Remember the Problem Statement described in Module 1?
You can think of this Module 2 as the real start to building our project. In this module our main task is to gather, clean, and explore the data we will be training our model on.
1- Pick one (or several) community(ies) from the DiscourseHub Community forums.
2- Scrap the data from the forum(s) in that(those) community(ies) and store it in a CSV file.
3- Perform basic cleaning to remove html tags.
4- Perform Exploratory Data Analysis (EDA) on your data. As in using data analysis techniques to understand, and further clean your data as well as extract insights (valuable information that helps you understand trends or answer questions) from it.
- When you are picking a community or forum to scrape it’s best if your forum has a high number of users, posts and for the language to be English (because primarily ML models have been trained using English written data). Try to avoid too technical forums (but you’re welcome to scrape them).
- While you’re cleaning your data try different strategies and compare which one is best to keep or remove by training a logistic regression model for example to perform this benchmarking.
Because you may lose important information that could be fundamental to improve the performance of your model.
- For EDA, it’s good to first ask some questions (what are the things you are curious to know about your data?) and then use python code to try to answer these questions through visualizations.
Check Module 1 for all the needed references.
@ML-Pathway, Check out the Scraping and EDA tutorials and let me know if you’re still facing any issues.
For scraping, I provided an example of how to scrape the Flowster Forum that you can use to scrape your forum of interest. To those that have already submitted their self-assessment and code in GitHub please try again to do the tasks (if you haven’t completed them) using these resources.
Please comment on this post your GitHub account username. If you don’t have a GitHub account please create it as we will need it.
Instead of the wrap-up meeting, I will share with you resources and tutorials to help guide you with the scraping and EDA task. Please check out the Winter Break ML Pathway Hub Sprint Training Details post for further details about the wrap-up meetings for this module and the next ones.
@ML-Pathway please push your scraping code and exploratory data analysis in the md2 branch of the repository I invited you to collaborate on.
- To do so, create a folder with the FirstName_LastName format that will contain all the scraping and EDA you did before the deadline.
Remember to post your self-assessment after finishing this module tasks.