Level 1: Module 2 - Data gathering and EDA


Remember the Problem Statement described in Module 1?

You can think of this Module 2 as the real start to building our project. In this module our main task is to gather, clean, and explore the data we will be training our model on.


1- Pick one (or several) community(ies) from the DiscourseHub Community forums.
2- Scrap the data from the forum(s) in that(those) community(ies) and store it in a CSV file.
3- Perform basic cleaning to remove html tags.
4- Perform Exploratory Data Analysis (EDA) on your data. As in using data analysis techniques to understand, and further clean your data as well as extract insights (valuable information that helps you understand trends or answer questions) from it.


  • When you are picking a community or forum to scrape it’s best if your forum has a high number of users, posts and for the language to be English (because primarily ML models have been trained using English written data). Try to avoid too technical forums (but you’re welcome to scrape them).
  • While you’re cleaning your data try different strategies and compare which one is best to keep or remove by training a logistic regression model for example to perform this benchmarking.
    Because you may lose important information that could be fundamental to improve the performance of your model.
  • For EDA, it’s good to first ask some questions (what are the things you are curious to know about your data?) and then use python code to try to answer these questions through visualizations.


Check Module 1 for all the needed references.


:eyes: @ML-Pathway, Check out the Scraping and EDA tutorials and let me know if you’re still facing any issues.
For scraping, I provided an example of how to scrape the Flowster Forum that you can use to scrape your forum of interest. To those that have already submitted their self-assessment and code in GitHub please try again to do the tasks (if you haven’t completed them) using these resources.

Tutorial 0.pdf (568.1 KB)
Tutorial 1.pdf (648.3 KB)
Processing Textual Data.pdf (378.2 KB)


Please comment on this post your GitHub account username. If you don’t have a GitHub account please create it as we will need it.

:rotating_light:Instead of the wrap-up meeting, I will share with you resources and tutorials to help guide you with the scraping and EDA task. Please check out the Winter Break ML Pathway Hub Sprint Training Details post for further details about the wrap-up meetings for this module and the next ones.

:rotating_light: @ML-Pathway please push your scraping code and exploratory data analysis in the md2 branch of the repository I invited you to collaborate on.

  • To do so, create a folder with the FirstName_LastName format that will contain all the scraping and EDA you did before the deadline.

:rotating_light: Remember to post your self-assessment after finishing this module tasks.


I kinda need more help in extracting the forum texts and relate them to their account Ids that actually put these texts. I can extract forum texts all together as one document. So if there is any useful resources, could someone share them to me. thank you

Hello @YasaminAbbaszadegan,

It’s best to extract: the title of the main post and its contents then its comments, you can extract other features like date and whatever you find useful but please avoid users because if we add users we will actually need to build a more complex system (with databases and relationships etc) which isn’t the main goal of level 1 project.
Our main goal is to recommend posts to user according to similarity in the text of the post or similarity in categories.

I hope this helps answer your question.

Hi Sara,
My Github account is: angela81ku
Also, may I ask the deadline for this module?

Hey @Angela_Ku,

Please check your email for the Github invite also if you could please comment on the appropriate post for the GitHub details, that would be great.
As for the deadline for this module it’s going to be the 5th of December 2020.


Thanks for your clarification!

1 Like

You’re welcome :slight_smile:

Hi Sara mentor,
Thanks for your references in module one.
However, I’m still struggling to scrape the data. It seems like I need to know more about HTML structure and how to use beautiful soup to parse the data. Is there any training tutorial for web crawling and HTML structure? I have watched the resources you provided on module 1 and find some fragmental materials on the internet, but still can’t scrape the discourse. Could I see other’s code after 12/5, so that I can know why I fail?

1 Like

Hi mentor,
Here is my second self-assessment.

1 Like

Hello @Angela_Ku,

I will try to prepare a scraping example (which you can follow) to help unstuck you.

Feel free @ML-Pathway to share any tips or tricks that helped you do the task as well.

Please remember, that we have two tasks in this module:
1- Scrape data
2- Perform exploratory data analysis to clean and understand the scraped data.

1 Like

Hello @Sara_EL-ATEIF mam,
I have written my self-assesment of Module 2 . Here is the link:

1 Like

I will check them out and get back to you @Sourav_Naskar.

1 Like

Hi @Sara_EL-ATEIF , here’s the link to my self assessment:


1 Like

I need more help in extracting the forum texts and relate them to their account Ids that actually put these texts. I can extract forum texts altogether as one document. I am still trying to scrap the data. As I am not that much comfortable with HTML or Beautiful Soup. So, is there any other option to extract data?? Thank you,

Hello @Sara_EL-ATEIF, here is the link to my self assessment:

Thank you! Stay Safe!

Hello @Sara_EL-ATEIF, I had a question: Where do we submit the code?

Here is the link to my self assessment!!

[Machine Learning Module 2 Self Assesment- Serafina Alhadad]

@tweshaghosh to the github repo : GitHub - mentorchains/level1_post_recommender_20: Level 1 ML project (2020): Forum post recommender system.

Here is the link of my second self-assessment