Machine Learning - Self-Assessment 2 - Twesha Ghosh

Machine Learning - Level 1 - Assessment Module 2

Technical Area:

  • Selected the Discourse Hub to scrape the data. Specifically chose the “Car Talk Community” Under “Automobiles”
  • The python modules I used are
    • Beautiful Soup
    • Selenium
    • Counter from Collections
    • json
  • The data that was scraped from the forum was stored in a formatted Json file
  • Used Data cleaning and EDA techniques to explore the scraped data.

Tools:

Beautiful Soup , Selenium Webdriver, NLTK library,GitHub, Spacy, TextBlob

Achievement Highlights:

  • Successfully scraped the Discourse Hub Community ( specifically, the Car Talk community) using beautiful soup and selenium
  • Successfully organized the scraped data and also staged it as json.
  • Successfully pushed the files to Github Repository.

Detailed Statements of task:

  • To scrape the data, I had to use the Google Chrome’s Developer tools to study the patterns in the rendered web page.
  • Used selenium and chromedriver to load the webpages as well as scroll to the bottom of the web page.
  • Collected all the comments for each post and converted to lowercase
  • Used Python Collection to build word frequency ( all words converted to lowercase). Ignored the stop words

Problems faced:

  • Used Python dictionaries extensively. As I am not very familiar with Python, I had to spend a lot more time as well as take help.
  • Most of the time was spent in understanding the pattern in the source html so that beautiful soup could be used efficiently