Machine Learning - Level 1 - Assessment Module 2
- Selected the Discourse Hub to scrape the data. Specifically chose the “Car Talk Community” Under “Automobiles”
- The python modules I used are
- Beautiful Soup
- Counter from Collections
- The data that was scraped from the forum was stored in a formatted Json file
- Used Data cleaning and EDA techniques to explore the scraped data.
Beautiful Soup , Selenium Webdriver, NLTK library,GitHub, Spacy, TextBlob
- Successfully scraped the Discourse Hub Community ( specifically, the Car Talk community) using beautiful soup and selenium
- Successfully organized the scraped data and also staged it as json.
- Successfully pushed the files to Github Repository.
Detailed Statements of task:
- To scrape the data, I had to use the Google Chrome’s Developer tools to study the patterns in the rendered web page.
- Used selenium and chromedriver to load the webpages as well as scroll to the bottom of the web page.
- Collected all the comments for each post and converted to lowercase
- Used Python Collection to build word frequency ( all words converted to lowercase). Ignored the stop words
- Used Python dictionaries extensively. As I am not very familiar with Python, I had to spend a lot more time as well as take help.
- Most of the time was spent in understanding the pattern in the source html so that beautiful soup could be used efficiently