Machine Learning_level1_Module 2_Baishali Sow Mondal

Concise Overview of Things Learned: Technical Area:

  • I have chosen Codeacademy forum to scrape data.
  • I Scraped the data from these forum using Beautiful Soup & Selenium library and stored the data in a csv file.
  • I have used different Data cleaning and EDA techniques to explore the scraped data.

Tools: Beautiful Soup , Selenium Webdriver, Numpy ,Pandas ,Matplotlib, Scikit-Learn, Wordcloud, NLTK library,GitHub, Spacy, TextBlob

Softskills: I Improved my google searching skills to cope up with bugs.

Achievement Highlights:

  • Successfully scraped the data from Codeacademy forum using Beautiful Soup & Selenium Library
  • Successfully executed Exploratory Data Analysis to explore the scraped data.
  • Successfully pushed the files to Github Repository.

Detailed Statements of task:

  • Firstly when I only used Beautiful Soup ,I was unable to scrape data because Beautiful Soup’s find_all method was unable to scrape Javascript based Web pages. So I used Selenium and Beautiful Soup Both. It worked well.
  • I was unable to scrape comments of web page so I used Selenium for Scrolling purpose.
  • I lowercased words, removed digits and words containing digits ,cleared punctuation, Removing extra spaces, removed common & rare words, lemmatized words
  • I created document-term matrix using Scikit-learn’s CountVectorizer to find the top words of every category.
  • I generated a word-cloud of top words of each category.
  • I used TextBlob to check polarity of each category.

Problems faced:

  • I used “C:\Program Files (x86)\chromedriver.exe” in Jupyter notebook ,it worked well but when I tried to use it in google colaboratory an error message was showing “Chromedriver not in path”.