Machine Learning - Level One Module Two- Shreya

Machine Learning - Level One Module Two - Shreya

Concise Overview Of Things Learned

Technical

  1. Learned how to run spiders/web crawlers using Scrapy with VS Code.
  2. Became familiar with html tags and how to use them to extract specific data from a website.
  3. Understood the basic steps to prepare data for EDA (i.e. removing stop words, tokenization, stemming, etc.)

Tools

  • Scrapy
  • VS Code
  • pandas
  • Git and GitHub

Soft Skills

  • Effective problem solving, to work through the many issues that arose.
  • Time management

Achievement Highlights

  • Used a web crawler to scrape forum posts and found a way to scrape multiple “infinite scroll” pages.
  • Performed basic data pre-processing on scraped csv file.
  • Successfully pushed all code to github repository.

Detailed Statement of Tasks Completed

  • Used Scrapy to create a web crawler that scrapes forum post titles, categories, and comments.
  • Created a loop to iterate through “infinite scroll” page urls.
  • At first the forum I was scraping would block my IP address because I was sending too many requests, so I changed my spider settings (such as depth limit, concurrent requests, and download delay) to delay requests so that the site was not overwhelmed.
  • Cleaned csv file by lowercasing data, removing punctuation, removing stop words, removing most and least common words, tokenization, and lemmatization.