Machine Learning - Level One Module Two - Shreya
Concise Overview Of Things Learned
- Learned how to run spiders/web crawlers using Scrapy with VS Code.
- Became familiar with html tags and how to use them to extract specific data from a website.
- Understood the basic steps to prepare data for EDA (i.e. removing stop words, tokenization, stemming, etc.)
- VS Code
- Git and GitHub
- Effective problem solving, to work through the many issues that arose.
- Time management
- Used a web crawler to scrape forum posts and found a way to scrape multiple “infinite scroll” pages.
- Performed basic data pre-processing on scraped csv file.
- Successfully pushed all code to github repository.
Detailed Statement of Tasks Completed
- Used Scrapy to create a web crawler that scrapes forum post titles, categories, and comments.
- Created a loop to iterate through “infinite scroll” page urls.
- At first the forum I was scraping would block my IP address because I was sending too many requests, so I changed my spider settings (such as depth limit, concurrent requests, and download delay) to delay requests so that the site was not overwhelmed.
- Cleaned csv file by lowercasing data, removing punctuation, removing stop words, removing most and least common words, tokenization, and lemmatization.