Overview of Things Learned:
- Technical Area: web scraping, data cleaning
- Tools: Scrapy, Beautiful Soup, Threading, Markdown
- Soft Skills: #communication, #time-management, #teamwork
- Used Scrapy and Beautiful Soup to scrape data from 10 different forums.
- Used multi-threaded programming to collect 315.9 MB of data in the form of posts and topics.
- Posted a list of quality forums to scrape along with code examples of web crawlers to help team members.
- 7/27 - Introduction to Web Scraping and this Week’s Deliverables
- 7/29 - Web Scraping Check In Meeting and Intro to Preprocessing
- 7/31 - Web Scraping and Preprocessing Presentations
Goals for the Upcoming Week
- Continue learning about TF-IDF and BERT
- Assist any team members struggling with web scraping or pre-processing.
- Web scraping: Built a web crawler using Scrapy then used it to collect post and topic data from different forums. After scraping one forum, I decided to scrape several more to obtain a sufficient amount of training/validation/test data. In order to process such a large volume of data in a limited time frame, I applied multi-threaded programming to my crawler code. This allowed me to collect much more than I would have using a serialized crawler.
- Pre-processing: After collecting all the data, I needed to find an efficient way to process all of the posts. This involved writing more multi-threaded code to handle the volume of data collected by the crawlers. Using the Pandas library, I managed to store all the cleaned data into CSV files and pushed the completed work to our team Github repo.