Week: July 26 - August 1, 2020
Overview of Things Learned:
The past week, I learned about different library/framework for web scraping, including BeautifulSoup and Scrapy. I initially thought that web scraping could exclusively be done by going through HTML tags, but then realized that JSON requests were a much neater way to mine data from websites. In terms of tools, I became familiar with Google Colab for interactive Python notebooks. Although I personally use Jupyter Notebook, Colab reveals to be an excellent tool for team collaboration.
- Scraping and storing data from the Webflow Community Forum
- Pre-processing text data by removing stop words and punctuations
- Familiarity with Jupyter notebook and Google Colab
- Web Scraping Check-in (July, 29)
- Web Scraping Presentations and Preprocessing (July 31)
Goals for the upcoming week:
- Complete/refine data pre-processing
- Complete TF-IDF
Web scraping using Scrapy:
- I had trouble getting data from the HTML tags, so I reached out to my teammates who recommended using JSON rather HTML requests. #teamwork
- Due to the nature of my forum, some of the text data that I pre-processed include HTML links, from which I was unable to extract keywords with simple pre-processing functions. I m still working on improving my pre-processing techniques.