1st Self Assessment
1.0 Concise overview
What I learned last week:
- Web scraping of lastest page of SmartThings.
- Data cleaning and get a clean CSV data file
- Data visualization
- Jupiter Notebook
- Soft Skills
- Use slack to make team communication
- Use Asana to get task management
- Solve problem by researching online
2.0 Three achievement highlights
- Build a web crawler to collect metadata about 3000 posts in about 1 minute
- Solve the scraping for Infinite Scrolling Pages problem by myself
- Solve the scraping for “replies number” and “views number” of a post by myself
3.0 List of meetings/ training attended including social team events
- Weekly team meeting on Monday
- STEMCast: Overview of ML and project, Technical Deep Dive - Data Mining
4.0 Goals for the upcoming week
- Learn how to pre-process text from collected dataset
- Learn some basic knowledge about NLP
5.0 Detailed statement of tasks done
I finished all the following tasks by myself and with the help of online resources.
5.1 Task 1 Web scraping
I used beautiful soup to collect metadata (url, title, category, replies number, views number) of 3000 post from lastest page of SmartThings. Collecting url, title and category is quite straightforward. The only issue I got is extracting replies number and views number of a post. I solved this issue by self-exploring and found that the html content got by beautiful soup is different from html content in “Inspect Element” in web browser.
5.2 Task 2 Data cleaning
This part is also straightforward, just extract url, title, category, replies number, views number by searching the html content, put them to a list and save them to a csv file.
5.3 Task3 Data visualization
Here is my visualization result.
- I use word cloud library to plot the word cloud for title and category to show the most frequent words in title and category.
- I use pyplot library to plot the distribution of top 5 categories and learned basic use of python plotting.
- I use seaborn library to plot the histogram of replies number and views number distribution.