Myang30 - Machine Learning Pathway

myang30 · June 16, 2020, 6:02pm

1st Self Assessment

1.0 Concise overview

What I learned last week:

Technical

Web scraping of lastest page of SmartThings.
Data cleaning and get a clean CSV data file
Data visualization

Tools

Github
Jupiter Notebook
Slack
Asana

Soft Skills

Use slack to make team communication
Use Asana to get task management
Solve problem by researching online

2.0 Three achievement highlights

Build a web crawler to collect metadata about 3000 posts in about 1 minute
Solve the scraping for Infinite Scrolling Pages problem by myself
Solve the scraping for “replies number” and “views number” of a post by myself

3.0 List of meetings/ training attended including social team events

Weekly team meeting on Monday
STEMCast: Overview of ML and project, Technical Deep Dive - Data Mining

4.0 Goals for the upcoming week

Learn how to pre-process text from collected dataset
Learn some basic knowledge about NLP

5.0 Detailed statement of tasks done

I finished all the following tasks by myself and with the help of online resources.

5.1 Task 1 Web scraping

I used beautiful soup to collect metadata (url, title, category, replies number, views number) of 3000 post from lastest page of SmartThings. Collecting url, title and category is quite straightforward. The only issue I got is extracting replies number and views number of a post. I solved this issue by self-exploring and found that the html content got by beautiful soup is different from html content in “Inspect Element” in web browser.

5.2 Task 2 Data cleaning

This part is also straightforward, just extract url, title, category, replies number, views number by searching the html content, put them to a list and save them to a csv file.

5.3 Task3 Data visualization

Here is my visualization result.

I use word cloud library to plot the word cloud for title and category to show the most frequent words in title and category.
I use pyplot library to plot the distribution of top 5 categories and learned basic use of python plotting.
I use seaborn library to plot the histogram of replies number and views number distribution.

myang30 · June 26, 2020, 6:29pm

2nd Self Assessment

1.0 Concise overview

What I learned last week:

Technical：use TF-IDF model to preprocessing corpus
Tools: scikit-learn library

2.0 Three achievement highlights

Understood TF-IDF model and why it’s important.
Use scikit-learn library implement TF-IDF model.
Optimized the code to minimize number of commands.

3.0 List of meetings/ training attended including social team events

Weekly team meeting on Monday
STEMCast: NLP Webinar

4.0 Goals for the upcoming week

Learn how to implement BERT model
Try some other models if I have time.

5.0 Detailed statement of tasks done

I finished following task by myself and with the help of online resources.

I imported content of 3000 posts from a CSV file got last week, and create 2 corpus, one for starter content one for reply content, each of them has 3000 documents. I use scikit-learn library to tokenization and calculate the word count matrix, the word list index and finally calculate TF-IDF matrix from word count matrix.

myang30 · July 3, 2020, 6:53pm

3rd Self Assessment

1.0 Concise overview

What I learned last week:

Technical: BERT Model, Fast.ai deep learning framework
Tools: fastai, Google Colab
Soft Skills: Presentation, public speaking, collaboration on group slides

2.0 Three achievement highlights

Use deep learning framework fast.ai to implement the BERT Model
Use BERT model to do text classification to predict the category of posts
Optimize the classification implementation to improve the accuracy from 37% to 51%

3.0 List of meetings/ training attended including social team events

Weekly team meeting on Monday
BERT demo by team leads

4.0 Goals for the upcoming week

Keep improving the accuracy of text classification
Learn some basic knowledge about recommendation system

5.0 Detailed statement of tasks done

I finished all the following tasks by myself and with the help of online resources.

I read about BERT model, understand the concepts and how it works. I use the deep learning framework fast.ai to implement a text classification program with BERT model. The program classify the 3000 posts from lastest page of SmartThings based on the posts content, and compared with the category metadata on those posts. I got the accuracy about 37% at first. Then I only keep posts in the most popular 5 categories and do it again, and improved the accuracy to 51%.

At last I worked with my team members on the group slides, combined our results and do a presentation to the whole team at the team meeting.