Myang30 - Machine Learning Pathway

1st Self Assessment

1.0 Concise overview

What I learned last week:

  1. Technical
  1. Tools
  • Github
  • Jupiter Notebook
  • Slack
  • Asana
  1. Soft Skills
  • Use slack to make team communication
  • Use Asana to get task management
  • Solve problem by researching online

2.0 Three achievement highlights

  1. Build a web crawler to collect metadata about 3000 posts in about 1 minute
  2. Solve the scraping for Infinite Scrolling Pages problem by myself
  3. Solve the scraping for “replies number” and “views number” of a post by myself

3.0 List of meetings/ training attended including social team events

  1. Weekly team meeting on Monday
  2. STEMCast: Overview of ML and project, Technical Deep Dive - Data Mining

4.0 Goals for the upcoming week

  1. Learn how to pre-process text from collected dataset
  2. Learn some basic knowledge about NLP

5.0 Detailed statement of tasks done

I finished all the following tasks by myself and with the help of online resources.

5.1 Task 1 Web scraping

I used beautiful soup to collect metadata (url, title, category, replies number, views number) of 3000 post from lastest page of SmartThings. Collecting url, title and category is quite straightforward. The only issue I got is extracting replies number and views number of a post. I solved this issue by self-exploring and found that the html content got by beautiful soup is different from html content in “Inspect Element” in web browser.

5.2 Task 2 Data cleaning

This part is also straightforward, just extract url, title, category, replies number, views number by searching the html content, put them to a list and save them to a csv file.

5.3 Task3 Data visualization

Here is my visualization result.

  1. I use word cloud library to plot the word cloud for title and category to show the most frequent words in title and category.
  2. I use pyplot library to plot the distribution of top 5 categories and learned basic use of python plotting.
  3. I use seaborn library to plot the histogram of replies number and views number distribution.

2nd Self Assessment

1.0 Concise overview

What I learned last week:

  1. Technical:use TF-IDF model to preprocessing corpus
  2. Tools: scikit-learn library

2.0 Three achievement highlights

  1. Understood TF-IDF model and why it’s important.
  2. Use scikit-learn library implement TF-IDF model.
  3. Optimized the code to minimize number of commands.

3.0 List of meetings/ training attended including social team events

  1. Weekly team meeting on Monday
  2. STEMCast: NLP Webinar

4.0 Goals for the upcoming week

  1. Learn how to implement BERT model
  2. Try some other models if I have time.

5.0 Detailed statement of tasks done

I finished following task by myself and with the help of online resources.

I imported content of 3000 posts from a CSV file got last week, and create 2 corpus, one for starter content one for reply content, each of them has 3000 documents. I use scikit-learn library to tokenization and calculate the word count matrix, the word list index and finally calculate TF-IDF matrix from word count matrix.

3rd Self Assessment

1.0 Concise overview

What I learned last week:

  1. Technical: BERT Model, Fast.ai deep learning framework
  2. Tools: fastai, Google Colab
  3. Soft Skills: Presentation, public speaking, collaboration on group slides

2.0 Three achievement highlights

  1. Use deep learning framework fast.ai to implement the BERT Model
  2. Use BERT model to do text classification to predict the category of posts
  3. Optimize the classification implementation to improve the accuracy from 37% to 51%

3.0 List of meetings/ training attended including social team events

  1. Weekly team meeting on Monday
  2. BERT demo by team leads

4.0 Goals for the upcoming week

  1. Keep improving the accuracy of text classification
  2. Learn some basic knowledge about recommendation system

5.0 Detailed statement of tasks done

I finished all the following tasks by myself and with the help of online resources.

I read about BERT model, understand the concepts and how it works. I use the deep learning framework fast.ai to implement a text classification program with BERT model. The program classify the 3000 posts from lastest page of SmartThings based on the posts content, and compared with the category metadata on those posts. I got the accuracy about 37% at first. Then I only keep posts in the most popular 5 categories and do it again, and improved the accuracy to 51%.

At last I worked with my team members on the group slides, combined our results and do a presentation to the whole team at the team meeting.