Sunlong - Machine Learning Pathway

Week: 7/27 – 8/01/2020

Overview of Things Learned:

Technical Area: Web scraping, Preprocessing

Tools: Requests, BeautifulSoup, re, and Pandas libraries; website

Soft Skills: #selfmotivation #communication

Achievement Highlights

  • Web scraping: used Python languages and the tools to build a web-crawling structure and extracted tags and key information, such as titles, authors, latest updated dates, and comments, from Mac Power Users forum.
  • Preprocessing: Removed HTML tags, signs, and unnecessary information from scraped HTML and stored the cleaned data in CSV file.

Meetings attended

7/23 – ML team kick-off meeting

7/27 – Team-4 meeting and introduction

7/29 – web scraping check-in

7/31 – web scraping check-in and reprocessing skills

Goals for the Upcoming Week

  • Learning TFIDF

  • Exploring BERT library

Tasks Done

Scraped and preprocessed data from the Mac-Power-User forum, stored them in CSV file, and uploaded the file in GitHub.

Week: 8/03 – 8/08/2020

Overview of Things Learned:

Technical Area: Preprocessing, and IF-IDF calculation

Tools: pandas, string, numpy, re, nltk, and sklearn

Soft Skills: #timemanagement problem-solving

Achievement Highlights

  • Preprocessing: Created a User Defined Function to loop and removed unnecessary punctuations, symbols, signs and tags from scraped data.
  • TF-IDF: Tokenized text, created bag-of-words with unique values, and calculated TF-IDF scores.

Meetings attended

8/5 – Preprocessing check-in

8/7 – Presenting Preprocessing and TF-IDF

Goals for the Upcoming Week

  • Practice BERT classification modules

Tasks Done

Completed the data reprocessing and TF-IDF score calculation.

Link to colab notebook: Google Colab

Week: 8/10 – 8/15/2020

Overview of Things Learned:

Technical Area: Preprocessing, and DistilBERT

Tools: tensorflow, pytorch, transformers, torch, keras, tqdm, pandas, io, numpy, matplotlib.pyplot, re, nltk, html.parser, and sklearn

Soft Skills: #timemanagement problem-solving

Achievement Highlights

  • Preprocessing: Cleaned data and created labels by converting ‘topics’ to one-hot and added the [CLS] and [SEP] to the sentences.
  • BERT: Used DistilBert algorithm to tokenize text

Meetings attended

Goals for the Upcoming Week

  • Debug some existing problems in BERT

Tasks Done

Structured BERT framework and generated inputs, but missing labels to be attached with the inputs. Progressing work to covert one-hot for each topic to create a list of labels.

Link to colab notebook: Google Colab