Week: 7/27 – 8/01/2020
Overview of Things Learned:
Technical Area: Web scraping, Preprocessing
Tools: Requests, BeautifulSoup, re, and Pandas libraries; website
Soft Skills: #selfmotivation #communication
Achievement Highlights
- Web scraping: used Python languages and the tools to build a web-crawling structure and extracted tags and key information, such as titles, authors, latest updated dates, and comments, from Mac Power Users forum.
- Preprocessing: Removed HTML tags, signs, and unnecessary information from scraped HTML and stored the cleaned data in CSV file.
Meetings attended
7/23 – ML team kick-off meeting
7/27 – Team-4 meeting and introduction
7/29 – web scraping check-in
7/31 – web scraping check-in and reprocessing skills
Goals for the Upcoming Week
-
Learning TFIDF
-
Exploring BERT library
Tasks Done
Scraped and preprocessed data from the Mac-Power-User forum, stored them in CSV file, and uploaded the file in GitHub.
Week: 8/03 – 8/08/2020
Overview of Things Learned:
Technical Area: Preprocessing, and IF-IDF calculation
Tools: pandas, string, numpy, re, nltk, and sklearn
Soft Skills: #timemanagement problem-solving
Achievement Highlights
- Preprocessing: Created a User Defined Function to loop and removed unnecessary punctuations, symbols, signs and tags from scraped data.
- TF-IDF: Tokenized text, created bag-of-words with unique values, and calculated TF-IDF scores.
Meetings attended
8/5 – Preprocessing check-in
8/7 – Presenting Preprocessing and TF-IDF
Goals for the Upcoming Week
- Practice BERT classification modules
Tasks Done
Completed the data reprocessing and TF-IDF score calculation.
Link to colab notebook: Google Colab
Week: 8/10 – 8/15/2020
Overview of Things Learned:
Technical Area: Preprocessing, and DistilBERT
Tools: tensorflow, pytorch, transformers, torch, keras, tqdm, pandas, io, numpy, matplotlib.pyplot, re, nltk, html.parser, and sklearn
Soft Skills: #timemanagement problem-solving
Achievement Highlights
- Preprocessing: Cleaned data and created labels by converting ‘topics’ to one-hot and added the [CLS] and [SEP] to the sentences.
- BERT: Used DistilBert algorithm to tokenize text
Meetings attended
Goals for the Upcoming Week
- Debug some existing problems in BERT
Tasks Done
Structured BERT framework and generated inputs, but missing labels to be attached with the inputs. Progressing work to covert one-hot for each topic to create a list of labels.
Link to colab notebook: Google Colab