Chris_De_Leon - Machine Learning Pathway

Chris_De_Leon · August 17, 2020, 7:34am

Week: 7/27

Overview of Things Learned:

Technical Area: web scraping, data cleaning
Tools: Scrapy, Beautiful Soup, Threading, Markdown
Soft Skills: #communication, #time-management, #teamwork

Achievement Highlights

Used Scrapy and Beautiful Soup to scrape data from 10 different forums.
Used multi-threaded programming to collect 315.9 MB of data in the form of posts and topics.
Posted a list of quality forums to scrape along with code examples of web crawlers to help team members.

Meetings attended

7/27 - Introduction to Web Scraping and this Week’s Deliverables
7/29 - Web Scraping Check In Meeting and Intro to Preprocessing
7/31 - Web Scraping and Preprocessing Presentations

Goals for the Upcoming Week

Continue learning about TF-IDF and BERT
Assist any team members struggling with web scraping or pre-processing.

Tasks Done

Web scraping: Built a web crawler using Scrapy then used it to collect post and topic data from different forums. After scraping one forum, I decided to scrape several more to obtain a sufficient amount of training/validation/test data. In order to process such a large volume of data in a limited time frame, I applied multi-threaded programming to my crawler code. This allowed me to collect much more than I would have using a serialized crawler.
Pre-processing: After collecting all the data, I needed to find an efficient way to process all of the posts. This involved writing more multi-threaded code to handle the volume of data collected by the crawlers. Using the Pandas library, I managed to store all the cleaned data into CSV files and pushed the completed work to our team Github repo.

Chris_De_Leon · August 17, 2020, 7:35am

Week: 8/3

Overview of Things Learned:

Technical Area: machine learning
Tools: ScikitLearn, Scipy, nltk
Soft Skills: time-management

Achievement Highlights

Used Scikit-learn to construct a TF-IDF matrix for ~700,000 posts
Performed more data pre-processing techniques such as stemming and removing stop words to obtain more accurate results

Meetings attended

8/5 - Preprocessing Check-In
8/7 - Presenting Preprocessing and TF-IDF

Goals for the Upcoming Week

Read more resources on BERT
Assist any team members

Tasks Done

TF-IDF Processing: luckily, there weren’t many challenges in constructing the TF-IDF matrix since most of it was handled by Scikit-Learn. The library performed well even when the entire data set was used.

Chris_De_Leon · August 17, 2020, 7:35am

Week: 8/10

Overview of Things Learned:

Technical Area: machine learning, data cleaning
Tools: torch, transformers
Soft Skills: time-management, #persistence, communication

Achievement Highlights

Trained DistilBERT to build a classifier capable of distinguishing between 10 different forums
Re-processed the data to remove more noise and fine-tuned the model to improve predictions

Meetings attended

8/10 - Present Pre-Processing and TF-IDF
8/12 - Check-in about implementing the BERT Model
8/14 - Present BERT Model implementations

Goals for the Upcoming Week

Continue fine-tuning the model
Assist any team members

Tasks Done

Bert Notebook: one of the challenges I encountered was removing noise from the data. In the pre-processing step, we removed symbols and punctuation, so it wasn’t possible to find items such as hyperlinks. To overcome this, I decided to filter out noise starting with the original data set. This opened up a new problem: removing noise from all 700,000 posts was taking too long. It turns out that it was very expensive to check if a post was in English, so I needed to use threads to speed up the noise filtering. After the posts were filtered of noise, I saved the filtered data frame for ease of access. Unfortunately, there was one more problem that occurred. Due to the sheer size of the data set, I couldn’t store all the tokens, paddings, and attention masks without running out of memory. This meant that I either needed to store them in another format or reduce the amount of training data. In the interest of time, I decided to sample 100,000 posts from the data. In the sample, there were posts from all 10 forums, but since it wasn’t a stratified sample, there was a class imbalance. Out of curiosity, I decided to use the sample on DistilBERT to examine its behavior, and it achieved an accuracy of ~88%.

Chris_De_Leon · August 22, 2020, 10:37pm

Week: 8/17

Overview of Things Learned:

Technical Area: machine learning
Tools: AWS
Soft Skills: time-management, teamwork, communication

Achievement Highlights

Shared processed BERT data with team members
Performed fine-tuning on BERT model

Meetings attended

8/17 - BERT Implementation Check In #2
8/20 - BERT Meeting
8/21 - BERT Presentations

Goals for the Upcoming Week

Learn more about AWS/Docker

Tasks Done

Bert Notebook: Used stratified sampling and fine tuned model hyper-parameters to raise validation accuracy to 90%.

Chris_De_Leon · August 26, 2020, 10:55pm

Week: 8/24

Overview of Things Learned:

Technical Area: machine learning, deployment
Tools: AWS, Docker, Flask
Soft Skills: time-management, teamwork, communication

Achievement Highlights

Created a web app in Flask for my machine learning model
Used AWS and Docker to deploy the model
Examined the final performance of the model by using it to classify a test set of 50,000 unseen posts

Meetings attended

8/24 - Team Meeting
8/26 - Practice Presentation Team 4
8/27 - Mock presentation Team 4
8/27 - Final presentation Team 4

Goals for the Upcoming Week

Keep building on what I learned in the internship!

Tasks Done

Deployment: Built a flask app that takes in a sentence and uses the trained BERT model to predict the best forum that the given sentence fits under.
Final Presentation: Helped create a presentation that summarizes the results of our model.

Chris_De_Leon · August 26, 2020, 11:13pm

FINAL SELF-ASSESSMENT: Brief Summary of Things Learned

Web Scraping: Learned various frameworks used for web-scraping including Scrapy, BeautifulSoup, json.
Pre-Processing: Learned various techniques to process large textual data sets. Built more familiarity with the threading framework to handle large volumes of data.
TF-IDF: Learned how to use Scikit-learn and Scipy to create a TF-IDF matrix using all the posts scraped in an earlier week.
BERT: Learned about the transformers library in Python, the process of training a transformer model in PyTorch. Gained a better understanding of how to use BERT for forum classification.
Deployment: Learned how to use Docker, AWS, and Flask to launch a simple machine learning application.