Chris_De_Leon - Machine Learning Pathway

Week: 7/27

Overview of Things Learned:

Achievement Highlights

  • Used Scrapy and Beautiful Soup to scrape data from 10 different forums.
  • Used multi-threaded programming to collect 315.9 MB of data in the form of posts and topics.
  • Posted a list of quality forums to scrape along with code examples of web crawlers to help team members.

Meetings attended

  • 7/27 - Introduction to Web Scraping and this Week’s Deliverables
  • 7/29 - Web Scraping Check In Meeting and Intro to Preprocessing
  • 7/31 - Web Scraping and Preprocessing Presentations

Goals for the Upcoming Week

  • Continue learning about TF-IDF and BERT
  • Assist any team members struggling with web scraping or pre-processing.

Tasks Done

  • Web scraping: Built a web crawler using Scrapy then used it to collect post and topic data from different forums. After scraping one forum, I decided to scrape several more to obtain a sufficient amount of training/validation/test data. In order to process such a large volume of data in a limited time frame, I applied multi-threaded programming to my crawler code. This allowed me to collect much more than I would have using a serialized crawler.
  • Pre-processing: After collecting all the data, I needed to find an efficient way to process all of the posts. This involved writing more multi-threaded code to handle the volume of data collected by the crawlers. Using the Pandas library, I managed to store all the cleaned data into CSV files and pushed the completed work to our team Github repo.

Week: 8/3

Overview of Things Learned:

  • Technical Area: machine learning
  • Tools: ScikitLearn, Scipy, nltk
  • Soft Skills: time-management

Achievement Highlights

  • Used Scikit-learn to construct a TF-IDF matrix for ~700,000 posts
  • Performed more data pre-processing techniques such as stemming and removing stop words to obtain more accurate results

Meetings attended

  • 8/5 - Preprocessing Check-In
  • 8/7 - Presenting Preprocessing and TF-IDF

Goals for the Upcoming Week

  • Read more resources on BERT
  • Assist any team members

Tasks Done

  • TF-IDF Processing: luckily, there weren’t many challenges in constructing the TF-IDF matrix since most of it was handled by Scikit-Learn. The library performed well even when the entire data set was used.

Week: 8/10

Overview of Things Learned:

Achievement Highlights

  • Trained DistilBERT to build a classifier capable of distinguishing between 10 different forums
  • Re-processed the data to remove more noise and fine-tuned the model to improve predictions

Meetings attended

  • 8/10 - Present Pre-Processing and TF-IDF
  • 8/12 - Check-in about implementing the BERT Model
  • 8/14 - Present BERT Model implementations

Goals for the Upcoming Week

  • Continue fine-tuning the model
  • Assist any team members

Tasks Done

  • Bert Notebook: one of the challenges I encountered was removing noise from the data. In the pre-processing step, we removed symbols and punctuation, so it wasn’t possible to find items such as hyperlinks. To overcome this, I decided to filter out noise starting with the original data set. This opened up a new problem: removing noise from all 700,000 posts was taking too long. It turns out that it was very expensive to check if a post was in English, so I needed to use threads to speed up the noise filtering. After the posts were filtered of noise, I saved the filtered data frame for ease of access. Unfortunately, there was one more problem that occurred. Due to the sheer size of the data set, I couldn’t store all the tokens, paddings, and attention masks without running out of memory. This meant that I either needed to store them in another format or reduce the amount of training data. In the interest of time, I decided to sample 100,000 posts from the data. In the sample, there were posts from all 10 forums, but since it wasn’t a stratified sample, there was a class imbalance. Out of curiosity, I decided to use the sample on DistilBERT to examine its behavior, and it achieved an accuracy of ~88%.

Week: 8/17

Overview of Things Learned:

Achievement Highlights

  • Shared processed BERT data with team members
  • Performed fine-tuning on BERT model

Meetings attended

  • 8/17 - BERT Implementation Check In #2
  • 8/20 - BERT Meeting
  • 8/21 - BERT Presentations

Goals for the Upcoming Week

  • Learn more about AWS/Docker

Tasks Done

  • Bert Notebook: Used stratified sampling and fine tuned model hyper-parameters to raise validation accuracy to 90%.

Week: 8/24

Overview of Things Learned:

Achievement Highlights

  • Created a web app in Flask for my machine learning model
  • Used AWS and Docker to deploy the model
  • Examined the final performance of the model by using it to classify a test set of 50,000 unseen posts

Meetings attended

  • 8/24 - Team Meeting
  • 8/26 - Practice Presentation Team 4
  • 8/27 - Mock presentation Team 4
  • 8/27 - Final presentation Team 4

Goals for the Upcoming Week

  • Keep building on what I learned in the internship!

Tasks Done

  • Deployment: Built a flask app that takes in a sentence and uses the trained BERT model to predict the best forum that the given sentence fits under.
  • Final Presentation: Helped create a presentation that summarizes the results of our model.

FINAL SELF-ASSESSMENT: Brief Summary of Things Learned

  1. Web Scraping: Learned various frameworks used for web-scraping including Scrapy, BeautifulSoup, json.

  2. Pre-Processing: Learned various techniques to process large textual data sets. Built more familiarity with the threading framework to handle large volumes of data.

  3. TF-IDF: Learned how to use Scikit-learn and Scipy to create a TF-IDF matrix using all the posts scraped in an earlier week.

  4. BERT: Learned about the transformers library in Python, the process of training a transformer model in PyTorch. Gained a better understanding of how to use BERT for forum classification.

  5. Deployment: Learned how to use Docker, AWS, and Flask to launch a simple machine learning application.