Week: 7/27
Overview of Things Learned:
Achievement Highlights
- Used Scrapy and Beautiful Soup to scrape data from 10 different forums.
- Used multi-threaded programming to collect 315.9 MB of data in the form of posts and topics.
- Posted a list of quality forums to scrape along with code examples of web crawlers to help team members.
Meetings attended
- 7/27 - Introduction to Web Scraping and this Week’s Deliverables
- 7/29 - Web Scraping Check In Meeting and Intro to Preprocessing
- 7/31 - Web Scraping and Preprocessing Presentations
Goals for the Upcoming Week
- Continue learning about TF-IDF and BERT
- Assist any team members struggling with web scraping or pre-processing.
Tasks Done
- Web scraping: Built a web crawler using Scrapy then used it to collect post and topic data from different forums. After scraping one forum, I decided to scrape several more to obtain a sufficient amount of training/validation/test data. In order to process such a large volume of data in a limited time frame, I applied multi-threaded programming to my crawler code. This allowed me to collect much more than I would have using a serialized crawler.
- Pre-processing: After collecting all the data, I needed to find an efficient way to process all of the posts. This involved writing more multi-threaded code to handle the volume of data collected by the crawlers. Using the Pandas library, I managed to store all the cleaned data into CSV files and pushed the completed work to our team Github repo.
Week: 8/3
Overview of Things Learned:
- Technical Area: machine learning
- Tools: ScikitLearn, Scipy, nltk
- Soft Skills: time-management
Achievement Highlights
- Used Scikit-learn to construct a TF-IDF matrix for ~700,000 posts
- Performed more data pre-processing techniques such as stemming and removing stop words to obtain more accurate results
Meetings attended
- 8/5 - Preprocessing Check-In
- 8/7 - Presenting Preprocessing and TF-IDF
Goals for the Upcoming Week
- Read more resources on BERT
- Assist any team members
Tasks Done
-
TF-IDF Processing: luckily, there weren’t many challenges in constructing the TF-IDF matrix since most of it was handled by Scikit-Learn. The library performed well even when the entire data set was used.
Week: 8/10
Overview of Things Learned:
Achievement Highlights
- Trained DistilBERT to build a classifier capable of distinguishing between 10 different forums
- Re-processed the data to remove more noise and fine-tuned the model to improve predictions
Meetings attended
- 8/10 - Present Pre-Processing and TF-IDF
- 8/12 - Check-in about implementing the BERT Model
- 8/14 - Present BERT Model implementations
Goals for the Upcoming Week
- Continue fine-tuning the model
- Assist any team members
Tasks Done
-
Bert Notebook: one of the challenges I encountered was removing noise from the data. In the pre-processing step, we removed symbols and punctuation, so it wasn’t possible to find items such as hyperlinks. To overcome this, I decided to filter out noise starting with the original data set. This opened up a new problem: removing noise from all 700,000 posts was taking too long. It turns out that it was very expensive to check if a post was in English, so I needed to use threads to speed up the noise filtering. After the posts were filtered of noise, I saved the filtered data frame for ease of access. Unfortunately, there was one more problem that occurred. Due to the sheer size of the data set, I couldn’t store all the tokens, paddings, and attention masks without running out of memory. This meant that I either needed to store them in another format or reduce the amount of training data. In the interest of time, I decided to sample 100,000 posts from the data. In the sample, there were posts from all 10 forums, but since it wasn’t a stratified sample, there was a class imbalance. Out of curiosity, I decided to use the sample on DistilBERT to examine its behavior, and it achieved an accuracy of ~88%.
Week: 8/17
Overview of Things Learned:
Achievement Highlights
- Shared processed BERT data with team members
- Performed fine-tuning on BERT model
Meetings attended
- 8/17 - BERT Implementation Check In #2
- 8/20 - BERT Meeting
- 8/21 - BERT Presentations
Goals for the Upcoming Week
- Learn more about AWS/Docker
Tasks Done
-
Bert Notebook: Used stratified sampling and fine tuned model hyper-parameters to raise validation accuracy to 90%.
Week: 8/24
Overview of Things Learned:
Achievement Highlights
- Created a web app in Flask for my machine learning model
- Used AWS and Docker to deploy the model
- Examined the final performance of the model by using it to classify a test set of 50,000 unseen posts
Meetings attended
- 8/24 - Team Meeting
- 8/26 - Practice Presentation Team 4
- 8/27 - Mock presentation Team 4
- 8/27 - Final presentation Team 4
Goals for the Upcoming Week
- Keep building on what I learned in the internship!
Tasks Done
-
Deployment: Built a flask app that takes in a sentence and uses the trained BERT model to predict the best forum that the given sentence fits under.
-
Final Presentation: Helped create a presentation that summarizes the results of our model.
FINAL SELF-ASSESSMENT: Brief Summary of Things Learned
-
Web Scraping: Learned various frameworks used for web-scraping including Scrapy, BeautifulSoup, json.
-
Pre-Processing: Learned various techniques to process large textual data sets. Built more familiarity with the threading framework to handle large volumes of data.
-
TF-IDF: Learned how to use Scikit-learn and Scipy to create a TF-IDF matrix using all the posts scraped in an earlier week.
-
BERT: Learned about the transformers library in Python, the process of training a transformer model in PyTorch. Gained a better understanding of how to use BERT for forum classification.
-
Deployment: Learned how to use Docker, AWS, and Flask to launch a simple machine learning application.