JoeyHuang - Machine Learning Pathway

Week: 7/27

Overview of Things Learned:

Achievement Highlights

  • Scraped data from over 41,000 posts (The Commons forum)
  • Utilized both raw HTML and JSON data collection
  • Learned about cleaning data
  • Learning about debugging skills

Meetings attended

  • 7/29 - Web Scraping Check-In Meeting and Intro to Preprocessing
  • 7/31 - Web Scraping and Preprocessing Presentations

Goals for the Upcoming Week

  • Continue learning about BERT and TD-IDF

Tasks Done

  • Web scraping: Built a web crawler using Scrapy and used it to scrape data from JSON files. I originally had a raw HTML scraper using Selenium and BeautifulSoup but preloading pages and then running the scraper took too much time. Consequently, I switched to Scrapy as it is quicker and more efficient
  • Pre-processing: After collecting all the data, I stored them directly into a .csv file onto my computer using pd and cleaned them up. After running the crawler for a few hours, it finished and I then pushed the .csv file to the github repo.

Week: 8/3

Overview of Things Learned:

  • Technical Area: Preprocessing, TF-IDF,
  • Tools: Pandas, Beautiful soup, Re, Markdown, String, Skicit-Learn
  • Soft Skills: #communication #time-management

Achievement Highlights

  • Polished skills on how to find and remove certain HTML tags, whitespace, and punctuation through different libraries
  • Obtained accurate results, stored that data in a .csv file.
  • Created a matrix with weighted values from TF-IDF

Meetings attended

  • 8/5 - Preprocessing Check-In
  • 8/7 - Presenting Preprocessing and TF-IDF

Goals for the Upcoming Week

  • Polish TD-IDF
  • Work on BERT

Tasks Done

  • Pre-processing: Previous code had lots of errors, I fixed the errors and added a space between words
  • TF-IDF: Completed the matrix with Scikit-Learn with weighted values

Week: 8/10

Overview of Things Learned:

Achievement Highlights

  • Learned about BERT, how it works, and its advantages
  • Began training BERT model with data gathered from The Commons Forums

Goals for the Upcoming Week

  • Learn more about BERT
  • Continue working on the model and fine tune any mistakes

Tasks Done

  • Data processing: I reprocessed my data from The Commons to work better with BERT and also reprocessed to reduce noise from the posts.
  • BERT: I created a simple BERT model and tested small data sets. I am working on fine-tuning the model to reduce the percentage of error.

Week: 8/17

Overview of Things Learned:

  • Technical Area: Machine Learning, BERT
  • Tools: DistillBERT, Machine Learning
  • Soft Skills: #communication #teamwork

Achievement Highlights

  • Learned about more BERT with teammates
  • Troubleshooted and fixed few problems, made BERT model more accurate

Goals for the Upcoming Week

  • Complete BERT
  • Work on AWS and Docker

Tasks Done

  • Finished training BERT model and learned a lot about how it worked

Week: 8/24

Overview of Things Learned:

Achievement Highlights

  • Tried to deploy BERT model I made with AWS/Docker
  • Examined performance of model

Goals for the Upcoming Week

  • Continue to learn more about Machine Learning and programming in general after the internship!

Tasks Done

  • Finished BERT and deployment using AWS/Docker

FINAL SELF-ASSESSMENT: Brief Summary of Things Learned

  1. Web Scraping: Learned how to gather data from forums using JSON files, selenium, BeautifulSoup, and requests.
  2. Pre-Processing: Learned how to process large data sets and remove html tags, symbols, and other unwanted information using python libraries.
  3. TF-IDF: Learned how to use Scikit-learn and Scipy to created a weighted TD-IDF vector matrix using the processed data scrapped from forums.
  4. BERT: I had a lot of problems as I was new to this but I managed to fix most of them. I learned a lot about BERT and PyTorch and managed to finish the model.
  5. Deployment: I learned how to use Docker, AWS, and Flask to deploy a simple machine learning application.