Mandy - Machine Learning (Level 1) Pathway

Mandy · July 7, 2021, 11:32pm

Module 1 Self-Assessment: Weeks 1 & 2

A concise overview of things learned:

Technical Area:
- Prepared environment
- Getting familiar with machine learning concepts
- Understanding the basics of:
1. Collaborative Filtering vs Content-based Filtering and their pros and cons
2. One-hot encoding vs Word Embedding
3. Metrics used: Cosine, Euclidean distance, Dot product
4. Metric Based Evaluation vs Human-Based Evaluation
5. Web Scraping/Crawling
Tools:
- STEM-Away Platform
- Discord
- Jira
- when2meet
- Google Colab
- Anaconda for Jupyter Notebooks (Had this installed before, but will probably use Colab moving forward)
- More Python libraries/packages: Beautiful Soup + Selenium
- Scrapy
- Chrome and Firefox webdrivers
- Previous experience with Python, VSCode, HTML, Git/GitHub
Soft Skills:
- Active communication
- Updating my other leads with details I find
- Directing and guiding a new team on team structure

Achievements:

Set up a detailed teamwork structure with the other leads for our team to follow
Understanding machine learning better and content-based recommender systems
Learned the basics of how to scrape a website and ways data can be formatted

Goals for the upcoming week:

Tasks Completed:

Mandy · July 23, 2021, 10:54pm

Module 2 Self-Assessment: Weeks 3 & 4

A concise overview of things learned:

Technical Area:
- Followed the tutorials provided from Module 2 and familiarize myself with the HTML on the PyTorch Community
- Scraped data from 7,000+ posts in the PyTorch Community
- Storing the data scraped into a csv file
- Did some basic data cleaning:
1. Combining the textual data that was gathered: title, leading comment, other comments
2. Removing stop words, punctuations, and other unnecessary data
3. Lowercasing data
4. Tokenization, pos_tags, wordnet, lamentation
5. Saving into a new cleaned csv file
- Generated word clouds
Tools:
- VSCode
- Jupyter Notebooks via Anaconda
- Beautiful Soup, Selenium, Geckodriver(Firefox), CSV files, requests, pandas, json, re, nltk, wordcloud, matplotlib
- Python, HTML
- Git/GitHub
- Discord, Jira, STEM-Away Platform
Soft Skills:
- Active communication on Discord
1. Updating and communicating with the leads
2. Replying to all questions asked from participants
- Hosted team meetings & added in weekly games event
1. Team updates & To Do List
2. Github overview
3. Set modules due dates
4. Module 2 web scraping and EDA tutorial overview
- Attending mentor meetings hosted by Sara and Anubhav

Achievements:

Understood how data can be scraped from a website and storing it into a csv file.
Learned how to do data cleaning and data visualization.
Provided detailed team structure and resource outline to team members
Understanding machine learning better

Goals for the upcoming week:

Tasks Completed:

Pushed testing file to our team GitHub repo
Familiarize myself with the PyTorch Community
Properly installing and setting path for geckodriver on Windows.
Scraped data from PyTorch Community. Problem faced: there is a huge data set on Pytorch (around 45,000+ posts). It took very long to scrape and faced timing out issues if the next post does not load within 5 minutes. To solve this, I decided to scrape by categories. I did 5 runs (around 1,000-2,000 post per run) and then combining the 5 csv files into one csv file and editing it.
Performed data cleaning and EDA on the data gathered.

Mandy · August 9, 2021, 5:46am

Module 3 Self-Assessment: Weeks 5 & 6

A concise overview of things learned:

Technical Area:
- Basic machine learning models training on the data collected using:
1. Naive Bayes
2. Linear SVM
3. Decision Tree
4. Logistic Regression
5. Random Forest
6. XGBoost
7. Light GBM
- Ensemble machine learning models training:
1. Pipeline with Doc2Vec & (Logistic Regression, Random Forest, XGBoost)
2. Pipeline with TF-IDF & (Logistic Regression, Random Forest, XGBoost)
Tools:
- Jupyter Notebooks via Anaconda
- CSV files, Beautiful Soup, pandas, re, nltk, wordcloud, matplotlib, numpy, scikit-learn, xgboost, lightgbm, gensim
- Python
- Git/GitHub
- Discord, Jira, STEM-Away Platform
Soft Skills:
- Continue with active communication on Discord
1. Updating and communicating with the leads
2. Replying to all questions asked from participants
- Hosted weekly team meetings
1. Team updates & To-Do List/Reminders
2. Module 3 Discussion
3. Team Presentation Overview
4. Module 4 Overview
- Attending mentor meetings hosted by Anubhav

Achievements:

Trained collected data with Naive Bayes, Linear SVM, Logistic Regression, Decision Tree, Random Forest, XGBoost, Light GBM and analyze their performances.
Trained Doc2Vec and TF-IDF with (Logistic Regression, Random Forest, XGBoost). Best performance by TF-IDF and XGBoost.
Reviewed recorded final presentations by session 1 teams and have a better understanding of the project
Communicated with other leads regarding project direction

Goals for the upcoming week:

Tasks Completed:

Mandy · August 20, 2021, 9:07pm

Module 4 Self-Assessment: Weeks 7 & 8

A concise overview of things learned:

Technical Area:
- Advanced machine learning models training on the data collected using:
1. BERT
2. RoBERTa
3. DistilBERT
4. XLNet
- Building and Dockerizing Web Application
Tools:
- Jupyter Notebooks via Anaconda + Google Colab
- VSCode
- Python + HTML + CSS + Dockerfile
- Simple Transformers, Tokenizers, BeautifulSoup4, lxml, PyTorch, tarfile, sklearn, Pandas, Re, etc…
- Flask
- Docker
- Git/GitHub
- Discord, STEM-Away Platform
Soft Skills:
- Continue with active communication on Discord
1. Updating and communicating with the leads
2. Replying to all questions asked from participants
- Hosted weekly team meetings
1. Team updates & To-Do List/Reminders
2. Module 4 Discussion + Project Direction
3. Final Team Presentation Overview + Splitting up sections for each member
4. Gave a step by step tutorial of building + dockerizing web applications for team members on Windows OS.
5. Exchange Linkedin with team members to stay in touch with each other

Achievements/Tasks Completed:

Gave first team presentation to mentors and improved based on feedback.
Confirmed project direction with team members.
Retrained all basic models with new data set (45,000+ posts).
Trained four different advanced models (BERT, RoBERTa, DistilBERT, XLNet) and picked BERT based on the highest accuracy as the web app foundation.
The Web App Process:
1. Transferred over trained BERT model & code
2. Added in HTML + CSS + Dockerfile
3. Wrapping everything in Flask API
4. Dockerized Web App
5. Running it locally
Completed Web App Purpose:
1. Allows the recommendation of categories to users for specific post.
2. Helps reduce the disproportion between categories in their respective number of post.
3. Allows users to find more relevant post in their category.
Coordinated with team members for final presentation
Delivering final team presentation + Web App Demonstration
Completed all modules and gain a deeper understanding for machine learning
Will push all work to team GitHub Repo