Mandy - Machine Learning (Level 1) Pathway

Module 1 Self-Assessment: Weeks 1 & 2

A concise overview of things learned:

  • Technical Area:

    • Prepared environment
    • Getting familiar with machine learning concepts
    • Understanding the basics of:
    1. Collaborative Filtering vs Content-based Filtering and their pros and cons
    2. One-hot encoding vs Word Embedding
    3. Metrics used: Cosine, Euclidean distance, Dot product
    4. Metric Based Evaluation vs Human-Based Evaluation
    5. Web Scraping/Crawling
  • Tools:

    • STEM-Away Platform
    • Discord
    • Jira
    • when2meet
    • Google Colab
    • Anaconda for Jupyter Notebooks (Had this installed before, but will probably use Colab moving forward)
    • More Python libraries/packages: Beautiful Soup + Selenium
    • Scrapy
    • Chrome and Firefox webdrivers
    • Previous experience with Python, VSCode, HTML, Git/GitHub
  • Soft Skills:

    • Active communication
    • Updating my other leads with details I find
    • Directing and guiding a new team on team structure

Achievements:

  • Set up a detailed teamwork structure with the other leads for our team to follow
  • Understanding machine learning better and content-based recommender systems
  • Learned the basics of how to scrape a website and ways data can be formatted

Goals for the upcoming week:

  • Module 2 + referring back to Module 1 resources & notes I had taken
  • Do more research on machine learning on my own for further understanding
  • Daily 5-10 minute scrum meetings to check in on team progress.

Tasks Completed:

  • Utilized STEM-Away platform and Discord to convey information to team members
  • Hosted first team meeting with other leads
  • Got familiar with the Discourse platform
  • Followed the Scrapy tutorial and scraped some data from a website
1 Like

Module 2 Self-Assessment: Weeks 3 & 4

A concise overview of things learned:

  • Technical Area:

    • Followed the tutorials provided from Module 2 and familiarize myself with the HTML on the PyTorch Community
    • Scraped data from 7,000+ posts in the PyTorch Community
    • Storing the data scraped into a csv file
    • Did some basic data cleaning:
    1. Combining the textual data that was gathered: title, leading comment, other comments
    2. Removing stop words, punctuations, and other unnecessary data
    3. Lowercasing data
    4. Tokenization, pos_tags, wordnet, lamentation
    5. Saving into a new cleaned csv file
    • Generated word clouds
  • Tools:

    • VSCode
    • Jupyter Notebooks via Anaconda
    • Beautiful Soup, Selenium, Geckodriver(Firefox), CSV files, requests, pandas, json, re, nltk, wordcloud, matplotlib
    • Python, HTML
    • Git/GitHub
    • Discord, Jira, STEM-Away Platform
  • Soft Skills:

    • Active communication on Discord
    1. Updating and communicating with the leads
    2. Replying to all questions asked from participants
    • Hosted team meetings & added in weekly games event
    1. Team updates & To Do List
    2. Github overview
    3. Set modules due dates
    4. Module 2 web scraping and EDA tutorial overview
    • Attending mentor meetings hosted by Sara and Anubhav

Achievements:

  • Understood how data can be scraped from a website and storing it into a csv file.
  • Learned how to do data cleaning and data visualization.
  • Provided detailed team structure and resource outline to team members
  • Understanding machine learning better

Goals for the upcoming week:

  • Module 3 + referring back to Module 1 and 2 resources & notes I had taken
  • Checking daily team updates channel and Jira for team progress.
  • A more in depth look into basic recommenders and classifiers

Tasks Completed:

  • Pushed testing file to our team GitHub repo
  • Familiarize myself with the PyTorch Community
  • Properly installing and setting path for geckodriver on Windows.
  • Scraped data from PyTorch Community. Problem faced: there is a huge data set on Pytorch (around 45,000+ posts). It took very long to scrape and faced timing out issues if the next post does not load within 5 minutes. To solve this, I decided to scrape by categories. I did 5 runs (around 1,000-2,000 post per run) and then combining the 5 csv files into one csv file and editing it.
  • Performed data cleaning and EDA on the data gathered.

Module 3 Self-Assessment: Weeks 5 & 6

A concise overview of things learned:

  • Technical Area:

    • Basic machine learning models training on the data collected using:
    1. Naive Bayes
    2. Linear SVM
    3. Decision Tree
    4. Logistic Regression
    5. Random Forest
    6. XGBoost
    7. Light GBM
    • Ensemble machine learning models training:
    1. Pipeline with Doc2Vec & (Logistic Regression, Random Forest, XGBoost)
    2. Pipeline with TF-IDF & (Logistic Regression, Random Forest, XGBoost)
  • Tools:

    • Jupyter Notebooks via Anaconda
    • CSV files, Beautiful Soup, pandas, re, nltk, wordcloud, matplotlib, numpy, scikit-learn, xgboost, lightgbm, gensim
    • Python
    • Git/GitHub
    • Discord, Jira, STEM-Away Platform
  • Soft Skills:

    • Continue with active communication on Discord
    1. Updating and communicating with the leads
    2. Replying to all questions asked from participants
    • Hosted weekly team meetings
    1. Team updates & To-Do List/Reminders
    2. Module 3 Discussion
    3. Team Presentation Overview
    4. Module 4 Overview
    • Attending mentor meetings hosted by Anubhav

Achievements:

  • Trained collected data with Naive Bayes, Linear SVM, Logistic Regression, Decision Tree, Random Forest, XGBoost, Light GBM and analyze their performances.
  • Trained Doc2Vec and TF-IDF with (Logistic Regression, Random Forest, XGBoost). Best performance by TF-IDF and XGBoost.
  • Reviewed recorded final presentations by session 1 teams and have a better understanding of the project
  • Communicated with other leads regarding project direction

Goals for the upcoming week:

  • Team Presentation to mentors
  • Module 4 + referring back to Module 1, 2, and 3 resources & notes I had taken
  • A better look at cosine similarities + BERT + development of web app

Tasks Completed:

  • Created a detailed outline of team presentation
  • Finishing up on the google slides

Module 4 Self-Assessment: Weeks 7 & 8

A concise overview of things learned:

  • Technical Area:

    • Advanced machine learning models training on the data collected using:
    1. BERT
    2. RoBERTa
    3. DistilBERT
    4. XLNet
    • Building and Dockerizing Web Application
  • Tools:

    • Jupyter Notebooks via Anaconda + Google Colab
    • VSCode
    • Python + HTML + CSS + Dockerfile
    • Simple Transformers, Tokenizers, BeautifulSoup4, lxml, PyTorch, tarfile, sklearn, Pandas, Re, etc…
    • Flask
    • Docker
    • Git/GitHub
    • Discord, STEM-Away Platform
  • Soft Skills:

    • Continue with active communication on Discord
    1. Updating and communicating with the leads
    2. Replying to all questions asked from participants
    • Hosted weekly team meetings
    1. Team updates & To-Do List/Reminders
    2. Module 4 Discussion + Project Direction
    3. Final Team Presentation Overview + Splitting up sections for each member
    4. Gave a step by step tutorial of building + dockerizing web applications for team members on Windows OS.
    5. Exchange Linkedin with team members to stay in touch with each other

Achievements/Tasks Completed:

  • Gave first team presentation to mentors and improved based on feedback.

  • Confirmed project direction with team members.

  • Retrained all basic models with new data set (45,000+ posts).

  • Trained four different advanced models (BERT, RoBERTa, DistilBERT, XLNet) and picked BERT based on the highest accuracy as the web app foundation.

  • The Web App Process:

    1. Transferred over trained BERT model & code
    2. Added in HTML + CSS + Dockerfile
    3. Wrapping everything in Flask API
    4. Dockerized Web App
    5. Running it locally
  • Completed Web App Purpose:

    1. Allows the recommendation of categories to users for specific post.
    2. Helps reduce the disproportion between categories in their respective number of post.
    3. Allows users to find more relevant post in their category.
  • Coordinated with team members for final presentation

  • Delivering final team presentation + Web App Demonstration

  • Completed all modules and gain a deeper understanding for machine learning

  • Will push all work to team GitHub Repo