Machine Learning - Level One Module Three - Shreya

Machine Learning - Level One Module Three - Shreya

Concise Overview Of Things Learned

Technical

  1. Transformed textual data into a meaningful word vector (bag of words)
  2. Calculated a distance metric (cosine similarity)
  3. Recommended a post using the title of a previously liked post (top 5 recommended posts)
  4. Identified simple machine learning classification models and trained them on my data
  5. Benchmarked these models by calculating different metrics
  6. Picked the best performing one and tested the model by feeding input data and evaluating its output

Tools

  • Google Colaboratory
  • pandas, matplotlib, seaborn, xgboost, etc.
  • Scrapy
  • VS Code

Soft Skills

  • Knowing when to ask for help and being able to learn from others
  • Effective problem solving and persistence
  • Attention to detail

Achievement Highlights

  • Made a basic recommender system which recommends top 5 most similar posts
  • Trained multiple simple classification models by following our mentor’s guide
  • Tested the best performing model and plotted the results

Detailed Statement of Tasks Completed

  • Became more familiar with Google Collaboratory.
  • Converted cleaned data into bag of words, then calculated cosine similarity matrix.
  • Used index of inputted post to sort cosine similarity scores in descending order so that the highest similarity (1) scores were at the top, with index 0 being the inputted post itself. Then returned the titles of the top five most similar posts (indexes 1:6).
  • One issue that came up was that the recommender would end up recommending other pages of the same post (i.e. page 3 of a large, multipage post). So I had to go back and clean my data further by identifying and removing extra pages of a post.
  • Learned about and then trained multiple machine learning models (ex. Logistic regression, naive bayes, etc.) on my data.
  • Evaluated the models by calculating metrics such as accuracy, precision, recall, and f1-score.
  • Chose the best model, then tested it and evaluated the plotted actual vs. predicted results.

To Be Continued

  • Generate word vectors differently to see if that improves accuracy
  • Investigate possible class imbalance