Level 1: Module 3 - Basic recommender & Simple classifiers

Overview

Build the recommender part and train basic machine learning models to perform posts classification into a certain category from a certain forum.

Description

Now that you have scraped, visualized, and cleaned the data you needed. It is time to train your simple machine learning models and build a basic content-based recommender system.

Tasks

To do this we will need to :

  • Make sure you pick only important columns. Example:
Topic Title Category Tags Leading Post Post Replies Created_at Replies
  • Clean your data well.
  • Transform your textual data into meaningful word vectors or word embeddings (check out Module 1 NLP webinars to understand more on this). Example: Bag of words or TF-IDF

For the basic recommender system:

  • Calculate a distance metric. Example: Cosine similarity.
  • Recommend a post using the title of a previously liked post (go for a top 10 recommendation).

For the simple classification model

  • Identify 5 simple machine learning classification models and train them on your data
  • Benchmark these models by calculating metrics like:
    Accuracy, Precision, Recall, and F1-Score
  • Pick the best performing one and perform hyperparameter tuning for it OR change the way you generate your word vectors or embeddings, etc.
  • Test your model by feeding your input data and evaluate its output and see if it meets your expectations.

Tips

  • Be as proactive as you can, check out why your model or similarity results are performing well or badly.
  • Investigate things like class imbalance, dropping off some columns, adding other columns, and check out ensemble methods and see if they improve your accuracy.

Resources

2 Likes

@Vishnupriya_Kanuri Module3 is here

1 Like

Oh thank you!! :slight_smile:

1 Like

@ML-Pathway,

Normally the meeting that wraps up Module 3 and introduces Module 4 should happen this Saturday at 7 pm but because many of you still haven’t posted your Module 2 assessment, nor pushed your code to GitHub and because I know that web scraping can sometimes take longer than normal. I will be pushing our Meeting to the 16th of January.

Remember we (I and your colleagues) are here to help and learn from each other.

Good luck and keep up the good work.

@Sara_EL-ATEIF, I think thats better. I myself have come across a problem where I realized my webscraper duplicated alot of the information. So I made another which gave me 4K entries, however its very imbalanced. And I’m playing around on cleaning and displaying it using wordcloud

1 Like

Great job @YasaminAbbaszadegan.
I am here for you please tell me if you need either more time or help.

1 Like

@Sara_EL-ATEIF, I have finally able to extract a more balance set of entries for each category with overall 5k entries. I’ll be starting Module3 from today. (Took forever :rofl: :sweat_smile: )

1 Like

Hello @Sara_EL-ATEIF mam,
I have written my self assessment for Level 1 Module 3. Here is the link:
https://stemaway.com/t/machine-learning-level-1-module-1-nash/7002/12

Great job @YasaminAbbaszadegan

1 Like

Thank you @Sourav_Naskar I will definitely check it out.

@YasaminAbbaszadegan good job, keep up the good work :partying_face:.

@ML-Pathway @ddas I am thinking of organizing a small showcase session where interested participants could present to us the work they did and receive feedback and helpful comments from the viewers. This will a great help, especially to the people that are taking a lead role as they will be doing presentations during the internship.
Let me know if you’re interested so we can start working on it.

keep up the good work and don’t give up (if you get stuck), persistence is the real key to success :wink:.

1 Like

Hi @Sara_EL-ATEIF, that is an excellent idea! Please go ahead and organize. I will try my best to attend.

Encourage anyone with some progress to present.

Hello @Sara_EL-ATEIF mam,
I would love to do this.
What I actually need to do? Go through & explain my codes?

@Sara_EL-ATEIF I have made a few classification models. but they are overfitting alot. I have around 300 samples for each category. I couldnt extract more for each. any suggestions??

@Sourav_Naskar I’ve tried your code for webscraping and it kinda freezes midway. do you know how to resolve this issue??

@YasaminAbbaszadegan Are you using Chrome Browser?
What’s the error you are getting?

It’a not an error, it just gets stuck and doesnt go over the loop (stuck in 190 loop). Yes I am using chrome browser.

Hello @YasaminAbbaszadegan, To avoid overfitting we usually:

  • Use regularization methods (try using the L1 method),
  • Train for a few instances (keep track of your validation and train loss and schedule an early stopping),
  • Reduce features (use dimensionality reduction techniques like PCA),

Here is a good tutorial:

I’ll try. But I think I need a one on one meeting with you this weekend @Sara_EL-ATEIF, to talk about my webscraper.

Hello @YasaminAbbaszadegan,
Here is the data you requested: https://github.com/mentorchains/level1_post_recommender_20/tree/md3/resources

1 Like

Hi @Sara_EL-ATEIF, thank you for the link. I was able to extract 7 categories out of 15, with 500 samples for each. Then the webscraper disconnected again (wasnt able to go over 7 categories). So I will start working on the next modules given what I was able to scrape.

1 Like