Machine Learning - Level 1 Module 1 - Nash

Concise overview of things learned:
Technical area:
• Set up the Google Colab Environment

• Learned Basics of Machine learning

• Learned Definition, Classification, pros & cons of different recommender systems

• Learned Web Scrapping

• Learned Basics of NLP

• Learned how to train Logistic regression model & the intuition behind it

Tools:
Beautiful Soap, Selenium webdriver, Scrapy, Git, Github, Trello

Softskills:
Learned about Project Management & improved my google searching skills to cope up with bugs.

Achievement highlights:

  • Created site maps to design the path to take in order to scrap desired data
  • Learned to use Beautiful Soap Library & after inspecting scrapped data from different tags of the HTML page of the DiscourseHub Community forums.
  • Learned to use Selenium webdriver & Incorporated it with Beautiful Soap library.
  • Trained a simple logistic regression model of social network ads. It predicts whether people with certain age & income will buy a car or not.
  • Familiarised myself with git commands, git-clone, commit, merge, push etc

Detailed Statement of Tasks Completed:
Recommendation Systems:

  • Content Based Filtering: recommends similar items based on previous action or feedback(like, rating etc)

Measures of similarity: finding cosine, dot-product, Eucledian distance

problems can be solved by 2 ways:
1)Classification: predict ‘like’ or ‘dislike’ here we use metrics like accuracy, precision or recall

2)Regression: predict rating given by user here we use MSE(mean squared error )metric For relevancy of new recommendation model we need to test in real conditions

• Collaborative Filtering: 2 types
1)Model based: Model is defined based on user item interactions where users and items representations are to be learned from interaction matrix

  1. Memory based: no model is defined but it depends on similarities between users or items in terms of observed interactions

• Learned the recommendation algorithm & step by step tasks behind Content Based recommendation system.

• K-Nearest Neighbours: it’s an algorithm to find K nearest neighbours of the input which are in n dimensional space based on a distance metrics.

• Basics of NLP:
Because of the problems of representing Language as ML input Vanilla
Neural Network came. But If we supply linguistic features ,we don’t get deeper

context surrounding the individual words or tokens because they don’t take

sequential information. So RNN(Recurrent Neural Network) came. But here

words only can be read from one direction. So LSTM(Long Short Term

Memory Network) came.

Learned Attention Model & it’s uses. Learned Bert ,it’s training process

  1. masking 2) Next sentence Predictions: not useful for sentiment analysis

Logistic Regression:
Steps:
1.Importing the libraries-(numpy,pandas,matplotlib)

2.Importing the dataset- csv file

3.Splitting the dataset into the Training set and Test set

  1. Feature Scaling
  2. Training the Logistic Regression model on the Training set
  3. Predicting a new result
  4. Predicting the Test set results
  5. Making the Confusion Matrix
  6. Visualising the Training set results
  7. Visualising the Test set results

Hello @Sourav_Naskar,

First things first, you did a good job with the self-assessment but you could improve a lot on it.
Please be mindful of the libraries or tools names especially in a resume as they could cost you:
=> It’s Beautiful Soup, not Soap

Second, I recommend you check out how to use markdown to style your writing in either forum posts like this one or Jupiter notebooks.

Third, in the detailed statement section, I would recommend you to focus on what you actually did with the knowledge when you are providing the details and to only state briefly (not more than a sentence) of what you learned theoretically.

Thank you.

Hello @Sara_EL-ATEIF mam ,
I apologize for the spelling mistake of Beautiful Soup in the self assessment. Next time I will take care of all things you have mentioned.
Thank you.

Hello @Sourav_Naskar,

No need to apologize I am simply giving you advice for future references, otherwise, we are humans so mistakes do happen and it’s normal.

Best,
Sara

Machine Learning-Level 1 Module-2

Concise Overview of Things Learned:
Technical Area:

  • I have chosen Codeacademy forum to scrape data.

  • I Scraped the data from these forum using Beautiful Soup & Selenium library and stored the data in a csv file.

  • I have used different Data cleaning and EDA techniques to explore the scraped data.

Tools:
Beautiful Soup , Selenium Webdriver, Numpy ,Pandas ,Matplotlib, Scikit-Learn, Wordcloud, NLTK library,GitHub, Spacy, TextBlob

Softskills:
I Improved my google searching skills to cope up with bugs.

Achievement Highlights:

  • Successfully scraped the data from Codeacademy forum using Beautiful Soup & Selenium Library

  • Successfully executed Exploratory Data Analysis to explore the scraped data.

  • Successfully pushed the files to Github Repository.

Detailed Statements of task:

  • Firstly when I only used Beautiful Soup ,I was unable to scrape data because Beautiful Soup’s find_all method was unable to scrape Javascript based Web pages. So I used Selenium and Beautiful Soup Both. It worked well.

  • I was unable to scrape comments of web page so I used Selenium for Scrolling purpose.

  • I lowercased words, removed digits and words containing digits ,cleared punctuation, Removing extra spaces, removed common & rare words, lemmatized words

  • I created document-term matrix using Scikit-learn’s CountVectorizer to find the top words of every category.

  • I generated a word-cloud of top words of each category.

  • I used TextBlob to check polarity of each category.

Problems faced:

  • I used “C:\Program Files (x86)\chromedriver.exe” in Jupyter notebook ,it worked well but when I tried to use it in google colaboratory an error message was showing “Chromedriver not in path”.

Looks good @Sourav_Naskar. I wanted to check your code, specifically to check how you applied selenium for webscraping. But I couldnt find it in the STEM_AWAY github team. Where did you push your code?? Also I had a similar problem (

)

what you do is as follows–>

substrong text_topic=pd.Series(sub_cat) #is a smaller array than the main_topics array
main_topics=pd.Series(main_topics)

df_main=pd.DataFrame({‘main_topics’:main_topics,‘sub_cat’:sub_topic.reindex(main_topics.index)})
#you .reindex your smaller array to the larger array if that make sense

1 Like

I meant

sub_topic =pd.Series(sub_cat)

@YasaminAbbaszadegan and @Sourav_Naskar please checkout the scraping tutorial I just added in the md2 branch of our github repo it may help you solve the issue you’re facing.

1 Like

Hi @Sara_EL-ATEIF I dont see that branch??

Here is the direct link: https://github.com/mentorchains/level1_post_recommender_20/tree/md2/webScraping_EDA_tutorials

2 Likes

Thanks @YasaminAbbaszadegan . I will try it.

1 Like

Machine Learning-Level 1 Module-3

Concise Overview of Things Learned:
Technical Area:

  • I learned about some Basic Machine Learning models like Naive Bayes , Linear SVM, Logistic Regression, Decision Tree, Random Forest, XGBoost,LightGBM

  • I learned Cross Validation with Linear SVM,Random Forest, XGBoost,LightGBM model

  • I learned how to use doc2vec,tf-idf embeddings with 3 machine learning models(Random Forest, XGBoost & logistic regression) [found that these 3 models are giving better accuracy.]

  • I learned how to find the cosine similarity between the columns & recommend similar posts.

Tools :
NLTK, pandas, numpy, sklearn, genism

Softskills:
I Improved my google searching skills to cope up with bugs.

Achievement Highlights:

  • I experimented the dataset with basic machine learning models( Naive Bayes , Linear SVM, Logistic Regression, Decision Tree, Random Forest, XGBoost,LightGBM) to see the classification results.

  • I used tf-idf,doc2vec embeddings with 3 machine learning models(Random Forest, XGBoost & logistic regression) to see the classification results.

  • I combined the important columns in a single column to find the cosine similarity between them and recommend similar 10 posts using the title of the post.

Detailed Statements of task:

  • Firstly I combined Topic title,Tags,Leading & other comments in a column to store results of classification against Category column.

  • I planned to preprocess the data using 3 strategies to see the classification result.

  1. Lower case all the words,Replace by space these [/(){}[]’\“\”\’|@,;],Remove these [^0-9a-z #+_]
  2. Remove stop words.(+ what was done in Strategy 1)
  3. Remove space & Remove digits and words containing digits (+ what was done in Strategy 2)
  • In every strategy I trained Naive Bayes,Linear Support Vector Machine , Logistic Regression,Decision Tree model & benchmarked these models by calculating metrics like:
    Accuracy, Precision, Recall, and F1-Score & cross validated with linear SVM. Here I found Logistic Regression is the most accurate model.

  • After removing stopwords accuracy increased by 0.05%

  • In strategy 3 i cross validated the data using linear SVM, random forest, xgboost & lightGBM . XGBOOST is the most accurate model with an accuracy of 83.85% & while using LightGBM accuracy decreased.

  • Then I found stopwords & UPPER CASE WORDS for every post.

  • Then I experimented with Leading comment vs category, (Leading comment+Other comment) vs category but these gave lesser accuracy than the first one.

  • Then I used Doc2Vec embeddings with 3 machine learning models( logistic regression, XGBOOST,random forest.)[as these 3 models were giving better accuracy]. But these gave lesser accuracy than with only machine learning models.

  • Then I used Tf-idf embeddings with 3 machine learning models( logistic regression, XGBOOST,random forest.).But these gave near about same accuracy than with only machine learning models.

  • For simlple Recommender system I combined all the important columns in a single column & calculated the cosine similarity of them against The title of posts to recommend similar posts using title.

1 Like

Machine Learning-Level 1 Module-4
Concise Overview of Things Learned:
Technical Area:

  • I learned about how to train BERT, xlnet, roberta, distilbert models using the Simple Transformers library.

  • I learned about how to combine an advanced model like BERT and a simple ML model like Logistic Regression.

  • I learned about how to deploy machine learning model using flask api locally & build and dockerize the ML app.

  • Tools:

    simpletransformers, tokenizers==0.9.4,sklearn,tarfile,html,css, Flask api, Docker

  • Achievement Highlights:

  • I successfully trained BERT,Roberta,xlnet,Distilbert models using the Simple Transformers library.

  • Detailed Statements of task:

  • I defined hyperparameters, trained BERT,Roberta,Distilbert,Xlnet models,saved the models using tarfiles, loaded the model & predicted Category by giving input of post title & post body. Then i compared the Accuracy,Evaluation_loss,F1_Score,MCC of the 4 advanced models.

  • For deploying flask webapp firstly I created a home.html, result.html & css file.Then i created model.py & app.py In app.py I used request method for predicting post category incorporating home.html & result.html. When I tried to deploy flask webapp using command prompt it’s giving errors.

  • For combining Bert with logistic regression I preprocessed & cleaned the dataset , then i loaded the pretrained bert model, tokenized,padded, used Masking. After that The model() function runs our sentences through BERT. The results of the processing will be returned into last_hidden_states. But When i was trying to use model() function it gives a warning “Your session crashed after using all available RAM”.