Soumya Goswami - Machine Learning Pathway

Things Learned
The things I have learned includes webscraping of large sets of data from public Discourse forums to get the text from posts and their associated metadata, subsequent data cleaning and how to transform the data into vectors in a meaningful way using BERT. Through this process, I learned how to use Selenium, Webdriver and Beautiful soup library for scrapping web html. I gained knowledge on Transformer Neurals algorithm and used Sentence_transformers python library for creating sentence embeddings based on BERT. Further, I familiarized myself with Google Colab , Github, VS Code. Some soft skills that I learned were in communication and how it is important to be flexible and adapt myself according to the project requirements.

Three achievements:

  1. I wrote a python script that takes control of a browser instance and scrolls all the way to the bottom of a page and extracts all topic links in the history of a forum and then saved the dataset in a csv file.
  2. I wrote a function in python that preprocesses the dataset using python libraries like stopwords, WordNetLemmatizer and PorterStemmer.
  3. I implemented TF-IDF (term frequency inverse document frequency), which leverages the counts of words and their relative rarity across all documents to determine a similarity score. I used the sentence_transformers library, a pretrained BERT model for getting different types of sentence embeddings and use scipy to find the most-similar embeddings for queries.

Meetings/Training
I attended every team meeting and also the training session on Github from the industry leaders.

Goals for the upcoming week.
I will use other model with the data collected and see the use of other similarity index metrics (so far I am scoring based on CosineSimilarityLoss).

Detailed statement of tasks done.

  1. Implemented web scraping of data from Atom discussion forum.
    Difficulties faced: Initially confused with the use of scrapy to extract forum contents. I was not aware how to scroll a page and extract all possible informations.
    Solved: With the help of tutorials shared by the project lead, I used selenium and beautiful soup to extract contents. With help of the example shown by the project lead in team meeting, I was able to employ infinite scrolling, and JavaScript will trigger a page to dynamically load as the user scrolls to the vertical axis.
  2. Preprocessed the data and wrote a function script in python
    Difficulties faced: Initially faced troubles as how to remove certain characters which we don’t need for conecept learning of the texts.
    Solved: With the help of the tutorial shared by the project lead, I was able to pre-process the data sequentially by lowrercasing characters, removing stop words, stemming/Lemmatization of same type of words.
  3. Implemented a Document retrieval script using TF-IDF matching score.
  4. Used BERT that gives a deeper sense of language context and flow than single-direction language models.
    Difficulties faced: Initially faced troubles to understand as to how the algorithm works.
    Solved: Solved with materials shared and online help.
  5. Still not sure how the word2vec model works. Need to learn more.

Things I have learned
Technical: I implemented Doc2vec algorithm and further refined my script on TF-IDF and BERT algorithms. I finished implementing a similarity detection program to recommend forum threads that are related to the input text using cosine similarity. I used the knowledge of Github gained from previous webinars to push the scripts that I designed.

Tools: I became familiar with the Natural Language Toolkit (nltk), Sentence Transformers, BERT models, and gained more knowledge on sklearn, tensorflow for building the machine learning algorithm.

Soft Skills: I am learning to be disciplined in completing my tasks and setting myself goals to be able to keep up with the rest of the team. I learnt that in a collaborative project, I need to be more clear and organized for the scripts I wrote for others to understand easily.

Three Achievements:

I managed to implement the doc2vec model to the collected data to detect similar forum topics. The doc2vec algorithm successfully captures the most similar topics for each topic in the forum .

I compared the similarity results obtained with each of doc2vec, TF-IDF and BERT models. For a small set of data, TF-IDF gives much better similarity scores than others and it is faster to compute. (If I match a topic to itself, the cosine similarity score is more close to 1 for TF-IDF than BERT and doc2vec). However, for large set of data, BERT scores get more reliable and computationally it is more efficient than doc2vec and TF-IDF.

Working in a subteam, we managed to obtain 14,000 forum data points from 4 different categories and 76 subcategories from the Codecademy forum, saving all that data into a csv file for further processing

Meetings/ Training Attended (Including Social Team Events)
I’ve attended all the team meetings, including subgroup meetings to collaborate on collecting codeacademy forum datapoints, clean them and building the classification model. Further I attended all NLP webinars on BERT by industry person.

Goals for Upcoming Week
Implementing a program to classify a given post into one of four categories as obtained from the Codecademy forum.

First, we have to embed the data obtained into meaningful vectors using distilBERT.
Then, I plan on using randomforest, Supoort Vector Machines for multi-class classification and perhaps a hierarchical model that can also classify the subcategories.

Tasks Done

Finished implementing Doc2vec, TF-IDF and BERT on previously obtained and cleaned data from the Atom forum

I compared the similarity scores obtained with the three NLP models mentioned above. For small data set, TF-IDF is more accurate and faster. For large data set, BERT model with pretrained weights gives much more improved results compared to TF-IDF.

I understood the concept of distilbert and how it is efficient compared to traditional BERT models.

As a group, we obtained approximately 14,000 data points from selenium and Beautifoulsoup containing the category name, subcategory name, title and contents of forum threads from the Codecademy forum. I was responsible for implementing the functions to obtain the topic contents for the forum. Further I optimized other team members’ function to get the category, subcategory and topic names corresponding to each datapoints. Further I used my preprocessing function to clean the data.

Hi @ddas I dont know whether this is the right place to mention, I was wandering if there is a vacant position as a technical lead for ML July 6 session. I have observed the roles played by our leads @anubhav15129 and @st3939 and learnt a lot from them. I actually emailed you about this, but probably you have missed my email. If technical lead position not available, its still fine with me, I will go on again as a participant in July session. Thanks.

If possible @ddas , I would strongly recommend @sgoswam2 to be a technical lead. He has done some excellent work and has been able to grasp the concepts in NLP very quickly. I am confident that he has the technical expertise to lead a team.

Thank you Anubhav. Recommendations from leads are very helpful. Soumya, your spot is guaranteed!

1 Like

Thanks @ddas for this opportunity and thanks @anubhav15129 for the valuable recommendation. I am looking forward to it.

Third Self Assessment - Week 5

Things I Have Learned
Technical: I implemented classification models such as SVM , Random Forest, and GPU accelerated Deep feedforward Neural Network to classify the texts from the Codeacademy forum into 4 different categories. I validated the performance of these models with metrics like confusion matrices, classification reports, ROC/AUC curve etc. Moreover, I have implemented data augmentation techniques on texts to solve class imbalances of categories in our dataset. I also implemented a script of hyperparameter tuning of the models using grid search algorithm.

Tools:
I’ve delved deeper into understanding the tools within the sklearn and keras library also and used a large number of them to implement the classification models on my data, and also to evaluate its performance.

Soft Skills: I have improved my virtual communication skills (through Slack), be able to schedule meetings with our subteam, and also directly helped my teammates for help when they are stuck.

Five Achievements:

  1. I managed to successfully implement 3 classification models (SVM, Random Forest and Feedforward Neural Network) to the collected Codecademy forum data to classify each topic to its corresponding categories.

  2. I implemented data augmentation (word swapping with Thesaurus/synonyms) on our data set due to massive class imbalances (two categories had significantly less data than the other two), and obtained improved results - a highest of 89% accuracy using Feedforward Neural Network on our test data.

  3. I implemented hyperparameter tuning, L2 and dropout regularization, Batch normalization of the deep neural network to improve the model

  4. I further combined our datasets with another team (who worked on the Ketogenics forum) and created a hierarchical deep neural network classification model that categorizes each post into the individual forums and also their respective categories.

  5. Our team has finished creating a slide deck documenting our processes and results of the Codecademy forum text classification project.

Meetings/ Training Attended (Including Social Team Events)
I’ve attended all the team meetings, and also subgroup meetings to discuss and work on the tasks concerning the Codecademy forum group project. Further I myself arranged a meeting where I discussed the classification model scripts that I wrote for the Codeacadmey forum.

Goals for Upcoming Week
To optimize the hierarchical model classifier and see if we can improve further the accuracy of the classification of the forums. This is important as the loss in the hierarchical model will propagate to the further classification of categories of the forum.

Tasks Done

  1. Created distilBERT topic embeddings for all the data scrapped from the Codecademy forum. I gained further knowledge on the multiple layers and features of BERT embeddings, the principle of attention mechanism of BERT. Moreover, I wrote a script that transforms BERT embeddings of individual words to sentence/topic embeddings, which would be used in further classification steps.

  2. Implemented data augmentation on the collected raw data using libraries like nlpaug and googletrans. We had an issue with class imbalance and hence had to do some data augmentation to improve the classification of the categories. We tried random shuffling of the words within each topic however that actually reduced the accuracy because random shuffling of words in topic does not preserve the meaning of the topic anymore which is required for BERT embeddings. I then tried replacing words with their synonyms using Thesaurus and that did rather significantly improved our results.

  3. Classified the forum topics to their categories using 3 classification models. Using the sklearn library, we were able to use classify the Codecademy forum topics using SVM, Random Forest and Neural Network (MLPClassifier) models. I evaluated these models using confusion matrices, AOC curves etc. I finally chose the deep feedforward NN model as it gives fairly good precision, recall, sensitivity and AUC of 0.9 for each category.

  4. Created a slide deck to showcase our project. We made a presentation which highlights the process and results of our project, including how we obtained our data, the issue with our class imbalances, comparisons of all the classification models etc.

Goals for July Session

  • Implement professional software development processes in developing an ML solution
  • Learn more about deploying ML solutions
  • Improve remote team leadership, tutorials and project management