Huiwen.goh - Machine Learning Pathway

First Self Assessment - Week 3

Things I Have Learned
Technical: I’ve learned how to infinitely scroll a site and scrap specific data, and clean the extracted data for further processing for implementing other NLP models. I have also gained knowledge in various NLP methods such as TF-IDF, Bag of Words, BERT etc.

Tools: I’ve gained some first hand experience in using Python/ML libraries such as BeautifulSoup, Selenium for scrapping data and implementing infinite scrolling.

Soft Skills: Working in a team virtually is a new experience for me, it is very different from working with a team in person and it has forced me to be able to set independent goals and organize my time to ensure that I would complete the required tasks before the next check-in meeting.

Three Achievements:

  1. Using BeautifulSoup, I scrapped the Atom discourse site to extract individual topic titles, comments and tags, then organized data points in a csv file

  2. I used Selenium to be able to infinitely scroll the homepage of the Atom forum, making the process more automated, I managed to get ~900 data points however it can be easily scaled to collect much larger volumes of data

  3. I’ve implemented several functions to clean the obtained data, preparing to feed it to NLP models such as TF-IDF/BERT
    (I’ve started those implementations however I am currently experiencing some errors in my code that I am trying to fix)

Meetings/ Training Attended (Including Social Team Events)
I’ve attended all the team meetings/trainings, but we have not had any social events (which I will definitely be up for!)

Goals for Upcoming Week
I would like to to fix the bug in my code for the TF-IDF implementation of my data and be able to get the predictions for forum topics.
After I’m done with that, I will implement BERT (and doc2vec, if time permits), so that I can compare and evalutate my results.

Tasks Done

  • Scrapped data from the Atom discourse forum, compiling it in a csv file
    Initially I tried to use Scrapy to complete this task (using the tutorial given on STEM-Cast), however it did not work the way I expected it to. In the end, I used BeautifulSoup and Selenium, a brief tutorial by Anubhav (our project lead) was very helpful to me and with that I managed to obtain the data points.

  • Cleaned the obtained data
    I was able to remove punctuation, stopwords, numeric figures, lemmatize the words etc. Not a lot of hurdles on this task, we were provided some online materials by Anubhav and those sites were pretty straightforward.

  • Read up on the various NLP implementations
    We were giving a number of readings that explained to us how various NLP implementations worked and how they can be used. This was a pretty straightforward task, some of the information is still a little confusing but I think by trying them out I will start to understand it better

  • Started a TF-IDF implementation on my data
    I managed to get about half way through this task before encountering an error in my program. I am currently in the processing of debugging it and I think I am close to figuring it out. Hopefully this will be done soon so that I can try other implementations.

I’m not sure if this is the right place to mention this, but I am interested in switching from an observer to participant role for this session. For the past 2 weeks I have been rather comfortable with the tasks/materials, and I really want to be more involved in the project, so I think a participant role would be suitable. Please let me know if this is possible!

@ddas I would highly recommend @huiwen.goh to join as a participant for the best batch. She has done some excellent work. :slight_smile:

Absolutely! Good work @huiwen.goh
Enjoy the project!

Second Self Assessment - Week 4

Things I Have Learned
Technical: I’ve gotten more familiar with the TF-IDF and BERT algorithms, and can now implement a similarity detection program to recommend forum threads that are related to the input text using cosine similarity.

Tools: I’ve use the Natural Language Toolkit (nltk), Sentence Transformers and BERT library to implement my programs. I am currently also in the process of learning how to use sklearn in my implementation.

Soft Skills: Similarly to before, I am learning to be more disciplined in completing my tasks and setting goals and deadlines for myself to be able to keep up with the rest of the team.

Three Achievements:

  1. I’ve managed to implement the TF-IDF model to the collected data to detect similar forum topics. The model is rather successful (it is able to detect similar topics) however the cosine similarity I get is often lower than what I expect (eg. the exact same post gives a cosine similarity of 0.85 instead of the expected 1)

  2. I also successfully implemented the BERT model to that same set of data and contrasted the obtained results with the TF-IDF implementation. The BERT implementation was significantly better and more reliable.

  3. Working in a subteam, we’ve managed to obtain 14,000 forum threads from 4 different categories and 76 subcategories from the Codecademy forum, saving all that data into a csv file for further processing

Meetings/ Training Attended (Including Social Team Events)
I’ve attended all the team meetings, including subgroup meetings to collaborate the tasks involving the Codecademy forum

Goals for Upcoming Week
Implementing a program to classify a given post into one of four categories as obtained from the Codecademy forum.
To implement that, I plans on learning and using randomforest, distilBERT and perhaps a hierarchical model that can also classify the subcategories.

Tasks Done

  • Implemented TF-IDF on previously obtained and cleaned data from the Atom forum
    I managed to implement a rather accurate model using nltk to clean and preprocess my data and the cosine similarity to calculate similarity. Results were accurate however there is definitely some room for improvement due to lower than expected similarity numbers.

  • Implemented BERT on the same data obtained from the Atom forum
    I implemented another NLP model to the previously obtained data so that I could compare and contrast my results. After implementing two different models, I could conclude that the BERT implementation was much more successful than the TF-IDF one

  • Read up on more libraries and NLP implementations to increase my understanding and knowledge of the algorithms

  • Scrapped more data using BeautifulSoup and Selenium
    As a group, we obtained approximately 14,000 data points containing the category name, subcategory name, title and contents of forum threads from the Codecademy forum. I was responsible for implementing the functions to obtain the category and subcategory urls for the rest of my teammates to continue scrapping the data. I also helped in refining the program to make it more efficient to prevent overloads

Third Self Assessment - Week 5

Things I Have Learned
Technical: I’ve gotten a much deeper understanding of how BERT embedding works and is used in text classifications. I have also gained knowledge on classification models such as SVM and Random Forest, such as methods that are used to evaluate the performance of these models such as confusion matrices, classification reports, ROC curves, AUC etc. I have also learned about data augmentation techniques and used them to solve class imbalences in our project.

Tools: I’ve delved deeper into understanding the tools within the sklearn library also and used a large number of them to implement the classification models on my data, and also to evaluate its performance

Soft Skills: I have improved my virtual communication skills (through Slack), be able to schedule meetings with our subteam, and also directly reach out to my teammates for help when I am stuck on a certain task

Three Achievements:

  1. As a group, we managed to successfully to implement 3 classification models (SVM, Random Forest and Feedforward Neural Network) to the collected Codecademy forum data to classify each topic to its corresponding categories.
  2. We implemented data augmentation (word swapping with Thesaurus) on our data set due to massive class imbalances (two categories had significantly less data than the other two), and obtained improved results - a highest of 89% accuracy using Feedforward Neural Network on our test data
  3. Our team has finished creating a slide deck documenting our processes and results of the Codecademy forum text classification project

Meetings/ Training Attended (Including Social Team Events)
I’ve attended all the team meetings, and also subgroup meetings to discuss and work on the tasks concerning the Codecademy forum group project

Goals for Upcoming Week
Combining our datasets with another team (who previously worked on the Ketogenics forum) to create a hierarchical classification model that will be able to categorize each post into the individual forums and also their respective categories.

Tasks Done

  • Created distilBERT topic embeddings for all the data scrapped from the Codecademy forum
    With much needed help from my teammate in understanding the multiple layers and features of BERT embeddings how to transform BERT embeddings of individual words to sentence/topic embeddings, I managed to implement the code myself and obtained the embeddings which would be used in further classification steps

  • Classified the forum topics to their categories using 3 classification models
    Using the sklearn library, we as a group was able to use classify the Codecademy forum topics using SVM, Random Forest and Neural Network (MLPClassifier) models. Also having less experience in this area, I got a lot of help from referring to code from my teammate and reading a large amount of documentations to understand how the models work. But in the end I understood and could reproduce the program, and also evaluate these models using confusion matrices, AOC curves etc.

  • Implemented data augmentation on the collected raw data
    Our group had an issue with class imbalance and hence had to do some data augmentation to improve the classification of the categories. We tried shuffling the words within each topic however that actually reduced the accuracy. We then tried replacing words with their synonyms using Thesaurus and that did rather significantly improved our results.

  • Created a slide deck to showcase our project
    We made a presentation which highlights the process and results of our project, including how we obtained our data, the issue with our class imbalances, comparisons of all the classification models etc.