Karandeep Kawatra - Machine Learning Self-Assessment

Overview of things learned:


  • Learned how to work with Keras models using BERT and TF IDF vectorization.
  • Preprocessing texts from forums
  • Identified the most common tags by exploratory data analysis
  • Exposed to the Active Learning and worked with the modAL library

Tools: Google Collab, Slack, Notion, Google Meet, BeautifulSoup, Tensorflow (Keras modeling), modAL, Prodigy, LightTag

Soft Skills:

  • Learned to become more vocal due to smaller sized teams
  • Shared my work with my teammates and gave my input
  • Got over the fear of being the youngest yet naive person in my group and took risks for sharing my ideas
  • Learned that communication was key since we have a diverse team from all over the world
  • Connected/bonded with my team and learned more about their background

Achievement Highlights:

  • Web scraped the StackExchange forum data with titles, content and tags from 4 different categories: Computer Science, Data Science, Computational Science and Machine Learning
  • Built a sequential model using BERT Embeddings from the scraped data.
  • Built a model using TF IDF embeddings which predicted the tags based on the content of the forum
  • Learned and experimented with the Active Learning library and tag annotation tools such as Prodigy and Lightag.

Meetings Attended:

  • Weekly team meetings hosted by our leads (11 of them total; occurred twice a week)
  • Team Building meetings: our leads discussed resume building and how to reach out to various companies for potential employment

Goals for this week:

  • To integrate the full functioning model that could use the Active Learning loop we had created to predict the tags based on the title and continuing on for the posts in the data set. We will deploy this probably through AWS
  • To test our model on Stem-Away dataset if possible

Tasks Done:

  • Webscraped StackExchange forum using BeautifulSoup
  • Researched annotation tools such as prodigy and light tags to use in
  • Preprocessed the data from my dataset to used BERT and TF IDF to embed our text (later discovered that TF IDF was the better option since it did not have limit on features)
  • Got the most frequent tags (25) and displayed them graphically (discovered machine learning was the most frequent by a landslide)
  • Used metrics to calculate the accuracy for both models (precision, recall and f1 scoring)
  • Found that the TF IDF gave me a better accuracy of 70% compared to the BERT of 40%
  • Researched and tried to implement the Active Learning within my model (did not workout well)