Soumya Goswami - Machine Learning Technical Lead- Self assessment

Soumya Goswami - Machine Learning Pathway Technical Lead July 2020 self-assessment:

Week 1, 2 & 3 Assessment July 5- July 20

Training/Learnings:

  • Leadership skills :

    Earn Trust - Introduced ourselves. Listened attentively, spoke candidly, and treated others respectfully and ultimately make them feel comfortable talking with me and trustworthy.
    Customer Obsession: The first week while deciding how to proceed with the project, we need to have a reality check and practicality of the application we are going to design so that it gain respect of the users.
    Invent and Simplify: This session is going to give me lot of hurdles like multiple timezones issues, conflicts in opinions, time management etc., I need to make a plan to simplify processes, consider everyone’s opinion, decide best possible solution to proceed, and proceed with a pace.

  • Technical skills:
    Tag-recommendation method , Discourse Forum Tagging, NLP annotation tool characteristics , Active learning architecture , Machine Learning Classifiers

Tools Used:

  • Python
  • Discourse Forum
  • Git
  • Slack, used to communicate
  • Google Colab
  • Notion, used to post online resources

Soft Skills Used:

  • Team Collaboration and Communication
  • Leading a team remotely
  • Conducting remote team meetings, tutorials and brainstorming

Achievements:

  • Successfully conducted 5 team meetings .
  • Developed an active learning architecture with several variants for an NLP annotation tool.
  • Produce a product with basic functionality for benchmarking with Stack Exchange data
  • Investigate the various models/techniques that can be used for tag prediction and decide on which to focus on and assign to members.
  • Started focusing on model exploration, benchmarking data analysis and the transition to operate on STEM-Away data

Detailed Statement of Tasks Accomplished/Organized:

  • To improve engagement during meetings, every participants were given a chance to speak. No matter how small or insignificant an input may seem, everyone’s contributions are welcome and are analysed. Leads turned on their videos to have a more personal connection with members. Meetings are recorded and meeting summaries are later shared on STEM-Away for those who are temporarily absent.

  • Initially, we were supposed build the tag recommendation software solely based on STEM-Away data but it was found that there are not enough posts and also not relevant tags available. Therefore, the Stack exchange was chosen to be scrapped . The plan was to train our model on the dataset of Stack Exchange and then later test it on the STEM-Away platform. Work is distributed between team members to scrap categories like data science, artificial intelligence, and computer sciences.

  • It was observed that there was kind of a class imbalance in the data-set like machine learning tag was almost in every post but performing data augmentation or anything was not found to be beneficial as the more we would augment the data of other tags, the more will be the machine learning tag would be there as it was in almost every post. Removing the machine learning tag was also not a feasible option as afterwards, we will have data related to other tags also when we combine every team’s work and therefore, we left it as it was.

  • Found out that, total we have 60,000 data and 480 tags to predict, which is difficult. We reduced the tag set to 25 most frequently occurred tags and that reduced data-set to 11,000 topics.

  • Explained the concept of multi-label classification. Since here, our goal was to predict multiple relevant tags related to a post, we took a probability based multi-label classification approach. Explained the concept of converting topic texts to vectors (BERT/TF-IDF) and tags to binarly labels.

  • Explained the concept of feed-forward neural network (NN) to everyone. We together defined three ML models to predict tags: 1. Topic vectorization by BERT + Training by NN 2. Topic vectorization by TF-IDF +Training by NN 3. Topic vectorization by TF-IDF +Training by Logistic regression

  • Detailed the pros and cons of each models in terms of efficiency and productivity.

Goals for the upcoming week:

  • Setting up active learning module
  • Build a user interface where tags will be predicted for user posted data.
  • Build a Tag annotation tool where user will be allowed to choose relevant tags and feed it back to model.
  • Work on tuning hyper-parameters of ML model

Week 4, 5, 6 & 7 Assessment July 21- Aug 20

Training/Learnings:

  • Leadership skills :

    Motivation - Since we already implemented a tag prediction model based on Machine learning, the next task was to motivate everyone to implement the active learning & tag annotation tool

    Strategic thinking & Creative Solution -Grouping of participants in two teams with the task of implementing active learning and annotation tool. A batch based active learning model to trade off between model efficiency and implementation difficulty. The idea of retraining the model on new annotated batch data combined with previous dataset was proposed.

    Team development - Created a slide deck to showcase our project with contribution of each particpants

  • Technical skills:
    NLP annotation tool characteristics , batch based active learning architecture, AWS

Tools Used:

  • Python
  • Tensorflow
  • Pytorch
  • scikit-learn
  • Git
  • Slack, used to communicate
  • Google Colab
  • Notion, used to post online resources

Soft Skills Used:

  • Team Collaboration and Communication
  • Developing strategies to improve productivity and efficiency
  • Conducting remote team meetings, tutorials

Achievements:

  • Implemented active learning model for retraining the tag prediction algorithm.
  • Analyzed the behavior of tagging in Stack Exchange forum and used that to predict tags for un-labelled STEM-Away data
  • Observed an increased accuracy, precision and recall after implementing active learning with tag annotator.
  • Presented to STEM-away and public people.

Detailed Statement of Tasks Accomplished/Organized:

  • Explained the concept and advantage of human tag annotator (like Prodigy software).

  • Our team implemented a probability based n-tag predictor system, where ‘n’ is the total number of tags to be predicted for a text. And this ‘n’ can be chosen by the user. This is helpful since the first few tags are obvious for each posts,e.g., a post related to machine or deep learning have high likelihood to have machine-learning and deep learning as tags but there may be other associated tags like keras, tensorflow, classification, python which are predicted by these n-tag predictor system.

  • After implementing human tag annotator loop, we fed back the new data points to the original dataset using active learning. We decided to feedback these new data points in batches rather than one at a time. This batch based active learning gives the scope to retrain the model continuously on the next set of posts on the STEM-Away forum.

  • We deployed this application to AWS so that in future, the application can be used for STEM-Away forum tag prediction.

Goals for future improvemnt:

  • Using a deep learning based tag recommendation system that will also predict new tags (generate new tags) outside the 25 tag set that we chose.
  • Analyze the model results on special cases, e.g., if same post is assigned different tags by two users during tag annotation, what will be the effect on the model in that case.