Soumya Goswami - Machine Learning Pathway Technical Lead July 2020 self-assessment:
Week 1, 2 & 3 Assessment July 5- July 20
Leadership skills :
Earn Trust - Introduced ourselves. Listened attentively, spoke candidly, and treated others respectfully and ultimately make them feel comfortable talking with me and trustworthy.
Customer Obsession: The first week while deciding how to proceed with the project, we need to have a reality check and practicality of the application we are going to design so that it gain respect of the users.
Invent and Simplify: This session is going to give me lot of hurdles like multiple timezones issues, conflicts in opinions, time management etc., I need to make a plan to simplify processes, consider everyone’s opinion, decide best possible solution to proceed, and proceed with a pace.
Tag-recommendation method , Discourse Forum Tagging, NLP annotation tool characteristics , Active learning architecture , Machine Learning Classifiers
- Discourse Forum
- Slack, used to communicate
- Google Colab
- Notion, used to post online resources
Soft Skills Used:
- Team Collaboration and Communication
- Leading a team remotely
- Conducting remote team meetings, tutorials and brainstorming
- Successfully conducted 5 team meetings .
- Developed an active learning architecture with several variants for an NLP annotation tool.
- Produce a product with basic functionality for benchmarking with Stack Exchange data
- Investigate the various models/techniques that can be used for tag prediction and decide on which to focus on and assign to members.
- Started focusing on model exploration, benchmarking data analysis and the transition to operate on STEM-Away data
Detailed Statement of Tasks Accomplished/Organized:
To improve engagement during meetings, every participants were given a chance to speak. No matter how small or insignificant an input may seem, everyone’s contributions are welcome and are analysed. Leads turned on their videos to have a more personal connection with members. Meetings are recorded and meeting summaries are later shared on STEM-Away for those who are temporarily absent.
Initially, we were supposed build the tag recommendation software solely based on STEM-Away data but it was found that there are not enough posts and also not relevant tags available. Therefore, the Stack exchange was chosen to be scrapped . The plan was to train our model on the dataset of Stack Exchange and then later test it on the STEM-Away platform. Work is distributed between team members to scrap categories like data science, artificial intelligence, and computer sciences.
It was observed that there was kind of a class imbalance in the data-set like machine learning tag was almost in every post but performing data augmentation or anything was not found to be beneficial as the more we would augment the data of other tags, the more will be the machine learning tag would be there as it was in almost every post. Removing the machine learning tag was also not a feasible option as afterwards, we will have data related to other tags also when we combine every team’s work and therefore, we left it as it was.
Found out that, total we have 60,000 data and 480 tags to predict, which is difficult. We reduced the tag set to 25 most frequently occurred tags and that reduced data-set to 11,000 topics.
Explained the concept of multi-label classification. Since here, our goal was to predict multiple relevant tags related to a post, we took a probability based multi-label classification approach. Explained the concept of converting topic texts to vectors (BERT/TF-IDF) and tags to binarly labels.
Explained the concept of feed-forward neural network (NN) to everyone. We together defined three ML models to predict tags: 1. Topic vectorization by BERT + Training by NN 2. Topic vectorization by TF-IDF +Training by NN 3. Topic vectorization by TF-IDF +Training by Logistic regression
Detailed the pros and cons of each models in terms of efficiency and productivity.
Goals for the upcoming week:
- Setting up active learning module
- Build a user interface where tags will be predicted for user posted data.
- Build a Tag annotation tool where user will be allowed to choose relevant tags and feed it back to model.
- Work on tuning hyper-parameters of ML model