Karandeep_Kawatra - Machine Learning Pathway

Karandeep_Kawatra · August 7, 2020, 6:51pm

Overview of things learned:

Technical:

Tools: Google Collab, Slack, Notion, Google Meet, BeautifulSoup, Tensorflow (Keras modeling), modAL, Prodigy, LightTag

Soft Skills:

Learned to become more vocal due to smaller sized teams
Shared my work with my teammates and gave my input
Got over the fear of being the youngest yet naive person in my group and took risks for sharing my ideas
Learned that communication was key since we have a diverse team from all over the world
Connected/bonded with my team and learned more about their background

Achievement Highlights:

Web scraped the StackExchange forum data with titles, content and tags from 4 different categories: Computer Science, Data Science, Computational Science and Machine Learning
Built a sequential model using BERT Embeddings from the scraped data.
Built a model using TF IDF embeddings which predicted the tags based on the content of the forum
Learned and experimented with the Active Learning library and tag annotation tools such as Prodigy and Lightag.

Meetings Attended:

Weekly team meetings hosted by our leads (11 of them total; occurred twice a week)
Team Building meetings: our leads discussed resume building and how to reach out to various companies for potential employment

Goals for this week:

To integrate the full functioning model that could use the Active Learning loop we had created to predict the tags based on the title and continuing on for the posts in the data set. We will deploy this probably through AWS
To test our model on Stem-Away dataset if possible

Tasks Done:

Webscraped StackExchange forum using BeautifulSoup
Researched annotation tools such as prodigy and light tags to use in
Preprocessed the data from my dataset to used BERT and TF IDF to embed our text (later discovered that TF IDF was the better option since it did not have limit on features)
Got the most frequent tags (25) and displayed them graphically (discovered machine learning was the most frequent by a landslide)
Used metrics to calculate the accuracy for both models (precision, recall and f1 scoring)
Found that the TF IDF gave me a better accuracy of 70% compared to the BERT of 40%
Researched and tried to implement the Active Learning within my model (did not workout well)