Overview of Things Learned:
Researching various annotation tools.
Web scrapping using BeatifulSoup.
Using Python libraries, for example - Pandas, matplotlib, scikit-learn
Pre-processing and cleaning the scraped data
Identifying the top 25 frequent tags
Machine learning model, TF-IDF
Tools: Google Colaboratory, Jupyter Notebook, Python libraries, Notion, Slack
Soft Skills: Project work collaboration, communication skills, active listening
- Worked in a diverse team and learned something from everybody.
- Enhanced my technical skills like web scraping and machine learning model-building.
- Explored various annotation tools.
- Learned about the concept of active learning.
Project Kick-off: Got introduced to the project and the team. We discussed annotation tools and different approaches we could use.
Weekly meetings: We followed the scrum methodology. In every meeting, we discussed our progress and identified tasks to be done next.
Team building meetings: Some best practices on writing professional emails and messages were shared. These were very insightful and helpful. Our Project Lead shared some resume tips and templates which were super awesome. I am very thankful to him.
Goals for the Upcoming Week
Integrating the predictive machine learning model, annotator, and active learning part. Afterwards we will have our model deployment on AWS perhaps.
Researched annotation tools: There are various annotation tools available. Therefore, to narrow down the things, primarily looked at Prodigy and Light Tag.
Initially, we were to have STEM-Away data but it was found that there are not enough posts and relevant tags available. Therefore, the Stack exchange was chosen to be scrapped. The plan was to train our model on the dataset of StackExchange and then later test it on the STEM-Away platform.
Web scrapped stack exchange using BeautifulSoup: Focussed on getting tags related to machine learning, deep learning, etc. Scrapped categories like data science, artificial intelligence, and computer sciences.
It was observed that there was kind of a class imbalance in the dataset like machine learning tag was almost in every post but performing data augmentation or anything was not found to be beneficial as the more we would augment the data of other tags, the more will be the machine learning tag would be there as it was in almost every post. Removing the machine learning tag was also not a good option as afterwards, we will have data related to other tags also when we combine every team’s work and therefore, we left it as it was.
Model building: After some basic data cleaning like removing extra white spaces and stop words, the model was built using TF-IDF.
Active learning: Went through active learning libraries like modAL. Active learning is to take up the new posts and re-train the model so that it can learn about the new data. Re-training the model after each new post seems time-consuming and hence, we can have it like after every ten or hundred new posts.
- Web Scraping
- Data Cleaning
- Annotation tools
- Active learning
- Google Colaboratory
- Jupyter Notebook
- Learning and researching
- Presenting final results
- Problem Solving
- Time Management
- Effective communication
- Worked with a team virtually and finally had our product ready.
- Used Beautiful Soup for web scraping
- Performed basic data cleaning
- Implemented a TF-IDF model
- Presented the results with my whole team
- Managed my time between college classes and our project.
It was an enriching experience to work on such a great project. With people from different time-zones, we learned how to work together with time differences. I learned the skills to successfully collaborate virtually. Initially, we were a group of strangers but by the end of the internship, we had become a team where everybody respects and appreciates each other. We all never met in-person but with the passage of this internship, we have developed a special bond with each other. Our project lead always gave wise and useful advice to us. My technical skills got sharpened along with soft skills. I got an opportunity to dive deep in machine learning and learn new things like annotation tools and active learning.