St3939 - Machine Learning Pathway

st3939 · August 7, 2020, 6:49pm

Technical Areas Involved In/ Covered:

Web Scraping
Web driving
Data preprocessing and augmentation
Natural Language Processing Algorithms
- Bag of Words
- TF-IDF Vectorization
- BERT
Machine Learning Classification Models
- Decision Trees
- Random Forest Classifier
- Support Vector Machine (SVM)
- Logistic Regression
Agile (Scrum) Product Development
Task Management and delegation

Tools Used

Python - Numpy, NLTK, Gensim, Scikit Learn, Pandas, Beautiful Soup, Selenium, Scrapy, transformers, sentence transformers, stop words, porter stemmer
VS Code
Github
Google Colaboratory
Notion
Slack

Soft Skills Developed and Applied

Research: I was new to the NLP and data scraping domain, hence I mainly used the resources shared by the team lead to build my skills and expand my knowledge for this part of the project. In addition to resources shared, I further researched and shared resources specifically for Beautiful soup, Selenium, bag-of-words, BERT, DistilBERT, Support Vector Machines and Random Forest classifier and successfully implemented these myself and supported the implementation of these methods for struggling team mates.
Staying ahead as a lead: As I was new to some domains, as a technical lead I would stay one step ahead of the team by trying the scripts suggested by the team lead before everyone else and by finding more resources in addition to the resources shared by the team lead which would help participants with potential road blocks. I also conducted one on one sessions with the team lead before team meetings to share the results of my findings during the aforementioned process and discuss the roadmap of the project at multiple points
Project management: Project management responsibilities came to me when the team was sub-divided into two team and were working independently for the building the hierarchical classification model. I made sure both the teams were working in parallel and completed their tasks at the required time. I made sure the teams individually understood the task and the bigger project they were working towards by conducting virtual meetings with them as required in addition to the sprint meetings conducted by the team lead. I ensured the streamlining of the project by ensuring that the datasets collected and the scripts written by each team were easily integrate-able with the other teams’ work. I designed the structure of the pseudo code that was going to be used to scrape the data and build the hierarchical classification model. Further, I broke down the work required for the implementation of the pseudo code in terms of functions independent of one another and assigned their implementation to individuals so everyone on the team would be able to contribute to the final product.
Teaching: For the hierarchical classification model I conducted 2 lectures for each team. First on how to build the data set given the specific forum they were assigned to and what the final structure should look like. The second lecture was on how to build the novel hierarchical classification and how can we check the accuracy of such a model. To conduct these lectures I relied on google meets screen sharing complementing the use of an apple pencil on my iPad which basically lead to the creation of a virtual whiteboard. This technique proved very successful for the seamless understanding of structures and concepts for those who were involved in this project
Team-building: Our team comprised of students from all over the globe with most of them living in different time zones. Our project would not have been possibly completed if it wasn’t for the team building exercises and comradery created amongst the individuals. The project lead started the project off with an amazing team building exercise which culminated in everyone seeing how aligned and similar the goals were for everyone. In addition to the initiatives taken by the project lead, I contributed further by doing simple activities like a simple question at the start of the meetings e-g “How was last week” or “what was the most important thing that happened for you in the previous week’’ to closing the meetings off with “what are you looking forward to till the next time we meet”. I feel such activities really helped open the participants up and see the similarities in their interests and develop a deeper bond amongst themselves and the team leads. It was further encouraged at all meeting for the participants to communicate amongst themselves using slack or any channel of communication that worked the best for them. I can happily say that I myself have found some really great people to work and collaborate with!
Collaboration: Collaboration between such a big team that was globally distributed was new for me and one of the things I looked forward to the most for this project. For the ease of communication amongst participants the usage of slack was encouraged. In hindsight, the slack threads are populated with valuable discussions on the progress of the project in addition to the resources shared by individual participants to help their teammates. As everyone lived in different time zones, when2meet had to be used on multiple occasions to find a time slot that suits the highest number of participants in the team. All team meetings were recorded and the recorded session was shared on the shared google drive for the team.

I set up the Notion page for our team for creating a virtual workspace. This workspace shared the tasks for every sprint, the resources required, had a task progress table (resembling gantt charts) and a help centre.

Google colab notebooks + GitHub was used to share scripts amongst team members for the building of the final product

Tasks Completed:

Scraped the schizophrenia forum from discourse forums.
Preprocessed data scraped for NLP techniques( Bag of words, tf-idf and BERT)
Transformed unstructured data to structured data(vectors) by implementation if NLP techniques and created a similar topic recommendation engin e using cosine similarities. Cosine similarities were found using structured data from different NLP techniques to find the best performing algorithm
Implementation of SVM and Random forest classifier for performing classification on the ketogenic forum data to create the hierarchical classification model (had to fill in for participants that failed to do their tasks from the sub-teams made)
Project managed the building of the hierarchical model

Meetings Attended/ Conducted

Team and team lead Meetings
Tutorials, Workshops, and Code walkthroughs
Sprint Planning and Retrospective
ML Weekly Lead Meetings
ML Training Sessions with Colin (webdriving + beautifulsoup+ scrapy, python, BERT)

Tasks as a Lead

Read assigned resources and implement scripts before giving resources to teams
Setting up Notion and regularly updating tasks + shared resources
Writing pseudo code for the hierarchical model and creating the data structures + divide tasks for participants
Clear any confusions regarding implementation using my digital whiteboard through one-on-one meetings
Providing feedback for hierarchical model tasks assigned to individual participants and guide them on how to successfully complete theirs tasks
Sprint planning with project lead
Keep track of the performance of the hierarchical models and solve arising problems (data augmentation for various classes, how to calculate accuracy for the hierarchical model)
Overlook and contribute to the making and the delivery of the final presentation (contributions were also made by Huiwen, Muniba Bashir, Soumya Goswami and Nikola Danevski).