St3939 - Machine Learning Pathway

Technical Areas Involved In/ Covered:

  • Web Scraping
  • Web driving
  • Data preprocessing and augmentation
  • Natural Language Processing Algorithms
    • Bag of Words
    • TF-IDF Vectorization
    • BERT
  • Machine Learning Classification Models
    • Decision Trees
    • Random Forest Classifier
    • Support Vector Machine (SVM)
    • Logistic Regression
  • Agile (Scrum) Product Development
  • Task Management and delegation

Tools Used

  • Python - Numpy, NLTK, Gensim, Scikit Learn, Pandas, Beautiful Soup, Selenium, Scrapy, transformers, sentence transformers, stop words, porter stemmer
  • VS Code
  • Github
  • Google Colaboratory
  • Notion
  • Slack

Soft Skills Developed and Applied

  • Research: I was new to the NLP and data scraping domain, hence I mainly used the resources shared by the team lead to build my skills and expand my knowledge for this part of the project. In addition to resources shared, I further researched and shared resources specifically for Beautiful soup, Selenium, bag-of-words, BERT, DistilBERT, Support Vector Machines and Random Forest classifier and successfully implemented these myself and supported the implementation of these methods for struggling team mates.
  • Staying ahead as a lead: As I was new to some domains, as a technical lead I would stay one step ahead of the team by trying the scripts suggested by the team lead before everyone else and by finding more resources in addition to the resources shared by the team lead which would help participants with potential road blocks. I also conducted one on one sessions with the team lead before team meetings to share the results of my findings during the aforementioned process and discuss the roadmap of the project at multiple points
  • Project management: Project management responsibilities came to me when the team was sub-divided into two team and were working independently for the building the hierarchical classification model. I made sure both the teams were working in parallel and completed their tasks at the required time. I made sure the teams individually understood the task and the bigger project they were working towards by conducting virtual meetings with them as required in addition to the sprint meetings conducted by the team lead. I ensured the streamlining of the project by ensuring that the datasets collected and the scripts written by each team were easily integrate-able with the other teams’ work. I designed the structure of the pseudo code that was going to be used to scrape the data and build the hierarchical classification model. Further, I broke down the work required for the implementation of the pseudo code in terms of functions independent of one another and assigned their implementation to individuals so everyone on the team would be able to contribute to the final product.
  • Teaching: For the hierarchical classification model I conducted 2 lectures for each team. First on how to build the data set given the specific forum they were assigned to and what the final structure should look like. The second lecture was on how to build the novel hierarchical classification and how can we check the accuracy of such a model. To conduct these lectures I relied on google meets screen sharing complementing the use of an apple pencil on my iPad which basically lead to the creation of a virtual whiteboard. This technique proved very successful for the seamless understanding of structures and concepts for those who were involved in this project
  • Team-building: Our team comprised of students from all over the globe with most of them living in different time zones. Our project would not have been possibly completed if it wasn’t for the team building exercises and comradery created amongst the individuals. The project lead started the project off with an amazing team building exercise which culminated in everyone seeing how aligned and similar the goals were for everyone. In addition to the initiatives taken by the project lead, I contributed further by doing simple activities like a simple question at the start of the meetings e-g “How was last week” or “what was the most important thing that happened for you in the previous week’’ to closing the meetings off with “what are you looking forward to till the next time we meet”. I feel such activities really helped open the participants up and see the similarities in their interests and develop a deeper bond amongst themselves and the team leads. It was further encouraged at all meeting for the participants to communicate amongst themselves using slack or any channel of communication that worked the best for them. I can happily say that I myself have found some really great people to work and collaborate with!
  • Collaboration: Collaboration between such a big team that was globally distributed was new for me and one of the things I looked forward to the most for this project. For the ease of communication amongst participants the usage of slack was encouraged. In hindsight, the slack threads are populated with valuable discussions on the progress of the project in addition to the resources shared by individual participants to help their teammates. As everyone lived in different time zones, when2meet had to be used on multiple occasions to find a time slot that suits the highest number of participants in the team. All team meetings were recorded and the recorded session was shared on the shared google drive for the team.

I set up the Notion page for our team for creating a virtual workspace. This workspace shared the tasks for every sprint, the resources required, had a task progress table (resembling gantt charts) and a help centre.

Google colab notebooks + GitHub was used to share scripts amongst team members for the building of the final product

Tasks Completed:

  • Scraped the schizophrenia forum from discourse forums.
  • Preprocessed data scraped for NLP techniques( Bag of words, tf-idf and BERT)
  • Transformed unstructured data to structured data(vectors) by implementation if NLP techniques and created a similar topic recommendation engin e using cosine similarities. Cosine similarities were found using structured data from different NLP techniques to find the best performing algorithm
  • Implementation of SVM and Random forest classifier for performing classification on the ketogenic forum data to create the hierarchical classification model (had to fill in for participants that failed to do their tasks from the sub-teams made)
  • Project managed the building of the hierarchical model

Meetings Attended/ Conducted

  • Team and team lead Meetings
  • Tutorials, Workshops, and Code walkthroughs
  • Sprint Planning and Retrospective
  • ML Weekly Lead Meetings
  • ML Training Sessions with Colin (webdriving + beautifulsoup+ scrapy, python, BERT)

Tasks as a Lead

  • Read assigned resources and implement scripts before giving resources to teams
  • Setting up Notion and regularly updating tasks + shared resources
  • Writing pseudo code for the hierarchical model and creating the data structures + divide tasks for participants
  • Clear any confusions regarding implementation using my digital whiteboard through one-on-one meetings
  • Providing feedback for hierarchical model tasks assigned to individual participants and guide them on how to successfully complete theirs tasks
  • Sprint planning with project lead
  • Keep track of the performance of the hierarchical models and solve arising problems (data augmentation for various classes, how to calculate accuracy for the hierarchical model)
  • Overlook and contribute to the making and the delivery of the final presentation (contributions were also made by Huiwen, Muniba Bashir, Soumya Goswami and Nikola Danevski).