Ernclb - Machine Learning Pathway

Overview:
I was the Project Lead for Team 5 during the June Session. And led the team on a 3 product journey in ML that consisted of an initial classification of forum submissions into the forums they came from, then a recommendation model for topic submissions and finally a classification product that groups forum submissions into their appropriate category. This was an overall amazing experience that really stretched the boundaries of my capabilities and taught me autonomy and leadership.

Some of the skills I learned or improved on:

  1. Transformer Neural Networks: BERT and other bidirectional encoders
  2. Agile project management
  3. Kept a business minded focus throughout the project
  4. Web Scraping, specifically js websites with infinite scroll
  5. Used classification algorithms such as SVM, Random Forest and Neural Networks
  6. Perfected the use of Tf-Idf encodings
  7. Perfected pre-processing skills with libraries such as Smote that help with undersampling and oversampling
  8. Extended the use of Beautiful Soup and the Requests library
  9. Collaborated effectively using Git
  10. Lead a large team and delegated tasks efficiently
  11. Used Google Collab effectively to run scripts in a Virtual environment

Meetings:
I was present in most team level meetings and was the main moderator. It was a great experience preparing for the meetings, trying to inspire my team-mates and providing direction. I believe my social skills were pretty solid going in, but I definitely improved in my ability to inspire a room of engineers to converse. We held. 2 to 3 team meetings every week where did reviews of the current work, brains stormed and gave webinars.

Tasks Completed:
We built the 2 products outlined by STEM-Away and did a good job adding the 3rd product. The first two I’m mentioning are the recommendation model and the forum classifier. Our forum classifier reached above 95% accuracy in certain situations. And was overall performing so well that I think it’s already deployable. The great work our team did in perfecting the Tf-Idf encoding was truly inspiring.

The second product of the recommendation model is more difficult to measure but I’m overall very proud of that achievement as well. We got to use a combination approach of both Tf-Idf and BERT embeddings to make recommendations. I feel like this is also a deployable product that uses both the cutting edge semantical understanding of BERT and still grounds these findings the tried and tested method of Tf-Idf.

The third product was a category automation tool that predicted the categories of topic submissions. We thought this was the strongest product we had since it could help automate a strenuous task for the clients of discourse forums and help make them more user-friendly. I feel like automation like this is best suited for strenuous tasks that take up unnecessary time. I think we got very close to fulfilling this dream. We used BERT embeddings once again to classify submissions into categories. We tried many classification methods from SVM, Random Forest to Neural Networks but despite promising results we couldn’t break the 90% accuracy mark in the time given to us. I think with a bit more time we could perfect this classification and make it truly worthy of being deployed in the real world.

We also completed axillary deliverables in the form of a web scraper to collect all the data necessary to complete these products; a presentation that encapsulates an explanation of all the workings of the project and a pretend business report on the analysis of a forum that we had scraped. This business report is available on our google drive. It goes over the some visualizations and analysis of the web forum of the Bank of New Zealand.