Kdglider - Machine Learning Pathway

Overview of Things Learned
Technical:

  • Webscraping fundamentals
  • Word embeddings for NLP applications
  • ML models for NLP applications

Tools:

  • Webscraping tools such as BeautifulSoup and Selenium
  • Datalogging with Pandas and JSON
  • scikit-learn ML library
  • Google Colab with GitHub functionality

Soft Skills:

  • Remote teamwork and communication
  • Knowledge of the importance of personal profile sites such as Medium

Achievement Highlights

  • Developed two modular webscrapers for the Amazon and Flowster Discourse forums
  • Implemented the Naïve Bayes algorithm to classify Discourse topics into the correct categories
  • Improved classifier accuracy to ~65% through experimentation of different data pre-processing techniques and word embeddings

List of Meetings/Training Attended

  • All team meetings, work sessions and socials

Goals for the Coming Week

  • Continue to investigate methods to improve Naïve Bayes classifier accuracy
  • Investigate more complex word embeddings and experiment with logistic regression classifier models, with the possibility of expanding to a simple neural network

Detailed Statement of Tasks Done

  • BeautifulSoup alone was not enough to properly scrape data from dynamic webpages. The solution was to use Selenium webdrivers in conjunction to properly load the pages’ HTML. Solved by myself with teammates, and with validation from our leads.
  • The Naïve Bayes classifier proved to be fairly inaccurate upon the initial implementation. Received good feedback from leads on possible ways to pre-process training data to improve accuracy, as well and increasing the amount of available data through augmentation.
  • Received good feedback from leads and the rest of the team in training the NB classifier on the Amazon data instead of the Flowster data, since it has more samples. The result was a more consistent accuracy.

Request Change of Role
I wouldn’t mind becoming a task lead if required. Just let me know of the responsibilities ahead of time :+1:

Final Self-Assessment
All items below are in addition to what was reported for the mid-session self-assessment.

Overview of Things Learned
Technical:

  • Advanced deep-learning models (eg. BERT) for NLP classification
  • Textual data augmentation with a variety of techniques including round-trip translation (RTT) and word substitution using synonyms, antonyms and TF-IDF

Tools:

  • NLP deep learning libraries such as the Huggingface transformers library
  • NLP data augmentation libraries such as nlpaug and googletrans
  • Using Google Colab for GPU-accelerated deep learning

Soft Skills:

  • Remote teamwork and communication

Achievement Highlights

  • Mitigated data class imbalance by augmentation using round-trip translation
  • Developed an ML pipeline which utilizes BERT with the ability to classify Discourse topics with an accuracy of ~80%

List of Meetings/Training Attended

  • All team meetings, work sessions and socials

Goals for the July Session

  • Implement object-oriented programming principles and professional software development processes in developing an ML solution
  • Learn more about deploying ML solutions
  • Improve remote team leadership, tutorials and project management

Detailed Statement of Tasks Done

  • Data augmentation using RTT proved to be very challenging and time-consuming due to the relatively poor reliability of the googletrans library. Because several sentences failed to translate back into English, numerous quality checks were put in place to monitor the translation success, and to bypass the topic completely if a certain failure rate is reached.