Sara_EL-ATEIF - Machine Learning Pathway

Written on Sunday 28th June 2020

A concise overview of things learned. Break it up into Technical Area, Tools, Soft Skills

Technical area:

  1. Learned web scraping through DataCamp Course
  2. Improved my collaboration Git skills
  3. Revised and updated my knowledge on Transformers, ELMO, OpenAI and BERT neural networks architectures.
  4. Got a better understanding of how :
    1. I can use BERT embeddings to train another model
    2. BERT performs a multi-class classification
  5. Grasped the difference between Multi-class classification and Multi-label classification
  6. Tested data augmentation techniques in NLP

Tools:

  1. Advanced usage of Github (such as issuing a Pull request, reviewing it, resolving conflicts and closing a PR)
  2. Performed web scraping by using the Scrapy library
  3. Became comfortable with using Git commands to collaborate in a teamwork setting
  4. Learned how to better use VS Code extensions to visualize how I interact with Git commands

Soft skills:

  1. Improved how I dealt with project management issues
  2. Upgraded my communication and collaboration skills
  3. Sharpened my mindset to become a high-level goal-oriented mindset instead of solely focusing on the details of the solution

Three achievement highlights

  1. Learned how to perform web scraping using Scrapy
  2. Improved my Git collaboration skills
  3. Grasped the details behind how BERT worked and its architecture
  4. Successfully implemented a text augmentation technique and improved the accuracy thanks to it
  5. Handled in a diplomatic approach project management issues

List of meetings/training attended including social team events

  • Everyone of them
  • I also attended all of Colin’s meetings and missed the last 3 webinars

Goals for the upcoming week:

  • Training BERT as a classifier on the merged data
  • Improve BERT’s accuracy through fine-tuning
  • Gather my team’s work and prepare a presentation to showcase our results

A detailed statement of tasks done.

State each task, hurdles faced if any, and how you solved the hurdle. You need to clearly mark whether the hurdles were solved with the help of training webinars, some help from project leads, or significant help from project leads.

  • All of the hurdles I faced came from git commands. For example, when I was using the mentorchains GitHub account I had to add it to the internal configuration of my laptop that took me a lot of time to resolve because I had trouble adding another id_rsa and getting its public key from my laptop. I resolved this by combining the explanation of a freecodecamp blog post and the official documentation of GitHub.

I faced other issues when I was preparing the GitHub tutorial for my team and resolved them by scrolling through StackOverflow community answers.

  • One of the challenges we faced with one of the forum’s (the Flowster forum) was the lack of data I did a quick google search about different data augmentation and preprocessing techniques that should be performed in this case and I also asked some NLP experienced friends on how they usually faced this challenge and shared their insights with my team.
  • Another challenge that we faced this week (week 4) is that the BERT model doesn’t train on Colab because the length of our sequences goes beyond the one predefined (default=512, ours= ~3000) also the batch size is too high for the RAM provided by Colab. I suggested that we use the default sequence length (512) and train on a TPU as usually TPUs can handle more data and are faster than GPUs.

I will be working on finding the best batch size and sequence length this upcoming week.

Updated on Wednesday 8th, July 2020

A concise overview of things learned. Break it up into Technical Area, Tools, Soft Skills

Technical area:

  1. Learned web scraping through DataCamp Course
  2. Improved my collaboration Git skills
  3. Revised and updated my knowledge on Transformers, ELMO, OpenAI and BERT neural networks architectures.
  4. Got a better understanding of how :
  5. I can use BERT embeddings to train another model
  6. BERT performs a multi-class classification
  7. Grasped the difference between Multi-class classification and Multi-label classification
  8. Tested data augmentation techniques in NLP
  9. Resolved class imbalance issue to improve model accuracy

Tools:

  1. Advanced usage of Github (such as issuing a Pull request, reviewing it, resolving conflicts and closing a PR)
  2. Performed web scraping by using the Scrapy library
  3. Became comfortable with using Git commands to collaborate in a teamwork setting
  4. Learned how to better use VS Code extensions to visualize how I interact with Git commands
  5. Augmented the data using nlpaug library.

Soft skills:

  1. Improved how I dealt with project management issues
  2. Upgraded my communication and collaboration skills
  3. Sharpened my mindset to become a high-level goal-oriented mindset instead of solely focusing on the details of the solution

Three achievement highlights

  1. Learned how to perform web scraping using Scrapy
  2. Improved my Git collaboration skills
  3. Grasped the details behind how BERT worked and its architecture
  4. Successfully implemented a text augmentation technique and improved the accuracy thanks to it
  5. Handled in a diplomatic approach project management issues
  6. Improved model accuracy from 70% to 78% through augmenting the data

List of meetings/training attended including social team events

  • Everyone of them
  • I also attended all of Colin’s meetings and missed the last 3 webinars

Goals for the upcoming week:

  • TBD

A detailed statement of tasks done.

State each task, hurdles faced if any, and how you solved the hurdle. You need to clearly mark whether the hurdles were solved with the help of training webinars, some help from project leads, or significant help from project leads.

  • All of the hurdles I faced came from git commands. For example, when I was using the mentorchains GitHub account I had to add it to the internal configuration of my laptop that took me a lot of time to resolve because I had trouble adding another id_rsa and getting its public key from my laptop. I resolved this by combining the explanation of a freecodecamp blog post and the official documentation of GitHub.

I faced other issues when I was preparing the GitHub tutorial for my team and resolved them by scrolling through StackOverflow community answers.

  • One of the challenges we faced with one of the forum’s (the Flowster forum) was the lack of data I did a quick google search about different data augmentation and preprocessing techniques that should be performed in this case and I also asked some NLP experienced friends on how they usually faced this challenge and shared their insights with my team.
  • Another challenge that we faced this week (week 4) is that the BERT model doesn’t train on Colab because the length of our sequences goes beyond the one predefined (default=512, ours= ~3000) also the batch size is too high for the RAM provided by Colab. I suggested that we use the default sequence length (512) and train on a TPU as usually TPUs can handle more data and are faster than GPUs.
  • After fine tuning the BERT model and finding the best hyper-parameters to train the data on (the batch size I used is 8), we faced another challenge. The accuracy of our model didn’t increase above 70% I did a quick trial where I dropped the categories with less data to see if it will improve. The result was a 0.01% increase in accuracy (71%).
    A quick visualization of the number of samples in each category of our data, told me that it suffered from a critical class imbalance. To resolve it, I augmented the data by first using TF-IDF, WordNet, Roberta, GPT-2, BERT, and DistillBERT word substitution of the nlpaug NLP data augmentation library.
    The accuracy was extremely low (4%) after training. I looked into the resulting data augmented and found out that the TF-IDF, Roberta and GPT-2 added more noise which sidetracked the model.
    I updated the data augmentation by first adding the Reply Comments in our data into Leading Comments and then augmented by leaving only TF-IDF, WordNet, BERT, and DistillBERT word substitution. After training, the accuracy increased into 78%.