Sara_EL-ATEIF - Machine Learning Pathway

Sara_EL-ATEIF · July 3, 2020, 11:37am

Written on Sunday 28th June 2020

A concise overview of things learned. Break it up into Technical Area, Tools, Soft Skills

Technical area:

Learned web scraping through DataCamp Course
Improved my collaboration Git skills
Revised and updated my knowledge on Transformers, ELMO, OpenAI and BERT neural networks architectures.
Got a better understanding of how :
1. I can use BERT embeddings to train another model
2. BERT performs a multi-class classification
Grasped the difference between Multi-class classification and Multi-label classification
Tested data augmentation techniques in NLP

Tools:

Advanced usage of Github (such as issuing a Pull request, reviewing it, resolving conflicts and closing a PR)
Performed web scraping by using the Scrapy library
Became comfortable with using Git commands to collaborate in a teamwork setting
Learned how to better use VS Code extensions to visualize how I interact with Git commands

Soft skills:

Improved how I dealt with project management issues
Upgraded my communication and collaboration skills
Sharpened my mindset to become a high-level goal-oriented mindset instead of solely focusing on the details of the solution

Three achievement highlights

Learned how to perform web scraping using Scrapy
Improved my Git collaboration skills
Grasped the details behind how BERT worked and its architecture
Successfully implemented a text augmentation technique and improved the accuracy thanks to it
Handled in a diplomatic approach project management issues

List of meetings/training attended including social team events

Everyone of them
I also attended all of Colin’s meetings and missed the last 3 webinars

Goals for the upcoming week:

Training BERT as a classifier on the merged data
Improve BERT’s accuracy through fine-tuning
Gather my team’s work and prepare a presentation to showcase our results

A detailed statement of tasks done.

State each task, hurdles faced if any, and how you solved the hurdle. You need to clearly mark whether the hurdles were solved with the help of training webinars, some help from project leads, or significant help from project leads.

All of the hurdles I faced came from git commands. For example, when I was using the mentorchains GitHub account I had to add it to the internal configuration of my laptop that took me a lot of time to resolve because I had trouble adding another id_rsa and getting its public key from my laptop. I resolved this by combining the explanation of a freecodecamp blog post and the official documentation of GitHub.

I faced other issues when I was preparing the GitHub tutorial for my team and resolved them by scrolling through StackOverflow community answers.

One of the challenges we faced with one of the forum’s (the Flowster forum) was the lack of data I did a quick google search about different data augmentation and preprocessing techniques that should be performed in this case and I also asked some NLP experienced friends on how they usually faced this challenge and shared their insights with my team.
Another challenge that we faced this week (week 4) is that the BERT model doesn’t train on Colab because the length of our sequences goes beyond the one predefined (default=512, ours= ~3000) also the batch size is too high for the RAM provided by Colab. I suggested that we use the default sequence length (512) and train on a TPU as usually TPUs can handle more data and are faster than GPUs.

I will be working on finding the best batch size and sequence length this upcoming week.

Sara_EL-ATEIF · July 8, 2020, 2:43pm

Updated on Wednesday 8th, July 2020