Pavitra_Walia - Machine Learning Pathway

Week 1 & 2

Overview of things learned:

Technical Area: I learned about scraping data from the discourse forum with the help of BeautifulSoup and Selenium. I was able to store the scraped data with the help of the Pandas and CSV module. I was able to preprocess the collected data by applying lowercasing, lemmatization, and tokenization on the collected data by the help of the ntlk module. I also used the String and re module for string manipulation. I was able to explore the TfidfVectorizer available in the sklearn module.

Tools: During the last two weeks I explored how to use Jupyter notebooks and project management tools like Asana. I also got familiar with efficiently using the various sections on the StemAway website.

Soft skills: I learned how to follow up with what’s discussed during the meetings and was able to devote adequate time to practice the things learned. I was able to effectively communicate during team meetings and other means of communication. I was able to go through the references given by the Team and Project leads.

Three Achievement Highlights:

- I was able to scrap/collect data from multiple discourse forums, able to explore how different things can be accessed from a webpage and able to effectively use various tools available in the webpage window using the inspect element feature.

- I was able to store the collected data in the form of a CSV file and also able to remove whitespaces and special characters in the collected data using re and string module.

- I was able to perform preprocessing and data cleaning on the collected dataset by lower casing, lemmatization, and tokenization techniques.

List of meetings I have joined so far:

I have taken up all the general meetings conducted till date except the first introductory meeting on June 1 due to the misinterpretation of timezones. 1. Make-Up Meeting 2. Web Scraping and Data Mining Technical Workshop 3. Git webinar by Industry Mentor 4. Web Scraping & Data Mining Workshop Pt 2 5. Data Preprocessing & Word Embeddings Basics Workshop

Goals of the upcoming week:

Become more comfortable with data preprocessing and experiment on how to use the TF-IDF vectorizer and be able to contribute to the project efficiently along with fellow participants and observers.

Task Done:

1. Web Scraping: Successfully scraped data from the discourse forums using Beautiful Soup and Selenium. Webinars helped in gaining a high-level overview of how things work along with the learning references provided at the end of the workshop helped me in completing the task. 2. Storage: Successfully stored the scraped data into a Pandas dataframe and converted it into .csv format. I have was able to perform it quite easily as I have worked on it previously. 3. Data Cleaning/Preprocessing: Successfully cleaned the collected data by various techniques using the ntlk module. I got an idea on how to implement various techniques from the webinar and a little more research from my side helped me in performing the task. 4. Utility Tools: Successfully set up my mentorchains.com account along with Slack, Asana, and Github. Initially, I got some issues regarding the setup but @shreyas helped me in resolving them

Other Issues and How I solved them:

The webinars/workshops gave me a good insight on how to approach web scraping and data preprocessing. Other than that there we some minute problems that I faced in setting up my chrome driver path and accessing web page elements by its XPath, but a little research helped me resolve them were great in demonstrating how to perform the required web scraping and data processing. I am still not confident enough in data cleaning and vectorizing it, but I am working on it. :slightly_smiling_face:

Week 3

Overview of things learned:

Technical Area: Explored various cleaning and pre-processing techniques. Learned about BERT architecture and tokenizer. Attended the NLP webinars learned the significance and application methods about the BERT model. Explored how BERT will be helpful in NLP modeling.

Tools: Successfully created a pull request for the task for web scraping and pre-processing on a dedicated Github repository. Setup a private channel with my team on Slack for discussions, Delegated sub-tasks among the team members on Asana.

Soft skills: Successfully discussed the creation of sub-tasks and planning on how to execute the tasks. Discussed the issues faced by fellow team members and resolve them together

Three achievements Highlights:

- Successfully scraped more than 5000 posts from a discourse forum and stored the collected data along with my team.
- Cleaned and pre-processed the collected data by removing stopwords & lemmatizations.
- Learned about the BERT model architecture and its application in our current use case.

List of meetings I have joined so far:

  1. Weekly team follow-up meetings
  2. BERT session by Technical team Lead @shruti
  3. NLP webinar by Industry mentor Colin

Goals of the upcoming week:

Become more comfortable with BERT and experiment on how to use the BERT tokenizer on the data collected and be able to contribute in a more efficient manner into the project.

Task Done:

  1. Web Scraping: Successfully scraped data (5k+ posts) from the discourse forums using Beautiful Soup and Selenium.
  2. Storage: Successfully stored the scraped data into a Pandas dataframe and converted it into .csv format.
  3. Data Cleaning/Preprocessing: Successfully cleaned the collected data by various techniques using the ntlk module.
  4. Successfully coordinated with my team to complete all the assigned tasks.

Other Issues and How I solved them:
I wasn’t able to load a forum page up to an extent that the page contains at least 5k+ posts. After exploring a little over the web I got an method for scrolling down over a webpage with the help of a web driver.

P.S. I will be staying here for the additional weeks (extension to the current program)

Week 4

Overview of things learned:

Technical Area: Explored more on the BERT Transformer library and how it can be used for sequence classification. I learned about Pytorch and its integration with the transformers library. I also learned about how the model can be applied.

Tools: Used Github for collaboration and Asana for Task Management.

Soft skills: I coordinated with my fellow teammates to resolve the issues we are facing and improved my communications with fellow team members

Three achievements Highlights:

- Understood about the working of BERT model for the Natural Language processing tasks.
- Learned about special tokens, attention masks, and fixing max length for the input.
- Successfully tokenized the input sequence and split the final dataset for training and testing purposes.

List of meetings I have joined so far:
Weekly team follow-up meetings

Goals of the upcoming week:

Carry out the classification task using the BERT model.

Task Done:

  1. BERT Tokenizer: Used the BERT tokenizer to tokenize the sentences.
  2. Encoding: Add special tokens and encoded the tokenized sentences into input id’s.
  3. Attention Masks: Generated attention masks for the corresponding input id’s.
  4. Dataset: Split the generated dataset into training and testing

Other Issues and How I solved them:
Initially, I wasn’t able to interpret how the transformers library of BERT works. The NLP webinars on BERT by Colin (Industry mentor) was a great help. In addition to that, the documentation helped to understand the usage of each function available in the library.

Week 5 (Final)

Overview of things learned:

Technical Area: Explored more about using the sequence classifier of BERT and how it can be implemented to create a classifier for the dataset generated in the last week.

Tools: Used Github for collaboration and Asana for Task Management.

Soft skills: I coordinated with my fellow teammates to resolve the issues we are facing and improved my communications with fellow team members

Three achievements Highlights:

- Explored various aspects of Sequence classifier for BERT.
- Tried implementing classifier on the generated Dataset.
- Modified the generated dataset to make it compatible with BERT.

List of meetings I have joined so far:

Weekly team follow-up meetings

Goals of the upcoming week:

Try to build a classifier for the improved dataset and continue further to deploy it.

Task Done:

  1. Bert Sequence Classifier: Tried implementing the BERT classifier on the collected dataset.
  2. Dataset: Modifying the existing dataset to work well with BERT.
  3. BERT Model: Tried training the BERT model on the modified dataset.

Other Issues and How I solved them:

While training the model with the help of BERT, the error is generated each time due to the huge length of the description that we were trying to input. We tried to shorten the length of the input by modifying the existing dataset to resolve that issue.