Shreya_Chandra - Machine Learning Pathway

6/16/20 Self Assessment

Things Learned:
I gained a greater understanding of using Python in order to extract data, specifically how to use web crawling and scraping in order to collect data from Discourse forums. I researched and understood various key terms necessary for using NLP and BERT, such as tokenization, one hot encoding, recurrent neural networks, and term frequency-inverse document frequency. I learned how to use a pickle file, GitHub, Google Colab, and Python libraries such as Beautiful Soup. Two important soft skills are learned were how to communicate what I have accomplished to a team and also how to speak up when I do not understand something, as at least one other team member will likely feel the same way.

Highlights:

  1. I learned how to web crawl and scrape through a Discourse forum, extracting topics and submissions from the forum and saving it to a pickle file.
  2. I modified and debugged web scraping code provided by our task lead by reviewing the code step by step, using Stack Overflow, and asking questions when necessary.
  3. I researched NLP and BERT, and reviewed additional articles explaining how to set up BERT in order to be prepared for when we would begin to use it as a team.

Meetings/Training:
I have attended all Monday and Friday team meetings that have taken place since the internship began on June 1st. I have watched all three introductory Machine Learning webinars (Introduction, Content Recommender Systems, and Data Mining), the Git webinar, and the Project Management Skills webinar.

Goals for Upcoming Week:
I would like to use BERT with either TF-IDF or RNN to parse through the forum data we have collected as a team to provide recommendations of similar topics to the selected question.

Tasks Done:

  • I created accounts for and joined Slack, GitHub, and Google Calendar for our team (ML Team 5).
  • I have attended all team meetings.
  • I have completed all required weekly reports (due on Wednesdays and Fridays).
  • I selected the Discourse forum I wanted to scrape during the course of this project (CodeCademy).
  • I developed a web crawler and scraper that would work well for my forum using the outline provided by our team leads. I ran into several errors while modifiying this code, including not receiving the pickle file because my forum was too large and encountering issues with json. I resolved these errors through research using Stack Overflow, as well as asked my team leads, who provided me with some guidance as to how I could fix the issue myself.

6/23/20 Self Assessment

Things Learned:
I gained a greater understanding of how to train BERT with actual data from pickle files. I became more experienced with Google Colab, having used it for multiple tasks now, and was exposed to Google Data Studio. Two important soft skills learned were how to work with a smaller team of people, as well as how to politely offer suggestions that differ from what another team member may have brainstormed.

Highlights:

  1. I trained a BERT model using pre-made training sentences, was able to return the top five suggestions that were closest to the selected sentence, and also displayed the accuracy of the sentence similarities.
  2. I modified my BERT model to include the pickle file from the forum I scraped the previous week.
  3. I brainstormed ideas with my subgroup as to how we would like to address pain points in the Bank of New Zealand forum. I suggested we address the category/tagging issue, which was one of the things we decided to tackle the following week.

Meetings/Training:
I have attended all main team meetings and subgroup meetings, most of which were mainly on Mondays and Fridays. I have watched also watched the recordings of the three part NLP webinar.

Goals for Upcoming Week:
I am planning to work with my team on visualizing pain points for the Bank of New Zealand forum by creating a dashboard on Google Data Studio, writing a business report, and creating a model that addresses the various issues we have determined.

Tasks Done:

  • I rejoined the new GitHub repository for ML Team 5.
  • I have attended all team and subgroup meetings.
  • I have completed all required weekly reports (due on Wednesdays and Fridays).
  • I created a BERT model that recommends the top 5 sentences/topics that are related the closest to the chosen sentence/topic. I ran into a few errors when trying to use this model with my pickle file from last week, but a few team members offered useful suggestions that allowed me to debug my code during our Friday team meeting.

6/30/20 Self Assessment

Things Learned:
I learned how to work more formally with Google Data Studio in order to create an in-depth dashboard that visualizes important data points from a forum. I also gained experience with creating a detailed business report describing potential problems within the forum and how our team planned to remedy them, as well as working with/cleaning CSV files. An important soft skills learned was how to quickly adapt when my team decided to switch gears moving into our final week of the internship.

Highlights:

  1. I created multiple graphs on Google Data Studio to highlight the Bank of New Zealand’s category performance related to criteria such as like count, views, replies, etc.
  2. I debugged our CSV file that held data related to the BNZ forum, allowing us to better display our data through Google Data Studio.
  3. I wrote a detailed portion of our business report explaining the graphs in our dashboard, and how we could address the problems shown moving forward.

Meetings/Training:
I have attended all team and subgroup meetings that have taken place this past week.

Goals for Upcoming Week:
I am working with my team to automate categories using a BERT recommender model for multiple forums. We plan to assign categories based on topic/post similarity, which will remove the need for a forum moderator to manually assign categories to posts.

Tasks Done:

  • I wrote my assigned portion of our team business report, describing what was depicted by the graphs on our dashboard.
  • I completed one of the four pages of our subgroup’s dashboard on Google Data Studio, creating graphs to represent performance for each category within the forum.
  • I cleaned parts of the CSV file we retrieved from collecting data from the forum.
  • I have attended all team and subgroup meetings.
  • I have completed all required weekly reports (due on Wednesdays and Fridays).

7/3/20 Self Assessment

Overall, I greatly enjoyed my experience as a STEM-Away Machine Learning participant. This internship experience was truly educational and also challenging at times, but ultimately rewarding. I gained exposure to and worked with concepts I had not been exposed to beforehand, including web crawling/scraping, BERT, NLP, TF-IDF, RNN, and undersampling.

Things Learned:
I learned how to undersample pickle file data that was scraped from a forum, as well as gained exposure to various Machine Learning algorithms in order to test for accuracy (as my group was testing for category classification accuracy). I also learned how to best delegate tasks amongst a team during a period of time with plenty of work to complete.

Highlights:

  1. I learned about undersampling and oversampling for unbalanced dataset, and also developed code to undersample our datasets.
  2. I modified our BERT Embedding Generator and Machine Learning Models in order to classify categories for multiple forums, and tested the accuracy of the classification.

Meetings/Training:
I have attended all team and subgroup meetings that have taken place this past week.

Goals for Upcoming Week:
I am working with my team to create a presentation that best displays our product. We are also trying to figure out how we can combine our code into one complete package.

Tasks Done:

  • I researched undersampling and oversampling.
  • I implemented undersampling for imbalanced datasets into our code.
  • I modified our code to classify categories and check accuracy for four additional forums.
  • I have attended all team and subgroup meetings.