Stella Kombo - Machine Learning Pathway - Self Assessment

Week: July 27th - August 1st

Overview of Things Learned:

  • Technical Area: Web Scrapping on the posts within the community car talk by going through the subsequent different HTML tags when using BeautifulSoup which proved to be a bit tiresome hence I have been looking for an ‘easier’ alternative to BeautifulSoup e.g Scrapy.

  • Tools: Jupyter, Excel, Matlab, BeautifulSoup

  • Soft Skills: #teamwork and responsiveness, virtual-communication, curiosity, #time-management, goal-setting and #confidence.

Achievement Highlights

  1. Being able to set up the web scrapping code on Jupyter along with Matlab although it took a lot of time and getting relatively similar results.
  2. Tried out a few debugging techniques
  3. Further understanding how to work with Jupyter and how certain extensions to it can make it easier to work with.

Meetings attended I have been unable to attend meetings for this first week since my google calendar didn’t show any updates but have been watching the recordings of the meetings and have been keeping up to date with the necessary tasks due. However, I feel some of the technical issues I keep facing would be cleared up if I had attended the meetings which is why this week I am particularly conscious of why I need to not miss a meeting. I did attend the Python webinar which was a huge help in further understanding the language.

Goals for the Upcoming Week

  1. I want to be able to attend a meeting in person particularly this Friday’s meeting and ensure all my pre-processing work has been done. I also want to try and attend at least one office hour this week so that I can be able to ask and get clarification on the different issues I have been having while carrying out this week’s deliverables.

  2. Learning TF-IDF which is so new and very confusing for me.

Tasks Done

  1. Completed the web scrapping using BeautifulSoup in Python and on Matlab. However, running the codes on Matlab and Jupyter with the languages being very different takes a lot of time. I still stored the collected data on my laptop as a csv file by directly downloading it from Jupyter and as a mat file from Matlab.

  2. I have began the pre-processing part, however its proving a bit hard to use Matlab to do this and then transfer some sorted code to Jupyter in python of the same thing so I am trying to solely work on the csv file.

Week: 8/3

Overview of Things Learned:

Achievement Highlights

  • Polished skills on how to find and remove certain HTML tags, whitespace, and punctuation through different libraries
  • Obtained accurate results, stored that data in a .csv file.
  • Created a matrix with weighted values from TF-IDF

Goals for the Week

  • Fix my matrix on TD-IDF and implement it on my pre-processed scraped data
  • Work on BERT and my source code so that it can be able to add the spacing between my different headers as well as remove any punctuation marks between the data.

Tasks Done
*Pre-processing: I was able to finish obtaining scraped data from the forum which I was able to save as a csv file. I realized, after seeking help that BeautifulSoup was not the ideal method to use to scrap data from the forum that I chose so I had to learn more about scrapy and how it works.

  • I am still trying to fix the spaces between my code and the noise that I keep obtaining from my scraped data such as the images and colons throughout my data.

Week: August 16 - 22, 2020

Overview of Things Learned:

Technical: BERT, nltk,

Tools: Torch, transformers, nltk , DistilBertTokenizer

Achievement Highlights

  1. Learnt a lot about how BERT works and how it is highly effective in transforming post text to manful sentences particularly when the data obtained is “dirty” and ineligible.
  2. I also learnt how to tokenize sentences from the posts obtained.

Goals for the upcoming week:

  • Try and complete the BERT model and to give outputs of classification in a better and layered manner.
  • Try and finalize the BERT model so that I can be able to deploy the model and fine tune it in a way that I can optimize the output obtained.

Tasks done:

Data preparation:
*Obtaining data and pre-processing the data to remove any images and external noise that was earlier present.

  • Start on the TF-IDF and build the matrix that I will use to carry out the process.

Week: 8/17

Overview of Things Learned:

  • Technical Area: Pre-processing, TF-IDF, BERT
  • Tools: nltk, pandas, sklearn
  • Soft Skills: Communication, Team work, Time management

Achievement Highlights

  • I was able to use sklearn to construct the TF-IDF matrix as opposed to manually setting it up in order to use it to clean the data.
  • Further worked on my BERT model and was able to train the model and was able to a greater degree than the previous week learn how to successfully embed sentences in order to obtain a better accuracy on the training set of data.

Goals for the Upcoming Week

  • Having a better understanding of the TF-IDF since my matrix offers an output that seems too inaccurate.
  • Attend meetings.
  • Try and have my TF-IDF model work and fix any of the bugs that are preventing the matrix from accurately running because the output I am currently getting does not seem right.

Tasks Done

  • Finalized in pre-processing my data and was able to implement BERT as a forum classifier.
  • I was able to have an understanding on the importance of deployment of any model designed for a project i.e. the BERT model and its relevance to the project.

Week: 8/24

Overview of Things Learned:

  • Technical Area: Deployment of a model, BERT,TF-IDF,Machine Learning
  • Tools: AWS
  • Soft Skills: Communication, critical thinking, time keeping, presentation of data

Achievement Highlights

  • I tried to deploy the BERT trained model for the project and was successful to some extent.
  • I was able to help with the creation of the powerpoint slide for the final presentation of the project

Meetings attended

  • 8/26 - Practice Presentation Team 4
  • 8/27 - Mock presentation Team 4

Goals for the Upcoming Weeks

  • To try and further work on my TF-IDF model and see why it was not working as opposed to how my other teammates TF-IDF model worked and the results they obtained in order to further have a better understanding of the project.
  • Try to also understand the BERT model as a whole as I was quite stuck and frustrated had the time during its deploying as there was always an error in my code and in my sentence tokenization.

Tasks Done

  • Deployment : Tried to deploy the model of the project using AWS

Final Presentation : Although I was unable to attend the final presentation meeting due to a scheduling conflict with school, I made contribution to the final presentation with regards to the accuracy of the results we obtained after deploying our model along with the conclusion on the overall project.

FINAL SELF-ASSESSMENT: Things that I learnt during the Machine Learning Project

  1. Web Scraping : When I started the project, I was not really sure how to approach the forum I chose because BeautifulSoup was not working on the forum. I almost changed my forum but I kept researching on it and seeking external help from my teammates as well as my previous Python Teaching Assistant, until I learnt about pandas and gspread.

  2. Pre-Processing :Once I was able to scrape the data from the forum I began cleaning it up by removing any images, emojis and html texts.

  3. BERT : The team was working on TF-IDF at this point but I felt like I had a fairly better understanding on BERT so I decided to proceed with it and work on TF-IDF on the next week. I was able to apply tokenization as well as learn a lot more than what I initially though I knew about BERT. It took me quite a lot of time to even get the model to run but with help, I was able to have a family decent output form classifier.

  4. TF-IDF : This concept as a whole was completely foreign to me and honestly one of the hardest part of the project. It felt constantly confusing and when I though I had a handle on the basic idea behind it and tried to build a matrix from scratch that I could use, it proved difficult. I thus settled to using sklearn which was a feature I learnt about from one of my teammates and I was able to then form a vectorized matrix that I could work with. However, the output data that I got looked a bit inaccurate.

  5. Deployment : As a final step to the development of the project, we had to deploy the trained model and my attempt at it was fairly difficult although I learnt a lot throughout the process. However, I was able to learn about AWS, its advantages and disadvantages and was able to see a working machine learning version of the project model.

Conclusion
The project for me was a place where my technological skills were put to the test and even though most days felt bleak and I felt like my overall participation in the project was slow, I was still able to complete the project and in the process I learnt a lot about Machine Learning and its powerful capabilities which can be highly useful when applied to the real world.