6/16/20 Self Assessment
I gained a greater understanding of using Python in order to extract data, specifically how to use web crawling and scraping in order to collect data from Discourse forums. I researched and understood various key terms necessary for using NLP and BERT, such as tokenization, one hot encoding, recurrent neural networks, and term frequency-inverse document frequency. I learned how to use a pickle file, GitHub, Google Colab, and Python libraries such as Beautiful Soup. Two important soft skills are learned were how to communicate what I have accomplished to a team and also how to speak up when I do not understand something, as at least one other team member will likely feel the same way.
- I learned how to web crawl and scrape through a Discourse forum, extracting topics and submissions from the forum and saving it to a pickle file.
- I modified and debugged web scraping code provided by our task lead by reviewing the code step by step, using Stack Overflow, and asking questions when necessary.
- I researched NLP and BERT, and reviewed additional articles explaining how to set up BERT in order to be prepared for when we would begin to use it as a team.
I have attended all Monday and Friday team meetings that have taken place since the internship began on June 1st. I have watched all three introductory Machine Learning webinars (Introduction, Content Recommender Systems, and Data Mining), the Git webinar, and the Project Management Skills webinar.
Goals for Upcoming Week:
I would like to use BERT with either TF-IDF or RNN to parse through the forum data we have collected as a team to provide recommendations of similar topics to the selected question.
- I created accounts for and joined Slack, GitHub, and Google Calendar for our team (ML Team 5).
- I have attended all team meetings.
- I have completed all required weekly reports (due on Wednesdays and Fridays).
- I selected the Discourse forum I wanted to scrape during the course of this project (CodeCademy).
- I developed a web crawler and scraper that would work well for my forum using the outline provided by our team leads. I ran into several errors while modifiying this code, including not receiving the pickle file because my forum was too large and encountering issues with json. I resolved these errors through research using Stack Overflow, as well as asked my team leads, who provided me with some guidance as to how I could fix the issue myself.