Arturo_Alvarado - Machine Learning Pathway

Arturo_Alvarado · June 17, 2020, 5:36am

Overview of things learned:

Technical Area:

Various NLP models including BERT and bag of words
Learned how to scrape the web using BeautifulSoup and Selenium
Gained experience in inspecting HTML elements in different discourse forums

Tools:

BeautifulSoup, Selenium, BERT

Soft Skills:

Remote Collaboration
Setting independent goals

Three Achievements:

Understanding how we can use BeautifulSoup to webscrape. It allowed us to create soup objects in which we can extract only the information we desire from our discourse forum.
Fixed the issue where I was only getting a limited number of elements after BeautifulSoup. This problem was because I wasn’t loading all the HTML elements. Selenium allowed me to scrape a large number of data points.
A minor achievements was getting python up and running on my computer. I was getting compilation errors for simple commands. Through the command line I was able to update the packages used allowing me to continue with my tasks.

Meetings/Training Attended

I have been able to attend every team meeting. I have also attended the useful webinars provided some of the other team leaders.

Goals for Upcoming Week

Finish my BERT implementation. Also attempt to implement a different model in order to compare accuracy between the two.

Tasks Done

Watched each of the training webinars to better prepare myself for the project and the technical requirements required to succeed.
Our task for the first week was to select a forum and scrape key information from it. Our team selected the Atom Discourse Forum as our subject of data collection. I was able to use BeautifulSoup and Selenium to collect a large quantity of data from the Atom Discourse Forum and separate it into a CSV file. The resources that Anubhav and Saad posted were really helpful in fully understanding how BeautifulSoup is utilized.
Our task for the second was to go through a handful of NLP implementations that would help us increase the accuracy of our recommender system. I’m still working on getting BERT fully operational. The resources that Anubhav has posted have really helped out.

Arturo_Alvarado · July 1, 2020, 1:13am

Overview of things learned:

Technical Area:

Additional research on algorithms such as BERT and TD-IDF
Additional practice using BeautifulSoup and Selenium
Implemented a program that can detect similarities between a set of input and different threads.

Tools:

Sentence Transformers, nltk

Soft Skills:

Setting independent goals
Presentation of data

Three Achievements:

I was able to take the data I scraped from the Atom forum implement a BERT model to detect similarity in forum topics.
As a team we were able to collect 14,000 stemming from 4 categories and 76 different subcategories. From this we were able to extract data-topics and topic-contents.
As a team we have implemented an SVM model and random forest.

Meetings/Training Attended

I have been able to attend every team meeting. I have also attended the 3 NLP webinars that walked us through the theory and implementation of BERT. I have also attended every subgroup meeting where we are working on the codecademy forum.

Goals for Upcoming Week

Work on presenting our results we have gathered from the codecademy forum. Enhance our scraped data with additional augmentation techniques.

Tasks Done

Finished implementing BERT model as well as TD-IDF model
Switched forums from Udactiy to codecademy because of a login issue that was preventing us from successfully scraping data.
Take the data we scraped from the codecademy forum and create a CSV file so that our team can continue with our various classifications.

Arturo_Alvarado · July 6, 2020, 11:03pm

Final Assessment

Technical Area:

Web scraping and data mining
Extracted contextualized word embeddings
Document classification
Implementation of Classification models
Data Augmentation

Tools:

Beautiful Soup
Sellenium
Natural Language Toolkit
BERT Libraries
Sentence Transformers

Soft Skills:

Gained collaboration experience in a remote environment
Accomplished weekly tasks in a timely matter
Learned to evaluate my own progress and write a report on it

Three Achievements:

I was able to complete the weekly tasks in the time provided. For our first task of scrapping the Atom forum I needed to ask for some guidance in regards to gathering a larger data set. I was able to read the resources provided and managed to scrape a larger quantity of data than my first attempt.
I was able to apply the BERT model to my first data that I collected. The webinars hosted by Colon really gave me a clear understanding the NLP framework and how we can use its bidirectional characteristics towards our own tasks. I ended up with a very effective language model.
My subgroup and I were able to scrape approximately 14,000 data points from the codecademy forum. We separated our data into categories, sub-categories, topics, and replies. Then we used the distilBERT model to transform the data and obtain its hidden state embeddings.

Meetings/Training Attended

During the 5 week session, I was able to attend all webinars hosted by stemaway as well as all team and subteam meetings.

Tasks Done

Gathered 14,000 data points from the codecademy forum
Generated a CSV file that was parsed into 4 different elements such as categories, subcategories, topics, and replies
Took this data and applied the distilBERT model to transform the data and obtain its hidden state embeddings
Implemented classification algorithms such as support vector machines, Random Forest, and Feed Forward Neural Network
Implemented various data augmentation techniques to improve the accuracy of our algorithms