Brandonc - Machine Learning Pathway

brandonc · June 16, 2020, 2:00am

What I learned:

Technical
-Web Scraping
-Utilizing Python for Data Science
-Data Cleaning
-EDA
Tools
-Jupiter Notebook
-GitHub
Soft Skills
-Research
-Patience
-Networking

Three Achievement Highlights:

Web Scrape on my own for the first time
Learn how to use Jupyter Notebook
Created a word cloud after data cleaning

List of Training/Meetings Attended:

-Git Webinar
-Team’s Monday Meetings
-STEMCasts

Goals for the upcoming week:

-Communicate and work well with my team on the upcoming text analysis task.
-Learn more about text analysis techniques to apply to my data set.

Detailed statement of tasks done:

Task 1: Web scraping
I used BeautifulSoup to web scrape from a SmartThings topic post using Juptyer Notebook and I was able to retrieve the username and content of each post. The only hurdle I faced was that I used Chrome Driver to web scrape and was told to change my code by one of my project leads after I completed data cleaning and EDA despite following an example notebook that was provided which used Chrome Driver. I was able to solve the problem with a bit of advice from the same project lead, but I primarily solved it on my own.

Task 2: Data Cleaning
I removed the html, stop words, and punctuation while maintaining any urls found in the content of a post by researching a regular expression that had the abilities I was looking for. I then applied a lemmatizer as well as a porter stemmer and compared the effects on the data I collected. I decided to use the porter stemmer since it had better results, but I consulted a lead to make sure. The only hurdle I had was my misconception of data cleaning, but after talking to the project leads on Slack or going to office hours, I understood what I had to do.

Task 3: EDA
I created bar graphs to show the most frequent user on the topic post as well as the most frequent words written. I also created a sentiment graph, but I didn’t think that the graph made much sense, so I decided to not showcase it. Finally, since I didn’t want to have two frequency bar graphs, I made a word cloud instead for the most frequent words written due to a suggestion made by one of the project leads. The only hurdle I had was my lack of knowledge on EDA, but I was given resources to learn by my project leads.

I would also like to upgrade from an observer to a participant.

brandonc · June 26, 2020, 7:13pm

What I learned:

Technical
-Text Pre-processing
-Term Frequency
-Inverse Document Frequency
-Sentiment Analysis
Tools
-TextBlob
Soft Skills
-Research
-Public Speaking

Three Achievement Highlights:

Learned and applied TF/IDF to pre-process the text
Implement sentiment analysis on all the topic posts
Presented my group’s work and results

List of Training/Meetings Attended:

-NLP Webinar Recordings
-Team’s Monday Meetings

Goals for the upcoming week:

-Learn more about BERT and text classification
-Excel in teamwork with my new group

Detailed statement of tasks done:

Task 1: Text Pre-processing
I conducted term frequency and inverse document frequency for not only each post, but also all the posts so that I could examine how often words were used within their respective posts as well as compared to all the posts. I also conducted sentiment analysis for all of the posts to view the polarity which I found to be neutral leaning positive. I had a few problems with inverse document frequency where I was unable to create a data frame due to the myriad amount of words creating too many columns. I solved this by creating two different methods to calculate inverse document frequency (a function that takes in a word and computes the IDF for that word and a dictionary that stores and displays the IDF for every word).

Task 2: Presentation
I created the slides and presented the work I’ve done as well as the results I’ve gathered from text pre-processing. The only problem I had was that most of my group did not communicate back when I was working on the slides; there was only one other person who helped on the slides.

brandonc · July 2, 2020, 8:00pm

What I learned:

Technical
-BERT
-Random Forest Classifier
-Ada Boost Classifier
Tools
-BERT
-Text Classifiers
Soft Skills
-Perseverance
-Communication

Three Achievement Highlights:

Learned and implemented BERT
Utilized text classifiers to predict tags
Increased accuracy of text classifiers

List of Training/Meetings Attended:

-BERT Lecture by Leads
-Team’s Monday Meetings

Goals for the upcoming week:

-Prepare for July Section
-Consider Machine Learning as a potential career path

Detailed statement of tasks done:

Task 1: BERT and Text Classifiers
I utilized tokenization, splitting, zero padding, and BERT in order to turn the content and title strings into numerical representations. These numerical representations were inputted into the text classifiers (Random Forest Classifier and Ada Boost Classifier) to train and test them. I ran into two problems: the code taking five hours to run and the accuracy being low (~33%). I was able to increase the accuracy since I realized that related tags had similar predictions, so I categorized the tags based on how they were categorized on the SmartThings website and the accuracy increased to about 63%.

Task 2: Presentation
After getting my results in the previous task, I worked with my team to create the slides to showcase our results with the use of graphs and screenshots of the code. We then presented our findings and shared the resources we used during the aforementioned task.