Kitty_Gu - Machine Learning Pathway

Kitty_Gu · June 15, 2020, 8:10pm

Concise overview of things learned. Break it up into Technical Area, Tools, Soft Skills

Technical area:
1. Definition of machine learning and natural language processing
2. A basic understanding of project workflow
3. Some Machine Learning Algorithms(Naive Bayes, Linear SVM, Logistic Regression)
4. Able to perform web scraping with BeautifulSoup and Selenium
5. Able to perform data cleaning and EDA
Tools:
1. Basic usage of Github(commands such as push, pull, rebase, etc)
2. BeautifulSoup and Selenium library
3. Became familiar with csv file and data manipulation with pandas
4. Medium is a good source for technical hacking articles
Soft skills;
1. Better at searching for help on the internet
2. Improved communication and collaboration skills
Three achievement highlights

Successfully scraped Flowster Discussion Forum by collaborating with teammates
Started a blog about the project on my new Medium account
Successfully implemented Linear SVM

List of meetings/ training attended including social team events

Everyone of them

Goals for the upcoming week.

Be able to understand and use Linear SVM to achieve an accuracy above 75%
Try to understand the other basic models as well
Learn complicated models such as BERT
Understand the code I implement

Detailed statement of tasks done. State each task, hurdles faced if any and how you solved the hurdle. You need to clearly mark whether the hurdles were solved with the help of training webinars, some help from project leads or significant help from project leads.

We web scraped desired information from the discussion forum such as the post, comments, likes, views so that we can eventually classify the posts into correct categories or even build a recommendation system. The main problem we encountered is that the BeautifulSoup library is nor able to scrape all information we need because it requires manually scrolling through the pages for extra information to load. Therefore, we used BeautifulSoup in conjunction with Selenium to solve this problem. We found this solution mostly by doing web research ourselves and collaborating among team members.
Now we are trying out various kinds of basic models for post classfication. The struggle we have is the need to better clean our data. We are approaching this issue through trial and error and a bit guidance from our team lead(such as avoid deleting of words).

Kitty_Gu · July 14, 2020, 4:08am

Updated Version(July 13) Concise overview of things learned. Break it up into Technical Area, Tools, Soft Skills

Technical area:

Definition of machine learning and natural language processing
A basic understanding of project workflow
Some Machine Learning Algorithms(Naive Bayes, Linear SVM, Logistic Regression)
Able to perform web scraping with BeautifulSoup and Selenium
Able to perform data cleaning and EDA
Learned how to merge and join CSV data frames using pandas
Learned about DistillBERT and BERT ( https://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/ )
Learned about Recurrent Neural Network/RNN, its vanishing gradient problem, and Long Short Term Memory/LSTM, which was developed to solve this problem (01 RNN 도입 V6 최종 - YouTube)

Tools:

Basic usage of Github(commands such as push, pull, rebase, etc)
BeautifulSoup and Selenium library
Became familiar with CSV file and data manipulation with pandas
Medium is a good source for technical hacking articles
Google Colab(a similar tool comparing to Jupiter notebook) has special features such as changing between TPU and GPU modes according to specific tasks.

Soft skills:

Better at searching for help on the internet
Improved communication and collaboration skills
Project Presentation skills: ~ when it comes to important figures in presentation, it is better to explain more when time is allowed(eg. chart for project workflow and site map of the forum) ~ in data presentation, bar charts are generally better than pie charts ~ better to declare what accuracy and precision refers to for your data because they can mean different things for different audiences ~ better to have visualizations for results

Three achievement highlights(1st self-assessment)

Successfully scraped Flowster Discussion Forum by collaborating with teammates
Started a blog about the project on my new Medium account
Successfully implemented Linear SVM

Three achievement highlights(2nd self-assessment)

Successfully merged our scraped Amazon and Flowster Date for the advanced ML models.
Successfully applied DistilBERT embedding to our data.
Successfully built a recommend function (trained with DistilBERT embedding) that takes in the index of the topic and gives a list of 10 most similar topics as recommendations. (We did this to better understand how DistilBERT embedding classifies its posts)

List of meetings/ training attended including social team events

Everyone of them

Have missed a few social bonding events due to travel

Goals for the upcoming week(1st self-assessment)

Be able to understand and use Linear SVM to achieve an accuracy above 75%
Try to understand the other basic models as well
Learn complicated models such as BERT
Understand the code I implement

Goals for the upcoming week(2nd self-assessment)

Get ready for the July session
Brush up on some ML skills
Reflect on June session
Watch webinars and attend meetings to build Full Stack Skill foundation and improve in ML knowledge.

Detailed statement of tasks done. State each task, hurdles faced if any and how you solved the hurdle. You need to clearly mark whether the hurdles were solved with the help of training webinars, some help from project leads or significant help from project leads.

We web scraped desired information from the discussion forum such as the post, comments, likes, views so that we can eventually classify the posts into correct categories or even build a recommendation system. The main problem we encountered is that the BeautifulSoup library is nor able to scrape all information we need because it requires manually scrolling through the pages for extra information to load. Therefore, we used BeautifulSoup in conjunction with Selenium to solve this problem. We found this solution mostly by doing web research ourselves and collaborating among team members.
Now we are trying out various kinds of basic models for post classfication. The struggle we have is the need to better clean our data. We are approaching this issue through trial and error and a bit guidance from our team lead(such as avoid deleting of words).
I kept running into import error when I was importing torch_xla. I fixed the problem by moving the position of my code block to the very top of my notebook and it worked.
Google Colab crashed several times when I tried to train the DistilBERT embedding. It was some error about Javascript that I could not fully understand. But I listened to my team lead’s advice and ran the code using Firefox, it worked.