Ckim - Machine Learning (Level 1) Pathway

Things Learned: Technical Area: Better understanding of Beautiful Soup documentation. Worked with it closely. Got a better understanding of GitHub through the videos. Also, learned about using python to train a logistic regression model to classify into positive/negative sentiment.

Tools Beautiful Soup Python A text editor—I use Atom Editor. If this is a problem, please let me know.
Installed scrapy, but didn’t do much with it. Selenium/chomedriver (didn’t do much with it). nltk.

Soft Skills Best soft skill I got was learning how to use different resources across the web to learn about a specific skill I wanted (in this case, using logistic regression to classify +/- sentiment). The process of learning this was quite helpful. Got a better understanding of a broader overview of ML from the Kunal Sing web video: learned about content vs. collaborative, unsupervised vs supervised, etc. From the Sara EL_ATEIF web video, got a better understanding of a specific application of ML through using the example of sorting IMDB movie. Also got a general understanding of NLP. Finally, watched the Maleeha Imran video for a more specific knowledge on crawling + scraping and the differences between the two. Also looked at specific code there as well.

3 Achievement highlights

  1. Was able to download Beautiful Soup and become familiar with the documentation. Also watched the videos under the resources tab to obtain the necessary background information. The web videos were very helpful.
  2. Successfully scraped from this website: Folksy 365 - Daily Listing Challenge Thread January 2021 - Showcase - Folksy Forums. This website is basically a discussion post, where people can respond. I was able to use a soup object to represent the website.
  3. I was able to take the text of each post, and print it out in an easy, presentable way. I was also able to take the text of each post and make them each an entry of the list in python. Basically, I have made a list of the text entries of each post. Now, I can do a number of different NLP operations on the words in each post.
  4. I also downloaded movie review data write code in python, and use logistic regression to classify into positive and negative sentiment. The accuracy was not great (78% or so) but I used a very basic and imprecise way of doing logistic regression, and I will continue to work on ways to improve that. My main focus was just getting my feet wet in ML. But this did give me a good refresher on writing python, learning about ML/logistic regression, and learning about nltk (which I imagine could be helpful over the summer).

Tasks completed:

  1. Prepared workspace by installing all minimum necessary technical requirements (i.e Beautiful Soup, etc.). Successfully installed Selenium and chromedriver as well.
  2. Watched the webinars and learned a lot (I explained what I learned in more detail above).
  3. I familiarized myself with scraping using the link mentioned above. I was able to extract just the text from each post, but I didn’t do much to process it. I mostly just played around with scraping. Tell me if you would like anything in particular with regards to the processing.
  4. Read some basic things regarding Pytorch, but I plan to study it more. Also, as mentioned above, wrote python to train a logistic regression model. I have certain questions about improving the logistic regression model, but I think I will get the answer by learning by doing/ searching online.

Self-Assessment- ML Level 1 Module 2 Things Learned:

Technical Area:

How to scrape data using Beautiful Soup, and the different operations you can do on soup objects.

How to use Selenium/chrome driver to navigate/ use a headless browser to help scrape data.

More knowledge of HTML as a language and how webpages work.

How to do EDA from a CSV file, including pre-processing (removing stop words, etc) as well as some actually analyzing data using things like n-grams. Tools:

Python, Beautiful Soup, Selenium/chromedriver, pandas, nltk, mainly.

Soft skills: Used the github link from Sara (level1_post_recommender_20/webScraping_EDA_tutorials at md2 · mentorchains/level1_post_recommender_20 · GitHub), so I learned how to follow instructions from a tutorial. Learned how to experiment on my own to learn to use different tools (esp. Selenium/chromedriver). 3 Achievement Highlights:

  1. Got comfortable w/ BeautifulSoup and using a Beautiful Soup Object to scrape data from different webpages, although primarily experimented w/ different webpages: mainly the Cartalk community and the MyPaint Community forums.
  2. Using the chrome webscraper on Sara’s github as a reference, I modified the get_category_and_tags() method (for some reason the ways for getting tags was different on the MyPaint forum than the Amazon Seller’s Forum) as well as the method for getting comments (among a few other ajudstments) to scrape the data on the MyPaint community forum ( into a CSV file, which I have on my computer.
  3. Used that CSV file to do basic EDA. Turned lower case, removed stop words/ common words, removed uncommon words, etc. I then did some basic processing using the TfidVectorizer and CountVectorizer from sklearn. I think I can continue to do more EDA as the team prepares to start building the ML model.

Goals for the upcoming week: Increased team communication as we begin to get into building the model. I want to make sure that we are all on the same page.

Continue to get more comfortable w/ the ML tools we are using for this summer (although I have made a lot of progress this week)

Get a better big picture understanding of the project for this summer.

Tasks Done:

  1. Played around with Beautiful Soup and Selenium/chomedriver using the Discourse community forums. There were some hurdles faced, but like any tool in CS you are learning to use, the best way to learn how to use them is to use online resources to try and fiture out as well as personal experimentation (learn by doing).
  2. Scraped Data from the MyPaint Community Forum. This was made a lot easier by Sara’s github page. The main hurdle I faced was just getting started and wrapping my mind around the problem as well as 1) getting the tags, as the process was different from Sara’s example as well as 2) getting the leading comment, as again, the process was a little different. I overcame this by just looking at the HTML code of the MyPaint community and looking at the best way of scraping the data. I was able to get a CSV file similar to the one in Sara’s Github.
  3. Was able to perform basic pre-processing of the data from the CSV file. I processed the leading comment of each topic post (I figure that since this column had the most information, it would give the most information to analyze it first). I turned each leading comment to lower case, removed punctuation, stop words, common words, and rare words, and lemmatized the words as well. This was all fairly straightforward. I then processed the data, making n-grams of the leading comments (I didn’t go through all of the data for this), used sklearn to calculate TF and IDF of words in the leading comments (again, didn’t go through all of the leading comments, as I was just experimenting), and created a bag of words. Finally, I used Sara’s outline for getting the most common 2-word n-grams, and my results made sense: “please tell”, “problem question”, “operating system”, and “graphic tablet” were some of the most common bigrams of words in the leading comments. I still need to do more work to get more comfortable with all of the tools that Sara used on the Github for EDA, but I definitely made progress.

Things Learned:

Technical Area: 1) Cleaning the data/ doing some preprocessing. 2) Learned how to use sklearn to test different classifiers (Naïve Bayes Classifier, LSVM, etc.). 3) I also built a recommender: I did a tfidf vectorizer and then used cosine similarity to get the best matches. Basically, I learned about the activities in module 3.

Tools: Python + different tools that can be used in python: sklearn, seaborn, numpy.

Soft Skills: Following directions/ learning on my own. For example, I searched up an article to get a better understanding of what the tfidf vectorizer was actually doing, which then helped me to understand the code I was writing.

3 achievement highlights:

  1. Was able to condense author, title, leading comment, and other comments, and tfidf vectorize each post, then use cosine similarity to build a recommender.
  2. Was able to perform basic classification of the posts: I used the 4 that Sara used in her tutorial (Naïve Bayes, LSVM, Logistic Regression, Decision Tree) plus I used k-nearest neighbors to test the accuracy. I was also able to document and compare the accuracy: my two most successful tests were LSVM and Logistic Regression.
  3. I was able to perform minor adjustments to see if I could make the classifier better. For example, I also used the tags to my data and vectorized that (in addition to author, title, etc.) which slightly increased the accuracy for some of my data. I also tried different types of data cleaning as well.

Tasks Completed:

  1. Was able to use cosine similarity to build a recommender for a post. I had to use google to learn how to do some of it, but it wasn’t a huge challenge. However, right now, it just spits out the topic title of the posts I am recommending. I was wondering: is there a way for me to do this so that it would actually work in a real-life setting: for example, if I was on the actual myPaint Community website, could I actually build a recommender that would nicely show each of the posts I am recommending? I know this would be a lot more difficult.
  2. Performed basic classification, as described above. Sara’s tutorial made it pretty easy, and k-nearest neighbors was not that big of an adjustment. The one thing that I want to do more of is different ways of playing around with the preprocessing, so that I can increase the accuracy of my model

Things Learned:

Technical Area: Started on module 4, tried to use BERT to create word embeddings (although no final product yet). Still working on it.

Tools: Simple Transformers Library on Github (GitHub - ThilinaRajapakse/simpletransformers: Transformers for Classification, NER, QA, Language Modelling, Language Generation, T5, Multi-Modal, and Conversational AI).

Soft Skills: spent time preparing slideshow for the presentation. Worked with others to present our progress, and how to create a slideshow that can represent the work we have done up to this point in a concise manner. Also got good feedback on our presentation on how to make it even better.

3 achievement highlights:

  1. Created the visuals for the presentation (bigrams split up by year, visualization of our recommender).
  2. Created the actual presentation and met up w/ other presenting team members to create + practice the presentation.
  3. Finally, presented our presentation last Tuesday and got feedback.

Tasks completed:

  1. As stated earlier, my team and I went through the process of actually creating a presentation that nicely put together all of the work we had been doing over the past month (roughly).
  2. Also started to explore other forums to scrape. Decided on Drowned in Sound community forum.
  3. Finally, started to look at the simple transformers library and how to use BERT. I think I need to read articles to get a better understanding of what BERT actually is/does, though.

Things Learned:

Technical Area:

  1. Learned how to scrape from a larger forum (Drowned in Sound). I first scraped all the URLS that I would use and then scraped the data from all of those URLs just so that the script wouldn’t take so long.
  2. Learned how to use the simple transformers to use bert, distilbert, and other ml models.
  3. Built a basic recommender and classifier (but not using bert or distilbert) to recommend/classify posts in the Drowned in Sound forum.

Tools: simpletransformers. (And as always python, sklearn, and other tools used throughout the summer).

Soft skills: Trying to start to communicate more with the team to build a final product.

3 achievement highlights:

  1. Was finally able to scrape data from a larger forum, and got about 13,000 posts. May choose to not use all of them.
  2. Built a recommender that had a much higher accuracy than my previous recommender (probably because we simply had more training data). We can still tune our model to get even better results, though. Also, built a recommender that can recommend a url to the user (same method as before, tfidf vectorizations to cosine similarity).
  3. Was able to use bert/distilbert to some effect (I will describe more in detail later).

Tasks Completed:

  1. Scraped data from the drowned in sound forum (scraped title, author, content of post, comments, likes, views, tags, commenters, data, url). About 13k posts in total. Scraped all of the urls from each category. But I didn’t scrape all of the posts in the music category, scraped only about 5k posts from them. But that should be sufficient.
  2. Built a basic recommender using the same model as used with myPaint (tfidf vectorizations + cosine similarity).
  3. Tried to use bert/distilbert, but got an error message: zsh:killed. I’m planning on going to office hours to elaborate on my problems.

Things Learned:

Technical Area:

  1. Learned how to use google colab to train machine learning models.
  2. Learned how to use different techniques to get around class imbalance, such as round-trip translation.
  3. Learned how to experiment in different ways to create more effective post embeddings (such as separating the post into title and body and weighting the title more heavily).
  4. Learned how to test both recommender and classifier for performance. Testing the recommender for performance was especially tricky.
  5. Learned how to use flask in order to deploy a web app for the classifier and recommender.

Tools: google colab, python, and flask. Others as well, but these were the main tools.

Soft skills: Learned how to problem solve and work together as a team. Also created the final presentation with the team and got feedback on it too.

Achievement highlights:

  1. Finding a model that generated good post embeddings. This is especially difficult because you have to test performance of these embeddings manually/using your own judgement.
  2. Creating a flask app so that the model can actually be seen in action. (This goes for both the classifier and recommender).
  3. Creating a slideshow presentation to showcase our work.

Tasks completed:

  1. Trained a few different classifier models on google colab and picked the best one and created a flask app to demonstrate it.
  2. Used a few different models to generate post embeddings to use for our recommender, picked the best one, and created a flask app to demonstrate it.
  3. Created a slideshow presentation w/ the rest of the team to showcase the work.