Machine Learning Sachitt Arora - Level 1 Module 2

Technical Area

  1. I chose the DiscourseHub Community forum for Amazon to scrape data since it enabled me to obtain a variety of categories and a large amount of information.

  2. By using different commands and libraries fro various companies, I was able to scrape the information and stored it in a .csv file.

  3. I accessed the csv file and used exploratory data analysis to clean the data.

  4. Finally, I used matrices and mathematical functions in order determine which posts to recommend.

Tools

  1. Beautiful Soup

  2. VS code

  3. SkLearn

  4. Pandas

  5. Github

Softskills

  1. Growth Mindset - When faced with challenges I was able to overcome them through continuous effort, demonstrating perseverance.
  2. Internet Skills - The internet is a very useful source of information and sometimes when I had trouble with certain commands or functions, I would go through websites like StackOverflow to see examples of how people used libraries.
  3. File structure understanding - Since I needed to use my file system a lot for this module, I gained a deeper understanding of the hierarchies of a file system and what files can or cannot be accessed from specific points.

Achievement Highlights

  1. I scraped the data from my DiscourseHub community successfully and stored it in a csv file for later usage
  2. I cleaned the data and stored it within a data frame, which I then modified to remove things like whitespace and punctuation and to use key words. I essentially was able to change the data for future usage. I also added a column that was simply a sentence of every important word that could contribute to the post.
  3. I learned the math behind cosine similarity and matrices in order to understand how to recommend something. I used tools I had and examples to help me in creating a function that decides which posts are most similar.

Detailed Statement of Tasks

  1. I used the beautiful soup library and inspect elements in order to scrape data from a webpage.
  2. I used the pandas library to put data and read data from a csv file.
  3. I cleaned the data in the csv file by removing whitespace, punctuation, and words that were not necessary.
  4. I organized all the words I needed to determine similarity in a single sentence which I stored in another column of the table.
  5. I used a recommend function which calculated the similarity between posts to recommend something.

Problems Faced

I faced many issues regarding the elements and tags in the html of the website, but was able to get around these after much debugging and troubleshooting. I also had some problems with VScode itself but resolved these relatively easily. My most common problem was with the dataframes itself and understanding the structure and syntax of pandas and what it took to make changes or read information from them. After scouring the docs, I was able to develop a better understanding of this.