Sachitt_Arora - Machine Learning (Level 1) Pathway

Overview of Things Learned

Technical Area

  1. Went over the concepts of Machine Learning, including the concepts it was built on and the workflow required to create a machine learning project.
  2. Becoming familiar with the different models and approaches that machine learning can have on different cases and situations.
  3. Learned the basics of recommender systems and how they categorize data in order to determine similarity in between things using the different measures of similarity like cosine, euclidean distance, and dot product.
  4. Learned data mining and how to use and employ scraping and web crawlers in order to find and organize data that can then be fed into a program to produce some effect.
  5. Mathematical concepts are very important in machine learning, with an emphasis on linear algebra and vector math.

Tools

  1. Scrapy and BeautifulSoup for scraping and parsing through data
  2. VSCODE as an IDE
  3. Spacy for NLP
  4. Git and github for collaborative working

Soft Skills

  1. Learning and understanding the guidelines for ethical web crawling
  2. Being able to use online discussion forums to understand and resolve problems that occur.
  3. Growth mindset enabled further development in the face of setbacks

Achievements completed

  1. Familiarized myself with BeautifulSoup Documentation and methods as well as the things it can do.
  2. Experimented with git and how to use it more as I previously used Github desktop
  3. Created some more detailed and complex functions to be more familiar in python.
  4. Reviewed mathematical concepts associated with machine learning such as linear algebra and vectors.

Detailed Statement of Tasks Completed

  1. Became familiar with ideas and concepts behind machine learning and project management in a cooperative group setting.
  2. Applied web scraping concepts to previous personal coding projects, such as bracketing systems and social media bots, in order to make them more efficient and successful.
  3. Set up the environment required for programming web scraping and data analyses scripts.
  4. Watched the stemaway webinars to further understand machine learning as taught by experts in the field.

Technical Area

  1. I chose the DiscourseHub Community forum for Amazon to scrape data since it enabled me to obtain a variety of categories and a large amount of information.
  2. By using different commands and libraries fro various companies, I was able to scrape the information and stored it in a .csv file.
  3. I accessed the csv file and used exploratory data analysis to clean the data.
  4. Finally, I used matrices and mathematical functions in order determine which posts to recommend.

Tools

  1. Beautiful Soup
  2. VS code
  3. SkLearn
  4. Pandas
  5. Github

Softskills

  1. Growth Mindset - When faced with challenges I was able to overcome them through continuous effort, demonstrating perseverance.
  2. Internet Skills - The internet is a very useful source of information and sometimes when I had trouble with certain commands or functions, I would go through websites like StackOverflow to see examples of how people used libraries.
  3. File structure understanding - Since I needed to use my file system a lot for this module, I gained a deeper understanding of the hierarchies of a file system and what files can or cannot be accessed from specific points.

Achievement Highlights

  1. I scraped the data from my DiscourseHub community successfully and stored it in a csv file for later usage
  2. I cleaned the data and stored it within a data frame, which I then modified to remove things like whitespace and punctuation and to use key words. I essentially was able to change the data for future usage. I also added a column that was simply a sentence of every important word that could contribute to the post.
  3. I learned the math behind cosine similarity and matrices in order to understand how to recommend something. I used tools I had and examples to help me in creating a function that decides which posts are most similar.

Detailed Statement of Tasks

  1. I used the beautiful soup library and inspect elements in order to scrape data from a webpage.
  2. I used the pandas library to put data and read data from a csv file.
  3. I cleaned the data in the csv file by removing whitespace, punctuation, and words that were not necessary.
  4. I organized all the words I needed to determine similarity in a single sentence which I stored in another column of the table.
  5. I used a recommend function which calculated the similarity between posts to recommend something.

Problems Faced

I faced many issues regarding the elements and tags in the html of the website, but was able to get around these after much debugging and troubleshooting. I also had some problems with VScode itself but resolved these relatively easily. My most common problem was with the dataframes itself and understanding the structure and syntax of pandas and what it took to make changes or read information from them. After scouring the docs, I was able to develop a better understanding of this.

Technical

  1. I created a new column which contained all of the data.
  2. I used cosine similarity to create a basic recommender system which simply uses the cosine matrix to determine whether two posts are similar or not.
  3. I used this recommender function to determine the top 10 most similar posts to any inputted post.
  4. I then moved on to advanced data modeling functions and created different pipelines and determined which would get the highest accuracy on my data.
  5. I experimented a few times with this by changing around data inputted in order to get the highest level of accuracy output.
  6. I abstracted all of my functions and data files both in this module and in other modules in order to simplify the process.

Tools

VScode Numpy SKLearn Pandas

Soft Skills

A problem I had with this one was largely due to the way I did not structure my functions well enough in module 2. I then went back and edited a lot of parts to my module 2 so that it could be used more efficiently in the future and did the same to module 3 and I understood how to keep my code neat and organized. I also used youtube videos about math and other functions to better understand what I was doing.

Achievement Highlights

  1. I was able to create a basic recommender system using cosine similarity
  2. I was able to analyze different models on my data
  3. Eventually got to almost a 90% accuracy with my models after inputting and outputting a lot of data
  4. Scraped even more data and restructured my code from both modules

Detailed Statement of Tasks Completed

  1. I first created two new csv files which each had a bag of words. One had it with only key words inputted into the bag of words whereas the other still had stopwords and was not too cleaned in order to see if one would work better than the other in modeling.
  2. I then constructed a cosine similarity matrix for a basic recommender system. I reviewed a little bit of linear algebra at this time to because I wanted to try and better understand what was going on. I then tried inputting a few posts and seeing the outputs of the top 10 similar ones.
  3. Next I moved on to the modeling and constructed pipelines for each of the different types of models. In these models I fed in different types of data (fully cleaned or semi-cleaned), different amounts of data, and different constraints on the models and decided on the best one.
  4. During my data analysis and models training I plotted graphs to better understand the way the data was structured
  5. Scraped more data and reorganized code structure and layout.

Problems Faced

The biggest problem I faced in doing this module was time constraints. I had an extremely busy schedule and was forced to do most of this in a very short period of time. However, I remained motivated even though the module looked tougher compared to earlier ones and was able to complete it before the deadline. Technical problems I faced I solved by using the mentors guide or by looking on websites such as StackOverflow for general solutions. I will continue tying to get even greater accuracy by changing around the data and attempting new techniques. One more thing I am slowly getting better at is my teamwork skills. Since I recently became a PM later into the program, I have needed to catch up and rise to the challenge even though I had a late start. I have been looking through the website and discussing with fellow PMs as for what could be done to maximize my potential as a project manager.

Technical Area:

  • Learned how to train BERT, xlnet, roberta, distilbert models using the Simple Transformers library
  • Was able to combine various models such as BERT and Random Forest and see their overall effectivness.
  • Learned how BERT operates and how it can function

Tools:

  • Simple Transformers
  • tokenizers
  • Docker
  • SKlearn
  • Jupiter Notebook

Achievement Highlights:

  • Was able to train various models and see what their relative efficiencies were.
  • Tried using different types of data, such as cleaned, cleaned, and not cleaned, to see which one was more effective.
  • Tried dockerization of ML App with the advanced models → currently still having issues with this and is one of my main challenges.

Challenges:

  • As the modules are becoming more complicated, a better understanding of the ins and outs of machine learning is needed to progress. Even though I had a lot of trouble at first, I was eventually able to obtain the knowledge I needed to complete this module.