- I chose the DiscourseHub Community forum for Amazon to scrape data since it enabled me to obtain a variety of categories and a large amount of information.
- By using different commands and libraries fro various companies, I was able to scrape the information and stored it in a .csv file.
- I accessed the csv file and used exploratory data analysis to clean the data.
- Finally, I used matrices and mathematical functions in order determine which posts to recommend.
- Beautiful Soup
- VS code
- Growth Mindset - When faced with challenges I was able to overcome them through continuous effort, demonstrating perseverance.
- Internet Skills - The internet is a very useful source of information and sometimes when I had trouble with certain commands or functions, I would go through websites like StackOverflow to see examples of how people used libraries.
- File structure understanding - Since I needed to use my file system a lot for this module, I gained a deeper understanding of the hierarchies of a file system and what files can or cannot be accessed from specific points.
- I scraped the data from my DiscourseHub community successfully and stored it in a csv file for later usage
- I cleaned the data and stored it within a data frame, which I then modified to remove things like whitespace and punctuation and to use key words. I essentially was able to change the data for future usage. I also added a column that was simply a sentence of every important word that could contribute to the post.
- I learned the math behind cosine similarity and matrices in order to understand how to recommend something. I used tools I had and examples to help me in creating a function that decides which posts are most similar.
Detailed Statement of Tasks
- I used the beautiful soup library and inspect elements in order to scrape data from a webpage.
- I used the pandas library to put data and read data from a csv file.
- I cleaned the data in the csv file by removing whitespace, punctuation, and words that were not necessary.
- I organized all the words I needed to determine similarity in a single sentence which I stored in another column of the table.
- I used a recommend function which calculated the similarity between posts to recommend something.
I faced many issues regarding the elements and tags in the html of the website, but was able to get around these after much debugging and troubleshooting. I also had some problems with VScode itself but resolved these relatively easily. My most common problem was with the dataframes itself and understanding the structure and syntax of pandas and what it took to make changes or read information from them. After scouring the docs, I was able to develop a better understanding of this.
- I created a new column which contained all of the data.
- I used cosine similarity to create a basic recommender system which simply uses the cosine matrix to determine whether two posts are similar or not.
- I used this recommender function to determine the top 10 most similar posts to any inputted post.
- I then moved on to advanced data modeling functions and created different pipelines and determined which would get the highest accuracy on my data.
- I experimented a few times with this by changing around data inputted in order to get the highest level of accuracy output.
- I abstracted all of my functions and data files both in this module and in other modules in order to simplify the process.
VScode Numpy SKLearn Pandas
A problem I had with this one was largely due to the way I did not structure my functions well enough in module 2. I then went back and edited a lot of parts to my module 2 so that it could be used more efficiently in the future and did the same to module 3 and I understood how to keep my code neat and organized. I also used youtube videos about math and other functions to better understand what I was doing.
- I was able to create a basic recommender system using cosine similarity
- I was able to analyze different models on my data
- Eventually got to almost a 90% accuracy with my models after inputting and outputting a lot of data
- Scraped even more data and restructured my code from both modules
Detailed Statement of Tasks Completed
- I first created two new csv files which each had a bag of words. One had it with only key words inputted into the bag of words whereas the other still had stopwords and was not too cleaned in order to see if one would work better than the other in modeling.
- I then constructed a cosine similarity matrix for a basic recommender system. I reviewed a little bit of linear algebra at this time to because I wanted to try and better understand what was going on. I then tried inputting a few posts and seeing the outputs of the top 10 similar ones.
- Next I moved on to the modeling and constructed pipelines for each of the different types of models. In these models I fed in different types of data (fully cleaned or semi-cleaned), different amounts of data, and different constraints on the models and decided on the best one.
- During my data analysis and models training I plotted graphs to better understand the way the data was structured
- Scraped more data and reorganized code structure and layout.
The biggest problem I faced in doing this module was time constraints. I had an extremely busy schedule and was forced to do most of this in a very short period of time. However, I remained motivated even though the module looked tougher compared to earlier ones and was able to complete it before the deadline. Technical problems I faced I solved by using the mentors guide or by looking on websites such as StackOverflow for general solutions. I will continue tying to get even greater accuracy by changing around the data and attempting new techniques. One more thing I am slowly getting better at is my teamwork skills. Since I recently became a PM later into the program, I have needed to catch up and rise to the challenge even though I had a late start. I have been looking through the website and discussing with fellow PMs as for what could be done to maximize my potential as a project manager.