Badhri_Narayanan_Sur - Machine Learning Pathway

Self Assessment - 6/16

Things Learned

• Learned Web crawling, scraping, parsing the web page content for useful information.
Learned to use Visual Studio code and other git functionalities from the git webinar by Collin.
Apart from the technical information I learned from the webinar, I also learned to some extent how to effectively communicate and teach complex ideas and concepts to a group so that it reaches everyone.
Although I have experience in Deep learning in Computer vision, NLP was relatively new and I learned about Transformer Neural networks and BERT over the past week.

Highlights
• Built Web Crawler iterating through web pages of the Hopscotch forum.
• Scraped about 4900 web pages using Beautiful soup package and stored in pickle file
• Modified the scraper code in such a way that it would ignore non-responding URLs by using try and except in code.

Meetings
• Attended the Git Webinar by Industry mentor. Also have been attending team meetings for about 3 times every week for the last 2 weeks for discussing further work or some queries regarding the work assigned.
Goals for upcoming week
Goals for upcoming week
• Goal for the following week is to use pre-trained BERT or transformer models to build a basic recommendation system for the forum picked and evaluate its performance.
Tasks Done

Tasks done
• By inspecting the main page of the forum and the network traffic, the correct URL for accessing further pages could be identified. Once done, we could iteratively crawl through successive pages and scrape the data off each web page. Scraping was done using BeautifulSoup package in python which helps is parsing HTML pages.
For the Crawler and scraper, the initial briefing by project lead was useful as it provided with the basic information for us to continue the work. Other than few simple errors, it was not very difficult to then build the crawler and scraper.
Using Visual studio code was helpful and even though I knew a little about Git, the webinar was really useful and I learnt a lot of new concepts.

Final Self Assessment

Things Learned

• Learned Web crawling, scraping, parsing the web page content for useful information.
Learned to use Visual Studio code and other git functionalities from the git webinar by Collin.
Apart from the technical information I learned from the webinar, I also learned to some extent how to effectively communicate and teach complex ideas and concepts to a group so that it reaches everyone.
Although I have experience in Deep learning in Computer vision, NLP was relatively new and I learned about Transformer Neural networks and BERT
• My work was associated with Recommendation model. The approaches we used required me to learn about BERT and TF-IDF embeddings. I Learned about the vectorizing process in sklearn TF-IDF Vectorizer for transforming text data to vectors that can be used.for recommendation.

Highlights

• Built Web Crawler iterating through web pages of the Hopscotch forum.

• Scraped about 4900 web pages using Beautiful soup package and stored in pickle file

• Modified the scraper code in such a way that it would ignore non-responding URLs by using try and except in code.

• Built Recommendation system based on BERT embeddings, TF-IDF embeddings and also tried an approach combining the two embeddings.

Meetings

• Attended the Git Webinar by Industry mentor. Also have been attending team meetings for about 3 times every week for the last 4 weeks for discussing further work or some queries regarding the work assigned.

Tasks Done

• By inspecting the main page of the forum and the network traffic, the correct URL for accessing further pages could be identified. Once done, we could iteratively crawl through successive pages and scrape the data off each web page. Scraping was done using BeautifulSoup package in python which helps is parsing HTML pages.
For the Crawler and scraper, the initial briefing by project lead was useful as it provided with the basic information for us to continue the work. Other than few simple errors, it was not very difficult to then build the crawler and scraper.

• After obtaining both the BERT & TF-IDF embeddings, cosine similarity was used to sort and produce the forum topics with highest similarity with the input query topic. The results of the Recommendation Model indicated that the approach combining both TF-IDF and BERT Embeddings outperformed the approaches with just BERT or TF-IDF. Since Recommendations cant be evaluated objectively, we used Jaccard Similarity to compare the input query topic with that of the output recommendations.