I am the Technical Lead for ML Team 5 during the June Session. I was very delighted to know all the people of team 5 and it’s my great honor to lead them. We did a great job on the recommendation system for the submitted topics of the forum and developed three different recommendation models to recommend the best-matched topics, and built a forum classifier to classify a post using TF-IDF. I learned about the web crawling, webpage parsing and NLP. I also gained first-hand experience with BERT embeddings and TF-IDF encoding. Apart from technical experience, I also learned a lot of team management and leadership from others.
Developed a web crawler to crawl webpages from the Hopscotch, which is a forum based on Discourse, parsed the topics using BeautifulSoup and stored them. Learned about the Scrapy package from the source provided.
Developed three recommendation systems based on BERT embeddings, TF-IDF encoding and association BERT embeddings and TF-IDF encoding. Tunned models and analyze the performance of each model.
Collaborated with other participates using git and learned the whole workflow with git. And used Google Collab to run crawler scripts online and learned to make use of the build-in GPU source to accelerate computation.
Lead a subgroup to complete the development of the above recommendation systems.
Attended the webinars held by industry mentor on the collaboration using git, data mining and recommendation systems. Also attended team meetings about three times a week, which improved my social and communication skills.
In the stage of collecting metadata and constructing dataset for the forum, we applied Scrapy to recursively crawl the content of the topic list but found the forum exploited auto loading to load the content of the next page and the URL of every page was well defined. Therefore, we could iteratively crawl the content of further pages by requesting the well-designed successive URLs and check whether the content of the page is null to terminate the loop in time. Once we got the metadata, we could use the BeautifulSoup to parse the HTML responses. For the future needs of the recommendation system, we merely extract the topic title and the URLs of each topic.
First, we applied both BERT embedding and TF-IDF encoding to encode the topics and submission, and calculate the cosine similarity to sort the relevancy with queried topic. However, the results we get cannot perfectly match the actual recommendations we want due to the limitations of BERT and TF-IDF. Therefore, we tried to combine both BERT and TF-IDF and calculate the cosine similarity using the weighted average. For fear of subjective interface, we introduced Jaccard similarity to compare the performance of the above three models, and we found the combined approach can produce the most reasonable recommendations.