Saiparsa - Machine Learning Pathway

Things Learned:

Understanding the structure of a web page, inspecting web elements to understand the distribution of data among different HTML tags, making HTTP requests, and parsing HTML responses. Working with libraries: Beautiful Soup, Scrapy to build Web Scrapers, Web Crawlers to scrape data from discourse forums. Learned to work with text data, basics of attention-based models, transformer networks, and BERT.

Learnt basics of version control, Git, and its functionalities working with VS Code, basics of coding in PyTorch.

Learned effective ways of communicating an idea, collaborating with people with varied backgrounds, the importance of considering different opinions, and reaching a common ground.

Highlights:

  1. Built a Web Crawler to go through various pages of Discourse form of my choice (Choice Community Forum).
  2. Scraped data from various web pages and stored them as pickle files.
  3. Submitted all bi-weekly reports on-time.

Meetings

Attended Colin’s Webinar on Git, Bi-weekly team meetings.

Goals for the upcoming week

Working with pre-trained BERT models and developing a basic recommendation system.

Tasks done

Scraping the main page of the forum to get slugs and ids of different pages. Crawling through the pages of the forum through URLs obtained and scraping data in as refined format as possible through scraper developed in python using the Beautiful Soup library. Guidance by the project lead on inspecting web elements was helpful in developing the scraper and crawler. Explored git functionalities and Colin’s webinar was instrumental in understanding Git and the importance of version control.
I was very comfortable with the pace and didn’t face any challenges other than minor issues while working with web pages.

Final Self-Assessment:
Things Learned:

  • Understanding the structure of a web page, inspecting web elements to understand the distribution of data among different HTML tags, making HTTP requests, and parsing HTML responses. Working with libraries: Beautiful Soup, Scrapy to build Web Scrapers, Web Crawlers to scrape data from discourse forums. Learned to work with text data, basics of attention-based models, transformer networks, and BERT.

  • Learnt basics of version control, Git, and its functionalities working with VS Code, basics of coding in PyTorch.

  • I had the opportunity to work on the recommendation model and develop machine learning models to identify the category of a given post. This made me develop scripts to implement the pre-trained-BERT model on various forums to generate embedding. To choose the right model to perform the category classification I had to go through a few papers and understand the working of models like XGBoost. To evaluate the performance of CNN based classifiers on BERT Embeddings I had to learn ways of using a CNN model on 1-dimensional data.

Highlights:

  1. Built a web scraper to scrape text data and metadata from various forums.
  2. Develop scripts to generate BERT embeddings for the posts of a given forum using pre-trained BERT models.
  3. Develop a cosine-similarity based recommendation model to recommend the top 10 relevant posts.
  4. Worked as a part of a subgroup to develop a business report on the Bank of New Zealand Forum and identify the pain points of the forum to be addressed.
  5. Performed data preprocessing, developed Machine Learning Models (SVM, CNN, Neural networks, XGBoost) to identify the category of a given post. Evaluate and compare their performance on One Hack forum.

Meetings

Attended Colin’s Webinar on Git, NLP, Bi-weekly team meetings, and subgroup meetings.

Tasks done

  • Scraping the main page of the forum to get slugs and ids of different pages. Crawling through the pages of the forum through URLs obtained and scraping data in as refined format as possible through scraper developed in python using the Beautiful Soup library. Guidance by the project lead on inspecting web elements was helpful in developing the scraper and crawler. Explored git functionalities and Colin’s webinar was instrumental in understanding Git and the importance of version control.

  • Developed scripts to generate BERT embeddings for the posts of a given forum using pre-trained BERT models. Develop a cosine-similarity based recommendation model to recommend the top 10 relevant posts.

  • Worked as a part of a subgroup to develop a business report on the Bank of New Zealand Forum and identify the pain points of the forum to be addressed.

  • Performed data preprocessing, developed Machine Learning Models (SVM, CNN, Neural networks, XGBoost) to identify the category of a given post. Evaluated and compared their performance on One Hack forum.