Name: Vishaal Ranjan
Team: ML Team 8 (Bertinator)
Overview of Things Learned:
Technical Area: I learned how to create web crawlers from scratch using Scrapy and BeautifulSoup. I also learnt about the API calls that are made in the websites that use infinite scrolling and how to scrape the data for such pages. I also learnt about cleaning and manipulating data through the use of the Pandas library. Currently, I am learning more about the BERT model that will be used by our team by going through various tutorials, articles and videos on the topic.
Tools Used: Python, Pandas, Scrapy, BeautifulSoup, Git, Colab, Visual Studio Code
Soft Skills: Collaborating with teammates from various different time zones, solving technical issues by interacting with the team and task leads and coming up with a solution
- Successfully scraping the airline forum through Scrapy (which included 36000+ entries and 13 features)
- Successfully joining all the dataframes together after cleaning the data and handling the null values
- Getting a better understanding of BERT model after going through the resources given by our Team Lead
List of Meetings attended
Attended all the team meetings that have been organized so far (2 meetings/week)
Scraping with BeautifulSoup and Scrapy Webinar
Recommender Systems Webinar
Short meeting with Technical Lead Maleeha
Goals for the Upcoming Week
Finish the given task, i.e., to prepare the data for the BERT model and train the forum classification model to achieve the highest possible accuracy
Scraped the data from the airline forum in the required format. This was quite challenging as it was the first time that I made a web crawler. I faced issues in incorporating the infinite scrolling nature of the website. I got help from my teammates as well as our leads who gave us some useful resources to improve our understanding of the topic.
Cleaned the data to get rid of the HTML tags as well as some unnecessary characters. I got to learn how to use regular expressions and pandas in order to clean the data in the required format. I interacted with several team members regarding this task and all of them gave some valuable input.
Concatenated the datasets of the 5 forums scraped by our team. I had to do the data cleaning tasks for a few dataframes and change the column names so that all the dataframes have the same column names and were in the same order.