OVERVIEW
Technical Area
I learned to use techniques from the Python library Scrapy to retrieve and parse data from over 221 pages on the Codecombat Forum. On my journey to work with forum data and infinite scrolling, I became comfortable with CSS Locator syntax and chaining together CSS Locators with XPath. My partner used Selenium framework and BeautifulSoup library initially, and I became familiar with those as well.
Ultimately we learned that the simple approach is sometimes best, and I learned to utilize API calls to handle the infinite scrolling on multiple pages when working with a forum.
I used the Pandas library, Jupyter Notebook, and regular expressions to clean and analyze 40,000 lines of data.
Tools:
Python, Pandas, Jupyter Notebook, Scrapy, Git, Colab, PyCharm, Regular Expressions
Soft Skills:
Worked and coordinated with team members in different time zones. Helped communicate the lessons my partner and I learned to multiple team members.
Achievement highlights:
- Used Pandas and Jupyter Notebook for the first time, and cleaned over 40,000 lines of data using a combination of Pandas and Regex.
- Created my first webcrawler using the Python library Scrapy and scraped 221 pages of a forum.
- Created my first tokens using DistilBert pretrained data.
List of meetings/ training attended including social team events:
Team Meetings
- All team meetings for Group 8
- Small group session with Charlene and Pratik for BERT training
- Two team meetings with Priya and Trang
- One team meeting with Trang
- Small group session with Charlene and team for BERT/project questions.
Python Training
hosted by Gorbal and Shreyas Pasumarthi
- Session 1 - Advanced Level (Intro to handling the dataset)
- Session 2 - Beginner Level (Learning the basics of Python to hopefully bring you up to an advanced level)
- Session 3 - Advanced Level (Answering any questions, etc.)
Git Training
- Git webinar for all ML teams (Hosted by industry mentor)
Navigating STEM-Away 101
- I hosted STEM-Away website navigation training with Shreyas Pasumarthi
Data-mining
- Technical Deep Dive - Data Mining hosted by Maleeha Imran
- STEMCasts - Overview of ML and project hosted by Kunal Singh
- Technical Deep Dive - Recommendation Models hosted by Sara EL-ATEIF
Goals for the upcoming week:
Learn more about Scikit Learn and PyTorch. Use DistilBERT to have a working forum classification model up and running.
Detailed statement of tasks done:
Webscraping
Created my first webcrawler using the Python library Scrapy. I went down a rabbit hole and over-complicated this by learning all about CSS Locators with XPath. Maleehas resources helped my partner reign me in and focus on the task at hand. I came across errors while scraping and noticed that my crawler was not going as fast as my partners. Thanks to my partner Trang and Charlene, I became more familiar with the errors. I learned the proper installs to help my computer work better with Python. I learned how to become more comfortable with a Colab notebook.
Cleaning Data
Used the Pandas library, Jupyter Notebook, and regular expressions to clean and analyze 100,000 lines of data. I learned that we only needed the posts, and reformatted the data down to approximately 40,000 lines of data to meet the project requirements. The Advanced Level training session (Intro to handling the dataset) helped me realize how simple and easy it was to get going with Pandas and Jupyter Notebook. Thanks to my partner’s code review, I learned to use best practices for my code that are widely accepted in the python community.
Understanding BERT
I am testing out DistilBERT, and have begun tokenization of data. I am now learning about padding and masking. I believe the major challenge I will face this week is to learn more about Scikit Learn and PyTorch.