Atrelex - Machine Learning Pathway

Week: 7/27

Overview of Things Learned:

  • Technical Area: Web scraping, Data cleaning, Data preprocessing
  • Tools: BeautifulSoup, Selenium, Scrapy, Requests, Pandas, Regex, Google Colab

Achievement Highlights

  • Scraped data of over 30,000 posts
  • Scraped JSON data using Scrapy
  • Learned about data cleaning and preprocessing

Meetings attended

  • 7/23 – ML team kick-off meeting
  • 7/31 - Web Scraping and Preprocessing Presentations

Goals for the Upcoming Week

  • Complete Preprocessing of the scraped data
  • Learn about TF-IDF

Tasks Done

  • Web scraping: Built a web crawler using Scrapy and used it to scrape data from JSON files and stored this data in a CSV file. Loaded this data in a dataframe and cleaned the data.

Week: 8/3

Overview of Things Learned:

  • Technical Area: Data preprocessing, TF-IDF
  • Tools: BeautifulSoup, Requests, Pandas, Regex, ScikitLearn, nltk

Achievement Highlights

  • Preprocessed scraped data using libraries like markdown and regex
  • Learnt about TF-IDF

Meetings attended

  • 7/27 - Introduction to Web Scraping and this Week’s Deliverables
  • 7/29 - Web Scraping Check-In Meeting and Intro to Preprocessing
  • 7/31 - Web Scraping and Preprocessing Presentations

Goals for the Upcoming Week

  • Complete TF-IDF
  • Learn about BERT

Tasks Done

  • Preprocessing: Removed HTML code from the text in the scraped data and removed symbols & multiple spaces as well. Filtered out the useful data by omitting the null values in the data.

Week: 8/7

Overview of Things Learned:

  • Technical Area: TF-IDF
  • Tools: Scikit-learn, Pytorch, transformers library

Achievement Highlights

  • Removed stop words from the preprocessed data
  • Used the TFIDF Vectorizer in scikit learn to create the TFIDF matrix

Meetings attended

  • 8/12 - Check-in about implementing the BERT Model
  • 8/14 - Present BERT Model implementations

Goals for the Upcoming Week

  • Complete BERT model implementation

Tasks Done

  • TF-IDF: Successfully created the TF-IDF matrix for the text in the posts after removing any values that were not useful including stop words.
  • BERT : Filtered out noise from the dataset including emojis and hyperlinks.