Name: Richa Singh Madnawat
Team: ML Team 8 (Bertinator)
Concise overview of things learned:
It has been a great learning experience even on an initial stage not only on the technical front but also on the functional side of things.
1.Technical Area – Web scraping, Data Mining, Data Preprocessing, Data cleaning, Model Building(Tokenization, Encoding, Padding and Truncating the sentences, Training and Validation Splits, Optimizer & Learning Rate Scheduler, Training)
2.Tools – Scrapy, BeautifulSoup, Python(pandas), Excel, Google Colab
3.Soft Skills – Teamwork, Timeline management, Reporting, Team cohesiveness, Technical communication.
Three achievement highlights:
- Working with a Big Data set (53,000+ fields and 13 features).
- Scraped data from a website which has an infinite scrolling feature and concatenated all the data frames to set it up for model building.
- Understanding of Machine Learning concepts (especially BERT, NLP techniques).
List of meetings/ training attended including social team events:
- Weekly Team Meetings (twice per week)
- Git Workshop
- Python Workshop
- Data scraping tutorial within the team
- BERT tutorial within the team
- Team Building activities within in the usual meetings
Goals for the upcoming week:
- Preparing the data for BERT by Tokenizing the sequence
- Train the forum classification model by using the DistilBert For Classification model
Detailed statement of tasks done:
- Scraped the data from talk.folksy.com forum, including main page content (id, title, created_at, last_posted_at, views, like_count, category_id), post_page (post_id, username, created_at, cooked, post_number, updated_at, reply_count, reply_post_num, reads, topic_id, user_id, topic_slug, forum)
- Cleaned the data using Excel, and Python (pandas, regex functions) where it had excess different language, tags, and unwanted data.
- Concatenation of data scraped by different teams, performing data cleaning for the next step.
- In terms of help, my queries regarding handling infinite scrolling were sorted from the technical as well as project leads of our team by providing me various resources to different tutorials and customized problem-solving solutions.
- Imported and fed the datasets to a pandas dataframe, parsed it to ready it for final data cleaning preparation
- Filtering null, duplicates, hyperlinks, foreign language in sentences were the tasks included in data cleaning process.
- Loading the BERT Tokenizer, Tokenized the sentence, Mapped token to their IDs, padded and truncated the sentences and created attention masks for tokens.
- Splitted the training and validation data by randomly selected samples sequentially.
- Loaded a pre-trained BERT Model, added a single layer of linear classification on the top, ran the model on PyTorch
- Established number of epoch, training steps and created learning rate scheduler.
- Ran the training and validation to acheive 96% of accuracy.
- Plotted traing and validation loss to access the performance of the model.