First Self Assessment - Week 3
Things I Have Learned
Technical: I’ve learned how to infinitely scroll a site and scrap specific data, and clean the extracted data for further processing for implementing other NLP models. I have also gained knowledge in various NLP methods such as TF-IDF, Bag of Words, BERT etc.
Tools: I’ve gained some first hand experience in using Python/ML libraries such as BeautifulSoup, Selenium for scrapping data and implementing infinite scrolling.
Soft Skills: Working in a team virtually is a new experience for me, it is very different from working with a team in person and it has forced me to be able to set independent goals and organize my time to ensure that I would complete the required tasks before the next check-in meeting.
Using BeautifulSoup, I scrapped the Atom discourse site to extract individual topic titles, comments and tags, then organized data points in a csv file
I used Selenium to be able to infinitely scroll the homepage of the Atom forum, making the process more automated, I managed to get ~900 data points however it can be easily scaled to collect much larger volumes of data
I’ve implemented several functions to clean the obtained data, preparing to feed it to NLP models such as TF-IDF/BERT
(I’ve started those implementations however I am currently experiencing some errors in my code that I am trying to fix)
Meetings/ Training Attended (Including Social Team Events)
I’ve attended all the team meetings/trainings, but we have not had any social events (which I will definitely be up for!)
Goals for Upcoming Week
I would like to to fix the bug in my code for the TF-IDF implementation of my data and be able to get the predictions for forum topics.
After I’m done with that, I will implement BERT (and doc2vec, if time permits), so that I can compare and evalutate my results.
Scrapped data from the Atom discourse forum, compiling it in a csv file
Initially I tried to use Scrapy to complete this task (using the tutorial given on STEM-Cast), however it did not work the way I expected it to. In the end, I used BeautifulSoup and Selenium, a brief tutorial by Anubhav (our project lead) was very helpful to me and with that I managed to obtain the data points.
Cleaned the obtained data
I was able to remove punctuation, stopwords, numeric figures, lemmatize the words etc. Not a lot of hurdles on this task, we were provided some online materials by Anubhav and those sites were pretty straightforward.
Read up on the various NLP implementations
We were giving a number of readings that explained to us how various NLP implementations worked and how they can be used. This was a pretty straightforward task, some of the information is still a little confusing but I think by trying them out I will start to understand it better
Started a TF-IDF implementation on my data
I managed to get about half way through this task before encountering an error in my program. I am currently in the processing of debugging it and I think I am close to figuring it out. Hopefully this will be done soon so that I can try other implementations.