Overview of the internship so far:
-
I have learned basics of web-scraping using the combination of Selenium engine and Beautiful Soup text parser. I have also applied nltk library for natural language processing (lowercasing, stemming, creating bag of words, etc.). I have reviewed git more in depth from the industry mentor.
-
I have attended all the meetings other than the first meeting due to communication problems. (Git meeting, technical meeting 1 on NLP, technical meeting 2 on data cleaning, technical meeting 3 on tf-idf vectorizer)
-
My goal is to keep attending upcoming meetings and workshops and refine the NLP project on a bodybuilding website.
-
I had little bit of trouble with web-scraping because I have never done it before, but I was able to create processed output at the end. Workshops and resources provided to me definitely helped me get through this process.
Things I learned:
Technical: I became more familiar with pandas data frame and web scraping after increasing the size of the data to about 6000 posts. The efficiency and time complexity of the program became more important, so we were able to optimize the dataset building process. I attended meetings and researched Natural Language Processing through BERT.
Tools: I learned to use Github and Google Colab and Asana to program and communicate with my teammate.
Soft skills: communicating directly with individuals and reaching out if I need help or do not understand any topics.
What we achieved:
My team scraped about 6000 posts from a discourse forum.
Cleaned and pre-processed the data by removing tags and stopwords, tokenization, lemmatizations.
Learned and researched BERT about its utilization and algorithmic model behind the tool.
List of meetings I have joined so far:
- Weekly team meetings, web-scraping meetings, etc.
- NLP webinar with Colin
- Git Webinar with Colin
Goals of the upcoming week:
- Finalize BERT model and finish up training BERT model
Task Done:
- scraped 6000 posts with beautifulsoup4 and selenium
- learned and researched about BERT
- Team communication
Issues that I solved:
The dataset of the scraped posts were too big to be run and was taking exponential amount of time compared to smaller dataset we had. I made changes to the code to run more efficiently and divided up the dataset to be run separately.
I will be staying for following weeks of the program (extension).