Concise Description of Things Learned -
Web Scraping using BeautifulSoup and Selenium
Using Python libraries like Pandas, Matplotlib, Scikit-Learn
Using Github and basic git commands
ML Project Workflow
Basic pre-processing and EDA
TF-IDF, Count vectorizer
Word2vec and doc2vec word embeddings
BERT deep learning model
Communication with peers
Attended a team-building meeting
Github meeting conducted by mentor Collin
Reviewed the STEMCast meeting recordings of Week 3
Presentation on the word2vec embedding and judging the feasibility of its application on our project.
Understood Git commands and began using Github Desktop version
Understood and implemented word2vec model on Flowster data
- Basic pre-processing and EDA of the Flowster forum
The basic pre-processing techniques included removing the HTML tags, removing special characters, lowercase of letters etc.
EDA techniques included bar graphs of the authors and the categories and word cloud visualization to find the words having more frequency in the data.
- Applying word2vec embedding to find similarity between the comments
- Presenting the topic word2vec embedding, its usage and explaining the neural network architecture behind the working of the word2vec embedding.
- Applying BERT model to the combined Flowster and amazon data
Apart from the above tasks, I gained a lot of knowledge of different word embeddings and basic ML models like TF-IDF, doc2vec, Naive Bayes classifier, SVC and linear regression through the presentations of my team members.
I also learned the usage of the BERT model through different tutorials and the STEMCasts.
Flowster forum contained less than 300 entries and a major class imbalance. Hence each of the basic pre-processing methods had to be reviewed minutely to judge whether they should be applied.
The class imbalance caused problems while refining the accuracy after training the ML model on the data
The word2vec model gave good accuracy when applied only with certain ML models like linear regression
Solution: Different data augmentation techniques used by the team brought about a significant improvement in the accuracy of the models.