- Learned to use vectorized data to train the ML models in the tutorial; XGboost, RandomForest, and Logistic regression.
- Learned to gauge the accuracy of these classifer models for a given dataset using scikitlearn.
Tools: Selenium Webdriver, Numpy ,Pandas ,Matplotlib, Scikit, Wordcloud, NLTK, requests, genism, seaborn
- I connected with a few of my peers over Discord to track my own progress and help me out with a few issues I had with understanding the module, particularly how classifiers actually work.
- Found some great resources on YouTube that explain the math behind ML models, and the logic of Bayes and Logistic Regression. Particularly, 3Blue1Brown.
In my own attempts at testing different classifier models for the BBC news and myPaint datasets, I noticed that TF-IDF + cosine is the most accurate classifier. This tracks with the linked tutorial which states that context specific models(like bayes, xgboost) are more sensitive to text outliers than domain specific models like TF-IDF.