Loading the data

Strategy 1

Explore the data

NB: We have imbalanced data. Deal with it later on.

Text Pre-processing

I lost 317518 info.

Modeling the data

Naive Bayes Classifier for Multinomial Models

CountVectorizer + TF-IDFTransformer + MultinomialNB

Linear Support Vector Machine

CountVectorizer + TF-IDFTransformer + SGDClassifier

Logistic Regression

CountVectorizer + TF-IDFTransformer + Logistic Regression

Decision Tree

CountVectorizer + TF-IDFTransformer + DecisionTreeClassifier

Results

Results of the previously trained models

Cross Validation with linear SVM

Strategy 2

Modeling the data

Naive Bayes Classifier for Multinomial Models

CountVectorizer + TF-IDFTransformer + MultinomialNB

Linear Support Vector Machine

CountVectorizer + TF-IDFTransformer + SGDClassifier

Logistic Regression

CountVectorizer + TF-IDFTransformer + Logistic Regression

Decision Tree

CountVectorizer + TF-IDFTransformer + DecisionTreeClassifier

Results

Results of the previously trained models

Cross Validation with linear SVM

Strategy 3

Cross Validation with linear SVM

The accuracy improved by ~0.01 compared to before by replacing \n with space.

Random Forest

Using TF-IDF as a vectorizer and transformer is slighly better than using CountVectorizer as a Vectorizer.

NB: Iterating more makes SVM drop in accuracy

XGBoost

We can see that:

Light GBM

About Light GBM: https://www.analyticsvidhya.com/blog/2018/06/comprehensive-guide-for-ensemble-models/

Other: Investigating Abbreviations

The ones that looks legilible to me are: FBA , OHI, CAS, URL, ASA, FNS, WAY, GET, HEL, FBR, TEL, CHE, CKE, DDB