🔸 Code Along for F-IE-1: Text Classification using NLP

Task: Create, Compare, and Visualize Movie Review Sentiment Analyzers

Welcome to an exciting exploration of Natural Language Processing (NLP)! In this task, you’ll create a sentiment analysis model to determine whether movie reviews are positive or negative. You’ll then analyze its performance and create compelling visualizations to showcase your results.

Please share your work by replying to this post with screenshots of your visualizations. Then, comment on at least two other submissions. Your active participation, analysis, visualizations, and peer feedback will count as your evaluation for the Virtual-Internships. Alternatively, you may opt to discuss your work in the AI Evaluator session, where your understanding and insights of the Text Classification Project will be assessed.

Task Overview

  1. Build a Sentiment Analyzer: Create a model using the provided dataset of 200 movie reviews. Experiment with different feature extraction methods (e.g., CountVectorizer, TfidfVectorizer) and choose the best performing one.

  2. Enhance the Model: Improve your model’s accuracy by implementing at least two text preprocessing techniques (e.g., lowercase conversion, punctuation removal, stopword removal, stemming/lemmatization). Briefly explain your choice of preprocessing steps in your visualizations.

  3. Test, Compare, and Visualize: Use your model to analyze real movie reviews. Evaluate its performance using metrics such as accuracy, precision, recall, F1-score, and ROC curve with AUC score. Create visualizations that highlight these metrics and provide insights into your model’s strengths and weaknesses.

  4. Advanced Analysis: Develop at least one custom visualization that provides unique insights into your model’s performance or the data itself. This could be an analysis of feature importance, error patterns, or any other aspect you find interesting.

  5. Peer Review: Engage with your peers by providing constructive feedback on at least two other submissions. Discuss the effectiveness of preprocessing, the creativity and informativeness of visualizations, and the thoroughness of the analysis. Suggest potential improvements and share insights gained that could inform future work.

Basic Python Code for Sentiment Analysis:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, roc_curve, auc

# Load the IMDB dataset
from sklearn.datasets import load_files

# Note: Before running this script, download the IMDB dataset from 
# http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
# Extract the contents and ensure you have a folder named 'aclImdb' in your working directory

# Load all positive and negative reviews
dataset = load_files(r"./aclImdb", categories=['pos', 'neg'], shuffle=True, random_state=42)

# Convert to DataFrame for easier handling
df = pd.DataFrame({'review': dataset.data, 'sentiment': dataset.target})

# Convert reviews to string (they're initially bytes)
df['review'] = df['review'].apply(lambda x: x.decode('utf-8'))

# Limit to 200 reviews for this task
df = df.sample(n=200, random_state=42)

reviews = df['review'].tolist()
sentiments = df['sentiment'].tolist()

print(f"Dataset loaded. Number of reviews: {len(reviews)}")
print(f"Sample review: {reviews[0][:100]}...")  # Print first 100 characters of first review

def preprocess_text(text):
    # TODO: Implement at least two text preprocessing techniques
    # Examples: lowercase conversion, punctuation removal, stopword removal, stemming/lemmatization
    pass

# Apply preprocessing
preprocessed_reviews = [preprocess_text(review) for review in reviews]

def extract_features(reviews):
    # TODO: Experiment with different feature extraction methods (e.g., CountVectorizer, TfidfVectorizer)
    # Choose the best performing one
    pass

# TODO: Extract features from preprocessed reviews
# X = ...
# y = ...

# TODO: Split the data into training and testing sets
# X_train, X_test, y_train, y_test = ...

def train_model(X_train, y_train):
    # TODO: Train your chosen model
    pass

# TODO: Train the model
# model = ...

def evaluate_model(model, X_test, y_test):
    # TODO: Evaluate the model using accuracy, precision, recall, F1-score, and ROC curve with AUC score
    pass

# TODO: Evaluate the model
# metrics = ...

def visualize_model_performance(metrics):
    # TODO: Create visualizations that highlight the model's performance metrics
    pass

def visualize_feature_importance(model, feature_names):
    # TODO: Create a visualization of feature importance
    pass

def custom_visualization(model, X, y):
    # TODO: Develop at least one unique visualization that provides deeper insights into your model or the data
    # Ideas: heatmap of word correlations, prediction confidence vs. review length, analysis of misclassified reviews
    pass

# Main execution
if __name__ == "__main__":
    # TODO: Run all the necessary steps and create visualizations
    # Remember to save or display all visualizations for submission
    pass

Instructions for Students:

  • Setup: Install Jupyter Notebook and necessary libraries. Ensure you have the following libraries: numpy, pandas, matplotlib, seaborn, and scikit-learn.

  • Preprocessing: Implement at least two text preprocessing techniques. Visualize the impact of these techniques on your model’s performance.

  • Feature Extraction: Experiment with different feature extraction methods. Create a visualization comparing their effectiveness.

  • Model Evaluation: Calculate and visualize accuracy, precision, recall, F1-score, and the ROC curve with AUC score. Interpret these metrics in the context of sentiment analysis.

  • Custom Visualization: Develop at least one unique visualization that provides deeper insights into your model or the data. This could be:

    • A heatmap showing the correlation between certain words and sentiment
    • A visualization of how prediction confidence varies with review length
    • An analysis of misclassified reviews and common patterns among them

Sharing and Discussion

Share your results by posting screenshots of your Jupyter Notebook visualizations, including one or more of:

  • Model performance metrics and ROC curve
  • Impact of preprocessing techniques
  • Comparison of feature extraction methods
  • Your custom visualization
  • Analysis of real movie reviews

Let your visualizations tell the full story without additional written explanations!

Peer Review

After submitting your own work, review and comment on at least two other students’ submissions. In your comments, consider the following:

  1. What do you find interesting or innovative about their approach or visualizations?
  2. How do their results compare to yours? Are there any notable differences in model performance or insights?
  3. Based on their visualizations, can you suggest any potential improvements or areas for further exploration?
  4. Is there anything you learned from their submission that you might apply to your own work in the future?

Remember to be constructive and respectful in your feedback. The goal is to learn from each other and foster a collaborative learning environment.

Aruuke Bayakmatova: Screenshots of visualizations for F-IE-1 Code Along

Screenshot 2024-07-03 at 3.33.12 PM Screenshot 2024-07-03 at 3.33.18 PM Screenshot 2024-07-03 at 3.33.24 PM Screenshot 2024-07-03 at 3.33.32 PM Screenshot 2024-07-03 at 3.33.50 PM Screenshot 2024-07-03 at 3.33.57 PM Screenshot 2024-07-03 at 3.34.04 PM Screenshot 2024-07-03 at 3.34.09 PM

Screenshot 2024-07-03 at 3.34.15 PM

Screenshot 2024-07-03 at 3.34.20 PM Screenshot 2024-07-03 at 3.31.47 PM Screenshot 2024-07-03 at 3.34.31 PM [CodeAlong-TextClassificationUsingNLP.pdf|attachment]

2 Likes

Hello @ARUUKE_BAYAKMATOVA Very well done! Not only is your work great, but it also demonstrates remarkable initiative to be the first one to post. You are welcome to join the foundational team as a lead, and the advanced team as a participant if you wish. The first round of internships starts on the 22nd, and we will begin forming teams shortly!

1 Like

Mohammed Saiger : Screenshots of visualizations for F-IE-1 Code Along stem 3 stem 1 stem 2 stem 4 stem 5

I am not sure If i did this right but this is what I was able to get.

2 Likes

Thank you so much! Sounds great!

Hi @Moh_Saiger, great effort! A perfect AUC for TfidfVectorizer is unlikely though. Try adding more performance metrics like precision, recall, and F1-score to investigate further. We suggest focusing on model evaluation and understand the metrices to prepare for your team experience!

Hi @ARUUKE_BAYAKMATOVA, please review this advanced project: 🔸 A-TE-2: Building an NLP Pipeline - Recommender Systems. Assess how much you can understand and where you might need to deepen your knowledge. Push yourself to whatever limit you are comfortable with, you can learn more along the way.

Our mentors will soon evaluate students interested in the advanced project and gather input on potential extensions. Our goal is to define a project with clear NLP subtasks that you can lead your foundational team in tackling. Your grasp of feature extraction and model evaluation will be an asset!

Prasun Sharma :Screenshots of visualizations for F-IE-1 Code Along

Dataset loaded. Number of reviews: 200 Sample review: Low budget horror movie. If you don’t raise your expectations too high, you’ll probably enjoy this l…

  • accuracy:0.9

  • precision:0.95

  • recall:0.88

  • f1:0.92 download download download

  • accuracy:0.77

  • precision:0.90

  • recall:0.73

  • f1:0.80 download download download

2 Likes

Screenshots of some visualizations of my attempt for F-IE-1 Code Along Untitled4png Untitled3 Untitled2 Untitled1 Untitled impact of prepreoc

1 Like

imrpoved not sure if i did this right, but i tried to improve the metrics and model. Still trying to figure it out.

1 Like

When we kick off the session next week, our mentor will start by reviewing these submissions and providing individual feedback. From there, they’ll help the team decide on the final project. Feedback on the work you’ve done will help you learn more than any reading can! Great job!

  • Trained several classifiers, including Logistic Regression, Random Forest, MLP, Multinomial Naive Bayes, Gaussian Naive Bayes, and Decision Tree
  • Preprocessing steps included Normalization (cleaning HTML syntax, removing URLs, standardizing casing, fix contractions, removing stopwords and non-word characters), Tokenization and Lemmatization
  • Feature extraction included both CountVectorizer and TfidfVectorizer methods for Vectorization
  • TF-IDF outperformed across all six models

P.S. I used 200 samples from the /aclImdb/train dataset and reserved 15% of them for testing

1 Like

For preprocessing: texts are lowered, any excess spaces are removed, and punctuations are replaced with an empty character.

Feature Extraction: used feature extraction with Tf-idf and count vectorizer.

Model Evaluation: the model is evaluated by accuracy, recall, F1-score, and the ROC curve with an AUC score. The two proposed models are trained using logistic regression, and visualizations when extracting features using a Tf-idf vectorizer are as follows.

Roc curve:

Tfidf_ROC_curve

Model performance metrics:

Model_performance_metrics_TFIDF

Top 20 important features:

Top_20features_TDIDF

Word correlation (custom visualization):

Word_correlation_TFIDF

Visualizations when extracting features using a count vectorizer are as follows.

Roc curve:

CountVectorizer_ROC_curve

Model performance metrics:

CountVectorizer_model_metrics

Top 20 important features:

CountVectorizer_top_features

Word correlation (custom visualization):

CountVectorizer_word_correlation

2 Likes