Getting Started with Collaborative Filtering: Create Your First Movie Recommendation System
Objective
The main goal of this project is to build a practical movie recommendation system using collaborative filtering techniques. You’ll work with real-world movie rating data to create, implement, and evaluate recommendation algorithms that can predict user preferences and suggest movies.
Learning Outcomes
By completing this project, you will:
- Understand how recommendation systems work and their practical applications
- Learn to process and analyze user-item interaction data
- Implement user-based and item-based collaborative filtering
- Use industry-standard libraries for building recommendation systems
- Evaluate and compare recommendation system performance
- Gain experience with real-world data challenges
Prerequisites
- Python Programming Fundamentals
- Libraries
- pandas: Data manipulation
- numpy: Numerical operations
- scikit-learn: Machine learning utilities
- surprise: Recommendation system library
List of Theoretical Concepts
Click to view recommendation system theory
1. Core Concepts
-
Types of Recommendation Systems
- Content-based: Uses item features
- Collaborative filtering: Uses user behavior
- Hybrid: Combines multiple approaches
-
Key Terms
- User-Item Matrix: R[u,i] contains rating of user u for item i
- Similarity Matrix: S[i,j] contains similarity between items i and j
- Rating Prediction: r̂ui = predicted rating for user u on item i
-
Collaborative Filtering Process
- Collect user-item interactions
- Calculate similarity between users/items
- Predict unknown ratings
- Recommend items with highest predicted ratings
2. Mathematical Framework
-
Rating Prediction Formulas
- User-Based CF:
r̂ui = μu + Σ(sim(u,v) * (rvi - μv)) / Σ|sim(u,v)|
- Item-Based CF:
r̂ui = Σ(sim(i,j) * ruj) / Σ|sim(i,j)|
- User-Based CF:
-
Similarity Metrics
- Cosine Similarity:
sim(a,b) = (a · b) / (||a|| ||b||)
- Pearson Correlation:
sim(a,b) = cov(a,b) / (σa σb)
- Cosine Similarity:
Tools Required
- Python 3.7+
- pandas
- numpy
- scikit-learn
- scikit-surprise
- matplotlib
- Jupyter Notebook (recommended)
Hyperparameter Explanation
In implementing our recommendation system models, several hyperparameters play crucial roles in influencing the performance and accuracy of the recommendations. Understanding these hyperparameters is essential for tuning the models effectively.
Click for detailed explanation
1. Number of Neighbors (n_neighbors
)
- Definition: The number of similar users or items considered when making predictions in collaborative filtering.
- Significance: Affects the balance between local and global influences in the model.
- Lower Values: Consider only the most similar users/items, which can capture more specific tastes but may be sensitive to noise.
- Higher Values: Include more users/items, providing more general recommendations but potentially diluting personalization.
- Usage in Code:
- In
UserBasedCF
andItemBasedCF
classes,n_neighbors
determines how many similar users/items to consider.
- In
2. Similarity Metrics
- Options: Cosine similarity, Pearson correlation, etc.
- Significance: Determines how similarity between users or items is calculated.
- Cosine Similarity: Measures the cosine of the angle between two vectors, focusing on the direction rather than magnitude.
- Pearson Correlation: Measures linear correlation, considering both direction and magnitude after centering the data.
- Usage in Code:
- The
cosine_similarity
function from scikit-learn is used to compute similarities.
- The
3. Number of Factors (n_factors
)
- Definition: In matrix factorization models like SVD, it represents the number of latent factors.
- Significance: Determines the complexity of the model and its ability to capture underlying patterns.
- Lower Values: Simpler models that may underfit the data.
- Higher Values: More complex models that can capture intricate patterns but may overfit.
- Usage in Code:
- In the Surprise library’s SVD model:
SVD(n_factors=100)
- In the Surprise library’s SVD model:
4. Learning Rate (lr_all
)
- Definition: Controls the step size during optimization in algorithms like SVD.
- Significance: Affects the speed and stability of convergence.
- Low Learning Rate: Slower convergence but potentially more stable.
- High Learning Rate: Faster convergence but risk of overshooting minima.
- Usage in Code:
- In SVD model:
lr_all=0.005
- In SVD model:
5. Regularization (reg_all
)
- Definition: Penalizes large model parameters to prevent overfitting.
- Significance: Helps improve generalization by discouraging overly complex models.
- Usage in Code:
- In SVD model:
reg_all=0.02
- In SVD model:
6. Number of Epochs (n_epochs
)
- Definition: Number of times the algorithm processes the entire training data.
- Significance: Determines how long the model trains.
- Too Few Epochs: Underfitting due to insufficient training.
- Too Many Epochs: Overfitting to the training data.
- Usage in Code:
- In SVD model:
n_epochs=20
- In SVD model:
7. Test Size (test_size
)
- Definition: The proportion of data set aside for testing.
- Significance: Affects the evaluation of model performance.
- Usage in Code:
- In
train_test_split
:test_size=0.2
- In
Hyperparameter Tuning Tips
- Start with Default Values: Use standard values as a baseline.
- Experiment Systematically: Adjust one hyperparameter at a time to observe its effect.
- Cross-Validation: Use techniques like grid search with cross-validation to find optimal hyperparameters.
- Balance Complexity and Performance: Aim for a model that generalizes well to new data.
Challenges and Solutions in Recommendation Systems
Building effective recommendation systems involves navigating several challenges inherent in real-world data and user behavior. Understanding these challenges and implementing appropriate solutions is crucial for creating robust systems.
Click for details
1. Cold Start Problem
- Definition: Difficulty in making accurate recommendations for new users or items due to lack of historical data.
- Types:
- User Cold Start: New users with no ratings or interactions.
- Item Cold Start: New items with no ratings.
- Solutions:
- For Users:
- Collect Initial Preferences: Prompt new users to rate items during onboarding.
- Use Demographic Information: Leverage user attributes to find similar users.
- Hybrid Approaches: Combine collaborative filtering with content-based methods.
- For Items:
- Item Attributes: Use metadata (genres, tags) to recommend new items.
- Popularity-Based Recommendations: Suggest popular items to new users.
- For Users:
2. Data Sparsity
- Definition: User-item matrices are often sparse due to users interacting with a small fraction of items.
- Impact: Reduces the effectiveness of similarity calculations and model training.
- Solutions:
- Dimensionality Reduction: Use techniques like SVD to reduce the number of dimensions.
- Model-Based Methods: Employ algorithms that can handle sparsity, such as matrix factorization.
- Implicit Feedback: Incorporate implicit data like clicks or views to enrich the dataset.
3. Scalability
- Definition: As the number of users and items grows, computational requirements increase significantly.
- Challenges: Real-time recommendations and model updates become computationally intensive.
- Solutions:
- Approximate Nearest Neighbors: Use algorithms like locality-sensitive hashing to speed up similarity searches.
- Parallel Computing: Distribute computations across multiple processors or machines.
- Incremental Updates: Update models incrementally instead of retraining from scratch.
4. Overfitting
- Definition: The model learns the noise in the training data, reducing performance on unseen data.
- Solutions:
- Regularization: Apply penalties to large parameters to prevent overfitting.
- Cross-Validation: Use to select hyperparameters that generalize well.
- Simplify the Model: Reduce complexity by adjusting the number of factors or neighbors.
5. Diversity and Serendipity
- Definition: Recommendations may become too narrow, focusing only on popular or similar items.
- Impact: Reduces user satisfaction and limits exposure to a variety of items.
- Solutions:
- Diversification Techniques: Introduce diversity in recommendations by considering item dissimilarity.
- Serendipitous Recommendations: Include unexpected items that the user may like.
- User Feedback: Allow users to provide feedback to refine recommendations.
6. Privacy Concerns
- Definition: Handling sensitive user data raises privacy issues.
- Solutions:
- Data Anonymization: Remove personally identifiable information.
- Federated Learning: Train models on user devices without transmitting raw data.
- Transparency: Inform users about data usage and obtain consent.
7. Evaluation Challenges
- Definition: Assessing recommendation systems can be complex due to subjective user preferences.
- Solutions:
- Multiple Metrics: Use a combination of accuracy, diversity, and novelty metrics.
- A/B Testing: Deploy different versions to subsets of users and compare performance.
- User Studies: Collect qualitative feedback to complement quantitative metrics.
Implementing Solutions
- Hybrid Recommendation Systems: Combine collaborative filtering with content-based methods to mitigate cold start and sparsity.
- Algorithm Selection: Choose algorithms suited to the data characteristics and business requirements.
- Continuous Improvement: Regularly update models with new data and feedback.
By being aware of these challenges and proactively addressing them, you can build more effective and user-friendly recommendation systems that provide value to both users and stakeholders.
Steps and Tasks
1. Data Acquisition and Exploration
First, let’s get and explore the MovieLens dataset:
# Download MovieLens 100k dataset
!wget https://files.grouplens.org/datasets/movielens/ml-100k.zip
!unzip ml-100k.zip
# Load the data
import pandas as pd
# Read ratings
ratings = pd.read_csv('ml-100k/u.data',
sep='\t',
names=['user_id', 'movie_id', 'rating', 'timestamp'])
# Read movie information
movies = pd.read_csv('ml-100k/u.item',
sep='|',
encoding='latin-1',
names=['movie_id', 'title', 'release_date', 'video_release',
'imdb_url', 'unknown', 'Action', 'Adventure', 'Animation',
'Children', 'Comedy', 'Crime', 'Documentary', 'Drama',
'Fantasy', 'Film-Noir', 'Horror', 'Musical', 'Mystery',
'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western'])
print("Dataset Overview:")
print(f"Number of ratings: {len(ratings)}")
print(f"Number of users: {ratings['user_id'].nunique()}")
print(f"Number of movies: {ratings['movie_id'].nunique()}")
Click to view complete data exploration code
def explore_dataset():
"""Comprehensive exploration of the MovieLens dataset"""
# Basic statistics
print("\nRating Statistics:")
print(ratings['rating'].describe())
# Rating distribution
plt.figure(figsize=(10, 5))
ratings['rating'].value_counts().sort_index().plot(kind='bar')
plt.title('Rating Distribution')
plt.xlabel('Rating')
plt.ylabel('Count')
plt.show()
# Ratings per user
user_ratings = ratings['user_id'].value_counts()
print("\nRatings per User:")
print(f"Mean: {user_ratings.mean():.2f}")
print(f"Median: {user_ratings.median():.2f}")
# Ratings per movie
movie_ratings = ratings['movie_id'].value_counts()
print("\nRatings per Movie:")
print(f"Mean: {movie_ratings.mean():.2f}")
print(f"Median: {movie_ratings.median():.2f}")
# Create user-movie matrix
user_movie_matrix = ratings.pivot(
index='user_id',
columns='movie_id',
values='rating'
).fillna(0)
# Calculate sparsity
sparsity = (user_movie_matrix == 0).sum().sum() / (user_movie_matrix.shape[0] * user_movie_matrix.shape[1])
print(f"\nMatrix Sparsity: {sparsity:.2%}")
explore_dataset()
2. Data Preprocessing
Basic preprocessing steps:
def preprocess_data(ratings, movies):
# Create user-movie matrix
user_movie_matrix = ratings.pivot(
index='user_id',
columns='movie_id',
values='rating'
).fillna(0)
# Split into train and test
from sklearn.model_selection import train_test_split
train_data, test_data = train_test_split(
ratings, test_size=0.2, random_state=42)
return user_movie_matrix, train_data, test_data
Click to view advanced preprocessing code
def preprocess_data(ratings, movies, test_size=0.2):
"""
Comprehensive data preprocessing for recommendation system
Args:
ratings (pd.DataFrame): Raw ratings data
movies (pd.DataFrame): Movie metadata
test_size (float): Proportion of data for testing
Returns:
tuple: Processed dataframes and matrices
"""
# Remove users with too few ratings (cold start handling)
min_ratings = 5
user_counts = ratings['user_id'].value_counts()
valid_users = user_counts[user_counts >= min_ratings].index
ratings_filtered = ratings[ratings['user_id'].isin(valid_users)]
# Create user-movie matrix
user_movie_matrix = ratings_filtered.pivot(
index='user_id',
columns='movie_id',
values='rating'
).fillna(0)
# Normalize ratings by user mean
user_means = user_movie_matrix.mean(axis=1)
user_movie_matrix_normalized = user_movie_matrix.sub(user_means, axis=0)
# Split data ensuring each user has both train and test items
from sklearn.model_selection import train_test_split
# Group by user for stratified splitting
train_data = []
test_data = []
for user in ratings_filtered['user_id'].unique():
user_ratings = ratings_filtered[ratings_filtered['user_id'] == user]
user_train, user_test = train_test_split(
user_ratings, test_size=test_size, random_state=42)
train_data.append(user_train)
test_data.append(user_test)
train_data = pd.concat(train_data)
test_data = pd.concat(test_data)
return (user_movie_matrix, user_movie_matrix_normalized,
train_data, test_data, user_means)
# Example usage
processed_data = preprocess_data(ratings, movies)
user_movie_matrix = processed_data[0]
3. Implementing User-Based Collaborative Filtering
Basic implementation:
class UserBasedCF:
def __init__(self, n_neighbors=5):
self.n_neighbors = n_neighbors
def fit(self, user_movie_matrix):
self.user_movie_matrix = user_movie_matrix
self.user_similarity = pd.DataFrame(
cosine_similarity(user_movie_matrix),
index=user_movie_matrix.index,
columns=user_movie_matrix.index
)
Click to view complete User-Based CF implementation
class UserBasedCF:
def __init__(self, n_neighbors=5):
"""
Initialize User-Based Collaborative Filtering
Args:
n_neighbors (int): Number of similar users to consider
"""
self.n_neighbors = n_neighbors
def fit(self, user_movie_matrix):
"""
Fit the model by calculating user similarities
Args:
user_movie_matrix (pd.DataFrame): User-movie rating matrix
"""
self.user_movie_matrix = user_movie_matrix
self.user_means = user_movie_matrix.mean(axis=1)
# Calculate user similarities
normalized_matrix = user_movie_matrix.sub(self.user_means, axis=0)
self.user_similarity = pd.DataFrame(
cosine_similarity(normalized_matrix),
index=user_movie_matrix.index,
columns=user_movie_matrix.index
)
def predict(self, user_id, movie_id):
"""
Predict rating for a user-movie pair
Args:
user_id: Target user ID
movie_id: Target movie ID
Returns:
float: Predicted rating
"""
if user_id not in self.user_movie_matrix.index:
return self.user_movie_matrix[movie_id].mean()
# Get similar users
similar_users = self.user_similarity[user_id].nlargest(self.n_neighbors + 1)[1:]
# Get ratings for the movie from similar users
similar_ratings = []
similar_sims = []
for sim_user, sim in similar_users.items():
rating = self.user_movie_matrix.loc[sim_user, movie_id]
if rating != 0: # If the user has rated the movie
similar_ratings.append(rating - self.user_means[sim_user])
similar_sims.append(sim)
if len(similar_ratings) == 0:
return self.user_means[user_id]
# Calculate prediction
prediction = self.user_means[user_id] + (
np.sum(np.array(similar_ratings) * np.array(similar_sims)) /
np.sum(similar_sims)
)
return np.clip(prediction, 1, 5)
def recommend_movies(self, user_id, n_recommendations=5):
"""
Recommend movies for a user
Args:
user_id: Target user ID
n_recommendations (int): Number of recommendations
Returns:
list: Top N movie recommendations
"""
# Get movies the user hasn't rated
user_ratings = self.user_movie_matrix.loc[user_id]
unrated_movies = user_ratings[user_ratings == 0].index
# Predict ratings for all unrated movies
predictions = [
(movie_id, self.predict(user_id, movie_id))
for movie_id in unrated_movies
]
# Sort by predicted rating
recommendations = sorted(predictions, key=lambda x: x[1], reverse=True)
return recommendations[:n_recommendations]
4. Implementing Item-Based Collaborative Filtering
Basic implementation:
class ItemBasedCF:
def __init__(self, n_neighbors=10):
self.n_neighbors = n_neighbors
def fit(self, user_movie_matrix):
"""Calculate item similarities"""
self.user_movie_matrix = user_movie_matrix
self.item_similarity = pd.DataFrame(
cosine_similarity(user_movie_matrix.T),
index=user_movie_matrix.columns,
columns=user_movie_matrix.columns
)
Click to view complete Item-Based CF implementation
class ItemBasedCF:
def __init__(self, n_neighbors=10):
self.n_neighbors = n_neighbors
def fit(self, user_movie_matrix):
"""
Fit the model by calculating item similarities
Args:
user_movie_matrix (pd.DataFrame): User-movie rating matrix
"""
self.user_movie_matrix = user_movie_matrix
# Normalize ratings by item mean
self.item_means = user_movie_matrix.mean()
normalized_matrix = user_movie_matrix.sub(self.item_means, axis=1)
# Calculate item similarities
self.item_similarity = pd.DataFrame(
cosine_similarity(normalized_matrix.T),
index=user_movie_matrix.columns,
columns=user_movie_matrix.columns
)
def predict(self, user_id, movie_id):
"""
Predict rating for a user-movie pair
"""
if movie_id not in self.item_similarity.index:
return self.item_means[movie_id]
# Get user's ratings
user_ratings = self.user_movie_matrix.loc[user_id]
rated_movies = user_ratings[user_ratings > 0].index
# Get similar items
similar_items = self.item_similarity[movie_id][rated_movies].nlargest(self.n_neighbors)
if len(similar_items) == 0:
return self.item_means[movie_id]
# Calculate weighted average
numerator = sum(sim * user_ratings[item] for item, sim in similar_items.items())
denominator = sum(abs(sim) for sim in similar_items)
if denominator == 0:
return self.item_means[movie_id]
prediction = numerator / denominator
return np.clip(prediction, 1, 5)
def recommend_movies(self, user_id, n_recommendations=5):
"""Recommend movies for a user"""
user_ratings = self.user_movie_matrix.loc[user_id]
unrated_movies = user_ratings[user_ratings == 0].index
predictions = [
(movie_id, self.predict(user_id, movie_id))
for movie_id in unrated_movies
]
recommendations = sorted(predictions, key=lambda x: x[1], reverse=True)
return recommendations[:n_recommendations]
5. Using the Surprise Library
Basic implementation using Surprise:
from surprise import Dataset, Reader, SVD
from surprise.model_selection import train_test_split
# Create Surprise reader object
reader = Reader(rating_scale=(1, 5))
# Load data into Surprise format
data = Dataset.load_from_df(ratings[['user_id', 'movie_id', 'rating']], reader)
# Split the data
trainset, testset = train_test_split(data, test_size=0.2, random_state=42)
# Train SVD model
model = SVD(n_factors=100, n_epochs=20, lr_all=0.005, reg_all=0.02)
model.fit(trainset)
Click to view complete Surprise implementation
from surprise import Dataset, Reader, SVD, KNNBasic
from surprise.model_selection import GridSearchCV
from surprise.model_selection import train_test_split
from surprise import accuracy
def train_surprise_models(ratings_df):
"""
Train and compare different Surprise models
Args:
ratings_df: DataFrame with columns user_id, movie_id, rating
Returns:
dict: Trained models and their performance metrics
"""
# Create Surprise reader and load data
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(ratings_df[['user_id', 'movie_id', 'rating']], reader)
# Split data
trainset, testset = train_test_split(data, test_size=0.2, random_state=42)
# Initialize models
models = {
'SVD': SVD(n_factors=100, n_epochs=20, lr_all=0.005, reg_all=0.02),
'KNN_user': KNNBasic(sim_options={'user_based': True}),
'KNN_item': KNNBasic(sim_options={'user_based': False})
}
# Train and evaluate each model
results = {}
for name, model in models.items():
print(f"\nTraining {name}...")
# Train model
model.fit(trainset)
# Make predictions
predictions = model.test(testset)
# Calculate metrics
results[name] = {
'RMSE': accuracy.rmse(predictions),
'MAE': accuracy.mae(predictions),
'model': model
}
print(f"{name} - RMSE: {results[name]['RMSE']:.4f}, "
f"MAE: {results[name]['MAE']:.4f}")
return results
# Function to make recommendations using Surprise models
def get_top_n_recommendations(model, user_id, n=10):
"""Get top N recommendations for a user"""
# Get all movies
all_movies = ratings['movie_id'].unique()
# Get movies the user hasn't rated
user_ratings = ratings[ratings['user_id'] == user_id]['movie_id']
movies_to_predict = np.setdiff1d(all_movies, user_ratings)
# Make predictions
predictions = [model.predict(user_id, movie_id) for movie_id in movies_to_predict]
# Sort predictions
predictions.sort(key=lambda x: x.est, reverse=True)
return predictions[:n]
# Example usage
surprise_results = train_surprise_models(ratings)
best_model = surprise_results['SVD']['model']
recommendations = get_top_n_recommendations(best_model, user_id=1, n=5)
6. Evaluation Metrics
Basic evaluation function:
def evaluate_recommendations(model, test_data):
"""Calculate RMSE and MAE"""
predictions = []
actuals = []
for _, row in test_data.iterrows():
pred = model.predict(row['user_id'], row['movie_id'])
predictions.append(pred)
actuals.append(row['rating'])
rmse = np.sqrt(mean_squared_error(actuals, predictions))
mae = mean_absolute_error(actuals, predictions)
return {'RMSE': rmse, 'MAE': mae}
Click to view complete evaluation implementation
def evaluate_recommendations(model, test_data, movies_df=None):
"""
Comprehensive evaluation of recommendation system
Args:
model: Trained recommendation model
test_data: Test dataset
movies_df: DataFrame with movie information
Returns:
dict: Dictionary of evaluation metrics
"""
predictions = []
actuals = []
# Rating prediction metrics
for _, row in test_data.iterrows():
pred = model.predict(row['user_id'], row['movie_id'])
predictions.append(pred)
actuals.append(row['rating'])
# Calculate basic metrics
rmse = np.sqrt(mean_squared_error(actuals, predictions))
mae = mean_absolute_error(actuals, predictions)
# Calculate additional metrics
metrics = {
'RMSE': rmse,
'MAE': mae,
'R2': r2_score(actuals, predictions)
}
# Recommendation relevance metrics
if movies_df is not None:
metrics.update(calculate_recommendation_metrics(model, test_data, movies_df))
return metrics
def calculate_recommendation_metrics(model, test_data, movies_df):
"""Calculate additional recommendation-specific metrics"""
def precision_at_k(actual, predicted, k=10):
"""Calculate precision@k"""
actual_set = set(actual[:k])
predicted_set = set(predicted[:k])
return len(actual_set.intersection(predicted_set)) / k
def diversity(recommendations, movies_df):
"""Calculate recommendation diversity based on genres"""
genres = []
for movie_id in recommendations:
movie_genres = movies_df[movies_df['movie_id'] == movie_id].iloc[0]
genres.extend([col for col in movies_df.columns[5:]
if movie_genres[col] == 1])
return len(set(genres)) / len(genres) if genres else 0
# Calculate metrics for each user
user_metrics = []
for user_id in test_data['user_id'].unique():
user_recs = model.recommend_movies(user_id, n_recommendations=10)
recommended_movies = [rec[0] for rec in user_recs]
# Get actual highly rated movies for user
actual_likes = test_data[
(test_data['user_id'] == user_id) &
(test_data['rating'] >= 4)
]['movie_id'].tolist()
user_metrics.append({
'precision@10': precision_at_k(actual_likes, recommended_movies),
'diversity': diversity(recommended_movies, movies_df)
})
# Average metrics across users
avg_metrics = {
'Precision@10': np.mean([m['precision@10'] for m in user_metrics]),
'Diversity': np.mean([m['diversity'] for m in user_metrics])
}
return avg_metrics
# Visualization of results
def plot_evaluation_results(metrics_dict):
"""Plot evaluation metrics comparison"""
plt.figure(figsize=(10, 5))
# Plot metrics
x = np.arange(len(metrics_dict))
metrics = list(metrics_dict.values())
plt.bar(x, metrics)
plt.xticks(x, list(metrics_dict.keys()), rotation=45)
plt.title('Recommendation System Evaluation Metrics')
plt.tight_layout()
plt.show()
7. Model Comparison
def compare_models(train_data, test_data):
"""Compare different recommendation models"""
# Initialize models
models = {
'UserCF': UserBasedCF(n_neighbors=5),
'ItemCF': ItemBasedCF(n_neighbors=10),
'SVD': SVD(n_factors=100)
}
# Train and evaluate each model
results = {}
for name, model in models.items():
model.fit(train_data)
metrics = evaluate_recommendations(model, test_data)
results[name] = metrics
return results
Click to view complete model comparison code
def compare_models(train_data, test_data, movies_df):
"""
Comprehensive comparison of different recommendation models
Args:
train_data: Training dataset
test_data: Test dataset
movies_df: DataFrame with movie information
Returns:
dict: Performance metrics for each model
"""
# Initialize models with different configurations
models = {
'UserCF_5': UserBasedCF(n_neighbors=5),
'UserCF_10': UserBasedCF(n_neighbors=10),
'ItemCF_10': ItemBasedCF(n_neighbors=10),
'ItemCF_20': ItemBasedCF(n_neighbors=20),
'SVD_50': SVD(n_factors=50),
'SVD_100': SVD(n_factors=100)
}
results = {}
for name, model in models.items():
print(f"\nTraining {name}...")
# Train model
start_time = time.time()
model.fit(train_data)
training_time = time.time() - start_time
# Evaluate model
metrics = evaluate_recommendations(model, test_data, movies_df)
metrics['Training Time'] = training_time
# Add memory usage
memory_usage = sys.getsizeof(model) / 1024 / 1024 # MB
metrics['Memory Usage (MB)'] = memory_usage
results[name] = metrics
print(f"{name} Results:")
for metric, value in metrics.items():
print(f"{metric}: {value:.4f}")
# Visualize results
plot_comparison_results(results)
return results
def plot_comparison_results(results):
"""Create visualizations for model comparison"""
metrics = ['RMSE', 'MAE', 'Precision@10', 'Diversity']
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
fig.suptitle('Model Comparison across Different Metrics')
for i, metric in enumerate(metrics):
ax = axes[i//2, i%2]
values = [results[model][metric] for model in results]
ax.bar(results.keys(), values)
ax.set_title(metric)
ax.tick_labels(rotation=45)
plt.tight_layout()
plt.show()
# Plot training time comparison
plt.figure(figsize=(10, 5))
times = [results[model]['Training Time'] for model in results]
plt.bar(results.keys(), times)
plt.title('Training Time Comparison')
plt.ylabel('Time (seconds)')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
8. Making Recommendations
Basic recommendation function:
def get_movie_recommendations(model, user_id, movies_df, n=5):
"""Get movie recommendations for a user"""
# Get recommendations
recommendations = model.recommend_movies(user_id, n_recommendations=n)
# Format results
rec_list = []
for movie_id, pred_rating in recommendations:
movie_title = movies_df[movies_df['movie_id'] == movie_id]['title'].iloc[0]
rec_list.append({
'title': movie_title,
'predicted_rating': pred_rating
})
return rec_list
9. Final Evaluation and Visualization
Evaluate the best model
def final_evaluation(best_model, test_data):
"""
Evaluate the best-performing model on the test set.
Args:
best_model: The trained model selected based on previous evaluations.
test_data: The test dataset.
Returns:
dict: A dictionary containing evaluation metrics.
"""
# Initialize lists to store actual and predicted ratings
actual_ratings = []
predicted_ratings = []
# Iterate over the test data
for _, row in test_data.iterrows():
user_id = row['user_id']
movie_id = row['movie_id']
actual_rating = row['rating']
# Predict the rating
predicted_rating = best_model.predict(user_id, movie_id)
# Append to lists
actual_ratings.append(actual_rating)
predicted_ratings.append(predicted_rating)
# Calculate evaluation metrics
rmse = np.sqrt(mean_squared_error(actual_ratings, predicted_ratings))
mae = mean_absolute_error(actual_ratings, predicted_ratings)
print(f'Final Model Evaluation:')
print(f'RMSE: {rmse:.4f}')
print(f'MAE: {mae:.4f}')
# Plotting actual vs predicted ratings
plt.figure(figsize=(10,6))
plt.scatter(actual_ratings, predicted_ratings, alpha=0.5)
plt.xlabel('Actual Ratings')
plt.ylabel('Predicted Ratings')
plt.title('Actual vs Predicted Ratings')
plt.show()
return {'RMSE': rmse, 'MAE': mae}
10. Next Steps
- Content-Based Filtering: Incorporate item features (e.g., genres, directors) to improve recommendations.
- Hybrid Systems: Combine collaborative and content-based methods for better performance.
- Advanced Algorithms: Explore matrix factorization techniques like SVD++ or deep learning approaches.
- Scalability: Implement distributed computing solutions to handle larger datasets efficiently.
- User Interface: Develop a web or mobile application interface for users to interact with the remmendation system.
By pursuing these extensions, you can enhance the recommendation system’s capabilities and applicability in real-world scenarios.