🔴 Advanced Movie Recommendation Systems with Graph Neural Networks and PyTorch Geometric

stemaway · February 28, 2024, 8:00am

Optimizing Graph Neural Network-Based Recommender Systems with PyTorch Geometric

Objective

Develop a graph-based recommender system using PyTorch Geometric (PyG), focusing specifically on hyperparameter optimization and model optimization techniques to improve recommendation accuracy. This project aims to provide hands-on experience in building and fine-tuning GNN models for recommender systems, emphasizing the practical aspects of model performance enhancement using PyG.

Learning Outcomes

By completing this project, you will:

Gain in-depth knowledge of PyTorch Geometric for building GNNs.
Understand how to optimize GNN models through hyperparameter tuning and architectural adjustments.
Learn to implement advanced hyperparameter optimization techniques using tools like Optuna.
Acquire skills in model evaluation and performance analysis specific to GNN-based recommender systems.
Develop expertise in handling graph data preprocessing and feature engineering tailored for recommendation tasks.

Prerequisites and Theoretical Foundations

1. Python Programming (Intermediate Level)

Data Structures: Lists, dictionaries, sets, and tuples.
Control Flow: Loops, conditionals, and functions.
Object-Oriented Programming: Classes and inheritance.
Libraries: Familiarity with PyTorch and Pandas.

2. Basic Understanding of Graph Neural Networks

Graph Theory:
- Concepts: Nodes, edges, adjacency matrices.
Neural Networks:
- Fundamentals: Layers, activation functions, loss functions.
- Training process: Forward pass, backward pass, optimization.

3. Basic Understanding of Recommender Systems

Collaborative Filtering:
- User-item interaction matrix.
Graph-Based Approaches:
- Representing recommender systems as graphs.

Tools Required

Programming Language: Python 3.7+
Libraries and Frameworks:
- PyTorch: pip install torch
- PyTorch Geometric (PyG): Follow official installation guide
- Pandas: pip install pandas
- Optuna: pip install optuna
- Scikit-learn: pip install scikit-learn
- Matplotlib: pip install matplotlib
Dataset:
- MovieLens dataset (Preferably the 100K variant for manageability)

Project Structure

gnn_recommender_system/
│
├── data/
│   └── movielens/
│       ├── ratings.csv
│       └── movies.csv
│
├── src/
│   ├── data_preprocessing.py
│   ├── graph_construction.py
│   ├── model.py
│   ├── train.py
│   ├── evaluate.py
│   ├── hyperparameter_optimization.py
│   └── utils.py
│
└── notebooks/
    ├── data_exploration.ipynb
    ├── model_training.ipynb
    └── hyperparameter_optimization.ipynb

Steps and Tasks

1. Data Acquisition and Exploration

Tasks:

Download the MovieLens 100K Dataset:
- Use the official MovieLens website.
Load Data into Pandas DataFrames:
- ratings.csv and movies.csv.
Perform Exploratory Data Analysis (EDA):
- Understand the distribution of ratings.
- Analyze user and item interactions.
- Check for missing values or inconsistencies.

Implementation:

import pandas as pd
import matplotlib.pyplot as plt

# Load data
ratings = pd.read_csv('data/movielens/ratings.csv')
movies = pd.read_csv('data/movielens/movies.csv')

# EDA
print(ratings.head())
print(movies.head())

# Rating distribution
plt.hist(ratings['rating'], bins=5)
plt.xlabel('Rating')
plt.ylabel('Count')
plt.title('Rating Distribution')
plt.show()

2. Data Preprocessing and Graph Construction

Tasks:

Preprocess Data:
- Encode user IDs and movie IDs to indices.
- Normalize ratings if necessary.
Construct Graph Data Structure for PyG:
- Nodes: Users and movies.
- Edges: Interactions (ratings) between users and movies.
Create Node and Edge Features:
- Node features: Could be embeddings or one-hot encodings.
- Edge features: Ratings or binary indicators of interaction.

Implementation:

import torch
from torch_geometric.data import Data
from sklearn.preprocessing import LabelEncoder

# Encode user IDs and movie IDs
user_encoder = LabelEncoder()
movie_encoder = LabelEncoder()

ratings['user_idx'] = user_encoder.fit_transform(ratings['userId'])
ratings['movie_idx'] = movie_encoder.fit_transform(ratings['movieId'])

# Prepare edge index
edge_index = torch.tensor([
    ratings['user_idx'].values,
    ratings['movie_idx'].values + ratings['user_idx'].nunique()
], dtype=torch.long)

# Prepare edge attributes (ratings)
edge_attr = torch.tensor(ratings['rating'].values, dtype=torch.float)

# Create node features (optional: use zeros if not available)
num_users = ratings['user_idx'].nunique()
num_movies = ratings['movie_idx'].nunique()
num_nodes = num_users + num_movies
x = torch.zeros((num_nodes, 1))  # Assuming no node features

# Create data object
data = Data(x=x, edge_index=edge_index, edge_attr=edge_attr)

3. Splitting the Data

Tasks:

Create Training, Validation, and Test Sets:
- Use techniques suitable for graph data.
- Ensure that there is no data leakage.
Mask Edges for Validation and Testing:
- Remove certain edges from the training graph to be used for evaluation.

Implementation:

from torch_geometric.utils import train_test_split_edges

# Split data
data = train_test_split_edges(data, val_ratio=0.1, test_ratio=0.1)

4. Implementing the GNN Model

Tasks:

Define the Model Architecture:
- Start with a simple GCNConv layer.
Set Up the Forward Pass:
- Handle the propagation of node features through the network.
Implement the Loss Function:
- Use appropriate loss functions for link prediction (e.g., MSELoss).

Implementation:

import torch.nn.functional as F
from torch_geometric.nn import GCNConv

class GCNRecommender(torch.nn.Module):
    def __init__(self, hidden_channels):
        super(GCNRecommender, self).__init__()
        self.embedding = torch.nn.Embedding(data.num_nodes, hidden_channels)
        self.conv1 = GCNConv(hidden_channels, hidden_channels)
        self.conv2 = GCNConv(hidden_channels, hidden_channels)

    def encode(self):
        x = self.embedding.weight
        x = self.conv1(x, data.train_pos_edge_index)
        x = F.relu(x)
        x = self.conv2(x, data.train_pos_edge_index)
        return x

    def decode(self, z, edge_index):
        # Dot product of node embeddings
        return (z[edge_index[0]] * z[edge_index[1]]).sum(dim=1)

    def forward(self):
        z = self.encode()
        pos_edge_index = data.train_pos_edge_index
        neg_edge_index = data.train_neg_edge_index
        pos_pred = self.decode(z, pos_edge_index)
        neg_pred = self.decode(z, neg_edge_index)
        return pos_pred, neg_pred

5. Training the Model

Tasks:

Set Up the Optimizer and Loss Function:
- Use Adam optimizer.
- Define a suitable loss function (e.g., binary cross-entropy loss).
Implement the Training Loop:
- Iterate over epochs.
- Perform forward pass, compute loss, backward pass, and optimizer step.
Validate the Model:
- Evaluate performance on the validation set after each epoch.

Implementation:

model = GCNRecommender(hidden_channels=64)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
criterion = torch.nn.BCEWithLogitsLoss()

def train():
    model.train()
    optimizer.zero_grad()
    pos_pred, neg_pred = model()
    pos_loss = criterion(pos_pred, torch.ones(pos_pred.size(0)))
    neg_loss = criterion(neg_pred, torch.zeros(neg_pred.size(0)))
    loss = pos_loss + neg_loss
    loss.backward()
    optimizer.step()
    return loss.item()

def test():
    model.eval()
    z = model.encode()
    pos_edge_index = data.test_pos_edge_index
    neg_edge_index = data.test_neg_edge_index
    pos_pred = model.decode(z, pos_edge_index).sigmoid()
    neg_pred = model.decode(z, neg_edge_index).sigmoid()
    # Compute evaluation metrics
    from sklearn.metrics import roc_auc_score
    preds = torch.cat([pos_pred, neg_pred]).detach().numpy()
    labels = torch.cat([torch.ones(pos_pred.size(0)), torch.zeros(neg_pred.size(0))]).numpy()
    auc = roc_auc_score(labels, preds)
    return auc

# Training loop
for epoch in range(1, 201):
    loss = train()
    if epoch % 10 == 0:
        auc = test()
        print(f'Epoch {epoch}, Loss: {loss:.4f}, AUC: {auc:.4f}')

6. Hyperparameter Optimization

Tasks:

Define Hyperparameters to Tune:
- Learning rate, hidden dimensions, number of layers, activation functions, etc.
Set Up Optuna for Hyperparameter Tuning:
- Create an objective function that trains the model and returns validation performance.
Run Hyperparameter Optimization Trials:
- Use Optuna to find the best hyperparameters.

Implementation:

import optuna

def objective(trial):
    # Suggest hyperparameters
    hidden_channels = trial.suggest_categorical('hidden_channels', [32, 64, 128])
    learning_rate = trial.suggest_loguniform('learning_rate', 1e-4, 1e-2)
    num_layers = trial.suggest_int('num_layers', 1, 3)
    
    # Define the model with suggested hyperparameters
    model = GCNRecommender(hidden_channels=hidden_channels, num_layers=num_layers)
    optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
    
    # Training loop (simplified)
    for epoch in range(1, 51):
        loss = train()
    # Validation
    auc = test()
    return auc

# Run optimization
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=50)

print('Best hyperparameters:', study.best_params)

7. Model Optimization Techniques

Tasks:

Implement Early Stopping:
- Stop training when validation loss doesn’t improve.
Add Regularization Techniques:
- Apply dropout layers.
- Use weight decay in the optimizer.
Experiment with Different Architectures:
- Try GATConv, GraphSAGEConv, or other PyG layers.

Implementation:

# Modify the model to include dropout
class GCNRecommender(torch.nn.Module):
    def __init__(self, hidden_channels, num_layers):
        super(GCNRecommender, self).__init__()
        self.embedding = torch.nn.Embedding(data.num_nodes, hidden_channels)
        self.convs = torch.nn.ModuleList()
        self.convs.append(GCNConv(hidden_channels, hidden_channels))
        for _ in range(num_layers - 1):
            self.convs.append(GCNConv(hidden_channels, hidden_channels))
        self.dropout = torch.nn.Dropout(p=0.5)

    def encode(self):
        x = self.embedding.weight
        for conv in self.convs:
            x = conv(x, data.train_pos_edge_index)
            x = F.relu(x)
            x = self.dropout(x)
        return x

    # Rest of the class remains the same

8. Evaluation and Analysis

Tasks:

Evaluate the Final Model on Test Set:
- Compute metrics like AUC, Precision@K, Recall@K.
Analyze Results:
- Interpret the impact of hyperparameter tuning.
- Visualize training and validation loss curves.

Implementation:

# Evaluate on test set
final_auc = test()
print(f'Final Test AUC: {final_auc:.4f}')

# Plot loss curves
plt.plot(training_losses, label='Training Loss')
plt.plot(validation_losses, label='Validation Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.show()

Further Enhancements

To continue improving your project and skills:

Experiment with Advanced GNN Layers:
- Try Graph Attention Networks (GATConv) or GraphSAGEConv.
Explore Different Loss Functions:
- Use BPR Loss for ranking-based recommendations.
Implement Cross-Validation:
- Ensure robustness of the model by validating across multiple data splits.
Scale Up with Larger Datasets:
- Apply the model to the full MovieLens dataset or other large-scale datasets.
Deploy the Model:
- Create a simple API using FastAPI or Flask to serve recommendations.