๐Ÿ”ด Advanced Movie Recommendation Systems with Graph Neural Networks and PyTorch Geometric

Optimizing Graph Neural Network-Based Recommender Systems with PyTorch Geometric

Objective

Develop a graph-based recommender system using PyTorch Geometric (PyG), focusing specifically on hyperparameter optimization and model optimization techniques to improve recommendation accuracy. This project aims to provide hands-on experience in building and fine-tuning GNN models for recommender systems, emphasizing the practical aspects of model performance enhancement using PyG.


Learning Outcomes

By completing this project, you will:

  • Gain in-depth knowledge of PyTorch Geometric for building GNNs.
  • Understand how to optimize GNN models through hyperparameter tuning and architectural adjustments.
  • Learn to implement advanced hyperparameter optimization techniques using tools like Optuna.
  • Acquire skills in model evaluation and performance analysis specific to GNN-based recommender systems.
  • Develop expertise in handling graph data preprocessing and feature engineering tailored for recommendation tasks.

Prerequisites and Theoretical Foundations

1. Python Programming (Intermediate Level)

  • Data Structures: Lists, dictionaries, sets, and tuples.
  • Control Flow: Loops, conditionals, and functions.
  • Object-Oriented Programming: Classes and inheritance.
  • Libraries: Familiarity with PyTorch and Pandas.

2. Basic Understanding of Graph Neural Networks

  • Graph Theory:
    • Concepts: Nodes, edges, adjacency matrices.
  • Neural Networks:
    • Fundamentals: Layers, activation functions, loss functions.
    • Training process: Forward pass, backward pass, optimization.

3. Basic Understanding of Recommender Systems

  • Collaborative Filtering:
    • User-item interaction matrix.
  • Graph-Based Approaches:
    • Representing recommender systems as graphs.

Tools Required

  • Programming Language: Python 3.7+
  • Libraries and Frameworks:
    • PyTorch: pip install torch
    • PyTorch Geometric (PyG): Follow official installation guide
    • Pandas: pip install pandas
    • Optuna: pip install optuna
    • Scikit-learn: pip install scikit-learn
    • Matplotlib: pip install matplotlib
  • Dataset:
    • MovieLens dataset (Preferably the 100K variant for manageability)

Project Structure

gnn_recommender_system/
โ”‚
โ”œโ”€โ”€ data/
โ”‚   โ””โ”€โ”€ movielens/
โ”‚       โ”œโ”€โ”€ ratings.csv
โ”‚       โ””โ”€โ”€ movies.csv
โ”‚
โ”œโ”€โ”€ src/
โ”‚   โ”œโ”€โ”€ data_preprocessing.py
โ”‚   โ”œโ”€โ”€ graph_construction.py
โ”‚   โ”œโ”€โ”€ model.py
โ”‚   โ”œโ”€โ”€ train.py
โ”‚   โ”œโ”€โ”€ evaluate.py
โ”‚   โ”œโ”€โ”€ hyperparameter_optimization.py
โ”‚   โ””โ”€โ”€ utils.py
โ”‚
โ””โ”€โ”€ notebooks/
    โ”œโ”€โ”€ data_exploration.ipynb
    โ”œโ”€โ”€ model_training.ipynb
    โ””โ”€โ”€ hyperparameter_optimization.ipynb

Steps and Tasks

1. Data Acquisition and Exploration

Tasks:

  • Download the MovieLens 100K Dataset:
  • Load Data into Pandas DataFrames:
    • ratings.csv and movies.csv.
  • Perform Exploratory Data Analysis (EDA):
    • Understand the distribution of ratings.
    • Analyze user and item interactions.
    • Check for missing values or inconsistencies.

Implementation:

import pandas as pd
import matplotlib.pyplot as plt

# Load data
ratings = pd.read_csv('data/movielens/ratings.csv')
movies = pd.read_csv('data/movielens/movies.csv')

# EDA
print(ratings.head())
print(movies.head())

# Rating distribution
plt.hist(ratings['rating'], bins=5)
plt.xlabel('Rating')
plt.ylabel('Count')
plt.title('Rating Distribution')
plt.show()

2. Data Preprocessing and Graph Construction

Tasks:

  • Preprocess Data:
    • Encode user IDs and movie IDs to indices.
    • Normalize ratings if necessary.
  • Construct Graph Data Structure for PyG:
    • Nodes: Users and movies.
    • Edges: Interactions (ratings) between users and movies.
  • Create Node and Edge Features:
    • Node features: Could be embeddings or one-hot encodings.
    • Edge features: Ratings or binary indicators of interaction.

Implementation:

import torch
from torch_geometric.data import Data
from sklearn.preprocessing import LabelEncoder

# Encode user IDs and movie IDs
user_encoder = LabelEncoder()
movie_encoder = LabelEncoder()

ratings['user_idx'] = user_encoder.fit_transform(ratings['userId'])
ratings['movie_idx'] = movie_encoder.fit_transform(ratings['movieId'])

# Prepare edge index
edge_index = torch.tensor([
    ratings['user_idx'].values,
    ratings['movie_idx'].values + ratings['user_idx'].nunique()
], dtype=torch.long)

# Prepare edge attributes (ratings)
edge_attr = torch.tensor(ratings['rating'].values, dtype=torch.float)

# Create node features (optional: use zeros if not available)
num_users = ratings['user_idx'].nunique()
num_movies = ratings['movie_idx'].nunique()
num_nodes = num_users + num_movies
x = torch.zeros((num_nodes, 1))  # Assuming no node features

# Create data object
data = Data(x=x, edge_index=edge_index, edge_attr=edge_attr)

3. Splitting the Data

Tasks:

  • Create Training, Validation, and Test Sets:
    • Use techniques suitable for graph data.
    • Ensure that there is no data leakage.
  • Mask Edges for Validation and Testing:
    • Remove certain edges from the training graph to be used for evaluation.

Implementation:

from torch_geometric.utils import train_test_split_edges

# Split data
data = train_test_split_edges(data, val_ratio=0.1, test_ratio=0.1)

4. Implementing the GNN Model

Tasks:

  • Define the Model Architecture:
    • Start with a simple GCNConv layer.
  • Set Up the Forward Pass:
    • Handle the propagation of node features through the network.
  • Implement the Loss Function:
    • Use appropriate loss functions for link prediction (e.g., MSELoss).

Implementation:

import torch.nn.functional as F
from torch_geometric.nn import GCNConv

class GCNRecommender(torch.nn.Module):
    def __init__(self, hidden_channels):
        super(GCNRecommender, self).__init__()
        self.embedding = torch.nn.Embedding(data.num_nodes, hidden_channels)
        self.conv1 = GCNConv(hidden_channels, hidden_channels)
        self.conv2 = GCNConv(hidden_channels, hidden_channels)

    def encode(self):
        x = self.embedding.weight
        x = self.conv1(x, data.train_pos_edge_index)
        x = F.relu(x)
        x = self.conv2(x, data.train_pos_edge_index)
        return x

    def decode(self, z, edge_index):
        # Dot product of node embeddings
        return (z[edge_index[0]] * z[edge_index[1]]).sum(dim=1)

    def forward(self):
        z = self.encode()
        pos_edge_index = data.train_pos_edge_index
        neg_edge_index = data.train_neg_edge_index
        pos_pred = self.decode(z, pos_edge_index)
        neg_pred = self.decode(z, neg_edge_index)
        return pos_pred, neg_pred

5. Training the Model

Tasks:

  • Set Up the Optimizer and Loss Function:
    • Use Adam optimizer.
    • Define a suitable loss function (e.g., binary cross-entropy loss).
  • Implement the Training Loop:
    • Iterate over epochs.
    • Perform forward pass, compute loss, backward pass, and optimizer step.
  • Validate the Model:
    • Evaluate performance on the validation set after each epoch.

Implementation:

model = GCNRecommender(hidden_channels=64)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
criterion = torch.nn.BCEWithLogitsLoss()

def train():
    model.train()
    optimizer.zero_grad()
    pos_pred, neg_pred = model()
    pos_loss = criterion(pos_pred, torch.ones(pos_pred.size(0)))
    neg_loss = criterion(neg_pred, torch.zeros(neg_pred.size(0)))
    loss = pos_loss + neg_loss
    loss.backward()
    optimizer.step()
    return loss.item()

def test():
    model.eval()
    z = model.encode()
    pos_edge_index = data.test_pos_edge_index
    neg_edge_index = data.test_neg_edge_index
    pos_pred = model.decode(z, pos_edge_index).sigmoid()
    neg_pred = model.decode(z, neg_edge_index).sigmoid()
    # Compute evaluation metrics
    from sklearn.metrics import roc_auc_score
    preds = torch.cat([pos_pred, neg_pred]).detach().numpy()
    labels = torch.cat([torch.ones(pos_pred.size(0)), torch.zeros(neg_pred.size(0))]).numpy()
    auc = roc_auc_score(labels, preds)
    return auc

# Training loop
for epoch in range(1, 201):
    loss = train()
    if epoch % 10 == 0:
        auc = test()
        print(f'Epoch {epoch}, Loss: {loss:.4f}, AUC: {auc:.4f}')

6. Hyperparameter Optimization

Tasks:

  • Define Hyperparameters to Tune:
    • Learning rate, hidden dimensions, number of layers, activation functions, etc.
  • Set Up Optuna for Hyperparameter Tuning:
    • Create an objective function that trains the model and returns validation performance.
  • Run Hyperparameter Optimization Trials:
    • Use Optuna to find the best hyperparameters.

Implementation:

import optuna

def objective(trial):
    # Suggest hyperparameters
    hidden_channels = trial.suggest_categorical('hidden_channels', [32, 64, 128])
    learning_rate = trial.suggest_loguniform('learning_rate', 1e-4, 1e-2)
    num_layers = trial.suggest_int('num_layers', 1, 3)
    
    # Define the model with suggested hyperparameters
    model = GCNRecommender(hidden_channels=hidden_channels, num_layers=num_layers)
    optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
    
    # Training loop (simplified)
    for epoch in range(1, 51):
        loss = train()
    # Validation
    auc = test()
    return auc

# Run optimization
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=50)

print('Best hyperparameters:', study.best_params)

7. Model Optimization Techniques

Tasks:

  • Implement Early Stopping:
    • Stop training when validation loss doesnโ€™t improve.
  • Add Regularization Techniques:
    • Apply dropout layers.
    • Use weight decay in the optimizer.
  • Experiment with Different Architectures:
    • Try GATConv, GraphSAGEConv, or other PyG layers.

Implementation:

# Modify the model to include dropout
class GCNRecommender(torch.nn.Module):
    def __init__(self, hidden_channels, num_layers):
        super(GCNRecommender, self).__init__()
        self.embedding = torch.nn.Embedding(data.num_nodes, hidden_channels)
        self.convs = torch.nn.ModuleList()
        self.convs.append(GCNConv(hidden_channels, hidden_channels))
        for _ in range(num_layers - 1):
            self.convs.append(GCNConv(hidden_channels, hidden_channels))
        self.dropout = torch.nn.Dropout(p=0.5)

    def encode(self):
        x = self.embedding.weight
        for conv in self.convs:
            x = conv(x, data.train_pos_edge_index)
            x = F.relu(x)
            x = self.dropout(x)
        return x

    # Rest of the class remains the same

8. Evaluation and Analysis

Tasks:

  • Evaluate the Final Model on Test Set:
    • Compute metrics like AUC, Precision@K, Recall@K.
  • Analyze Results:
    • Interpret the impact of hyperparameter tuning.
    • Visualize training and validation loss curves.

Implementation:

# Evaluate on test set
final_auc = test()
print(f'Final Test AUC: {final_auc:.4f}')

# Plot loss curves
plt.plot(training_losses, label='Training Loss')
plt.plot(validation_losses, label='Validation Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.show()

Further Enhancements

To continue improving your project and skills:

  • Experiment with Advanced GNN Layers:
    • Try Graph Attention Networks (GATConv) or GraphSAGEConv.
  • Explore Different Loss Functions:
    • Use BPR Loss for ranking-based recommendations.
  • Implement Cross-Validation:
    • Ensure robustness of the model by validating across multiple data splits.
  • Scale Up with Larger Datasets:
    • Apply the model to the full MovieLens dataset or other large-scale datasets.
  • Deploy the Model:
    • Create a simple API using FastAPI or Flask to serve recommendations.
1 Like