🔴 Image Captioning with CNNs and RNNs

stemaway · October 28, 2024, 3:59am

Image Captioning with CNNs and RNNs: An Intermediate Project

Objective

The goal of this project is to build an image captioning system that can generate descriptive sentences for images. You will combine convolutional neural networks (CNNs) for image feature extraction with recurrent neural networks (RNNs), specifically Long Short-Term Memory (LSTM) networks, for sequence generation. This project bridges computer vision and natural language processing, providing a hands-on experience with multimodal AI systems.

Learning Outcomes

By completing this project, you will:

Understand how to preprocess image and text data for multimodal models
Learn to extract image features using pre-trained CNNs
Implement encoder-decoder architectures combining CNNs and RNNs
Train a model to generate captions for images
Evaluate model performance using metrics like BLEU scores
Gain experience with advanced deep learning frameworks and techniques

Prerequisites and Theoretical Foundations

1. Intermediate Python Programming

Object-oriented programming concepts
Advanced functions and modules
Working with files and data serialization (JSON, pickle)
Familiarity with Jupyter Notebooks or Python IDEs

Click to view Python code examples

import json
import pickle

# Class definitions
class ImageCaptionDataset:
    def __init__(self, image_dir, caption_file):
        self.image_dir = image_dir
        self.caption_file = caption_file
        self.data = self.load_data()
    
    def load_data(self):
        with open(self.caption_file, 'r') as f:
            captions = json.load(f)
        return captions

# Saving and loading data
with open('data.pkl', 'wb') as f:
    pickle.dump(data, f)

with open('data.pkl', 'rb') as f:
    data = pickle.load(f)

2. Mathematics and Machine Learning Foundations

Linear algebra (vectors, matrices, tensor operations)
Probability and statistics (probability distributions, sampling)
Understanding of loss functions and optimization algorithms
Familiarity with gradient descent and backpropagation

Click to view mathematical concepts with code

import numpy as np

# Softmax function
def softmax(x):
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum(axis=0)

# Cross-entropy loss
def cross_entropy_loss(y_true, y_pred):
    return -np.sum(y_true * np.log(y_pred + 1e-9))

# Gradient computation
def compute_gradients(X, y, weights):
    predictions = softmax(np.dot(weights, X))
    error = predictions - y
    grad = np.dot(error, X.T)
    return grad

3. Deep Learning Concepts

Understanding of CNNs (convolutions, pooling, feature maps)
Familiarity with RNNs and LSTMs (sequence modeling)
Knowledge of encoder-decoder architectures
Regularization techniques (dropout, weight decay)

Click to view deep learning concepts

Convolutional Neural Networks
- Convolution Operation: Applying filters/kernels to input data to extract features. [ \text{Feature Map} = \text{Input} \ast \text{Kernel} ]
- Pooling Layers: Reducing spatial dimensions (e.g., max pooling).
Recurrent Neural Networks
- Sequential Data Processing: Handling sequences by maintaining a hidden state.
- LSTM Units: Addressing the vanishing gradient problem with gates:
  - Forget Gate: Decides what information to discard.
  - Input Gate: Determines what new information to store.
  - Output Gate: Decides what to output.
Encoder-Decoder Architecture
- Encoder: Processes the input data and encodes it into a context vector.
- Decoder: Generates output sequences based on the context vector.
Attention Mechanism (Optional advanced concept)
- Purpose: Allows the model to focus on different parts of the input during decoding.
- Mechanism: [ \text{Attention Weights} = \text{Softmax}(\text{Score}(h_{\text{decoder}}, h_{\text{encoder}})) ]

4. Natural Language Processing Basics

Tokenization and vocabulary building
Embeddings (word2vec, GloVe)
Sequence padding and truncation
Understanding of BLEU scores for evaluation

Click to view NLP concepts with code

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Sample captions
captions = ['a dog playing with a ball', 'a cat sitting on a sofa']

# Tokenization
tokenizer = Tokenizer(num_words=5000, oov_token='<unk>')
tokenizer.fit_on_texts(captions)
sequences = tokenizer.texts_to_sequences(captions)

# Vocabulary size
vocab_size = len(tokenizer.word_index) + 1

# Padding sequences
max_length = max(len(seq) for seq in sequences)
padded_sequences = pad_sequences(sequences, maxlen=max_length, padding='post')

Skills Gained

Building encoder-decoder models combining CNNs and RNNs
Preprocessing image and text data for multimodal learning
Extracting features from images using pre-trained CNNs
Implementing and training sequence models for caption generation
Evaluating language models using BLEU and other metrics
Handling datasets with both image and text modalities

Tools Required

Python 3.7+
Deep learning frameworks: TensorFlow 2.x (preferred) or PyTorch
Data manipulation: Pandas, NumPy
Image processing: OpenCV or PIL
Natural language processing: NLTK or spaCy
Data visualization: Matplotlib, Seaborn
Jupyter Notebook or an IDE like PyCharm

pip install tensorflow==2.8.0
pip install numpy pandas matplotlib seaborn
pip install nltk
pip install pillow

Project Structure

image_captioning_project/
│
├── data/
│   ├── images/          # Image files
│   └── captions.txt     # Captions for images
│
├── src/
│   ├── data_loader.py   # Code for loading and preprocessing data
│   ├── model.py         # Model architecture
│   ├── train.py         # Training loop
│   ├── evaluate.py      # Evaluation script
│   └── utils.py         # Utility functions
│
└── notebooks/
    ├── data_exploration.ipynb
    ├── training.ipynb
    └── evaluation.ipynb

Steps and Tasks

1. Data Acquisition and Exploration

Dataset: Use the Flickr8k or COCO dataset, which contains images and corresponding captions.

Tasks:

Download and extract the dataset.
Load image files and captions.
Explore the data to understand its structure and content.

Implementation:

# Example for loading the Flickr8k dataset
import os
import pandas as pd
from PIL import Image

# Define data paths
images_dir = 'data/Flickr8k_Dataset/Flicker8k_Dataset/'
captions_file = 'data/Flickr8k_text/Flickr8k.token.txt'

# Load captions
def load_captions(filepath):
    captions = {}
    with open(filepath, 'r') as f:
        for line in f:
            tokens = line.strip().split('\t')
            image_id, caption = tokens[0], tokens[1]
            image_id = image_id.split('#')[0]
            if image_id not in captions:
                captions[image_id] = []
            captions[image_id].append(caption)
    return captions

captions = load_captions(captions_file)
print(f"Loaded captions for {len(captions)} images.")

# Display sample image and captions
def display_sample_image(image_id):
    image_path = os.path.join(images_dir, image_id)
    image = Image.open(image_path)
    plt.imshow(image)
    plt.axis('off')
    plt.show()
    print("Captions:")
    for caption in captions[image_id]:
        print(f"- {caption}")

sample_image_id = list(captions.keys())[0]
display_sample_image(sample_image_id)

Click to view data exploration code

# Analyze caption lengths
caption_lengths = [len(caption.split()) for captions_list in captions.values() for caption in captions_list]
plt.hist(caption_lengths, bins=20)
plt.title('Caption Length Distribution')
plt.xlabel('Number of Words')
plt.ylabel('Frequency')
plt.show()

# Vocabulary size
all_captions = [caption for captions_list in captions.values() for caption in captions_list]
unique_words = set(' '.join(all_captions).split())
print(f"Vocabulary Size: {len(unique_words)}")

2. Data Preprocessing

Tasks:

Clean captions (lowercase, remove punctuation, etc.).
Tokenize captions and build a vocabulary.
Map words to integers and vice versa.
Handle rare words and limit vocabulary size.
Prepare sequences for training (input-output pairs).

Implementation:

import string
import re
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Clean captions
def clean_captions(captions_dict):
    table = str.maketrans('', '', string.punctuation)
    for key, captions_list in captions_dict.items():
        cleaned_list = []
        for caption in captions_list:
            caption = caption.lower()
            caption = caption.translate(table)
            caption = re.sub(r'\d+', '', caption)
            caption = caption.strip()
            caption = ' '.join(caption.split())
            cleaned_list.append(caption)
        captions_dict[key] = cleaned_list
    return captions_dict

captions = clean_captions(captions)

# Build vocabulary
def build_vocabulary(captions_dict, threshold=5):
    word_counts = {}
    for captions_list in captions_dict.values():
        for caption in captions_list:
            words = caption.split()
            for word in words:
                word_counts[word] = word_counts.get(word, 0) + 1

    vocab = [word for word, count in word_counts.items() if count >= threshold]
    print(f"Vocabulary size after thresholding: {len(vocab)}")
    return vocab

vocab = build_vocabulary(captions)

# Create tokenizer
tokenizer = Tokenizer(num_words=len(vocab)+1, oov_token='<unk>')
tokenizer.fit_on_texts([' '.join(vocab)])

Preparing sequences for training

# Add start and end tokens to captions
def add_tokens(captions_dict):
    for key, captions_list in captions_dict.items():
        captions_dict[key] = ['<start> ' + caption + ' <end>' for caption in captions_list]
    return captions_dict

captions = add_tokens(captions)

# Convert captions to sequences
def captions_to_sequences(tokenizer, captions_dict):
    sequences = []
    for captions_list in captions_dict.values():
        for caption in captions_list:
            seq = tokenizer.texts_to_sequences([caption])[0]
            sequences.append(seq)
    return sequences

sequences = captions_to_sequences(tokenizer, captions)

# Determine maximum caption length
max_length = max(len(seq) for seq in sequences)
print(f"Maximum caption length: {max_length}")

3. Feature Extraction from Images

Tasks:

Use a pre-trained CNN (e.g., InceptionV3, VGG16) to extract features from images.
Remove the top classification layer to get feature vectors.
Save extracted features to avoid reprocessing images.

Implementation:

from tensorflow.keras.applications.vgg16 import VGG16, preprocess_input
from tensorflow.keras.preprocessing.image import load_img, img_to_array
import numpy as np

# Load pre-trained model
model = VGG16(weights='imagenet', include_top=False, pooling='avg')

def extract_features(image_path, model):
    image = load_img(image_path, target_size=(224, 224))
    image = img_to_array(image)
    image = preprocess_input(image)
    image = np.expand_dims(image, axis=0)
    features = model.predict(image)
    return features

# Extract features for all images
features = {}
for image_id in captions.keys():
    image_path = os.path.join(images_dir, image_id)
    feature = extract_features(image_path, model)
    features[image_id] = feature

# Save features
with open('features.pkl', 'wb') as f:
    pickle.dump(features, f)

Loading extracted features

# Load features
with open('features.pkl', 'rb') as f:
    features = pickle.load(f)

4. Preparing Data for Training

Tasks:

Create input-output pairs for the model:
- Image features + partial caption as input
- Next word in the caption as output
Pad sequences to the maximum caption length
Split data into training and validation sets

Implementation:

from sklearn.model_selection import train_test_split

# Create data generator
def data_generator(captions_dict, features, tokenizer, max_length, batch_size):
    while True:
        X1, X2, y = [], [], []
        for image_id, captions_list in captions_dict.items():
            feature = features[image_id][0]
            for caption in captions_list:
                seq = tokenizer.texts_to_sequences([caption])[0]
                for i in range(1, len(seq)):
                    in_seq, out_seq = seq[:i], seq[i]
                    in_seq = pad_sequences([in_seq], maxlen=max_length, padding='post')[0]
                    out_seq = tf.keras.utils.to_categorical([out_seq], num_classes=vocab_size)[0]
                    X1.append(feature)
                    X2.append(in_seq)
                    y.append(out_seq)
                    if len(X1) == batch_size:
                        yield ([np.array(X1), np.array(X2)], np.array(y))
                        X1, X2, y = [], [], []

Splitting data

# Get list of image IDs
image_ids = list(captions.keys())

# Split image IDs
train_ids, val_ids = train_test_split(image_ids, test_size=0.2, random_state=42)

# Create separate caption dictionaries
train_captions = {id: captions[id] for id in train_ids}
val_captions = {id: captions[id] for id in val_ids}

# Adjust features dictionaries
train_features = {id: features[id] for id in train_ids}
val_features = {id: features[id] for id in val_ids}

5. Building the Image Captioning Model

Tasks:

Define the model architecture:
- Encoder: Pre-trained CNN for image features
- Decoder: LSTM network for sequence generation
Merge image features and caption sequences
Compile the model with appropriate loss and optimizer

Implementation:

from tensorflow.keras.layers import Input, Dense, LSTM, Embedding, Dropout, Add
from tensorflow.keras.models import Model

# Define model parameters
embedding_dim = 256
units = 256
vocab_size = len(tokenizer.word_index) + 1

# Feature extractor model (already extracted features)
inputs1 = Input(shape=(4096,))
fe1 = Dropout(0.5)(inputs1)
fe2 = Dense(units, activation='relu')(fe1)

# Sequence model
inputs2 = Input(shape=(max_length,))
se1 = Embedding(vocab_size, embedding_dim, mask_zero=True)(inputs2)
se2 = Dropout(0.5)(se1)
se3 = LSTM(units)(se2)

# Decoder (combine models)
decoder1 = Add()([fe2, se3])
decoder2 = Dense(units, activation='relu')(decoder1)
outputs = Dense(vocab_size, activation='softmax')(decoder2)

# Build the model
model = Model(inputs=[inputs1, inputs2], outputs=outputs)
model.compile(loss='categorical_crossentropy', optimizer='adam')
model.summary()

6. Training the Model

Tasks:

Use the data generator to feed data into the model
Implement checkpoints to save the model after each epoch
Monitor training loss and adjust parameters if necessary

Implementation:

# Define training parameters
batch_size = 64
steps = len(train_captions) * sum(len(captions_list) for captions_list in train_captions.values()) // batch_size

# Create data generators
train_generator = data_generator(train_captions, train_features, tokenizer, max_length, batch_size)
val_generator = data_generator(val_captions, val_features, tokenizer, max_length, batch_size)

# Define checkpoint callback
from tensorflow.keras.callbacks import ModelCheckpoint

checkpoint = ModelCheckpoint('model.h5', save_best_only=True, monitor='val_loss', mode='min')

# Train the model
history = model.fit(
    train_generator,
    epochs=20,
    steps_per_epoch=steps,
    validation_data=val_generator,
    validation_steps=steps // 5,
    callbacks=[checkpoint],
    verbose=1
)

Plotting training history

# Plot loss over epochs
plt.plot(history.history['loss'], label='train')
plt.plot(history.history['val_loss'], label='validation')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.show()

7. Generating Captions

Tasks:

Implement a function to generate captions for a given image
Use the trained model to predict the next word in the sequence
Repeat until the end token is generated or maximum length is reached

Implementation:

def generate_caption(model, tokenizer, photo, max_length):
    in_text = '<start>'
    for i in range(max_length):
        sequence = tokenizer.texts_to_sequences([in_text])[0]
        sequence = pad_sequences([sequence], maxlen=max_length, padding='post')
        yhat = model.predict([photo, sequence], verbose=0)
        yhat = np.argmax(yhat)
        word = tokenizer.index_word.get(yhat, '<unk>')
        in_text += ' ' + word
        if word == '<end>':
            break
    final_caption = in_text.replace('<start>', '').replace('<end>', '').strip()
    return final_caption

# Example usage
image_id = val_ids[0]
photo = val_features[image_id]
caption = generate_caption(model, tokenizer, photo, max_length)
print(f"Generated Caption: {caption}")

Display the image with generated caption

def display_image_with_caption(image_id, caption):
    image_path = os.path.join(images_dir, image_id)
    image = Image.open(image_path)
    plt.imshow(image)
    plt.title(caption)
    plt.axis('off')
    plt.show()

display_image_with_caption(image_id, caption)

8. Evaluating the Model

Tasks:

Use BLEU scores to evaluate the quality of generated captions
Compare generated captions with actual captions
Analyze cases where the model performs well or poorly

Implementation:

from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction

def evaluate_model(model, captions_dict, features_dict, tokenizer, max_length):
    actual, predicted = [], []
    for image_id, captions_list in captions_dict.items():
        y_pred = generate_caption(model, tokenizer, features_dict[image_id], max_length)
        references = [caption.split() for caption in captions_list]
        y_pred = y_pred.split()
        actual.append(references)
        predicted.append(y_pred)
    # Calculate BLEU scores
    smooth_fn = SmoothingFunction().method1
    bleu_scores = {
        'BLEU-1': np.mean([sentence_bleu(ref, pred, weights=(1, 0, 0, 0), smoothing_function=smooth_fn) for ref, pred in zip(actual, predicted)]),
        'BLEU-2': np.mean([sentence_bleu(ref, pred, weights=(0.5, 0.5, 0, 0), smoothing_function=smooth_fn) for ref, pred in zip(actual, predicted)]),
        'BLEU-3': np.mean([sentence_bleu(ref, pred, weights=(0.33, 0.33, 0.33, 0), smoothing_function=smooth_fn) for ref, pred in zip(actual, predicted)]),
        'BLEU-4': np.mean([sentence_bleu(ref, pred, weights=(0.25, 0.25, 0.25, 0.25), smoothing_function=smooth_fn) for ref, pred in zip(actual, predicted)])
    }
    for metric, score in bleu_scores.items():
        print(f"{metric}: {score:.4f}")
    return bleu_scores

# Evaluate on validation data
bleu_scores = evaluate_model(model, val_captions, val_features, tokenizer, max_length)

9. Improving the Model

Suggestions:

Implement Attention Mechanism: Enhance the model to focus on different parts of the image during caption generation.
Use Beam Search: Improve caption generation by considering multiple candidate sequences.
Experiment with Different Architectures: Try different CNNs for feature extraction or deeper LSTM layers.
Hyperparameter Tuning: Adjust embedding sizes, units, batch sizes, and learning rates.

Implementing Attention Mechanism (Advanced)

Due to the complexity, implementing attention is considered an advanced task. You can refer to TensorFlow’s Image Captioning with Attention tutorial for detailed guidance.

10. Conclusion and Next Steps

In this project, you built an image captioning system that generates descriptive sentences for images by combining CNNs and RNNs. You learned how to preprocess multimodal data, extract image features using pre-trained models, and train an encoder-decoder network for sequence generation. This project has equipped you with the skills to tackle more advanced computer vision and multimodal AI tasks.

Next Steps:

Explore Attention Mechanisms: Implement attention to improve model performance.
Work with Larger Datasets: Use datasets like MS COCO for more diverse data.
Integrate Transformer Models: Experiment with transformer architectures for caption generation.
Deploy the Model: Build a web application to showcase your model’s capabilities.
Extend to Other Modalities: Apply similar techniques to video captioning or audio-visual tasks.