Image Captioning with CNNs and RNNs: An Intermediate Project
Objective
The goal of this project is to build an image captioning system that can generate descriptive sentences for images. You will combine convolutional neural networks (CNNs) for image feature extraction with recurrent neural networks (RNNs), specifically Long Short-Term Memory (LSTM) networks, for sequence generation. This project bridges computer vision and natural language processing, providing a hands-on experience with multimodal AI systems.
Learning Outcomes
By completing this project, you will:
- Understand how to preprocess image and text data for multimodal models
- Learn to extract image features using pre-trained CNNs
- Implement encoder-decoder architectures combining CNNs and RNNs
- Train a model to generate captions for images
- Evaluate model performance using metrics like BLEU scores
- Gain experience with advanced deep learning frameworks and techniques
Prerequisites and Theoretical Foundations
1. Intermediate Python Programming
- Object-oriented programming concepts
- Advanced functions and modules
- Working with files and data serialization (JSON, pickle)
- Familiarity with Jupyter Notebooks or Python IDEs
Click to view Python code examples
import json
import pickle
# Class definitions
class ImageCaptionDataset:
def __init__(self, image_dir, caption_file):
self.image_dir = image_dir
self.caption_file = caption_file
self.data = self.load_data()
def load_data(self):
with open(self.caption_file, 'r') as f:
captions = json.load(f)
return captions
# Saving and loading data
with open('data.pkl', 'wb') as f:
pickle.dump(data, f)
with open('data.pkl', 'rb') as f:
data = pickle.load(f)
2. Mathematics and Machine Learning Foundations
- Linear algebra (vectors, matrices, tensor operations)
- Probability and statistics (probability distributions, sampling)
- Understanding of loss functions and optimization algorithms
- Familiarity with gradient descent and backpropagation
Click to view mathematical concepts with code
import numpy as np
# Softmax function
def softmax(x):
e_x = np.exp(x - np.max(x))
return e_x / e_x.sum(axis=0)
# Cross-entropy loss
def cross_entropy_loss(y_true, y_pred):
return -np.sum(y_true * np.log(y_pred + 1e-9))
# Gradient computation
def compute_gradients(X, y, weights):
predictions = softmax(np.dot(weights, X))
error = predictions - y
grad = np.dot(error, X.T)
return grad
3. Deep Learning Concepts
- Understanding of CNNs (convolutions, pooling, feature maps)
- Familiarity with RNNs and LSTMs (sequence modeling)
- Knowledge of encoder-decoder architectures
- Regularization techniques (dropout, weight decay)
Click to view deep learning concepts
-
Convolutional Neural Networks
- Convolution Operation: Applying filters/kernels to input data to extract features. [ \text{Feature Map} = \text{Input} \ast \text{Kernel} ]
- Pooling Layers: Reducing spatial dimensions (e.g., max pooling).
-
Recurrent Neural Networks
- Sequential Data Processing: Handling sequences by maintaining a hidden state.
- LSTM Units: Addressing the vanishing gradient problem with gates:
- Forget Gate: Decides what information to discard.
- Input Gate: Determines what new information to store.
- Output Gate: Decides what to output.
-
Encoder-Decoder Architecture
- Encoder: Processes the input data and encodes it into a context vector.
- Decoder: Generates output sequences based on the context vector.
-
Attention Mechanism (Optional advanced concept)
- Purpose: Allows the model to focus on different parts of the input during decoding.
- Mechanism: [ \text{Attention Weights} = \text{Softmax}(\text{Score}(h_{\text{decoder}}, h_{\text{encoder}})) ]
4. Natural Language Processing Basics
- Tokenization and vocabulary building
- Embeddings (word2vec, GloVe)
- Sequence padding and truncation
- Understanding of BLEU scores for evaluation
Click to view NLP concepts with code
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
# Sample captions
captions = ['a dog playing with a ball', 'a cat sitting on a sofa']
# Tokenization
tokenizer = Tokenizer(num_words=5000, oov_token='<unk>')
tokenizer.fit_on_texts(captions)
sequences = tokenizer.texts_to_sequences(captions)
# Vocabulary size
vocab_size = len(tokenizer.word_index) + 1
# Padding sequences
max_length = max(len(seq) for seq in sequences)
padded_sequences = pad_sequences(sequences, maxlen=max_length, padding='post')
Skills Gained
- Building encoder-decoder models combining CNNs and RNNs
- Preprocessing image and text data for multimodal learning
- Extracting features from images using pre-trained CNNs
- Implementing and training sequence models for caption generation
- Evaluating language models using BLEU and other metrics
- Handling datasets with both image and text modalities
Tools Required
- Python 3.7+
- Deep learning frameworks: TensorFlow 2.x (preferred) or PyTorch
- Data manipulation: Pandas, NumPy
- Image processing: OpenCV or PIL
- Natural language processing: NLTK or spaCy
- Data visualization: Matplotlib, Seaborn
- Jupyter Notebook or an IDE like PyCharm
pip install tensorflow==2.8.0
pip install numpy pandas matplotlib seaborn
pip install nltk
pip install pillow
Project Structure
image_captioning_project/
│
├── data/
│ ├── images/ # Image files
│ └── captions.txt # Captions for images
│
├── src/
│ ├── data_loader.py # Code for loading and preprocessing data
│ ├── model.py # Model architecture
│ ├── train.py # Training loop
│ ├── evaluate.py # Evaluation script
│ └── utils.py # Utility functions
│
└── notebooks/
├── data_exploration.ipynb
├── training.ipynb
└── evaluation.ipynb
Steps and Tasks
1. Data Acquisition and Exploration
Dataset: Use the Flickr8k or COCO dataset, which contains images and corresponding captions.
Tasks:
- Download and extract the dataset.
- Load image files and captions.
- Explore the data to understand its structure and content.
Implementation:
# Example for loading the Flickr8k dataset
import os
import pandas as pd
from PIL import Image
# Define data paths
images_dir = 'data/Flickr8k_Dataset/Flicker8k_Dataset/'
captions_file = 'data/Flickr8k_text/Flickr8k.token.txt'
# Load captions
def load_captions(filepath):
captions = {}
with open(filepath, 'r') as f:
for line in f:
tokens = line.strip().split('\t')
image_id, caption = tokens[0], tokens[1]
image_id = image_id.split('#')[0]
if image_id not in captions:
captions[image_id] = []
captions[image_id].append(caption)
return captions
captions = load_captions(captions_file)
print(f"Loaded captions for {len(captions)} images.")
# Display sample image and captions
def display_sample_image(image_id):
image_path = os.path.join(images_dir, image_id)
image = Image.open(image_path)
plt.imshow(image)
plt.axis('off')
plt.show()
print("Captions:")
for caption in captions[image_id]:
print(f"- {caption}")
sample_image_id = list(captions.keys())[0]
display_sample_image(sample_image_id)
Click to view data exploration code
# Analyze caption lengths
caption_lengths = [len(caption.split()) for captions_list in captions.values() for caption in captions_list]
plt.hist(caption_lengths, bins=20)
plt.title('Caption Length Distribution')
plt.xlabel('Number of Words')
plt.ylabel('Frequency')
plt.show()
# Vocabulary size
all_captions = [caption for captions_list in captions.values() for caption in captions_list]
unique_words = set(' '.join(all_captions).split())
print(f"Vocabulary Size: {len(unique_words)}")
2. Data Preprocessing
Tasks:
- Clean captions (lowercase, remove punctuation, etc.).
- Tokenize captions and build a vocabulary.
- Map words to integers and vice versa.
- Handle rare words and limit vocabulary size.
- Prepare sequences for training (input-output pairs).
Implementation:
import string
import re
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
# Clean captions
def clean_captions(captions_dict):
table = str.maketrans('', '', string.punctuation)
for key, captions_list in captions_dict.items():
cleaned_list = []
for caption in captions_list:
caption = caption.lower()
caption = caption.translate(table)
caption = re.sub(r'\d+', '', caption)
caption = caption.strip()
caption = ' '.join(caption.split())
cleaned_list.append(caption)
captions_dict[key] = cleaned_list
return captions_dict
captions = clean_captions(captions)
# Build vocabulary
def build_vocabulary(captions_dict, threshold=5):
word_counts = {}
for captions_list in captions_dict.values():
for caption in captions_list:
words = caption.split()
for word in words:
word_counts[word] = word_counts.get(word, 0) + 1
vocab = [word for word, count in word_counts.items() if count >= threshold]
print(f"Vocabulary size after thresholding: {len(vocab)}")
return vocab
vocab = build_vocabulary(captions)
# Create tokenizer
tokenizer = Tokenizer(num_words=len(vocab)+1, oov_token='<unk>')
tokenizer.fit_on_texts([' '.join(vocab)])
Preparing sequences for training
# Add start and end tokens to captions
def add_tokens(captions_dict):
for key, captions_list in captions_dict.items():
captions_dict[key] = ['<start> ' + caption + ' <end>' for caption in captions_list]
return captions_dict
captions = add_tokens(captions)
# Convert captions to sequences
def captions_to_sequences(tokenizer, captions_dict):
sequences = []
for captions_list in captions_dict.values():
for caption in captions_list:
seq = tokenizer.texts_to_sequences([caption])[0]
sequences.append(seq)
return sequences
sequences = captions_to_sequences(tokenizer, captions)
# Determine maximum caption length
max_length = max(len(seq) for seq in sequences)
print(f"Maximum caption length: {max_length}")
3. Feature Extraction from Images
Tasks:
- Use a pre-trained CNN (e.g., InceptionV3, VGG16) to extract features from images.
- Remove the top classification layer to get feature vectors.
- Save extracted features to avoid reprocessing images.
Implementation:
from tensorflow.keras.applications.vgg16 import VGG16, preprocess_input
from tensorflow.keras.preprocessing.image import load_img, img_to_array
import numpy as np
# Load pre-trained model
model = VGG16(weights='imagenet', include_top=False, pooling='avg')
def extract_features(image_path, model):
image = load_img(image_path, target_size=(224, 224))
image = img_to_array(image)
image = preprocess_input(image)
image = np.expand_dims(image, axis=0)
features = model.predict(image)
return features
# Extract features for all images
features = {}
for image_id in captions.keys():
image_path = os.path.join(images_dir, image_id)
feature = extract_features(image_path, model)
features[image_id] = feature
# Save features
with open('features.pkl', 'wb') as f:
pickle.dump(features, f)
Loading extracted features
# Load features
with open('features.pkl', 'rb') as f:
features = pickle.load(f)
4. Preparing Data for Training
Tasks:
- Create input-output pairs for the model:
- Image features + partial caption as input
- Next word in the caption as output
- Pad sequences to the maximum caption length
- Split data into training and validation sets
Implementation:
from sklearn.model_selection import train_test_split
# Create data generator
def data_generator(captions_dict, features, tokenizer, max_length, batch_size):
while True:
X1, X2, y = [], [], []
for image_id, captions_list in captions_dict.items():
feature = features[image_id][0]
for caption in captions_list:
seq = tokenizer.texts_to_sequences([caption])[0]
for i in range(1, len(seq)):
in_seq, out_seq = seq[:i], seq[i]
in_seq = pad_sequences([in_seq], maxlen=max_length, padding='post')[0]
out_seq = tf.keras.utils.to_categorical([out_seq], num_classes=vocab_size)[0]
X1.append(feature)
X2.append(in_seq)
y.append(out_seq)
if len(X1) == batch_size:
yield ([np.array(X1), np.array(X2)], np.array(y))
X1, X2, y = [], [], []
Splitting data
# Get list of image IDs
image_ids = list(captions.keys())
# Split image IDs
train_ids, val_ids = train_test_split(image_ids, test_size=0.2, random_state=42)
# Create separate caption dictionaries
train_captions = {id: captions[id] for id in train_ids}
val_captions = {id: captions[id] for id in val_ids}
# Adjust features dictionaries
train_features = {id: features[id] for id in train_ids}
val_features = {id: features[id] for id in val_ids}
5. Building the Image Captioning Model
Tasks:
- Define the model architecture:
- Encoder: Pre-trained CNN for image features
- Decoder: LSTM network for sequence generation
- Merge image features and caption sequences
- Compile the model with appropriate loss and optimizer
Implementation:
from tensorflow.keras.layers import Input, Dense, LSTM, Embedding, Dropout, Add
from tensorflow.keras.models import Model
# Define model parameters
embedding_dim = 256
units = 256
vocab_size = len(tokenizer.word_index) + 1
# Feature extractor model (already extracted features)
inputs1 = Input(shape=(4096,))
fe1 = Dropout(0.5)(inputs1)
fe2 = Dense(units, activation='relu')(fe1)
# Sequence model
inputs2 = Input(shape=(max_length,))
se1 = Embedding(vocab_size, embedding_dim, mask_zero=True)(inputs2)
se2 = Dropout(0.5)(se1)
se3 = LSTM(units)(se2)
# Decoder (combine models)
decoder1 = Add()([fe2, se3])
decoder2 = Dense(units, activation='relu')(decoder1)
outputs = Dense(vocab_size, activation='softmax')(decoder2)
# Build the model
model = Model(inputs=[inputs1, inputs2], outputs=outputs)
model.compile(loss='categorical_crossentropy', optimizer='adam')
model.summary()
6. Training the Model
Tasks:
- Use the data generator to feed data into the model
- Implement checkpoints to save the model after each epoch
- Monitor training loss and adjust parameters if necessary
Implementation:
# Define training parameters
batch_size = 64
steps = len(train_captions) * sum(len(captions_list) for captions_list in train_captions.values()) // batch_size
# Create data generators
train_generator = data_generator(train_captions, train_features, tokenizer, max_length, batch_size)
val_generator = data_generator(val_captions, val_features, tokenizer, max_length, batch_size)
# Define checkpoint callback
from tensorflow.keras.callbacks import ModelCheckpoint
checkpoint = ModelCheckpoint('model.h5', save_best_only=True, monitor='val_loss', mode='min')
# Train the model
history = model.fit(
train_generator,
epochs=20,
steps_per_epoch=steps,
validation_data=val_generator,
validation_steps=steps // 5,
callbacks=[checkpoint],
verbose=1
)
Plotting training history
# Plot loss over epochs
plt.plot(history.history['loss'], label='train')
plt.plot(history.history['val_loss'], label='validation')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.show()
7. Generating Captions
Tasks:
- Implement a function to generate captions for a given image
- Use the trained model to predict the next word in the sequence
- Repeat until the end token is generated or maximum length is reached
Implementation:
def generate_caption(model, tokenizer, photo, max_length):
in_text = '<start>'
for i in range(max_length):
sequence = tokenizer.texts_to_sequences([in_text])[0]
sequence = pad_sequences([sequence], maxlen=max_length, padding='post')
yhat = model.predict([photo, sequence], verbose=0)
yhat = np.argmax(yhat)
word = tokenizer.index_word.get(yhat, '<unk>')
in_text += ' ' + word
if word == '<end>':
break
final_caption = in_text.replace('<start>', '').replace('<end>', '').strip()
return final_caption
# Example usage
image_id = val_ids[0]
photo = val_features[image_id]
caption = generate_caption(model, tokenizer, photo, max_length)
print(f"Generated Caption: {caption}")
Display the image with generated caption
def display_image_with_caption(image_id, caption):
image_path = os.path.join(images_dir, image_id)
image = Image.open(image_path)
plt.imshow(image)
plt.title(caption)
plt.axis('off')
plt.show()
display_image_with_caption(image_id, caption)
8. Evaluating the Model
Tasks:
- Use BLEU scores to evaluate the quality of generated captions
- Compare generated captions with actual captions
- Analyze cases where the model performs well or poorly
Implementation:
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
def evaluate_model(model, captions_dict, features_dict, tokenizer, max_length):
actual, predicted = [], []
for image_id, captions_list in captions_dict.items():
y_pred = generate_caption(model, tokenizer, features_dict[image_id], max_length)
references = [caption.split() for caption in captions_list]
y_pred = y_pred.split()
actual.append(references)
predicted.append(y_pred)
# Calculate BLEU scores
smooth_fn = SmoothingFunction().method1
bleu_scores = {
'BLEU-1': np.mean([sentence_bleu(ref, pred, weights=(1, 0, 0, 0), smoothing_function=smooth_fn) for ref, pred in zip(actual, predicted)]),
'BLEU-2': np.mean([sentence_bleu(ref, pred, weights=(0.5, 0.5, 0, 0), smoothing_function=smooth_fn) for ref, pred in zip(actual, predicted)]),
'BLEU-3': np.mean([sentence_bleu(ref, pred, weights=(0.33, 0.33, 0.33, 0), smoothing_function=smooth_fn) for ref, pred in zip(actual, predicted)]),
'BLEU-4': np.mean([sentence_bleu(ref, pred, weights=(0.25, 0.25, 0.25, 0.25), smoothing_function=smooth_fn) for ref, pred in zip(actual, predicted)])
}
for metric, score in bleu_scores.items():
print(f"{metric}: {score:.4f}")
return bleu_scores
# Evaluate on validation data
bleu_scores = evaluate_model(model, val_captions, val_features, tokenizer, max_length)
9. Improving the Model
Suggestions:
- Implement Attention Mechanism: Enhance the model to focus on different parts of the image during caption generation.
- Use Beam Search: Improve caption generation by considering multiple candidate sequences.
- Experiment with Different Architectures: Try different CNNs for feature extraction or deeper LSTM layers.
- Hyperparameter Tuning: Adjust embedding sizes, units, batch sizes, and learning rates.
Implementing Attention Mechanism (Advanced)
Due to the complexity, implementing attention is considered an advanced task. You can refer to TensorFlow’s Image Captioning with Attention tutorial for detailed guidance.
10. Conclusion and Next Steps
In this project, you built an image captioning system that generates descriptive sentences for images by combining CNNs and RNNs. You learned how to preprocess multimodal data, extract image features using pre-trained models, and train an encoder-decoder network for sequence generation. This project has equipped you with the skills to tackle more advanced computer vision and multimodal AI tasks.
Next Steps:
- Explore Attention Mechanisms: Implement attention to improve model performance.
- Work with Larger Datasets: Use datasets like MS COCO for more diverse data.
- Integrate Transformer Models: Experiment with transformer architectures for caption generation.
- Deploy the Model: Build a web application to showcase your model’s capabilities.
- Extend to Other Modalities: Apply similar techniques to video captioning or audio-visual tasks.