🟡 Exploring Multimodal Learning: Autoencoders for Visual and Textual Data

Unsupervised Representation Learning with Autoencoders

Objective

The goal of this project is to build an unsupervised learning system using autoencoders for image data, and extend it to multimodal AI by incorporating text data. You will learn how to extract meaningful representations from images without labeled data, and explore how these representations can be used in downstream tasks such as clustering, anomaly detection, or as features for other models.


Learning Outcomes

By completing this project, you will:

  • Understand the principles of unsupervised learning and representation learning.
  • Learn how to implement autoencoders and variational autoencoders (VAEs) for image data.
  • Explore dimensionality reduction techniques and their applications in computer vision.
  • Gain experience in handling multimodal data by combining image and text modalities.
  • Apply learned representations to tasks like clustering, visualization, and anomaly detection.
  • Enhance your skills in deep learning frameworks such as TensorFlow or PyTorch.

Prerequisites and Theoretical Foundations

1. Intermediate Python Programming

  • Data Manipulation: Proficiency with pandas and NumPy.
  • Object-Oriented Programming: Understanding of classes and objects.
  • Image Processing Libraries: Familiarity with OpenCV or PIL.
Click to view Python code examples
import pandas as pd
import numpy as np
from PIL import Image

# Reading an image
image = Image.open('sample.jpg')
image_array = np.array(image)

# Basic data manipulation
df = pd.DataFrame({'feature1': [1, 2], 'feature2': [3, 4]})
print(df.head())

2. Mathematics and Machine Learning Foundations

  • Linear Algebra: Understanding of vectors, matrices, eigenvalues, and eigenvectors.
  • Calculus: Basics of derivatives and integrals.
  • Probability and Statistics: Knowledge of probability distributions and statistical measures.
  • Machine Learning Concepts: Understanding unsupervised learning, clustering, and dimensionality reduction.
Click to view mathematical concepts with code
import numpy as np

# PCA example
from sklearn.decomposition import PCA

data = np.random.rand(100, 50)
pca = PCA(n_components=2)
principal_components = pca.fit_transform(data)

3. Deep Learning Concepts

  • Neural Networks: Understanding of feedforward neural networks.
  • Autoencoders: Concept of encoding and decoding data for dimensionality reduction.
  • Variational Autoencoders (VAEs): Knowledge of probabilistic generative models.
  • Loss Functions: Reconstruction loss, KL divergence.
Click to view deep learning concepts
  1. Autoencoder Structure:

    • Encoder: Maps input data to a latent representation.
    • Decoder: Reconstructs the input data from the latent representation.
  2. Loss Function:

    • Reconstruction Loss: Measures the difference between the input and its reconstruction. [ \mathcal{L}_{\text{recon}} = ||X - \hat{X}||^2 ]
  3. Variational Autoencoder (VAE):

    • Concept: Learns a distribution over the latent space.
    • KL Divergence: Regularizes the latent space to follow a prior distribution. [ \mathcal{L}{\text{KL}} = D{\text{KL}}(q(z|X) || p(z)) ]

4. Computer Vision Basics

  • Image Preprocessing: Resizing, normalization, and augmentation.
  • Convolutional Neural Networks (CNNs): Understanding of convolutions and feature extraction.
  • Feature Visualization: Techniques for visualizing learned features.
Click to view computer vision code examples
import cv2

# Load an image using OpenCV
image = cv2.imread('sample.jpg')

# Resize image
resized_image = cv2.resize(image, (128, 128))

# Normalize image
normalized_image = resized_image / 255.0

5. Multimodal AI Concepts

  • Handling Text Data: Tokenization, embedding techniques.
  • Combining Modalities: Strategies for integrating image and text data.
  • Applications: Cross-modal retrieval, multimodal representations.
Click to view multimodal AI concepts with code
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Sample text data
texts = ["An image of a cat", "A dog playing fetch"]

# Tokenization
tokenizer = Tokenizer(num_words=1000)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)

# Padding sequences
padded_sequences = pad_sequences(sequences, maxlen=10)

Skills Gained

  • Implementing autoencoders and VAEs using deep learning frameworks.
  • Performing unsupervised learning tasks such as clustering and anomaly detection.
  • Preprocessing and handling large-scale image datasets.
  • Combining image and text data for multimodal learning.
  • Visualizing high-dimensional data using techniques like t-SNE and PCA.
  • Understanding and applying regularization techniques in unsupervised models.

Tools Required

  • Python 3.x
  • Deep learning frameworks: TensorFlow 2.x or PyTorch
  • Data manipulation: pandas, NumPy
  • Data visualization: Matplotlib, Seaborn
  • Scikit-learn for preprocessing and evaluation
  • Jupyter Notebook or an IDE like PyCharm
pip install tensorflow pandas numpy matplotlib seaborn scikit-learn

Steps and Tasks

1. Data Acquisition and Exploration

Obtain a suitable image dataset for unsupervised learning. Possible options include:

  • CIFAR-10/CIFAR-100: Small images in various categories.
  • MNIST/Fashion-MNIST: Handwritten digits or fashion items.
  • Custom Dataset: Collect images from sources like Unsplash or subsets of ImageNet.

Load and Explore the Data:

from tensorflow.keras.datasets import cifar10
import matplotlib.pyplot as plt

# Load CIFAR-10 dataset
(X_train, _), (X_test, _) = cifar10.load_data()

# Combine train and test sets for unsupervised learning
data = np.concatenate((X_train, X_test), axis=0)

# Explore data shape
print(f"Data shape: {data.shape}")  # Should be (60000, 32, 32, 3)

# Display some sample images
def display_samples(data, num_samples=5):
    plt.figure(figsize=(10, 2))
    for i in range(num_samples):
        plt.subplot(1, num_samples, i+1)
        plt.imshow(data[i])
        plt.axis('off')
    plt.show()

display_samples(data)
Click to view data exploration code
# Check data statistics
mean = np.mean(data, axis=(0,1,2))
std = np.std(data, axis=(0,1,2))
print(f"Data mean: {mean}, Data std: {std}")

# Normalize data
data = (data - mean) / std

2. Data Preprocessing

Prepare the data for training the autoencoder.

Tasks:

  • Normalize pixel values.
  • Reshape data if necessary.
  • Apply any desired augmentations.

Implementation:

# Normalize data to [0,1]
data = data.astype('float32') / 255.

# Optionally, reshape data
data = data.reshape((len(data), 32, 32, 3))

print(f"Preprocessed data shape: {data.shape}")

3. Building the Autoencoder

Create a convolutional autoencoder to learn representations of the images.

Tasks:

  • Define the encoder network.
  • Define the decoder network.
  • Combine them into an autoencoder model.

Implementation:

from tensorflow.keras.layers import Input, Conv2D, MaxPooling2D, UpSampling2D
from tensorflow.keras.models import Model

# Encoder
input_img = Input(shape=(32, 32, 3))
x = Conv2D(32, (3,3), activation='relu', padding='same')(input_img)
x = MaxPooling2D((2,2), padding='same')(x)
x = Conv2D(64, (3,3), activation='relu', padding='same')(x)
encoded = MaxPooling2D((2,2), padding='same')(x)

# Decoder
x = Conv2D(64, (3,3), activation='relu', padding='same')(encoded)
x = UpSampling2D((2,2))(x)
x = Conv2D(32, (3,3), activation='relu', padding='same')(x)
x = UpSampling2D((2,2))(x)
decoded = Conv2D(3, (3,3), activation='sigmoid', padding='same')(x)

# Autoencoder
autoencoder = Model(input_img, decoded)
autoencoder.compile(optimizer='adam', loss='binary_crossentropy')

autoencoder.summary()
Click to view model training
# Train the autoencoder
history = autoencoder.fit(
    data, data,
    epochs=50,
    batch_size=256,
    shuffle=True,
    validation_split=0.1
)

4. Visualizing the Learned Representations

Extract the encoder part of the autoencoder to obtain the latent representations.

Tasks:

  • Create an encoder model.
  • Use the encoder to transform images into latent space.
  • Visualize the latent space using dimensionality reduction techniques.

Implementation:

from sklearn.manifold import TSNE

# Create encoder model
encoder = Model(input_img, encoded)

# Get latent representations
latent_representations = encoder.predict(data)

# Flatten the representations if necessary
latent_flat = latent_representations.reshape(len(latent_representations), -1)

# Use t-SNE to reduce dimensions for visualization
tsne = TSNE(n_components=2, random_state=42)
latent_2d = tsne.fit_transform(latent_flat)

# Plot the 2D representations
plt.scatter(latent_2d[:,0], latent_2d[:,1], s=2)
plt.title('t-SNE Visualization of Latent Space')
plt.show()
Click to view reconstruction results
# Visualize original and reconstructed images
decoded_imgs = autoencoder.predict(data)

n = 5  # Number of images to display
plt.figure(figsize=(10,4))
for i in range(n):
    # Original images
    ax = plt.subplot(2, n, i+1)
    plt.imshow(data[i])
    plt.axis('off')

    # Reconstructed images
    ax = plt.subplot(2, n, i+n+1)
    plt.imshow(decoded_imgs[i])
    plt.axis('off')
plt.show()

5. Clustering in Latent Space

Use the latent representations for clustering the images.

Tasks:

  • Apply clustering algorithms like K-Means on the latent representations.
  • Evaluate clustering results qualitatively or quantitatively.

Implementation:

from sklearn.cluster import KMeans

# Apply K-Means clustering
kmeans = KMeans(n_clusters=10, random_state=42)
clusters = kmeans.fit_predict(latent_flat)

# Visualize clusters in t-SNE plot
plt.scatter(latent_2d[:,0], latent_2d[:,1], c=clusters, cmap='tab10', s=2)
plt.title('t-SNE Visualization with Clusters')
plt.show()
Click to view cluster samples
# Display sample images from each cluster
for cluster_id in range(10):
    idx = np.where(clusters == cluster_id)[0][:5]
    plt.figure(figsize=(10,2))
    for i, img_idx in enumerate(idx):
        plt.subplot(1,5,i+1)
        plt.imshow(data[img_idx])
        plt.axis('off')
    plt.suptitle(f'Cluster {cluster_id}')
    plt.show()

6. Anomaly Detection

Use the reconstruction error from the autoencoder to detect anomalies.

Tasks:

  • Compute reconstruction errors for all images.
  • Set a threshold to identify anomalies.
  • Analyze the anomalies detected.

Implementation:

# Compute reconstruction errors
reconstructions = autoencoder.predict(data)
mse = np.mean(np.power(data - reconstructions, 2), axis=(1,2,3))

# Plot distribution of reconstruction errors
plt.hist(mse, bins=50)
plt.xlabel('Reconstruction Error')
plt.ylabel('Number of Images')
plt.show()

# Set threshold (e.g., mean + 2*std)
threshold = np.mean(mse) + 2*np.std(mse)

# Identify anomalies
anomalies = data[mse > threshold]

print(f"Number of anomalies detected: {len(anomalies)}")
Click to view anomalies
# Display anomalies
n = min(len(anomalies), 5)
plt.figure(figsize=(10,2))
for i in range(n):
    plt.subplot(1,n,i+1)
    plt.imshow(anomalies[i])
    plt.axis('off')
plt.suptitle('Anomalies Detected')
plt.show()

7. Extending to Variational Autoencoders (VAEs)

Implement a VAE to learn a structured latent space.

Tasks:

  • Modify the autoencoder to include sampling from latent distributions.
  • Define the VAE loss function including reconstruction and KL divergence terms.
  • Train the VAE and analyze the latent space.
Click to view VAE implementation

Due to the complexity, implementing a VAE requires careful consideration of the model architecture and loss function.

Here’s a high-level overview:

from tensorflow.keras.layers import Lambda
from tensorflow.keras import backend as K

# Define sampling function
def sampling(args):
    z_mean, z_log_var = args
    epsilon = K.random_normal(shape=K.shape(z_mean))
    return z_mean + K.exp(0.5 * z_log_var) * epsilon

# Encoder with mean and log variance
z_mean = Dense(latent_dim)(x)
z_log_var = Dense(latent_dim)(x)
z = Lambda(sampling)([z_mean, z_log_var])

# Define VAE loss
def vae_loss(x, x_decoded_mean):
    xent_loss = binary_crossentropy(K.flatten(x), K.flatten(x_decoded_mean)) * original_dim
    kl_loss = -0.5 * K.sum(1 + z_log_var - K.square(z_mean) - K.exp(z_log_var), axis=-1)
    return xent_loss + kl_loss

Implementing a VAE is an excellent way to explore unsupervised learning further.

8. Incorporating Multimodal Data

Combine image data with text data (e.g., captions) to learn joint representations.

Tasks:

  • Obtain or create a dataset with images and corresponding text (e.g., MS COCO dataset).
  • Preprocess text data and create embeddings.
  • Modify the autoencoder to accept both image and text inputs.
Click to view multimodal autoencoder implementation

Implementing a multimodal autoencoder involves:

  • Creating separate encoders for image and text data.
  • Concatenating the latent representations.
  • Decoding from the combined latent space.

Here’s a simplified example:

# Text encoder
text_input = Input(shape=(max_seq_length,))
text_embedding = Embedding(vocab_size, embedding_dim)(text_input)
text_encoded = LSTM(256)(text_embedding)

# Image encoder (as before)
image_input = Input(shape=(32, 32, 3))
# ... (build image encoder)
image_encoded = encoder_layers(image_input)

# Combine encodings
combined = Concatenate()([image_encoded, text_encoded])

# Decoder
# ... (build decoder using combined representations)
decoded = decoder_layers(combined)

# Build multimodal autoencoder
multimodal_autoencoder = Model(inputs=[image_input, text_input], outputs=decoded)

This allows the model to learn representations that capture information from both images and text.

9. Evaluating and Visualizing Multimodal Representations

Analyze how the model captures relationships between images and text.

Tasks:

  • Visualize the multimodal latent space using t-SNE or PCA.
  • Perform cross-modal retrieval (e.g., find images given text queries).
  • Evaluate the quality of the learned representations.

Implementation:

# Extract multimodal latent representations
# Assume you have functions to encode images and texts
image_latent = image_encoder.predict(image_data)
text_latent = text_encoder.predict(text_data)

# Combine or compare representations
# For visualization, you might concatenate or average them
combined_latent = np.concatenate([image_latent, text_latent], axis=1)

# Visualize with t-SNE
tsne = TSNE(n_components=2, random_state=42)
latent_2d = tsne.fit_transform(combined_latent)

plt.scatter(latent_2d[:,0], latent_2d[:,1], s=2)
plt.title('t-SNE Visualization of Multimodal Latent Space')
plt.show()

# Cross-modal retrieval example
def retrieve_images(query_text, top_k=5):
    # Encode the text
    query_latent = text_encoder.predict(query_text)
    # Compute similarities with image_latent
    similarities = np.dot(image_latent, query_latent.T)
    # Get top_k indices
    indices = np.argsort(similarities.flatten())[-top_k:]
    return image_data[indices]

# Example usage
sample_text = ["A cat sitting on a couch"]
sample_sequence = tokenizer.texts_to_sequences(sample_text)
sample_sequence_padded = pad_sequences(sample_sequence, maxlen=max_seq_length)
retrieved_images = retrieve_images(sample_sequence_padded)

10. Conclusion and Next Steps

In this project, you learned how to implement unsupervised learning techniques using autoencoders and VAEs for image data, and extended these concepts to handle multimodal data combining images and text. You explored applications such as clustering, anomaly detection, and representation learning.

Next Steps:

  • Generative Models: Explore Generative Adversarial Networks (GANs) for image generation.
  • Self-Supervised Learning: Implement contrastive learning methods like SimCLR or MoCo.
  • Advanced Multimodal Models: Experiment with models like CLIP (Contrastive Language–Image Pre-training).
  • Real-world Applications: Apply unsupervised learning to tasks like recommendation systems or content-based retrieval.