Final Demo
Final Demo
DISSERTATION
AIMLCZG628T
16 March 2025
Outline
Image Captioning is an AI-based system that generates textual descriptions for images.
The generated text, also known as caption ,should accurately reflect the images’ content
Applications: Social Media, E-commerce, Content Moderation , Medical Imaging , to help visually impaired etc.
Two main approaches: CNN + LSTM and Transformer-based models.(Vision Transformer, Swin Transformer for medical
image captioning)
Introduction
Background: Image Captioning is a critical AI task that bridges vision and language. It helps in generating meaningful
descriptions for images. Applications include accessibility, content tagging, and automated reporting.
Image captioning has multiple applications. For example, on social media platforms, it automatically suggests image
captions, saving time for content creators. In online retail, it generates captions for product images, thus improving the
shopping experience
Autonomous vehicles rely on multiple AI technologies, including computer vision, NLP, and deep learning. Image
captioning bridges vision and language processing, enabling more intuitive vehicle responses
Challenges & Solutions
Social Media : on social media platforms, it automatically suggests image captions, saving time for content creators
E-commerce : it generates captions for product images, thus improving the shopping experience
Content Moderation( Detecting and labelling : inappropriate image, violence detected )
Detection of Inappropriate Language: Identifying and removing captions that contain vulgar, offensive, or inappropriate
language.
Bias and Fairness: Ensuring that captions do not perpetuate stereotypes, biases, or discrimination against any group or
individual.
Contextual Accuracy: Verifying that the captions accurately describe the content of the image without adding
misleading or false information.
Safety and Compliance: Ensuring that captions comply with legal and regulatory standards, as well as platform-specific
guidelines.
Challenges & Solutions
Medical Imaging : (CT scans for kidney disease detection)
Accurate identification of kidney conditions (normal, cyst, tumor, stone) from CT scans
Providing an automated place for visually impaired people to see the world by hearing the images that they have
captured
Increasing the ability of visually impaired people to understand their surroundings.
It can help in signal understanding
It can help in navigation and it will help in all day-to-day
activities
Dataset
The images in Flickr30k/ Flickr8k are sourced from Flickr, covering a wide range
of everyday scenes, objects, and human activities. Compared to MS COCO, it has
shorter and more diverse captions, making it valuable for evaluating fine-grained
image-captioning models.
Dataset
Caption preparation
Raw captions are often noisy and not in a format that is usable by the ML model. During caption preparation, we remove
inappropriate captions and ensure the remaining ones are consistent and tokenized. In particular, we perform the following
steps:
Remove pairs with a non-English caption: We remove image-caption pairs where the caption is not in English, as this model's focus will
be on English.
Remove duplicate images or captions: To ensure the diversity and quality of the training data, we eliminate duplicate images and
captions. Duplicate images are identified using perceptual hashing techniques or image similarity models (e.g., CLIP image encoder), while
duplicate captions are detected by exact match or semantic similarity checks (e.g., CLIP text encoder). removing duplicates prevents the model
from overfitting to redundant data and helps it learn a broader range of associations between images and text.
Remove irrelevant captions: We use a pretrained vision-language model (e.g., CLIP) to assess the
relevance between images and their corresponding captions. A higher score usually indicates greater
semantic relevance between the image and the text. We remove pairs with scores below a specific threshold,
such as 0.25. This ensures our model learns from high-quality, relevant pairs. For more information on how
CLIP scores the relevance between text and images
Summarize long captions:Captions are often long and detailed. Training the model with these captions
leads to the generation of similarly long captions, which doesn't suit our use case. To address this, we
summarize the captions using a large language model such as Llama [4] to create brief, concise descriptions
that meet our requirements.
Tokenize captions: We use a subword-level tokenization algorithm such as Byte-Pair Encoding (BPE) to tokenize captions into a sequence of
IDs.
Dataset - Data preparation steps
Tokenization and Embeddings
Tokenization:
The process of breaking down text into smaller units, such as words or sub-words, to be processed by language models. Effective tokenization is
crucial for handling diverse vocabularies and languages
Embeddings:
Techniques that convert tokens into dense vector representations, capturing semantic meanings and relationships between words. These
embeddings enable models to understand context and perform tasks like similarity assessments.
Tokenize captions
Text tokenization and token indexing : Text tokenization followed by token indexing converts the raw text into a format the Trans.
former model expects: a sequence of numbers
Fig. Converting raw text to a sequence of numbers
Dataset - Data preparation steps
Text tokenization
Text tokenization:
Example of GPT-4 tokenization
Text tokenization is the process of splitting text into smaller units called tokens.
shows how OpenAl's GPT-4 tokenizes the sentence "Let's go to NYC".
Tokenization can be performed at different levels. For example, "Hello world" can be split into ["Hello",
"world"] or ["H", "e", "l", "l", "o", " ", "w", "o", "r", "l", "d"]. Generally, tokenization algorithms
are divided into three categories:
• Character-level tokenization
• Word-level tokenization
• Subword-level tokenization
Dataset - Data preparation steps for LSTM on fliker8k
Caption preparation Implementation for LSTM
Text tokenization:
Caption text preprocessing steps
Dataset Splitting – The dataset is split into training (85%) and validation (15%), ensuring that the model is trained and tested on separate
image-caption pairs.
Dataset - Data preparation steps for LSTM on fliker8k
Tokenization Implementation for Transformer
As is the case for captions, not all images are useful. We remove images that might hurt
Training and ensure the remaining images are consistent and suitable for the model training.
In particular, we perform the following steps:
Feature Extraction: Load DenseNet201 (pretrained on ImageNet) to extract image embeddings from the
last convolutional layer.
Resize and preprocess images to match DenseNet201’s input requirements.
Text Processing : Tokenize captions, convert words to integer sequences, and pad sequences for
uniform length.
Create word-to-index and index-to-word mappings for vocabulary.
Model Architecture
Feature Extraction:
Use EfficientNetB0 (pretrained on ImageNet) to extract global and local image features.
Apply positional encoding to integrate spatial information into feature embeddings.
Text Preprocessing:
Tokenize captions and map words to integer sequences.
Create word embeddings using a transformer-based embedding layer.
And it passed to encoder block
Transformer Implementation
Model Architecture
Use EfficientNetB0 to encode image features.
Pass encoded features into a Transformer-based
decoder using multi-head attention and positional encoding.
Apply a fully connected output
layer with a softmax function to predict words.
Transformer Implementation
Overall Architecture:
Image Input: An image is fed into the EfficientNetB0 CNN.
Feature Extraction: The CNN extracts feature maps from the image.
Feature Reshaping: The feature maps are reshaped into a sequence of feature vectors.
Encoder: The Transformer encoder processes the feature vectors, capturing relationships between
different parts of the image.
Caption Input: A caption (represented as a sequence of token indices) is fed into the decoder.
Positional Embedding: The caption tokens are embedded and combined with positional
embeddings.
Decoder: The Transformer decoder generates the next word in the caption, attending to both the
previous words and the encoded image features.
Output: The decoder outputs a probability distribution over the vocabulary.
Training: The model is trained to minimize the cross-entropy loss between the predicted and actual
captions.
System Design and Implementation
LSTM based Implementation
Component Description
Image Encoder Uses DenseNet201 to extract deep image features.
CNN-Based Encoder Converts CNN features into sequence embeddings for the
decoder.
Text Decoder Uses LSTM-based architecture to generate captions one token at
a time.
Unsupervised Pretraining Uses pretrained CNN (DenseNet201) and pretrains LSTM on text
data.
Supervised Finetuning Trains the CNN and LSTM together on image-caption pairs using
cross-entropy loss.
Component Description
Image Encoder EfficientNetB0 extracts deep image features.
CNN-Based Encoder Converts CNN features into sequence embeddings for the decoder.
Unsupervised Pretraining Pretrains CNN on ImageNet & decoder on large text corpus.
Supervised Finetuning Trains CNN and Transformer together using cross-entropy loss.
Sampling : Generative models are trained to capture the underlying distribution of the training
data. Once trained, these models can generate new samples that are similar to the data they
were trained on. Sampling is the process of using a trained generative model to generate new
data.
Beam search is a popular deterministic algorithm for generating text from a trained
model. The core idea is to track multiple potential sequences of tokens simultaneously.
At each step, the model calculates the probabilities for the next possible tokens for each
sequence and selects the "top-k" most probable sequences. The value of k, known as
beam width, is configurable.
System Design and Implementation
Greedy search
Greedy search is the simplest deterministic algorithm. It always selects the token with the
highest probability as the next token. As was shown in Figure , greedy search can lead
to repetitive patterns in the generated text.
This occurs because it follows a narrow path based on the highest probability tokens without
considering alternative paths that might lead to more coherent sentences. Due to this limitation,
greedy search is rarely used in practice
Evaluation Metrics
To thoroughly evaluate a language translation model, metrics should measure both translation
accuracy and contextual appropriateness. The research community has proposed
several metrics that, over the years, have become widely accepted as standards.
Some used matrix are
- BLUE, ROUGE, METEOR
Task Metrics
BLEU
BP is a constant term that penalizes translations shorter than the reference translation. The formula is
Precision (𝑝ₙ)
Precision measures how many n-grams in the candidate translation are present in reference
translations. It is calculated by dividing the number of matching n-grams by the total number of n-
grams in the candidate translation.
Model Prediction/Result
Online time taken for Inferencing around 4 sec once everything is loaded properly
Containerization
Docker File
Containerization
• Guidance
------------------End----------------------