0% found this document useful (0 votes)
24 views48 pages

Final Demo

The dissertation focuses on an Image Captioning System that generates textual descriptions for images using AI techniques, particularly CNN + LSTM and Transformer-based models. It addresses various applications such as social media, e-commerce, and medical imaging, while also discussing challenges like bias, contextual accuracy, and compliance. The document outlines methodologies, datasets, and system designs for implementing both LSTM and Transformer architectures for effective image captioning.

Uploaded by

Ghulam Sarwar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views48 pages

Final Demo

The dissertation focuses on an Image Captioning System that generates textual descriptions for images using AI techniques, particularly CNN + LSTM and Transformer-based models. It addresses various applications such as social media, e-commerce, and medical imaging, while also discussing challenges like bias, contextual accuracy, and compliance. The document outlines methodologies, datasets, and system designs for implementing both LSTM and Transformer architectures for effective image captioning.

Uploaded by

Ghulam Sarwar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 48

Image Captioning System

DISSERTATION
AIMLCZG628T

Gulam Sarwar (2022AC05156)


First Semester 2024-2025
Work-Integrated Learning Programs Division
BIRLA INSTITUTE OF TECHNOLOGY & SCIENCE, PILANI

16 March 2025
Outline

 Introduction  High Level Design


 Case study (Challenges & Solutions)  Model Evaluation Metrics
 Dataset
 Dashboard
 Methodology
 Result and Conclusion
 System Design and Implementation
 Demo
Introduction

 Image Captioning is an AI-based system that generates textual descriptions for images.

The generated text, also known as caption ,should accurately reflect the images’ content

 Applications: Social Media, E-commerce, Content Moderation , Medical Imaging , to help visually impaired etc.

 Medical Imaging (ex : CT scans for kidney disease detection)

 Content Moderation(Detecting and labelling : inappropriate image, violence detected)

 Two main approaches: CNN + LSTM and Transformer-based models.(Vision Transformer, Swin Transformer for medical
image captioning)
Introduction

 Background: Image Captioning is a critical AI task that bridges vision and language. It helps in generating meaningful
descriptions for images. Applications include accessibility, content tagging, and automated reporting.

 Image captioning has multiple applications. For example, on social media platforms, it automatically suggests image
captions, saving time for content creators. In online retail, it generates captions for product images, thus improving the
shopping experience

 Autonomous vehicles rely on multiple AI technologies, including computer vision, NLP, and deep learning. Image
captioning bridges vision and language processing, enabling more intuitive vehicle responses
Challenges & Solutions

 Social Media : on social media platforms, it automatically suggests image captions, saving time for content creators
 E-commerce : it generates captions for product images, thus improving the shopping experience
 Content Moderation( Detecting and labelling : inappropriate image, violence detected )

 Detection of Inappropriate Language: Identifying and removing captions that contain vulgar, offensive, or inappropriate
language.

 Bias and Fairness: Ensuring that captions do not perpetuate stereotypes, biases, or discrimination against any group or
individual.

 Contextual Accuracy: Verifying that the captions accurately describe the content of the image without adding
misleading or false information.

 Safety and Compliance: Ensuring that captions comply with legal and regulatory standards, as well as platform-specific
guidelines.
Challenges & Solutions
 Medical Imaging : (CT scans for kidney disease detection)

 Accurate identification of kidney conditions (normal, cyst, tumor, stone) from CT scans

 Assists radiologists in diagnosis.

 Automates medical report generation.


CT KIDNEY DATASET: Normal-Cyst-Tumor
and Stone Prediction using Vision
Transformer and Swin Transformer
Challenges & Solutions

 Medical Imaging : (CT scans for kidney disease detection)


Challenges & Solutions

 Helping visually impaired

 Providing an automated place for visually impaired people to see the world by hearing the images that they have
captured
 Increasing the ability of visually impaired people to understand their surroundings.
 It can help in signal understanding
 It can help in navigation and it will help in all day-to-day
activities
Dataset

 Flickr30k/ Flickr8k is a large-scale image captioning dataset designed to facilitate


research in vision-language tasks. It expands on Flickr8k and contains 30,000
images, each annotated with five human-generated captions, providing rich textual
descriptions for training and evaluation.

 Flickr8k is an image captioning dataset consisting of 8,000 images, each paired


with five human-annotated captions. The images are sourced from Flickr and depict
a variety of everyday scenes, focusing on people, animals, and objects in different
activities.
 Training Set: Flickr_8k.trainImages.txt (6,000 images)
 Validation Set: Flickr_8k.devImages.txt (1,000 images)
 Test Set: Flickr_8k.testImages.txt (1,000 images)

 The images in Flickr30k/ Flickr8k are sourced from Flickr, covering a wide range
of everyday scenes, objects, and human activities. Compared to MS COCO, it has
shorter and more diverse captions, making it valuable for evaluating fine-grained
image-captioning models.
Dataset

 Flickr30k is widely used in image captioning, visual question answering (VQA),


and cross-modal retrieval tasks, serving as a benchmark for assessing vision-
language models.
Dataset
Data preparation steps

Caption preparation
Raw captions are often noisy and not in a format that is usable by the ML model. During caption preparation, we remove
inappropriate captions and ensure the remaining ones are consistent and tokenized. In particular, we perform the following
steps:
Remove pairs with a non-English caption: We remove image-caption pairs where the caption is not in English, as this model's focus will
be on English.
Remove duplicate images or captions: To ensure the diversity and quality of the training data, we eliminate duplicate images and
captions. Duplicate images are identified using perceptual hashing techniques or image similarity models (e.g., CLIP image encoder), while
duplicate captions are detected by exact match or semantic similarity checks (e.g., CLIP text encoder). removing duplicates prevents the model
from overfitting to redundant data and helps it learn a broader range of associations between images and text.
Remove irrelevant captions: We use a pretrained vision-language model (e.g., CLIP) to assess the
relevance between images and their corresponding captions. A higher score usually indicates greater
semantic relevance between the image and the text. We remove pairs with scores below a specific threshold,
such as 0.25. This ensures our model learns from high-quality, relevant pairs. For more information on how
CLIP scores the relevance between text and images
Summarize long captions:Captions are often long and detailed. Training the model with these captions
leads to the generation of similarly long captions, which doesn't suit our use case. To address this, we
summarize the captions using a large language model such as Llama [4] to create brief, concise descriptions
that meet our requirements.
Tokenize captions: We use a subword-level tokenization algorithm such as Byte-Pair Encoding (BPE) to tokenize captions into a sequence of
IDs.
Dataset - Data preparation steps
Tokenization and Embeddings

Tokenization:
The process of breaking down text into smaller units, such as words or sub-words, to be processed by language models. Effective tokenization is
crucial for handling diverse vocabularies and languages
Embeddings:
Techniques that convert tokens into dense vector representations, capturing semantic meanings and relationships between words. These
embeddings enable models to understand context and perform tasks like similarity assessments.
Tokenize captions
Text tokenization and token indexing : Text tokenization followed by token indexing converts the raw text into a format the Trans.
former model expects: a sequence of numbers
Fig. Converting raw text to a sequence of numbers
Dataset - Data preparation steps
Text tokenization

Text tokenization:
Example of GPT-4 tokenization

Text tokenization is the process of splitting text into smaller units called tokens.
shows how OpenAl's GPT-4 tokenizes the sentence "Let's go to NYC".

Tokenization can be performed at different levels. For example, "Hello world" can be split into ["Hello",
"world"] or ["H", "e", "l", "l", "o", " ", "w", "o", "r", "l", "d"]. Generally, tokenization algorithms
are divided into three categories:
• Character-level tokenization
• Word-level tokenization
• Subword-level tokenization
Dataset - Data preparation steps for LSTM on fliker8k
Caption preparation Implementation for LSTM

Text tokenization:
Caption text preprocessing steps

-Convert sentences into lowercase


-Remove special characters
and numbers present in the text
-Remove extra spaces
-Remove single characters
-remove duplicate caption
Add a starting and an ending
tag to the sentences to indicate the
beginning and the ending of a
sentence of a sentence
Dataset - Data preparation steps for LSTM on fliker8k
Caption preparation Implementation for LSTM

Tokenization Level : Word-Level Tokenization


The provided code uses the Tokenizer from Keras, which applies word-level tokenization by default. This means the text is split into individual
words, and each unique word is assigned a numerical index.
Word-Based Tokenization Approach – The Tokenizer splits sentences into words rather than characters or subwords. For example, "Hello
world" would be tokenized as ["Hello", "world"] rather than individual letters .
Vocabulary Construction : The tokenizer builds a word index, where each unique word is assigned an integer ID (tokenizer. word_index),
and the vocabulary size is calculated (vocab_size).
Sequence Conversion – The method texts_to_sequences () converts sentences into a sequence of word indices, representing the text
numerically.
Sentence Length Handling – The max_length variable is computed as the maximum number of words in a sentence, which helps in
defining input sequence padding for model training.

Dataset Splitting – The dataset is split into training (85%) and validation (15%), ensuring that the model is trained and tested on separate
image-caption pairs.
Dataset - Data preparation steps for LSTM on fliker8k
Tokenization Implementation for Transformer

• Tokenization Level: Word-Level Tokenization


1. The TextVectorization layer processes text into integer sequences, where each unique word is assigned a numerical token.
2. The custom standardization function removes special characters and numbers but does not split words into smaller subword units (as
in subword-level tokenization).
3. The vectorizer uses a fixed vocabulary size (VOCAB_SIZE), which is characteristic of word-level tokenization.
4. The output sequence length (SEQ_LENGTH) ensures that each caption is padded to a uniform number of words, not characters or
subwords.
5. If it were character-level tokenization, each letter would be treated as a separate token. If it were subword-level tokenization, uncommon
words would be broken into smaller meaningful units, which is not happening here.
Dataset - Data preparation steps for LSTM on fliker8k
Image preparation

As is the case for captions, not all images are useful. We remove images that might hurt
Training and ensure the remaining images are consistent and suitable for the model training.
In particular, we perform the following steps:

Remove low-resolution images: We remove image-caption pairs in which the image


resolution is less than 256x256 because such low-resolution images might not provide
enough detail for accurate caption generation.

Normalize images: We scale the pixel values to a normalized range, such


This normalization makes the training process more stable.

Remove low-quality images:


To maintain high-quality training data, we filter images that exhibit conditions such as blurriness,
overexposure, underexposure, or other defects that degrade visual clarity. Image quality assessment methods,
such as the LAION Aesthetics Predictor, help identify and remove subpar images by scoring them on factors
such as sharpness, contrast, and lighting.

Adjust image dimensions:


Images typically have a range of sizes and aspect ratios. We resize all images to a uniform size, which is
critical since ML models require fixed-size inputs during training. When adjusting image dimensions, it is
important to preserve their original aspect ratios.
To do so, we often follow two steps:
1. Resizing: First, we resize the image so that the smaller dimension matches the target size.
o Example: If our target size is 256×256 and our original image is 512×768, we resize it to
256×384.
2. Centre-cropping: Next, we centre-crop the resized image to the target dimensions.
o Example: The 256×384 image is centre-cropped to 256×256.
Method Selection and developments

 CNN + RNN-Based Image Captioning


 Transformer-Based Image Captioning
Methodology - Models
First implementation follows a CNN-RNN architecture for image captioning, using DenseNet201 as the feature extractor and LSTMs for sequential text
generation.

Deep Learning Models Generative Ai models


• CNN-Load DenseNet201 (pretrained on ImageNet) to extract
• Here I have designed our customer Transformer based model
image embeddings from the last convolutional layer)
• Which takes transformer encoder decoder
• RNN + STM Generate the caption based on input emabding
feature vector from RNN
• It consist of EfficientNetB0 , Positional Embedding etc
Method Selection and developments

 LSTM based High level Architecture


Method Selection and developments

 LSTM based High level Architecture


This implementation follows a CNN-RNN architecture for image captioning, using
DenseNet201 as the feature extractor and LSTMs for sequential text generation. The
model processes Flickr8k images to generate descriptive captions.
Steps in Implementation
Feature Extraction
Load DenseNet201 (pretrained on ImageNet) to extract image embeddings from
the last convolutional layer.
Resize and preprocess images to match DenseNet201’s input requirements.
Text Processing
Tokenize captions, convert words to integer sequences, and pad sequences for
uniform length.
Create word-to-index and index-to-word mappings for vocabulary.
Model Architecture
Use DenseNet201 to extract feature vectors from input
images.
Pass extracted features into an LSTM-based decoder for caption generation.
LSTMs Implementation

 Feature Extraction: Load DenseNet201 (pretrained on ImageNet) to extract image embeddings from the
last convolutional layer.
 Resize and preprocess images to match DenseNet201’s input requirements.
 Text Processing : Tokenize captions, convert words to integer sequences, and pad sequences for
uniform length.
 Create word-to-index and index-to-word mappings for vocabulary.
Model Architecture

 Use DenseNet201 to extract feature vectors from input images.


 Pass extracted features into an LSTM-based decoder
for caption generation.
 Training : Train the model on Flickr8k using categorical cross-entropy loss and Adam optimizer .
 Sampling : We have used both beam and greedy search for
prediction
Model Summery
Model Learning Curve

 Learning Curve (Loss Curve).


Transformer Implementation

 Feature Extraction:
Use EfficientNetB0 (pretrained on ImageNet) to extract global and local image features.
Apply positional encoding to integrate spatial information into feature embeddings.
 Text Preprocessing:
 Tokenize captions and map words to integer sequences.
 Create word embeddings using a transformer-based embedding layer.
 And it passed to encoder block
Transformer Implementation

Model Architecture
 Use EfficientNetB0 to encode image features.
 Pass encoded features into a Transformer-based
decoder using multi-head attention and positional encoding.
 Apply a fully connected output
layer with a softmax function to predict words.
Transformer Implementation

Overall Architecture:
 Image Input: An image is fed into the EfficientNetB0 CNN.
 Feature Extraction: The CNN extracts feature maps from the image.
 Feature Reshaping: The feature maps are reshaped into a sequence of feature vectors.
 Encoder: The Transformer encoder processes the feature vectors, capturing relationships between
different parts of the image.
 Caption Input: A caption (represented as a sequence of token indices) is fed into the decoder.
 Positional Embedding: The caption tokens are embedded and combined with positional
embeddings.
 Decoder: The Transformer decoder generates the next word in the caption, attending to both the
previous words and the encoded image features.
 Output: The decoder outputs a probability distribution over the vocabulary.
 Training: The model is trained to minimize the cross-entropy loss between the predicted and actual
captions.
System Design and Implementation
LSTM based Implementation

Component Description
Image Encoder Uses DenseNet201 to extract deep image features.
CNN-Based Encoder Converts CNN features into sequence embeddings for the
decoder.
Text Decoder Uses LSTM-based architecture to generate captions one token at
a time.

Unsupervised Pretraining Uses pretrained CNN (DenseNet201) and pretrains LSTM on text
data.

Supervised Finetuning Trains the CNN and LSTM together on image-caption pairs using
cross-entropy loss.

sampling Greedy Search, Beam Search

Evaluation Matrix BLUE score

Data Set Fliker8K


System Design and Implementation
Transformer based Implementation

Component Description
Image Encoder EfficientNetB0 extracts deep image features.
CNN-Based Encoder Converts CNN features into sequence embeddings for the decoder.

Transformer Encoder Uses Multi-Head Attention for spatial feature extraction.

Positional Embedding Adds positional encodings to maintain word order.

Transformer Decoder Uses self-attention and cross-attention for caption generation.

Unsupervised Pretraining Pretrains CNN on ImageNet & decoder on large text corpus.

Supervised Finetuning Trains CNN and Transformer together using cross-entropy loss.

sampling Greedy Search, Beam Search

Evaluation Matrix BLUE score


System Design and Implementation
Sampling
Beam search

Sampling : Generative models are trained to capture the underlying distribution of the training
data. Once trained, these models can generate new samples that are similar to the data they
were trained on. Sampling is the process of using a trained generative model to generate new
data.

 Beam search is a popular deterministic algorithm for generating text from a trained
model. The core idea is to track multiple potential sequences of tokens simultaneously.
 At each step, the model calculates the probabilities for the next possible tokens for each
sequence and selects the "top-k" most probable sequences. The value of k, known as
beam width, is configurable.
System Design and Implementation
Greedy search

 Greedy search is the simplest deterministic algorithm. It always selects the token with the
highest probability as the next token. As was shown in Figure , greedy search can lead
to repetitive patterns in the generated text.
 This occurs because it follows a narrow path based on the highest probability tokens without
considering alternative paths that might lead to more coherent sentences. Due to this limitation,
greedy search is rarely used in practice
Evaluation Metrics

Offline evaluation metrics

 To thoroughly evaluate a language translation model, metrics should measure both translation
accuracy and contextual appropriateness. The research community has proposed
several metrics that, over the years, have become widely accepted as standards.
Some used matrix are
- BLUE, ROUGE, METEOR

Task Metrics

Text Generation Perplexity, BLEU, METEOR, ROUGE, CIDEr

Image Generation FID, IS, KID, SWD, PPL, LPIPS

Text-to-Video FVD, CLIPScore, FID, LPIPS, KID


Evaluation Metrics

BLEU

BLEU (BiLingual Evaluation Understudy) is a precision-based metric that compares ngrams


(a sequence of "n" words) of the candidate translation with n-grams of the reference
translations and counts the ratio of matches. It ranges from O to 1, where a higher value
indicates a more precise translation.
The BLEU score is calculated using the following formula
where:
• N is the maximum n-gram length considered for evaluation
• BP is the brevity penalty
• 𝑝ₙ is the n-grams precision
• 𝑤ₙ represents the weight for different n-gram precisions
Let's explore each of these terms in detail.
:
Evaluation Metrics
BLEU

 BP is a constant term that penalizes translations shorter than the reference translation. The formula is

 Precision (𝑝ₙ)
Precision measures how many n-grams in the candidate translation are present in reference
translations. It is calculated by dividing the number of matching n-grams by the total number of n-
grams in the candidate translation.
Model Prediction/Result

 LSTM Model generated caption for given image


Model Prediction/Result

 Transformer based Model generated caption for given image


Model off Line Evaluation
avg. Blue Score

 Blue Score Transformer Based model VS LSTM


Model on Line Evaluation

 Online time taken for Inferencing around 4 sec once everything is loaded properly
Containerization

 Containerization with the help of Docker

 Benefits of Docker Containerization


 Portability – Containers ensure the application runs consistently across different environments (development, testing, production).
 Scalability – Easily scale applications up or down by running multiple container instances.
 Resource Efficiency – Containers share the host OS kernel, consuming fewer resources than traditional virtual machines.
 Fast Deployment & Startup – Containers start quickly, reducing downtime during deployments.
 Isolation – Each container runs independently, preventing conflicts between dependencies.
 Simplified Dependency Management – Containers package all necessary dependencies, eliminating compatibility issues.
 Version Control & Rollbacks – Docker enables easy versioning, allowing rollbacks to previous versions if needed.
 Security – Containers provide process isolation, minimizing security risks.
 Consistent Development & Production Environments – Eliminates "works on my machine" issues by ensuring uniform
environments.
 Microservices Architecture Support – Containers allow breaking down applications into smaller, manageable services, improving
maintainability and agility.
Containerization

 Model Ringing Status In container


Containerization

 Docker File
Containerization

 Docker compose File


Conclusion and Feature work

 Transformer-based models outperform CNN + LSTM in generating captions.


 can be implemented for multi lingual
 Can be implemented for QA
 Continuously improving the models for better performance and accuracy
 Enhancing the system to handle real-time image captioning for live video feeds
 Optimizing the models for deployment on edge devices for faster and offline processing
 Integrating the system with assistive technologies to better support visually impaired users
Dashboard
Dashboard

• Beam And Greedy Search Implementation


Dashboard

• Caption Generation from live portal


Suggestions and Guidance for Future Endeavors.

• Guidance

------------------End----------------------

You might also like