0% found this document useful (0 votes)
7 views34 pages

Report 1

Capstone project report

Uploaded by

sandeightzeros
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views34 pages

Report 1

Capstone project report

Uploaded by

sandeightzeros
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 34

Image Captioning

Project Report Submitted in Partial Fulfilment of the Requirements for


the Degree of

Bachelor of Technology (Hons.)


in
Computer Science and Engineering

Submitted by

Anshuman Raj: (2022UGCS035)


Sulochan Khadka:(2022UGCS109)
Prince Kumar Singh: (2021UGCS117)

Under the Supervision


of Dr. Gopa Bhaumik

(Assistant Professor)

Department of Computer Science and Engineering


National Institute of Technology Jamshedpur
October, 2024
CERTIFICATE

This is to certify that the report entitled “Image Captioning” is a bonafide


record of the Project done by Anshuman Raj (Roll No.: 2022UGCS0435),
Sulochan Khadka (Roll No.: 2022UGCS109) and Prince Kumar Singh (Roll
No.: 2022UGCS117) under my supervision, in partial fulfilment of the
requirements for the award of the degree of Bachelor of Technology (Hons.) in
Computer Science and Engineering from National Institute of Technology
Jamshedpur.

Dr. Gopa Bhaumik


Assistant Professor

Computer Science and Engineering

Date: 24 October 2024

Department seal

Page | 2
DECLARATION

I certify that the work contained in this report is original and has been done by us
under the guidance of my supervisor(s). The work has not been submitted to any
other Institute for any degree. I have followed the guidelines provided by the
Institute in preparing the report. I have conformed to the norms and guidelines
given in the Ethical Code of Conduct of the Institute. whenever I have used
materials (data, theoretical analysis, figures, and text) from other sources, I have
given due credit to them by citing them in the text of the report and giving their
details in the references. Further, I have taken permission from the copyright
owners of the sources, whenever necessary.

Signature of the Students

Roll Number: 2022UGCS035 Name: Anshuman Raj Sign

Roll Number: 2022UGCS109 Name: Sulochan Khadka Sign

Roll Number: 2022UGCS117 Name: Prince kumar Sign


Singh

Page | 3
ACKNOWLEDGEMENT
I would like to express my heartfelt gratitude to all those who have contributed to
the successful completion and submission of my capstone project. This project has
been a culmination of months of hard work, research, and dedication, and I am
truly grateful for the support and assistance I have received along the way.

First and foremost, I would like to thank our advisor, Dr. Gopa Bhaumika, for his
unwavering guidance and mentorship throughout this journey. Your expertise,
feedback, and patience have been instrumental in shaping this project and helping
me reach this milestone.

I am also thankful to my professors and instructors at National Institute of


Technology Jamshedpur, for providing me with a strong academic foundation and
inspiring me to pursue this capstone project. Your teachings have been invaluable.

Lastly, I appreciate the understanding and cooperation of all those who may have
been inconvenienced during my capstone project's preparation and submission.

Thank you once again to everyone who has been a part of this journey.

Sincerely,

Anshuman Raj (2022UGCS035)


Sulochan Khadka(2022UGCS109)
Prince Kumar Singh (2022UGCS117)

Page | 4
ABSTRACT

Image captioning, the process of automatically generating textual


descriptions from visual content, has gained increasing importance in
artificial intelligence and computer vision research. As applications for
generating real-time and context-aware captions grow—ranging from
assisting visually impaired individuals to improving image search and
content recommendation—new challenges have emerged, such as
handling diverse datasets, improving contextual understanding, and
ensuring semantic relevance. Despite significant advancements with
deep learning models, current systems struggle with complex scenes,
generalized language generation, and real-time processing, especially
in environments with multiple objects and actions.

Selecting the appropriate deep learning techniques for image


captioning systems is a nuanced task, given the variety of models
available—each with distinct strengths, limitations, and computational
requirements. This paper explores a range of approaches, from
traditional CNN-LSTM models to cutting-edge transformer-based
architectures, examining their performance on standard datasets such
as Flickr8k and MS-COCO. The study highlights key challenges,
including dataset bias, the difficulty of capturing fine-grained
relationships between objects, and the ambiguity of generating
multiple valid captions for a single image.

By systematically analyzing the state of research in image captioning,


this paper aims to shed light on existing gaps and areas for
improvement. It underscores the necessity of incorporating attention
mechanisms, multimodal approaches, and advanced evaluation
metrics to enhance model accuracy and contextual relevance.
Ultimately, this research provides valuable insights to inform the
development of more robust, scalable, and real-time image captioning
systems capable of addressing diverse application needs.

Page | 5
Contents
Chapter 1: Introduction
1.1 Introduction
1.2 Background
1.3 Problem Definition
1.4 Outline of the Report

Chapter 2: Literature Review


2.1 Review of CNN-LSTM Architectures for Image Captioning
2.2 Attention Mechanisms in Image Captioning
2.3 Transformer-Based Approaches
2.4 Challenges in Dataset Diversity and Bias
2.5 Multimodal Learning and Hybrid Approaches

Chapter 3: Proposed Methodology


3.1 Overview of Methodologies
3.2 CNN+LSTM Model for Image Captioning
3.2.1 CNN for Feature Extraction
3.2.2 LSTM for Caption Generation
3.2.3 Attention Mechanism for Context-Aware Captioning
3.3 CNN+Transformer Model
3.3.1 Transformer-Based Caption Generation
3.3.2 Self-Attention Mechanisms
3.4 ViT+GPT-2 Model
3.4.1 Vision Transformer for Image Processing
3.4.2 GPT-2 for Language Generation
3.4.3 Global Attention and Contextual Understanding
3.5 Comparison of Models

Chapter 4: Results and Discussion


4.1 LSTM+CNN Model Performance
4.1.1 BLEU Score Evaluation
4.1.2 Challenges with Sequential Processing
4.2 CNN+Transformer Model Performance
4.2.1 BLEU Score and Contextual Coherence
4.2.2 Advantages of Parallel Processing
4.3 ViT+GPT-2 Model Performance
4.3.1 BLEU Score and Rich Captions
4.3.2 Handling Complex Scenes and Actions
4.4 Comparative Analysis of Models

Chapter 5: Conclusion and Scope for Future Work Page | 6


5.1 Summary of Findings
5.2 Improvements in Model Performance
5.3 Limitations of the Current Approach
5.4 Future Research Directions
5.4.1 Use of Larger Datasets and Data Augmentation
5.4.2 Advanced Architectures (Vision-Language Models, Multimodal
Transformers)
5.4.3 Enhancing Efficiency for Real-Time Applications

6 . References

Page | 7
CHAPTER 1
INTRODUCTION

In the modern digital landscape, where artificial intelligence (AI) intersects with
everyday tasks, image captioning has emerged as a critical area of research. This
technology plays a pivotal role in translating visual content into meaningful
textual descriptions, enabling machines to understand and communicate the
essence of an image. Image captioning finds applications in various fields, such as
assisting visually impaired individuals, enhancing search engine capabilities, and
improving content accessibility across multimedia platforms.

1.1 INTRODUCTION

Powered by advanced deep learning models, image captioning systems can


automatically generate coherent and contextually accurate captions for images. The
integration of convolutional neural networks (CNNs) for feature extraction and recurrent
neural networks (RNNs), especially long short-term memory (LSTM) networks, for
language generation allows these systems to analyze images and produce natural
language descriptions. As the demand for more efficient, context-aware, and scalable
solutions grows, the potential of image captioning technology to transform the way we
interact with visual data becomes increasingly clear.
By leveraging the capabilities of machine learning algorithms, these systems can
adaptively process visual content, identifying objects, actions, and scenes within images.
Image captioning systems offer a proactive approach to content generation, facilitating a
smoother interaction between humans and machines. As these systems continue to
evolve, they hold the promise of revolutionizing fields ranging from social media to
autonomous vehicles, offering dynamic solutions to interpret and describe the world
around us.

.
1.2 BACKGROUND

Rooted in the increasing need for smarter content understanding and automation, image
captioning systems build upon foundational developments in both computer vision and
natural language processing. These systems go beyond traditional image recognition
tasks, integrating advanced models to generate coherent descriptions of visual
Pagedata.
|8
Early efforts in this domain involved hand-crafted features and template-based
captioning methods. However, the advent of deep learning has revolutionized the field,
enabling the development of more adaptive and context-aware systems.
Inspired by advancements in sequence-to-sequence learning and language modeling,
modern image captioning systems harness the power of CNNs for feature extraction and
RNNs, particularly LSTMs, for language generation. This approach is similar to
techniques used in translation tasks, where an encoder-decoder framework is applied to
translate visual information into natural language. Furthermore, attention mechanisms,
first introduced in machine translation, have been instrumental in refining the
performance of image captioning models, allowing for more accurate and contextually
relevant descriptions.
As the demand for intelligent systems capable of understanding and describing complex
visual scenes grows, image captioning has seen increased attention across industries.
From enhancing search engine capabilities to aiding the visually impaired, these systems
leverage both visual and linguistic data to offer comprehensive solutions. Models like
those proposed by Vinyals et al. (2015) and Xu et al. (2015) serve as the foundation for
ongoing research, demonstrating the ability to generate descriptions that capture the
relationships, actions, and context of objects within an image. This convergence of AI
fields has opened new doors for advancements in image processing, recommendation
systems, and human-computer interaction.

1.3 PROBLEM DEFINITION

As the reliance on artificial intelligence for understanding and generating content from
visual data grows, the research at hand delves into the complexities and challenges of
building robust and efficient image captioning systems. Image captioning, a task
requiring the combination of visual comprehension and natural language generation,
faces numerous hurdles, from accurately identifying objects in images to capturing the
relationships and contextual nuances between them.
The key challenge lies in developing a system that can generate captions that are not
only grammatically correct but also semantically meaningful and contextually
appropriate. This involves refining caption generation mechanisms through the
integration of advanced deep learning models, leveraging attention mechanisms to focus
on relevant parts of the image, and employing sophisticated algorithms to ensure that
captions align with human-like understanding of visual scenes. Page | 9
The contemporary landscape of multimedia and digital communication demands
systems that not only understand visual data but also provide concise, accurate, and
contextually rich descriptions. As such, image captioning systems must address the
intricacies of complex visual scenes, diverse datasets, and multiple valid captions for a
single image. In this project, the focus is on building an efficient image captioning
model that can overcome these challenges, ensuring accurate, coherent, and context-
aware captions for diverse images.

1.4 OUTLINE OF THE REPORT

This report is structured to guide readers through a comprehensive exploration of image


captioning systems, addressing the challenges of generating meaningful textual
descriptions from visual content.

1. Introduction: This section introduces the topic, emphasizing the importance


of image captioning in artificial intelligence and its applications across
various fields. It outlines the key challenges, such as the need for accurate
contextual understanding and the integration of vision and language models.
2. Literature Review: A detailed review of existing work in the field of image
captioning, exploring the evolution of methodologies from template-based
systems to deep learning models. This section also identifies research gaps
and highlights the role of advanced architectures like CNNs, LSTMs, and
transformers in improving caption generation.

3. Proposed Methodology: An in-depth discussion of the methodologies


employed, including data collection, preprocessing, and model selection. The
integration of convolutional neural networks (CNNs) for feature extraction
and long short-term memory networks (LSTMs) for caption generation forms
the core of our approach. Attention mechanisms are used to enhance the
model's ability to focus on specific image regions during caption generation.

4. Result and Discussion: This section presents and analyzes the outcomes of
the image captioning models, including performance evaluation using
standard metrics such as BLEU, METEOR, and CIDEr. It discusses the
effectiveness of different models and architectures and examines the results
Page |
10
across various datasets.

5. Conclusion and Future Scope: Summarizing the findings, this section


discusses the implications of the research, achievements, and limitations. It
also provides insights into future research directions, such as the exploration
of multimodal approaches and the use of larger, more diverse datasets to
further improve image captioning systems.

6. References: A comprehensive list of references, acknowledging the sources


that contributed to the development and understanding of image captioning
systems. This section highlights the key studies and research efforts that have
shaped the field.

Page |
11
Page | 10
CHAPTER 2
LITERATURE REVIEW

In navigating the ever-evolving landscape of image captioning systems, this


literature review embarks on an exploration of existing research, methodologies,
and technological advancements aimed at improving the quality of generated
captions. The focal points include the development of encoder-decoder
architectures, attention mechanisms, transformer-based models, and the existing
research gaps in handling complex scenes and context. This review provides a
comprehensive understanding of the challenges and advancements in this field.

2.1 Review on CNN-LSTM Architectures for Image Captioning


A seminal work in image captioning by Vinyals et al. (2015) introduced the concept of
applying an encoder-decoder architecture for generating captions from images. The
encoder, typically a convolutional neural network (CNN), extracts visual features from
the image, which are then passed to a recurrent neural network (RNN), such as a long
short-term memory (LSTM) network, that generates the caption one word at a time.
The CNN extracts high-level visual features that represent objects and their attributes,
while the LSTM decodes these features into a sequential sentence. This architecture has
proven effective for generating simple, descriptive captions but struggles with complex
scenes where relationships between objects are more intricate. The model’s inability to
focus on specific regions of an image while generating a particular word prompted
further research into more adaptive systems.

2.2 Review on Attention Mechanisms in Image Captioning

In a breakthrough study by Xu et al. (2015), the concept of attention mechanisms was


introduced to the image captioning task. Attention mechanisms allow the model to
dynamically focus on different parts of the image while generating each word in the
caption, mimicking the way humans visually process scenes. This approach significantly
Page | 111
improved the performance of image captioning systems, particularly in handling images
with multiple objects or complex interactions.

The attention mechanism enhances the decoder by allowing it to selectively attend to


certain regions of the image based on the context of the caption being generated. This
development was critical in bridging the gap between basic object detection and
understanding the contextual relationships within an image, leading to more accurate
and contextually relevant captions.

2.3 Review on Transformer-Based Architectures


More recently, transformer models, such as those introduced by Dosovitskiy et al.
(2020) in the Vision Transformer (ViT) and further extended by models like CLIP, have
revolutionized the field of image captioning. Transformers, originally designed for
natural language processing tasks, utilize self-attention mechanisms to capture
relationships between all parts of the input data simultaneously. This ability to model
long-range dependencies in both the image and the caption text has resulted in more
coherent and globally aware captions.
Transformers, unlike CNN-LSTM models, process the entire input image as a set of
patches, allowing for parallel processing rather than the sequential approach of RNNs.
This has enabled the generation of richer, more complex captions, particularly in scenes
with multiple objects or actions. The use of large-scale pre-training on datasets like
ImageNet and MS-COCO has further improved the model's ability to generalize across
diverse images.

2.4 Challenges in Dataset Diversity and Bias


One significant issue in image captioning research is the bias inherent in the
datasets used for training models. Datasets such as MS-COCO and Flickr8k
contain images and captions that are often culturally specific and limited in
diversity. This can result in models that generate biased or stereotypical captions,
particularly when applied to images from different cultural contexts. Research has
shown that models trained on biased datasets can produce inaccurate or insensitive
captions, highlighting the need for more diverse and inclusive training data.

Page | 122
2.5 Hybrid Approaches and Multimodal Learning

Recent advancements have also explored hybrid approaches that integrate multimodal
data sources, combining image, text, and even audio or video data to enhance caption
generation. For example, multimodal transformers such as CLIP and models that
incorporate reinforcement learning have shown promise in improving the accuracy and
contextual relevance of generated captions.
By learning from multiple modalities, these systems can
generate captions that not only describe objects but also
account for the actions or sounds associated with them. This
integration of multimodal data holds great potential for
applications such as real-time captioning for video streams or
generating captions for multimedia content, where
understanding dynamic interactions is critical.

Research Gap
Lack of Dataset Diversity: One major research gap in image captioning is the
lack of diversity in the datasets used for training models. Popular datasets such as
MS-COCO and Flickr8k contain a limited range of image types, with a
predominant focus on Western cultural contexts. This results in models that are
biased toward generating captions that reflect the dataset’s specific language
patterns and cultural elements, making them less applicable in global or diverse
settings. Addressing this gap requires the development and inclusion of more
culturally and contextually diverse datasets to improve the generalization of image
captioning models across different populations and scenarios.

Handling Complex Visual Contexts: Another significant gap in image


captioning research is the challenge of generating captions that accurately describe
complex scenes involving multiple objects, relationships, and actions. While
models equipped with attention mechanisms and transformers have improved the
ability to focus on specific elements within an image, they still struggle with
Page | 133
understanding the interplay between objects and actions. Current models often
generate surface-level descriptions, overlooking the deeper context or interaction
between visual elements. Bridging this gap requires further refinement of model
architectures to handle the dynamic complexity of real-world scenes.

Ambiguity in Caption Generation: A critical issue in image captioning is the


ambiguity associated with generating captions for a single image. Many images
can have multiple valid captions that describe them differently based on the
viewer’s interpretation. However, most models generate a single, deterministic
caption, limiting the system's ability to capture the full range of possible
descriptions. There is a growing need for models that can produce diverse sets of

captions for a given image, reflecting the various ways in which an image can be
interpreted.

Optimization of Model Hyperparameters: The selection and optimization of


hyperparameters, such as learning rates, batch sizes, and network architectures,
play a crucial role in improving the performance of image captioning models.
However, determining the optimal set of hyperparameters for each specific dataset
and application remains an open challenge. Many models suffer from overfitting
or underfitting due to suboptimal hyperparameter selection, leading to reduced
performance in real-world scenarios. Addressing this gap requires the exploration
of more efficient techniques for hyperparameter tuning, such as automated
optimization or reinforcement learning approaches.

Real-Time Captioning and Integration: Another key gap is the real-time


processing and seamless integration of image captioning systems with web and
mobile applications. While most current systems focus on accuracy, few models
are optimized for real-time applications, such as live video captioning or
interactive multimedia systems. Real-time image captioning requires models that
are both computationally efficient and responsive, without sacrificing caption
quality. Furthermore, integration with user-friendly interfaces remains a
challenge, as the generated captions must be presented in a way that enhances the
overall user experience. Achieving this level of integration will require more

Page | 144
innovation in both model efficiency and interface design.

Page | 155
Page | 166
CHAPTER 3
Methodology
1. Data Collection
The foundation of effective danger detection lies in the quality of the collected
data. Our recommendation is a multi-modal approach to data collection,
encompassing audio, video, and motion data.
Audio Data: Capturing ambient sounds enables the detection of distress
signals, screams, or aggressive voices. This auditory information adds a
crucial layer to the overall safety system.
Motion Data: Incorporating accelerometers and gyroscopes allows the
detection of sudden movements or falls, indicating potential danger. This
sensor-based approach adds another dimension to the safety system.

2. Data Pre-processing
Before model training, meticulous data pre-processing is essential to ensure the
quality and reliability of the information.
Noise Reduction: Background noise in audio recordings can obscure
important signals. Implementing noise reduction techniques enhances the
clarity of audio data.
Feature Extraction: Relevant features, such as spectral characteristics for
audio and motion, should be extracted to facilitate modelling. This step is
crucial for feeding the necessary information into the subsequent stages of
the system.

Converting Audio to Image: In scenarios where Convolutional Neural


Network (CNN) models are employed for audio analysis, converting audio
data to image format can be beneficial. This transformation enables the
utilization of CNNs, which are traditionally used for image classification
tasks, for audio-based tasks. One common approach is to convert audio
waveforms into spectrograms, which are visual representations of the
frequency content of the audio signal over time. Spectrograms can be

7
generated using techniques like Short-Time Fourier Transform (STFT) or
Mel-Frequency Spectral Conversion (MFSC), which divide the audio
signal into short time intervals and calculate the energy distribution across
different frequency bands within each interval. The resulting spectrogram
is a 2D image where the x-axis represents time, the y-axis represents
frequency, and the intensity or color at each pixel represents the magnitude
or energy of the corresponding frequency component at a particular time.
This conversion facilitates the application of CNN models for audio
analysis tasks such as speech recognition, sound classification, and audio
event detection.

Training Data Preparation: In this phase, collected audio data undergoes


annotation, with each instance labeled to denote whether it signifies a
dangerous situation. This categorization involves identifying specific audio
cues indicative of danger, such as distress signals or aggressive sounds.
Consistency and accuracy in annotations are vital for training machine
learning models to accurately discern between safe and hazardous audio
events. This meticulous process lays the foundation for developing robust
danger detection systems, crucial for enhancing women's safety.

3. Purposed Methodology and Model Training


3.1. CNN+LSTM: Convolutional Neural Networks with Long Short-Term
Memory

The CNN+LSTM model forms the baseline of our image captioning system. It
combines the strengths of convolutional neural networks for feature extraction and
recurrent neural networks, specifically long short-term memory (LSTM) networks, for
sequential caption generation.

• CNN for Feature Extraction: In this approach, a pre-trained CNN (e.g.,


ResNet-50 or InceptionV3) is used to encode the image into a high-dimensional
feature vector. These CNNs have been trained on large-scale image datasets like
ImageNet, enabling them to capture rich visual features such as object shapes,
colors, and spatial relations.

8
• LSTM for Caption Generation: The extracted features from the CNN are then
passed to an LSTM network, which generates a caption word by word. The
LSTM is well-suited for this task as it excels at handling sequential data and
maintaining the context of previously generated words while predicting the next
word.

Attention Mechanism: To further improve caption quality, an attention mechanism is


introduced. The attention mechanism allows the LSTM to dynamically focus on
different regions of the image during caption generation. This ensures that the model
attends to relevant parts of the image, improving the accuracy and relevance of the
generated captions, particularly in complex scenes.

Training Process: The CNN+LSTM model is trained using image-caption pairs from
datasets such as MS-COCO or Flickr8k. The model learns to predict the next word in a
caption sequence given the previous words and the image features. The loss function
used is categorical cross-entropy, and techniques such as dropout are employed to
prevent overfitting.

3.2. CNN+Transformer Model: Convolutional Neural Networks with


Transformer-Based Caption Generation

The CNN+Transformer model leverages the powerful self-attention mechanisms of


transformers for generating captions. While the CNN is still used for image feature
extraction, transformers replace the LSTM for caption generation.

• CNN for Image Encoding: Similar to the CNN+LSTM approach, a pre-trained


CNN is used to extract a feature map from the input image. The CNN captures
important visual details that will be used as input for the transformer network.

• Transformer for Caption Generation: Unlike LSTM, the transformer


architecture uses self-attention mechanisms, which allow the model to process all
input tokens (in this case, image features and words) in parallel. This enables the
model to capture long-range dependencies and relationships within the image
and the generated caption. Transformers are better at handling long sequences
9
and complex dependencies, making them more suited for tasks that require
contextual understanding over an entire sequence.

Self-Attention Mechanism: The core strength of transformers lies in their self-


attention mechanism. This mechanism allows the model to attend to all parts of the
image and caption simultaneously, improving its ability to generate coherent and
contextually accurate captions. The transformer architecture also enables more efficient
training due to its parallel processing capabilities.

Training Process: The CNN+Transformer model is trained similarly to the


CNN+LSTM model but with the transformer replacing the LSTM. The transformer is
trained to generate captions by learning the relationships between words and image
features through attention weights. This architecture is particularly effective in
generating captions for complex images with multiple objects and interactions.

3.3. ViT+GPT-2: Vision Transformer with GPT-2 for Caption Generation

The ViT+GPT-2 model represents the cutting edge in image captioning by using the
Vision Transformer (ViT) for image encoding and GPT-2, a powerful language
model, for caption generation.

• Vision Transformer for Image Encoding: The Vision Transformer (ViT)


divides the input image into a sequence of patches, treating each patch as a
token. Unlike CNNs, which use convolutional layers to process images, ViT
applies transformer-based self-attention mechanisms directly to image patches.
This allows the model to capture both local and global relationships in the image,
providing a more comprehensive understanding of the visual content.

• GPT-2 for Caption Generation: GPT-2, a transformer-based language model,


is used for generating captions. GPT-2 has been pre-trained on large text corpora,
giving it a deep understanding of language structure, grammar, and contextual
relationships between words. When combined with the Vision Transformer,
GPT-2 can generate captions that are not only syntactically correct but also
contextually rich, accurately describing the image content.

10
Global Attention and Contextual Understanding: The ViT+GPT-2 model excels at
capturing global features of the image due to the self-attention applied to all patches
simultaneously. This global context is crucial for generating captions that reflect the
overall scene, especially when multiple objects or complex actions are present.

Training Process: The ViT+GPT-2 model is trained in two stages. First, the Vision
Transformer is fine-tuned on an image dataset (such as MS-COCO) to learn relevant
image features. Then, the GPT-2 model is trained to generate captions based on the
image representations learned by ViT. This two-part training process ensures that both
the image and text components of the model are optimized for the caption generation
task.

3.4. Comparison of the Approaches

Each of these models has its own strengths and limitations:

• CNN+LSTM: The CNN+LSTM approach is computationally efficient and


works well for generating simple, descriptive captions. However, it struggles
with capturing long-range dependencies and complex relationships between
objects.

• CNN+Transformer: The CNN+Transformer model improves upon


CNN+LSTM by leveraging self-attention mechanisms to handle more complex
image-caption relationships. Transformers allow for parallel processing, making
them more efficient when handling long sequences. This approach generates
more contextually aware captions and is better suited for diverse, complex
scenes.

• ViT+GPT-2: The ViT+GPT-2 model represents the most advanced approach,


utilizing transformer architectures for both image encoding and caption
generation. The Vision Transformer’s ability to capture global relationships and
GPT-2’s natural language generation capabilities make this model ideal for
generating high-quality, contextually rich captions. However, this approach is
computationally expensive and requires significant resources for training and
inference.

11
12
Chapter 4
Results

4 Results
In this section, we provide a detailed breakdown of the results
obtained from three different image captioning models: LSTM +
CNN, CNN + Transformer, and ViT + GPT-2. All models were
trained and evaluated on the Flickr8K dataset, which contains 8,091
images, each paired with five human-generated captions. The
evaluation was performed using four main BLEU scores: B1, B2, B3,
and B4, corresponding to n-grams from unigrams to 4-grams.
4.1 LSTM + CNN Results
The LSTM + CNN model combines a CNN to extract visual features
from the image and an LSTM to sequentially generate captions.
 BLEU Scores (Flickr8K):
o B1: 0.675
o B2: 0.503
o B3: 0.355
o B4: 0.297
Analysis:
 The B1 score (0.675) indicates that the model performs reasonably
well in matching individual words (unigrams) between the
generated and reference captions. This suggests that the LSTM +
CNN model can capture the primary objects or actions in an image
but struggles with generating complex sentences.
 The drop in B4 (0.297) highlights the model's difficulty in
constructing longer, grammatically coherent sequences of words.
The model tends to generate simple captions and struggles to
capture more intricate relationships between objects or actions in the
image.
 The LSTM's sequential nature limits its performance, particularly
when generating long captions where earlier words are required to
influence later words. The reliance on sequential processing also
makes it slower during both training and inference.
4.2 CNN + Transformer Results
The CNN + Transformer model utilizes the self-attention mechanism
of transformers, enabling it to process sequences in parallel. This
13
significantly improves its ability to capture long-range dependencies
in the image captioning task.
 BLEU Scores (Flickr8K):
o B1: 0.706
o B2: 0.527
o B3: 0.375
o B4: 0.318
Analysis:
 The B1 score (0.706) is higher than the LSTM + CNN model,
reflecting the transformer's better ability to recognize key objects
and actions in the images. This is due to the parallel processing of
the transformer, which allows the model to attend to different parts
of the image simultaneously.
 The B2, B3, and B4 scores are also higher than those of the LSTM
+ CNN model, showing that the CNN + Transformer model excels
at generating more complex and contextually accurate captions. The
B4 score (0.318) indicates that the model captures better phrase-
level coherence compared to the LSTM-based model.
 The transformer’s self-attention mechanism enables the model to
effectively model dependencies between distant words, allowing for
the generation of more sophisticated captions that incorporate
complex relationships between objects.
4.3 ViT + GPT-2 Results
The ViT + GPT-2 model leverages the Vision Transformer (ViT) to
split images into patches and apply attention mechanisms across them,
followed by GPT-2, a highly advanced language model, to generate
rich, contextually appropriate captions.
 BLEU Scores (Flickr8K):
o B1: 0.712
o B2: 0.531
o B3: 0.381
o B4: 0.330
Analysis:
 The B1 score (0.712) indicates that the ViT + GPT-2 model is
effective at identifying key objects and actions in images, producing
captions with a high unigram match to reference captions.
 The B4 score (0.330), while only slightly higher than the CNN +
Transformer model, demonstrates the ViT + GPT-2 model's superior
ability to generate longer, contextually richer captions. The
combination of ViT's global image understanding and GPT-2’s
14
language generation capabilities results in captions that are both
coherent and descriptive.
 The model handles complex scenes particularly well, where
multiple objects or activities are present. This makes the captions
more descriptive, even though it comes at a higher computational
cost due to the heavy usage of attention mechanisms in both ViT
and GPT-2.

4.4 Comparison of Models (Flickr8K Dataset)

Observations:
 LSTM + CNN performs reasonably well on the Flickr8K dataset,
particularly in recognizing individual objects. However, its
sequential nature limits its ability to generate complex and coherent
captions.
 CNN + Transformer shows an improvement over LSTM + CNN,
particularly in handling long-range dependencies, enabling the
generation of more contextually accurate captions.
 ViT + GPT-2 outperforms both LSTM + CNN and CNN +
Transformer, especially in capturing relationships between objects
in complex scenes, but this comes at a higher computational cost.

Overall, for the Flickr8K dataset, the ViT + GPT-2 model is the best
performer, followed closely by the CNN + Transformer model. Both
of these models demonstrate superior capabilities in generating
complex, contextually relevant captions compared to the LSTM +
CNN model, which remains a strong baseline but falls short in terms
of longer caption coherence and computational efficiency.

15
Chapter 5
Conclusions and Scope for Future Work

5.1 Conclusions
In this project, we explored image captioning, a critical task at the
intersection of computer vision and natural language processing. By
implementing and comparing three different models—LSTM + CNN, CNN
+ Transformer, and ViT + GPT-2—we evaluated their performance in
generating descriptive captions for images using the Flickr8K dataset. The
project aimed to understand how these models handle the challenges of
image captioning, including object recognition, scene understanding, and
language generation.
The primary findings can be summarized as follows:
 LSTM + CNN: This model performed adequately for basic image
captioning tasks, particularly in simpler scenarios where captions
involved fewer objects or less complex relationships between objects.
Its primary limitation lies in its sequential nature, which results in
slower processing and difficulties in handling long-range
dependencies between words in a caption. Although effective for
short captions, it struggles to maintain coherence and fluency in
longer descriptions.
 CNN + Transformer: The introduction of self-attention mechanisms
16
in the transformer model significantly improved performance. The
ability of transformers to process sequences in parallel allowed for
faster and more accurate caption generation. This model captured
longer-range dependencies better than LSTM-based models, resulting
in more coherent and contextually accurate captions. However, the
model’s performance could still be improved with larger datasets and
more training data.
 ViT + GPT-2: This model emerged as the top performer, leveraging
Vision Transformer (ViT) for a global understanding of the image
and GPT-2 for generating rich, detailed captions. It excelled in
handling complex scenes with multiple objects or actions, generating
grammatically correct and contextually relevant captions. However,
the high computational cost of this model presents a challenge,
making it less feasible for real-time or resource-constrained
environments.
Evaluation Metrics: Across all models, BLEU scores (B1, B2, B3, B4) were
used to assess caption quality. While BLEU-1 (B1) provided insights into
the models' ability to capture key objects or actions, BLEU-4 (B4) was
particularly useful for evaluating the models’ ability to generate coherent,
complex sentences. The ViT + GPT-2 model achieved the highest BLEU-4
score, indicating its superior capacity for producing longer, more detailed
captions.
Overall, the results highlight the importance of using advanced
architectures like transformers and language models for improving the
quality of image captions. However, these improvements come at the cost
of computational efficiency, especially in models like ViT + GPT-2.
5.2 Scope for Future Work
While the project successfully implemented three different models for
image captioning, there are several avenues for future research and

17
development that can improve both the accuracy and efficiency of these
models. Below are some potential directions for future work:
5.2.1 Larger Datasets for Training
One of the key limitations of this project was the size of the Flickr8K
dataset. Although it provided a reasonable starting point, larger datasets like
MS COCO (with over 120,000 images) or Flickr30k (with 30,000 images)
could significantly improve model performance. Larger datasets would
allow models, especially transformers and ViT-based architectures, to learn
more diverse and complex image-caption mappings, enhancing their
generalization capabilities.
 Data augmentation: In addition to using larger datasets, data
augmentation techniques (e.g., image transformations, synthetic
captions) can increase the effective size of the training set and
improve the models' robustness to variations in image content.
5.2.2 Incorporating More Advanced Architectures
While the ViT + GPT-2 model represents a significant advancement in
image captioning, there are opportunities to explore even more advanced
architectures:
 Vision-Language Models: Models like CLIP (Contrastive Language-
Image Pre-training) and BLIP (Bootstrapping Language-Image Pre-
training) have shown promise in improving image-text alignment.
These models could be incorporated into future image captioning
frameworks to further improve caption quality.
 Multimodal Transformers: Recent advancements in multimodal
transformers, which jointly process visual and textual data, could
lead to more seamless integration of vision and language tasks,
producing captions that are even more contextually aware and
accurate.
5.2.3 Improving Efficiency for Real-Time Applications

18
One of the primary challenges of advanced models like ViT + GPT-2 is
their computational complexity, which can make them impractical for real-
time applications, such as generating live captions for videos or CCTV
footage. Future work should focus on optimizing these models for speed
and efficiency:
 Model Compression and Pruning: Techniques like quantization,
pruning, and distillation can be used to reduce the size and
computational requirements of these models without significantly
sacrificing performance. This could make models like ViT + GPT-2
more viable for real-time applications.
 Edge AI: As edge computing becomes more prevalent, there is
potential for deploying optimized image captioning models directly
on edge devices (e.g., smartphones, IoT devices). Future research
could focus on building lightweight models that can run efficiently
on such devices, reducing the reliance on cloud-based processing.
5.2.4 Context-Aware Captioning
While current models generate captions based solely on the visual content
of an image, there is an opportunity to improve captioning by incorporating
contextual information. This could include:
 Temporal Context: For image sequences or video frames,
incorporating temporal context could allow the model to generate
captions that reflect ongoing actions or changes in the scene.
 External Knowledge: Integrating external knowledge sources, such as
knowledge graphs, could allow the model to generate more
informative and contextually rich captions. For example, a model
could recognize an object in an image and generate captions that
reflect its historical or cultural significance.
5.2.5 User Interaction and Customization
Future work could also focus on improving the user interaction with image

19
captioning systems by allowing users to customize captions based on their
preferences. This could include:
 Personalization: Allowing users to adjust captioning style (e.g.,
formal vs. informal language) or detail level (e.g., concise vs.
descriptive captions).
 Interactive Feedback: Implementing systems where users can provide
feedback on generated captions, helping the model improve its
performance over time through reinforcement learning.
5.2.6 Enhanced Evaluation Metrics
Although BLEU is the most commonly used evaluation metric in image
captioning, it has limitations, particularly in handling multiple valid
captions for the same image. Future research should explore alternative or
enhanced evaluation metrics:
 METEOR and CIDEr scores offer complementary perspectives by
focusing on synonym matching and human consensus. These metrics
could be more extensively used to evaluate the richness and human-
like quality of captions.
 Human Evaluation: Automated metrics can often miss nuances in
language. Incorporating human evaluation, alongside automated
metrics, would provide a more complete picture of model
performance.

5.3 Final Thoughts


The advancements in image captioning, as demonstrated by the models
explored in this project, represent significant progress in AI’s ability to
interpret visual content and generate natural language descriptions.
However, this field is still evolving, and there remain many opportunities
for improvement, particularly in areas like efficiency, contextual
understanding, and user interaction. By exploring larger datasets,

20
leveraging more advanced architectures, and focusing on real-time
applications, future research can continue to push the boundaries of what
image captioning systems are capable of achieving.

21
REFERENCES

1.  Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). "Attention is All You
Need." In Advances in Neural Information Processing Systems. Available at:
https://arxiv.org/abs/1706.03762
2.  Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al. (2020). "An Image is Worth
16x16 Words: Transformers for Image Recognition at Scale." Available at:
https://arxiv.org/abs/2010.11929
3.  Radford, A., Narasimhan, K., et al. (2021). "Learning Transferable Visual
Models From Natural Language Supervision." Available at:
https://arxiv.org/abs/2103.00020
4.  Papineni, K., Roukos, S., Ward, T., & Zhu, W. J. (2002). "BLEU: a method
for automatic evaluation of machine translation." In Proceedings of the 40th
annual meeting on association for computational linguistics (ACL). Available at:
https://aclanthology.org/P02-1040/
5.  Lin, C.Y. (2004). "ROUGE: A Package for Automatic Evaluation of
Summaries." Available at: https://aclanthology.org/W04-1013/
6.  Chen, X., Fang, H., Lin, T., et al. (2015). "Microsoft COCO Captions: Data
Collection and Evaluation Server." Available at: https://arxiv.org/abs/1504.00325
7.  Anderson, P., Fernando, B., Johnson, M., & Gould, S. (2016). "SPICE:
Semantic Propositional Image Caption Evaluation." Available at:
https://arxiv.org/abs/1607.08822
8.  Hossain, M. D., Sohel, F., Shiratuddin, M. F., & Laga, H. (2019). "A
Comprehensive Survey of Deep Learning for Image Captioning." Available at:
https://arxiv.org/abs/1810.04020

22

You might also like