Report 1
Report 1
Submitted by
(Assistant Professor)
Department seal
Page | 2
DECLARATION
I certify that the work contained in this report is original and has been done by us
under the guidance of my supervisor(s). The work has not been submitted to any
other Institute for any degree. I have followed the guidelines provided by the
Institute in preparing the report. I have conformed to the norms and guidelines
given in the Ethical Code of Conduct of the Institute. whenever I have used
materials (data, theoretical analysis, figures, and text) from other sources, I have
given due credit to them by citing them in the text of the report and giving their
details in the references. Further, I have taken permission from the copyright
owners of the sources, whenever necessary.
Page | 3
ACKNOWLEDGEMENT
I would like to express my heartfelt gratitude to all those who have contributed to
the successful completion and submission of my capstone project. This project has
been a culmination of months of hard work, research, and dedication, and I am
truly grateful for the support and assistance I have received along the way.
First and foremost, I would like to thank our advisor, Dr. Gopa Bhaumika, for his
unwavering guidance and mentorship throughout this journey. Your expertise,
feedback, and patience have been instrumental in shaping this project and helping
me reach this milestone.
Lastly, I appreciate the understanding and cooperation of all those who may have
been inconvenienced during my capstone project's preparation and submission.
Thank you once again to everyone who has been a part of this journey.
Sincerely,
Page | 4
ABSTRACT
Page | 5
Contents
Chapter 1: Introduction
1.1 Introduction
1.2 Background
1.3 Problem Definition
1.4 Outline of the Report
6 . References
Page | 7
CHAPTER 1
INTRODUCTION
In the modern digital landscape, where artificial intelligence (AI) intersects with
everyday tasks, image captioning has emerged as a critical area of research. This
technology plays a pivotal role in translating visual content into meaningful
textual descriptions, enabling machines to understand and communicate the
essence of an image. Image captioning finds applications in various fields, such as
assisting visually impaired individuals, enhancing search engine capabilities, and
improving content accessibility across multimedia platforms.
1.1 INTRODUCTION
.
1.2 BACKGROUND
Rooted in the increasing need for smarter content understanding and automation, image
captioning systems build upon foundational developments in both computer vision and
natural language processing. These systems go beyond traditional image recognition
tasks, integrating advanced models to generate coherent descriptions of visual
Pagedata.
|8
Early efforts in this domain involved hand-crafted features and template-based
captioning methods. However, the advent of deep learning has revolutionized the field,
enabling the development of more adaptive and context-aware systems.
Inspired by advancements in sequence-to-sequence learning and language modeling,
modern image captioning systems harness the power of CNNs for feature extraction and
RNNs, particularly LSTMs, for language generation. This approach is similar to
techniques used in translation tasks, where an encoder-decoder framework is applied to
translate visual information into natural language. Furthermore, attention mechanisms,
first introduced in machine translation, have been instrumental in refining the
performance of image captioning models, allowing for more accurate and contextually
relevant descriptions.
As the demand for intelligent systems capable of understanding and describing complex
visual scenes grows, image captioning has seen increased attention across industries.
From enhancing search engine capabilities to aiding the visually impaired, these systems
leverage both visual and linguistic data to offer comprehensive solutions. Models like
those proposed by Vinyals et al. (2015) and Xu et al. (2015) serve as the foundation for
ongoing research, demonstrating the ability to generate descriptions that capture the
relationships, actions, and context of objects within an image. This convergence of AI
fields has opened new doors for advancements in image processing, recommendation
systems, and human-computer interaction.
As the reliance on artificial intelligence for understanding and generating content from
visual data grows, the research at hand delves into the complexities and challenges of
building robust and efficient image captioning systems. Image captioning, a task
requiring the combination of visual comprehension and natural language generation,
faces numerous hurdles, from accurately identifying objects in images to capturing the
relationships and contextual nuances between them.
The key challenge lies in developing a system that can generate captions that are not
only grammatically correct but also semantically meaningful and contextually
appropriate. This involves refining caption generation mechanisms through the
integration of advanced deep learning models, leveraging attention mechanisms to focus
on relevant parts of the image, and employing sophisticated algorithms to ensure that
captions align with human-like understanding of visual scenes. Page | 9
The contemporary landscape of multimedia and digital communication demands
systems that not only understand visual data but also provide concise, accurate, and
contextually rich descriptions. As such, image captioning systems must address the
intricacies of complex visual scenes, diverse datasets, and multiple valid captions for a
single image. In this project, the focus is on building an efficient image captioning
model that can overcome these challenges, ensuring accurate, coherent, and context-
aware captions for diverse images.
4. Result and Discussion: This section presents and analyzes the outcomes of
the image captioning models, including performance evaluation using
standard metrics such as BLEU, METEOR, and CIDEr. It discusses the
effectiveness of different models and architectures and examines the results
Page |
10
across various datasets.
Page |
11
Page | 10
CHAPTER 2
LITERATURE REVIEW
Page | 122
2.5 Hybrid Approaches and Multimodal Learning
Recent advancements have also explored hybrid approaches that integrate multimodal
data sources, combining image, text, and even audio or video data to enhance caption
generation. For example, multimodal transformers such as CLIP and models that
incorporate reinforcement learning have shown promise in improving the accuracy and
contextual relevance of generated captions.
By learning from multiple modalities, these systems can
generate captions that not only describe objects but also
account for the actions or sounds associated with them. This
integration of multimodal data holds great potential for
applications such as real-time captioning for video streams or
generating captions for multimedia content, where
understanding dynamic interactions is critical.
Research Gap
Lack of Dataset Diversity: One major research gap in image captioning is the
lack of diversity in the datasets used for training models. Popular datasets such as
MS-COCO and Flickr8k contain a limited range of image types, with a
predominant focus on Western cultural contexts. This results in models that are
biased toward generating captions that reflect the dataset’s specific language
patterns and cultural elements, making them less applicable in global or diverse
settings. Addressing this gap requires the development and inclusion of more
culturally and contextually diverse datasets to improve the generalization of image
captioning models across different populations and scenarios.
captions for a given image, reflecting the various ways in which an image can be
interpreted.
Page | 144
innovation in both model efficiency and interface design.
Page | 155
Page | 166
CHAPTER 3
Methodology
1. Data Collection
The foundation of effective danger detection lies in the quality of the collected
data. Our recommendation is a multi-modal approach to data collection,
encompassing audio, video, and motion data.
Audio Data: Capturing ambient sounds enables the detection of distress
signals, screams, or aggressive voices. This auditory information adds a
crucial layer to the overall safety system.
Motion Data: Incorporating accelerometers and gyroscopes allows the
detection of sudden movements or falls, indicating potential danger. This
sensor-based approach adds another dimension to the safety system.
2. Data Pre-processing
Before model training, meticulous data pre-processing is essential to ensure the
quality and reliability of the information.
Noise Reduction: Background noise in audio recordings can obscure
important signals. Implementing noise reduction techniques enhances the
clarity of audio data.
Feature Extraction: Relevant features, such as spectral characteristics for
audio and motion, should be extracted to facilitate modelling. This step is
crucial for feeding the necessary information into the subsequent stages of
the system.
7
generated using techniques like Short-Time Fourier Transform (STFT) or
Mel-Frequency Spectral Conversion (MFSC), which divide the audio
signal into short time intervals and calculate the energy distribution across
different frequency bands within each interval. The resulting spectrogram
is a 2D image where the x-axis represents time, the y-axis represents
frequency, and the intensity or color at each pixel represents the magnitude
or energy of the corresponding frequency component at a particular time.
This conversion facilitates the application of CNN models for audio
analysis tasks such as speech recognition, sound classification, and audio
event detection.
The CNN+LSTM model forms the baseline of our image captioning system. It
combines the strengths of convolutional neural networks for feature extraction and
recurrent neural networks, specifically long short-term memory (LSTM) networks, for
sequential caption generation.
8
• LSTM for Caption Generation: The extracted features from the CNN are then
passed to an LSTM network, which generates a caption word by word. The
LSTM is well-suited for this task as it excels at handling sequential data and
maintaining the context of previously generated words while predicting the next
word.
Training Process: The CNN+LSTM model is trained using image-caption pairs from
datasets such as MS-COCO or Flickr8k. The model learns to predict the next word in a
caption sequence given the previous words and the image features. The loss function
used is categorical cross-entropy, and techniques such as dropout are employed to
prevent overfitting.
The ViT+GPT-2 model represents the cutting edge in image captioning by using the
Vision Transformer (ViT) for image encoding and GPT-2, a powerful language
model, for caption generation.
10
Global Attention and Contextual Understanding: The ViT+GPT-2 model excels at
capturing global features of the image due to the self-attention applied to all patches
simultaneously. This global context is crucial for generating captions that reflect the
overall scene, especially when multiple objects or complex actions are present.
Training Process: The ViT+GPT-2 model is trained in two stages. First, the Vision
Transformer is fine-tuned on an image dataset (such as MS-COCO) to learn relevant
image features. Then, the GPT-2 model is trained to generate captions based on the
image representations learned by ViT. This two-part training process ensures that both
the image and text components of the model are optimized for the caption generation
task.
11
12
Chapter 4
Results
4 Results
In this section, we provide a detailed breakdown of the results
obtained from three different image captioning models: LSTM +
CNN, CNN + Transformer, and ViT + GPT-2. All models were
trained and evaluated on the Flickr8K dataset, which contains 8,091
images, each paired with five human-generated captions. The
evaluation was performed using four main BLEU scores: B1, B2, B3,
and B4, corresponding to n-grams from unigrams to 4-grams.
4.1 LSTM + CNN Results
The LSTM + CNN model combines a CNN to extract visual features
from the image and an LSTM to sequentially generate captions.
BLEU Scores (Flickr8K):
o B1: 0.675
o B2: 0.503
o B3: 0.355
o B4: 0.297
Analysis:
The B1 score (0.675) indicates that the model performs reasonably
well in matching individual words (unigrams) between the
generated and reference captions. This suggests that the LSTM +
CNN model can capture the primary objects or actions in an image
but struggles with generating complex sentences.
The drop in B4 (0.297) highlights the model's difficulty in
constructing longer, grammatically coherent sequences of words.
The model tends to generate simple captions and struggles to
capture more intricate relationships between objects or actions in the
image.
The LSTM's sequential nature limits its performance, particularly
when generating long captions where earlier words are required to
influence later words. The reliance on sequential processing also
makes it slower during both training and inference.
4.2 CNN + Transformer Results
The CNN + Transformer model utilizes the self-attention mechanism
of transformers, enabling it to process sequences in parallel. This
13
significantly improves its ability to capture long-range dependencies
in the image captioning task.
BLEU Scores (Flickr8K):
o B1: 0.706
o B2: 0.527
o B3: 0.375
o B4: 0.318
Analysis:
The B1 score (0.706) is higher than the LSTM + CNN model,
reflecting the transformer's better ability to recognize key objects
and actions in the images. This is due to the parallel processing of
the transformer, which allows the model to attend to different parts
of the image simultaneously.
The B2, B3, and B4 scores are also higher than those of the LSTM
+ CNN model, showing that the CNN + Transformer model excels
at generating more complex and contextually accurate captions. The
B4 score (0.318) indicates that the model captures better phrase-
level coherence compared to the LSTM-based model.
The transformer’s self-attention mechanism enables the model to
effectively model dependencies between distant words, allowing for
the generation of more sophisticated captions that incorporate
complex relationships between objects.
4.3 ViT + GPT-2 Results
The ViT + GPT-2 model leverages the Vision Transformer (ViT) to
split images into patches and apply attention mechanisms across them,
followed by GPT-2, a highly advanced language model, to generate
rich, contextually appropriate captions.
BLEU Scores (Flickr8K):
o B1: 0.712
o B2: 0.531
o B3: 0.381
o B4: 0.330
Analysis:
The B1 score (0.712) indicates that the ViT + GPT-2 model is
effective at identifying key objects and actions in images, producing
captions with a high unigram match to reference captions.
The B4 score (0.330), while only slightly higher than the CNN +
Transformer model, demonstrates the ViT + GPT-2 model's superior
ability to generate longer, contextually richer captions. The
combination of ViT's global image understanding and GPT-2’s
14
language generation capabilities results in captions that are both
coherent and descriptive.
The model handles complex scenes particularly well, where
multiple objects or activities are present. This makes the captions
more descriptive, even though it comes at a higher computational
cost due to the heavy usage of attention mechanisms in both ViT
and GPT-2.
Observations:
LSTM + CNN performs reasonably well on the Flickr8K dataset,
particularly in recognizing individual objects. However, its
sequential nature limits its ability to generate complex and coherent
captions.
CNN + Transformer shows an improvement over LSTM + CNN,
particularly in handling long-range dependencies, enabling the
generation of more contextually accurate captions.
ViT + GPT-2 outperforms both LSTM + CNN and CNN +
Transformer, especially in capturing relationships between objects
in complex scenes, but this comes at a higher computational cost.
Overall, for the Flickr8K dataset, the ViT + GPT-2 model is the best
performer, followed closely by the CNN + Transformer model. Both
of these models demonstrate superior capabilities in generating
complex, contextually relevant captions compared to the LSTM +
CNN model, which remains a strong baseline but falls short in terms
of longer caption coherence and computational efficiency.
15
Chapter 5
Conclusions and Scope for Future Work
5.1 Conclusions
In this project, we explored image captioning, a critical task at the
intersection of computer vision and natural language processing. By
implementing and comparing three different models—LSTM + CNN, CNN
+ Transformer, and ViT + GPT-2—we evaluated their performance in
generating descriptive captions for images using the Flickr8K dataset. The
project aimed to understand how these models handle the challenges of
image captioning, including object recognition, scene understanding, and
language generation.
The primary findings can be summarized as follows:
LSTM + CNN: This model performed adequately for basic image
captioning tasks, particularly in simpler scenarios where captions
involved fewer objects or less complex relationships between objects.
Its primary limitation lies in its sequential nature, which results in
slower processing and difficulties in handling long-range
dependencies between words in a caption. Although effective for
short captions, it struggles to maintain coherence and fluency in
longer descriptions.
CNN + Transformer: The introduction of self-attention mechanisms
16
in the transformer model significantly improved performance. The
ability of transformers to process sequences in parallel allowed for
faster and more accurate caption generation. This model captured
longer-range dependencies better than LSTM-based models, resulting
in more coherent and contextually accurate captions. However, the
model’s performance could still be improved with larger datasets and
more training data.
ViT + GPT-2: This model emerged as the top performer, leveraging
Vision Transformer (ViT) for a global understanding of the image
and GPT-2 for generating rich, detailed captions. It excelled in
handling complex scenes with multiple objects or actions, generating
grammatically correct and contextually relevant captions. However,
the high computational cost of this model presents a challenge,
making it less feasible for real-time or resource-constrained
environments.
Evaluation Metrics: Across all models, BLEU scores (B1, B2, B3, B4) were
used to assess caption quality. While BLEU-1 (B1) provided insights into
the models' ability to capture key objects or actions, BLEU-4 (B4) was
particularly useful for evaluating the models’ ability to generate coherent,
complex sentences. The ViT + GPT-2 model achieved the highest BLEU-4
score, indicating its superior capacity for producing longer, more detailed
captions.
Overall, the results highlight the importance of using advanced
architectures like transformers and language models for improving the
quality of image captions. However, these improvements come at the cost
of computational efficiency, especially in models like ViT + GPT-2.
5.2 Scope for Future Work
While the project successfully implemented three different models for
image captioning, there are several avenues for future research and
17
development that can improve both the accuracy and efficiency of these
models. Below are some potential directions for future work:
5.2.1 Larger Datasets for Training
One of the key limitations of this project was the size of the Flickr8K
dataset. Although it provided a reasonable starting point, larger datasets like
MS COCO (with over 120,000 images) or Flickr30k (with 30,000 images)
could significantly improve model performance. Larger datasets would
allow models, especially transformers and ViT-based architectures, to learn
more diverse and complex image-caption mappings, enhancing their
generalization capabilities.
Data augmentation: In addition to using larger datasets, data
augmentation techniques (e.g., image transformations, synthetic
captions) can increase the effective size of the training set and
improve the models' robustness to variations in image content.
5.2.2 Incorporating More Advanced Architectures
While the ViT + GPT-2 model represents a significant advancement in
image captioning, there are opportunities to explore even more advanced
architectures:
Vision-Language Models: Models like CLIP (Contrastive Language-
Image Pre-training) and BLIP (Bootstrapping Language-Image Pre-
training) have shown promise in improving image-text alignment.
These models could be incorporated into future image captioning
frameworks to further improve caption quality.
Multimodal Transformers: Recent advancements in multimodal
transformers, which jointly process visual and textual data, could
lead to more seamless integration of vision and language tasks,
producing captions that are even more contextually aware and
accurate.
5.2.3 Improving Efficiency for Real-Time Applications
18
One of the primary challenges of advanced models like ViT + GPT-2 is
their computational complexity, which can make them impractical for real-
time applications, such as generating live captions for videos or CCTV
footage. Future work should focus on optimizing these models for speed
and efficiency:
Model Compression and Pruning: Techniques like quantization,
pruning, and distillation can be used to reduce the size and
computational requirements of these models without significantly
sacrificing performance. This could make models like ViT + GPT-2
more viable for real-time applications.
Edge AI: As edge computing becomes more prevalent, there is
potential for deploying optimized image captioning models directly
on edge devices (e.g., smartphones, IoT devices). Future research
could focus on building lightweight models that can run efficiently
on such devices, reducing the reliance on cloud-based processing.
5.2.4 Context-Aware Captioning
While current models generate captions based solely on the visual content
of an image, there is an opportunity to improve captioning by incorporating
contextual information. This could include:
Temporal Context: For image sequences or video frames,
incorporating temporal context could allow the model to generate
captions that reflect ongoing actions or changes in the scene.
External Knowledge: Integrating external knowledge sources, such as
knowledge graphs, could allow the model to generate more
informative and contextually rich captions. For example, a model
could recognize an object in an image and generate captions that
reflect its historical or cultural significance.
5.2.5 User Interaction and Customization
Future work could also focus on improving the user interaction with image
19
captioning systems by allowing users to customize captions based on their
preferences. This could include:
Personalization: Allowing users to adjust captioning style (e.g.,
formal vs. informal language) or detail level (e.g., concise vs.
descriptive captions).
Interactive Feedback: Implementing systems where users can provide
feedback on generated captions, helping the model improve its
performance over time through reinforcement learning.
5.2.6 Enhanced Evaluation Metrics
Although BLEU is the most commonly used evaluation metric in image
captioning, it has limitations, particularly in handling multiple valid
captions for the same image. Future research should explore alternative or
enhanced evaluation metrics:
METEOR and CIDEr scores offer complementary perspectives by
focusing on synonym matching and human consensus. These metrics
could be more extensively used to evaluate the richness and human-
like quality of captions.
Human Evaluation: Automated metrics can often miss nuances in
language. Incorporating human evaluation, alongside automated
metrics, would provide a more complete picture of model
performance.
20
leveraging more advanced architectures, and focusing on real-time
applications, future research can continue to push the boundaries of what
image captioning systems are capable of achieving.
21
REFERENCES
1. Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). "Attention is All You
Need." In Advances in Neural Information Processing Systems. Available at:
https://arxiv.org/abs/1706.03762
2. Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al. (2020). "An Image is Worth
16x16 Words: Transformers for Image Recognition at Scale." Available at:
https://arxiv.org/abs/2010.11929
3. Radford, A., Narasimhan, K., et al. (2021). "Learning Transferable Visual
Models From Natural Language Supervision." Available at:
https://arxiv.org/abs/2103.00020
4. Papineni, K., Roukos, S., Ward, T., & Zhu, W. J. (2002). "BLEU: a method
for automatic evaluation of machine translation." In Proceedings of the 40th
annual meeting on association for computational linguistics (ACL). Available at:
https://aclanthology.org/P02-1040/
5. Lin, C.Y. (2004). "ROUGE: A Package for Automatic Evaluation of
Summaries." Available at: https://aclanthology.org/W04-1013/
6. Chen, X., Fang, H., Lin, T., et al. (2015). "Microsoft COCO Captions: Data
Collection and Evaluation Server." Available at: https://arxiv.org/abs/1504.00325
7. Anderson, P., Fernando, B., Johnson, M., & Gould, S. (2016). "SPICE:
Semantic Propositional Image Caption Evaluation." Available at:
https://arxiv.org/abs/1607.08822
8. Hossain, M. D., Sohel, F., Shiratuddin, M. F., & Laga, H. (2019). "A
Comprehensive Survey of Deep Learning for Image Captioning." Available at:
https://arxiv.org/abs/1810.04020
22