Survey Paper
Survey Paper
While remarkable advancements have been achieved in By integrating these components, the system can provide a
richer and more interactive experience for users while In the realm of artificial intelligence (AI) and computer
ensuring scalability for various domains. vision, the task of image captioning stands as a significant
breakthrough. It provides machines with the capability to
The primary objectives of the Visual Description understand visual content and articulate meaningful
Generator project are as follows: descriptions in natural language. This intersection of
1. Automated Caption Generation: Leverage a vision and language technologies has paved the way for
Generative Image-to-Text (GIT) model to produce innovations across numerous domains, from accessibility
high-accuracy, contextually relevant captions for tools for the visually impaired to sophisticated e-
input images. commerce solutions and automated content generation
systems. The ability of AI to analyze an image and
2. Tag Extraction: Derive key tags from the captions, generate an accurate description showcases its potential to
enabling efficient categorization and improved mimic human-like comprehension and creativity.
search functionality.
Despite these advancements, current image captioning
3. Related Image Retrieval: Identify and display systems face inherent limitations. Many models struggle
images similar to the input image based on to contextualize the fine-grained details of images, often
semantic content, enhancing usability in content resulting in generic or incomplete captions. Furthermore,
management systems. most existing systems operate in silos, either generating
captions or focusing on tasks such as keyword tagging or
4. Reusable Dataset Creation: Develop a high-quality
image retrieval. The lack of integration between these
dataset to train and fine-tune similar models,
functionalities restricts the adaptability of these systems
benefiting future research in image captioning and
for real-world applications. For instance, a system capable
related domains.
of describing an image, extracting its key attributes, and
While the proposed system aims to provide an finding similar visuals would significantly improve
integrated solution for image captioning and retrieval, it workflows in industries such as media curation, archival
has certain limitations: management, and personalized content delivery.
Reliance on GIT Model: The accuracy and The motivation behind this research lies in addressing
contextual understanding of the generated captions are these gaps by designing an all-encompassing framework
inherently tied to the capabilities of the GIT model, that integrates caption generation, tag extraction, and
which may have limitations in handling niche or highly related image retrieval. Leveraging the Generative Image-
specific domains. to-Text (GIT) model, this study proposes a novel approach
to enhance the functionality and relevance of image
1. Dataset Scope: The dataset created and used for captioning systems. Unlike traditional models, GIT is built
this study is diverse but may not fully encompass on a seamless neural architecture, combining a vision-
all potential use cases, requiring future expansion based convolutional neural network (CNN) with a
for niche applications. language-generating recurrent neural network (RNN).
2. Performance Constraints: Real-time processing This end-to-end learning paradigm allows for the
speed and computational requirements may limit generation of highly descriptive captions while
the scalability of the system for high-volume simultaneously offering ancillary outputs, such as image
applications. tags and visually similar image suggestions.
3. Despite these limitations, the study's outcomes aim The primary goal of this research is not only to develop
to provide a robust foundation for further a unified system but also to contribute to the AI
advancements in the field, ensuring adaptability community by creating a reusable dataset. This dataset can
and extensibility for diverse use cases. serve as a benchmark for future models, encouraging
further exploration in this domain. Additionally, the
Image captioning represents a critical advancement integration of multi-functional features demonstrates the
in the intersection of computer vision and natural potential of such systems to transform how visual content
language processing. It aims to bridge the gap between is analyzed, described, and utilized in settings.
visual perception and linguistic expression, enabling
machines to understand and describe the content of However, the scope of this research is bounded by
images. This capability has far-reaching applications in several constraints. The reliance on the GIT model, while
areas such as assisting visually impaired individuals, powerful, makes the system's performance contingent on
automating content creation, and enhancing human- the pre-training and configuration of this specific
computer interaction. However, despite its potential, architecture. Similarly, the outcomes are influenced by the
existing systems face challenges in generating diversity and quality of the dataset used. Computational
contextually rich and precise captions for diverse efficiency also emerges as a potential limitation,
datasets. particularly for large-scale implementations. These
challenges underscore the importance of iterative
refinement and optimization in developing such
integrated systems. Despite these advancements, significant challenges
persist in existing systems. Traditional models often
By exploring these possibilities, this study aims to struggle with generating detailed and context-aware
contribute to the evolution of intelligent systems captions, particularly for images with intricate visual
capable of understanding and communicating visual compositions. Moreover, the lack of integration between
information with remarkable accuracy and captioning and auxiliary tasks like tagging and retrieval
functionality. The proposed approach signifies a step results in fragmented workflows, reducing the utility of
forward in creating more intuitive and versatile AI- these systems in real-world applications. Furthermore,
driven solutions that bridge the gap between vision and most models rely heavily on large-scale datasets for
language. training, which may not always represent diverse real-
world scenarios, leading to issues such as bias and
reduced generalizability.
2. Related Work
Image captioning has garnered considerable attention
in the field of artificial intelligence, leading to the
development of numerous models that aim to bridge the
gap between visual understanding and natural language
generation. One of the foundational approaches is the
Show and Tell model, which employs an end-to-end
neural network comprising a convolutional neural
network (CNN) for image feature extraction and a
recurrent neural network (RNN) for generating
captions. This model laid the groundwork for using
deep learning in captioning tasks by demonstrating that Figure 2. A Visual Representation of the Image
visual information could be effectively translated into Captioning Process using a Convolutional Neural Network
textual descriptions. Building upon this, the Show, (CNN) and Long Short-Term Memory (LSTM) network.
Attend, and Tell model introduced an attention
mechanism, enabling the system to focus on specific The proposed system aims to address these challenges
regions of an image while generating each word in the by creating a unified framework that seamlessly combines
caption. This innovation significantly improved the image captioning, tag extraction, and related image
accuracy of generated captions, particularly for retrieval. By leveraging the GIT (Generative Image-to-
complex images with multiple objects. More recently, Text) model, this approach ensures that the generated
transformer-based models, leveraging architectures like outputs are not only accurate but also highly functional for
Vision Transformers (ViT) and multi-head attention, a variety of applications. In doing so, it bridges the gaps in
have pushed the boundaries of image captioning. These existing systems, paving the way for more versatile.
models excel at capturing global dependencies and
generating richer, contextually accurate descriptions. This research aims to advance intelligent systems by
exploring methods for enhanced visual understanding and
Beyond caption generation, the integration of tagging communication. The goal is to develop AI systems that
and related image retrieval has also been a subject of can accurately and effectively interpret and describe visual
exploration. Some studies have attempted to extend the information. This approach represents a significant step
functionality of captioning models by combining them towards creating more intuitive and adaptable AI solutions
with tag prediction systems. By extracting salient that effectively bridge the gap between visual perception
features from images and generating keywords, these and language comprehension.
systems provide a complementary layer of information.
Figure 6. This image showcases different examples of image captions, ranging from accurate descriptions to irrelevant
the precision of n-grams between generated captions
ones. It illustrates the challenges of automatic image captioning.
and reference captions. Higher BLEU scores indicate
closer alignment with ground truth.
METEOR (Metric for Evaluation of Translation with
Explicit ORdering): Considers precision, recall, and
semantic matching, providing a more nuanced
evaluation than BLEU. Several strengths underscore the effectiveness of the proposed
system. The GIT model, fine-tuned with domain-specific
ROUGE (Recall-Oriented Understudy for Gisting datasets, achieved high accuracy in generating descriptive and
Evaluation): Used primarily to assess the overlap of contextually appropriate captions. The integration of spaCy
phrases and sequences, focusing on recall. facilitated efficient and relevant tagging, adding significant
value to the generated outputs.
References [18] P. Kuznetsova, V. Ordonez, T. Berg, and Y. Choi.
[1] A. Aker and R. Gaizauskas. Generating image descriptions Treetalk: Composition and compression of trees for image
using dependency relational patterns. In ACL, 2010. descrip- tions. ACL, 2(10), 2014.
[2] D. Bahdanau, K. Cho, and Y. Bengio. Neural ma- [19]S. Li, G. Kulkarni, T. L. Berg, A. C. Berg, and Y. Choi.
chine translation by jointly learning to align and translate. Com- posing simple image descriptions using web-scale n-
arXiv:1409.0473, 2014. grams. In Conference on Computational Natural Language
Learn- ing, 2011.
[3] K. Cho, B. van Merrienboer, C. Gulcehre, F. Bougares,
H. Schwenk, and Y. Bengio. Learning phrase representations [20]T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D.
using RNN encoder-decoder for statistical machine transla- Ra- manan, P. Dolla´r, and C. L. Zitnick. Microsoft coco:
tion. In EMNLP, 2014. Com- mon objects in context. arXiv:1405.0312, 2014.
[21]J. Mao, W. Xu, Y. Yang, J. Wang, and A. Yuille. Ex-
[4] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang,
plain images with multimodal recurrent neural networks. In
E. Tzeng, and T. Darrell. Decaf: A deep convolutional acti-
arXiv:1410.1090, 2014.
vation feature for generic visual recognition. In ICML,
2014. [22]T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient
estimation of word representations in vector space. In
[5] D. Elliott and F. Keller. Image description using visual de-
ICLR, 2013.
pendency representations. In EMNLP, 2013.
[23]M. Mitchell, X. Han, J. Dodge, A. Mensch, A. Goyal, A. C.
[6] A. Farhadi, M. Hejrati, M. A. Sadeghi, P. Young, Berg, K. Yamaguchi, T. L. Berg, K. Stratos, and H. D. III.
C. Rashtchian, J. Hockenmaier, and D. Forsyth. Every pic- Midge: Generating image descriptions from computer
ture tells a story: Generating sentences from images. In vision detections. In EACL, 2012.
ECCV, 2010.
[24]V. Ordonez, G. Kulkarni, and T. L. Berg. Im2text: Describ-
[7] R. Gerber and H.-H. Nagel. Knowledge representation for ing images using 1 million captioned photographs. In NIPS,
the generation of quantified natural language descriptions 2011.
of vehicle traffic in image sequences. In ICIP. IEEE, 1996.
[25]K. Papineni, S. Roukos, T. Ward, and W. J. Zhu. BLEU: A
[8] Y. Gong, L. Wang, M. Hodosh, J. Hockenmaier, and method for automatic evaluation of machine translation. In
S. Lazebnik. Improving image-sentence embeddings using ACL, 2002.
large weakly annotated photo collections. In ECCV, 2014. [26]C. Rashtchian, P. Young, M. Hodosh, and J. Hockenmaier.
[9] A. Graves. Generating sequences with recurrent neural net- Collecting image annotations using amazon’s mechanical
works. arXiv:1308.0850, 2013. turk. In NAACL HLT Workshop on Creating Speech and
[10] S. Hochreiter and J. Schmidhuber. Long short-term Language Data with Amazon’s Mechanical Turk, pages
memory. 139– 147, 2010.
Neural Computation, 9(8), 1997. [27] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,
[11] M. Hodosh, P. Young, and J. Hockenmaier. Framing image S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein,
description as a ranking task: Data, models and evaluation A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual
metrics. JAIR, 47, 2013. Recognition Challenge, 2014.
[12] S. Ioffe and C. Szegedy. Batch normalization: Accelerating [28]P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus,
deep network training by reducing internal covariate shift. and Y. LeCun. Overfeat: Integrated recognition,
In arXiv:1502.03167, 2015. localization and detection using convolutional networks.
[13] A. Karpathy, A. Joulin, and L. Fei-Fei. Deep fragment em- arXiv preprint arXiv:1312.6229, 2013.
beddings for bidirectional image sentence mapping. NIPS, [29]R. Socher, A. Karpathy, Q. V. Le, C. Manning, and A. Y.
2014. Ng. Grounded compositional semantics for finding and
[14] R. Kiros, R. Salakhutdinov, and R. S. Zemel. Unifying describ- ing images with sentences. In ACL, 2014.
visual-semantic embeddings with multimodal neural lan- [30]I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to
guage models. In arXiv:1411.2539, 2014. sequence learning with neural networks. In NIPS, 2014.
[15] R. Kiros and R. Z. R. Salakhutdinov. Multimodal neural [31]R. Vedantam, C. L. Zitnick, and D. Parikh. CIDEr:
lan- guage models. In NIPS Deep Learning Workshop, Consensus-based image description evaluation. In
2013. arXiv:1411.5726, 2015.
[16] G. Kulkarni, V. Premraj, S. Dhar, S. Li, Y. Choi, A. C. [32]B. Z. Yao, X. Yang, L. Lin, M. W. Lee, and S.-C. Zhu. I2t:
Berg, and T. L. Berg. Baby talk: Understanding and Image parsing to text description. Proceedings of the IEEE,
generating simple image descriptions. In CVPR, 2011. 98(8), 2010.
[17] P. Kuznetsova, V. Ordonez, A. C. Berg, T. L. Berg, and [33]P. Young, A. Lai, M. Hodosh, and J. Hockenmaier. From
Y. Choi. Collective generation of natural image im- age descriptions to visual denotations: New similarity
descriptions. In ACL, 2012. met- rics for semantic inference over event descriptions. In
ACL, 2014.
[34] W. Zaremba, I. Sutskever, and O. Vinyals. Recurrent
neural network regularization. In arXiv:1409.2329, 2014.