A Comparative Analysis of Attention Mechanism in RNN-LSTMs For Improved Image Captioning Performance
A Comparative Analysis of Attention Mechanism in RNN-LSTMs For Improved Image Captioning Performance
Abstract:- Image captioning which links computer vision The specialized arrangement of neural networks,
with NATURAL LANGUAGE PROCESSING is critical in characterized by an assortment of interconnected “non-linear
providing descriptions for the image. The proposed functions”, differs greatly from that of standard algorithms. In
solution in this research is a hierarchical attention model contrast to conventional techniques, which depend on
which includes use of CNN features on images and LSTM inflexible and pre-structured guidelines, neural networks adapt
networks with attention mechanisms for generating by learning from data and tuning their parameters through
captions. By utilizing both object level and image level many layers to address complicated problems. This makes
features, our method enhances the quality and relevance of them quite successful in domains which cannot be dealt with
captions, enhancing the variability of the automated image logical problem solving as in speech recognition, image
description. recognition, story writing and even music.
Keywords:- Image Captioning, Deep Learning, Artificial For image captioning, applying Deep neural networks
Intelligence, Natural language Processing. such as CNNs for visual feature extraction and RNN-LSTMs
with attention mechanism for language generation helps for
I. INTRODUCTION image captioning which is not only descriptive but also
intelligent. While putting all these sequences into action, the
The use of neural networks has revolutionized the model keeps modifying its grasp of the image as each word of
practice of image classification, resulting in great progress in the sequence is executed, leading to image captions that
artificial intelligence as well as computer vision. There is a encapsulate most of the aspects of the image content and
trend, however, as these systems develop, that the researchers context. This outlines the possibility of producing sensible and
rather seek for even complicated usage that go beyond what cohesive captions, thus eliminating the language-cognition
ever machines are able to do. One of the fundamental aspects barrier, which serves as a link between images and language.
in this quest is the viewing and the rendering of images and
videos, which goes beyond the conventional image object The aim of this research paper is the development of
recognition tasks to producing a coherent natural language automated image captioning systems that combine within
description of the visual content. This development resonates themselves the technologies of computer vision and natural
to the increasing aspirations in the context of AI which is to language processing as this direction becomes more and more
have machines perceiving and reporting about the demanded. Since the problem of image captioning is the
environment as human beings do. contextualization of visual data, a further probing of attention
mechanisms in Long Short Term Memory Recurrent Neural
In this research paper, we propose a novel approach to Networks warrants the study. This study endeavors to enhance
the image captioning challenge which aims at describing the a strong hierarchical attention network by fusing local object
main settings and events presented in photographs without parts with the global features obtained from Convolutional
human assistance. Image captioning is considered to be an Neural Networks. The intention is to build a system that will
extremely difficult task because it does not only involve not only be able to correctly render these captions in relation
detecting objects, but also providing information about the to the content but also settle the issues surrounding the
relations of the objects to each other and their surroundings relationship between local and global features.
and many other things that are often hard for people with good
visual memories.
II. METHODOLOGY through a VGG16 model that is already trained, whereby all
the images were converted to fixed length vectors for the
Data Set Overview model to read.
This research employs the Kaggle-sourced Flickr30k
dataset. This dataset allows training on the computer due to its Caption Pre-Processing
size and content variation that makes it research-friendly. In In preparing the captions, we also introduced a start of
this dataset, there are a total of 31,783 images and each image caption token and end of caption token to indicate the
is appended with five captions to enable the efficient training beginning and end of any prepared caption thus helping the
of the model. The preprocessing stage includes lower casing model in creating text sequences whose grammar makes sense.
of all the text, punctuation, stop words, numbers and extra This method is useful for ensuring that the captions logically
whitespaces eradication. The image features were obtained make sense.
Model Training against the reference captions. It was done through the NLTK
A custom data generator parsed data in batches, and the library available in Python.
model was trained over five epochs. An epoch training loss
was recorded by the ‘Early Stop’ callback which was aimed at Caption Visualization
preventing the model from overfitting. The model was The top-5 revealed captions for every image were
evaluated based on BLEU score which is a common technique generated by beam search. This technique begins with a single
used to evaluate how well a generated captions matches to a token and throughout the process searches for the top-k most
reference caption. probable continuations of the sequence, known as captioning,
until an end token is encountered.
BLEU Score
The bleu score is measured in ratio form, 0 and 1, and its
purpose is to determine how precise the suggested captions
Interface Development visual features and text adds more meaning on the ability of
For the backend, Flask was used as the framework and the model to handle visual contents. The study explains the
for the frontend HTML/CSS/JavaScript were used to design a merits and demerits of the model and finally argues that the
basic web interface. This interface enables users to upload approach proposed enhances existing ones for image
images and get captions predicted by the model. captioning research. Such systems may find application in
aids, imaging retrieval, and HCI systems. An RNN which
III. RESULTS incorporates LSTM units helps in processing the image parts
by producing efficient captioned word sequences, while the
This paper describes the procedures for image captioning training process is supported by a data generator. It achieves a
with RNN-LSTM model with the attention mechanism on the BLEU score of 99% indicating that the model is accurate.
Flickr30k dataset. It covers data preprocessing, feature Furthermore, the model has a ‘playable with’ visualization
extraction, model training and testing, and visualization while interface which allows a person to go to the web, upload an
proving the ability of the model to produce efficient and image and straightaway get a caption for the image as well.
coherent captions with high BLEU scores. The combination of
IV. CONCLUSION captions as an output, which proves the extent to which the
model was trained to comprehend and recreate a 3-
This Research paper reviews the image captioning dimensional composition including its environmental features.
process enabling the usage of RNN-LSTM with attention Except for the qualitative evaluation, this research study
mechanism and the Flickr30k dataset. The aforementioned served the purpose of gaining an insight into the strengths and
approach includes data preprocessing, feature representation weaknesses of the model. It can be concluded that this work
extraction, model building, training, testing, and visualization has enriched the methods of image captioning while also
creating a large homogenous area of work regarding the image having demonstrated through the experiment that the proposed
captioning task. Thus, checking and assessing on the turn out method works. In view of the rich potential inherent in the
model, this model is able to generate picture captions with a topic, the recommendations of the study may be utilized in the
high level of efficiency and the different models result in a development of assistive devices, image-based content
high BLEU coefficient. The integration of image features with retrieval systems, and man-computer interfaces. These models
text captions, as well as the use of deep learning methods, shall require additional research and advancement to expand
shows the effectiveness of the approach when it comes to and enhance the manner in which image captioning aids in
survey any visual information. In addition, it is also worth multi-modal input and output understanding and engagement.
noting the fact that test images were provided with generated
FUTURE RECOMMENDATION [9]. Gaurav & Mathur, P. (2021). A survey on various deep
learning models for automatic image captioning.
Numerous major directions for further exploration and ICMAI.
progress have been specified to promote the domain of image [10]. Hendricks, L.A., Venugopalan, S. & Rohrbach, M.
captioning. First things first, being transformed into better and (2016). Deep compo_sitional captioning: Describing
better model architectures and coming to constructs like BERT novel object categories without paired training data.
and ViTs have the potential of enhancing the quality of the IEEE.
produced captions and their diversity. Attention mechanisms, [11]. Huang, L., Wang, W., Chen, J. & Wei, X.Y. (2019).
which are present in most modern models, may be further Attention on attention for image captioning. In IEEE.
enhanced by modifications such as self-attention or multi-head [12]. Jandial, S., Badjatiya, P., Chawla, P. & Krishnamurthy,
attention, allowing for more precise targeting of important B. (2022). Sac: Semantic attention composition for
locations within the image. New technologies, as creating text-conditioned image retrieval. In IEEE/CVF Winter
additive training sets, can increase the robustness of the model Conference on Applications of Computer Vision
while also new technologies can combine different models (WACV).
predictions for better results. However, regardless of the [13]. Khaled, R., R., T.T. & Arabnia, H.R. (2020).
performance marked by the BLEU parameters, it makes sense Automatic image and video caption generation with
to assess the caption quality with evaluating metrics like deep learning: A concise review and algorithmic over
METEOR or CIDEr as well. lap. IEEE.
[14]. Khan, R., Islam, M.S., Kanwal, K., Iqbal, M., Hossain,
Experiments with users and their opinions allow M.I. & Ye, Z. (2022). A deep neural framework for
measuring the effectiveness of the model in terms of quality, image caption generation using gru-based attention
which is subjective, and helps improving the model. Transfer mechanism. IEEE.
learning and domain adaption strategies could make it possible [15]. Lew, M.S., Liu, Y., Guo, Y. & Bakker, E.M. (2017a).
to use the developed model for new datasets or domains even Learning a recurrent residual fusion network for
with limited data. Cross modal fusion techniques may improve multimodal matching. In IEEE.
the combination of images and text. The technique of creating [16]. Lew, Y.L., Guo, Y., Bakker, E.M. & Lew, M.S.
captions is designed specifically for English; however, its (2017b). Learning a recurrent residual fusion network
application for other languages like Italian and French is for multimodal matching. IEEE.
suggested. Advanced our developed technique for creation of [17]. Mathur, A. (2022). Image captioning system using
other content element such as sticker pictures and their recurrent neural network lstm. International Journal of
integration with the text. Enhancing the capacity of the Engineering Research and Technology (IJERT).
technique from generating one caption about objects in a [18]. Mundargi, M.S. & Mohanty, M.H. (2020). Image
picture to generating several different captions associated with captioning using attention mechanism with resnet, vgg
objects in a picture. and inception models. International Research Journal
of Engineering and Technology (IRJET).
REFERENCES [19]. Parameshwaran, A.P. (2020). Deep architectures for
visual recognition and description. Scholarworks.
[1]. Al-Malla, M.A., Jafar, A. & Ghneim, N. (2020). Image [20]. Pedersoli, M., Lucas, T., Schmid, C. & Verbeek, J.
captioning model using attention and object features to (2017). Areas of attention for image captioning. In
mimic human image understanding. IEEE. IEEE International Conference on Computer Vision
[2]. Aneja, J., Deshpande, A. & Schwing, A.G. (2017). (ICCV).
Convolutional image captioning. IEEE. [21]. Rajendra, A., Rajendra, R., Mengshoel, O.J., Zeng, M.
[3]. Ayoub, S., Reegu, F.A. & Turaev, S. (2022). & Haider, M. (2018). Captioning with language-based
Generating image captions using bahdanau attention attention. In IEEE 5th International Conference on
mechanism and transfer learning. In Symmetry 2022, Data Science and Advanced Analytics.
14, 2681. [22]. Raut, R., Patil, S., Borkar, P. & Zore, P. (2023). Image
[4]. Bai, T., Zhou, S., Pang, Y., Luo, J., Wang, H. & Du, Y. captioning using resnet rs and attention mechanism.
(2023). An image caption model based on attention International Journal of Intelligent Systems and
mechanism and deep reinforcement learning. IEEE Applications in Engineering.
Conference. [23]. Shukla, S.K., Dubey, S., Pandey, A.K., Mishra, V. &
[5]. Cao, P., Yang, Z., Sun, L., Liang, Y., Yang, M.Q. & Awasthi, M. (2021). Image caption generator using
Guan, R. (2019). Image captioning with bidirectional neural networks. International Journal of Scientific
semantic attention-based guiding of long short-term Research in Computer Science, Engineering and
memory. Neural Processing Letters, 50, 103–119. Information Technology.
[6]. Chaudhri, S., Mithal, V., Polatkan, G. & Ramanath, R. [24]. Soh, M. (2016). Learning cnn-lstm architectures for
(2021). An attentive survey of attention models. IEEE. image caption generation. In IEEE.
[7]. Chen, J., Dong, W. & Li, M. (2021). Image caption [25]. Sonntag, D., Biswas, R. & Barz, M. (2020). Towards
generator based on deep neural networks. IEEE. explanatory inter_active image captioning using top-
[8]. Galassi, A., Lippi, M. & Torroni, P. (2021). Attention down and bottom-up features, beam search and re-
in natural language processing. IEEE Transactions on ranking. In KI - K¨unstliche Intelligenz.
Neural Networks and Learning Systems, 32.