IJIEMR March 2023 COPY RIGHT (2 Files Merged)
IJIEMR March 2023 COPY RIGHT (2 Files Merged)
IJIEMR March 2023 COPY RIGHT (2 Files Merged)
2023 IJIEMR. Personal use of this material is permitted. Permission from IJIEMR must
be obtained for all other uses, in any current or future media, including
reprinting/republishing this material for advertising or promotional purposes, creating new
collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted
component of this work in other works. No Reprint should be done to this paper, all copy
right is authenticated to Paper Authors
IJIEMR Transactions, online available on 16th Mar 2023. Link
:http://www.ijiemr.org/downloads.php?vol=Volume-12&issue=Issue 03
10.48047/IJIEMR/V12/ISSUE 03/16
Title AN AUTOMATIC IMAGE CAPTION GENERATION APPRAOCH USING LSTM AND CNN
Paper Authors
K. SAI CHARAN LAHIRI, M. ANITHA LAKSHMI, P. PREM KUMAR,SK. ALTAF,
M. YASWANTH KUMAR, SLVVD SARMA
Vol 12 Issue 03, Mar 2023 ISSN 2456 – 5083 Page 122
images increases rapidly; hence, Recent advancements in language
categorizing these images and retrieving modeling and object recognition have
the relevant web images are a difficult made image captioning an essential
process. For people to use numerous research area in computer vision and
images effectively on the web, natural language processing. Caption
technologies must be able to explain image generation of an image has a great impact
contents and must be capable of searching by helping visually impaired people to
for data that users need. Moreover, images better understand the contents on the web
must be described with natural sentences [3]. Automatic caption generation is a
based not only on the names of objects tough undertaking that can aid visually
contained in an image, but also on their challenged persons in understanding the
mutual relations. content of web images. It may also have a
significant impact on search engines and
Photo captions aim to describe objects, robots. This problem is substantially more
actions, and details found in an image difficult than image categorization or
using natural language. Most image object recognition, both of which have
caption research focuses on single- been extensively researched.
sentence captions, but the descriptive
capabilities of this form are limited; one Recently, deep learning methods have
sentence can only describe in detail a small achieved state-of the-art results on
part of an image [2]. This task of examples of this problem. It has been
automatically generating captions and demonstrated that deep learning models
describing the image is significantly harder are able to achieve optimum results in the
than image classification and object field of caption generation problems.
recognition. The description of an image Hence in this work, an automatic image
must involve not only the objects in the caption generation approach using CNN
image, but also relation between the and LSTM is presented in this work. The
objects with their attributes and activities rest of the work s organized as follows:
shown in images. Most of the work done The section II demonstrates literature
in visual recognition previously has survey. The section III presents an
concentrated to label images with already automatic image caption generation
fixed classes or categories leading to the approach using LSTM and CNN. The
large progress in this field. Eventually, section IV evaluates the result analysis of
vocabularies of visual concepts which are presented approach. Finally the work is
closed, makes a suitable and simple model concluded in section V.
for assumption.
II. LITERATURE SURVEY
Automatic caption generation for an image
is one of the challenging problems in Xiangqing Shen, Bing Liu, Yong
artificial intelligence. Image captioning Zhou & Jiaqi Zhao et. al., [7] describes
models not only solve computer vision Remote sensing image caption generation
challenges of object recognition but also via transformer and reinforcement
capture and express their relationships in learning. A new model using the
natural language. This task is more Transformer to decode the image features
complicated as compared to well-studied to target sentences is presented. For
image classification and object recognition making the Transformer more adaptive to
tasks, which have been the main focus in the remote sensing image captioning task,
the computer vision community [5]. we additionally employ dropout layers,
residual connections, and adaptive feature
fusion in the Transformer. Reinforcement
Vol 12 Issue 03, Mar 2023 ISSN 2456 – 5083 Page 123
Learning is then applied to enhance the retrieval tasks for methods using similar
quality of the generated sentences. We visual features.
demonstrate the validity of our proposed
model on three remote sensing image Philip Kinghorn, Li Zhang, Ling Shao et.
captioning datasets. This model obtains all al., [10] presents A region-based image
seven higher scores on the Sydney Dataset caption generator with refined
and Remote Sensing Image Caption descriptions. A novel region-based deep
Dataset (RSICD), four higher scores on learning architecture for image description
UCM dataset, which indicates that the generation is presented. It employs a
proposed methods perform better than the regional object detector, recurrent neural
previous state of the art models in remote network (RNN)-based attribute prediction,
sensing image caption generation. and an encoder–decoder language
generator embedded with two RNNs to
Songtao Ding, Shiru Qu, Yuling Xi, Arun produce refined and detailed descriptions
Kumar Sangaiah, Shaohua Wan et. al., [8] of a given image. Most importantly, the
describes Image caption generation with proposed system focuses on a local based
high-level image features. A novel image approach to further improve upon existing
captioning model based on high-level holistic methods, which relates specifically
image features is presented. They combine to image regions of people and objects in
low-level information, such as image an image. Evaluated with the IAPR TC-12
quality, with high-level features, such as dataset, the proposed system shows
motion classification and face recognition impressive performance and outperforms
to detect attention regions of an image. We state-of-the-art methods using various
demonstrate that our attention model evaluation metrics
produces good performance in experiments
on MSCOCO, Flickr 30K, PASCL and Kelvin Xu, Jimmy Lei Ba, Ryan Kiros,
SBU datasets. This approach gives good Kyunghyun Cho, Aaron Courville, Ruslan
performance on benchmark datasets. Salakhutdinov, Richard S. Zemel, Yoshua
Bengio et. al., [11] describes Neural Image
Xinlei Chen, C. Lawrence Zitnick et. al., Caption Generation with Visual Attention.
[9] describes learning a Recurrent Visual An attention based model is described that
Representation for Image Caption automatically learns to describe the
Generation. A novel recurrent visual content of images. Authors described how
memory is presented that automatically this model is trained in a deterministic
learns to remember long-term visual manner using standard back-propagation
concepts to aid in both sentence generation techniques and stochastically by
and visual feature reconstruction. Authors maximizing a variational lower bound.
evaluated this approach on several tasks. They also show through visualization how
These include sentence generation, the model is able to automatically learn to
sentence retrieval and image retrieval. fix its gaze on salient objects while
State-of-the-art results are shown for the generating the corresponding words in the
task of generating novel image output sequence. They validate the use of
descriptions. When compared to human attention with state-of-theart performance
generated captions, our automatically on three benchmark datasets: Flickr9k,
generated captions are preferred by Flickr30k and MS COCO.
humans over 19.8% of the time. Results
are better than or comparable to state-of-
the-art results on the image and sentence
Vol 12 Issue 03, Mar 2023 ISSN 2456 – 5083 Page 124
III. AUTOMATIC IMAGE CAPTION containing well known people and places
GENERATION APPROACH which makes the dataset more generic. The
dataset has 6000 images in training
An automatic image caption generation dataset, 1000 images in development
approach using LSTM and CNN is dataset and 1000 images in test dataset.
presented in this section. The main Features of the dataset making it suitable
objective of this project is to develop a for this project are: Multiple captions
web based interface for users to get the mapped for a single image makes the
description of the image and to make a model generic and avoids overfitting of the
classification system in order to model. Diverse category of training images
differentiate images as per their can make the image captioning model to
description. It can also make the task work for multiple categories of images and
easier which is complicated as they have to hence can make the model more robust.
maintain and explore enormous amounts Dataset is collected from various source of
of data. The fig. 1 shows the system internet.
architecture of automatic caption
generation approach using CNN and Data preprocessing is done in this step
LSTM. includes Data cleaning, Data reduction,
Image data preparation. For instance,
punctuations, digits, single length words
are removed from the text dataset. Two
deep learning models have been selected
i.e, CNN and LSTM. Firstly, CNN takes
image as input and extract features such as
background, objects in the image.
CNN stands for Convolutional Neural
Networks. It is a deep learning algorithm
which takes image as an input.CNN scans
images from left to right and top to bottom
to pull out important features from the
image and combines the features to
classify images. Pre-processing required in
convolutional neural networks is much
Fig. 1: System Architecture of Automatic lower as compared to other classification
Caption Generation Approach using CNN
and LSTM algorithms.
The architecture of a Conv Net is
analogous to that of the connectivity
Here, a dataset called flickr8k which is pattern of Neurons in the Human Brain
collected from kaggle. Flickr8k dataset is and was inspired by the organization of the
a public benchmark dataset for image to Visual Cortex. Individual neurons respond
sentence description. This dataset consists to stimuli only in a restricted region of the
of 8000 images with five captions for each visual field known as the Receptive Field.
image. These images are extracted from A collection of such fields overlaps to
diverse groups in Flickr website. Each cover the entire visual area. The CNNs are
caption provides a clear description of inspired by visual system of human brain.
entities and events present in the image. The idea behind the CNNs thus is to make
The dataset depicts a variety of events and the computers capable of viewing the
scenarios and doesn’t include images 37 world as humans view it. This way CNNs
Vol 12 Issue 03, Mar 2023 ISSN 2456 – 5083 Page 125
can be used in the fields of image information. LSTM will use the
recognition and analysis, image information that is extracted from CNN to
classification, and natural language generate a description of the image.
processing.
The CNN LSTM architecture involves
CNN is a type of deep neural networks using Convolutional Neural Network
which contain the convolutional, max (CNN) layers for feature extraction on
pooling, and activation layers. The input data combined with LSTMs to
convolutional layer, considered as a main support sequence prediction. This
layer of a CNN, performs the operation architecture was originally referred to as a
called “convolution” that gives CNN its Long-term Recurrent Convolutional
name. Kernels in the convolutional layer Network or LRCN model, although we
are applied to the layer inputs. All the will use the more generic name “CNN
outputs of the convolutional layers are LSTM” to refer to LSTMs that use a CNN
convolved as a feature map. In this study, as a front end in this lesson. This
the Rectified Linear Unit (ReLU) has been architecture is used for the task of
used in the activation function with a generating textual descriptions of images.
convolutional layer which is helpful to Key is the use of a CNN that is pre-trained
increase the non-linearity in input image, on a challenging image classification task
as the images are fundamentally nonlinear that is re-purposed as a feature extractor
in nature. for the caption-generating problem.
The pooling layer is an important building A pre trained model called xception is
block of CNN. Pooling can be the max, used to train LSTM which will generate
average, and sum in the CNN model. In captions. Features which were extracted
this study, max pooling has been used by CNN are given to LSTM. This LSTM
because others may not identify the sharp will generate the captions for the given
features easily as compared to max image. Once all these steps were
pooling. The dropout layer has also been implemented, a caption will be generated
used, which drops the neurons during the for the given image. The generated
training chosen at random to reduce the captions are displayed on screen using
overfitting problem. CNN is used to GUI (Graphical User Interface). This
extract features from the image. A pre- whole project is implemented using keras
trained model called Xception is used for with tensorflow backend written in python.
this.CNN can be used in the fields of
image recognition, image classification, IV. RESULT ANALYSIS
and natural language processing. In this section, an automatic image caption
LSTM stands for long short-term memory, generation approach using LSTM and
it is a type of RNN (Recurrent Neural CNN is implemented using python. The
Networks. LSTM is capable of working result analysis of presented approach is
with sequence prediction problems. It is evaluated here. In this analysis, flickr8k
used for word prediction purposes. In dataset is used. The CNN is used to extract
LSTM based on previous text, one can the features and LSTM is used to generate
predict what the next word will be. It is the the caption.
same as google search where this system In this approach, firstly user upload an
will show the next word based on our image as input this image is forwarded to
previous text. LSTM can carry out relevant CNN in which it extracts the features such
information throughout the process with a as background, scene, objects in the image
forget gate and discards non-relevant
Vol 12 Issue 03, Mar 2023 ISSN 2456 – 5083 Page 126
using convolutional layer and pooling
layer then these features are sent to LSTM
by using fully connected layer. Now, the
dataset which contains Image Dataset and
text dataset is preprocessed to form an
training model. This training model is used
to train LSTM for which it generates
captions. The user has to upload an image
for which the caption has to be generated.
The generated caption for the image is
viewed by the user. The uploaded images
(d)
and their captions are shown in following Fig. 2 (a), (b), (c) & (d): Uploaded imaged and
figures. their generated Captions
(c)
Vol 12 Issue 03, Mar 2023 ISSN 2456 – 5083 Page 127
VI. REFERENCES [7] Xiangqing Shen, Bing Liu, Yong
Zhou & Jiaqi Zhao, “Remote sensing
[1] Peerzada Salman syeed, Dr.Mahmood image caption generation via transformer
Usman, “Image Caption Generator Using and reinforcement learning”, Multimedia
Deep Learning”, Neuroquantology, Tools and Applications,
October 2022, Volume 20, Issue 12, Page volume 79, pages26661–26682 (2020),
2682-2691, Doi: doi: 10.1007/s11042-020-09294-7
10.14704/Nq.2022.20.12.Nq77261
[8]Songtao Ding, Shiru Qu, Yuling Xi, Ar
[2] Dr. P. Srinivasa Rao, Thipireddy un Kumar Sangaiah, Shaohua Wan,
Pavankumar, Raghu Mukkera, Gopu “Image caption generation with high-level
Hruthik Kiran, Velisala Hariprasad, image features”, Pattern Recognition
“Image Caption Generation Using Deep Letters, Volume 123, 15 May 2019, Pages
Learning Technique”, International 89-95, Elsevier, doi:
Research Journal of Modernization in 10.1016/j.patrec.2019.03.021
Engineering Technology and Science,
Volume:04/Issue:06/June-2022, e-ISSN: [9] Xinlei Chen, C. Lawrence Zitnick,
2582-5208 “Learning a Recurrent Visual
Representation for Image Caption
[3] Santosh Kumar Mishra, Rijul Dhir, Generation”, Computer Vision and Pattern
Sriparna Saha and Pushpak Bhattacharyya, Recognition (cs.CV), arXiv:1411.5654v1 ,
“A Hindi Image Caption Generation doi:10.48550/arxiv1411.5654
Framework Using Deep Learning”, ACM
Trans. Asian Low-Resour. Lang. Inf. [10]Philip Kinghorn, Li Zhang, Ling Shao,
Process., 2021,Vol. 20, No. 2, Article 32. “A region-based image caption generator
with refined descriptions”,
[4] Aishwarya Maroju, Sneha Sri Doma, Neurocomputing, Volume 272, 10 January
Lahari Chandarlapati, “Image Caption 2018, Pages 416-424, Elsevier,
Generating Deep Learning Model”, Doi:10.1016/j.neucom.2017.07.014
International Journal of Engineering
Research & Technology (IJERT), ISSN: [11] Kelvin Xu, Jimmy Lei Ba, Ryan
2278-0181, Vol. 10 Issue 09, September- Kiros, Kyunghyun Cho, Aaron Courville,
2021 Ruslan Salakhutdinov, Richard S. Zemel,
Yoshua Bengio, “Neural Image Caption
[5] Moksh Grover, Rajat Rathi Chinkit, Generation with Visual Attention”,
Kanishk Garg, Ravinder Beniwal, “AI Proceedings of the 32 nd International
Optics: Object recognition and caption Conference on Machine Learning, Lille,
generation for Blinds using Deep Learning France, 2015. JMLR: W&CP volume 37.
Methodologies”, 2021 International
Conference on Computing,
Communication, and Intelligent
Systems(ICCCIS), DOI:
10.1109/ICCCIS51004.2021.9397143
[6] Omkar Nitin Shinde, Rishikesh Gawde,
Anurag Paradkar, “Social Media Image
Caption Generation Using Deep
Learning”, International Journal of
Engineering Development and Research,
2020, Volume 8, Issue 4, ISSN: 2321-9939
Vol 12 Issue 03, Mar 2023 ISSN 2456 – 5083 Page 128