0% found this document useful (0 votes)
15 views

Image_Captioning_-_A_Deep_Learning_Approach_Using_CNN_and_LSTM_Network

Research Paper

Uploaded by

Asif Iqbal Khan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Image_Captioning_-_A_Deep_Learning_Approach_Using_CNN_and_LSTM_Network

Research Paper

Uploaded by

Asif Iqbal Khan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

2023 3rd International Conference on Pervasive Computing and Social Networking (ICPCSN)

Image Captioning - A Deep Learning Approach using


CNN and LSTM Network

, Preeti Voditel, Aparna Gurjar, Aakansha Pandey, Akrati Jain, Nandita Sharma, Nisha Dubey
[email protected], [email protected], [email protected], [email protected], [email protected], [email protected],
Department of Computer Applications,
Shri Ramdeobaba College of Engineering and Management,
2023 3rd International Conference on Pervasive Computing and Social Networking (ICPCSN) | 979-8-3503-2284-2/23/$31.00 ©2023 IEEE | DOI: 10.1109/ICPCSN58827.2023.00062

Nagpur, India

Abstract— An image caption generator is a system II. INTRODUCTION


that uses artificial intelligence and computer vision to The ability of the human brain to quickly grasp and accurately
analyze an image and generate a written description describe an image is something that has been difficult to
or a caption. The caption gives a brief description replicate in computers. However, with recent advancements
which accurately aligns with the content of the image. in computer vision and deep learning, it is now possible to
The various elements within the image are recognized train a machine to process and label an image with a highly
and interpreted using deep learning techniques. relevant and accurate caption. Generating a proper sentence
to describe the image is still a challenge, but with the right
The process of using datasets to train the model to assign English
language labels or descriptors to an image is known as image techniques and algorithms, it is possible to build a caption
tagging. Tagging helps identify an image and its description for generator that can produce accurate results. The task of image
easier search and retrieval in the future. In our research, a new captioning involves identifying the objects in an image and
model is suggested that utilizes an encoder-decoder architecture to finding the appropriate words to describe them. To form a
generate appropriate and grammatically correct captions for caption, these words are combined to create a sentence that
images. This model employs methods from both image analysis accurately describes the image. This process requires a
and natural language processing/generation to examine and combination of computer vision and natural language
characterize pictures. The goal is to generate accurate captions processing techniques, as it involves both understanding the
that precisely convey the content of the images. We utilize a
content of the image and being able to properly describe it in
particular deep learning architecture called VGG16 in this
method. VGG16 is a Convolutional Neural Network (CNN) that natural language. The model is trained on multiple sentences
has demonstrated exceptional performance in image recognition. and images, so that it can learn to generate a wide range of
The VGG16 architecture is used as the encoding layer to extract captions for different images with different objects.
important features from the image. After the VGG16 model We now review few already existing methods and body of
processes the image, the results are fed into an LSTM (Long work related to image captioning. The captioning process can
Short-Term Memory), a type of recurrent neural network, which be classified into categories like template-based image
then predicts or generates a textual description of the image, one captioning, retrieval-based image captioning, and novel
word at a time. For generating accurate captions, it is trained on caption generation [1]. Template-based methods use pre-
a set of labelled images and their corresponding captions called
defined templates consisting of blank slots which are then
the Flickr8k Captions dataset. This dataset is used to provide the
model with the ground truth captions. After the training phase is filled with appropriate objects, their attributes and action
completed, the model creates descriptions for a group of test words. Retrieval-based methods use the concept of candidate
images. These generated captions are then compared to the actual captions which are drawn from the existing captions present
captions present in the test dataset. The comparison is done using in the training data. Novel caption generation uses deep
a metric called the BLEU score, which is a measure of the learning techniques. Associating semantics with image
accuracy of the generated captions. The effectiveness of the model captioning can achieve relevant and richer description for the
is determined based on this score. generated caption. Image captioning approaches can either be
performed in a top-down or bottoms-up manner [2]. Top-
Keywords—CNN, LSTM, BLEU, VGG16, Image captioning,
Deep learning.
down creates descriptive caption from image summary
whereas bottoms-up finds out words for various image
I. PROBLEM STATEMENT aspects and then combines them to get coherent captions. A
We tackle the problem of image captioning, which requires a new captioning system called Context Sequence Memory
computer vision system to identify the important parts of an Network (CSMN) stores words into a long-term memory and
image and subsequently describe it using natural language attaches them together to capture long-term information [3].
generation technique. This task is an extension of object This mechanism is able to tide over the vanishing gradient
problem seen in RNNs. Image captioning is also useful for
detection, where the descriptions are more detailed, they
consist of multiple words and are able to assign correct handling the huge image data which is available now-a-days
semantic labels to the entire image through natural language through generation and storage of captions which can later be
description. used for content-based retrieval purposes [6]. Image
captioning is a problem which has to handle multiple modes
in which semantics play an important part. A new image
captioning architecture is introduced which generates visual
relationship graphs that enhance caption generation [12].

979-8-3503-2284-2/23/$31.00 ©2023 IEEE 343


DOI 10.1109/ICPCSN58827.2023.00062
Authorized licensed use limited to: Guru Gobind Singh Indraprastha University. Downloaded on November 29,2024 at 06:05:13 UTC from IEEE Xplore. Restrictions apply.
In this paper, we investigate a method for generating image identify objects or patterns within the image. A
captions using deep neural networks. In particular, we Convolutional Neural Network typically has a defined
employ Convolutional Neural Networks (CNN) and Long structure that includes different layers to process information.
Short-term Memory (LSTM) to examine the image and The input layer is responsible for receiving the image, while
produce the description. The objective is to input a picture the hidden layers analyse and identify important features. The
and output a sentence that accurately describes its contents, final layer, known as the output layer, produces the prediction
with proper grammar. based on the processed information from the hidden layers.
In case of image captioning, the generated caption is closely
III. MODEL ARCHITECTURE linked to the semantic relationship present between the
The CNN-LSTM architecture is a combination of two components of the image. These components are nothing but
different types of neural networks: CNN and LSTM. The the features generated by the intermediate layers of the CNN.
CNNs are considered very efficient for performing These are at varying degree of abstraction. The low-level
recognition tasks for components of an image [4]. CNN abstractions like the edges and the shapes are composed to
layers extract characteristics from the input data, while produce high level shapes and image components. Based on
LSTMs generate predictions for sequences of data. Image its training data, the CNN is thus able to extract objects
captioning is related both to computer vision as well as present in the given image. CNNs allow the extraction of
language generation [10]. Recurrent Neural Networks using meaningful information from an image which constitutes the
LSTM components are very well suited in these applications first part of the caption generation process. Thus, CNNs are
as they can handle long-term dependencies [5] [9]. This an effective tool for image captioning.
model is ideal for problems that require predicting sequences
with spatial inputs like images or videos. It has various
applications, such as recognizing actions, describing images,
and describing videos. The design of the CNN-LSTM Model
has the structure shown in Fig. 1.

Fig. 2. CNN layers


A convolutional neural network is designed to process
images. It uses a series of layers that perform different
operations on the input data. These layers are designed to
learn specific features of the image, such as edges, brightness,
and unique characteristics of the objects in the image. The
layers include convolution, activation or ReLU, and pooling
layers shown in Fig. 2. In a Convolutional Neural Network,
the image undergoes several processing stages to extract
Fig. 1. CNN-LSTM model structure important features. The convolution layer applies filters to the
image to identify specific patterns and features. The
The CNN-LSTM architecture is particularly useful when the activation layer helps to rectify non-linear relationships by
input data has both spatial and temporal structure. Spatial transforming negative values to zero. The pooling layer then
structure refers to the placement of components in a particular simplifies the output by reducing its dimensionality through
space, such as the arrangement of pixels in an image or words down-sampling. This results in a more compact
in a sentence, paragraph, or document. This technology is representation of the information and helps to reduce
also useful in satellite imaging for analyzing deforestation overfitting.
[13]. This architecture is also used when the output is
expected to have temporal structure, like the sequence of
words in a textual description. In summary, CNN-LSTMs are B. Long Short Term Memory (LSTM)
used when the input data has both spatial and temporal An LSTM network is a kind of recurrent neural network which
structure, and the output is also expected to have temporal is capable of remembering previous inputs for a certain period
structure. of time, which allows it to handle sequential data such as time
series, text, and speech. An LSTM network operates using two
main components, the memory cell and the gates, to control
A. Convolutional Neural Network the flow of information. The memory cell acts as a storage unit
A Convolutional Neural Network is a type of artificial that holds information over a prolonged period, while the
intelligence model that focuses on analyzing visual data. It gates regulate the transfer of information into and out of the
processes images by dividing them into smaller parts and memory cell. The gates decide what information should be
recognizing the repeating patterns in those parts. This enables added to the memory cell, what should be discarded, and what
the model to identify objects and features in images and make should be passed on to other parts of the network. This allows
the LSTM network to effectively handle sequential data and
predictions based on that information. It works by applying a
make predictions based on long-term dependencies. These
set of filters to the input image at various resolutions. These
components allow the network to selectively retain or forget
filters are used to extract features from the image, such as information based on the input, enabling it to make better
edges, shapes, and colors. The CNN then uses these features predictions and decisions. An LSTM model is designed to
to build a representation of the image, which is used to maintain a memory of the information it has processed and use

344

Authorized licensed use limited to: Guru Gobind Singh Indraprastha University. Downloaded on November 29,2024 at 06:05:13 UTC from IEEE Xplore. Restrictions apply.
that memory to inform its current processing. The memory are not as crucial as the others, It is considered optimal
cell within the LSTM model is responsible for keeping track to incorporate it within the design of the LSTM cell.
of this information. Input and forget gates control the flow of
information inward and out of it is controlled by gates. This x Output Gate(o): The output gate in LSTM uses the
allows the LSTM to selectively keep or discard information, current input and the state before the current to
depending on its relevance to the task at hand. The LSTM determine how much of the current state should be
model has a special structure called gates, which determine outputted. This output is then passed through a tanh
what information should be kept or forgotten from one function to add non-linearity and make it zero-mean,
moment to the next. The gates can be opened or closed, and if before being scaled by the output gate's fraction. This
they are closed, the information stored in the memory cell will final output is then used as the input for the next LSTM
not change. This allows the model to retain important block and also give back into the current LSTM block
information over a longer period of time, and use it to make as part of the state.
predictions or generate captions for images. The LSTM model The LSTM also uses an additional internal state, called
is designed to address the vanishing gradient problem, which the cell state, to pass information from one time-step
occurs in many Recurrent Neural Network models [7]. The to the next. This allows the LSTM to retain relevant
vanishing gradient problem refers to the difficulty in training information for longer periods of time, making it well
deep networks due to the rapidly decreasing gradient suited for tasks such as sequence prediction. The
magnitude with increasing network depth. The LSTM model LSTM structure is shown in Fig. 3.
is also adept at learning vision and language interactions by
keeping track of past and predicting future context
information [14]. This behaviour of LSTM helps during
textual caption generation of visual information. Image
captioning requires generation of word sequence. The LSTM
model helps predict the caption words based on the input
image.
A Long Short Term Memory (LSTM) Network is composed
of four distinct gates that serve specific functions. These gates
work together to regulate the flow of information into and out
of the memory cell and to control the stability of the gradient Fig. 3. LSTM structure
during training. They are: In our image caption generator model, we will
x Forget Gate(f): The forget gate in LSTM controls the combine these two network architectures, which is
amount of information to be retained from the previous commonly referred to as a CNN-RNN hybrid model
state by calculating a value between 0 and 1, which is shown in Fig. 4.
then used to update the current state. The previous
output and input are used to determine the proportion
of information that should be kept and the proportion
that should be discarded. This mechanism helps to
keep only the important information and discard the
irrelevant information, leading to better performance
of the model.
x Input Gate(i): The input gate in LSTM determines the
amount of new information to be stored in the cell state
based on the current input and the previous output. It
combines these two signals to produce a scalar
between 0 and 1, which acts as a weighting factor. This
weighting factor is then applied to the output of the Fig. 4. CNN-LSTM model
tanh activation function to determine the new A pre-trained CNN is required to extract important
information to be stored in the cell state. This new features from an input image, which are then used as input
information is added to the previous state, allowing the for an LSTM network that has been trained to model
LSTM to maintain its context over time. language. The feature representation obtained from the
x Input Modulation Gate(g): The Input Modulation CNN is altered to fit the input specifications of the LSTM
Gate in LSTM is a part of the input gate that adjusts network, enabling it to produce output based on the
the incoming information before it is stored in the characteristics of the image.
internal state cell. It modulates the input data by using The label and target text for training an LSTM model
a non-linear transformation, which helps to standardize would be the text description needed to be generated. For
the data and ensure that it has zero mean. This gate example, if there is a picture containing an old man who is
helps to ensure that only relevant information is wearing some hat, the label and target text would be:
incorporated into the internal state. The Input gate will
response to the internal state cell by making the input Label — [<start> ,An, old, man, is, wearing, a , hat]
zero-mean through non-linearity. This helps to speed Target — [ An old man is wearing a hat .,<End> ]
up the learning process by reducing the time it takes
for the network to converge. While this gate's actions

345

Authorized licensed use limited to: Guru Gobind Singh Indraprastha University. Downloaded on November 29,2024 at 06:05:13 UTC from IEEE Xplore. Restrictions apply.
This is done so that the model can identify the beginning IV. DATA SET
and end of the caption and understand how the words in The Flickr8k dataset is a collection of 8092 photographs in
the caption relate to each other. JPEG format, along with accompanying text descriptions of
C. VGG16 the images. The dataset is organized into two main directories,
one containing the images themselves, and the other
VGG16, whose architecture is shown in Fig. 5, is a popular
containing text files with different sources of descriptions for
variant of CNN and is widely considered to be one of the most the photographs. The main file of the dataset is called
advanced models for computer vision tasks. The VGG16 Flickr8k.token and it is located in the Flickr8k_text folder.
model is a highly advanced convolutional neural network that This file contains the names of the images and their
was created with the goal of improving upon previous corresponding captions, with each image and its captions
computer vision models. The creators of this model separated by a newline character ("\n"). Overall, the dataset
experimented with different architectural designs and includes 8091 images, each with 5 English captions. The
ultimately decided to use a deep network with small Flickr8k is a 1GB dataset comprised of images and
convolution filters. This resulted in a model with a large accompanying text descriptions, organized into three subsets.
number of layers and trainable parameters, and ultimately led The training set contains 6000 images, the testing set has
to VGG16 becoming one of the best-performing models in 1000, and the validation set also has 1000 images. In order to
the field of computer vision. It is used for classification and prepare the dataset for use, the text descriptions have
identification of images belonging to diverse categories. undergone a process of cleaning and formatting. This includes
removing punctuation, converting all words to lowercase, and
removing any numerical values. Additionally, the dataset has
VGG16 Architecture been organized in a sequential pattern, using a tool called tqdm
to track progress.
V. METHODOLOGY
A. Pre-requities
To work on this project, it is important to have a strong
understanding of various technologies and tools, including
deep learning, Python programming, working with Kaggle
notebooks, and using the Keras library. Additionally,
experience with NumPy and natural language processing is
necessary. It is also important to assure that all of the
necessary libraries installed, in order to effectively run the
Fig. 5. VGG16 Architecture project
x Tensorflow: TensorFlow [11] is an open-source platform
Loading VGG16 Model that provides extensive support for machine learning and
artificial intelligence. With TensorFlow, developers can
After loading the VGG16 model, the output produced by the take advantage of its compatibility with multiple
final dense layer, which is usually 4096 in size and seen in hardware platforms, user-friendly APIs, and the support
Fig. 6, is passed to the encoder and decoder modules. The of an active community.
encoder module takes the image embeddings and encodes x Keras: Keras is a high-level neural networks API, written
them into a tensor whereas the decoder generates output in Python and capable of running on top of TensorFlow,
(captions) at each step. The encoder decoder model which CNTK, or Theano.One of the key strengths of Keras is
combines the CNN and LSTM is very efficient in generating its simplicity and ease of use. Keras also provides a
rich text captions of input images [8][15]. number of pre-trained models, making it easy to get
started with transfer learning and fine-tuning existing
models.
x Relevant Python libraries like NumPy, Pickle- for
serializing and deserializing Python objects, tqdm-to get
progress feedback through the progress bar widget for
long-running tasks or iterations over a large datasets.

x Tokenizer: Tokenization is the process of breaking down


a piece of text into smaller units, known as tokens.
Tokens can be words, phrases, symbols, or other
elements of the text, depending on the specific use case.
Tokenization is a fundamental step in many natural
language processing (NLP) tasks, including text
classification, sentiment analysis, and information
retrieval.Tokenization is important because it allows us
to represent text in a structured format that can be easily
Fig. 6. VGG16 model parameters processed by computers. Tokens can be used as features
in machine learning models, and can be further processed

346

Authorized licensed use limited to: Guru Gobind Singh Indraprastha University. Downloaded on November 29,2024 at 06:05:13 UTC from IEEE Xplore. Restrictions apply.
to extract more meaningful information, such as word E. Model evaluation – Bilingual evaluation understudy
embeddings or n-grams. score(BLUE’S):
B. Pre-processing the Image BLEU is a method to measure the similarity between a
generated sentence and a reference sentence. It is mostly used
In our project, we are using a pre-trained model named
to evaluate the performance of machine translation systems.
VGG16 from the Visual Geometry Group for image
recognition. This model is a available within the Keras library, The score ranges from 0.0 to 1.0, where 1.0 represents a
and thus it will not require additional installation or setup. For perfect match and 0.0 represents a complete mismatch.
this project, the image features are extracted by resizing the To evaluate the performance of the model, the generated
images to a size of 224*224 pixels. The extraction process is captions are compared with the actual captions. The
performed on the image just before the last layer of the similarity between these two sets is measured using BLEU
classification model. This location is chosen because it is used score, which is a method to evaluate text similarity. This
for predicting the classification of a photo, but since it is not score is calculated for the entire set of captions, and it
required to classify the images, we exclude the last layer provides a summary of how well the generated captions
during the feature extraction process. match the expected captions.
C. Creating vocabulary for the image:
Before using text data in a machine learning or deep learning VI. RESULTS
model, it is necessary to clean and prepare it for the model.
This process includes splitting the text into individual words,
handling issues with punctuation and case sensitivity.
Additionally, since computers do not understand English
words, we need to represent them with numbers. This is done
by creating a vocabulary and mapping each word to a unique
index value, and then encoding each word into a fixed-sized
vector. Only after this process, the text becomes readable by
the machine and can be used to generate captions for images.
We plan to reduce the size of our vocabulary by processing
the text in the following sequence through cleaning.
In order to accomplish the goals of reducing the size of
vocabulary, we have outlined and defined five functions
Fig. 7 (a). Generated caption-1
below.
x Retrieving the data.
x Establishing a mapping between images and their
descriptions using a dictionary.
x Purifying the descriptions by eliminating punctuation
marks, transforming them to lowercase letters, and
removing any words containing numbers.
x Create a list of all the distinct words found in the
descriptions and make a vocabulary out of them.
x Making a document to keep all the captions saved.
In order to establish a vocabulary for the project, all the unique
words from the training dataset are tokenized. This process
results in 8763 unique words being defined as the vocabulary.
Fig. 7 (b). Generated caption-2
D. Training the model:
The dataset includes a file called "Flickr_8k.trainImages.txt"
Compiling a list of 8000 image names that will be utilized
during the training phase of the project.

The first step is to load the features extracted from the pre-
trained CNN model. To train the model, we will be using the
8000 training images by breaking them into smaller chunks
called batches and using them to generate input and output
sequences. These sequences are then used to fit the model.
The training process is set to run for 20 cycles known as Fig. 7 (c). Generated caption-3
epochs.

347

Authorized licensed use limited to: Guru Gobind Singh Indraprastha University. Downloaded on November 29,2024 at 06:05:13 UTC from IEEE Xplore. Restrictions apply.
EXPERIMENTAL EVALUATION REFERENCES
Fig. 7. a, b, and c show the captions generated for 3 sample [1] Md. Z. Hossain, F. Sohel, Md. F. Shiratuddin and H. Laga, “A
images. The BLEU score was calculated for the entire set of Comprehensive Survey of Deep Learning for Image Captioning,”
arXiv:1810.04020v2 [cs.CV] 14 Oct 2018.
captions, for single word and word pair match respectively as
[2] Q. You, H. Jin, Z. Wang, C. Fang and J. Luo, “Image Captioning with
BLEU-1: 0.569902 Semantic Attention”, Proceedings of the IEEE Conference on
BLEU-2: 0.376622 Computer Vision and Pattern Recognition (CVPR), 2016, pp. 4651-
4659.
Our model has achieved a BLEU-1 score of 0.56 for the 1- [3] C. Park, B. Kim and G. Kim, “Attend to You: Personalized Image
Captioning With Context Sequence Memory Networks”, Proceedings
gram (single word match) and a BLEU-2 score of 0.37 for the of the IEEE Conference on Computer Vision and Pattern Recognition
2-gram (word pair match). The generated captions were (CVPR), 2017, pp. 895-903
evaluated against 5 reference sentences. There was almost [4] J. Johnson, K. Karpathy and L. Fei-Fei, “DenseCap: Fully
60% single word match with the reference sentences as Convolutional Localization Networks for Dense Captioning”,
visible in the BLEU-1 score which is considered good. We Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), 2016, pp. 4565-4574
saw how different values of epochs affected the accuracy of
[5] J. Aneja, A. Deshpande and A.G. Schwing, “Convolutional Image
our model. We viewed that after some epochs accuracy of the Captioning”, Proceedings of the IEEE Conference on Computer Vision
model started decreasing due to over fitting. The dropout and Pattern Recognition (CVPR), 2018, pp. 5561-5570
technique used during training the model, ignores some [6] G. Srivastava and R. Srivastava, “A Survey on Automatic Image
layers of input information to ensure that the model will not Captioning”, Mathematics and Computing. ICMC 2018.
Communications in Computer and Information Science, vol 834.
over fit. The dropout we used is 0.5, and it avoided over Springer.
fitting of the model. [7] H. Wang, Y. Zhang and X. Yu, "An Overview of Image Caption
Generation Methods", Computational Intelligence and Neuroscience,
vol. 2020, Article ID 3062706, 13 pages, 2020.
VII. CONCLUSION [8] P. Sharma, N. Ding, S. Goodman, and R. Soricut,” Conceptual
We have built an Image Caption Generator using a Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For
Automatic Image Captioning”, In Proceedings of the 56th Annual
combination of a CNN and an LSTM model. This Meeting of the Association for Computational Linguistics (Volume 1:
architecture, known as CNN-LSTM, can be used in a variety Long Papers), pages 2556–2565, Melbourne, Australia 2018.
of areas such as computer vision and natural language [9] G. Sairam, M. Mandha, P. Prashanth and P. Swetha, "Image Captioning
processing. Our specific implementation uses an encoder- using CNN and LSTM," 4th Smart Cities Symposium (SCS 2021),
Online Conference, Bahrain, 2021, pp. 274-277.
decoder approach to generate grammatically correct captions
[10] V. Agrawal, S. Dhekane, N. Tuniya and V. Vyas, "Image Caption
for images. The model we are proposing uses a CNN as an Generator Using Attention Mechanism," 12th International Conference
encoder and an LSTM as a decoder. We have evaluated the on Computing Communication and Networking Technologies
model using the standard metric, called BLEU, on the (ICCCNT), Kharagpur, India, 2021, pp. 1-6.
Flickr8k Captions dataset. Our findings indicate that the [11] Online: https://www.tensorflow.org
model performs comparably with other leading techniques, as [12] Z. Shi, X. Zhou, X. Qiu and X. Zhu, “Improving Image Captioning
evaluated by the BLEU metric. Our proposed model has with Better Use of Captions”, Proceedings of the 58th Annual Meeting
of the Association for Computational Linguistics, pages 7454–7464,
displayed positive results in terms of BLEU scores, though July 5 - 10, 2020. c©2020 Association for Computational Linguistics.
there is still scope for enhancement. In future we plan to [13] G. Geetha, T. Kirthigadevi, G.Godwin Ponsam, T. Karthik and M.
improve the semantic relevance of the generated captions by Safa, “Image Captioning Using Deep Convolutional Neural Networks
implementing the attention mechanism where attention (CNNs)”, Journal of Physics.: Conf. Ser. 1712 012015, 2020
scores are used to change the attention strength on various [14] C. Wang, H. Yang, C. Bartz and C. Meinel, “Image Captioning with
Deep Bidirectional LSTMs”, MM '16: Proceedings of the 24th ACM
image attributes. international conference on Multimedia. 2016.
[15] A. K. Poddar and Dr. R. Rani, “Hybrid Architecture using CNN and
LSTM for Image Captioning in Hindi Language”, International
Conference on Machine Learning and Data Engineering, Procedia
Computer Science 218 (2023) 686–696.

348

Authorized licensed use limited to: Guru Gobind Singh Indraprastha University. Downloaded on November 29,2024 at 06:05:13 UTC from IEEE Xplore. Restrictions apply.

You might also like