0% found this document useful (0 votes)
2 views6 pages

A_Comparative_Study_of_Machine_Learning_Based_Image_Captioning_Models

This document presents a comparative study of three machine learning-based image captioning models: k-Nearest Neighbor (KNN), Convolutional Neural Network (CNN) with Long Short Term Memory (LSTM), and Attention Based LSTM. The study evaluates their performance using BLEU, ROUGE, and METEOR scores on the Flickr8k dataset, demonstrating that the Attention Based LSTM outperforms the other models. Additionally, improvements to the KNN and CNN with LSTM approaches are proposed to enhance their efficiency and accuracy in generating image captions.

Uploaded by

12213023
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views6 pages

A_Comparative_Study_of_Machine_Learning_Based_Image_Captioning_Models

This document presents a comparative study of three machine learning-based image captioning models: k-Nearest Neighbor (KNN), Convolutional Neural Network (CNN) with Long Short Term Memory (LSTM), and Attention Based LSTM. The study evaluates their performance using BLEU, ROUGE, and METEOR scores on the Flickr8k dataset, demonstrating that the Attention Based LSTM outperforms the other models. Additionally, improvements to the KNN and CNN with LSTM approaches are proposed to enhance their efficiency and accuracy in generating image captions.

Uploaded by

12213023
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Proceedings of the Sixth International Conference on Trends in Electronics and Informatics (ICOEI 2022)

IEEE Xplore Part Number: CFP22J32-ART; ISBN: 978-1-6654-8328-5

A Comparative Study of Machine Learning Based


Image Captioning Models
Priya Singh Piyush Gupta Hardik Jain
Department of Software Engineering Department of Software Engineering Department of Software Engineering
Delhi Technological University Delhi Technological University Delhi Technological University
Delhi,India Delhi, India Delhi, India
[email protected] [email protected] [email protected]

Abstract— Automated i mage captioning is a cruci al divided into two categories, whether they provide novel
concept for numerous real-worl d applications as it is captions. So me IC algorith ms choose an appropriate caption
useful in robotics, image indexi ng, self-dri vi ng vehicles fro m their database that best describes the test image. Other
2022 6th International Conference on Trends in Electronics and Informatics (ICOEI) | 978-1-6654-8328-5/22/$31.00 ©2022 IEEE | DOI: 10.1109/ICOEI53556.2022.9777153

and greatl y hel pful for impaired eyesight people. An ways generate unique captions by constructing a description
image provi ded in real-time can be converted i nto text of the image. The Nearest Neighbor approach to IC is an
using image capti oning models developed by machine example of the first type. In contrast, the Neural Image
learning algorithms. Understanding an i mage mostly Caption generator is an example of the second group of
depends on the features of the i mage. Machi ne learning innovative caption generators [1-2], [18] and [22-23].
techni ques are wi dely used for i mage capti oning tasks. In the past few years, a lot of research in the fie ld of IC
This research study has performed a comparati ve has been done [1-4], [14], [17-19] and [22-25]. The Nearest
analysis on three Machine Learning (ML) algori thms, Neighbor approach can be used to find the best caption fro m
i.e. k-Nearest Neighbor (KNN), Convolution Neural the training dataset which is nearest to the specific image
Network (CNN) with Long Short Term Memory [2]. Th is means selecting general caption from all of the
(LS TM) and Attention Based LSTM. In additi on, an available captions that can be applied to various photos
improved KNN algorithm with reduced ti me complexity nearest to the given test image. Captions are created by
and an i mproved CNN with LSTM and Attention B ased combin ing visually comparab le photographs from the test
LSTM model wi th an added beam search method is set. The consensus caption can then be used to describe the
proposed to improve the underlying approaches further. test image, wh ich is intuitive. If the dataset is diverse, the
The performance of the three selected models are chosen caption for the test image will arrive to be more
empirically eval uated using BLEU, ROUGE and common; for examp le- ’a cat’, if the dataset is comparable, a
MET EOR scores on the wi dely used flickr8k dataset, more precise caption would be chosen like ‘b lack cat’. In
and the experi mental results demonstrate the supremacy such a case, when one has to develop new captions for the
of the Attention B ased LSTM over the other two test image emp loying a neural network will be a good
approaches. Finally, the current study's findings hel p choice. This method employs the LSTM Recurrent Neural
guide the researchers and practitioners in selecting the Network (RNN) architecture, which is highly suited for
appropri ate approach for Image Captioni ng with sequential data, such as captions in this case [5], [18] and
empirical evi dence in terms of standard eval uation [22]. The model is trained on the dataset to get the best
metrics. negative log likelihood of probability (S/I), where I is image
used as the input and S is a sequential list of generated
Keywords— Image Captioning (IC), Attention Based caption words present in the dictionary that sufficiently
Model, Long-Short Term Memory (LSTM ), k-Nearest represents the image. The image is init ially fed into a CNN,
Neighbor (KNN), Convolutional Neural Network (CNN). specifically Inception-V3 [8]. On the ImageNet database,
Inception-V3 is an image classificat ion model that has been
I. INT RODUCT ION shown to achieve higher than 78.1 per cent accuracy [8].
Every day, a person encounters different images fro m One more approach is to use an attention-based gated
recurrent unit which uses the previous word and the next
sources like the internet, art icles, news, and advertisements.
crucial part of the image to predict the next word in the
These images do not have descriptions, but the human
interprets and understands them without any captio n. caption thus, the decoder use only a region of the image to
predict the caption [14]. Global attention is computationally
However, machines need some Image captioning (IC)
costly as it focuses on all input words for all keywords.
methods to describe the image. IC is the process of creating
words that describes an image. It requires extracting Thus, local attention is used, which focuses on the tiny
subset of the encoder’s hidden layer.
relevant informat ion fro m a photograph and interpreting it
Based on these different encoding and decoding
into natural language. IC is not to be confused with image
classification. It helps both in recognizing things in the methods, we have selected three machine learning-based IC
models for the paper: Nearest Neighbor Approach, CNN-
image and conveying information in natural langu age about
LSTM Based and Attention Based LSTM method. We have
the objects in the image. IC is a type of machine translation
for which an image is converted into a written description. It improved the KNN approach by sorting and selecting a
subset of captions and using the beam search method to
is difficult because it demands both computer vision and
improve the performance of CNN with LSTM and Attention
natural language processing expert ise. IC systems are
Based LSTM model.

978-1-6654-8328-5/22/$31.00 ©2022 IEEE 1555


Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY KURUKSHETRA. Downloaded on May 01,2025 at 10:22:48 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Sixth International Conference on Trends in Electronics and Informatics (ICOEI 2022)
IEEE Xplore Part Number: CFP22J32-ART; ISBN: 978-1-6654-8328-5

captioning and provides similar imp rovements over the


selected approaches.
II. LITERATURE SURVEY
Research in the field o f image captioning is quite prevalent III. METHODOLOGY
[1-4]. The early image captioning approaches mainly relies
A. Dataset Description
on the template-based methods that have to be predefined
with the results of the object detection and their attributes For evaluating the performance of the three selected models,
and relationships [6-7]. Inspired by recent development in namely CNN with LSTM, Attention Based LSTM and
image captioning approaches [17-19], [22-25], they can be KNN, we have chosen the widely used Flickr8k bench mark
divided based on distance-based matching or encoder- dataset available for download at [15]. It contains 8092
decoder models. Jacob Dev lin et al. [2] suggested a baseline images depicting various scenes and situations and contains
technique that is Nearest Neighbor, based upon a distance respective descriptions. Out of all the images, 6000 images
learning classifier. The approach uses all the train ing dataset are selected for training and the rest for testing. Each image
for finding the best suitable caption for the image. The has five captions, as shown in Fig. 1, and the total captions
Nearest Neighbor approach can be used for finding the are 40460.
annotations based on different objects in the image with
similar meanings to produce more accurate captions [24].
There are two basic steps in the encoder-decoder
approach. First, the CNN encoder extracts the image's visual
features into a fixed-length embedding vector. Second,
Recurrent Neural Net work (RNN), especially LSTM, is
used as a decoder to generate the sequential caption as
proposed by [18] and [22-23].
Y. Chu et al. [18] developed the probability-based
encoder-decoder model, which uses ResNet50 [26] fo r
feature extraction and gated LSTM decoder. The predicted
caption is generated using a sequence of words produced by
the softmax function, which contains the probabilit ies of all
the words present in the vocabulary.
Inspired by the encoder-decoder model, many of the
current states of art models of image captioning used an
attention-based encoder-decoder model to produce better Fig. 1. Flickr8k dataset sample
captions [4], [14], [17] and [24-25]. Kelvin Xu et al. [4]
developed a model that concentrates on the important region
of the image. Sai Siddarth Yv et al. [14] suggested an
attention-based encoder-decoder model using an ext ra layer
of a gated recurrent unit in the classical CNN with LSTM
model. The attention mechanism works by giving weights to
the different objects in the image and using the relevant
parts of the image at a given timestamp.
Instead of going in one direction into the LSTM network,
S. Takkar et al. [19] suggested a bidirectional LSTM
network to ext ract the image features. We can divide the
image region into a grid of bounding boxes containing
different objects in the image using the Yolo model as
suggested by M. A. Al-Malla et al. [17] to improve the
model's accuracy. O. Sargar et al. [13] proposed a
Generative Adversarial Net works (GAN) and a soft
attention layer to enhance the performance of the model.
The Flickr8k, Flickr30k and MS COCO are the majo r Fig. 2. Architecture of CNN with LST M Fig. 3. Architecture of KNN
datasets used by these research papers to improve the results
produced by their proposed methods. B. Implementation Details
By now, it is evident that many researchers have In this section three image captioning models are d iscussed
proposed approaches for image captioning tasks. In the that are CNN with LSTM, KNN and Attention Based
current study, we are conducting a comparative study for the LSTM. The CNN with LSTM and Attention Based LSTM
three selected ML approaches, namely CNN with LSTM , models are imp lemented with proposed methodology of
Attention Based LSTM and KNN, by empirically accessing beam search that is based upon the maximu m probability
their performances on BLEU, M ETEOR and ROGUE theory to improve the quality of captions. The proposed
measures. We also proposed the improvements of the KNN model uses a similarity function to reduce the
selected approaches to improve the performance fu rther. To exponential time complexity.
the best of our knowledge, there is no study yet that
emp irically evaluates the selected approaches for image

978-1-6654-8328-5/22/$31.00 ©2022 IEEE 1556


Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY KURUKSHETRA. Downloaded on May 01,2025 at 10:22:48 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Sixth International Conference on Trends in Electronics and Informatics (ICOEI 2022)
IEEE Xplore Part Number: CFP22J32-ART; ISBN: 978-1-6654-8328-5

1) Implementation of CNN with LSTM understanding. Thus KNN can be used for IC problems and
give better results if the dataset has comparable images with
The image to text generation is a co mbined study of locating
the test image. As there is no training in the KNN the
the critical regions in the image and sequentially rearranging
images during testing use all the training dataset to find the
them so that they give a meaningful sentence. A combined
suitable caption. This method uses the cosine similarity
CNN can be used for locating the objects in the image and
measure to determine the closest k photos for the image in
LSTM for sequential text generation. Thus CNN with
the training dataset using one feature space. The caption set
LSTM is the most intuitive approach for the IC solution.
consists of 5*k captions because each k photo has five
The model for the IC has two parts, an encoder and
captions. Then the selected caption c* is the one with the
decoder. The encoder part is used to extract the feature
highest value given by equation (1):
vector of the image, wh ich is the second last layer of the
CNN model and a decoder part for generating the sequence ∑ (1)
caption describing the image. The CNN model used above is
Inception v-3 or formerly googlenet trained on the imagenet
database. The LSTM model has been used as the decoder The selected caption is the caption that has the most
significant similarity score with all the captions in the set.
part of the model. The LSTM is a mo re advanced RNN as
the ordinary RNN has a problem of vanishing gradients. The Further, to reduce noise due to outliers, the solution is to
LSTM contains a memo ry cell that stores the important evaluate the similarity score over only a subset of the set of
candidate captions instead of the entire set of candidate
values for a long period of time. Gates are used to control
the update time of the state of the cell [5]. captions. So the consensus caption after noise reduction is
The result generated by the final layer of LSTM is the given by equation (2):
probability of the unique words present in the dataset that
∑ (2)
can give meaning to the next word in the sequence using the
feature vector and the sequence of partial captions. The first
partial caption is the start token (< startseq >), and the So for every caption, it is required to evaluate the inner
remain ing caption is generated using this start token, and the summation for all possible Size -M subsets of the set of
caption ends with the end token (< endseq >). The model candidate captions. We proposed the algorithms to compute
has been trained using categorical cross entropy as the loss the consensus caption below efficiently.
function and adam optimizer as the optimizer to min imize Fig. 3 shows the basic steps to implement the KNN
the loss [12]. model. To evaluate the consensus caption c*, we are
The preprocessing of the dataset is completed before required to evaluate the similarity scores of a given caption
training the model. The dictionary keeps track of every word c with all possible M sized subsets of the N candidate
with mo re than five occurrences. Glo Ve embedding is used caption we have. Our program will have an exponential time
to convert the words to a vector representation [21]. Words complexity using the brute force algorith m to evaluate the
are translated into a meaningful space in which semantic consensus caption. No computer or supercomputer will
similarity is related to the distance between them. successfully execute the program in any reasonable time.
To enhance the performance of the model, we used the Then we proposed an algorithm to reduce the time
beam search method of variable k, where k is the nu mber o f complexity to O (N*N logN).
chosen captions taken during the generation of the caption a) The Brute Force Algorithm.
using softmax probabilities [20]. The captions are generated We generated all possible M-sized subsets of the N
using beam search, which considers k most possible candidate captions set, and then for every caption c, we will
sentences. It is further imp roved using length normalization. calculate the similarity score fo r c. Once we have calculated
Since the probabilit ies are small nu mbers, mult iplying them the similarity scores for all the captions, we choose the
forms even smaller numbers approaching zero. Therefore caption that achieves the maximu m consensus score. As
we use the logarithm of probabilit ies. The sentence with the discussed, the time co mplexity for the algorith m is
highest probability will have the highest log probability. We exponential.
can also say that it will have the most negligible negative b) Proposed optimized KNN Algorithm.
log probability. This algorithm has an asymptotic runtime co mp lexity of O
Fig. 2 shows the basic steps to implement the CNN with (N*N logN) under the assumption that SIM (i, j) which is
LSTM model. Higher values of k may provide better results BLEU-1 score can be evaluated in constant time.
theoretically, but it also increases the complexity. Simp le i. First, arbitrarily number the captions from 1 to N.
naïve beam search assign greater probabilities to the
ii. Choose the ith caption :
sentences with a shorter length than sentences with a longer
length. The solution is to perform length normalizat ion on iii. Evaluate the values SIM(i, j) where j goes fro m 1
the probabilit ies during sorting and selecting the k most to N
probable partial sentences. iv. Among the N values evaluated, choose the M
largest values.
2) Implementation of KNN model v. Sum the selected M values to get the score for the
ith caption.
The nearest neighbor approach is one of the machine
Do this for all the N captions and choose the consensus
learning approaches, which is very popular and easy to use
caption that gets the highest score.
because of its simplicity of imp lementation and

978-1-6654-8328-5/22/$31.00 ©2022 IEEE 1557


Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY KURUKSHETRA. Downloaded on May 01,2025 at 10:22:48 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Sixth International Conference on Trends in Electronics and Informatics (ICOEI 2022)
IEEE Xplore Part Number: CFP22J32-ART; ISBN: 978-1-6654-8328-5

3) Attention Based LSTM Model ii. Sequence Decoder: This is a recurrent neural
The shortcoming of the CNN with LSTM model is that network which has a gated recurrent unit. After
when this model is generating the next wo rd of the caption, travelling via an Embedding layer, the captions are
the model is not using the input image effect ively. The provided in as inputs.
attention mechanism helps overcome this problem by iii. Attention: As the Decoder creates each wo rd in the
focusing on the image part that describes the next target output sequence, the attention module assists it in
word. focusing on the most important region of the image
The CNN with LSTM based IC model wou ld decode the
for that phrase.
input image with the already trained convolutional neural
network inception V-3 in a h idden state (Hj). Then, using iv. Sentence Generator: Th is module is made up of a
LSTM decodes the hidden state and generates a caption. The few linear layers. It uses the decoder's result to
outputs fro m p revious elements and new sequence data are generate probabilit ies for each word in the corpus
used as inputs for subsequent target words. and each location in the expected sequence. The
However, as RNNs are co mputationally costly, it is not beam search applied in this module takes k
easy to analyze the whole input image by picking the candidate words with maximu m probability, where
important segment fro m the input image. This problem can
k is the beam size. The end result is the output of
be overcome. The image is first partitioned into n pieces
with an attention method, and with CNN, we co mpute the the given test image having the highest summation
representation of each piece. When the LSTM generates the of probability.
next target word, the attention mechanism uses the relevant
part of the image so that LSTM does not become
computationally costly.
Only a few resource positions are focused on local
attention compared with global attention, which is very
costly as it focuses on all the input parts of the image. Local
attention focuses only on the part of the CNN hidden layer
per next word to compensate for this shortcoming. Local
attention uses a centre point and calculates the attention
weights in the left-hand right region of the centre point into
the context vector. We have also used the beam search
method to improve the model’s performance.
The score for local attention is mentioned below in
equation (3).

(3)

where,

The importance of this segment is given by equation (4)


Fig. 4. Working of Attention Based LST M model

(4)
IV. RESULTS AND DISCUSSIONS
where, For evaluating the performance of the three models - LSTM
 is the prior state of decoder. with CNN, KNN and Attention-based network, we have
selected the following metrics:
 The encoder state is .
a) BLEU: BLEU is Bilingual evaluation understudy
 The feed forward neural network is a linear input which is a well-known performance metric for comparing
transformation. the similarity of several references with a consensus text [9].
It gives a number between 0 and 1. One indicates that the
The Attention Based LSTM model is made up of four reference sentence and the delivered sentences are
logical parts: comparable. Attention Based LSTM model with beam size
i. Encoder: Because the pre-trained Inception model of 7 outperforms KNN and CNN with LSTM model with a
BLEU-4 score of 32.21, as shown in Table 1, Table 2 and
already has done the picture encoding, the Encoder
Table 3.
here is pretty basic. It includes a linear layer that
b) ROUGE: ROUGE is Recall oriented understudy. It
transfers pre-encoded picture characteristics to the
is selected as an evaluation performance metric because it
next part.

978-1-6654-8328-5/22/$31.00 ©2022 IEEE 1558


Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY KURUKSHETRA. Downloaded on May 01,2025 at 10:22:48 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Sixth International Conference on Trends in Electronics and Informatics (ICOEI 2022)
IEEE Xplore Part Number: CFP22J32-ART; ISBN: 978-1-6654-8328-5

evaluates recall o f the reference sentence with the given T ABLE 3. BLEU SCORES FOR AT TENTION BASED LST M MODEL
summary [11]. It estimates the candidate's and reference's Beam Size BLEU-1 BLEU-4
longest common subsequence, which is useful for assessing
1(Greedy) 58.72 28.36
the generated caption. Our architecture for the Attention
Based LSTM with beam size of 7 showed the best ROUGE- 3 59.16 29.26
L score of 46.41 as shown in Table 5 and outperforms KNN 5 59.86 31.12
and CNN with LSTM model.
7 62.34 32.21
c) METEOR: M ETEOR is Metric for evaluation fo r
translation with exp licit ordering and used for evaluating the
translation generated by machine [10]. The issue with T ABLE 4. MET EOR SCORES FOR DIFFERENT MODELS
BLEU is that it uses a candidate caption identical to the Model Meteor
reference caption in length. The METEOR solves this by
CNN with LST M (Beam=5) 27.5
replacing the BLEU precision and recall calcu lation with a
KNN (k=40,M=50) 27.3
weighted F-score. Our architecture for the Attention Based
LSTM with beam size o f 7 showed the best METEOR score Attention Based LST M (Beam=7) 28.6
of 28.86 as shown in Tab le 4 and outperforms KNN and
CNN with LSTM model . T ABLE 5. ROUGE SCORES FOR DIFFERENT MODELS
The above results depict that Attention-Based LSTM
CNN with KNN Attention Base d
with a beam size of 7 gives better results than the other two LSTM (k=40,M=50) LSTM
models and can produce good quality image captions. (BEAM-5) (BEAM-7)
The above metrics were assessed on the Flickr8k dataset ROUGE-1 34.6 35.4 48.27
using the Nvidia K80 GPU provided by Kaggle [16]. The
ROUGE-2 10.4 4.0 11.42
findings of different variat ions of the neural network-based
model are presented. These versions are then compared. We ROUGE-L 32.4 31.9 46.41
compared greedy search and beam search on the CNN w ith
LSTM and KNN with different k and M, where k is the
number of images and M is the number of captions.
Then we try to imp rove the LSTM model by including
the local attention layer to the model and different beam
sizes. Finally, we co mpared all three models based on
ROUGE-1, ROUGE-2, ROUGE-L, BLEU-1, BLEU-4 and
METEOR performance measures.
We investigated various model versions and reported
findings for:
i. CNN with LSTM -based neural network model with
different beam sizes of 1,3,5, and 7
ii. KNN model with value of M = 40, 50 and k = 20,
30, 40.
iii. Attention Based LSTM with various beam sizes of
1,3,5, and 7
(a)
iv. METEOR Score of different Models
v. ROUGE Score of different Models

T ABLE 1. BLEU SCORES FOR CNN WITH LSTM OF VARIOUS SIZE


OF BEAMS

Be am Siz e BLEU-1 BLEU-4


1(Greedy) 58.05 15.45
3 59.04 15.90
5 58.88 15.97
7 58.85 15.90

T ABLE 2. BLEU SCORES FOR KNN

KNN BLEU-1 BLEU-4


k=20,M=40 52.62 10.92 (b)
k=30,M=50 53.49 11.01
Fig. 5. Results showing Generation of Caption
k=40,M=50 55.24 11.87

978-1-6654-8328-5/22/$31.00 ©2022 IEEE 1559


Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY KURUKSHETRA. Downloaded on May 01,2025 at 10:22:48 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Sixth International Conference on Trends in Electronics and Informatics (ICOEI 2022)
IEEE Xplore Part Number: CFP22J32-ART; ISBN: 978-1-6654-8328-5

V. CONCLUSION AND FUTURE WORK [6] G. Kulkarni et al., “Babytalk: Understanding and generating simple
image descriptions,” IEEE Transactions on Pattern Analysis and
In the current study, we have empirically evaluated three Machine Intelligence, vol. 35, no. 12, pp. 2891–2903, 2013.
models- CNN with LSTM, KNN and Attention Based [7] M. Ivasic-Kos, M. Pobar, and S. Ribaric, “T wo-tier image annotation
LSTM for IC on the Flickr8k public dataset. Further, three model based on a multi-label classifier and fuzzy-knowledge
representation scheme,” Pattern Recognit., vol. 52, pp. 287–305,
performance metrics- BLEU, M ETEOR and ROGUE are 2016.
selected to compare the results. [8] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna,
As it has been discussed above, the Attention Based “Rethinking the inception architecture for computer vision,” in 2016
IEEE Conference on Computer Vision and Pattern Recognition
LSTM model showed better results than KNN and CNN
(CVPR), 2016, doi: 10.1109/cvpr.2016.308.
with LSTM models on all three performance measures. The [9] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “BLEU: A method
proposed KNN algorith m also helps in reducing the time for automatic evaluation of machine translation,” in Proceedings of
complexity without compromising the final result. We have the 40th Annual Meeting on Association for Computational
added the beam search algorithm, which helps in improving Linguistics - ACL ’02, 2001, doi: 10.3115/1073083.1073135.
[10] Banerjee, Satanjeev and A. Lavie, “. METEOR: An automatic metric
the performance of the CNN with LSTM and Attention for MT evaluation with improved correlation with human judgments,”
Based LSTM. The BLEU-4 score of Attention Based LSTM ACL workshop on intrinsic and extrinsic evaluation measures for
with a beam size of 7 is 32.21, wh ich is much better than the machine translation and/or summarization, pp. 65–72, 2005.
greedy search score of 28.36 and also CNN with LSTM [11] Lin, Chin-Yew, and F. J. Och, “. Looking for a few good metrics:
ROUGE and its evaluation,” 4th NTCIR Workshops, 2004.
model performance with a beam size of 5 is 15.97, which is [12] A. Mustapha, L. Mohamed, and K. Ali, “Comparative study of
better than the greedy search BLEU-4 score of 15.45. optimization techniques in deep learning: Application in the
Every approach has its own set of limits and advantages, ophthalmology field,” J. Phys. Conf. Ser., vol. 1743, no. 1, p. 012002,
2021.
such as the KNN model wh ich perfo rm best when the
[13] O. Sargar and S. Kinger, “Image captioning methods and metrics,” in
dataset is huge. At the same t ime, the beam search proves to 2021 International Conference on Emerging Smart Computing and
be an essential method to improve the quality of captions. Informatics (ESCI), pp. 522–526, 2021.
Also, there are a few limitations in the current s tudy. [14] S. S. Yv, Y. Choubey, and D. Naik, “Image captioning with attention
Firstly, the align ment and dimensional relationships of the based model,” in 2021 5th International Conference on Computing
Methodologies and Communication (ICCMC), 2021, doi:
features are not considered by CNN. It is tough to assess the 10.1109/iccmc51019.2021.9418347.
results of natural language generation systems. Secondly, [15] Machine Learning Mastery, glass.csv at
the Attention Based LSTM is best compared to other models d20fcb6402ae34e653d4513b00f39257bb37ed7f · jbrownlee/Datasets.
on the selected metrics, which may not be the case for other [16] “Kaggle: Your machine learning and data science community,”
Kaggle.com. [Online]. Available: https://www.kaggle.com/.
metrics selected. Further, we have evaluated results only on [Accessed: 05-Mar-2022].
the Flickr8k dataset. There may be a possibility of the [17] M. A. Al-Malla, A. Jafar, and N. Ghneim, “Image captioning model
models not performing well on other datasets. using attention and object features to mimic human image
understanding,” J. Big Data, vol. 9, no. 1, 2022.
In the future, it would be intriguing to analyze the
[18] Y. Chu, X. Yue, L. Yu, M. Sergei, and Z. Wang, “Automatic image
performance on hybrid models, models based on captioning based on ResNet50 and LST M with soft attention,” Wirel.
reinforcement learn ing and mu lti-d irect ional RNN. One Commun. Mob. Comput., vol. 2020, pp. 1–7, 2020.
more future direction could be to explore the models on [19] S. T akkar, A. Jain, and P. Adlakha, “Comparative study of different
much larger datasets like M S COCO and Flickr30k. A larger image captioning models,” in 2021 5th International Conference on
Computing Methodologies and Communication (ICCMC), pp. 1366–
dataset imp lies more data for training, which can help the 1371, 2021.
suggested models perform better. Finally, the performance [20] “9.8. Beam Search — Dive into Deep Learning 0.16.1
over other valuable metrics like CIDER and SPICE can be documentation.” Available: https://d2l.ai/chapter_recurrent -
explored. modern/beamsearch.html [Accessed 13-Mar-2021].
[21] J. Pennington, R. Socher, and C. Manning, “Glove: Global vectors for
In the end, we conclude that the findings of the word representation,” in Proceedings of the 2014 Conference on
comparative study for IC are significant for researchers and Empirical Methods in Natural Language Processing (EMNLP), 2014.
practitioners. [22] C. Amritkar and V. Jabade, “Image caption generation using deep
learning technique,” in 2018 Fourth International Conference on
REFERENCES Computing Communication Control and Automation (ICCUBEA), pp.
1–4, 2018.
[1] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A [23] A. Puscasiu, A. Fanca, D.-I. Gota, and H. Valean, “Automated image
neural image caption generator,” in 2015 IEEE Conference on captioning,” in 2020 IEEE International Conference on Automation,
Computer Vision and Pattern Recognition (CVPR), 2015. Quality and Testing, Robotics (AQTR), pp. 1–6, 2020.
[2] J. Devlin, S. Gupta, R. Girshick, M. Mitchell, and C. L. Zitnick, [24] Applsci-11-10176-v2.pdf. [Online]. Available: http://applsci-11-
“Exploring nearest neighbor approaches for image captioning,” arXiv 10176-v2.pdf. [Accessed: 17-Mar-2022].
[cs.CV], 2015. [25] I. Azhar, I. Afyouni, and A. Elnagar, “Facilitated deep learning
[3] G. Ding, M. Chen, S. Zhao, H. Chen, J. Han, and Q. Liu, “Neural models for image captioning,” in 2021 55th Annual Conference on
image caption generation with weighted training and reference,” Information Sciences and Systems (CISS), pp. 1–6, 2021.
Cognit. Comput., vol. 11, no. 6, pp. 763–777, 2019, doi: [26] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
10.1007/s12559-018-9581-x. image recognition,” in 2016 IEEE Conference on Computer Vision
[4] K. Xu et al., “ Show, attend and tell: Neural image caption generation and Pattern Recognition (CVPR), 2016.
with visual attention,” arXiv [cs.LG], 2015.
[5] S. Hochreiter and J. Schmidhuber, “Long short -term memory,”
Neural Comput., vol. 9, no. 8, pp. 1735–1780, 1997, doi:
10.1162/neco.1997.9.8.1735.

978-1-6654-8328-5/22/$31.00 ©2022 IEEE 1560


Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY KURUKSHETRA. Downloaded on May 01,2025 at 10:22:48 UTC from IEEE Xplore. Restrictions apply.

You might also like