Image Captioning Techniques A Review
Image Captioning Techniques A Review
net/publication/364583341
CITATIONS READS
8 248
3 authors, including:
All content following this page was uploaded by Anbara Al-Jamal on 11 June 2024.
Authorized licensed use limited to: Jordan University of Science & Technology. Downloaded on June 11,2024 at 12:39:48 UTC from IEEE Xplore. Restrictions apply.
To understand the process of image captioning; we
III. LITERATURE REVIEW
should know that image captioning is more complex to
handle by traditional classification models such as zero Mao et al. [8] developed a multimodal Recurrent
rule, one rule, decision tree, etc. Computer Vision comes to Neural Network (m-RNN) model for generating novel
play a role in processing, analyzing, and interpreting visual image captions. It directly models the probability
data. The first step in designing image caption applications distribution of generating a word given previous words
is to collect a large number of images using different and an image. The vision part contains a deep CNN which
datasets like ImageNet, Flicker8k and other big datasets generates the image representation. The multimodal part
that provide a good number of images for this task and use connects the language model and the deep CNN together
them as a training data fed to the deep learning model such by a one-layer representation.
as Convolutional Neural Network CNN for feature For image caption models studies noted to use of
extraction from the image. As a second step using these language generated using deep learning algorithmsin ,
features to feed the model like theLong Short Term ,addition to using deep learning for extracted features
Memory LSTM model which in turn generates the image from images. Some researchers [11] used theCNN model
captions. as an image encoder and RNN as a decoder to generate
sentences. To get a good performance some researchers
use decoder models-multi encoder for image captioning.
II. DEEP LEARNING MODELS IN IMAGE CAPTIONING To get better performance of the image caption model
A. Convolutional Neural Network Cyganek et al. [4] try to increase the resolution of the
images to extract a high level of features before feeding
CNN [7] stands for Convolutional Neural Network; is an
the CNN as preprocessing data analysis step. As another
artificial deep learning neural network. It is used for image
preprocessing step; images can be resized and rechanged
classifications, computer vision, image recognition, and
their extension before the decoder process according to
object detection. CNN image classifications take an input
the selected decoder algorithm. The figure below shows
image, process it, and classify it under certain categories (E.g.,
this process, increasing the resolution of training and
Dog, Cat, etc.). Such scans images from left to right and top
testing images is an optional step in the image caption
to bottom to pull out important features from the image and
model. In [3] they care about the resolution of images in
combines the feature to classify images.
their study.
B. Recurrent Neural Network (LSTM) As part of a text generation in the model, most of the
LSTM [7] stands for Long Short-Term Memory; they text documents are available in the English language, but
are a type of RNN (recurrent neural network) that is well Pa Aung et al. [2] used text documents in the Myanmar
suited for sequence prediction problems. Based on the language as a new challenge in image caption generation
previous text, we can predict what the next word will be. in Myanmar using deep learning. BLEU scores and 10-
It has proven itself more effective than the traditional fold cross-validation were used as evaluation metrics.
RNN by overcoming the limitations of RNN which had They created the first corpus of image captioning for the
short-term memory. LSTM can carry out relevant Myanmar language, and manually checked and built the
information throughout the processing of inputs and with descriptions in detail to match captions and images. But it
a forget gate, it discards non-relevant information. was easy to use the English databases that were available
In image captioning, researchers use CNN and LSTM in public and use the image caption model to get the
as an encoder-decoder architecture in general, see Fig. 2. captions in English then convert these captions from
below: English to Myanmar language such as using APIs for
language translation provided by Google, instead of
wasting time in preparing their database.
IV. DATASETS
The development of this research area greatly depends
on the availability of large datasets that contain images
with corresponding descriptions. In addition to the size of
Fig. 2. Image Caption Model as Encoder-Decoder Architecture the dataset, an image captioning model benefits also
significantly from the quality of captions in the spirit of
natural language and their adaptation to a given task [10].
TABLE I. COMMON TRADITIONAL IMAGE CAPTIONING TECHNIQUES There are several publicly available datasets that are
useful for training image captioning tasks. The most
popular datasets include ImageNet [5], UIUC PASCAL
Method Functionality
[13], Flickr8k [14], Flickr30k [15], MSCOCO dataset
[16], PASCAL VOC [17], some datasets of images are
Encoder (CNN) Image feature extraction
available online for public to access and download to use
[2].
Decoder (LSTM) Language modeling
ImageNet is a dataset of over 15 million labeled high-
resolution images belonging to roughly 22,000
categories. The images were collected from the web and
Authorized licensed use limited to: Jordan University of Science & Technology. Downloaded on June 11,2024 at 12:39:48 UTC from IEEE Xplore. Restrictions apply.
labeled by human labelers using Amazon’s Mechanical V. EVALUATION METRICS
Turk crowd-sourcing tool. Starting in 2010, as part of the Human evaluations of machine translation are
Pascal Visual Object Challenge, an annual competition extensive but expensive. Human evaluations can take
called the ImageNet Large-Scale Visual Recognition months to finish and involve human labor that cannot be
Challenge (ILSVRC) has been held. ILSVRC uses a reused. Several metrics for automatic evaluation of
subset of ImageNet with roughly 1000 images in each of machine translation are proposed for quick, inexpensive
1000 categories. In all, there are roughly 1.2 million and language-independent, that correlates highly with
training images, 50,000 validation images, and 150,000 human evaluation. BLEU (Bilingual Evaluation
testing images. ILSVRC-2010 is the only version of Understudy) [19]: as a metric, it counts the number of
ILSVRC for which the test set labels are available, so this matching n-grams in the model’s prediction compared to
is the version on which we performed most of our the ground truth. With this, precision is calculated based
experiments. Since we also entered our model in the on the mean n-grams computed, and the recall is
ILSVRC-2012 competition, in Section 6 we report our computed via the introduction of a brevity penalty in the
results on this version of the dataset as well, for which test caption label.
set labels are unavailable. On ImageNet, it is customary
to report two error rates: top-1 and top-5, where the top-5 ROUGE (Recall-Oriented Understudy for Gisting
error, the rate is the fraction of test images for which the Evaluation) [20]: it is useful for summary evaluation and
correct label is not among the five labels considered most is calculated as the overlap of either 1-gram or bigrams
probable by the model. ImageNet consists of variable- between the referenced caption and the predicted
resolution images, while our system requires a constant sequence. Using the longest sequence available, the co-
input dimensionality. Therefore, we down-sampled the occurrence F-score mean of the predicted sequence’s
images to a fixed resolution of 256 × 256. Given a recall and prediction is obtained.
rectangular image, we first rescaled the image such that METEOR (Metric for Evaluation of Translation with
the shorter side was of length 256 and then cropped out Explicit Ordering) [21]: it addresses the drawback of
the central 256×256 patch from the resulting image. We BLEU, and it is based on a weighted F-score computation
did not pre-process the images in any other way, except as well as a penalty function meant to check the order of
for subtracting the mean activity over the training set from the candidate sequence. It adopts synonyms matching in
each pixel. So, we trained our network on the (centered) the detection of similarity between sentences.
raw RGB values of the pixels.
CIDEr (Consensus-based Image Description
UIUC PASCAL Sentences was one of the first image- Evaluation) [18]: it determines the consensus between a
caption datasets, consisting of 1,000 images and reference sequence and a predicted sequence via cosine
associated with five different descriptions collected via similarity, stemming, and TF-IDF weighting. -e predicted
crowdsourcing. It was used by early image captioning sequence is compared to the combination of all available
systems, but it is rarely used due to its limited domain, reference sequences.
small size, and relatively simple captions.
SPICE (Semantic Propositional Image Caption
Evaluation) [22]: it is a relatively new caption metric that
Flickr 30K includes and extends the previous Flickr relates to the semantic interrelationship between the
8K dataset. It consists of 31,783 images showing generated and referenced sequence. Its graph-based
everyday activities, events, and scenes described by methodology uses a scene graph of semantic
158,915 captions obtained via crowdsourcing. representations to indicate details of objects and their
interaction to describe their textual illustrations.
Microsoft COCO Captions dataset containing more
complex images of everyday objects and scenes. By
adding human-generated captions, two datasets were VI. DISCUSSION
created: c5 with five captions for each of the more than
300K images and an additional, c40 dataset with 40 Table 2 is a comparison between encoder- decoder models
different captions for the randomly chosen 5K images. used in different research. Language CNN helps to
The c40 was created because it was observed [18] that understand better word embeddings and better history word
some evaluation metrics benefit from more reference representations even with fewer data.
captions
PASCAL VOC: PASCAL Visual Object Classes TABLE II. ENCODER-DECODER COMPARISON
(VOC) is arguably the most popular semantic
segmentation dataset with 21 classes of predefined object Model Encod Deco
er der
labels, background included. -e dataset contains images model model
and annotations which could be used for detection, M. Bhalekar et al. [12] D- LST
classification, action classification, person layout, and CNN M
segmentation tasks. The dataset’s training, validation, and S.pa Aung et al. [2] VGG VGG
test set have 1464, 1449, and 1456 images, respectively. Oxford 16,
Yearly, the dataset has been used for public competitions net VGG
since 2005 CNN 19
with
Flickr 30K and MS COCO Captions are widely LST
accepted as benchmark datasets for image captioning by M
most models using deep neural networks. M. Zakir Hossain et al. [9] CNN LST
M
Authorized licensed use limited to: Jordan University of Science & Technology. Downloaded on June 11,2024 at 12:39:48 UTC from IEEE Xplore. Restrictions apply.
Y. Chu et al. [1] IJCAI Int. Jt. Conf. Artif. Intell., vol. 2015-January, pp. 4188–4192,
ResNe LST
2015.
t50 M/
Soft [15] P. Young, A. Lai, M. Hodosh, and J. Hockenmaier, “From image
Attent descriptions to visual denotations: New similarity metrics for semantic
ion inference over event descriptions,” Trans. Assoc. Comput. Linguist.,
vol. 2, pp. 67–78, 2014, doi: 10.1162/tacl_a_00166.
VII. CONCLUSION [16] X. Chen et al., “Microsoft COCO Captions: Data Collection and
Evaluation Server,” pp. 1–7, 2015, [Online]. Available:
In this paper, we have discussed various image http://arxiv.org/abs/1504.00325.
captioning models. We have also presented the limitations [17] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A.
of the discussed approaches. Different evaluation metrics Zisserman, “The pascal visual object classes (VOC) challenge,” Int. J.
are also presented and discussed. We have shown the Comput. Vis., vol. 88, no. 2, pp. 303–338, 2010, doi: 10.1007/s11263-
results of various methods performed on Flickr8k, 009-0275-4.
Flickr30k, ImageNet, and Microsoft COCO datasets. In [18] R. Vedantam, C. L. Zitnick, and D. Parikh, “CIDEr: Consensus-based
image description evaluation,” Proc. IEEE Comput. Soc. Conf.
the future, models using reinforcement learning and Comput. Vis. Pattern Recognit., vol. 07-12-June-2015, pp. 4566–4575,
unsupervised learning will be highly acceptable for the 2015, doi: 10.1109/CVPR.2015.7299087.
captioning of natural scenes. Integration of textual content [19] C. Cormier, “Bleu,” Landscapes, vol. 7, no. 1, pp. 16–17, 2005, doi:
and visual regions will definitely enhance the image 10.3917/chev.030.0107.
captioning task to great extent. [20] G. Tsuchiya, “Postmortem Angiographic Studies on the Intercoronary
Arterial Anastomoses.: Report I. Studies on Intercoronary Arterial
Anastomoses in Adult Human Hearts and the Influence on the
REFERENCES Anastomoses of Strictures of the Coronary Arteries.,” Jpn. Circ. J., vol.
[1] Y. Chu, X. Yue, L. Yu, M. Sergei, and Z. Wang, “Automatic Image 34, no. 12, pp. 1213–1220, 1971, doi: 10.1253/jcj.34.1213.
Captioning Based on ResNet50 and LSTM with Soft Attention,” Wirel.
[21] S. Banerjee and A. Lavie, “METEOR: An automatic metric for mt
Commun. Mob. Comput., vol. 2020, 2020, doi:
evaluation with improved correlation with human judgments,” Intrinsic
10.1155/2020/8909458.
Extrinsic Eval. Meas. Mach. Transl. and/or Summ. Proc. Work. ACL
[2] S. Pa Pa Aung, W. Pa Pa, and T. L. Nwe, “Automatic Myanmar Image 2005, no. June, pp. 65–72, 2005.
Captioning using CNN and LSTM-Based Language Model,” Proc. 1st
[22] P. Anderson, B. Fernando, M. Johnson, and S. Gould, “SPICE:
Jt. Work. Spok. Lang. Technol. Under-resourced Lang. Collab.
Semantic propositional image caption evaluation,” Lect. Notes
Comput. Under-Resourced Lang., no. May, pp. 139–143, 2020,
Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes
[Online]. Available: https://www.aclweb.org/anthology/2020.sltu-
Bioinformatics), vol. 9909 LNCS, pp. 382–398, 2016, doi:
1.19.
10.1007/978-3-319-46454-1_24.
[3] A. Oluwasammi et al., “Features to text: A comprehensive survey of
[23] S. Aljawarneh, V. Radhakrishna, and G. R. Kumar, “An imputation
deep learning on semantic segmentation and image captioning,”
measure for data imputation and disease classification of medical
Complexity, vol. 2021, 2021, doi: 10.1155/2021/5538927.
datasets,” in AIP Conference Proceedings, 2019, vol. 2146.
[4] M. Koziarski and B. Cyganek, “Impact of low resolution on image
[24] S. Aljawarneh and J. A. Lara, “Data science for analyzing and
recognition with deep neural networks: An experimental study,” Int. J.
improving educational processes,” J. Comput. High. Educ., vol. 33, no.
Appl. Math. Comput. Sci., vol. 28, no. 4, pp. 735–744, 2018, doi:
3, pp. 545–550, 2021.
10.2478/amcs-2018-0056.
[25] J. A. Lara, A. A. De Sojo, S. Aljawarneh, R. P. Schumaker, and B. Al-
[5] T. F. Gonzalez, “Handbook of approximation algorithms and
Shargabi, “Developing big data projects in open university engineering
metaheuristics,” Handb. Approx. Algorithms Metaheuristics, pp. 1–
courses: Lessons learned,” IEEE Access, vol. 8, pp. 22988–23001,
1432, 2007, doi: 10.1201/9781420010749.
2020.
[6] S. Bai and S. An, “A survey on automatic image caption generation,”
[26] S. Aljawarneh and J. A. Lara, “Editorial: Special Issue onQuality
Neurocomputing, vol. 311, pp. 291–304, 2018, doi:
Assessment and Management in Big Data-Part i,” ACM Trans. Embed.
10.1016/j.neucom.2018.05.080.
Comput. Syst., vol. 13, no. 2, 2021.
[7] A. A. Mohamed, “Image Caption using CNN & LSTM,” no. June,
[27] S. A. Aljawarneh, “Formulating models to survive multimedia big
2020.
content from integrity violation,” J. Ambient Intell. Humaniz. Comput.,
[8] J. Mao, W. Xu, Y. Yang, J. Wang, Z. Huang, and A. Yuille, “Deep 2018.
captioning with multimodal recurrent neural networks (m-RNN),” 3rd
[28] S. A. Aljawarneh, R. Vangipuram, V. K. Puligadda, and J. Vinjamuri,
Int. Conf. Learn. Represent. ICLR 2015 - Conf. Track Proc., vol. 1090,
“G-SPAMINE: An approach to discover temporal association patterns
no. 2014, pp. 1–17, 2015.
and trends in internet of things,” Futur. Gener. Comput. Syst., vol. 74,
[9] H. Sharma, M. Agrahari, S. K. Singh, M. Firoj, and R. K. Mishra, pp. 430–443, 2017.
“Image Captioning: A Comprehensive Survey,” 2020 Int. Conf. Power
[29] V. Chang, S. A. Aljawarneh, and C.-S. Li, “Special issue on ‘advances
Electron. IoT Appl. Renew. Energy its Control. PARC 2020, pp. 325–
in visual analytics and mining visual data,’” Expert Syst., vol. 37, no.
328, 2020, doi: 10.1109/PARC49193.2020.236619.
5, 2020.
[10] I. Hrga and M. Ivašic-Kos, “Deep image captioning: An overview,”
[30] J. A. Lara, S. Aljawarneh, and S. Pamplona, “Special issue on the
2019 42nd Int. Conv. Inf. Commun. Technol. Electron. Microelectron.
current trends in E-learning Assessment,” J. Comput. High. Educ., vol.
MIPRO 2019 - Proc., pp. 995–1000, 2019, doi:
32, no. 1, 2020.
10.23919/MIPRO.2019.8756821.
[31] J. A. Lara, J. Pazos, A. A. de Sojo, and S. Aljawarneh, “The Paternity
[11] Z. Shi, H. Liu, and X. Zhu, “Enhancing Descriptive Image Captioning
of the Modern Computer,” Found. Sci., 2021.
with Natural Language Inference,” ACL-IJCNLP 2021 - 59th Annu.
Meet. Assoc. Comput. Linguist. 11th Int. Jt. Conf. Nat. Lang. Process. [32] S. Aljawarneh, V. Radhakrishna, P. V. Kumar and V. Janaki, "A
Proc. Conf., vol. 2, pp. 269–277, 2021, doi: 10.18653/v1/2021.acl- similarity measure for temporal pattern discovery in time series data
short.36. generated by IoT," 2016 International Conference on Engineering &
MIS (ICEMIS), 2016, pp. 1-4, doi: 10.1109/ICEMIS.2016.7745355.
[12] M. Bhalekar, “D-CNN : A New model for Generating Image Captions
with Text Extraction Using Deep Learning for Visually Challenged [33] S. A. Aljawarneh, V. Radhakrishna and A. Cheruvu, "Extending the
Individuals,” vol. 12, no. 2, pp. 8366–8373, 2022. Gaussian membership function for finding similarity between temporal
patterns," 2017 International Conference on Engineering & MIS
[13] R. Vedantam, C. L. Zitnick, and D. Parikh, “Collecting Image
(ICEMIS), 2017, pp. 1-6, doi: 10.1109/ICEMIS.2017.8273100.
Description Datasets using Crowdsourcing,” 2014, [Online].
Available: http://arxiv.org/abs/1411.3041. [34] E. Ayoubi and S. Aljawarneh, “Challenges and opportunities of
adopting business intelligence in SMEs: Collaborative model,”
[14] M. Hodosh, P. Young, and J. Hockenmaier, “Framing image
in Proceedings of the First International Conference on Data Science,
description as a ranking task: Data, models and evaluation metrics,”
E-learning and Information Systems, 2018.
Authorized licensed use limited to: Jordan University of Science & Technology. Downloaded on June 11,2024 at 12:39:48 UTC from IEEE Xplore. Restrictions apply.
[35] M. N. Mouchili, S. Aljawarneh, and W. Tchouati, “Smart city data
analysis,” in Proceedings of the First International Conference on Data
Science, E-learning and Information Systems, 2018.
[36] A. Nagaraja, S. Aljawarneh, and P. H., “PAREEKSHA: A machine
learning approach for intrusion and anomaly detection,” in Proceedings
of the First International Conference on Data Science, E-learning and
Information Systems, 2018.
[37] B. K. Muslmani, S. Kazakzeh, E. Ayoubi, and S. Aljawarneh,
“Reducing integration complexity of cloud-based ERP systems,”
in Proceedings of the First International Conference on Data Science,
E-learning and Information Systems, 2018.
[38] M. N. Mouchili, J. W. Atwood, and S. Aljawarneh, “Call data record
based big data analytics for smart cities,” in Proceedings of the Second
International Conference on Data Science, E-Learning and Information
Systems - DATA ’19, 2019.
[39] S. Aljawarneh and M. Malhotra, Critical research on scalability and
security issues in virtual cloud environments. Hershey, PA: IGI Global,
2017.
[40] S. Aljawarneh and M. Malhotra, Impacts and Challenges of Cloud
Business Intelligence. Hershey, PA: Business Science Reference,
2020.
Authorized licensed use limited to: Jordan University of Science & Technology. Downloaded on June 11,2024 at 12:39:48 UTC from IEEE Xplore. Restrictions apply.