Overview On Image Captioning Techniques
Overview On Image Captioning Techniques
Overview On Image Captioning Techniques
Volume
Urja Bahety et al., International Journal of Emerging 9. No.in8,Engineering
Trends August 2021Research, 9(8), August 2021, 1118 – 1123
International Journal of Emerging Trends in Engineering Research
Available Online at http://www.warse.org/IJETER/static/pdf/file/ijeter15982021.pdf
https://doi.org/10.30534/ijeter/2021/15982021
1118
Urja Bahety et al., International Journal of Emerging Trends in Engineering Research, 9(8), August 2021, 1118 – 1123
To make sure that generated description of an image are Image captioning has enthralled more study in the area of AI
semantically and syntactically correct some techniques of and has played remarkable role in computer vision and
NLP and computer vision are adopted to deal the problem language processing. Captioning of image described objects
arrived from modality and integrity appropriately. Several attribute and their relationships. In order to understand image
approaches have been proposed to create description of an captioning and to produce captions for image this survey
image correctly. paper describes various methods to perform captioning task.
This survey paper presents theoretically analysis of several This survey provides the summarised understanding of
different methods some of which was used earlier and some method which has been achieved earlier and the latest
latest methods for image captioning. Below represent the methods which has been achieved so far. The methods to
overall categorization of image captioning method in caption an image which have surveyed are retrieval based,
structure form and detailed description of these methods deep neural network based along with their types. Detailed
describe in succeeding sections. discussion of these methods is discussed in remaining section
of the paper.
Remaining paper organisation is as - Segment 3 outlined
Retrieval based image captioning technique segment 4
outlined some of the deep neural network based techniques
and finally discussed concluding remarks.
conditioned on querying image. To retrieve the largest score Significantly by using deep neural network the execution of
as caption of the query the word probability density method is image captioning is enhanced. However using deep neural
used to rank recent caption. The prior principal determined network in template and retrieval based methods does not
that given a query image always there exist a sentence which overthrow their weaknesses so limits of caption produced by
is relevant to it. This hypothesis is barely accurate in practise these techniques are not taken out completely.
so in place of using selected sentences as description of query
image directly, in other line of retrieval based research, 4.2 Captioning of Image Based on Multimodal Learning
retrieved sentences are used for the composition of new v/s Visual Space
description for image query.
Dataset is provided with paired imaged and captions Stanford Captioning based on deep neural network can create captions
coreNLP toolkit is used by Y. Verma et al. [18] to process from visual and multimodal space. Dataset of image
sentences in dataset to derive a sentence list for every single captioning have captions in text form for corresponding
image. First image retrieval is achieved based upon global image. In multimodal space methods a shared multimodal
features of an image to retrieve a set of images for the query in space is learn from images and matching captions, then this
order to produce a description for a query image. A model is multimodal depiction is pass to language decoder. In contrast
trained to predicate phrase relevance is used to select phrases in a visual space based method the matching captions and
from the ones related with retrieved images subsequently. image feature are passed to language decoder independently.
Lastly a description is produced which is based upon selected Several image captioning technique use visual space to
relevant phrases. generate captions for image. This approach is discussed
There are certain limitations also of this method. However briefly in section 4.4.
this method creates general and syntactically correct captions
for image i.e. captions generated by this method are 4.2.1 Multimodal Space
grammatically correct and simple but this method is unable to
create image specific captions and the captions which are Multimodal space method architecture containing vision part,
semantically correct i.e. in some conditions generated language encoder, language decoder and multimodal space
captions may be unrelated to the content of the image. part. To extract image feature vision part use deep
Convolutional Neural Network (CNN) as extractor of feature
4. DEEP NEURAL NETWORKS BASED IMAGE and language decoder extracts word features and learn dense
CAPTIONING feature embedding for every word then it passed to recurrent
layer. Multimodal space maps features of image with word
Due to excessive evolution made in arena of deep learning, features to a common space, then resultant feature map are
work start to relay on deep neural networks to captioning of handed to a language decoder that produce captions by
image. Deep neural network is now taken on broadly to decoding the feature map.
handle image captioning task. Categorisation is done on deep Initially work is done in this field by Kiros et al. [21] the
neural network based approaches to subclasses on the basis of technique uses a CNN to take out the feature of image to
key structures & framework. generate the captions and uses multimodal space that
represent text and image both together for multimodal
4.1 Using Neural Networks Retrieval and Template Based learning and generation of caption, also it propose
Method multimodal neural network model that was conditioned on
image input to create images captions. In their approach log
Increasing the progresses in the arena of deep neural network bilinear language model is adapted to multimodal space. This
we can perform captioning task with ease. The problem of method depends on high level image feature and
ranking and embedding can be resolved with retrieval based representation of word learned from multimodal neural
method, to make image captioning as multimodality for use in language modal and deep neural network.
deep modelling recommended by researchers. “To retrieve a “A Karpathy et al. [22] proposed deep, multimodal model,
description for a query image Socher et al. [19] proposed to embedding of natural language and image data for
use dependency tree recursive neural network to represent bidirectional image and caption embedding task previously
sentences or phrases as compositional vectors. Another deep mention multimodal based approach use common embedding
neural network model is used as visual model to extract the space that maps image and caption directly, however this
features from the images. Obtained multimodal features are method work at finer level and embed fragments of captions
mapped to a common space using max margin objective and fragments of images this method breaks down images
function. Correct sentence and image pair in common space into number of objects and captions into dependency tree
has larger inner product after the training” [20]. At the end relation. This approach achieves some improvement in
sentence retrieval is done which is based upon similarities retrieval task as compared to other methods. This approach
between representation of sentences and images in common has some drawback also like in term of modelling the
space. dependency tree can model relation easily but they are not
always correct” [23].
1120
Urja Bahety et al., International Journal of Emerging Trends in Engineering Research, 9(8), August 2021, 1118 – 1123
A multimodal recurrent neural network (MRNN) language CNN is used get scene type i.e. to notice objects and
model is used by J. Mao et al. [24] to create novel captions for relationship between them and output of this are then used by
image. There are two subnetwork in this method deep language model to transform them into words and combined
Recurrent Neural Network (RNN) used for captions & deep phrases that create caption of image.
CNN used for images both of these networks interacts with Vinyals et al. [25] describes “a method called neural image
one another in multimodal space to form MRNN model. In caption generator, this method uses LSTM for caption
this method both fragments of captions and images are given generation and CNN for image representation. This CNN
as input and it computes probability distribution in order to uses novel method for batch normalisation and output of its
produce subsequent word of captions, in this model five more last layer is used as input to the LSTM decoder. LSTM is able
layers are there- a recurrent layer, two layers of word to keep track of objects that already described using text” [23]
embedding, a multimodal layer and a softmax layer. Kiros and neural image caption generator trained on maximum
[21] proposed a technique that is based on log bilinear model likelihood estimation.
and alexnet for feature extraction. This MRNN approach is To generate a caption for image, image statistics is included
some interrelated to method of kiros [21] which uses fixed to initial state of LSTM and the succeeding word which would
length context where as in “this method temporal context is be generated is based on preceding hidden state and current
stored in recurrent architecture which allows variable context time step. Till it catches the end of the token of sentence this
length. Two word embedding layer use vector in order to procedure is continues. Below figure shows encoder decoder
generate a dense word representation. It encodes both architecture-
semantic and syntactic meaning of words. Semantically
relevancy of word can be get by calculate Euclidian distance
between the two dense word vectors in embedding layer. Most
of the image sentence based multimodal methods use pre
computed word embedding vectors in order to initialise their
model. On the other hand this method randomly initialises
word embedding layer and learn from training dataset which
helps to generate good caption for image” [23]. Below figure
shows multimodal based image captioning-
1121
Urja Bahety et al., International Journal of Emerging Trends in Engineering Research, 9(8), August 2021, 1118 – 1123
All the process of this methods follows steps like first image are added to different hidden states of language generation
features are obtained using convolutional neural network then model. At last language generation model will generate
attributes are gotten from visual features, multiple captions captions with semantic concepts” [23]. Below figure shows
are created by language model in order to follow above two semantic caption based captioning-
steps, at last captions that are generated is re-ranked using a
model called deep multimodal similarity model to select great
excellence captions of image.
5. CONCLUSION
The proposed survey paper describes the captioning for image
models and techniques adopted in each methods. The survey
Figure 4.4.1: Attention Based Image Captioning paper has also described some strength and limitations of
discussed approaches. First this paper describes primary
4.4.2 Semantic Caption Based Captioning- captioning for image workings which are template and
retrieval based also it described that retrieval methods are
“This method selectively attend to a set of semantic concept good in generating simple and grammatically correct captions
proposal extracted from the image, these concepts are then
but also has some limitation i.e. unable to generate
combined to a hidden states and output of RNN. In this
semantically correct caption. Afterword the key attention is
method first CNN based encoder is used to encode semantic
dedicated on the approaches which are based on deep neural
concepts image feature. After this image feature are given as
input to language generation model then semantic concepts network which gives prominent results in generating good
1122
Urja Bahety et al., International Journal of Emerging Trends in Engineering Research, 9(8), August 2021, 1118 – 1123
captions, then further divide them in subcategories since 13. V. Ordonez, G. Kulkarni, T. L. Berg., Img2text:
distinct agendas are used in deep neural network based Describing images using 1 million captioned
techniques and some other methods related to image photographs, in: Advances in Neural Information
captioning. Processing Systems, 2011, pp. 1143–1151.
14. M. Hodosh, P. Young, J. Hockenmaier, Framing image
REFERENCES description as a ranking task: data, models and
evaluation metrics, Journal of Artificial Intelligence
1. T. Yang, C. Gan, B. Gong, Learning attributes equals Research 47 2013, pp. 853–899.
multisource domain generalization, in: IEEE Conference 15. F. R. Bach, M. I. Jordan, Kernel independent component
on Computer Vision and Pattern Recognition, Miami, analysis, Journal of Machine Learning Research 3 2002,
FL, USA, 2016, pp. 87–97. pp. 1–48.
2. H. Nickisch, C. H. Lampert, S. Harmeling, Learning to 16. D. R. Hardoon, S. R. Szedmak, J. R. Shawe-Taylor,
detect unseen object classes by between class attribute Canonical correlation analysis: An over view with
transfer, in: IEEE Conference on Computer Vision and application to learning methods, Neural Computation 16
Pattern Recognition, Miami, FL, USA, 2009, pp. 2004, pp. 2639–2664.
951–958. 17. R. Mason, E. Charniak, Nonparametric method for data
3. P. F. Felzenszwalb, R. B. Girshick, D. McAllester, D. driven image captioning, in: Proceedings of the 52nd
Ramanan, Object detection with discriminatively trained Annual Meeting of the Association for Computational
part based models, IEEE Transactions on Pattern Linguistics, 2014.
Analysis and Machine Intelligence 32 (9) 2010, pp.
18. A. Gupta, Y. Verma, C. V. Jawahar., Choosing
1627–1645.
linguistics over vision to describe images, in: AAAI
4. R. Girshick, J. Donahue, T. Darrell, J. Malik, Rich
Conference on Artificial Intelligence,Vol.5, 2012.
feature hierarchies for accurate object detection and
19. R. Socher, A. Karpathy, Q. V. Le, C. D. Manning, A. Y.
semantic segmentation, in: IEEE Conference on
Ng, Grounded compositional semantics for finding and
Computer Vision and Pattern Recognition, Columbus,
describing images with sentences, TACL 2 2014, pp.
OH, USA, 2014, pp. 580-587.
207–218.
5. P. Lakshmi Prasanna, D. Raghava Lavanya, T. Sasidhar,
20. Shuang Bai , Shan An, A Survey On Automatic Image
B. Sekhar Babu, Image Classification Using
Caption Generation, Article in Neurocomputing, May
Convolutional Neural Network, in: International Journal
2018, pp. 291-304.
of Emerging Trends in Engineering Research, October
21. Kiros R, Salakhutdinov R, Zemel R, Multimodal neural
2020, pp. 6816-6820.
language models. In: International conference on
6. S.Vimala, P.Haritha, S.Malathi, A Systematic literature
machine learning, 2014, pp. 595–603.
review on story telling for kids using image
22. A Karpathy, A Joulin, and FeiFeiFLi. Deep fragment
captioning-deep learning, in: 4th International
embeddings for bidirectional image sentence mapping.
Conference on Electronics, Communication and
Advances in neural information processing systems,
Aerospace Technology(ICECA), 2020.
2014, pp. 1889–1897.
7. Shubham Chaturvedi, S. Iyer, Tirthraj Dash, Image
23. Md. Zakir Hossain, Ferdous Sohel, Mohd Fairuz
captioning based image search engine: An alternative to
Shiratuddin, and Hamid Laga. A Comprehensive Survey
retrieval by metadata: SocProS 2017, pp. 181-191.
of Deep Learning for Image Captioning. ACM Comput.
8. B. Makav, V. Kilic, A new image captioning approach Surv. Vol. 51, 2019, pp. 1-36.
for visually impaired people, in: 11th International 24. Junhua Mao et al., Deep captioning with multimodal
Conference on Electrical and Electronics recurrent neural networks (m-rnn). In International
Engineering(ELECO), 2019. Conference on Learning Representations (ICLR), 2015.
9. A.Farhadi et al., Every picture tells a story: Generating 25. O. Vinyals, A. Toshev, S. Bengio, D. Erhan, Show and
sentences from images, in: European Conference on tell: A neural image caption generator, in: Proceedings of
Computer Vision, 2010, pp. 15–29. the IEEE Conference on Computer Vision and Pattern
10. D. Lin, An information-theoretic definition of similarity, Recognition, 2015, pp. 3156–3164.
in: Proceedings of the Fifteenth International Conference
on Machine Learning, pp. 296–304.
11. J. Curran, S. Clark, J. Bos, Linguistically motivated
large-scale nlp with cc and boxer, in: Proceedings of the
45th Annual Meeting of the ACL on Interactive Poster
and Demonstration Sessions, pp. 33–36.
12. Himanshu Sharma, Manmohan Agrahari, Mohd Firoj,
Image Captioning: A Comprehensive Survey, in:
International Conference on Power Electronics and IoT
Applications in Renewable Energy and its Control
(PARC), 2020, pp. 325-328.
1123