Generating Caption For Image Using Beam Search and Analyzation With Unsupervised Image Captioning Algo
Generating Caption For Image Using Beam Search and Analyzation With Unsupervised Image Captioning Algo
Rajat Kumar
Department of Computer Engineering
Delhi Technological University
New Delhi, India
rajatkumar [email protected]
Abstract—In today’s world of social media, almost everyone is caption that is understandable by a human brain i.e., the
a part of social platform and actively interacting with each other sentence will be both syntactically and semantically correct.
through internet. People on social media upload many pictures The resultant model will be able to describe the image
on their social media accounts with different captions. Thinking
about the appropriate caption is a tedious process. Caption is precisely to a human brain.
important to effectively describe the content and meaning of a In recent times, the studies and research related to captioning
picture. Caption describes the image in meaningful sentences. A an image or providing the image with a meaningful sentence
model for image caption generator can be built which is used to describing the image using some advanced deep learning
generate caption for images of different types and resolutions. techniques and algorithms and the concept of computer vision
Image captioning model which is used to generate caption in
language that is understandable by a human being for the input
have become immensely popular and advanced.
images. CNN(convolution neural network) and RNN(recurrent Getting the machine ready to automatically describe the
neural network) is used using the concept of encoder-decoder to objects in the picture next to it or given on the display can be
build this model. As CNN is used for image feature extraction a difficult work to do, but its effect is immense in many field.
purpose where only the important features or the important As an example it can be used for the visually impaired people
pixels, if the image is considered in the form of matrix of pixels,
which are extracted from the resultant image, instead of CNN and can be helpful to them and can act as a guidance system
model, other pre-trained imagenet models which have higher for their day to day life. It does not only describe the image
accuracy will be used and their results are then compared by but also make user understand about it in an understandable
using BLEU score metric for comparison. For the prediction of language.
captions, beam search method and argmax method is used and It is the very reason, the animation field is taken seriously
compared. The above discussed supervised image caption model
is also compared with the built unsupervised image captioning in this generation. This paper has the scope of copying or
model. mimicking the brain’s capability to describe the image using
The flickr8k dataset and then MSCOCO dataset are used to train natural meaningful sentence and analyze various process
and test the model. This model if implemented with the mobile and due to this it is a very good problem for the field of
application, which can be very useful for differently abled artificial intelligence which included heavy deep learning
people, who completely rely on the assistance of text-to-speech
feature. including concept of computer vision and it has resulted in
Index Terms—Image caption generator, deep learning, CNN, many organizations studying the concept of image caption
RNN, LS TM, supervised and unsupervised learning, beam generator.
search, Argmax.
I. INT RODUCT ION
Image caption generator being a very famous and accepted Image captioning is often used for a wide range of cases
field of research and study defines a way to build a model used as a way to help blind people use text to speech with
which can generate an English language based sentence or real-time output about the merging of the camera feed and
the particular environment, to improve public awareness by
Identify applicable funding agency here. If none, delete this. converting photo captions into social feeds as speech mes-
sages. Captions for each image on the web can create quick and construct descriptive captions using marks and order.
and precise image search and targeting. For robots, the agent’s Furthermore, in the study, language parsing based model
natural view is often given context by the display of natural which are more efficient and powerful is also been studied
language in the captions of images in the camera’s center and used [4, 5, 3, 6, 7]. These methods are inaccessible as
space. they neglect to show a hidden subtitle regardless of whether
This paper has used the encoder-decoder concept and each item was available in the configuration details. Likewise,
methodology using CNN-RNN pair of architecture with beam the issue of format is their formal assessment.
search [1]. Here, RNN has used to provide input as the For approaching and resolving this problem, the model is
resultant optimized output of the CNN model instead of LSTM being implemented and used with the deep neural network,
(Long Short Term Memory) to avoid the vanishing gradient which has the vector space shared by both the image and the
problem. LSTM has the problem of vanishing gradient if the caption. Here, the Convolution Neural Network has been used
value of the parameter weight is taken to be less than 1, due to as a typical approach[11] along with recurrent neural network
which the training of the model is improper and the resultant approach [10] to the problem and it develop a number of high -
output is not as accurate as required. So, RNN model is used lights that are used in sequence modeling process that studies
instead of LSTM where the value of weight is taken to be the model for natural language to produce an understandable
exactly equal to 1, if its value is greater than 1, the problem language showcase. We are different from their killings in
of overfitting may arise. order to improve the stable conditions. The Show, Attend and
In this paper instead of building the CNN structure from Tell uses new improvements in machine description and article
the scratch for the optimization of the input image which is recognition to introduce a thought-based model that looks
present as the matrix of pixels, various pre- trained models at a few ”spots” in a picture while making transcripts. They
like Inceptionv3 and VGG16 are trained on heavy dataset removed the highlights from the lower layer of convolution in
resulting in a higher accuracy or lower validation loss/loss and contrast to the earlier layers, bringing the vector element 14 ×
compared their results for accuracy using BLEU (Bilingual 14 × 512 in each image. We defined the number as the ”196”
Evaluation Understudy) scores. BLEU score is the metrics or spots in the picture, each with a vector of 512 objects. These
standard to measure and identify how much understandable a 196 locations are included in the drafting process. Using visual
sentence is to the human brain. considerations, the model had the option of finding out how
After covering the supervised methodology for image caption to adjust its visibility to the key elements in the images when
generator, we have compared the unsupervised manner for making subtitles. They introduced two instruments of thought,
image caption generator for the accuracy of the results. The a “critical” object for processing prepared with back-to-back
summary of our contribution to this paper: techniques; and a “solid” stochastic research program designed
• Built model for image caption generator with beam search to maximize the hypothetical variability. Adding consideration
using pre-trained model like VGG16 and Inceptionv3 that improves the type of documentation produced, but at a higher
are already trained on huge datasets for higher accuracy. cost of additional teachable loads. In addition, the prepara-
• To resolve the “out of memory” issue, the size of batch tion and exposure of the installed takes a ton of computer
during the training of the dataset is reduced. time, appropriately bringing this useless approach to the ever-
• As the nature of the algorithm for image caption generator present programs in the shopper’s wireless devices. Karpathy
is stochastic, this may lead to slightly different results and Li (2015) [12] introduced a method that uses accessible
every time we run the model for results. So, random seed image data sets for displaying their sentences to gain expertise
is set to generate same results everytime. through multi-purpose connections among display information
• We have used higher value of k in beam search for more and descriptive language.
efficient and accurate results with less validation loss/loss. The planned model is defined by integrating Convolution Neu-
• In this paper, we compared the results of supervised and ral Networks into the display information areas, bidirectional
unsupervised methodology of image caption generator Recurrent Neural Networks in the language, and the strategic
goal of adapting these approaches. The des ign of the Mul-
II. RELAT ED W ORK timodal Recurrent Neural Network, in these developed lines,
This particular section includes investigation of the part of uses expert arrangements to draw the required graphic text.
the work that has recently been tried in this difficult area. Pre- The orderly model produces the best in the class bringing the
image captioning techniques depend on the substitution of a research found into Flickr8K, Flickr30K and MSCOCO [13]
productive production model for text production in a common datasets. As there are various improvements and enhancements
language. Farhadi et al. (2010) [2], use trios next to a in the field of the frameworks, Etienne et al (2016) [14]
predetermined structure to produce text. They discriminate introduced another approach to improvement, small image go
against the multi-name Markov Random Field to predict the through independent sequence training, which is a modified
measurement of trios. Kulkarni et al. (2011) [8], object is form of reinforce [15]. Thus, the problem is resolved and help
separated from the image, anticipate a multitude of features to develop their own model of inseparable test standards such
and related words (local data contradicting different articles) as BLEU, ROUGE, etc., rather than being a coincidence.
for each object, create a diagram called Basic Random Field
• Input gate: This gate helps determining the part of input is total number of sentences. External corpus is used to get
data which is to be converted to memory. The function the sentence, they are unrelated to the image description.
sigmoid determines which values you must allow to
exceed 0,1. and the performance of tanh gives weight The Model
to the transferred values that determine their value level The model is comprised of three parts , first one is image en -
from 1 to 1. coder, second is sentence generator and third is a discriminator
for discriminating real sentences or produced by the generator.
it = σ(Wi .[ht− 1 , xt ] + bi ) (1)
Ct = tanh(Wc . [ ht− 1 , xt ] + bC ) (2) Encoder
CNN model is used to provide the most important feature or
• Forget gate: Find out what information will be lost on the pixels of an image as a result fim:
block. Determined by the function sigmoid. It observes
the previous situation (ht-1)and input text(Xt)and sub- fim = CNN (I) (6)
tracts a number in between0(leave this)and1 (keep this)
for each of the numbers in styleCt-1. We used Inception-V4 for the purpose of encoder.
(7)
−
, [ ] ( ) −
Image Reconstruction
We have used the sentence generated by the generator
reconstruction of image feature instead of constructing
Training
the whole image. So as it can be seen in the figure 5
As this is not supervised training. So, there are three discriminator is also an encoder for sentences. A layer
objectives of the training in this paper. that is fully connected is arranged on the top of the
• Adversarial caption generation discriminator so that the last hidden state can be papered
An adversarial text generation method [22] is used in this to the latent space that is common to both sentences and
paper. The image feature is passed as an input to the images:
generator which generates a sentence best suited to the (11)
feature of image.
For the purpose of training the discriminator an additional
The main work of generator is to generate sentences as image reconstruction loss is defined:
real as possible so that it can fool the discriminator. To (12)
accomplish this, a reward mechanism is used in each
time-step which is known as “adversarial reward”. The − (13)
value for the reward for the t-th generated word can be im
calculated as: rt = Reward to the generator for reconstructing
the image
(9)
Lim = Reconstruction error
Adversarial loss for the discriminator is defined as:
(10)
Sentence Reconstruction
t
(15)
(16)
V . CONCLUSION
As various earlier studies in the area/field of image cap-
tioning model has been done providing the state-of-art model
which primarily generates captions for image having the
architecture of encoder-decoder (i.e., CNN-RNN pair). Here,
enhancement in accuracy and reduction in loss/validation loss
is achieved as some of the already trained imagenet models [18 ] Haoran Wang, Yue Zhang, Xiaosheng Yu, “ An Overview of Im-
are taken into consideration for the feature extraction phase of age Caption Generation Methods”, Computational Intelligence and
Neuroscience, vol. 2020, Article ID 3062706, 13 pages, 2020.
the model building. The comparison among results for these https://doi.org/10.1155/2020/3062706.
models is done for the accuracy and then different methods [19 ] T .-Y. Lin, et al (2014) Microsoft COCO: Common objects in con-
used for the prediction of captions for input image are also text.arXiv:1405.0312.
[20 ] Ivan Krasin, Tom Duerig, Neil Alldrin, Vittorio Ferrari, Sami Abu- El-
compared. It is observed that the model is more improved Haija, Alina Kuznetsova, Hassan Rom, Jasper Uijlings, Stefan Popov,
with respect to the memory issue and accuracy if batch size Shahab Kamali, Matteo Malloci, Jordi Pont-T uset, Andreas Veit,
of the dataset during the training of the model is taken to be Serge Belongie, Victor Gomes, Abhinav Gupta, Chen Sun, Gal
Chechik, David Cai, Zheyun Feng, Dhyanesh Narayanan, and Kevin
of higher value and the value of k is higher in beam search Murphy. Openimages: A public dataset for large-scale multi- label and
method. The unsupervised method for image captioning is also multi-class image classification. Dataset available from
experimented and model is built for it which shows improve- https://storage.googleapis.com/openimages/web/index.html, 2017.
[21 ] Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay
ment in the performance and accuracy of the model for image Mansour. Policy gradient methods for reinforcement learning with
caption generator. As corpus which is an image describing function approximation. In NIPS, 2000.
content consist of over two million sentences gathered from [22 ] William Fedus, Ian Goodfellow, and Andrew M Dai. Maskgan: Better
text generation via filling in the . In ICLR, 2018.
Shutterstock used efficiently for the purpose of implementing [23 ] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine
unsupervised algorithm for image captioning that subsequently Manzagol. Extracting and composing robust features with denoising
lead to the promising, more efficient and accurate results even autoencoders. In ICML, 2008.
[24 ] M. Hodosh, P. Young, and J. Hockenmaier. Framing image description
if the image sentence label is not provided. as a ranking task: Data,modelsand evaluation metrics. JAIR, 47, 2013.
[25 ] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network
training by reducing internalcovariate shift. In arXiv:1502.03167, 2015.
REFERENCES [26 ] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic
optimization. arXiv preprint arXiv:1412.6980, 2014.
[1] O. Vinyals, A. T oshev, S. Bengio and D. Erhan, ”Show and tell: A neural [27 ] Jonathan Huang, Vivek Rathod, Chen Sun, Menglong Zhu, Anoop Ko-
image caption generator,” 2015 IEEE Conference on Computer Vision rattikara, Alireza Fathi, Ian Fischer, Zbigniew Wojna, Yang Song, Sergio
and Pattern Recognition (CVPR), Boston, MA, 2015, pp.3156-3164, doi: Guadarrama, et al. Speed/accuracy trade-offs for modern convolutional
10.1109/CVPR.2015.7298935. object detectors. In CVPR, 2017.
[2] A. Farhadi, M. Hejrati, M. A. Sadeghi, P. Young, C. Rashtchian, J. [28 ] Ranganathan, G. "Real Life Human Movement Realization in
Hockenmaier, and D. Forsyth. Every picture tells a story: Generating Multimodal Group Communication Using Depth Map Information and
sentences from images. In ECCV, 2010. Machine Learning." Journal of Innovative Image Processing (JIIP) 2,
[3] A. Aker and R. Gaizauskas. Generating image descriptions using de- no. 02 (2020): 93-101.
pendency relational patterns. In ACL, 2010. [29 ] Bindhu, V., and Villankurichi Saravanampatti PO. "Semi-Automated
Segmentation Scheme for Computerized Axial T omography Images of
[4] M. Mitchell, X. Han, J. Dodge, A. Mensch, A. Goyal, A. C. Berg, K.
Esophageal Tumors." Journal of Innovative Image Processing (JIIP) 2,
Yamaguchi, T . L. Berg, K. Stratos, and H. D. III. Midge: Generating
no. 02 (2020): 110-120.
image descriptions from computer vision detections. In EACL, 2012.
[30 ] S. Han and H. Choi, "Domain-Specific Image Caption Generator with
[5] P. Kuznetsova, V. Ordonez, A. C. Berg, T . L. Berg, and Y. Choi. Semantic Ontology," 2020 IEEE International Conference on Big Data
Collective generation of natural image descriptions. In ACL, 2012. and Smart Computing (BigComp), Busan, Korea (South), 2020, pp. 526-
[6] P. Kuznetsova, V. Ordonez, T. Berg, and Y. Choi. Treetalk: Composition 530, doi: 10.1109/BigComp48618.2020.00-12.
and compression of trees for image descriptions. ACL, 2(10), 2014. [31 ] Vijayakumar, T ., and R. Vinothkanna. "Retrieval of complex images
[7] D. Elliott and F. Keller. Image description using visual dependency using visual saliency guided cognitive classification." J Innov Image
representations. In EMNLP, 2013. Process (JIIP) 2, no. 02 (2020): 102-109.
[8] G. Kulkarni, V. Premraj, S. Dhar, S. Li, Y. Choi, A. C. Berg, and [32 ] N. K. Kumar, D. Vigneswari, A. Mohan, K. Laxman and J. Yuvaraj,
T . L. Berg. Baby talk: Understanding and generating simple image "Detection and Recognition of Objects in Image Caption Generator
descriptions. In CVPR, 2011. System: A Deep Learning Approach," 2019 5th International
[9] R. Kiros and R. Z. R. Salakhutdinov. Multimodal neural language Conference on Advanced Computing & Communication Systems
models. In NIPS Deep Learning Workshop, 2013. (ICACCS), Coimbatore, India, 2019, pp. 107-109, doi:
[10 ] S. Hochreiter and J. Schmidhuber. Long short -term memory. Neural 10.1109/ICACCS.2019.8728516.
Computation, 9(8), 1997.
[11 ] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network
training by reducing internal covariate shift. In arXiv:1502.03167, 2015.
[12 ] Andrej Karpathy, Li Fei Fei (2015) Deep Visual-Semantic Alignmentsfor
Generating Image Descriptions.IEEE Transactions on PatternAnalysis
and Machine Intelligence(April 2017), vol 39, issue 4:664 –676
[13 ] T .-Y. Lin, et al (2014) Microsoft COCO: Common objects in con-
text.arXiv:1405.0312
[14 ] Steven J. Rennie, Etienne Marcheret, Youssef Mroueh, Jarret Rossand
Vaibhava Goel (2016) Self-critical Sequence Training for ImageCaption-
ing.in arXiv:1612.00563
[15 ] Ronald J. Williams (1992) Simple statistical gradient -following algo-
rithms for connectionist reinforcement learning.In Machine Learn-
ing,pages 229–256, 1992.
[16 ] Marc’ Aurelio Ranzato, Sumit Chopra, Michael Auli, and Woj-
ciechZaremba. (2015) Sequence level training with recurrent neural net-
works.ICLR, 2015.
[17 ] P. Mathur, A. Gill, A. Yadav, A. Mishra and N. K. Bansode, ”Cam -
era2Caption: A real-time image caption generator,” 2017 International
Conference on Computational Intelligence in Data Science(ICCIDS),
Chennai, India, 2017, pp. 1-6, doi: 10.1109/ICCIDS.2017.8272660.