0% found this document useful (0 votes)
13 views

Generating Caption For Image Using Beam Search and Analyzation With Unsupervised Image Captioning Algo

Uploaded by

bawok32900
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Generating Caption For Image Using Beam Search and Analyzation With Unsupervised Image Captioning Algo

Uploaded by

bawok32900
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Proceedings of the Fifth International Conference on Intelligent Computing and Control Systems (ICICCS 2021)

IEEE Xplore Part Number: CFP21K74-ART; ISBN: 978-0-7381-1327-2

Generating Caption for Image using Beam


Search and Analyzation with Unsupervised
Image Captioning Algorithm
Prashant Giridhar Shambharkar Priyanka Kumari Pratik Yadav
2021 5th International Conference on Intelligent Computing and Control Systems (ICICCS) | 978-1-6654-1272-8/21/$31.00 ©2021 IEEE | DOI: 10.1109/ICICCS51141.2021.9432245

Department of Computer Engineering Department of Computer Engineering Department of Computer Engineering


Delhi Technological University Delhi Technological University Delhi Technological University
New Delhi, India New Delhi, India New Delhi, India
prasha nt.sha m bha rk ar@ dtu.a c .in priyankakumari [email protected] pratikyadav [email protected]

Rajat Kumar
Department of Computer Engineering
Delhi Technological University
New Delhi, India
rajatkumar [email protected]

Abstract—In today’s world of social media, almost everyone is caption that is understandable by a human brain i.e., the
a part of social platform and actively interacting with each other sentence will be both syntactically and semantically correct.
through internet. People on social media upload many pictures The resultant model will be able to describe the image
on their social media accounts with different captions. Thinking
about the appropriate caption is a tedious process. Caption is precisely to a human brain.
important to effectively describe the content and meaning of a In recent times, the studies and research related to captioning
picture. Caption describes the image in meaningful sentences. A an image or providing the image with a meaningful sentence
model for image caption generator can be built which is used to describing the image using some advanced deep learning
generate caption for images of different types and resolutions. techniques and algorithms and the concept of computer vision
Image captioning model which is used to generate caption in
language that is understandable by a human being for the input
have become immensely popular and advanced.
images. CNN(convolution neural network) and RNN(recurrent Getting the machine ready to automatically describe the
neural network) is used using the concept of encoder-decoder to objects in the picture next to it or given on the display can be
build this model. As CNN is used for image feature extraction a difficult work to do, but its effect is immense in many field.
purpose where only the important features or the important As an example it can be used for the visually impaired people
pixels, if the image is considered in the form of matrix of pixels,
which are extracted from the resultant image, instead of CNN and can be helpful to them and can act as a guidance system
model, other pre-trained imagenet models which have higher for their day to day life. It does not only describe the image
accuracy will be used and their results are then compared by but also make user understand about it in an understandable
using BLEU score metric for comparison. For the prediction of language.
captions, beam search method and argmax method is used and It is the very reason, the animation field is taken seriously
compared. The above discussed supervised image caption model
is also compared with the built unsupervised image captioning in this generation. This paper has the scope of copying or
model. mimicking the brain’s capability to describe the image using
The flickr8k dataset and then MSCOCO dataset are used to train natural meaningful sentence and analyze various process
and test the model. This model if implemented with the mobile and due to this it is a very good problem for the field of
application, which can be very useful for differently abled artificial intelligence which included heavy deep learning
people, who completely rely on the assistance of text-to-speech
feature. including concept of computer vision and it has resulted in
Index Terms—Image caption generator, deep learning, CNN, many organizations studying the concept of image caption
RNN, LS TM, supervised and unsupervised learning, beam generator.
search, Argmax.
I. INT RODUCT ION
Image caption generator being a very famous and accepted Image captioning is often used for a wide range of cases
field of research and study defines a way to build a model used as a way to help blind people use text to speech with
which can generate an English language based sentence or real-time output about the merging of the camera feed and
the particular environment, to improve public awareness by
Identify applicable funding agency here. If none, delete this. converting photo captions into social feeds as speech mes-

978-0-7381-1327-2/21/$31.00 ©2021 IEEE 857


Authorized licensed use limited to: GITAM University. Downloaded on September 10,2024 at 06:43:09 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Fifth International Conference on Intelligent Computing and Control Systems (ICICCS 2021)
IEEE Xplore Part Number: CFP21K74-ART; ISBN: 978-0-7381-1327-2

sages. Captions for each image on the web can create quick and construct descriptive captions using marks and order.
and precise image search and targeting. For robots, the agent’s Furthermore, in the study, language parsing based model
natural view is often given context by the display of natural which are more efficient and powerful is also been studied
language in the captions of images in the camera’s center and used [4, 5, 3, 6, 7]. These methods are inaccessible as
space. they neglect to show a hidden subtitle regardless of whether
This paper has used the encoder-decoder concept and each item was available in the configuration details. Likewise,
methodology using CNN-RNN pair of architecture with beam the issue of format is their formal assessment.
search [1]. Here, RNN has used to provide input as the For approaching and resolving this problem, the model is
resultant optimized output of the CNN model instead of LSTM being implemented and used with the deep neural network,
(Long Short Term Memory) to avoid the vanishing gradient which has the vector space shared by both the image and the
problem. LSTM has the problem of vanishing gradient if the caption. Here, the Convolution Neural Network has been used
value of the parameter weight is taken to be less than 1, due to as a typical approach[11] along with recurrent neural network
which the training of the model is improper and the resultant approach [10] to the problem and it develop a number of high -
output is not as accurate as required. So, RNN model is used lights that are used in sequence modeling process that studies
instead of LSTM where the value of weight is taken to be the model for natural language to produce an understandable
exactly equal to 1, if its value is greater than 1, the problem language showcase. We are different from their killings in
of overfitting may arise. order to improve the stable conditions. The Show, Attend and
In this paper instead of building the CNN structure from Tell uses new improvements in machine description and article
the scratch for the optimization of the input image which is recognition to introduce a thought-based model that looks
present as the matrix of pixels, various pre- trained models at a few ”spots” in a picture while making transcripts. They
like Inceptionv3 and VGG16 are trained on heavy dataset removed the highlights from the lower layer of convolution in
resulting in a higher accuracy or lower validation loss/loss and contrast to the earlier layers, bringing the vector element 14 ×
compared their results for accuracy using BLEU (Bilingual 14 × 512 in each image. We defined the number as the ”196”
Evaluation Understudy) scores. BLEU score is the metrics or spots in the picture, each with a vector of 512 objects. These
standard to measure and identify how much understandable a 196 locations are included in the drafting process. Using visual
sentence is to the human brain. considerations, the model had the option of finding out how
After covering the supervised methodology for image caption to adjust its visibility to the key elements in the images when
generator, we have compared the unsupervised manner for making subtitles. They introduced two instruments of thought,
image caption generator for the accuracy of the results. The a “critical” object for processing prepared with back-to-back
summary of our contribution to this paper: techniques; and a “solid” stochastic research program designed
• Built model for image caption generator with beam search to maximize the hypothetical variability. Adding consideration
using pre-trained model like VGG16 and Inceptionv3 that improves the type of documentation produced, but at a higher
are already trained on huge datasets for higher accuracy. cost of additional teachable loads. In addition, the prepara-
• To resolve the “out of memory” issue, the size of batch tion and exposure of the installed takes a ton of computer
during the training of the dataset is reduced. time, appropriately bringing this useless approach to the ever-
• As the nature of the algorithm for image caption generator present programs in the shopper’s wireless devices. Karpathy
is stochastic, this may lead to slightly different results and Li (2015) [12] introduced a method that uses accessible
every time we run the model for results. So, random seed image data sets for displaying their sentences to gain expertise
is set to generate same results everytime. through multi-purpose connections among display information
• We have used higher value of k in beam search for more and descriptive language.
efficient and accurate results with less validation loss/loss. The planned model is defined by integrating Convolution Neu-
• In this paper, we compared the results of supervised and ral Networks into the display information areas, bidirectional
unsupervised methodology of image caption generator Recurrent Neural Networks in the language, and the strategic
goal of adapting these approaches. The des ign of the Mul-
II. RELAT ED W ORK timodal Recurrent Neural Network, in these developed lines,
This particular section includes investigation of the part of uses expert arrangements to draw the required graphic text.
the work that has recently been tried in this difficult area. Pre- The orderly model produces the best in the class bringing the
image captioning techniques depend on the substitution of a research found into Flickr8K, Flickr30K and MSCOCO [13]
productive production model for text production in a common datasets. As there are various improvements and enhancements
language. Farhadi et al. (2010) [2], use trios next to a in the field of the frameworks, Etienne et al (2016) [14]
predetermined structure to produce text. They discriminate introduced another approach to improvement, small image go
against the multi-name Markov Random Field to predict the through independent sequence training, which is a modified
measurement of trios. Kulkarni et al. (2011) [8], object is form of reinforce [15]. Thus, the problem is resolved and help
separated from the image, anticipate a multitude of features to develop their own model of inseparable test standards such
and related words (local data contradicting different articles) as BLEU, ROUGE, etc., rather than being a coincidence.
for each object, create a diagram called Basic Random Field

978-0-7381-1327-2/21/$31.00 ©2021 IEEE 858


Authorized licensed use limited to: GITAM University. Downloaded on September 10,2024 at 06:43:09 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Fifth International Conference on Intelligent Computing and Control Systems (ICICCS 2021)
IEEE Xplore Part Number: CFP21K74-ART; ISBN: 978-0-7381-1327-2

III. M ET HODOLOGY AND A RCHIT ECTURAL A NALYSIS

A. Datasets and Evaluation

This part of the paper helps in understanding the image


captioning and open source database and the various method
for it. The important parts or element for all the inventions
that has be created by humans are feed up by the data,
statistics and obviously the system power. These keys are
important for each other. There is a fact that the data can
make a model more powerful, reliable and more effective for
its usage. The showing of the image on the display is
somewhat familiar with description of the machine, and its
testing strategy extends to the machine description to create
its own unique test rules.
D . Model analysis
1) Supervised image captioning: For our paper, we will be
B. Data sets are accessible from making a caption generator using captions using CNN (Convolutional Neural Networks)
and specialized RNN i.e. LSTM (Long Short- Term Memory).
Information and knowledge are the basic keys the ideas
going on in the human brain. It has been noticed that the Image highlighting will be removed from inceptionv3/vgg16
rules and regulations that have a bulk of data and information model which is a CNN model based on image databases and
thereafter feeds on the outstanding LSTM model which is
in it are most likely to be unfollowed by the people. In ear-
responsible for producing the captions of a given image.
lier image caption generator methodology there were various
datasets were used having large size or may be small size
also, instance, Flickr8k, Flickr30k [17], MSCOCO, PASCAL, Convolutional Neural Network(CNN)
STAIR and AI Challenger Dataset. There are five images as Convolutional Neural Networks are deep neural organizations
that can deal with information in the form of input as a 2-D
reference in the database. To display the same image, there is
framework. We convert the images into a 2-D frame which are
a different grammar used so that the previous image can be
displayed properly. For example, the Microsoft team created then fed into the CNN and then we can use it to successfully
the Microsoft COCO Captions Database just to use the image classify/identify the
captioning method to take generate the caption on daily basis given image. In this way, CNN provides us an easy way to
and can be used to assign the tasks and the works, for example, work with images. We primarily use CNN for image orders,
image recognition, flight or Superman, and so on. It filters
classification and display, image recognition.
images from left to right and rips to extract highlights from
To build the required model of image captioning, we have
the image and merges part of the group photos. It can deal with
used MSCOCO dataset having number of data/images equal
translated, uplifted, enhanced images and changes in context.
to 82780.
Recurrent Neural Network(RNN)
C. Steps involved in data processing Intermittent Neural Network is an assertion of a feed-forward
neural organization with memory within. The RNN is inter-
Data purification: Raw data may miss data or have in- mittent as it plays the same power in each data offering while
significant amounts of it. This can lead to inconsistencies in the yield of current information is based on a single previous
data. The data cleaning step handles all of these problems. calculation. Following the harvest, it is repeated and sent back
All the values which are missing are rectified in accordance to the repetitive organization. By resolving the options, it looks
with the relevant terms and all non-essential amounts are at current data and yields obtained from previous information.
deducted. Creating Test and Training Sets: This is one of the Long-Term Memory (LSTM)
most important steps in data development. In this step we look Long Short-Term Memory (LSTM) networks is a type of
at the data set in the training and setting of the test. For the RNN, it is an altered version of duplicate neural networks,
training purpose a large portion of the data (70% -80%) is which is adjusted in a way so that it can remember past data
used and a small portion of the data (20% - 30%) is used for in memory easily. The RNN disappearance issue is resolved.
the experiment objective. Feature measurement: One variable The LSTM is improvised upon RNN to categorize, process
dominates the other variable, if independent variables are and improvise a series of time given a period of undetermined
measured at different scales. Feature measurement is therefore duration. Back distribution is used to train the model. In the
an important step in which all the values in the data are LSTM network, there are three gates:
converted to a definite distance.

978-0-7381-1327-2/21/$31.00 ©2021 IEEE 859


Authorized licensed use limited to: GITAM University. Downloaded on September 10,2024 at 06:43:09 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Fifth International Conference on Intelligent Computing and Control Systems (ICICCS 2021)
IEEE Xplore Part Number: CFP21K74-ART; ISBN: 978-0-7381-1327-2

• Input gate: This gate helps determining the part of input is total number of sentences. External corpus is used to get
data which is to be converted to memory. The function the sentence, they are unrelated to the image description.
sigmoid determines which values you must allow to
exceed 0,1. and the performance of tanh gives weight The Model
to the transferred values that determine their value level The model is comprised of three parts , first one is image en -
from 1 to 1. coder, second is sentence generator and third is a discriminator
for discriminating real sentences or produced by the generator.
it = σ(Wi .[ht− 1 , xt ] + bi ) (1)
Ct = tanh(Wc . [ ht− 1 , xt ] + bC ) (2) Encoder
CNN model is used to provide the most important feature or
• Forget gate: Find out what information will be lost on the pixels of an image as a result fim:
block. Determined by the function sigmoid. It observes
the previous situation (ht-1)and input text(Xt)and sub- fim = CNN (I) (6)
tracts a number in between0(leave this)and1 (keep this)
for each of the numbers in styleCt-1. We used Inception-V4 for the purpose of encoder.

ft = σ(Wf .[ht− 1 , xt ] + bf ) (3) Generator


• Output gate: The output of the LSTM structure is repre- We have used LSTM for the purpose of generation of sen-
sented or provided by the this gate. tences. It takes the image representation and converts it into
sentence in the language understandable by a human brain to
Ot = σ(Wo .[ht− 1 , xt ] + bo ) (4) describe the image. It provides us the probability distribution
ht = Ot ∗ tanh(Ct) (5) of all the words conditioned on the feature representation of
image and the words that are already generated from our
vocabulary.
The following equations are used to sample the generated word
from the vocabulary.

(7)

, [ ] ( ) −

In this paper, the CNN model which is generally ,


used for feature extraction of an image is replaced by
VGG16/Inceptionv3 models as these models provide better
accuracy and the image captions are predicted using beam FC = Fully connected layer,
search with different value of k and argmax search and then
compared with each other for the accuracy and loss/validation
loss. Flickr8k dataset and MSCOCO dataset is used here in ∼ = operation used for sampling

this paper for experimentation. n = generated sentence s length
• Approximately 1 hour is required for training of
model/feature extraction for VGG16 model with Flickr8k W e = word embeddin g matrix, xt = input of L ST M
dataset and 5-6 hours for MSCOCO dataset.
st = Representation of word generated by generator using one−hot
• Due to the fact that Inceptionv3 has very less parameter
as compared to the VGG16 model so, it took 20 minutes g
h t = hidden state of LSTM
for feature extraction with Flickr8k dataset and 2 hours 45
g
minutes for MSCOCO dataset using Inceptionv3 model. h t = hidden state of L ST M
2) Unsupervised image captioning: We have used the con-
pt = probability over the dictionary at the time step t
cept of machine translation using unsupervised technique for
image captioning using unsupervised deep learning algorithm. so = sentence starting point, sn = sentence ending point
We have considered out image as source language. We have
used set of images I and sentences Sˆ , and detector which st is sampled from the probability distribution pt
detects visual concept. Ni is total numbers of images and Ns

978-0-7381-1327-2/21/$31.00 ©2021 IEEE 860


Authorized licensed use limited to: GITAM University. Downloaded on September 10,2024 at 06:43:09 UTC from IEEE Xplore. Restrictions apply.
of different concepts and their confidence scores.

C=(c1 ,v1 ),..., (ci,vi),. ..,(c nc ,vn c)


C= Confidence Scores
ci= visual concepts corresponding to ith value
vi= confidence score of ith detected visual
Nc =total visual concepts
The t-th generated word is assigned a concept reward
which is given by:

• Bi-directional Image-Sentence Reconstruction


Limited Number of object concepts can be detected
Discriminator reliably by the existing visual concept detectors. It’s
We have used LSTM to implement the discriminator whose important for our model that it has a better understanding
work is to discriminate the corpus contained real sentence from of semantic concepts of the image. The images and
the sentence that the model generates. sentences are put together into a common latent space
for reconstructing each other.
[ ] ( ) (8)

Image Reconstruction
We have used the sentence generated by the generator
reconstruction of image feature instead of constructing
Training
the whole image. So as it can be seen in the figure 5
As this is not supervised training. So, there are three discriminator is also an encoder for sentences. A layer
objectives of the training in this paper. that is fully connected is arranged on the top of the
• Adversarial caption generation discriminator so that the last hidden state can be papered
An adversarial text generation method [22] is used in this to the latent space that is common to both sentences and
paper. The image feature is passed as an input to the images:
generator which generates a sentence best suited to the (11)
feature of image.
For the purpose of training the discriminator an additional
The main work of generator is to generate sentences as image reconstruction loss is defined:
real as possible so that it can fool the discriminator. To (12)
accomplish this, a reward mechanism is used in each
time-step which is known as “adversarial reward”. The − (13)
value for the reward for the t-th generated word can be im
calculated as: rt = Reward to the generator for reconstructing
the image
(9)
Lim = Reconstruction error
Adversarial loss for the discriminator is defined as:

(10)
Sentence Reconstruction

The derived representation can be used by the generator


• Visual Concept Distillation The adversarial type of to reconstruct the sentences. It is also a type of sentence
reward only ensures generation of plausible sentences but denoising auto-encoder [23]. The images and sentences
they may be irrelevant to the image. So it is important are aligned by it in the latent space, and by having the
to identify all the visual concepts present in the image. common space for the representation of an image, the
We have used a visual concept detector for this purpose. learning of the decoding of sentence takes place. The
When a word generated by the model where the image is cross-entropy loss is defined as:
used to identify and detect the analogous visual concept,
a reward named as concept reward is given to the word
which is generated, where the confidence score of the (14)
visual concept corresponds to the reward value. The
visual concept detector takes an Image I and yields a set Integration
Authorized licensed use limited to: GITAM University. Downloaded on September 10,2024 at 06:43:09 UTC from IEEE Xplore. Restrictions apply.861
Proceedings of the Fifth International Conference on Intelligent Computing and Control Systems (ICICCS 2021)
IEEE Xplore Part Number: CFP21K74-ART; ISBN: 978-0-7381-1327-2

In this paper policy gradient [21] is used for training


the generator, where the gradients are estimated with
respect to all the trainable parameters. The gradients for
the generator are provided by the sentence reconstruction
loss using back-propagation. For updating the generator
both of the gradients are used. The gradient is:

t
(15)

The combination of image reconstruction and adversarial


losses provide the updating required with the parameters
with the help of gradient descent process within the
discriminator.

(16)

I V. RESULT S AND DISCUSSION


In this paper, as pre-trained models has been used for
the purpose of feature extraction which is then feed as the
input vector to the RNN/LSTM model/architecture for the
final image captioning. The model VGG16 has larger number
of parameters as compared to the Inceptionv3 model that
is why, the later model takes lesser time for the feature
extraction/training of the input dataset.

For image caption prediction, two methods have been


used in this paper, beam search and argmax search and the
results are compared with respect to the different pre-trained
models using tables. The criteria for the comparison is taken
to be the loss/validation loss value instead of accuracy value
and the standard metric for comparison used here is BLEU
score.

Initialization A pipeline is proposed for the pre-training of


the discriminator and the generator. For each of the trainin g
image we have generated a pseudo caption, then image
captioning model is initialized by the pseudo image-caption
pairs. A concept dictionary has been built which contains all
the object classes present in the dataset OpenImages [20].

Here, sentence corpus is required and used to provide


the required training of a concept to the sentence model.
The concept of each image provided is effectively detected
using existing visual concept detector. Then the training of
generator has been done with the help of standardized deep
learning algorithm which includes pseudo pairs of image
caption[1].

978-0-7381-1327-2/21/$31.00 ©2021 IEEE 862


Authorized licensed use limited to: GITAM University. Downloaded on September 10,2024 at 06:43:09 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Fifth International Conference on Intelligent Computing and Control Systems (ICICCS 2021)
IEEE Xplore Part Number: CFP21K74-ART; ISBN: 978-0-7381-1327-2

Here, in figure 3 we can observe that beam search method


for prediction of caption of an image provide more accuracy
and preciseness than the argmax method and within the beam
search method, higher the value of k, better the results will be
obtained. Above we discussed the result and discussion part of
the supervised image captioning now in the later part we are
going to discuss the outcome of the image captioning model
when implemented through unsupervised manner.
• Shutterstock1 is used here to gather the image descrip-
tion which is eventually used to collect the corpus of
sentence. Here, visual concept detector is used by using
object detection model [27] which is properly trained on
OpenImages [20].
• The model built for image caption generator using un -
supervised algorithm provide enhanced and better results
for accuracy with 28.9% if it is measured with respect to
CIDEr score.
• The space dimension for shared latent and the dimensions
which are available for LSTM are taken to be 512 which
is a fixed value and the to achieve the approximately same
level and scale of weighting parameters with respect to
different rewards, the value of sen, im, c, are taken to
be 1, 0.2,10 and 0.9 respectively.
• The training of the model or the dataset is done using
Adam optimizer[26] and it has the learning rate of 10-4.
In the initial phase, the optimization/minimization of the
loss or we can say it a cross entropy loss is achieved
by taking into consideration the learning rate to be 10-3.
Consequently, during the testing of the model the beam
search method is opted with the value of k taken to be 3.
• We have used both supervised learning and unsupervised
learning to build the model of image caption generator
and results were very noticeable and promising but as we
have used sentence corpus not image sentence pairs in un-
supervised learning so the captions are more diversified.
Our vocabulary is huge in unsupervised leaning. Captions
generated in unsupervised learning are more plausible and
can be seen in figure 4.

V . CONCLUSION
As various earlier studies in the area/field of image cap-
tioning model has been done providing the state-of-art model
which primarily generates captions for image having the
architecture of encoder-decoder (i.e., CNN-RNN pair). Here,
enhancement in accuracy and reduction in loss/validation loss

978-0-7381-1327-2/21/$31.00 ©2021 IEEE 863


Authorized licensed use limited to: GITAM University. Downloaded on September 10,2024 at 06:43:09 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Fifth International Conference on Intelligent Computing and Control Systems (ICICCS 2021)
IEEE Xplore Part Number: CFP21K74-ART; ISBN: 978-0-7381-1327-2

is achieved as some of the already trained imagenet models [18 ] Haoran Wang, Yue Zhang, Xiaosheng Yu, “ An Overview of Im-
are taken into consideration for the feature extraction phase of age Caption Generation Methods”, Computational Intelligence and
Neuroscience, vol. 2020, Article ID 3062706, 13 pages, 2020.
the model building. The comparison among results for these https://doi.org/10.1155/2020/3062706.
models is done for the accuracy and then different methods [19 ] T .-Y. Lin, et al (2014) Microsoft COCO: Common objects in con-
used for the prediction of captions for input image are also text.arXiv:1405.0312.
[20 ] Ivan Krasin, Tom Duerig, Neil Alldrin, Vittorio Ferrari, Sami Abu- El-
compared. It is observed that the model is more improved Haija, Alina Kuznetsova, Hassan Rom, Jasper Uijlings, Stefan Popov,
with respect to the memory issue and accuracy if batch size Shahab Kamali, Matteo Malloci, Jordi Pont-T uset, Andreas Veit,
of the dataset during the training of the model is taken to be Serge Belongie, Victor Gomes, Abhinav Gupta, Chen Sun, Gal
Chechik, David Cai, Zheyun Feng, Dhyanesh Narayanan, and Kevin
of higher value and the value of k is higher in beam search Murphy. Openimages: A public dataset for large-scale multi- label and
method. The unsupervised method for image captioning is also multi-class image classification. Dataset available from
experimented and model is built for it which shows improve- https://storage.googleapis.com/openimages/web/index.html, 2017.
[21 ] Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay
ment in the performance and accuracy of the model for image Mansour. Policy gradient methods for reinforcement learning with
caption generator. As corpus which is an image describing function approximation. In NIPS, 2000.
content consist of over two million sentences gathered from [22 ] William Fedus, Ian Goodfellow, and Andrew M Dai. Maskgan: Better
text generation via filling in the . In ICLR, 2018.
Shutterstock used efficiently for the purpose of implementing [23 ] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine
unsupervised algorithm for image captioning that subsequently Manzagol. Extracting and composing robust features with denoising
lead to the promising, more efficient and accurate results even autoencoders. In ICML, 2008.
[24 ] M. Hodosh, P. Young, and J. Hockenmaier. Framing image description
if the image sentence label is not provided. as a ranking task: Data,modelsand evaluation metrics. JAIR, 47, 2013.
[25 ] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network
training by reducing internalcovariate shift. In arXiv:1502.03167, 2015.
REFERENCES [26 ] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic
optimization. arXiv preprint arXiv:1412.6980, 2014.
[1] O. Vinyals, A. T oshev, S. Bengio and D. Erhan, ”Show and tell: A neural [27 ] Jonathan Huang, Vivek Rathod, Chen Sun, Menglong Zhu, Anoop Ko-
image caption generator,” 2015 IEEE Conference on Computer Vision rattikara, Alireza Fathi, Ian Fischer, Zbigniew Wojna, Yang Song, Sergio
and Pattern Recognition (CVPR), Boston, MA, 2015, pp.3156-3164, doi: Guadarrama, et al. Speed/accuracy trade-offs for modern convolutional
10.1109/CVPR.2015.7298935. object detectors. In CVPR, 2017.
[2] A. Farhadi, M. Hejrati, M. A. Sadeghi, P. Young, C. Rashtchian, J. [28 ] Ranganathan, G. "Real Life Human Movement Realization in
Hockenmaier, and D. Forsyth. Every picture tells a story: Generating Multimodal Group Communication Using Depth Map Information and
sentences from images. In ECCV, 2010. Machine Learning." Journal of Innovative Image Processing (JIIP) 2,
[3] A. Aker and R. Gaizauskas. Generating image descriptions using de- no. 02 (2020): 93-101.
pendency relational patterns. In ACL, 2010. [29 ] Bindhu, V., and Villankurichi Saravanampatti PO. "Semi-Automated
Segmentation Scheme for Computerized Axial T omography Images of
[4] M. Mitchell, X. Han, J. Dodge, A. Mensch, A. Goyal, A. C. Berg, K.
Esophageal Tumors." Journal of Innovative Image Processing (JIIP) 2,
Yamaguchi, T . L. Berg, K. Stratos, and H. D. III. Midge: Generating
no. 02 (2020): 110-120.
image descriptions from computer vision detections. In EACL, 2012.
[30 ] S. Han and H. Choi, "Domain-Specific Image Caption Generator with
[5] P. Kuznetsova, V. Ordonez, A. C. Berg, T . L. Berg, and Y. Choi. Semantic Ontology," 2020 IEEE International Conference on Big Data
Collective generation of natural image descriptions. In ACL, 2012. and Smart Computing (BigComp), Busan, Korea (South), 2020, pp. 526-
[6] P. Kuznetsova, V. Ordonez, T. Berg, and Y. Choi. Treetalk: Composition 530, doi: 10.1109/BigComp48618.2020.00-12.
and compression of trees for image descriptions. ACL, 2(10), 2014. [31 ] Vijayakumar, T ., and R. Vinothkanna. "Retrieval of complex images
[7] D. Elliott and F. Keller. Image description using visual dependency using visual saliency guided cognitive classification." J Innov Image
representations. In EMNLP, 2013. Process (JIIP) 2, no. 02 (2020): 102-109.
[8] G. Kulkarni, V. Premraj, S. Dhar, S. Li, Y. Choi, A. C. Berg, and [32 ] N. K. Kumar, D. Vigneswari, A. Mohan, K. Laxman and J. Yuvaraj,
T . L. Berg. Baby talk: Understanding and generating simple image "Detection and Recognition of Objects in Image Caption Generator
descriptions. In CVPR, 2011. System: A Deep Learning Approach," 2019 5th International
[9] R. Kiros and R. Z. R. Salakhutdinov. Multimodal neural language Conference on Advanced Computing & Communication Systems
models. In NIPS Deep Learning Workshop, 2013. (ICACCS), Coimbatore, India, 2019, pp. 107-109, doi:
[10 ] S. Hochreiter and J. Schmidhuber. Long short -term memory. Neural 10.1109/ICACCS.2019.8728516.
Computation, 9(8), 1997.
[11 ] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network
training by reducing internal covariate shift. In arXiv:1502.03167, 2015.
[12 ] Andrej Karpathy, Li Fei Fei (2015) Deep Visual-Semantic Alignmentsfor
Generating Image Descriptions.IEEE Transactions on PatternAnalysis
and Machine Intelligence(April 2017), vol 39, issue 4:664 –676
[13 ] T .-Y. Lin, et al (2014) Microsoft COCO: Common objects in con-
text.arXiv:1405.0312
[14 ] Steven J. Rennie, Etienne Marcheret, Youssef Mroueh, Jarret Rossand
Vaibhava Goel (2016) Self-critical Sequence Training for ImageCaption-
ing.in arXiv:1612.00563
[15 ] Ronald J. Williams (1992) Simple statistical gradient -following algo-
rithms for connectionist reinforcement learning.In Machine Learn-
ing,pages 229–256, 1992.
[16 ] Marc’ Aurelio Ranzato, Sumit Chopra, Michael Auli, and Woj-
ciechZaremba. (2015) Sequence level training with recurrent neural net-
works.ICLR, 2015.
[17 ] P. Mathur, A. Gill, A. Yadav, A. Mishra and N. K. Bansode, ”Cam -
era2Caption: A real-time image caption generator,” 2017 International
Conference on Computational Intelligence in Data Science(ICCIDS),
Chennai, India, 2017, pp. 1-6, doi: 10.1109/ICCIDS.2017.8272660.

978-0-7381-1327-2/21/$31.00 ©2021 IEEE 864


Authorized licensed use limited to: GITAM University. Downloaded on September 10,2024 at 06:43:09 UTC from IEEE Xplore. Restrictions apply.

You might also like