Image Captioning Based On Deep Reinforcement Learning: Haichao Shi Peng Li

This paper presents a novel architecture for image captioning using deep reinforcement learning, specifically employing a policy network and a value network to enhance the generation of natural language descriptions for images. The proposed method aims to optimize the quality of generated captions by evaluating the confidence of word predictions and adjusting targets based on reward values. Experimental results demonstrate the effectiveness of this approach compared to state-of-the-art methods, utilizing the Microsoft COCO dataset for evaluation.

Uploaded by

nanavathakashchandra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views5 pages

Image Captioning Based On Deep Reinforcement Learning: Haichao Shi Peng Li

Uploaded by

nanavathakashchandra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Image Captioning based on Deep Reinforcement Learning

Haichao Shi Peng Li†

Institute of Information Engineering, College of Information and Control
Chinese Academy of Sciences Engineering, China University of
School of Cyber Security, University of Petroleum (East China)
Chinese Academy of Sciences Qingdao, China
Beijing, China [email protected]
[email protected]

Bo Wang† Zhenyu Wang

National Computer Network Institute of Information Engineering,
Emergency Response Technical Chinese Academy of Sciences
Team/Coordination Center of China Beijing, China
Beijing, China [email protected]
[email protected]

ABSTRACT Haichao Shi, Peng Li, Bo Wang and Zhenyu Wang. 2018. Image
Captioning based on Deep Reinforcement Learning. SIG Proceedings
Recently it has shown that the policy-gradient methods for Paper in word Format. In Proceedings of the 10th International Conference
reinforcement learning have been utilized to train deep end-to-end on Internet Multimedia Computing and Service, Nanjing, China, August
systems on natural language processing tasks. What’s more, with 2018 (ICIMCS 2018), 5 pages.
the complexity of understanding image content and diverse ways
of describing image content in natural language, image captioning
has been a challenging problem to deal with. To the best of our 1 INTRODUCTION
knowledge, most state-of-the-art methods follow a pattern of Learning based methods have been widely used in various image
sequential model, such as recurrent neural networks (RNN). analysis tasks [30,31,32,33,34,35,36]. Image captioning is a
However, in this paper, we propose a novel architecture for image challenging task of generating the natural language description for
captioning with deep reinforcement learning to optimize image the input image. It requires a fine-grained understanding of the
captioning tasks. We utilize two networks called "policy network" global and the local entities in an image, as well as the
and "value network" to collaboratively generate the captions of relationships and attributes. Image captioning tasks have attracted
images. The experiments are conducted on Microsoft COCO increasingly interests in computer vision recently. Most state-of-
dataset, and the experimental results have verified the the-art methods [2,3,4,5,6,7] choose to utilize the encoder-decoder
effectiveness of the proposed method. framework to generate the captions of images. Inspired by the
deep learning approaches, which have yielded impressive results
CCS CONCEPTS on the computer vision tasks, the encoder-decoder models usually
• Computer methodologies → Sequential decision making; utilize the convolutional neural networks to encode the images
Natural language generation; and employ the recurrent neural networks to decode the semantic
information to integrate complete sentences. These models are
KEYWORDS trained end-to-end using back-propagation neural networks, and
Image caption, deep reinforcement learning, policy, value have achieved fairly good results on Microsoft COCO dataset [1].
In this paper, we introduce a novel architecture with deep
ACM Reference format: reinforcement learning for image captioning. Different from the
former works, which are learned to train a recurrent model to look
†
Corresponding author. for the next suitable word, we utilize two networks called “policy
Permission to make digital or hard copies of part or all of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full
citation on the first page. Copyrights for components of this work owned by others ICIMCS’18, August 17-19, 2018, Nanjing, China
than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, © 2018 Copyright held by the owner/author(s).
or republish, to post on servers or to redistribute to lists, requires prior specific ACM ISBN 978-1-4503-6520-8/18/08…$15.00
permission and/or a fee. Request permissions from [email protected]. https://doi.org/10.1145/3240876.3240900
H. Shi et al.

network” and “value network” to jointly learn to predict the Image caption is a comprehensive problem that integrates
correct words at each state. In detail, the policy network is utilized computer vision, natural language processing and machine
to evaluate the confidence of predicting the next word according learning. With the rise of machine translation and big data, most
to the current state. The value network is utilized to evaluate the state-of-the-art methods follow an encoder-decoder framework to
reward value of the predictions of the current state. In other generate captions for natural images. To the best of our
words, the value network aims at adjusting the targets of knowledge, the encoder is always a convolutional neural network,
predicting the linguistic of the images towards generating captions using the characteristics of the last fully connected layer or
as natural as possible. Based on the deep reinforcement learning, convolutional layer as the features of an image. The decoder is
our model can generate captions similar to human natural generally a recurrent neural network and is mainly utilized for
language descriptions using the two networks. As is shown in Fig. generating image description. Due to the problem of gradient
1, is an instance of the proposed image caption model based on descent in ordinary RNNs, RNN can only memorize the contents
deep reinforcement learning. Based on reinforcement learning, the of the previous limited time unit. Then comes the LSTM (Long
policy network gives some possible actions about the object. Short Term Memory) [27], which is a special RNN architecture
Then, the value network makes decisions on whether to choose that can solve problems such as the vanish of gradients and has
the action given by the policy network based on the evaluated long-term memory. Therefore, the LSTM is gradually used in the
reward scores. Both networks are devoted to satisfying the decoder stage.
ultimate goal of generating a good enough description. In Vinyals’s [5] work, they proposed an encoder-decoder
framework, which utilizes convolutional neural networks to
extract image features and generates target language description
through LSTM to maximize the maximum likelihood estimation
of target description. While in Fang’s [8] work, they utilized
multiple instance learning to train the detector to extract words
contained in an image at first, then a statistical model is learned to
generate descriptions. In Xu’s [9] work, they incorporated the
attention mechanism with image caption. They propose to
Figure 1: An instance of the proposed image caption model combine spatial attention mechanisms in the convolutional
based on deep reinforcement learning. The policy network is features of the image, and input context information into the
intended to predict the action of the object according to the encoder-decoder framework. These works all combine CNN
current state, and the value network is willing to make an (convolutional neural network) and RNN (recurrent neural
inference based on the rewards. network) to do caption generation tasks. The words are
sequentially drawn according to the local confidence. Such
We conduct detailed experiments to demonstrate the methods usually choose the words with top local confidence. As a
comparative performance than the state-of-the-art approaches. result, some good caption results may be missed. In contrast, our
Our evaluation indicators include BLEU [22], METEOR [23], model can choose the suitable caption results via the reward
CIDEr [24] and ROUGE [25]. The contributions of this paper are scores to generate a good description.
summarized as follows:
l We integrate the deep reinforcement learning with 2.2 Reinforcement Learning
image caption tasks to generate descriptions more like
Reinforcement learning is usually the core problem in computer
natural language. We use two networks called “policy
gaming, control theory and path planning, etc. These problems all
network” and “value network” to complement with each
meet with the same conditions, there exists agents that need to
other to improve the quality of generated descriptions.
interact with the environment, execute a series of actions and are
l We optimize the two networks using temporal-
intended to complete the expected goals. Before the rise of the
difference (TD) method [26]. As for the training process,
deep reinforcement learning, some preliminary work has been
we use supervised learning with cross entropy loss to
carried out. However, these tasks use only deep neural networks
train the sequence part of policy network and with mean
to reduce the dimension of high-dimensional input data so that the
squared loss to train the value network.
traditional reinforcement learning algorithm can process it.
The rest of the paper is structured as follows. In Section 2, we
Riedmiller et al. [14] proposed to use a multilayer perceptron to
discuss the related work of image caption and reinforcement
approximate represent Q-value function, and a Neural Fitted Q
learning. In Section 3, we elaborate the proposed method. In
Iteration (NFQ) algorithm was proposed. In Lange’s work [15],
Section 4, experiments are conducted to demonstrate the
they combined deep learning model with reinforcement learning
effectiveness of the proposed method. In Section 5, we draw
to propose a Deep Auto-Encoder (DAE). Abtahi et al. [16]
conclusions.
utilized deep belief network to be used as an approximator in the
traditional reinforcement learning, which greatly improves the
2 RELATED WORK learning efficiency of the agent and is successfully applied to
character segmentation tasks of license plate images. Recently,
2.1 Image Captioning
2
Image Captioning based on Deep Reinforcement Learning

Silver et al. [17] designed a professional-level computer Go ℎ1 = 𝑅𝑁𝑁(ℎ1<& , 𝑥1 ) (1)

program using deep neural networks and Monte Carlo Tree 𝑥> = 𝑊@ 𝐶𝑁𝑁(𝐼) (2)
Search. Human-level gaming control [18] was achieved through 𝑥1 = 𝜙(𝜔1<& ), 𝑡 ≥ 1 (3)
deep Q-learning. A visual navigation system [21] was proposed 𝑞/ (𝑎1 |𝑠1 ) = 𝜑(ℎ1 ) (4)
recently based on actor-critic reinforcement learning model. There
are also having generation tasks using reinforcement learning, Where 𝑊@ is the weight of the input of convolutional neural
such as [19]. They use reinforcement learning [20] to train its network. 𝑥> is the initial input of RNN. 𝜙 and 𝜑 represent the
model to generate texts by directly optimizing a user-specified input and output of the RNN.
evaluation metric. In this paper, we do generation tasks on image
captioning using deep reinforcement learning.

3 Image Captioning based on Deep

Reinforcement Learning
In this section, we first elaborate the formulation of image
captioning based on deep reinforcement learning. Then we Figure 2: The architecture of policy network 𝒒𝝅 , which is
introduce our model architecture and the training procedure. comprised of a CNN and a RNN. Through the policy function
𝒒𝝅 (𝒂𝒕 |𝒔𝒕 ) , the probability of executing an action 𝒂𝒕 at a
3.1 Problem Formulation certain state 𝒔𝒕 is computed.

Since we introduce deep reinforcement learning to image caption 3.2.2 The Value Network. For the value network, it contains
tasks, we formulate the problem into the scheme of reinforcement three parts, a CNN, a RNN and a Linear Mapping Layer, here we
learning. Thus, we model this problem as a decision-making use a perceptron model, which is utilized to evaluate the
process. There are four factors affecting the whole process, agent, predictions to choose the most suitable action. As is shown in Fig.
action, environment and goal. All the four factors have an impact 3, is the architecture of value network 𝑣M . The CNN is utilized to
with each other. The agent is intended to interact with the encode the visual information of the given image, the RNN is
environment, and executes a series of actions to achieve the goal. utilized to encode the semantic information of the partially given
The evaluation indicators are formulated according to the caption. And the Linear Mapping Layer is designed to predict the
rewarding mechanism. In the task of image captioning, given an generated captions to give a reward on the caption.
image 𝐼, to generate a natural description 𝑆 = {𝜔& , 𝜔( , … , 𝜔* }
about it, where 𝜔, is a word in the description and 𝑛 is the number
of words. What’s more, our model includes a policy network 𝑞/
and a value network 𝑣/ . In this situation, the two networks can be
viewed as agents, the image 𝐼 and the generated description 𝑆1 =
{𝜔& , 𝜔( , … , 𝜔1 } can be regarded as environment, which 𝑡 is the
internal time. The prediction of the next word 𝜔13& is viewed as
an action.

3.2 Model Architecture

3.2.1 The Policy Network. Similar with the architecture of Figure 3: The architecture of value network 𝒗𝝅 , which is
encoder-decoder, the policy network 𝑞/ consists of two networks, comprised of a CNN, a RNN and a Linear Mapping Layer.
a Convolutional Neural Network (CNN) and a Recurrent Neural The value network can evaluate the policy’s value given an
image and a partially generated caption.
Network (RNN), which provides a probability for the agent to
take actions at each state. The architecture of policy network 𝑞/ is
shown in Fig. 2. We make a hypothesis that, the current state is 𝑠1 , 3.3 Reward Mechanism and Training Strategy
which include the environment 𝑒 = {𝐼, 𝜔& , 𝜔( , … , 𝜔1 } to interact 3.3.1 Reward Mechanism. The reward mechanism is the
with. The action is 𝑎1 = 𝜔13& . When the policy network is fed measurement of how well the action is performing to complete the
into an image, the CNN is utilized to encode the visual goal. We learn from the Reproducing Kernel Hilbert Space
information. Then the information is fed into a RNN module, (RKHS) [28] to map the raw data through a non-linear mapping
providing the action 𝑎1 at each step according to the hidden state method, which makes the original linearly inseparable problem
ℎ1 . For the RNN is able to keep the sequential information, the 𝜔1 become a linearly separable problem. In our model, we utilize a
generated by it at time 𝑡 will be fed back into RNN at next step. linear mapping method to map the images and captions into a
When the inputs update, the hidden state will also be updated to semantic embedding space, where we can calculate the distance
the next state. Empirically, we design the function of 𝑞/ for each between the images and captions. For the value network 𝑣/ aims
input 𝑥1 by the following equations: at giving a certain reward to an action, we denote the components
of 𝑣/ as 𝐶𝑁𝑁, 𝑅𝑁𝑁, 𝑙P . Given an image caption 𝑆 =

3
H. Shi et al.

{𝜔& , 𝜔( , … , 𝜔 Q }, its embedding feature is represented as ℎQ<& (𝑆), In this section, we perform extensive experiments to evaluate the
which depends on the last state of ℎQ . 𝑓S represents the feature proposed framework. All the experiments are conducted on the
extracted by the CNN and 𝑙P represents the mapping function of MS COCO dataset, and utilize evaluation indicators such as:
mapping the features into the embedding space. For the mapping BLEU-3, BLEU-4, Meteor, CIDEr and ROUGE-L. All of those
loss, we can define it as follows: evaluation metrics are widely used in caption evaluation tasks.

𝐿P = ∑ab ∑` γ[max[0, ℎQ<& (𝑆) ∙ 𝑙P (𝑓S )^ − ℎQ (𝑆) ∙ 𝑙P (𝑓S )] (5) 4.1 Dataset Preparation and Network Settings
For the convenience of evaluation, we used the data splits from
Where γ is the penalty coefficient varying from (0,1), ℎQ and [4], containing more that 80,000 images for training, 5,000 images
ℎQ<& are the adjacent hidden states. And we define the final for testing and 5,000 images for validation as well.
reward as follows: We use VGG-16 Net [29] as the CNN architecture and LSTM
d (`)∙h (a ) as the RNN architecture. And a three-layer perceptron is used to
𝑅Q = ||defg (`)∙hi (ab )|| (6)
efg i b predict the generated captions to give a reward on the caption. All
the inputs are set to be 512 dimensions including the hidden units.
Then the final loss can be represented as follows: And we initialize the model using Adam [37] optimizer with an
initial learning rate of 5 × 10<•. We adjust the learning rate by a
𝐿j = 𝛼(𝐿P + 𝑅Q ) (7) factor of 0.9 every two epochs.
Where 𝛼 is varying from (0,1).
3.3.2 Training Strategy. Firstly, we train the two networks 4.2 Comparison with the State-of-the-arts
respectively. The policy network attempts to get the possible
action predictions. We train the policy network utilizing As is shown in Table 1, we provide a summary of our model on
supervised learning and optimize it with cross entropy loss. Then MS COCO dataset with five evaluation metrics. Note that the
Semantic ATT [7] has utilized rich extra data to train their
we train the value network by minimizing the mean squared loss.
The two networks’ loss functions are defined as follows: predictor, so their results are incomparable to the methods without
using extra training data. Compared with these methods except
𝐿m = −𝑙𝑜𝑔𝑞(𝜔& , 𝜔( , … , 𝜔 Q |𝐼; 𝜋) [7], our approach shows significant improvements in these
evaluation metrics.
= − ∑Q1r& 𝑙𝑜𝑔𝑞/ (𝑎1 |𝑠1 ) (8)

𝐿s = ||𝑣𝝅 (𝑠, ) − 𝑅||( (9) Table 1: Performance of our method on MS COCO dataset
compared with state-of-the-art methods. For those competing
Where 𝐿m represents the loss of policy network, 𝑞/ (𝑎1 |𝑠1 ) methods, we show the results from their latest version of
paper. The (-) indicates unknown scores.
represents the policy function. When given a certain state 𝑠1 at
time 𝑡, the policy function gives a predicted action policy 𝑎1 .
Secondly, we jointly train the two networks 𝑞/ and 𝑣/ using Evaluation Metric
deep reinforcement learning. We learn the parameters of the Methods
BLEU3 BLEU4 METEOR CIDEr
model by maximizing the total reward that the agent gets when
DeepVS[4] 32.1 23.0 19.5 66.0
interacting with the environment. We can formulate the reward as
𝑅1 which represents the reward at time 𝑡. The reward expectation NIC[5] 32.9 27.7 23.7 85.5
is supposed to be: 𝐽(𝜋) = 𝔼&…Q~wx (∑Q1r& 𝑅1 ) . For calculation gLSTM[9] 35.8 26.4 22.7 81.3
convenience, we can regard the object function of rewards as a
m-RNN[11] 35.0 25.0 - -
Markov decision process. For this problem includes a
representation of semantic space from high-dimensional ATT[7] 40.2 30.4 24.3 -
interaction, which may contain unknown environment variables. 39.5 28.2 24.3 90.7
Ours
Considering the solution procedure, we can give an approximation
to this problem:
Since our framework has a significant improvement compared to
other methods, we’d like to combine the decision-making
∇(wx 𝐽 = ∑Q1r& ∇(wx log 𝑞/ (𝑎1 |𝑠1 )(𝑅 − 𝑣/ (𝑠1 )) (10)
framework with the existing work to optimize our model
∇(sx 𝐽 = ∇(sx 𝑣/ (𝑠1 )(𝑅 − 𝑣/ (𝑠1 )) (11)
constantly in the future work.
𝑣w (𝑠) = 𝔼[𝑅|𝑎1 ~𝑞/ , 𝑠1 = 𝑠] (12)
Where 𝑣w (𝑠) represents the evaluation scores at state 𝑠1 when
given a policy predicted by 𝑞/ at time 𝑡.

4 Experiments

4
Image Captioning based on Deep Reinforcement Learning

[8] Fang H, Gupta S, Iandola F, et al. From captions to visual concepts and
back[C]// IEEE Conference on Computer Vision and Pattern Recognition.
IEEE, 2015:1473-1482.
[9] Jia X, Gavves E, Fernando B, et al. Guiding Long-Short Term Memory for
Image Caption Generation[J]. 2015.
[10] Zhou L, Xu C, Koch P, et al. Image Caption Generation with Text-Conditional
Semantic Attention[J]. 2016.
[11] J. Mao, W. Xu, Y. Yang, J. Wang, and A. Yuille. Deep captioning with
multimodal recurrent neural networks. ICLR, 2015.
[12] Yao T, Pan Y, Li Y, et al. Boosting Image Captioning with Attributes[J]. 2016.
[13] Lu J, Xiong C, Parikh D, et al. Knowing When to Look: Adaptive Attention via
A Visual Sentinel for Image Captioning[J]. 2016.
[14] Riedmiller M. Neural Fitted Q Iteration – First Experiences with a Data
Efficient Neural Reinforcement Learning Method[C]// European Conference on
Machine Learning. Springer-Verlag, 2005:317-328.
[15] Lange S, Riedmiller M. Deep Auto-Encoder Neural Networks in
Reinforcement Learning[J]. 2010:1-8.
[16] Abtahi F, Fasel I. Deep belief nets as function approximators for reinforcement
learning[C]// AAAI Conference on Lifelong Learning. AAAI Press, 2011:2-7.
[17] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche,
Figure 4: The visualization results of our model on the MS J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman,
COCO dataset. D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K.
Kavukcuoglu, T. Graepel, and D. Hassabis. Mastering the game of go with
deep neural networks and tree search. Nature, 529:484–489, 2016.
We show some qualitative captioning results of our model in [18] V. Mnih, K. Kavukcuoglu, D. Silver, A. Rusu, J. Veness, M. G. Bellemare, A.
Fig. 4. From the results, we can see that our method is better at Graves, M. Riedmiller, A. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A.
recognizing key objects and the integral environment. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D.
Hassabis. Human-level control through deep reinforcement learning. Nature,
518:529–533, 2015.
5 CONCLUSIONS [19] M. Ranzato, S. Chopra, M. Auli, and W. Zaremba. Sequence level training with
recurrent neural networks. In ICLR, 2016.
In this paper, we present a novel architecture for image captioning [20] R.Williams. simple statistical gradient-following algorithms for connectionist
reinforcement learning. Machine Learning, 8:229–256,1992.
with deep reinforcement learning, which can achieve good [21] Y. Zhu, R. Mottaghi, E. Kolve, J. J. Lim, and A. Gupta. Target-driven visual
performance compared with other state-of-the-art methods. navigation in indoor scenes using deep reinforcement learning. In
arXiv:1609.05143, 2016.
Different from the previous encoder-decoder framework, our [22] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: A
model utilizes two networks called “policy network” and “value method for automatic evaluation of machine translation. In ACL, 2002.
network” to generate captions. The policy network gives some [23] Satanjeev Banerjee and Alon Lavie. Meteor: An automatic metric for mt
evaluation with improved correlation with human judgments. In Proceedings of
possible actions about the agent. Then, the value network makes the acl workshop on intrinsic and extrinsic evaluation measures for machine
decisions on whether to choose the action given by the policy translation and/or summarization, volume 29, pages 65–72, 2005.
[24] Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. Cider:
network. And we also use the temporal-difference method to Consensus-based image description evaluation. In CVPR, 2015.
optimize the model. In our future work, we consider to improve [25] Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In
the model based on the existing methods and find out other Text summarization branches out: Proceedings of the ACL-04 workshop,
volume 8. Barcelona, Spain, 2004.
alternative methods to design a new reward mechanism for natural [26] Sutton R S, Barto A G. Reinforcement Learning: An Introduction, Bradford
language generation tasks. Book[J]. IEEE Transactions on Neural Networks, 2005, 16(1):285-286.
[27] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural
Computation, 9:1735–1780, 1997.
6 ACKNOWLEDGEMENT [28] Zhou S K, Chellappa R. From sample similarity to ensemble similarity:
probabilistic distance measures in reproducing kernel Hilbert space[J]. IEEE
This work is supported by the National Natural Science Transaction on Pattern Analysis & Machine Intelligence, 2006, 28(6):917-929.
[29] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-
Foundation of China (No.61501457, No.61602517) and the scale image recognition. In ICLR, 2015.
National Key Research and Development Program of China [30] Zhang X. Interactive Patent classification based on multi-classifier fusion and
(Grant 2016YFB0801305). active learning. Neurocomputing (NEUCOM), 2014, 127: 200-205.
[31] Zhang X Y. Simultaneous optimization for robust correlation estimation in
partially observed social network. Neurocomputing, 2017, 205: 455-462.
REFERENCES [32] Zhang X Y, Wang S, Yun X. Bidirectional active learning: a two-way
exploration into unlabeled and labeled dataset. IEEE Transactions on Neural
[1] Lin T Y, Maire M, Belongie S, et al. Microsoft COCO: Common Objects in
Networks and Learning Systems (TNNLS), 2015, 26(12): 3034-3044.
Context[J]. 2014, 8693:740-755.
[33] Zhang X Y, Wang S, Zhu X, et al. Update vs. upgrade: modeling with
[2] Chen X, Zitnick C L. Mind's eye: A recurrent visual representation for image indeterminate multi-class active learning. Neurocomputing (NEUCOM), 2015,
caption generation[J]. 2014:2422-2431.
162: 163-170.
[3] Donahue J, Hendricks L A, Guadarrama S, et al. Long-term recurrent [34] Zhu, X., Jin, X., Zhang, X., Li, C., He, F. and Wang, L., 2015. Context-aware
convolutional networks for visual recognition and description[C]// Computer
local abnormality detection in crowded scene. Science China Information
Vision and Pattern Recognition. IEEE, 2015:677.
Sciences, 58(5), pp.1-11.
[4] Karpathy A, Li F F. Deep visual-semantic alignments for generating image
[35] Liu Y, Zhang X, Zhu X, et al. ListNet-based object proposals ranking.
descriptions[J]. IEEE Transactions on Pattern Analysis & Machine Intelligence,
Neurocomputing (NEUCOM), 2017, 267: 182-194.
2014, 39(4):664-676.
[36] Zhu X, Liu J, Wang J, et al. Sparse representation for robust abnormality
[5] Vinyals O, Toshev A, Bengio S, et al. Show and tell: A neural image caption
detection in crowded scenes[J]. Pattern Recognition, 2014, 47(5):1791-1799.
generator[J]. 2014:3156-3164.
[37] D. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR,
[6] Xu K, Ba J, Kiros R, et al. Show, Attend and Tell: Neural Image Caption
2015.
Generation with Visual Attention[J]. Computer Science, 2015:2048-2057.
[7] You Q, Jin H, Wang Z, et al. Image Captioning with Semantic Attention[C]//
Computer Vision and Pattern Recognition. IEEE, 2016:4651-4659.

HS1501 Notes
No ratings yet
HS1501 Notes
6 pages
Image Captioning
No ratings yet
Image Captioning
8 pages
Review 3
No ratings yet
Review 3
18 pages
Self-Critical Sequence Training For Image Captioning
No ratings yet
Self-Critical Sequence Training For Image Captioning
17 pages
Image Captioning
No ratings yet
Image Captioning
17 pages
Self-Critical Sequence Training For Image Captioning
No ratings yet
Self-Critical Sequence Training For Image Captioning
16 pages
A Comprehensive Survey of Deep Learning For Image Captioning
No ratings yet
A Comprehensive Survey of Deep Learning For Image Captioning
36 pages
Deep Learning-Based Image Captioning For Visually
No ratings yet
Deep Learning-Based Image Captioning For Visually
7 pages
Image Captioning Using Deep Learning Mait
No ratings yet
Image Captioning Using Deep Learning Mait
8 pages
IJNRD2309143
No ratings yet
IJNRD2309143
11 pages
A Novel Approach of Image Caption Generator Using Deep Learning
No ratings yet
A Novel Approach of Image Caption Generator Using Deep Learning
6 pages
Paper 17881
No ratings yet
Paper 17881
6 pages
Image Caption Bot With Keras and Speech Generation For
No ratings yet
Image Caption Bot With Keras and Speech Generation For
7 pages
Image Captioning Generator Using Deep Machine Learning
No ratings yet
Image Captioning Generator Using Deep Machine Learning
3 pages
7 - 23 - Deep Image Captioning A Review of Methods, Trends and Future Challenges
No ratings yet
7 - 23 - Deep Image Captioning A Review of Methods, Trends and Future Challenges
21 pages
PGCON Paper Final
No ratings yet
PGCON Paper Final
4 pages
Review 3
No ratings yet
Review 3
18 pages
A Novel Approach of Image Caption Generator Using Deep Learning
No ratings yet
A Novel Approach of Image Caption Generator Using Deep Learning
6 pages
r10.Fine-Grained Image Captioning With Global-Local Discriminative Objective
No ratings yet
r10.Fine-Grained Image Captioning With Global-Local Discriminative Objective
16 pages
CVIU Hema 1-S2.0-S1077314222000650-Main
No ratings yet
CVIU Hema 1-S2.0-S1077314222000650-Main
13 pages
Detection and Recognition of Objects in Image Caption Generator System A Deep Learning Approach
No ratings yet
Detection and Recognition of Objects in Image Caption Generator System A Deep Learning Approach
3 pages
Project Review
No ratings yet
Project Review
12 pages
Image Captioning - A Deep Learning Approach Using CNN and LSTM Network
No ratings yet
Image Captioning - A Deep Learning Approach Using CNN and LSTM Network
6 pages
Hybrid Image Captioning Model
No ratings yet
Hybrid Image Captioning Model
6 pages
Aneja Convolutional Image Captioning CVPR 2018 Paper
No ratings yet
Aneja Convolutional Image Captioning CVPR 2018 Paper
10 pages
Gu An Empirical Study ICCV 2017 Paper PDF
No ratings yet
Gu An Empirical Study ICCV 2017 Paper PDF
10 pages
An Empirical Study of Language CNN For Image Captioning
No ratings yet
An Empirical Study of Language CNN For Image Captioning
10 pages
DL Group 6 Rep
No ratings yet
DL Group 6 Rep
11 pages
Stack Captioning
No ratings yet
Stack Captioning
8 pages
Image Caption Generator: Minor Project (BCA 5005)
No ratings yet
Image Caption Generator: Minor Project (BCA 5005)
15 pages
Image Captioning Synopsis
No ratings yet
Image Captioning Synopsis
17 pages
Camera 2 Caption
No ratings yet
Camera 2 Caption
6 pages
Conference Paper A5
No ratings yet
Conference Paper A5
9 pages
Image Caption Generation
No ratings yet
Image Caption Generation
8 pages
He 2017
No ratings yet
He 2017
8 pages
A Comparative Study of Machine Learning Based Image Captioning Models
No ratings yet
A Comparative Study of Machine Learning Based Image Captioning Models
6 pages
Image Captioning Via A Hierarchical Attention Mechanism and Policy Gradient Optimization
No ratings yet
Image Captioning Via A Hierarchical Attention Mechanism and Policy Gradient Optimization
13 pages
Image Caption Generator
No ratings yet
Image Caption Generator
2 pages
Generating Caption From Images Using Flickr Image Dataset
No ratings yet
Generating Caption From Images Using Flickr Image Dataset
7 pages
Image Caption Generator Final Report
No ratings yet
Image Caption Generator Final Report
28 pages
Building A Voice Based Image Caption Generator With Deep Learning
No ratings yet
Building A Voice Based Image Caption Generator With Deep Learning
6 pages
TSP CMC 53245
No ratings yet
TSP CMC 53245
18 pages
Image Captioning Model Using Attention and Object
No ratings yet
Image Captioning Model Using Attention and Object
17 pages
Image Caption Generation Using Deep Neural Networks
No ratings yet
Image Caption Generation Using Deep Neural Networks
3 pages
2023 Controllable Image Captioning Via Prompting
No ratings yet
2023 Controllable Image Captioning Via Prompting
9 pages
Seminar Report Final
No ratings yet
Seminar Report Final
20 pages
Papers
No ratings yet
Papers
9 pages
IJIEMR March 2023 COPY RIGHT (2 Files Merged)
No ratings yet
IJIEMR March 2023 COPY RIGHT (2 Files Merged)
8 pages
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
Image Captioning: Department of Computer Science University of Engineering & Technology Taxila
No ratings yet
Image Captioning: Department of Computer Science University of Engineering & Technology Taxila
10 pages
Visual Image Caption Generator 38
No ratings yet
Visual Image Caption Generator 38
6 pages
Image Caption Generator
No ratings yet
Image Caption Generator
6 pages
Mini Project Fln..
No ratings yet
Mini Project Fln..
51 pages
Automatic Image Captioning Using Neural Networks
No ratings yet
Automatic Image Captioning Using Neural Networks
9 pages
Image Captioning With Bidirectional Semantic Attention-Based Guiding of Long Short-Term Memory
No ratings yet
Image Captioning With Bidirectional Semantic Attention-Based Guiding of Long Short-Term Memory
17 pages
DW & Caption Generator - Paper 1
No ratings yet
DW & Caption Generator - Paper 1
6 pages
Image Caption Generation With Adaptive Transformer
No ratings yet
Image Caption Generation With Adaptive Transformer
6 pages
Ref 12
No ratings yet
Ref 12
7 pages
Image Caption Generator Using Deep Learning
No ratings yet
Image Caption Generator Using Deep Learning
9 pages
Image Captioning - A Deep Learning Approach
No ratings yet
Image Captioning - A Deep Learning Approach
4 pages
Rich Image Captioning in The Wild
No ratings yet
Rich Image Captioning in The Wild
8 pages
2002.08277 - When Radiology Report Generation Meets Knowledge Graph
No ratings yet
2002.08277 - When Radiology Report Generation Meets Knowledge Graph
8 pages
Review 01
No ratings yet
Review 01
13 pages
Moverscore: Text Generation Evaluating With Contextualized Embeddings and Earth Mover Distance
No ratings yet
Moverscore: Text Generation Evaluating With Contextualized Embeddings and Earth Mover Distance
16 pages
AI Concepts and Viva Prep Updated
No ratings yet
AI Concepts and Viva Prep Updated
16 pages
NLP Tutorial - Javatpoint
No ratings yet
NLP Tutorial - Javatpoint
20 pages
Resume Omar
No ratings yet
Resume Omar
1 page
AI Report
No ratings yet
AI Report
32 pages
AI Activity in UK Businesses Report Capital Economics and DCMS January 2022 Web Accessible
No ratings yet
AI Activity in UK Businesses Report Capital Economics and DCMS January 2022 Web Accessible
68 pages
Natural Language Processing
No ratings yet
Natural Language Processing
2 pages
LLM 1
No ratings yet
LLM 1
6 pages
Ai Chatbot Bagchi 2020
No ratings yet
Ai Chatbot Bagchi 2020
6 pages
Black and White Both Sides MAIN
No ratings yet
Black and White Both Sides MAIN
23 pages
Syllabus Generative AI
100% (1)
Syllabus Generative AI
22 pages
How Culturally Aware Are Vision-Language Models?: Hi@aman - Ai Hi@vinija - Ai
No ratings yet
How Culturally Aware Are Vision-Language Models?: Hi@aman - Ai Hi@vinija - Ai
14 pages
The CLEVER ChatGPT Prompt Engineering Approach
No ratings yet
The CLEVER ChatGPT Prompt Engineering Approach
30 pages
Natural Language Processing in Artificial Intelligence 1st Edition Brojo Kishore Mishra All Chapter Instant Download
100% (1)
Natural Language Processing in Artificial Intelligence 1st Edition Brojo Kishore Mishra All Chapter Instant Download
65 pages
Natural Language Processing in Artificial Intelligence 1st Edition Brojo Kishore Mishra Ebook All Chapters PDF
100% (1)
Natural Language Processing in Artificial Intelligence 1st Edition Brojo Kishore Mishra Ebook All Chapters PDF
65 pages
Unit-4 Generative AI
No ratings yet
Unit-4 Generative AI
14 pages
Research Paper (NLP)
No ratings yet
Research Paper (NLP)
14 pages
CC-13: Artificial Intelligence (UNIT-4) Dealing With Uncertainty and Inconsistencies
100% (1)
CC-13: Artificial Intelligence (UNIT-4) Dealing With Uncertainty and Inconsistencies
9 pages
Evaluation and Analysis of Large Language Models For Clinical Text Augmentation and Generation
No ratings yet
Evaluation and Analysis of Large Language Models For Clinical Text Augmentation and Generation
10 pages
Evaluation of Text Generation: A Survey
No ratings yet
Evaluation of Text Generation: A Survey
75 pages
Deep Learning Literature Review
100% (1)
Deep Learning Literature Review
8 pages
Neha Dubey: Education
No ratings yet
Neha Dubey: Education
3 pages
Fine-Tuning GPT-2 On Annotated RPG Quests For NPC Dialogue Generation
No ratings yet
Fine-Tuning GPT-2 On Annotated RPG Quests For NPC Dialogue Generation
8 pages
10 1109@tetci 2019 2892755
No ratings yet
10 1109@tetci 2019 2892755
16 pages
AI - Playbook Executive Briefing Artificial Intelligence
No ratings yet
AI - Playbook Executive Briefing Artificial Intelligence
27 pages
Pranshi Singla IX C AI Activity 1
No ratings yet
Pranshi Singla IX C AI Activity 1
24 pages
Clement Resume
No ratings yet
Clement Resume
1 page