Image Captioning based on Deep Reinforcement Learning
Haichao Shi Peng Li†
Institute of Information Engineering, College of Information and Control
Chinese Academy of Sciences Engineering, China University of
School of Cyber Security, University of Petroleum (East China)
Chinese Academy of Sciences Qingdao, China
Beijing, China
[email protected] [email protected] Bo Wang† Zhenyu Wang
National Computer Network Institute of Information Engineering,
Emergency Response Technical Chinese Academy of Sciences
Team/Coordination Center of China Beijing, China
Beijing, China
[email protected] [email protected]ABSTRACT Haichao Shi, Peng Li, Bo Wang and Zhenyu Wang. 2018. Image
Captioning based on Deep Reinforcement Learning. SIG Proceedings
Recently it has shown that the policy-gradient methods for Paper in word Format. In Proceedings of the 10th International Conference
reinforcement learning have been utilized to train deep end-to-end on Internet Multimedia Computing and Service, Nanjing, China, August
systems on natural language processing tasks. What’s more, with 2018 (ICIMCS 2018), 5 pages.
the complexity of understanding image content and diverse ways
of describing image content in natural language, image captioning
has been a challenging problem to deal with. To the best of our 1 INTRODUCTION
knowledge, most state-of-the-art methods follow a pattern of Learning based methods have been widely used in various image
sequential model, such as recurrent neural networks (RNN). analysis tasks [30,31,32,33,34,35,36]. Image captioning is a
However, in this paper, we propose a novel architecture for image challenging task of generating the natural language description for
captioning with deep reinforcement learning to optimize image the input image. It requires a fine-grained understanding of the
captioning tasks. We utilize two networks called "policy network" global and the local entities in an image, as well as the
and "value network" to collaboratively generate the captions of relationships and attributes. Image captioning tasks have attracted
images. The experiments are conducted on Microsoft COCO increasingly interests in computer vision recently. Most state-of-
dataset, and the experimental results have verified the the-art methods [2,3,4,5,6,7] choose to utilize the encoder-decoder
effectiveness of the proposed method. framework to generate the captions of images. Inspired by the
deep learning approaches, which have yielded impressive results
CCS CONCEPTS on the computer vision tasks, the encoder-decoder models usually
• Computer methodologies → Sequential decision making; utilize the convolutional neural networks to encode the images
Natural language generation; and employ the recurrent neural networks to decode the semantic
information to integrate complete sentences. These models are
KEYWORDS trained end-to-end using back-propagation neural networks, and
Image caption, deep reinforcement learning, policy, value have achieved fairly good results on Microsoft COCO dataset [1].
In this paper, we introduce a novel architecture with deep
ACM Reference format: reinforcement learning for image captioning. Different from the
former works, which are learned to train a recurrent model to look
†
Corresponding author. for the next suitable word, we utilize two networks called “policy
Permission to make digital or hard copies of part or all of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full
citation on the first page. Copyrights for components of this work owned by others ICIMCS’18, August 17-19, 2018, Nanjing, China
than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, © 2018 Copyright held by the owner/author(s).
or republish, to post on servers or to redistribute to lists, requires prior specific ACM ISBN 978-1-4503-6520-8/18/08…$15.00
permission and/or a fee. Request permissions from [email protected]. https://doi.org/10.1145/3240876.3240900
H. Shi et al.
network” and “value network” to jointly learn to predict the Image caption is a comprehensive problem that integrates
correct words at each state. In detail, the policy network is utilized computer vision, natural language processing and machine
to evaluate the confidence of predicting the next word according learning. With the rise of machine translation and big data, most
to the current state. The value network is utilized to evaluate the state-of-the-art methods follow an encoder-decoder framework to
reward value of the predictions of the current state. In other generate captions for natural images. To the best of our
words, the value network aims at adjusting the targets of knowledge, the encoder is always a convolutional neural network,
predicting the linguistic of the images towards generating captions using the characteristics of the last fully connected layer or
as natural as possible. Based on the deep reinforcement learning, convolutional layer as the features of an image. The decoder is
our model can generate captions similar to human natural generally a recurrent neural network and is mainly utilized for
language descriptions using the two networks. As is shown in Fig. generating image description. Due to the problem of gradient
1, is an instance of the proposed image caption model based on descent in ordinary RNNs, RNN can only memorize the contents
deep reinforcement learning. Based on reinforcement learning, the of the previous limited time unit. Then comes the LSTM (Long
policy network gives some possible actions about the object. Short Term Memory) [27], which is a special RNN architecture
Then, the value network makes decisions on whether to choose that can solve problems such as the vanish of gradients and has
the action given by the policy network based on the evaluated long-term memory. Therefore, the LSTM is gradually used in the
reward scores. Both networks are devoted to satisfying the decoder stage.
ultimate goal of generating a good enough description. In Vinyals’s [5] work, they proposed an encoder-decoder
framework, which utilizes convolutional neural networks to
extract image features and generates target language description
through LSTM to maximize the maximum likelihood estimation
of target description. While in Fang’s [8] work, they utilized
multiple instance learning to train the detector to extract words
contained in an image at first, then a statistical model is learned to
generate descriptions. In Xu’s [9] work, they incorporated the
attention mechanism with image caption. They propose to
Figure 1: An instance of the proposed image caption model combine spatial attention mechanisms in the convolutional
based on deep reinforcement learning. The policy network is features of the image, and input context information into the
intended to predict the action of the object according to the encoder-decoder framework. These works all combine CNN
current state, and the value network is willing to make an (convolutional neural network) and RNN (recurrent neural
inference based on the rewards. network) to do caption generation tasks. The words are
sequentially drawn according to the local confidence. Such
We conduct detailed experiments to demonstrate the methods usually choose the words with top local confidence. As a
comparative performance than the state-of-the-art approaches. result, some good caption results may be missed. In contrast, our
Our evaluation indicators include BLEU [22], METEOR [23], model can choose the suitable caption results via the reward
CIDEr [24] and ROUGE [25]. The contributions of this paper are scores to generate a good description.
summarized as follows:
l We integrate the deep reinforcement learning with 2.2 Reinforcement Learning
image caption tasks to generate descriptions more like
Reinforcement learning is usually the core problem in computer
natural language. We use two networks called “policy
gaming, control theory and path planning, etc. These problems all
network” and “value network” to complement with each
meet with the same conditions, there exists agents that need to
other to improve the quality of generated descriptions.
interact with the environment, execute a series of actions and are
l We optimize the two networks using temporal-
intended to complete the expected goals. Before the rise of the
difference (TD) method [26]. As for the training process,
deep reinforcement learning, some preliminary work has been
we use supervised learning with cross entropy loss to
carried out. However, these tasks use only deep neural networks
train the sequence part of policy network and with mean
to reduce the dimension of high-dimensional input data so that the
squared loss to train the value network.
traditional reinforcement learning algorithm can process it.
The rest of the paper is structured as follows. In Section 2, we
Riedmiller et al. [14] proposed to use a multilayer perceptron to
discuss the related work of image caption and reinforcement
approximate represent Q-value function, and a Neural Fitted Q
learning. In Section 3, we elaborate the proposed method. In
Iteration (NFQ) algorithm was proposed. In Lange’s work [15],
Section 4, experiments are conducted to demonstrate the
they combined deep learning model with reinforcement learning
effectiveness of the proposed method. In Section 5, we draw
to propose a Deep Auto-Encoder (DAE). Abtahi et al. [16]
conclusions.
utilized deep belief network to be used as an approximator in the
traditional reinforcement learning, which greatly improves the
2 RELATED WORK learning efficiency of the agent and is successfully applied to
character segmentation tasks of license plate images. Recently,
2.1 Image Captioning
2
Image Captioning based on Deep Reinforcement Learning
Silver et al. [17] designed a professional-level computer Go ℎ1 = 𝑅𝑁𝑁(ℎ1<& , 𝑥1 ) (1)
program using deep neural networks and Monte Carlo Tree 𝑥> = 𝑊@ 𝐶𝑁𝑁(𝐼) (2)
Search. Human-level gaming control [18] was achieved through 𝑥1 = 𝜙(𝜔1<& ), 𝑡 ≥ 1 (3)
deep Q-learning. A visual navigation system [21] was proposed 𝑞/ (𝑎1 |𝑠1 ) = 𝜑(ℎ1 ) (4)
recently based on actor-critic reinforcement learning model. There
are also having generation tasks using reinforcement learning, Where 𝑊@ is the weight of the input of convolutional neural
such as [19]. They use reinforcement learning [20] to train its network. 𝑥> is the initial input of RNN. 𝜙 and 𝜑 represent the
model to generate texts by directly optimizing a user-specified input and output of the RNN.
evaluation metric. In this paper, we do generation tasks on image
captioning using deep reinforcement learning.
3 Image Captioning based on Deep
Reinforcement Learning
In this section, we first elaborate the formulation of image
captioning based on deep reinforcement learning. Then we Figure 2: The architecture of policy network 𝒒𝝅 , which is
introduce our model architecture and the training procedure. comprised of a CNN and a RNN. Through the policy function
𝒒𝝅 (𝒂𝒕 |𝒔𝒕 ) , the probability of executing an action 𝒂𝒕 at a
3.1 Problem Formulation certain state 𝒔𝒕 is computed.
Since we introduce deep reinforcement learning to image caption 3.2.2 The Value Network. For the value network, it contains
tasks, we formulate the problem into the scheme of reinforcement three parts, a CNN, a RNN and a Linear Mapping Layer, here we
learning. Thus, we model this problem as a decision-making use a perceptron model, which is utilized to evaluate the
process. There are four factors affecting the whole process, agent, predictions to choose the most suitable action. As is shown in Fig.
action, environment and goal. All the four factors have an impact 3, is the architecture of value network 𝑣M . The CNN is utilized to
with each other. The agent is intended to interact with the encode the visual information of the given image, the RNN is
environment, and executes a series of actions to achieve the goal. utilized to encode the semantic information of the partially given
The evaluation indicators are formulated according to the caption. And the Linear Mapping Layer is designed to predict the
rewarding mechanism. In the task of image captioning, given an generated captions to give a reward on the caption.
image 𝐼, to generate a natural description 𝑆 = {𝜔& , 𝜔( , … , 𝜔* }
about it, where 𝜔, is a word in the description and 𝑛 is the number
of words. What’s more, our model includes a policy network 𝑞/
and a value network 𝑣/ . In this situation, the two networks can be
viewed as agents, the image 𝐼 and the generated description 𝑆1 =
{𝜔& , 𝜔( , … , 𝜔1 } can be regarded as environment, which 𝑡 is the
internal time. The prediction of the next word 𝜔13& is viewed as
an action.
3.2 Model Architecture
3.2.1 The Policy Network. Similar with the architecture of Figure 3: The architecture of value network 𝒗𝝅 , which is
encoder-decoder, the policy network 𝑞/ consists of two networks, comprised of a CNN, a RNN and a Linear Mapping Layer.
a Convolutional Neural Network (CNN) and a Recurrent Neural The value network can evaluate the policy’s value given an
image and a partially generated caption.
Network (RNN), which provides a probability for the agent to
take actions at each state. The architecture of policy network 𝑞/ is
shown in Fig. 2. We make a hypothesis that, the current state is 𝑠1 , 3.3 Reward Mechanism and Training Strategy
which include the environment 𝑒 = {𝐼, 𝜔& , 𝜔( , … , 𝜔1 } to interact 3.3.1 Reward Mechanism. The reward mechanism is the
with. The action is 𝑎1 = 𝜔13& . When the policy network is fed measurement of how well the action is performing to complete the
into an image, the CNN is utilized to encode the visual goal. We learn from the Reproducing Kernel Hilbert Space
information. Then the information is fed into a RNN module, (RKHS) [28] to map the raw data through a non-linear mapping
providing the action 𝑎1 at each step according to the hidden state method, which makes the original linearly inseparable problem
ℎ1 . For the RNN is able to keep the sequential information, the 𝜔1 become a linearly separable problem. In our model, we utilize a
generated by it at time 𝑡 will be fed back into RNN at next step. linear mapping method to map the images and captions into a
When the inputs update, the hidden state will also be updated to semantic embedding space, where we can calculate the distance
the next state. Empirically, we design the function of 𝑞/ for each between the images and captions. For the value network 𝑣/ aims
input 𝑥1 by the following equations: at giving a certain reward to an action, we denote the components
of 𝑣/ as 𝐶𝑁𝑁, 𝑅𝑁𝑁, 𝑙P . Given an image caption 𝑆 =
3
H. Shi et al.
{𝜔& , 𝜔( , … , 𝜔 Q }, its embedding feature is represented as ℎQ<& (𝑆), In this section, we perform extensive experiments to evaluate the
which depends on the last state of ℎQ . 𝑓S represents the feature proposed framework. All the experiments are conducted on the
extracted by the CNN and 𝑙P represents the mapping function of MS COCO dataset, and utilize evaluation indicators such as:
mapping the features into the embedding space. For the mapping BLEU-3, BLEU-4, Meteor, CIDEr and ROUGE-L. All of those
loss, we can define it as follows: evaluation metrics are widely used in caption evaluation tasks.
𝐿P = ∑ab ∑` γ[max[0, ℎQ<& (𝑆) ∙ 𝑙P (𝑓S )^ − ℎQ (𝑆) ∙ 𝑙P (𝑓S )] (5) 4.1 Dataset Preparation and Network Settings
For the convenience of evaluation, we used the data splits from
Where γ is the penalty coefficient varying from (0,1), ℎQ and [4], containing more that 80,000 images for training, 5,000 images
ℎQ<& are the adjacent hidden states. And we define the final for testing and 5,000 images for validation as well.
reward as follows: We use VGG-16 Net [29] as the CNN architecture and LSTM
d (`)∙h (a ) as the RNN architecture. And a three-layer perceptron is used to
𝑅Q = ||defg (`)∙hi (ab )|| (6)
efg i b predict the generated captions to give a reward on the caption. All
the inputs are set to be 512 dimensions including the hidden units.
Then the final loss can be represented as follows: And we initialize the model using Adam [37] optimizer with an
initial learning rate of 5 × 10<•. We adjust the learning rate by a
𝐿j = 𝛼(𝐿P + 𝑅Q ) (7) factor of 0.9 every two epochs.
Where 𝛼 is varying from (0,1).
3.3.2 Training Strategy. Firstly, we train the two networks 4.2 Comparison with the State-of-the-arts
respectively. The policy network attempts to get the possible
action predictions. We train the policy network utilizing As is shown in Table 1, we provide a summary of our model on
supervised learning and optimize it with cross entropy loss. Then MS COCO dataset with five evaluation metrics. Note that the
Semantic ATT [7] has utilized rich extra data to train their
we train the value network by minimizing the mean squared loss.
The two networks’ loss functions are defined as follows: predictor, so their results are incomparable to the methods without
using extra training data. Compared with these methods except
𝐿m = −𝑙𝑜𝑔𝑞(𝜔& , 𝜔( , … , 𝜔 Q |𝐼; 𝜋) [7], our approach shows significant improvements in these
evaluation metrics.
= − ∑Q1r& 𝑙𝑜𝑔𝑞/ (𝑎1 |𝑠1 ) (8)
𝐿s = ||𝑣𝝅 (𝑠, ) − 𝑅||( (9) Table 1: Performance of our method on MS COCO dataset
compared with state-of-the-art methods. For those competing
Where 𝐿m represents the loss of policy network, 𝑞/ (𝑎1 |𝑠1 ) methods, we show the results from their latest version of
paper. The (-) indicates unknown scores.
represents the policy function. When given a certain state 𝑠1 at
time 𝑡, the policy function gives a predicted action policy 𝑎1 .
Secondly, we jointly train the two networks 𝑞/ and 𝑣/ using Evaluation Metric
deep reinforcement learning. We learn the parameters of the Methods
BLEU3 BLEU4 METEOR CIDEr
model by maximizing the total reward that the agent gets when
DeepVS[4] 32.1 23.0 19.5 66.0
interacting with the environment. We can formulate the reward as
𝑅1 which represents the reward at time 𝑡. The reward expectation NIC[5] 32.9 27.7 23.7 85.5
is supposed to be: 𝐽(𝜋) = 𝔼&…Q~wx (∑Q1r& 𝑅1 ) . For calculation gLSTM[9] 35.8 26.4 22.7 81.3
convenience, we can regard the object function of rewards as a
m-RNN[11] 35.0 25.0 - -
Markov decision process. For this problem includes a
representation of semantic space from high-dimensional ATT[7] 40.2 30.4 24.3 -
interaction, which may contain unknown environment variables. 39.5 28.2 24.3 90.7
Ours
Considering the solution procedure, we can give an approximation
to this problem:
Since our framework has a significant improvement compared to
other methods, we’d like to combine the decision-making
∇(wx 𝐽 = ∑Q1r& ∇(wx log 𝑞/ (𝑎1 |𝑠1 )(𝑅 − 𝑣/ (𝑠1 )) (10)
framework with the existing work to optimize our model
∇(sx 𝐽 = ∇(sx 𝑣/ (𝑠1 )(𝑅 − 𝑣/ (𝑠1 )) (11)
constantly in the future work.
𝑣w (𝑠) = 𝔼[𝑅|𝑎1 ~𝑞/ , 𝑠1 = 𝑠] (12)
Where 𝑣w (𝑠) represents the evaluation scores at state 𝑠1 when
given a policy predicted by 𝑞/ at time 𝑡.
4 Experiments
4
Image Captioning based on Deep Reinforcement Learning
[8] Fang H, Gupta S, Iandola F, et al. From captions to visual concepts and
back[C]// IEEE Conference on Computer Vision and Pattern Recognition.
IEEE, 2015:1473-1482.
[9] Jia X, Gavves E, Fernando B, et al. Guiding Long-Short Term Memory for
Image Caption Generation[J]. 2015.
[10] Zhou L, Xu C, Koch P, et al. Image Caption Generation with Text-Conditional
Semantic Attention[J]. 2016.
[11] J. Mao, W. Xu, Y. Yang, J. Wang, and A. Yuille. Deep captioning with
multimodal recurrent neural networks. ICLR, 2015.
[12] Yao T, Pan Y, Li Y, et al. Boosting Image Captioning with Attributes[J]. 2016.
[13] Lu J, Xiong C, Parikh D, et al. Knowing When to Look: Adaptive Attention via
A Visual Sentinel for Image Captioning[J]. 2016.
[14] Riedmiller M. Neural Fitted Q Iteration – First Experiences with a Data
Efficient Neural Reinforcement Learning Method[C]// European Conference on
Machine Learning. Springer-Verlag, 2005:317-328.
[15] Lange S, Riedmiller M. Deep Auto-Encoder Neural Networks in
Reinforcement Learning[J]. 2010:1-8.
[16] Abtahi F, Fasel I. Deep belief nets as function approximators for reinforcement
learning[C]// AAAI Conference on Lifelong Learning. AAAI Press, 2011:2-7.
[17] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche,
Figure 4: The visualization results of our model on the MS J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman,
COCO dataset. D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K.
Kavukcuoglu, T. Graepel, and D. Hassabis. Mastering the game of go with
deep neural networks and tree search. Nature, 529:484–489, 2016.
We show some qualitative captioning results of our model in [18] V. Mnih, K. Kavukcuoglu, D. Silver, A. Rusu, J. Veness, M. G. Bellemare, A.
Fig. 4. From the results, we can see that our method is better at Graves, M. Riedmiller, A. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A.
recognizing key objects and the integral environment. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D.
Hassabis. Human-level control through deep reinforcement learning. Nature,
518:529–533, 2015.
5 CONCLUSIONS [19] M. Ranzato, S. Chopra, M. Auli, and W. Zaremba. Sequence level training with
recurrent neural networks. In ICLR, 2016.
In this paper, we present a novel architecture for image captioning [20] R.Williams. simple statistical gradient-following algorithms for connectionist
reinforcement learning. Machine Learning, 8:229–256,1992.
with deep reinforcement learning, which can achieve good [21] Y. Zhu, R. Mottaghi, E. Kolve, J. J. Lim, and A. Gupta. Target-driven visual
performance compared with other state-of-the-art methods. navigation in indoor scenes using deep reinforcement learning. In
arXiv:1609.05143, 2016.
Different from the previous encoder-decoder framework, our [22] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: A
model utilizes two networks called “policy network” and “value method for automatic evaluation of machine translation. In ACL, 2002.
network” to generate captions. The policy network gives some [23] Satanjeev Banerjee and Alon Lavie. Meteor: An automatic metric for mt
evaluation with improved correlation with human judgments. In Proceedings of
possible actions about the agent. Then, the value network makes the acl workshop on intrinsic and extrinsic evaluation measures for machine
decisions on whether to choose the action given by the policy translation and/or summarization, volume 29, pages 65–72, 2005.
[24] Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. Cider:
network. And we also use the temporal-difference method to Consensus-based image description evaluation. In CVPR, 2015.
optimize the model. In our future work, we consider to improve [25] Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In
the model based on the existing methods and find out other Text summarization branches out: Proceedings of the ACL-04 workshop,
volume 8. Barcelona, Spain, 2004.
alternative methods to design a new reward mechanism for natural [26] Sutton R S, Barto A G. Reinforcement Learning: An Introduction, Bradford
language generation tasks. Book[J]. IEEE Transactions on Neural Networks, 2005, 16(1):285-286.
[27] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural
Computation, 9:1735–1780, 1997.
6 ACKNOWLEDGEMENT [28] Zhou S K, Chellappa R. From sample similarity to ensemble similarity:
probabilistic distance measures in reproducing kernel Hilbert space[J]. IEEE
This work is supported by the National Natural Science Transaction on Pattern Analysis & Machine Intelligence, 2006, 28(6):917-929.
[29] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-
Foundation of China (No.61501457, No.61602517) and the scale image recognition. In ICLR, 2015.
National Key Research and Development Program of China [30] Zhang X. Interactive Patent classification based on multi-classifier fusion and
(Grant 2016YFB0801305). active learning. Neurocomputing (NEUCOM), 2014, 127: 200-205.
[31] Zhang X Y. Simultaneous optimization for robust correlation estimation in
partially observed social network. Neurocomputing, 2017, 205: 455-462.
REFERENCES [32] Zhang X Y, Wang S, Yun X. Bidirectional active learning: a two-way
exploration into unlabeled and labeled dataset. IEEE Transactions on Neural
[1] Lin T Y, Maire M, Belongie S, et al. Microsoft COCO: Common Objects in
Networks and Learning Systems (TNNLS), 2015, 26(12): 3034-3044.
Context[J]. 2014, 8693:740-755.
[33] Zhang X Y, Wang S, Zhu X, et al. Update vs. upgrade: modeling with
[2] Chen X, Zitnick C L. Mind's eye: A recurrent visual representation for image indeterminate multi-class active learning. Neurocomputing (NEUCOM), 2015,
caption generation[J]. 2014:2422-2431.
162: 163-170.
[3] Donahue J, Hendricks L A, Guadarrama S, et al. Long-term recurrent [34] Zhu, X., Jin, X., Zhang, X., Li, C., He, F. and Wang, L., 2015. Context-aware
convolutional networks for visual recognition and description[C]// Computer
local abnormality detection in crowded scene. Science China Information
Vision and Pattern Recognition. IEEE, 2015:677.
Sciences, 58(5), pp.1-11.
[4] Karpathy A, Li F F. Deep visual-semantic alignments for generating image
[35] Liu Y, Zhang X, Zhu X, et al. ListNet-based object proposals ranking.
descriptions[J]. IEEE Transactions on Pattern Analysis & Machine Intelligence,
Neurocomputing (NEUCOM), 2017, 267: 182-194.
2014, 39(4):664-676.
[36] Zhu X, Liu J, Wang J, et al. Sparse representation for robust abnormality
[5] Vinyals O, Toshev A, Bengio S, et al. Show and tell: A neural image caption
detection in crowded scenes[J]. Pattern Recognition, 2014, 47(5):1791-1799.
generator[J]. 2014:3156-3164.
[37] D. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR,
[6] Xu K, Ba J, Kiros R, et al. Show, Attend and Tell: Neural Image Caption
2015.
Generation with Visual Attention[J]. Computer Science, 2015:2048-2057.
[7] You Q, Jin H, Wang Z, et al. Image Captioning with Semantic Attention[C]//
Computer Vision and Pattern Recognition. IEEE, 2016:4651-4659.