Bai 2018
Bai 2018
Neurocomputing
journal homepage: www.elsevier.com/locate/neucom
a r t i c l e i n f o a b s t r a c t
Article history: Image captioning means automatically generating a caption for an image. As a recently emerged research
Received 5 May 2017 area, it is attracting more and more attention. To achieve the goal of image captioning, semantic infor-
Revised 13 April 2018
mation of images needs to be captured and expressed in natural languages. Connecting both research
Accepted 19 May 2018
communities of computer vision and natural language processing, image captioning is a quite challenging
Available online 26 May 2018
task. Various approaches have been proposed to solve this problem. In this paper, we present a survey
Communicated by Dr. Min Xu on advances in image captioning research. Based on the technique adopted, we classify image captioning
approaches into different categories. Representative methods in each category are summarized, and their
Keywords:
Image captioning strengths and limitations are talked about. In this paper, we first discuss methods used in early work
Sentence template which are mainly retrieval and template based. Then, we focus our main attention on neural network
Deep neural networks based methods, which give state of the art results. Neural network based methods are further divided
Multimodal embedding into subcategories based on the specific framework they use. Each subcategory of neural network based
Encoder–decoder framework methods are discussed in detail. After that, state of the art methods are compared on benchmark datasets.
Attention mechanism Following that, discussions on future research directions are presented.
© 2018 Elsevier B.V. All rights reserved.
https://doi.org/10.1016/j.neucom.2018.05.080
0925-2312/© 2018 Elsevier B.V. All rights reserved.
292 S. Bai, S. An / Neurocomputing 311 (2018) 291–304
Table 1
Summary of image captioning methods.
Early work Retrieval based Farhadi et al. [13], Ordonez et al. [15], Gupta et al. [16], Hodosh et al. [32], Mason and Charniak [49],
Kuznetsova et al. [50].
Template based Yang et al. [14], Kulkarni et al. [51], Li et al. [52], Mitchell et al. [53], Ushiku et al. [54].
Neural networks based Augmenting early work by deep Socher et al. [55], Karpathy et al. [37], Ma et al. [56], Yan and Mikolajczyk [57], Lebret et al. [58].
models
Multimodal learning Kiros et al. [59], Mao et al. [60], Karpathy and Li [61], Chen and Zitnick [62].
Encoder–decoder framework Kiros et al. [63], Vinyals et al. [64], Donahue et al. [34], Jia et al. [65] Wu et al. [66], Pu et al. [67].
Attention guided Xu et al. [68], You et al. [69], Yang et al. [70].
Compositional architectures Fang et al. [33], Tran et al. [71], Fu et al. [72], Ma and Han [73], Oruganti et al. [74], Wang et al. [75].
Describing novel objects Mao et al. [76], Hendricks and Venugopalan, [36].
environment. Hede et al. used a dictionary of objects and language in this section we divide neural network based image captioning
templates to describe images of objects in backgrounds without methods into subcategories, and discuss representative methods in
clutters [12]. Apparently, such methods are far from applications each subcategory, respectively. State of art methods will be com-
to describing images that we encounter in our everyday life. pared on benchmark datasets in Section 5. After that, we will en-
It is not until recently that work aiming to generate descriptions vision future research directions of image captioning in Section 6.
for generic real life images is proposed [13]–[16]. Early work on The conclusion will be given in Section 7.
image captioning mainly follows two lines of research, i.e. retrieval
based and template based. Because these methods accomplish the
image captioning task either by making use of existing captions
in the training set or relying on hard-coded language structures, 2. Retrieval based image captioning
the disadvantage of methods adopted in early work is that they
are not flexible enough. As a result, expressiveness of generated One type of image captioning methods that are common in
descriptions by these methods is, to a large extent, limited. early work is retrieval based. Given a query image, retrieval based
Despite the difficult nature of the image captioning task, thanks methods produce a caption for it through retrieving one or a set of
to recent advances in deep neural networks [17–22], which are sentences from a pre-specified sentence pool. The generated cap-
widely applied to the fields of computer vision [23–26] and nat- tion can either be a sentence that has already existed or a sentence
ural language processing [27–31], image captioning systems based composed from the retrieved ones. First, let us investigate the line
on deep neural networks are proposed. Powerful deep neural net- of research that directly uses retrieved sentences as captions of im-
works provide efficient solutions to visual and language modelling. ages.
Consequently, they are used to augment existing systems and de- Farhadi et al. establish a object, action, scene meaning space
sign countless new approaches. Employing deep neural networks to link images and sentences. Given a query image, they map it
to tackle the image captioning problem has demonstrated state of into the meaning space by solving a Markov Random Field, and
the art results [32]–[37]. use Lin similarity measure [77] to determine the semantic distance
With the recent surge of research interest in image captioning, between this image and each existing sentence parsed by Curran
a large number of approaches have been proposed. To facilitate et al. parser [78]. The sentence closest to the query image is taken
readers to have a quick overview of the advances of image caption- as its caption [13].
ing, we present this survey to review past work and envision fu- In [15], to caption an image Ordonez et al. first employ global
ture research directions. Although there exist several research top- image descriptors to retrieve a set of images from a web-scale col-
ics that also involve both computer vision and natural language lection of captioned photographs. Then, they utilize semantic con-
processing, such as visual question answering [38–42], text sum- tents of the retrieved images to perform re-ranking and use the
marization [43], [44] and video description [45–48], because each caption of the top image as the description of the query.
of them has its own focus, in this survey we mainly focus on work Hodosh et al. frame image captioning as a ranking task [32].
that aims to automatically generate descriptions for generic real The authors employ the Kernel Canonical Correlation Analysis tech-
life images. nique [79], [80] to project image and text items into a common
Based on the technique adopted in each method, we classify space, where training images and their corresponding captions are
image captioning approaches into different categories, which are maximally correlated. In the new common space, cosine similar-
summarized in Table 1. Representative methods in each category ities between images and sentences are calculated to select top
are listed. Methods in early work are mainly retrieval and template ranked sentences to act as descriptions of query images.
based, in which hard coded rules and hand engineered features are To alleviate impacts of noisy visual estimation in methods that
utilized. Outputs of such methods have obvious limitations. We re- depend on image retrieval for image captioning, Mason and Char-
view early work relatively briefly in this survey. With the great niak first use visual similarity to retrieve a set of captioned images
progress made in research of deep neural networks, approaches for a query image [49]. Then, from the captions of the retrieved
that employ neural networks for image captioning are proposed images, they estimate a word probability density conditioned on
and demonstrate state of the art results. Based on the framework the query image. The word probability density is used to score the
used in each deep neural network based method, we further clas- existing captions to select the one with the largest score as the
sify these methods into subcategories. In this survey, we will focus caption of the query.
our main attention on neural network based methods. The frame- The above methods have implicitly assumed that given a query
work used in each subcategory will be introduced, and the corre- image there always exists a sentence that is pertinent to it. This
sponding representative methods will be discussed in more detail. assumption is hardly true in practice. Therefore, instead of using
This paper is organized as follows. In Sections 2 and 3, we first retrieved sentences as descriptions of query images directly, in the
review retrieval based and template based image captioning meth- other line of retrieval based research, retrieved sentences are uti-
ods, respectively. Section 4 is about neural network based methods, lized to compose a new description for a query image.
S. Bai, S. An / Neurocomputing 311 (2018) 291–304 293
Provided with a dataset of paired images and sentences, Gupta Li et al. use visual models to perform detections in images for
et al. use Stanford CoreNLP toolkit1 to process sentences in the extracting semantic information including objects, attributes and
dataset to derive a list of phrases for each image. In order to gen- spatial relationships [52]. Then, they define a triplet of the for-
erate a description for a query image, image retrieval is first per- mat adj1, obj1, prep, adj2, obj2 for encoding recognition re-
formed based on global image features to retrieve a set of images sults. To generate a description with the triplet, web-scale n-gram
for the query. Then, a model trained to predicate phrase relevance data, which is able to provide frequency counts of possible n-gram
is used to select phrases from the ones associated with retrieved sequences, is resorted to for performing phrase selection, so that
images. Finally a description sentence is generated based on the candidate phrases that may compose the triplet can be collected.
selected relevant phrases [16]. After that, phrase fusion is implemented to use dynamic program-
With a similar idea, Kuznetsova et al. propose a tree based ming to find the optimal compatible set of phrases to act as the
method to compose image descriptions by making use of captioned description of the query image.
web images [50]. After performing image retrieval and phrase ex- Mitchell et al. employ computer vision algorithms to process
traction, the authors take extracted phrases as tree fragments and an image and represent this image by using objects, actions, spa-
model description composition as a constraint optimization prob- tial relationships triplets [53]. After that, they formulate image de-
lem, which is encoded by using Integer Linear Programming [81], scription as a tree-generating process based on the visual recogni-
[82] and solved by using the CPLEX solver2 . Before this paper, the tion results. Trough object nouns clustering and ordering, the au-
same authors have reported a similar method in [83]. thors determine image contents to describe. Then sub-trees are
Disadvantages of retrieval based image captioning methods are created for object nouns, which are further used for creating full
obvious. Such methods transfer well-formed human-written sen- trees. Finally, a trigram language model [88] is used to select a
tences or phrases for generating descriptions for query images. Al- string from the generated full trees as the description of the corre-
though the yielded outputs are usually grammatically correct and sponding image.
fluent, constraining image descriptions to sentences that have al- Methods mentioned above use visual models to predicate indi-
ready existed can not adapt to new combinations of objects or vidual words from a query image in a piece-wise manner. Then,
novel scenes. Under certain conditions, generated descriptions may predicted words such as objects, attributes, verbs and prepositions
even be irrelevant to image contents. Retrieval based methods have are connected in later stages to generate human-like descriptions.
large limitations to their capability to describe images. Since phrases are combinations of words, compared to individual
words, phrases carry bigger chunks of information [89]. Sentences
yielded based on phrases tend to be more descriptive. Therefore,
3. Template based image captioning
methods utilizing phrases under the template based image cap-
tioning framework are proposed.
In early image captioning work, another type of methods that
Ushiku et al. present a method called Common Subspace for
are commonly used is template based. In template based methods,
Model and Similarity to learn phrase classifiers directly for caption-
image captions are generated through a syntactically and seman-
ing images [54]. Specifically, the authors extract continuous words
tically constrained process. Typically, in order to use a template
[84] from training captions as phrases. Then, they map image fea-
based method to generate a description for an image, a specified
tures and phrase features into the same subspace, where similarity
set of visual concepts need to be detected first. Then, the detected
based and model based classification are integrated to learn a clas-
visual concepts are connected through sentence templates or spe-
sifier for each phrase. In the testing stage, phrases estimated from
cific language grammar rules or combinatorial optimization algo-
a query image are connected by using multi-stack beam search
rithms [84], [53] to compose a sentence.
[84] to generate a description.
A method to use a sentence template for generating image de-
Template based image captioning can generate syntactically cor-
scriptions is presented in [14] by Yang et al., where a quadruplet
rect sentences, and descriptions yielded by such methods are usu-
(Nouns-Verbs-Scenes-Prepositions) is utilized as a sentence tem-
ally more relevant to image contents than retrieval based ones.
plate. To describe an image, the authors first use detection algo-
However, there are also disadvantages for template based methods.
rithms [2], [85] to estimate objects and scenes in this image. Then,
Because description generation under the template based frame-
they employ a language model [86] trained over the Gigaword cor-
work is strictly constrained to image contents recognized by visual
pus3 to predicate verbs, scenes and prepositions that may be used
models, with the typically small number of visual models available,
to compose the sentence. With probabilities of all elements com-
there are usually limitations to coverage, creativity, and complex-
puted, the best quadruplet is obtained by using Hidden Markov
ity of generated sentences. Moreover, compared to human-written
Model inference. Finally, the image description is generated by fill-
captions, using rigid templates as main structures of sentences will
ing the sentence structure given by the quadruplet.
make generated descriptions less natural.
Kulkarni et al. employ Conditional Random Field to determine
image contents to be rendered in the image caption [87], [51].
In their method, nodes of the graph correspond to objects, object
attributes and spatial relationships between objects, respectively.
In the graph model, unary potential functions of nodes are ob-
4. Deep neural network based image captioning
tained by using corresponding visual models, while pairwise po-
tential functions are obtained by making statistics on a collection
Retrieval based and template based image captioning methods
of existing descriptions. Image contents to be described are deter-
are adopted mainly in early work. Due to great progress made in
mined by performing Conditional Random Field inference. Outputs
the field of deep learning [18], [90], recent work begins to rely
of the inference is used to generate a description based on a sen-
on deep neural networks for automatic image captioning. In this
tence template.
section, we will review such methods. Even though deep neural
networks are now widely adopted for tackling the image caption-
1 ing task, different methods may be based on different frameworks.
http://nlp.stanford.edu/software/corenlp.shtml.
2
ILOG CPLEX: High-performance software for mathematical programming and Therefore, we classify deep neural network based methods into
optimization. http://www.ilog.com/products/cplex/. subcategories on the basis of the main framework they use and
3
https://catalog.ldc.upenn.edu/LDC2003T05. discuss each subcategory, respectively.
294 S. Bai, S. An / Neurocomputing 311 (2018) 291–304
4.1. Retrieval and template based methods augmented by neural with correlation between paired features maximized. In the joint
networks latent space, similarities between an image feature and a sentence
feature can be computed directly for sentence retrieval.
Encouraged by advances in the field of deep neural networks, Besides using deep models to augment retrieval based image
instead of utilizing hand-engineered features and shallow models captioning methods, utilizing deep models under the template
like in early work, deep neural networks are employed to perform based framework is also attempted. Lebret et al. leverage a kind of
image captioning. With inspiration from retrieval based methods, soft-template to generate image captions with deep models [58].
researchers propose to utilize deep models to formulate image cap- In this method, the authors use the SENNA software4 to extract
tioning as a multi-modality embedding [91] and ranking problem. phrases from training sentences and make statistics on the ex-
To retrieve a description sentence for a query image, Socher tracted phrases. Phrases are represented as high-dimensional vec-
et al. propose to use dependency-tree recursive neural networks tors by using a word vector representation approach [31], [99],
to represent phrases and sentences as compositional vectors. They [100], and images are represented by using a deep Convolutional
use another deep neural network [92] as visual model to ex- Neural Network [94]. A bilinear model is trained as a metric be-
tract features from images [55]. Obtained multimodal features are tween image features and phrase features, so that given a query
mapped into a common space by using a max-margin objective image, phrases can be inferred from it. Phrases inferred from an
function. After training, correct image and sentence pairs in the image are used to generate a sentence under the guidance of
common space will have larger inner products and vice versa. At statistics made in the early stage.
last, sentence retrieval is performed based on similarities between With the utilization of deep neural networks, performances of
representations of images and sentences in the common space. image captioning methods are improved significantly. However, in-
Karpathy et al. propose to embed sentence fragments and im- troducing deep neural networks into retrieval based and template
age fragments into a common space for ranking sentences for a based methods does not overcome their disadvantages. Limitations
query image [37]. They use dependency tree relations [93] of a of sentences generated by these methods are not removed.
sentence as sentence fragments and use detection results of the
Region Convolutional Neural Network method [3] in an image as 4.2. Image captioning based on multimodal learning
image fragments. Representing both image fragments and sentence
fragments as feature vectors, the authors design a structured max- Retrieval based and template based image captioning meth-
margin objective, which includes a global ranking term and a frag- ods impose limitations on generated sentences. Thanks to pow-
ment alignment term, to map visual and textual data into a com- erful deep neural networks, image captioning approaches are
mon space. In the common space, similarities between images and proposed that do not rely on exiting captions or assump-
sentences are computed based on fragment similarities, as a result tions about sentence structures in the caption generation pro-
sentence ranking can be conducted at a finer level. cess. Such methods can yield more expressive and flexi-
In order to measure similarities between images and sentences ble sentences with richer structures. Using multimodel neural
with different levels of interactions between them taken into con- networks is one of the attempts that rely on pure learning to gen-
sideration, Ma et al. propose a multimodal Convolutional Neural erate image captions.
Network [56]. Ma’s framework includes three kinds of components, General structure of multimodal learning based image caption-
i.e. image CNNs to encode visual data [94], [95], matching CNNs ing methods is shown in Fig. 1. In such kind of methods, image
to jointly represent visual and textual data [96], [97] and multi- features are first extracted by using a feature extractor, such as
layer perceptions to score compatibility of visual and textual data. deep convolutional neural networks. Then, the obtained image fea-
The authors use different variants of matching CNNs to account for ture is forwarded to a neural language model, which maps the im-
joint representations of images and words, phrases and sentences. age feature into the common space with the word features and
The final matching score between an image and a sentence is de- perform word predication conditioned on the image feature and
termined based on an ensemble of multimodal Convolutional Neu- previously generated context words.
ral Networks. Kiros et al. propose to use a neural language model which is
Yan and Mikolajczyk propose to use deep Canonical Correla- conditioned on image inputs to generate captions for images [59].
tion Analysis [98] to match images and sentences [57]. They use In their method, log-bilinear language model [30] is adapted to
a deep Convolutional Neural Network [8] to extract visual fea- multimodal cases. In a natural language processing problem, a lan-
tures from images and use a stacked network to extract tex- guage model is used to predicate the probability of generating a
tual features from Frequency-Inverse Document Frequency repre-
sented sentences. The Canonical Correlation Analysis objective is
employed to map visual and textual features to a joint latent space 4
Available at http://ml.nec-labs.com/senna/.
S. Bai, S. An / Neurocomputing 311 (2018) 291–304 295
word wt conditioned on previously generated words w1 , . . . , wt−1 , representing image regions and sentence segments by using cor-
which is shown below: responding neural networks, a structured objective is used to map
visual and textual data into a common space and associate each
P (wt | w1 , . . . , wt−1 ). (1)
region feature to the textual feature that describes the region. The
The authors make the language model become dependent on im- aligned two modalities are then employed to train a multimodal
ages through two different ways, i.e. adding an image feature as an Recurrent Neural Network model, that can be used to predicate the
additive bias to the representation of the next predicted word and probability of generating the next word given an image feature and
gating the word representation matrix by using the image feature. context words.
Consequently, in the multimodal case the probability of generating Recurrent Neural Networks are known to have difficulties in
a word wt is as follows: learning long term dependencies [103], [104]. To alleviate this
weakness in image captioning, Chen and Zitnick propose to dy-
P (wt | w1 , . . . , wt−1 , I ). (2)
namically build a visual representation of an image as a caption
where I is an image feature. In their method, images are repre- is being generated for it, so that long term visual concepts can
sented by a deep Convolutional Neural Network, and joint image- be remembered during this process [62]. To this end, a set of la-
text feature learning is implemented by back propagating gradi- tent variables Ut−1 are introduced to encode visual interpretation
ents from the loss function through the multimodal neural net- of words Wt−1 that have already been generated. With these latent
work model. By using this model, an image caption can be gener- variables, the probability of generating a word wt is given below:
ated word by word, with the generation of each word conditioned
on previously generated words and the image feature. P (wt , V | Wt−1 , Ut−1 ) = P (wt | V, Wt−1 , Ut−1 )P (V | Wt−1 , Ut−1 ), (7)
To generate novel captions for images, Mao et al. adapt a Recur-
where V denotes observed visual features, and Wt−1 denotes gen-
rent Neural Network language model to multimodal cases for di-
erated words (w1 , . . . , wt−1 ). The authors realize the above idea
rectly modelling the probability of generating a word conditioned
through adding recurrent visual hidden layer u into the Recurrent
on a given image and previously generated words [60], [35]. Under
Neural Networks. The recurrent layer u is helpful for both recon-
their framework, a deep Convulutional Neural Network [8] is used
structing the visual features V from previous words Wt−1 and pre-
to extract visual features from images, and a Recurrent Neural Net-
dicting the next word wt .
work [101] with a multimodal part is used to model word distri-
butions conditioned on image features and context words. For the 4.3. Image captioning based on the encoder-decoder framework
Recurrent Neural Network language model, each unit consists of an
input word layer w, an recurrent layer r and an output layer y. At Inspired by recent advances in neural machine translation [28],
the tth unit of the Recurrent Neural Network language model, the [105], [106], the encoder–decoder framework is adopted to gen-
calculation performed by these three layers is shown as follows: erate captions for images. General structure of encoder–decoder
x(t ) = [w(t ) r(t − 1 )], (3) based image captioning methods is shown in Fig. 2. This frame-
work is originally designed to translate sentences from one lan-
guage into another language. Motivated by the neural machine
r(t ) = f (U · x(t )), (4) translation idea, it is argued that image captioning can be formu-
lated as a translation problem, where the input is an image, while
y(t ) = g(V · r(t )), (5) the output is a sentence [63]. In image captioning methods under
this framework, an encoder neural network first encodes an im-
where f( · ) and g( · ) are element-wise non-linear functions, and U age into an intermediate representation, then a decoder recurrent
and V are matrices of weights to be learned. The multimodal part neural network takes the intermediate representation as input and
calculates its layer activation vector m(t) by using the equation be- generate a sentence word by word.
low: Kiros et al. introduce the encoder–decoder framework into im-
age captioning research to unify joint image-text embedding mod-
m(t ) = gm (Vw · w(t ) + Vr · r(t ) + VI · I ), (6)
els and multimodal neural language models, so that given an im-
where gm is a non-linear function. I is the image feature. Vw , Vr age input, a sentence output can be generated word by word
and VI are matrices of weights to be learned. The multimodal part [63] like language translation. They use Long Short-Term Mem-
fuses image features and distributed word representations by map- ory (LSTM) Recurrent Neural Networks to encode textual data
ping and adding them. To train the model, a perplexity based cost [107] and a deep Convolutional Neural Network to encode visual
function is minimized based on back propagation. data. Then, through optimizing a pairwise ranking loss, encoded vi-
Schuster and Paliwal present an approach to align image re- sual data is projected into an embedding space spanned by LSTM
gions represented by a Convolutional Neural Network and sentence hidden states that encode textual data. In the embedding space, a
segments represented by a Bidirectional Recurrent Neural Network structure-content neural language model is used to decode visual
[102] to learn a multimodal Recurrent Neural Network model to features conditioned on context word feature vectors, allowing for
generate descriptions for image regions [61]. In their method, after sentence generation word by word.
296 S. Bai, S. An / Neurocomputing 311 (2018) 291–304
With the same inspiration from neural machine translation, cl = fl cl−1
+ i h (W x + W m + W g),
l cx l cm l−1 cg (19)
Vinyals et al. use a deep Convolutional Neural Network as an en-
coder to encode images and use Long Short-Term Memory (LSTM)
ml = ol cl , (20)
Recurrent Neural Networks to decode obtained image features into
sentences [64] [108]. With the above framework, the authors for- where g is the representation of semantic information, which can
mulate image captioning as predicating the probability of a sen- be from any sources as long as it can provide guidance for image
tence conditioned on an input image: captioning.
S = arg max P (S | I; θ ) (8) Given an image, approaches introduced above seek to directly
S derive a description from its visual features. In order to uti-
where I is an input image and θ is the model parameter. Since a lize high-level semantic information for image captioning, Wu
sentence S equals to a sequence of words (S0 , . . . , ST +1 ), with chain et al. incorporate visual concepts into the encoder–decoder frame-
rule Eq. (8) is reformulated below: work [66]. To this end, the authors first mined a set of seman-
tic attributes from the training sentences. Under the region-based
S = arg max P (St | I, S0 , . . . , St−1 ; θ ). (9) multi-label classification framework [110], a Convolutional Neural
S
Network based classifier is trained for each attribute. With trained
Vinyals et al. use a Long Short-Term Memory neural network to semantic attribute classifiers, an image can be represented as a
model P (St | I, S0 , . . . , St−1 ; θ ) as hidden state ht , which can be up- prediction vector Vatt (I) giving the probability of each attribute to
dated by a update function below: be present in the image. After encoding an image I as Vatt (I), a
Long Short-Term Memory network [107] is employed as a decoder
ht+1 = f (ht , xt ), (10)
to generate a sentence describing the contents of the image based
where xt is the input to the Long Short-Term Memory neural net- on the representation. Under this condition, the image captioning
work. In the first unit, xt is an image feature, while in other units problem can be reformulated below:
xt is a feature of previously predicated context words. The model
parameter θ is obtained by maximizing the likelihood of sentence S = arg max P (S | Vatt (I ); θ ) (21)
S
image pairs in the training set. With the trained model, possible
output word sequences can be predicted by either sampling or where I is the input image. S is a sentence. θ is the model param-
beam search. eter.
Similar to Vinyals’s work [64], [108], Donahue et al. also adopt a Because in practical applications, there may be far less cap-
deep Convolutional Neural Network for encoding and Long Short- tioned images than uncaptioned ones, semi-supervised learning of
Term Memory Recurrent Networks for decoding to generate a sen- image captioning models is of significant practical values. To ob-
tence description for an input image [34]. The difference is that tain an image captioning system by leveraging the vast quantity of
instead of inputting image features to the system only at the ini- uncaptioned images available, Pu et al. propose a semi-supervised
tial stage, Donahue et al. provide both image feature and context learning method under the encoder–decoder framework to use a
word feature to the sequential model at each time step. deep Convolutional Neural Network to encode images and a Deep
It has demonstrated promising results to use the encoder– Generative Deconvolutional Network to decode latent image fea-
decoder framework to tackle the image captioning problem. En- tures for image captioning [67]. The system uses the deep Convolu-
couraged by the success, approaches aiming to augment this tional Neural Network to provide an approximation to the distribu-
framework for obtaining better performances are proposed. tion of the latent features of the Deep Generative Deconvolutional
Aiming to generate image descriptions that are closely related Network and link the latent features to generative models for cap-
to image contents, Jia et al. extract semantic information from im- tions. After training, given an image, the caption can be generated
ages and add the information to each unit of the Long Short-Term by averaging across the distribution of latent features of Deep Gen-
Memory Recurrent Neural Networks during the process of sentence erative Deconvolutional Network.
generation [65]. The original forms of the memory cell and gates of
a LSTM unit [109] are defined as follows: 4.4. Attention guided image captioning
il = σ (Wix xl + Wim ml−1 ), (11)
It is well-known that images are rich in information they con-
fl = σ (W f x xl + W f m ml−1 ), (12) tain, while in image captioning it is unnecessary to describe all de-
tails of a given image. Only the most salient contents are supposed
ol = σ (Wox xl + Wom ml−1 ), (13) to be mentioned in the description. Motivated by the visual atten-
tion mechanism of primates and humans [111], [112], approaches
cl = fl cl−1 + il h(Wcx xl + Wcm ml−1 ), (14) that utilize attention to guide image description generation are
proposed. By incorporating attention to the encoder–decoder im-
ml = ol cl , (15) age captioning framework, sentence generation will be conditioned
on hidden states that are computed based on attention mechanism.
where σ ( · ) and h( · ) are non-linear functions, variables il , fl and ol General structure of attention guided image captioning methods is
stand for input gate, forget gate, output gate of a LSTM cell, respec- given Fig. 3. In such methods, attention mechanism based on var-
tively, cl and ml stand for the state and hidden state of the memory ious kinds of cues from the input image is incorporated into the
cell unit, xl is the input, W[ · ][ · ] are model parameters, and de- encoder–decoder framework to make the decoding process focus
notes an element-wise multiplication operation. With the addition on certain aspects of the input image at each time step to gener-
of semantic information to an LSTM unit, the forms of the memory ate a description for the input image.
cell and gates are changed to be as follows: Encouraged by successes of other tasks that employ at-
il = σ (Wix xl + Wim ml−1 + Wig g), (16) tention mechanism [113–115], Xu et al. propose an atten-
tive encoder–decoder model to be able to dynamically attend
fl = σ (W f x xl + W f m ml−1 + W f g g), (17) salient image regions during the process of image descrip-
tion generation [68]. Forwarding an image to a deep Convo-
ol = σ (Wox xl + Wom ml−1 + Wog g), (18) lutional Neural Network and extracting features from a lower
S. Bai, S. An / Neurocomputing 311 (2018) 291–304 297
convolutional layer of the network, the authors encode an image Contrarily, in the deterministic soft attention mechanism, the
as a set of feature vectors which is shown as follows: positive weight α l, i associated with location i at time step l is
used to represent the relative importance of the corresponding lo-
a = ( a1 , . . . aN ), ai ∈ RD , (22)
cation in blending visual features from all N locations to calculate
where ai is a D-dimensional feature vector that represents one part the context vector zl , which is formulated below:
of the image. As a result, an image is represented by N vectors. In
the decoding stage, a Long Short-Term Memory network is used
N
zl = αl,i ai . (30)
as the decoder. Different from previous LSTM versions, a context
i
vector zl is utilized to dynamically represent image parts that are
relevant for caption generation at time l. Consequently, the mem- Finding that both bottom-up [13], [87], [116] and top-down
ory cell and gates of a LSTM unit become the forms given below: [34], [61], [62] image captioning approaches have certain limita-
tions, You et al. propose a semantic attention model to take advan-
il = σ (Wix xl + Wim ml−1 + Wiz zl ), (23) tages of the complimentary properties of both types of approaches
[69]. To achieve this goal, the authors use a deep Convolutional
Neural Network and a set of visual attribute detectors to extract
fl = σ (W f x xl + W f m ml−1 + W f z zl ), (24) a global feature v and a list of visual attributes {Ai } from an input
image, respectively. With each attribute corresponding to one entry
ol = σ (Wox xl + Wom ml−1 + Woz zl ), (25) of the used vocabulary, words to generate and attributes to detect
share the same vocabulary. Under the encoder–decoder framework,
the global visual feature v is only forwarded to the encoder at the
cl = fl cl−1 + il h(Wcx xl + Wcm ml−1 + Wcz zl ), (26) initial step. In the decoding stage, using an input attention func-
tion φ ( · ), certain cognitive visual cues in the attribute list {Ai } will
ml = ol cl . (27) be attended with a probability distribution:
Attention is imposed to the decoding process by using the context {αti } = φ (yt−1 , {Ai } ), (31)
vector zl , which is a function of image region vectors (a1 , . . . aN ) where αti is the weight assigned to an attribute in the list, and yt−1
and weights associated with them (α1 , . . . αN ): is the previously generated word. These weights are used to calcu-
zl = φ ({ai }, {αi } ). (28) late input vector xt to the tth unit of a Long Short-Term Memory
neural network. With an output attention function ϕ ( · ), the atten-
With different function forms, different attention mechanisms tion on all the attributes will be modulated by using the weights
can be applied. In [68], Xu et al. proposed a stochastic hard at- given below:
tention and a deterministic soft attention for image captioning. In
each time step, the stochastic hard attention mechanism selects a {βti } = ϕ (mt , {Ai } ), (32)
visual feature from one of the N locations as the context vector to
where βti
is the weight assigned to an attribute. mt is the hidden
generate a word, while the deterministic soft attention mechanism
state of tth unit of the Long Short-Term Memory neural network.
combines visual features from all N locations to obtain the context
The obtained weights are further used to predicate probability dis-
vector to generate a word.
tribution of the next word to be generated.
Specifically, in the stochastic hard attention mechanism, at time
Arguing that attentive encoder–decoder models lack global
step l, for each location i, the positive weight α l, i associated with
modelling abilities due to their sequential information process-
it is taken as the probability for this location to be focused on for
ing manner, Yang et al. propose a review network to enhance
generating the corresponding word. The context vector zl is calcu-
the encoder–decoder framework [70]. To overcome the above-
lated as follows:
mentioned problem, a reviewer module is introduced to perform
N
review steps on the hidden states of the encoder and give a
zl = sl,i ai . (29) thought vector at each step. During this process, attention mech-
i anism is applied to determine weights assigned to hidden states.
where sl, i is an indicator variable, which is set to 1, if the visual Through this manner, information encoded by the encoder can be
feature ai from the ith location out of N is attended at time step reviewed and learned by the thought vectors which can capture
l, otherwise 0. The distribution of the variable sl, i is treated as a global properties of the input. Obtained thought vectors are used
multinouli distribution parametrized by {α l, i }, and its value is de- by the decoder for word predication. Specifically, the authors use
termined based on sampling. the VGGNet [94], which is a commonly used deep Convolutional
298 S. Bai, S. An / Neurocomputing 311 (2018) 291–304
Neural Network to encode an image as a context vector c and a authors added detection for landmarks and celebrities and a confi-
set of hidden states H = {ht }. A Long-Short Term Memory neural dence model for dealing with images that are difficult to describe.
network is used as reviewer to produce thought vectors. A thought To exploit parallel structures between images and sentences for
vector ft at the tth LSTM unit is calculated as follows: image captioning, Fu et al. propose to align the word generation
process to visual perception of image regions [72]. Furthermore,
ft = gt (H, ft−1 ), (33)
the authors introduce scene-specific contexts to capture high-level
where gt is a function performed by a reviewer with attention semantic information in images for adapting word generation to
mechanism applied. After obtaining thought vectors F = {ft }, a specific scene types. Given an image, Fu et al. first use the selec-
Long-Short Term Memory neural network decoder can predicate tive search method [121] to extract a large number of image re-
word probability distribution based on them as given below: gions. Then, based on the criterion of being semantically mean-
ingful, non-compositional and contextually rich, a small number of
yt = gt (F , st−1 , yt−1 ), (34)
them are selected for further processing. Each selected region is
where st is the hidden state of the tth LSTM unit in the decoder. yt represented as a visual feature by using the ResNet network [120].
is the tth word. These features are dynamically attended by an attention-based de-
coder, which is a Long-Short Term Memory neural network [107].
4.5. Compositional architectures for image captioning Finally, to exploit semantic-contexts in images for better caption-
ing, Latent Dirichlet Allocation [122] and a multilayer perceptron
In Section 4, we focus on image captioning methods that are are used to predicate a context vector for an image to bias the
based on deep neural networks. Most of the approaches in previ- word generation in the Long-Short Term Memory neural network.
ous subsections are based on end-to-end frameworks, whose pa- To be able to produce detailed descriptions about image con-
rameters can be trained jointly. Such methods are neat and effi- tents, Ma and Han propose to use structural words for image cap-
cient. However, believing that each type of approaches have their tioning [73]. Their method consists of two-stages, i.e. structural
own advantages and disadvantages, architectures composed of in- word recognition and sentence translation. The authors first em-
dependent building blocks are proposed for image captioning. In ploy a multi-layer optimization method to generate a hierarchi-
this subsection, we will talk about compositional image captioning cal concepts to represent an image as a tetrad < objects, attributes,
architectures that are consisted of independent functional building activities, scenes > . The tetrad plays the role of structural words.
blocks that may be used in different types of methods. Then, they utilize an encoder–decoder machine translation model,
General structure of compositional image captioning methods which is based on the Long-Short Term Memory neural network,
is given Fig. 4. In contrast to end-to-end image captioning frame- to translate the structural words into sentences.
work, compositional image captioning methods integrate indepen- Oruganti et al. present a fusion based model which consists of
dent building blocks into a pipeline to generate captions for input an image processing stage, a language processing stage and a fu-
images. Generally, compositional image captioning methods use a sion stage [74]. In their method, images and languages are inde-
visual model to detect visual concepts appearing in the input im- pendently processed in their corresponding stages based on a Con-
age. Then, detected visual concepts are forwarded to a language volutional Neural Network and a Long-Short Term Memory net-
model to generate candidate descriptions, which are then post- work, respectively. After that, the outputs of these two stages are
processed to select one of them as the caption of the input image. mapped into a common vector space, where the fusion stage asso-
Fang et al. propose a system that is consisted of visual detec- ciate these two modalities and make predications. Such a method
tors, language models and multimodal similarity models for auto- is argued to be able to make the system more flexible and miti-
matic image captioning [33]. The authors first detect a vocabulary gate the shortcomings of previous approaches on their inability to
of words that are most common in the training captions. Then, accommodate disparate inputs.
corresponding to each word, a visual detector is trained by using A parallel-fusion RNN-LSTM architecture is presented in [75] by
a Multiple Instance Learning approach [117]. Visual features used Wang et al. to take advantages of the complementary properties
by these detectors are extracted by a deep Convolutional Neural of simple Recurrent Neural Networks and Long-Short Term Mem-
Network [8]. Given an image, conditioned on the words detected ory networks for improving the performance of image caption-
from it, a maximum entropy language model [118] is adopted to ing systems. In their method, inputs are mapped to hidden states
generate candidate captions. During this process, left-to-right beam by Recurrent Neural Network units and Long-Short Term Memory
search [119] with a stack of pre-specified length of l is performed. units in parallel. Then, the hidden states in these two networks are
Consequently, l candidate captions are obtained for this image. Fi- merged with certain ratios for word predication.
nally, a deep multimodal similarity model, which maps images and
text fragments into a common space for similarity measurement is 4.6. Generating descriptions for images with novelties
used to re-rank the candidate descriptions.
Based on Fang’s et al. work [33], Tran et al. presented a sys- So far, all of the introduced image captioning methods are
tem for captioning open domain images [71]. Similar to [33], the limited to pre-specified and fixed word dictionaries and are
authors use a deep residual network based vision model to de- not enabled to generate descriptions for concepts that are not
tect a broad range of visual concepts [120], a maximum entropy trained with paired image-sentence training data. Humans have
language model for candidate description generation, and a deep the ability to recognize, learn and use novel concepts in vari-
multimodal similarity model for caption ranking. What’s more, the ous visual understanding tasks. And in practical image description
S. Bai, S. An / Neurocomputing 311 (2018) 291–304 299
applications, it is quite possible to come across situations where sentence is compared with reference sentences in unigram, while
there are novel objects which are not in the pre-specified vocabu- for calculating BLEU-2, bigram will be used for matching. A max-
lary or have not been trained with paired image-sentence data. It imum order of four is empirically determined to obtain the best
is undesirable to retrain the whole system every time when a few correlation with human judgements. For BLEU metrics, the uni-
images with novel concepts appear. Therefore, it is a useful abil- gram scores account for the adequacy, while higher n-gram scores
ity for image captioning systems to adapt to novelties appearing account for the fluency.
in images for generating image descriptions efficiently. In this sub- ROUGE-L [124] is designed to evaluate the adequacy and flu-
section, we talk about approaches that can deal with novelties in ency of machine translation. This metric employs the longest com-
images during image captioning. mon subsequence between a candidate sentence and a set of
In order to learn novel visual concepts without retraining the reference sentences to measure their similarity at sentence-level.
whole system, Mao et al. propose to use linguistic context and The longest common subsequence between two sentences only
visual features to hypothesize semantic meanings of new words requires in-sequence word matches, and the matched words are
and use these words to describe images with novelties [76]. To not necessarily consecutive. Determination of the longest com-
accomplish the novelty learning task, the authors build their sys- mon subsequence is achieved by using dynamic programming
tem by making two modifications to the model proposed in [35]. technique. Because this metric automatically includes longest in-
First, they use a transposed weight sharing strategy to reduce the sequence common n-grams, sentence level structure can be natu-
number of parameters in the model, so that the over fitting prob- rally captured.
lem can be prevented. Second, they use a Long-Short Term Mem- METEOR [125] is an automatic machine translation evaluation
ory (LSTM) layer [107] to replace the recurrent layer to avoid the metric. It first performs generalized unigram matches between a
gradient explosion and vanishing problem. candidate sentence and a human-written reference sentence, then
With the aim of describing novel objects that are not present computes a score based on the matching results. The computation
in the training image-sentence pairs, Hendricks et al. propose the involves precision, recall and alignments of the matched words. In
Deep Compositional Captioner method [36]. In this method, large the case of multiple reference sentences, the best score among all
object recognition datasets and external text corpora are lever- independently computed ones is taken as the final evaluation re-
aged, and novel object description is realised based on knowledges sult of the candidate. Introduction of this metric is for addressing
transferred between semantically similar concepts. To achieve this weakness of the BLEU metric, which is derived only based on the
goal, Hendricks et al. first train a lexical classifier and a language precision of matched n-grams.
model over image datasets and text corpora, respectively. Then, CIDEr [126] is a paradigm that uses human consensus to evalu-
they trained a deep multimodal caption model to integrate the ate the quality of image captioning. This metric measures the sim-
lexical classifier and the language model. Particularly, as a linear ilarity of a sentence generated by the image captioning method
combination of affine transformation of image and language fea- to the majority of ground truth sentences written by human. It
tures, the caption model enables easy transfer of semantic knowl- achieves this by encoding the frequency of the n-grams in the can-
edge between these two modalities, which allows predication of didate sentence to appear in the reference sentences, where a Term
novel objects. Frequency Inverse Document Frequency weighting for each n-gram
is used. This metric is designed to evaluate generated sentences in
5. State of the art method comparison aspects of grammaticality, saliency, importance and accuracy.
In this section, we will compare image captioning methods that Three benchmark datasets that are widely used to evaluate im-
give state of the art results. Being plagued by the complexity of age captioning methods are employed as the testbed for method
the outputs, image captioning methods are difficult to evaluate. In comparison. The datasets are Flickr8K [32], Flickr30k [127] and Mi-
order to compare image captioning systems as for their capability crosoft COCO Caption dataset [128].
to generate human-like sentences with respect to linguistic qual- Flickr8K [32] contains 8,0 0 0 images extracted from Flickr. The
ity and semantic correctness, various evaluation metrics have been images in this dataset mainly contain human and animals. Each
designed. For state of the art method comparison, we need to in- image is annotated by five sentences based on crowdsourcing ser-
troduce the commonly used evaluation metrics first. vice from Amazon Mechanical Turk. During image annotation, the
In fact, the most intuitive way to determine how well a gen- Amazon Mechanical Turk workers are instructed to focus on the
erated sentence describes the content of an image is by direct images and describe their contents without considering the con-
human judgements. However, because human evaluation requires text in which the pictures are taken.
large amounts of un-reusable human efforts, it is difficult to scale Flickr30k [127] is a dataset that is extended from the Flickr8k
up. Furthermore, human evaluation is inherently subjective mak- dataset. There are 31,783 annotated images in Flickr30k. Each im-
ing it suffer from user variances. Therefore, in this paper we re- age is associated to five sentences purposely written for it. The im-
port method comparison based on automatic image captioning ages in this dataset are mainly about humans involved in everyday
evaluation metrics. The used automatic evaluation metrics include activities and events.
BLEU [123], ROUGE-L [124], METEOR [125] and CIDEr [126]. BLEU, Microsoft COCO Caption dataset [128] is created by gathering
ROUGE-L and METEOR are originally designed to judge the qual- images of complex everyday scenes with common objects in their
ity of machine translation. Because the evaluation process of im- natural context. Currently, there are 123,287 images in total, of
age captioning is exactly the same as machine translation, in which which 82,783 and 40,504 are used for training and validation, re-
generated sentences are compared against ground truth sentences, spectively. For each image in the training and validation set, five
these metrics are widely used for image captioning evaluation. human written captions are provided. Captions of test images are
BLEU [123] is to use variable lengths of phrases of a candidate unavailable publicly. This dataset poses great challenges to the im-
sentence to match against reference sentences written by human age captioning task.
to measure their closeness. In other words, BLEU metrics are de- The comparison is based on an experiment protocol that
termined by comparing a candidate sentence with reference sen- is commonly adopted in previous work. For datasets Flickr8k
tences in n-grams. Specifically, to determine BLEU-1, the candidate and Flickr30k, 1,0 0 0 images are used for validation and testing
300 S. Bai, S. An / Neurocomputing 311 (2018) 291–304
Table 2
Method comparison on datasets Flcikr8k and Flick30k. In this table, B-n, MT, RG, CD stand for BLEU-n, METEOR, ROUGE-L and CIDEr, respectively.
Multimodal learning Karpathy and Fei-Fei [61] 0.579 0.383 0.245 0.160 – – – 0.573 0.369 0.240 0.157 – – –
Mao et al. [35] 0.565 0.386 0.256 0.170 – – – 0.600 0.410 0.280 0.190 – – –
Kiros et al. [59] 0.656 0.424 0.277 0.177 0.173 – – 0.600 0.380 0.254 0.171 0.169 – –
encoder–decoder framework Donahue et al. [34] – – – – – – – 0.587 0.391 0.251 0.165 – – –
Vinyals et al. [64] 0.630 0.410 0.270 – – – – 0.670 0.450 0.300 – – – –
Jia et al. [65] 0.647 0.459 0.318 0.216 0.202 – – 0.646 0.446 0.305 0.206 0.179 – –
Attention guided You et al. [69] – – – – – – – 0.647 0.460 0.324 0.230 0.189 – –
Xu et al. [68] 0.670 0.457 0.314 0.213 0.203 – – 0.669 0.439 0.296 0.199 0.185 – –
Compositional architectures Fu et al. [72] 0.639 0.459 0.319 0.217 0.204 0.470 0.538 0.649 0.462 0.324 0.224 0.194 0.451 0.472
Table 3
Method comparison on Microsoft COCO Caption dataset under the commonly used protocol. In this table, B-n, MT, RG, CD stand
for BLEU-n, METEOR, ROUGE-L and CIDEr, respectively.
Multimodal learning Karpathy and Fei-Fei [61] 0.625 0.450 0.321 0.230 0.195 – 0.660
Mao et al. [35] 0.670 0.490 0.350 0.250 – – –
encoder–decoder framework Donahue et al. [34] 0.669 0.489 0.349 0.249 – – –
Jia et al. [65] 0.670 0.491 0.358 0.264 0.227 – 0.813
Vinyals et al. [64] – – – 0.277 0.237 – 0.855
Wu et al. [66] 0.74 0.56 0.42 0.31 0.26 – 0.94
Attention guided Xu et al. [68] 0.718 0.504 0.357 0.250 0.230 – –
You et al. [69] 0.709 0.537 0.402 0.304 0.243 – –
Compositional architectures Fang et al. [33] – – – 0.257 0.236 – –
Fu et al. [72] 0.724 0.555 0.418 0.313 0.248 0.532 0.955
respectively, while all the other images are used for training. For a deep Convolutional Neural Network for encoding and a Long
the Microsoft COCO Caption dataset, since the captions of the test Short-Term Memory Recurrent Network for decoding to generate
set are unavailable, only training and validation sets are used. All sentence descriptions for input images [34]. In Donahue’s method,
images in the training set are used for training, while 5,0 0 0 val- both image feature and context word feature are provided to the
idation images are used for validation, and another 5,0 0 0 images sequential model at each time step. On the Flickr30k dataset, the
from the validation set are used for testing. Under the experiment achieved BLEU-n scores are 0.587, 0.391, 0.251 and 0.165, respec-
setting described above, image captioning comparison on datasets tively. On the Microsoft COCO Caption dataset, the achieved BLEU-
Flcikr8k and Flick30k is shown in Table 2, and comparison results n scores are 0.669, 0.489, 0.349 and 0.249, respectively. The results
on the Microsoft COCO Caption dataset are shown in Table 3. are superior to Karpathy and Li [61], but a little bit inferior to Mao
In the method Karpathy and Li [61], a multimodal Recurrent et al. [35].
Neural Network is trained to align image regions and sentence With the same encoder–decoder framework, Vinyals et al.
fragments for image captioning. The authors report their results [64] outperform Donahue et al. [34] by feeding image features
on the benchmark datasets Flcikr8k, Flick30k and Microsoft COCO to the decoder network at only the initial time step. In Vinyals’
Caption dataset in Tables 2 and 3, respectively. On Flcikr8k, the method, inputs to the decoder at the following time steps are fea-
achieved BLEU-1, BLEU-2, BLEU-3 and BLEU-4 scores are 0.579, tures of previously predicated context words. They report BLUE-1,
0.383, 0.245 and 0.160, respectively. Similar results are achieved BLUE-2 and BLUE-3 scores on the Flickr8k and Flickr30k datasets
on the Flick30k dataset, which are 0.573, 0.369, 0.240 and 0.157, and report BLUE-4, METEOR and CIDEr scores on the MSCOCO
respectively. Higher scores are achieved by their method on the dataset. As for the reported results, they outperform multimodal
Microsoft COCO Caption dataset for all the BLEU-n evaluation met- learning based image captioning methods [35], [61] and the other
rics. Furthermore, on this dataset, METEOR and CIDEr scores are encoder–decoder based method [34]. The results show that com-
reported, which are 0.195 and 0.660, respectively. pared to multimodal learning based image captioning framework,
Another multimodal learning based image captioning method the encoder–decoder framework is more effective for image cap-
is Mao et al. [35], where a deep Convulutional Neural Network is tioning.
used to extract visual features from images, and a Recurrent Neu- Following the encoder–decoder paradigm, Jia et al. [65] propose
ral Network with a multimodal part is used to model word dis- to extract semantic information from images and add the informa-
tributions conditioned on image features and context words. In tion to each unit of the Long Short-Term Memory Recurrent Neu-
their method, words are generated one by one for captioning im- ral Network during the process of sentence generation for generat-
ages. They evaluate their method on all three benchmark datasets, ing image descriptions that are closely related to image contents.
with respect to BLEU-n metrics. Their method outperforms Karpa- Through this manner, the BLEU-n scores on the Flickr8k dataset
thy and Li [61] on all three benchmarks. The results show that are improved to 0.647, 0.459, 0.318 and 0.216, respectively. And
multimodal learning based image captioning method that gener- the BLEU-n scores on the Flickr30k dataset are improved to 0.646,
ates image descriptions word by word can outperform the one us- 0.446, 0.305 and 0.206, respectively. The METEOR scores on the
ing language fragments due to its flexibility. Flickr8k and Flickr30k are 0.202 and 0.179, respectively. Compared
After the encoder–decoder framework is introduced to solve the to the basic encoder–decoder framework, results achieved by their
image captioning problem, it becomes a popular paradigm, and method are much higher. And scores reported by the authors on
promising performances are demonstrated. Donahue et al. adopt the MSCOCO dataset are also competitive with other methods.
S. Bai, S. An / Neurocomputing 311 (2018) 291–304 301
Table 4
Automatic metric scores on the MSCOCO test server. In this table, B-n, MT, RG, CD stand for BLEU-n, METEOR, ROUGE-L and CIDEr, respectively.
Multimodal learning Mao et al. [35] 0.680 0.506 0.369 0.272 0.225 0.499 0.791 0.865 0.760 0.641 0.529 0.304 0.640 0.789
encoder–decoder framework Donahue et al. [34] 0.700 0.530 0.380 0.280 0.240 0.520 0.870 0.870 0.770 0.650 0.530 0.320 0.660 0.890
Vinyals et al. [64] 0.713 0.542 0.407 0.309 0.254 0.530 0.943 0.895 0.802 0.694 0.587 0.346 0.682 0.946
Wu et al. [66] 0.730 0.560 0.410 0.310 0.250 0.530 0.920 0.890 0.800 0.690 0.580 0.330 0.670 0.930
Attention guided Xu et al. [68] 0.705 0.528 0.383 0.277 0.241 0.516 0.865 0.881 0.779 0.658 0.537 0.322 0.654 0.893
You et al. [69] 0.731 0.565 0.424 0.316 0.250 0.535 0.943 0.9 0.815 0.709 0.599 0.335 0.682 0.958
Yang et al. [70] – – – – – – – – – – 0.597 0.347 0.686 0.969
Compositional architectures Fang et al. [33] 0.695 – – 0.291 0.247 0.519 0.912 0.880 – – 0.567 0.331 0.662 0.925
Fu et al. [72] 0.722 0.556 0.418 0.314 0.248 0.530 0.939 0.902 0.817 0.711 0.601 0.336 0.680 0.946
With the encoder–decoder framework, Xu et al. [68] propose ences can usually lead to higher probability of matching, resulting
to add the attention mechanism to the model, so that the atten- higher metric scores.
tive encoder–decoder model is able to dynamically attend salient From Tables 3 and4, it can be seen that although image caption-
image regions during the process of image description generation. ing evaluation metric scores computed on the MSCOCO test server
Xu et al. reported their BLEU-n and METEOR scores on all three are different from the ones computed under the commonly used
benchmark datasets. Their results are comparable to Jia et al. [65]. protocol, the tendency of the performances of the methods are
To take advantages of the complimentary properties of bottom- similar. The method Mao et al. [35], which is multimodal learning
up and top-down image captioning approaches, You et al. [69] pro- based, is outperformed by encoder–decoder based image caption-
pose a semantic attention model to incorporate cognitive visual ing methods Donahue et al. [34] and Vinyals et al. [64]. Although
cues into the decoder as attention guidance for image captioning. both methods Donahue et al. [34] and Vinyals et al. [64] are based
Their method is evaluated on the Flickr30k and MSCOCO dataset, on the encoder–decoder framework, with different decoding mech-
with BLEU-n and METEOR scores reported. The experiment results anisms, like in Tables 2 and3, Vinyals et al. [64] achieve higher
show that their method can improve the scores further compared scores than Donahue et al.[34], with respect to all used evaluation
to Xu et al. [68] and Jia et al. [65]. The results show that appropri- metrics.
ate modifications to the basic encoder–decoder framework by in- Incorporating additional information into the encoder–decoder
troducing attention mechanism can improve the image captioning framework can improve the image captioning performance fur-
performances effectively. ther. For example, by using the attention mechanism, Xu et al.
A compositional architecture is used by Fu et al. [72] to inte- [68] give superior performances to Donahue et al. [34]. By incor-
grate independent building blocks for generating captions for input porating visual concepts into the encoder–decoder framework, Wu
images. In their method, the word generation process is aligned to et al. [66] outperform Xu et al. [68]. By using a semantic attention
visual perception of image regions, and scene-specific contexts are model, You et al. [69] achieve superior performances to nearly all
introduced to capture high-level semantic information in images the other methods.
for adapting word generation to specific scene types. The authors These results show that various kinds of cues from the images
report their experiment results on all three benchmark datasets can be utilized to improve image captioning performances of the
with respect to evaluation metrics BLEU-n, METEOR and CIDEr. encoder–decoder framework. And effectiveness of different infor-
Most of the reported results can outperform other methods. How- mation may be different for improving the image captioning per-
ever, although methods based on compositional architectures can formance. And even with the same structure, when information
utilize information from different sources and take advantages of are fed to the framework in different ways, quite different results
strengths of various methods to give better results than most of may be achieved.
the other methods, they are usually much more complex and rela- On MSCOCO test server image captioning methods based on
tively hard to implement. compositional architectures can usually give relatively good results.
To ensure consistency in evaluation of image captioning meth- Fu et al. [72], which is a compositional architecture, achieve image
ods, a test server is hosted by the MSCOCO team [128]. For method captioning scores comparable to You et al. [69], and another com-
evaluation, this server allows researchers to forward captions gen- positional method Fang et al. [33] can also outperform multimodal
erated by their own models to it for computing several popu- based method Mao et al. [35] and encoder–decoder based method
lar metric scores. The computed metric scores include BLEU, ME- Donahue et al. [34] and Xu et al. [68].
TEOR, ROUGE and CIDEr. The evaluation on the server is on the In summary, from Table 4, it can be observed that when us-
“test 2014” test set of the Microsoft COCO Caption dataset, whose ing the MSCOCO test server for image captioning method evalu-
ground truth captions are unavailable publicly. With each image in ation, image captioning methods based on the encoder–decoder
the test set accompanied by 40 human-written captions, two types framework [34], [64] outperform the multimodal learning im-
of metrics can be computed for caption evaluation, i.e. c5 and c40, age captioning method [35], noticeably. When semantic informa-
which means to compare one caption against 5 reference captions tion or attention mechanisms are used [66], [69], the perfor-
and 40 reference captions for metric score computation, respec- mance can be improved further. Currently, the best results on the
tively. Evaluation results of previous methods on the test server MSCOCO test server are achieved by image captioning methods
are summarized in Table 4. that utilize attention mechanisms to augment the encoder–decoder
From Table 4, it can be seen that image captioning evalua- framework [69] [70], which outperform the compositional method
tion metric scores computed based on c40 are higher than the [72] slightly (Accessed in March, 2017).
ones computed based on c5. This is because the evaluation met- Finally, in Fig. 5 we show examples of image captioning results
rics are computed based on the consistency between the generated obtained based on different approaches to give readers a straight-
description and the reference descriptions. Therefore, more refer- forward impression for different kinds of image caption methods.
302 S. Bai, S. An / Neurocomputing 311 (2018) 291–304
Automatic image captioning is a relatively new task, thanks This work was supported by National Natural Science Founda-
to the efforts made by researchers in this field, great progress tion of China (61602027).
has been made. In our opinion there is still much room to im-
prove the performance of image captioning. First, with the fast References
development of deep neural networks, employing more powerful
network structures as language models and/or visual models will [1] L. Fei-Fei, A. Iyer, C. Koch, P. Perona., What do we perceive in a glance of a
undoubtedly improve the performance of image description gener- real-world scene? J. Vis. 7 (1) (2007) 1–29.
[2] P.F. Felzenszwalb, R.B. Girshick, D. McAllester, D. Ramanan, Object detection
ation. Second, because images are consisted of objects distributed with discriminatively trained part based models, IEEE Trans. Pattern Anal.
in space, while image captions are sequences of words, investiga- Mach. Intell. 32 (9) (2010) 1627–1645.
tion on presence and order of visual concepts in image captions are [3] R. Girshick, J. Donahue, T. Darrell, J. Malik, Rich feature hierarchies for accu-
rate object detection and semantic segmentation, in: Proceedings of the IEEE
important for image captioning. Furthermore, since this problem Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA,
fits well with the attention mechanism and attention mechanism 2014, pp. 580–587.
is suggested to run the range of AI-related tasks [129], how to uti- [4] C.H. Lampert, H. Nickisch, S. Harmeling, Learning to detect unseen object
classes by between class attribute transfer, in: Proceedings of the IEEE Con-
lize attention mechanism to generate image cations effectively will ference on Computer Vision and Pattern Recognition, Miami, FL, USA, 2009,
continue to be an important research topic. Third, due to the lack pp. 951–958.
of paired image-sentence training set, research on utilizing unsu- [5] C. Gan, T. Yang, B. Gong, Learning attributes equals multi-source domain gen-
eralization, in: Proceedings of the IEEE Conference on Computer Vision and
pervised data, either from images alone or text alone, to improve
Pattern Recognition, Miami, FL, USA, 2016, pp. 87–97.
image captioning will be promising. Fourth, current approaches [6] L. Bourdev, J. Malik, S. Maji, Action recognition from a distributed representa-
mainly focus on generating captions that are general about im- tion of pose and appearance, in: Proceedings of the IEEE Conference on Com-
age contents. However, as pointed by Johnson et al. [130], to de- puter Vision and Pattern Recognition, Providence, RI, 2011, pp. 3177–3184.
[7] Y.-W. Chao, Z. Wang, R. Mihalcea, J. Deng, Mining semantic affordances of
scribe images at a human level and to be applicable in real-life visual object categories, in: Proceedings of the IEEE Conference on Computer
environments, image description should be well grounded by the Vision and Pattern Recognition, Boston, MA, USA, 2015, pp. 4259–4267.
elements of the images. Therefore, image captioning grounded by [8] A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep
convolutional neural networks, in: Proceedings of the Twenty Fifth In-
image regions will be one of the future research directions. Fifth, ternational Conference on Neural Information Processing Systems, 2012,
so far, most of previous methods are designed to image captioning pp. 1097–1105.
for generic cases, while task-specific image captioning is needed [9] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, A. Oliva, Learning deep features for
scene recognition using places database, in: Proceedings of the Advances in
in certain cases. Research on solving image captioning problems in Neural Information Processing Systems (NIPS), 2014, pp. 487–495.
various special cases will also be interesting. [10] Y. Gong, L. Wang, R. Guo, S. Lazebnik, Multi-scale orderless pooling of deep
convolutional activation features, in: Proceedings of the European Conference
on Computer Vision, 2014, pp. 392–407.
7. Conclusion [11] A. Kojima, T. Tamura, K. Fukunaga, Natural language description of human ac-
tivities from video images based on concept hierarchy of actions, Int. Comput.
In this paper, we present a survey on image captioning. Based Vis. 50 (2002) 171–184.
[12] P. Hede, P. Moellic, J. Bourgeoys, M. Joint, C. Thomas, Automatic generation
on the technique adopted in each method, we classify image cap- of natural language descriptions for images, in: Proceedings of the Recherche
tioning approaches into different categories. Representative meth- Dinformation Assistee Par Ordinateur, 2004.
ods in each category are summarized, and strengths and limita- [13] A. Farhadi, M. Hejrati, M.A. Sadeghi, P. Young, C. Rashtchian, J. Hocken-
maier, D. Forsyth, Every picture tells a story: Generating sentences from im-
tions of each type of work are talked about. We first discuss early
ages, in: Proceedings of the European Conference on Computer Vision„ 2010,
image captioning work which are mainly retrieval based and tem- pp. 15–29.
plate based. Then, our main attention is focused on neural network [14] Y. Yang, C.L. Teo, H. Daume, Y. Aloimono, Corpus-guided sentence generation
of natural images, in: Proceedings of the Conference on Empirical Methods in
based methods, which give state of the art results. Because differ-
Natural Language Processing, 2011, pp. 444–454.
ent frameworks are used in neural network based methods, we fur- [15] V. Ordonez, G. Kulkarni, T.L. Berg., Im2Text: describing images using 1 million
ther divided them into subcategories and discussed each subcate- captioned photographs, in: Proceedings of the Advances in Neural Informa-
gory, respectively. After that, state of the art methods are compared tion Processing Systems, 2011, pp. 1143–1151.
[16] A. Gupta, Y. Verma, C.V. Jawahar., Choosing linguistics over vision to describe
on benchmark datasets. Finally, we present a discussion on future images, in: Proceedings of the AAAI Conference on Artificial Intelligence, 5,
research directions of automatic image captioning. 2012.
S. Bai, S. An / Neurocomputing 311 (2018) 291–304 303
[17] H. Goh, N. Thome, M. Cord, J. Lim, Learning deep hierarchical visual feature [47] S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, K. Saenko,
coding, IEEE Trans. Neural Netw. Learn. Syst. 25 (12) (2014) 2212–2225. Sequence to sequence – video to text, in: Proceedings of the International
[18] Y. Bengio, A. Courville, P. Vincent, Representation learning: a review and new Conference on Computer Vision, 2015.
perspectives, IEEE Trans. Pattern Anal. Mach. Intell. 35 (8) (2013) 1798–1828. [48] S. Venugopalan, L. Hendricks, R. Mooney, K. Saenko, Improving LSTM-based
[19] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, T. Darrell, DeCAF: video description with linguistic knowledge mined from text, in: Proceed-
a deep convolutional activation feature for generic visual recognition, in: Pro- ings of the Conference on Empirical Methods in Natural Language Processing,
ceedings of The Thirty First International Conference on Machine Learning, 2016.
2014, pp. 647–655. [49] R. Mason, E. Charniak, Nonparametric method for data driven image caption-
[20] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadar- ing, in: Proceedings of the Fifty Second Annual Meeting of the Association for
rama, T. Darrell, Caffe: convolutional architecture for fast feature embedding, Computational Linguistics, 2014.
arXiv:1408.5093v1 (2014). [50] P. Kuznetsova, V. Ordonez, T. Berg, Y. Choi, TREETALK: composition and com-
[21] N. Zhang, S. Ding, J. Zhang, Y. Xue, Research on point-wise gated deep net- pression of trees for image descriptions, Trans. Assoc. Comput. Linguist. 2 (10)
works, Appl. Soft Comput. 52 (2017) 1210–1221. (2014) 351–362.
[22] J.P. Papa, W. Scheirer, D.D. Cox, Fine-tuning deep belief networks using har- [51] G. Kulkarni, V. Premraj, V. Ordonez, S. Dhar, S. Li, Y. Choi, A.C. Berg, T.L. Berg,
mony search, Appl. Soft Comput. 46 (2016) 875–885. BabyTalk: understanding and generating simple image descriptions, IEEE
[23] C. Farabet, C. Couprie, L. Najman, Y. LeCun, Learning hierarchical features for Trans. Pattern Anal. Mach. Intell. 35 (12) (2013) 2891–2903.
scene labeling., IEEE Trans. Pattern Anal. Mach. Intell. 35(8). [52] S. Li, G. Kulkarni, T.L. Berg, A.C. Berg, Y. Choi, Composing simple image de-
[24] E.P. Ijjina, C.K. Mohan, Hybrid deep neural network model for human action scriptions using web-scale n-grams, in: Proceedings of the Fifteenth Confer-
recognition, Appl. Soft Comput. 46 (2016) 936–952. ence on Computational Natural Language Learning, 2011.
[25] S. Wang, Y. Jiang, F.-L. Chung, P. Qian, Feedforward kernel neural networks, [53] M. Mitchell, J. Dodge, A. Goyal, K. Yamaguchi, K. Stratos, X. Han, A. Mensch,
generalized least learning machine, and its deep learning with application to A. Berg, T. Berg, H. Daume, Midge: Generating image descriptions from com-
image classification, Appl. Soft Comput. 37 (2015) 125–141. puter vision detections, in: Proceedings of the Thirteenth Conference of the
[26] S. Bai, Growing random forest on deep convolutional neural networks for European Chapter of the Association for Computational Linguistics, 2012.
scene categorization, Expert Syst. Appl. 71 (2017) 279–287. [54] Y. Ushiku, M. Yamaguchi, Y. Mukuta, T. Harada, Common subspace for model
[27] D. Bahdanau, K. Cho, Y. Bengio, Neural machine translation by jointly learning and similarity: phrase learning for caption generation from images, in: IEEE
to align and translate, arXiv:1409.0473v7 (2016). International Conference on Computer Vision, 2015, pp. 2668–2676.
[28] K. Cho, B.V. Merrinboer, C. Gulcehre, Learning phrase representations using [55] R. Socher, A. Karpathy, Q.V. Le, C.D. Manning, A.Y. Ng, Grounded composi-
RNN encoder–decoder for statistical machine translation, arXiv:1406.1078v3 tional semantics for finding and describing images with sentences, TACL 2
(2014). (2014) 207–218.
[29] R. Collobert, J. Weston, A unified architecture for natural language pro- [56] L. Ma, Z. Lu, Lifeng, S.H. Li, Multimodal convolutional neural networks for
cessing:deep neural networks with multitask learning, in: Proceedings of matching image and sentences, in: Proceedings of the IEEE International Con-
the Twenty Fifth International Conference on Machine Learning, 2008, ference on Computer Vision, 2015, pp. 2623–2631.
pp. 160–167. [57] F. Yan, K. Mikolajczyk, Deep correlation for matching images and text, in: Pro-
[30] A. Mnih, G. Hinton, Three new graphical models for statistical language mod- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
elling, in: Proceedings of the Twenty Fourth International Conference on Ma- 2015, pp. 3441–3450.
chine Learning, 2007, pp. 641–648. [58] R. Lebret, P.O. Pinheiro, R. Collobert, Phrase-based image captioning, in: Pro-
[31] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, J. Dean, Distributed representa- ceedings of the International Conference on Machine Learning, 2015.
tions of words and phrases and their compositionality, in: Proceedings of the [59] R. Kiros, R. Zemel, R. Salakhutdinov, Multimodal neural language models, in:
Advances in Neural Information Processing Systems, 2013. Proceedings of the International Conference on Machine Learning, 2014.
[32] M. Hodosh, P. Young, J. Hockenmaier, Framing image description as a rank- [60] J. Mao, W. Xu, Y. Yang, J. Wang, A.L. Yuille, Explain images with multimodal
ing task: data, models and evaluation metrics, J. Artif. Intell. Res. 47 (2013) recurrent neural networks, arXiv:1410.1090v1 (2014).
853–899. [61] A. Karpathy, F. Li, Deep visual-semantic alignments for generating image de-
[33] H. Fang, S. Gupta, F. Iandola, R. Srivastava, From captions to visual concepts scriptions, in: Proceedings of the IEEE Conference on Computer Vision and
and back., in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3128–3137.
Pattern Recognition, 2015, pp. 1473–1482. [62] X. Chen, C. Zitnick, Mind’s eye: a recurrent visual representation for image
[34] J. Donahue, L. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, caption generation, in: Proceedings of the IEEE Conference on Computer Vi-
Long-term recurrent convolutional networks for visual recognition and de- sion and Pattern Recognition, 2015, pp. 2422–2431.
scription, in: Proceedings of the IEEE Conference on Computer Vision and [63] R. Kiros, R. Salakhutdinov, R. Zemel, Unifying visual-semantic embeddings
Pattern Recognition, 2015, pp. 2625–2634. with multimodal neural language models, arXiv:1411.2539(2018).
[35] J. Mao, W. Xu, Y. Yang, J. Wang, Z. Huang, A. Yuille, Deep captioning with [64] O. Vinyals, A. Toshev, S. Bengio, D. Erhan, Show and tell: a neural image cap-
multimodal recurrent neural networks, in: Proceedings of the International tion generator, in: Proceedings of the IEEE Conference on Computer Vision
Conference on Learning Representation, 2015. and Pattern Recognition, 2015, pp. 3156–3164.
[36] M.R.R.M.S. L A Hendricks, S. Venugopalan, Deep compositional captioning: de- [65] X. Jia, E. Gavves, B. Fernando, T. Tuytelaars, Guiding the long-short term
scribing novel object categories without paired training data, in: Proceedings memory model for image caption generation, in: Proceedings of the IEEE In-
of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, ternational Conference on Computer Vision, 2015, pp. 2407–2415.
pp. 1–10. [66] Q. Wu, C. Shen, L. Liu, A. Dick, A. van den Hengel, What value do explicit
[37] A. Karpathy, A. Joulin, F. Li, Deep fragment embeddings for bidirectional im- high level concepts have in vision to language problems? in: Proceedings
age sentence mapping, in: Proceedings of the Twenty Seventh Advances in of the IEEE Conference on Computer Vision and Pattern Recognition, 2016,
Neural Information Processing Systems (NIPS), 3, 2014, pp. 1889–1897. pp. 203–212.
[38] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C.L. Zitnick, D. Parikh, VQA: [67] Y. Pu, Z. Gan, R. Henao, X. Yuan, C. Li, A. Stevens, L. Carin, Variational autoen-
Visual question answering., arXiv:1505.00468v7 (2016). coder for deep learning of images, labels and captions, in: Proceedings of the
[39] M. Malinowski, M. Fritz, A multi-world approach to question answering about Advances in Neural Information Processing Systems, 2016.
real-world scenes based on uncertain input, in: Proceedings of the Advances [68] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov, R. Zemel, Y. Bengio,
in Neural Information Processing Systems, pp. 1682–1690. Show, attend and tell: neural image caption generation with visual attention,
[40] M. Malinowski, M. Rohrbach, M. Fritz, Ask your neurons:a neural-based ap- arXiv:1502.03044v3 (2016).
proach to answering questions about images, in: Proceedings of the Interna- [69] Q. You, H. Jin, Z. Wang, C. Fang, J. Luo, Image captioning with semantic atten-
tional Conference on Computer Vision, 2015. tion, in: Proceedings of the IEEE Conference on Computer Vision and Pattern
[41] H. Gao, J. Mao, J. Zhou, Z. Huang, L. Wang, W. Xu., Are you talking to a ma- Recognition, 2016, pp. 4651–4659.
chine? Dataset and methods for multilingual image question answering, in: [70] Z. Yang, Y. Yuan, Y. Wu, R. Salakhutdinov, W.W. Cohen, Review networks for
Proceedings of the Advances in Neural Information Processing Systems, pp. caption generation, in: Proceedings of the Advances in Neural Information
2296–2304. Processing Systems, 2016, pp. 2361–2369.
[42] D. Geman, S. Geman, N. Hallonquist, L. Younes, Visual turing test for com- [71] K. Tran, X. He, L. Zhang, J. Sun, Rich image captioning in the wild, in: Pro-
puter vision systems, in: Proceedings of the National Academy of Sciences of ceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
the United States of America, vol. 112, pp. 3618–3623. 2016, pp. 434–441.
[43] Y. Feng, M. Lapata, Automatic caption generation for news images, IEEE Trans. [72] K. Fu, J. Jin, R. Cui, F. Sha, C. Zhang, Aligning where to see and what to
Pattern Anal. Mach. Intell. 35(4). tell: image captioning with region-based attention and scene-specific con-
[44] A. Tariq, H. Foroosh, A context-driven extractive framework for generating re- texts, IEEE Trans. Pattern Anal. Mach. Intell. (2016).
alistic image descriptions, IEEE Trans. Image Process. 26(2). [73] S. Ma, Y. Han, Describing images by feeding LSTM with structural words, in:
[45] S. Guadarrama, N. Krishnamoorthy, G. Malkarnenkar, S. Venugopalan, R. Proceedings of the IEEE International Conference on Multimedia and Expo,
Mooney, T. Darrell, K. Saenko, YouTube2text: recognizing and describing arbi- 2016, pp. 1–6.
trary activities using semantic hierarchies and zero-shot recognition, in: Pro- [74] R. Oruganti, S. Sah, S. Pillai, R. Ptucha, Image description through fusion
ceedings of the International Conference on Computer Vision, pp. 2712–2719. based recurrent multi-modal learning, in: Proceedings of the IEEE Interna-
[46] J. Thomason, S. Venugopalan, S. Guadarrama, K. Saenko, R. Mooney, Integrat- tional Conference on Image Processing, 2016, pp. 3613–3617.
ing language and vision to generate natural language descriptions of videos [75] M. Wang, L. Song, X. Yang, C. Luo, A parallel-fusion RNN-LSTM architecture
in the wild, in: Proceedings of the International Conference on Computational for image caption generation, in: Proceedings of the IEEE International Con-
Linguistics, 2014. ference on Image Processing, 2016.
304 S. Bai, S. An / Neurocomputing 311 (2018) 291–304
[76] J. Mao, X. Wei, Y. Yang, J. Wang, Learning like a child: fast novel visual con- [110] Y. Wei, W. Xia, J. Huang, B. Ni, J. Dong, Y. Zhao, S. Yan, CNN: single-label to
cept learning from sentence descriptions of images, in: Proceedings of the multi-label, arXiv:1406.5726v3 (2014) 1–14.
IEEE International Conference on Computer Vision, 2015, pp. 2533–2541. [111] R. A.Rensink, The dynamic representation of scenes, Vis. Cognit. 7 (1) (20 0 0)
[77] D. Lin, An information-theoretic definition of similarity, in: Proceedings of the 17–42.
Fifteenth International Conference on Machine Learning, pp. 296–304. [112] M. Spratling, M.H. Johnson, A feedback model of visual attention, J. Cognit.
[78] J. Curran, S. Clark, J. Bos, Linguistically motivated large-scale NLP with CC and Neurosci. 16 (2) (2004) 219–237.
boxer, in: Proceedings of the Forty Fifth Annual Meeting of the ACL on Inter- [113] D. Bahdanau, K. Cho, Y. Bengio, Neural machine translation by jointly learning
active Poster and Demonstration Sessions, pp. 33–36. to align and translate, arXiv:1409.0473v7 (2017).
[79] F.R. Bach, M.I. Jordan, Kernel independent component analysis, J. Mach. Learn. [114] J. Ba, V. Mnih, K. Kavukcuoglu, Multiple object recognition with visual atten-
Res. 3 (2002) 1–48. tion, in: Proceedings of the International Conference on Learning Representa-
[80] D.R. Hardoon, S.R. Szedmak, J.R. Shawe-Taylor, Canonical correlation analysis: tion, 2015.
an overview with application to learning methods, Neural Comput. 16 (2004) [115] V. Mnih, N. Hees, A. Graves, K. Kavukcuoglu, Recurrent models of visual atten-
2639–2664. tion, in: Proceedings of the Advances in Neural Information Processing Sys-
[81] D. Roth, W. tau Yih, A linear programming formulation for global inference in tems, 2014.
natural language tasks, in: Proceedings of the Annual Conference on Compu- [116] D. Elliott, F. Keller, Image description using visual dependency representa-
tational Natural Language Learning, 2004. tions, in: Proceedings of the Conference on Empirical Methods in Natural Lan-
[82] J. Clarke, M. Lapata, Global inference for sentence compression an integer lin- guage Processing, 2013, pp. 1292–1302.
ear programming approach, J. Artif. Intell. Res. 31 (2008) 339–429. [117] C. Zhang, J.C. Platt, P.A. Viola, Multiple instance boosting for object detection,
[83] P. Kuznetsova, V. Ordonez, A.C. Berg, T.L. Berg, Y. Choi, Collective generation of in: Proceedings of the Advances in Neural Information Processing Systems,
natural image descriptions, in: Proceedings of the Meeting of the Association 2005, pp. 1419–1426.
for Computational Linguistics, 2012. [118] A.L. Berger, S.A.D. Pietra, V.J.D. Pietra, A maximum entropy approach to natu-
[84] Y. Ushiku, T. Harada, Y. Kuniyoshi, Efficient image annotation for automatic ral language processing, Comput. Linguist. 22 (1) (1996) 39–71.
sentence generation, in: Proceedings of the Twentieth ACM International Con- [119] A. Ratnaparkhi, Trainable methods for surface natural language generation,
ference on Multimedia, 2012. in: Proceedings of the North American chapter of the Association for Compu-
[85] A. Oliva, A. Torralba, Modeling the shape of the scene: a holistic representa- tational Linguistics conference, 20 0 0, pp. 194–201.
tion of the spatial envelope, Int. J. Comput. Vis. 42 (3) (2001) 145–175. [120] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recogni-
[86] T. Dunning, Accurate methods for the statistics of surprise and coincidence, tion, in: Proceedings of the IEEE Conference on Computer Vision and Pattern
Comput. Linguist. 19 (1) (1993) 61–74. Recognition, 2016, pp. 770–778.
[87] G. Kulkarni, V. Premraj, S. Dhar, S. Li, Y. Choi, A.C. Berg, T.L. Berg., Baby talk: [121] J.R. Uijlings, K.E. van de Sande, T. Gevers, A.W. Smeulders, Selective search for
understanding and generating simple image descriptions, in: Proceedings of object recognition, Int. J. Comput. Vis. 104 (2) (2013) 154–171.
the IEEE Conference on Computer Vision and Pattern Recognition, 2011. [122] D.M. Blei, A.Y. Ng, M.I. Jordan, Latent Dirichlet allocation, J. Mach. Learn. Res.
[88] P. Koehn, Europarl: a parallel corpus for statistical machine translation, in: 3 (2003) 993–1022.
MT Summit, 2005. [123] K. Papineni, S. Roukos, T. Ward, W. Zhu, BLEU: a method for automatic eval-
[89] A. Farhadi, M.A. Sadeghi, Phrasal recognition, IEEE Trans. Pattern Anal. Mach. uation of machine translation, in: Proceedings of the Meeting on Association
Intell. 35 (12) (2013) 2854–2865. for Computational Linguistics, vol. 4 (2002).
[90] Y. LeCun, Y. Bengio, G. Hinton, Deep learning, Nature 521 (7553) (2015) [124] C.-Y. Lin, F.J. Och, Automatic evaluation of machine translation quality using
436–444. longest common subsequence and skip-bigram statistics, in: Proceedings of
[91] A. Frome, G.S. Corrado, J. Shlens, S. Bengio, J. Dean, T. Mikolov, Devise: a the Meeting on Association for Computational Linguistics, 2004.
deep visual-semantic embedding model, in: Proceedings of the Twenty Sixth [125] A. Lavie, A. Agarwal, METEOR: an automatic metric for MT evaluation with
International Conference on Neural Information Processing Systems, 2013, improved correlation with human judgments, in: Proceedings of the Second
pp. 2121–2129. Workshop on Statistical Machine Translation, 2007, pp. 228–231.
[92] Q.V. Le, M. Ranzato, R. Monga, M. Devin, K. Chen, G. Corrado, J. Dean, A.Y. Ng, [126] R. Vedantam, C.L. Zitnick, D. Parikh, CIDEr: consensus-based image descrip-
Building high-level features using large scale unsupervised learning, in: Pro- tion evaluation, in: Proceedings of the IEEE Conference on Computer Vision
ceedings of the International Conference on Machine Learning, 2012. and Pattern Recognition, 2015, pp. 4566–4575.
[93] M. Marneffe, B. Maccartney, C. Manning, Generating typed dependency parses [127] P. Young, A. Lai, M. Hodosh, J. Hockenmaier, From image descriptions to vi-
from phrase structure parses, in: Proceedings of the LREC, 2006, pp. 449–454. sual denotations: new similarity metrics for semantic inference over event
[94] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale descriptions, in: Proceedings of the Meeting on Association for Computational
image recognition, arXiv:1409.1556v6 (2015). Linguistics, 2014, pp. 67–78.
[95] C. Szegedy, W. Liu, Y. Jia, P. Sermannet, S. Reed, D. Anguelov, D. Erhan, V. [128] X. Chen, H. Fang, T. Lin, R. Vedantam, S. Gupta, P. Dollar, C. Zitnick, Microsoft
Vanhoucke, A. Rabinovich., Going deeper with convolutions, arXiv:1409.4842 COCO captions: data collection and evaluation server, arXiv:1504.00325v2
(2018). (2015).
[96] B. Hu, Z. Lu, H. Li, Q. Chen, Convolutional neural network architectures for [129] K. Cho, A. Courville, Y. Bengio, Describing multimedia content using atten-
matching natural language sentences, in: Proceedings of the Twenty Seventh tion-based encoder–decoder networks, IEEE Trans. Multimed. 17 (11) (2015)
International Conference on Neural Information Processing Systems, 2014, 1875–1886.
pp. 2042–2050. [130] J. Johnson, A. Karpathy, L. Fei-Fei, DenseCap: fully convolutional localization
[97] N. Kalchbrenner, E. Grefenstette, P. Blunsom, A convolutional neural network networks for dense captioning, in: Proceedings of the IEEE Conference on
for modelling sentences, arXiv:1404.2188v1 (2014). Computer Vision and Pattern Recognition, 2016, pp. 4565–4574.
[98] G. Andrew, R. Arora, J. Bilmes, K. Livescu., Deep canonical correlation analysis,
in: Proceedings of the International Conference on Machine Learning, 2013, Shuang Bai received the degrees of B.Eng. and M.Eng.
pp. 1247–1255. from the School of Electrical Engineering and Automation
[99] A. Mnih, K. Kavukcuoglu, Learning word embeddings efficiently with noise– of Tianjin University, Tianjin, China in 2007 and 2009, re-
contrastive estimation, in: Proceedings of the Advances in Neural Information spectively. In 2013, he received the degree of D.Eng. in the
Processing Systems, 2013, pp. 2265–2273. Graduate School of Information Science of Nagoya Univer-
[100] T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficien testimation of word repre- sity. Currently, he is an associate professor in the School
sentations in vector space, arXiv:1301.3781v3 (2013). of Electronic and Information Engineering of Beijing Jiao-
[101] J.L. Elman, Finding structure in time, Cognit. Sci. 14 (2) (1990) 179–211. tong University, Beijing, China. His research interests in-
[102] M. Schuster, K. Paliwal, Bidirectional recurrent neural networks, IEEE Trans. clude machine learning and computer vision.
Signal Process. 45 (11) (1997) 2673–2681.
[103] Y. Bengio, P. Simard, P. Frasconi, Learning long-term dependencies with gra-
dient descent is difficult, IEEE Trans. Neural Netw. 5(5).
[104] T. Mikolov, M. Karafiat, L. Burget, J. Cernocky, S. Khudanpur, Recurrent neu-
ral network based language model, in: Proceedings of the Conference of the Shan An received the degree of B.Eng. from the school
International Speech Communication Association, 2010, pp. 1045–1048. of Electrical Engineering and Automation of Tianjin Uni-
[105] N. Kalchbrenner, P. Blunsom, Recurrent continuous translation models, in: versity, China in 2007 and received the degree of M.Eng.
Proceedings of the Conference on Empirical Methods in Natural Language from the school of Control Science and Engineering of
Processing, 2013. Shandong University, China, in 2010. Currently, he is
[106] I. Sutskever, O. Vinyals, Q.V. Le, Sequence to sequence learning with neural a senior algorithm engineer in JD.COM. Before joining
networks, in: Proceedings of the Advances in Neural Information Processing JD.COM, he worked for China Academy of Space Technol-
Systems, 2014. ogy and Alibaba.com. His research interests include ma-
[107] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural Comput. 9 (8) chine learning and computer vision.
(1997) 1735–1780.
[108] O. Vinyals, A. Toshev, S. Bengio, D. Erhan, Show and tell: lessons learned from
the 2015 MSCOCO image captioning challenge, IEEE Trans. Pattern Anal. Mach.
Intell. 39(4).
[109] K. Greff, R.K. Srivastava, J. KoutnÃk, B.R. Steunebrink, J. Schmidhuber, LSTM: a
search space odyssey, arXiv:1503.04069v2 (2017).