Fang 2015
Fang 2015
Microsoft Research
Abstract
This paper presents a novel approach for automatically
generating image descriptions: visual detectors, language
models, and multimodal similarity models learnt directly
from a dataset of image captions. We use multiple instance
learning to train visual detectors for words that commonly
occur in captions, including many different parts of speech
such as nouns, verbs, and adjectives. The word detector
outputs serve as conditional inputs to a maximum-entropy
language model. The language model learns from a set of
over 400,000 image descriptions to capture the statistics
of word usage. We capture global semantics by re-ranking
caption candidates using sentence-level features and a deep
multimodal similarity model. Our system is state-of-the-art
on the official Microsoft COCO benchmark, producing a
BLEU-4 score of 29.1%. When human judges compare the
system captions to ones written by other people on our held-
out test set, the system captions have equal or better quality
34% of the time.
1474
bounding boxes may not be easily defined, such as open or
beautiful. One possible approach is to use image classi-
fiers that take as input the entire image. As we show in Sec-
tion 6, this leads to worse performance since many words
or concepts only apply to image sub-regions. Instead, we
learn our detectors using the weakly-supervised approach
of Multiple Instance Learning (MIL) [30, 49].
For each word w ∈ V, MIL takes as input sets of “posi-
tive” and “negative” bags of bounding boxes, where each
bag corresponds to one image i. A bag bi is said to be
positive if word w is in image i’s description, and negative
otherwise. Intuitively, MIL performs training by iteratively
selecting instances within the positive bags, followed by re-
training the detector using the updated positive labels.
We use a noisy-OR version of MIL [49], where the prob-
ability of bag bi containing word w is calculated from the
probabilities of individual instances in the bag:
Y
1 − pw
Figure 2. Multiple Instance Learning detections for cat, red,
1− ij (1)
j∈bi
flying and two (left to right, top to bottom). View in color.
where pw
is the probability that a given image region j in
ij
image i corresponds to word w. We compute pw ij using a
for generating images captions by conditioning its output on
multi-layered architecture [21, 42]2 , by computing a logistic
image features extracted by a convolutional neural network.
function on top of the fc7 layer (this can be expressed as a
More recently, Donahue et al. [9] also applied a similar
fully connected fc8 layer followed by a sigmoid layer):
model to video description. Lebret et al. [25] have inves-
tigated the use of a phrase-based model for generating cap- 1
, (2)
tions, while Xu et al. [46] have proposed a model based on 1 + exp (−(vw
t φ(b ) + u ))
ij w
visual attention. where φ(bij ) is the fc7 representation for image region j
Unlike these approaches, in this work we detect words by in image i, and vw , uw are the weights and bias associated
applying a CNN to image regions [13] and integrating the with word w.
information with MIL [49]. We minimize a priori assump- We express the fully connected layers (fc6, fc7, fc8)
tions about how sentences should be structured by train- of these networks as convolutions to obtain a fully convo-
ing directly from captions. Finally, in contrast to [20, 29], lutional network. When this fully convolutional network
we formulate the problem of generation as an optimization is run over the image, we obtain a coarse spatial response
problem and search for the most likely sentence [40]. map. Each location in this response map corresponds to the
response obtained by applying the original CNN to overlap-
3. Word Detection ping shifted regions of the input image (thereby effectively
The first step in our caption generation pipeline detects scanning different locations in the image for possible ob-
a set of words that are likely to be part of the image’s de- jects). We up-sample the image to make the longer side
scription. These words may belong to any part of speech, to be 565 pixels which gives us a 12 × 12 response map at
including nouns, verbs, and adjectives. We determine our fc8 for both [21, 42] and corresponds to sliding a 224×224
vocabulary V using the 1000 most common words in the bounding box in the up-sampled image with a stride of 32.
training captions, which cover over 92% of the word occur- The noisy-OR version of MIL is then implemented on top
rences in the training data (available on project webpage 1 ). of this response map to generate a single probability pw
i for
each word for each image. We use a cross entropy loss and
3.1. Training Word Detectors optimize the CNN end-to-end for this task with stochastic
Given a vocabulary of words, our next goal is to detect gradient descent. We use one image in each batch and train
the words from images. We cannot use standard super- for 3 epochs. For initialization, we use the network pre-
vised learning techniques for learning detectors, since we trained on ImageNet [7].
do not know the image bounding boxes corresponding to 2 We denote the CNN from [21] as AlexNet and the 16-layer CNN from
the words. In fact, many words relate to concepts for which [42] as VGG for subsequent discussion. We use the code base and models
available from the Caffe Model Zoo https://github.com/BVLC/
1 http://research.microsoft.com/image_captioning caffe/wiki/Model-Zoo [17].
1475
3.2. Generating Word Scores for a Test Image
Given a novel test image i, we up-sample and forward Pr(wl = w̄l |w̄l−1 , · · · , w̄1 , <s>, Ṽl−1 ) =
1476
Table 1. Features used in the maximum entropy language model.
Feature Type Definition Description
Attribute 0/1 w̄l ∈ Ṽl−1 Predicted word is in the attribute set, i.e. has been visually detected and not yet used.
N-gram+ 0/1 w̄l−N +1 , · · · , w̄l = κ and w̄l ∈ Ṽl−1 N-gram ending in predicted word is κ and the predicted word is in the attribute set.
N-gram- 0/1 w̄l−N +1 , · · · , w̄l = κ and w̄l ∈
/ Ṽl−1 N-gram ending in predicted word is κ and the predicted word is not in the attribute set.
End 0/1 w̄l = κ and Ṽl−1 = ∅ The predicted word is κ and all attributes have been mentioned.
Score R score(w̄l ) when w̄l ∈ Ṽl−1 The log-probability of the predicted word when it is in the attribute set.
Table 2. Features used by MERT. mapping each input modality to a common semantic space,
1. The log-likelihood of the sequence. which are trained jointly. In training, the data consists of
2. The length of the sequence. a set of image/caption pairs. The loss function minimized
3. The log-probability per word of the sequence. during training represents the negative log posterior proba-
4. The logarithm of the sequence’s rank in the log-likelihood. bility of the caption given the corresponding image.
5. 11 binary features indicating whether the number Image model: We map images to semantic vectors us-
of mentioned objects is x (x = 0, . . . , 10). ing the same CNN (AlexNet / VGG) as used for detecting
6. The DMSM score between the sequence and the image.
words in Section 3. We first finetune the networks on the
COCO dataset for the full image classification task of pre-
After obtaining the set of completed sentences C, we
dicting the words occurring in the image caption. We then
form an M -best list as follows. Given a target number of
extract out the fc7 representation from the finetuned net-
T image attributes to be mentioned, the sequences in C cov-
work and stack three additional fully connected layers with
ering at least T objects are added to the M -best list, sorted
tanh non-linearities on top of this representation to obtain
in descending order by the log-likelihood. If there are less
a final representation of the same size as the last layer of
than M sequences covering at least T objects found in C,
the text model. We learn the parameters in these additional
we reduce T by 1 until M sequences are found.
fully connected layers during DMSM training.
Text model: The text part of the DMSM maps text frag-
5. Sentence Re-Ranking ments to semantic vectors, in the same manner as in the
Our LM produces an M -best set of sentences. Our final original DSSM. In general, the text fragments can be a full
stage uses MERT [35] to re-rank the M sentences. MERT caption. Following [16] we convert each word in the caption
uses a linear combination of features computed over an en- to a letter-trigram count vector, which uses the count dis-
tire sentence, shown in Table 2. The MERT model is trained tribution of context-dependent letters to represent a word.
on the M -best lists for the validation set using the BLEU This representation has the advantage of reducing the size
metric, and applied to the M -best lists for the test set. Fi- of the input layer while generalizing well to infrequent, un-
nally, the best sequence after the re-ranking is selected as seen and incorrectly spelled words. Then following [41],
the caption of the image. Along with standard MERT fea- this representation is forward propagated through a deep
tures, we introduce a new multimodal semantic similarity convolutional neural network to produce the semantic vec-
model, discussed below. tor at the last layer.
Objective and training: We define the relevance R as
5.1. Deep Multimodal Similarity Model the cosine similarity between an image or query (Q) and a
text fragment or document (D) based on their representa-
To model global similarity between images and text, we
tions yQ and yD obtained using the image and text models:
develop a Deep Multimodal Similarity Model (DMSM).
R(Q, D) = cosine(yQ , yD ) = (yQ T yD )/kyQ kkyD k. For a
The DMSM learns two neural networks that map images
given image-text pair, we can compute the posterior proba-
and text fragments to a common vector representation. We
bility of the text being relevant to the image via:
measure similarity between images and text by measuring
exp(γR(Q, D))
cosine similarity between their corresponding vectors. This P (D|Q) = (5)
ΣD′ ∈D exp(γR(Q, D′ ))
cosine similarity score is used by MERT to re-rank the
sentences. The DMSM is closely related to the unimodal Here γ is a smoothing factor determined using the val-
Deep Structured Semantic Model (DSSM) [16, 41], but ex- idation set, which is 10 in our experiments. D denotes the
tends it to the multimodal setting. The DSSM was initially set of all candidate documents (captions) which should be
proposed to model the semantic relevance between textual compared to the query (image). We found that restricting
search queries and documents, and is extended in this work D to one matching document D+ and a fixed number N
to replace the query vector in the original DSSM by the im- of randomly selected non-matching documents D− worked
age vector computed from the deep convolutional network. reasonably well, although using noise-contrastive estima-
The DMSM consists of a pair of neural networks, one for tion could further improve results. Thus, for each image we
1477
select one relevant text fragment and N non-relevant frag-
ments to compute the posterior probability. N is set to 50
in our experiments. During training, we adjust the model
parameters Λ to minimize the negative log posterior proba-
bility that the relevant captions are matched to the images:
Y
L(Λ) = − log P (D+ |Q) (6)
(Q,D + )
Figure 4. Qualitative results for images on the PASCAL sentence
dataset. Captions using our approach (black), Midge [32] (blue)
and Baby Talk [22] (red) are shown.
6. Experimental Results
We next describe the datasets used for testing, followed which often correspond to concrete objects in an image sub-
by an evaluation of our approach for word detection and region. Results for both classification and MIL NOR are
experimental results on sentence generation. lower for parts of speech that may be less visually infor-
mative and difficult to detect, such as adjectives (e.g., few,
6.1. Datasets which has an AP of 2.5), pronouns (e.g., himself, with
an AP of 5.7), and prepositions (e.g., before, with an
Most of our results are reported on the Microsoft COCO
AP of 1.0). In comparison words with high AP scores are
dataset [28, 4]. The dataset contains 82,783 training im-
typically either visually informative (red: AP 66.4, her:
ages and 40,504 validation images. The images create a
AP 45.6) or associated with specific objects (polar: AP
challenging testbed for image captioning since most images
94.6, stuffed: AP 74.2). Qualitative results demonstrat-
contain multiple objects and significant contextual informa-
ing word localization are shown in Figures 2 and 3.
tion. The COCO dataset provides 5 human-annotated cap-
tions per image. The test annotations are not available, so 6.3. Caption Generation
we split the validation set into validation and test sets4 .
For experimental comparison with prior papers, we also We next describe our caption generation results, begin-
report results on the PASCAL sentence dataset [38], which ning with a short discussion of evaluation metrics.
contains 1000 images from the 2008 VOC Challenge [11], Metrics: The sentence generation process is measured
with 5 human captions each. using both automatic metrics and human studies. We use
three different automatic metrics: PPLX, BLEU [37], and
6.2. Word Detection METEOR [1]. PPLX (perplexity) measures the uncertainty
of the language model, corresponding to how many bits on
To gain insight into our weakly-supervised approach for
average would be needed to encode each word given the
word detection using MIL, we measure its accuracy on the
language model. A lower PPLX indicates a better score.
word classification task: If a word is used in at least one
BLEU [37] is widely used in machine translation and mea-
ground truth caption, it is included as a positive instance.
sures the fraction of N-grams (up to 4-gram) that are in
Note that this is a challenging task, since conceptually sim-
common between a hypothesis and a reference or set of ref-
ilar words are classified separately; for example, the words
erences; here we compare against 4 randomly selected ref-
cat/cats/kitten, or run/ran/running all correspond to differ-
erences. METEOR [1] measures unigram precision and re-
ent classes. Attempts at adding further supervision, e.g., in
call, extending exact word matches to include similar words
the form of lemmas, did not result in significant gains.
based on WordNet synonyms and stemmed tokens. We ad-
Average Precision (AP) and Precision at Human Recall
ditionally report performance on the metrics made available
(PHR) [4] results for different parts of speech are shown
from the MSCOCO captioning challenge,5 which includes
in Table 3. We report two baselines. The first (Chance)
scores for BLEU-1 through BLEU-4, METEOR, CIDEr
is the result of randomly classifying each word. The sec-
[44], and ROUGE-L [27].
ond (Classification) is the result of a whole image classifier
All of these automatic metrics are known to only roughly
which uses features from AlexNet or VGG CNN [21, 42].
correlate with human judgment [10]. We therefore include
These features were fine-tuned for this word classification
human evaluation to further explore the quality of our mod-
task using a logistic regression loss.
els. Each task presents a human (Mechanical Turk worker)
As shown in Table 3, the MIL NOR approach improves
with an image and two captions: one is automatically gen-
over both baselines for all parts of speech, demonstrating
erated, and the other is a human caption. The human is
that better localization can help predict words. In fact, we
asked to select which caption better describes the image,
observe the largest improvement for nouns and adjectives,
or to choose a “same” option when they are of equal qual-
4 We split the COCO train/val set ito 82,729 train/20243 val/20244 test. ity. In each experiment, 250 humans were asked to compare
Unless otherwise noted, test results are reported on the 20444 images from
the validation set. 5 http://mscoco.org/dataset/#cap2015
1478
Table 3. Average precision (AP) and Precision at Human Recall (PHR) [4] for words with different parts of speech (NN: Nouns, VB: Verbs,
JJ: Adjectives, DT: Determiners, PRP: Pronouns, IN: Prepositions). Results are shown using a chance classifier, full image classification,
and Noisy OR multiple instance learning with AlexNet [21] and VGG [42] CNNs.
Average Precision Precision at Human Recall
NN VB JJ DT PRP IN Others All NN VB JJ DT PRP IN Others All
Count 616 176 119 10 11 38 30 1000
Chance 2.0 2.3 2.5 23.6 4.7 11.9 7.7 2.9
Classification (AlexNet) 32.4 16.7 20.7 31.6 16.8 21.4 15.6 27.1 39.0 27.7 37.0 37.3 26.2 31.5 25.0 35.9
Classification (VGG) 37.0 19.4 22.5 32.9 19.4 22.5 16.9 30.8 45.3 31.0 37.1 40.2 29.6 33.9 25.5 40.6
MIL (AlexNet) 36.9 18.0 22.9 31.7 16.8 21.4 15.2 30.4 46.0 29.4 40.1 37.9 25.9 31.5 21.6 40.8
MIL (VGG) 41.4 20.7 24.9 32.4 19.1 22.8 16.3 34.0 51.6 33.3 44.3 39.2 29.4 34.3 23.9 45.7
Human Agreement 63.8 35.0 35.9 43.1 32.5 34.3 31.6 52.8
Figure 3. Qualitative results for several randomly chosen images on the Microsoft COCO dataset, with our generated caption (black) and a
human caption (blue) for each image. In the bottom two rows we show localizations for the words used in the sentences. More examples
can be found on the project website1 .
20 caption pairs each, and 5 humans judged each caption tors; and Shuffled Human, which randomly picks another
pair. We used Crowdflower, which automatically filters out human generated caption from another image. Both the
spammers. The ordering of the captions was randomized BLEU and METEOR scores are very low for these ap-
to avoid bias, and we included four check-cases where the proaches, demonstrating the variation and complexity of the
answer was known and obvious; workers who missed any Microsoft COCO dataset.
of these were excluded. The final judgment is the majority We provide results on seven variants of our end-
vote of the judgment of the 5 humans. In ties, one-half of a to-end approach: Baseline is based on visual features
count is distributed to the two best answers. We also com- from AlexNet and uses the ME LM with all the dis-
pute errors bars on the human results by taking 1000 boot- crete features as described in Table 1. Baseline+Score
strap resamples of the majority vote outcome (with ties), adds the feature for the word detector score into the
then reporting the difference between the mean and the 5th ME LM. Both of these versions use the same set of
or 95th percentile (whichever is farther from the mean). sentence features (excluding the DMSM score) described
Generation results: Table 4 summarizes our results on in Section 5 when re-ranking the captions using MERT.
the Microsoft COCO dataset. We provide several base- Baseline+Score+DMSM uses the same ME LM as Base-
lines for experimental comparison, including two base- line+Score, but adds the DMSM score as a feature for
lines that measure the complexity of the dataset: Uncon- re-ranking. Baseline+Score+DMSM+ft adds finetuning.
ditioned, which generates sentences by sampling an N - VGG+Score+ft and VGG+Score+DMSM+ft are analogous
gram LM without knowledge of the visual word detec- to Baseline+Score and Baseline+Score+DMSM but use
1479
Table 4. Caption generation performance for seven variants of our system on the Microsoft COCO dataset. We report performance on
our held out test set (half of the validation set). We report Perplexity (PPLX), BLEU and METEOR, using 4 randomly selected caption
references. Results from human studies of subjective performance are also shown, with error bars in parentheses. Our final System
“VGG+Score+DMSM+ft” is “same or better” than human 34% of the time.
System PPLX BLEU METEOR ≈human >human ≥human
1. Unconditioned 24.1 1.2% 6.8%
2. Shuffled Human – 1.7% 7.3%
3. Baseline 20.9 16.9% 18.9% 9.9% (±1.5%) 2.4% (±0.8%) 12.3% (±1.6%)
4. Baseline+Score 20.2 20.1% 20.5% 16.9% (±2.0%) 3.9% (±1.0%) 20.8% (±2.2%)
5. Baseline+Score+DMSM 20.2 21.1% 20.7% 18.7% (±2.1%) 4.6% (±1.1%) 23.3% (±2.3%)
6. Baseline+Score+DMSM+ft 19.2 23.3% 22.2% – – –
7. VGG+Score+ft 18.1 23.6% 22.8% – – –
8. VGG+Score+DMSM+ft 18.1 25.7% 23.6% 26.2% (±2.1%) 7.8% (±1.3%) 34.0% (±2.5%)
Human-written captions – 19.3% 24.1%
Table 5. Official COCO evaluation server results on test set images are not available publicly), and evaluated them on
(40,775 images). First row show results using 5 reference captions,
the COCO evaluation server. These results are summarized
second row, 40 references. Human results reported in parentheses.
in Table 5. Our system gives a BLEU-4 score of 29.1%,
CIDE r BLEU - 4 BLEU - 1 ROUGE - L METEOR and equals or surpasses human performance on 12 of the 14
[5] .912 (.854) .291 (.217) .695 (.663) .519 (.484) .247 (.252) metrics reported – the only system to do so. These results
[40] .925 (.910) .567 (.471) .880 (.880) .662 (.626) .331 (.335)
are also state-of-the-art on all 14 reported metrics among
the four other results available publicly at the time of writ-
finetuned VGG features. Note: the AlexNet baselines with- ing this paper. In particular, our system is the only one ex-
out finetuning are from an early version of our system which ceeding human CIDEr scores, which has been specifically
used object proposals from [50] instead of dense scanning. proposed for evaluating image captioning systems [44].
As shown in Table 4, the PPLX of the ME LM with and To enable direct comparison with previous work on au-
without the word detector score feature is roughly the same. tomatic captioning, we also test on the PASCAL sentence
But, BLEU and METEOR improve with addition of the dataset [38], using the 847 images tested for both the Midge
word detector scores in the ME LM. Performance improves [32] and Baby Talk [22] systems. We show significantly
further with addition of the DMSM scores in re-ranking. improved results over the Midge [32] system, as measured
Surprisingly, the BLEU scores are actually above those pro- by both BLEU and METEOR (2.0% vs. 17.6% BLEU and
duced by human generated captions (25.69% vs. 19.32%). 9.2% vs. 19.2% METEOR).6 To give a basic sense of the
Improvements in performance using the DMSM scores with progress quickly being made in this field, Figure 4 shows
the VGG model are statistically significant as measured output from the system on the same images.7
by 4-gram overlap and METEOR per-image (Wilcoxon
signed-rank test, p < .001). 7. Conclusion
We also evaluated an approach (not shown) with whole- This paper presents a new system for generating novel
image classification rather than MIL. We found this ap- captions from images. The system trains on images and cor-
proach to under-perform relative to MIL in the same set- responding captions, and learns to extract nouns, verbs, and
ting (for example, using the VGG+Score+DMSM+ft set- adjectives from regions in the image. These detected words
ting, PPLX=18.9, BLEU=21.9%, METEOR=21.4%). This then guide a language model to generate text that reads well
suggests that integrating information about words associ- and includes the detected words. Finally, we use a global
ated to image regions with MIL leads to improved perfor- deep multimodal similarity model introduced in this paper
mance over image classification alone. to re-rank candidate captions
The VGG+Score+DMSM approach produces captions At the time of writing, our system is state-of-the-art on
that are judged to be of the same or better quality than all 14 official metrics of the COCO image captioning task,
human-written descriptions 34% of the time, which is a sig- and equal to or exceeding human performance on 12 out of
nificant improvement over the Baseline results. Qualitative the 14 official metrics. Our generated captions have been
results are shown in Figure 3, and many more are available judged by humans (Mechanical Turk workers) to be equal
on the project website. to or better than human-written captions 34% of the time.
COCO evaluation server results: We further gener- 6 Baby Talk generates long, multi-sentence captions, making compari-
ated the captions for the images in the actual COCO test son by BLEU/METEOR difficult; we thus exclude evaluation here.
set consisting of 40,775 images (human captions for these 7 Images were selected visually, without viewing system captions.
1480
References [19] A. Karpathy, A. Joulin, and L. Fei-Fei. Deep fragment em-
beddings for bidirectional image sentence mapping. arXiv
[1] S. Banerjee and A. Lavie. METEOR: An automatic met- preprint arXiv:1406.5679, 2014. 2
ric for MT evaluation with improved correlation with hu-
[20] R. Kiros, R. Zemel, and R. Salakhutdinov. Multimodal neu-
man judgments. In ACL Workshop on Intrinsic and Extrinsic
ral language models. In NIPS Deep Learning Workshop,
Evaluation Measures for Machine Translation and/or Sum-
2013. 2, 3
marization, 2005. 2, 6
[21] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet
[2] A. L. Berger, S. A. D. Pietra, and V. J. D. Pietra. A maximum
classification with deep convolutional neural networks. In
entropy approach to natural language processing. Computa-
NIPS, 2012. 2, 3, 6, 7
tional Linguistics, 1996. 2, 4
[22] G. Kulkarni, V. Premraj, S. Dhar, S. Li, Y. Choi, A. C. Berg,
[3] A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. R. Hr-
and T. L. Berg. Baby talk: Understanding and generating
uschka Jr, and T. M. Mitchell. Toward an architecture for
simple image descriptions. In CVPR, 2011. 1, 2, 6, 8
never-ending language learning. In AAAI, 2010. 2
[23] P. Kuznetsova, V. Ordonez, A. C. Berg, T. L. Berg, and
[4] X. Chen, H. Fang, T. Lin, R. Vedantam, S. Gupta, P. Dollár,
Y. Choi. Collective generation of natural image descriptions.
and C. L. Zitnick. Microsoft coco captions: Data collec-
In ACL, 2012. 2
tion and evaluation server. arXiv preprint arXiv:1504.00325,
2015. 2, 6, 7 [24] R. Lau, R. Rosenfeld, and S. Roukos. Trigger-based lan-
guage models: A maximum entropy approach. In ICASSP,
[5] X. Chen, A. Shrivastava, and A. Gupta. Neil: Extracting
1993. 4
visual knowledge from web data. In ICCV, 2013. 1
[6] X. Chen and C. L. Zitnick. Mind’s eye: A recurrent visual [25] R. Lebret, P. O. Pinheiro, and R. Collobert. Phrase-based
representation for image caption generation. CVPR, 2015. 2 image captioning. arXiv preprint arXiv:1502.03671, 2015.
2, 3
[7] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-
Fei. ImageNet: A large-scale hierarchical image database. [26] S. Li, G. Kulkarni, T. L. Berg, A. C. Berg, and Y. Choi. Com-
In CVPR, 2009. 2, 3 posing simple image descriptions using web-scale n-grams.
[8] S. Divvala, A. Farhadi, and C. Guestrin. Learning everything In CoNLL, 2011. 2
about anything: Webly-supervised visual concept learning. [27] C.-Y. Lin and F. J. Och. Automatic evaluation of machine
In CVPR, 2014. 1 translation quality using longest common subsequence and
[9] J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, skip-bigram statistics. In Proceedings of the 42nd Annual
S. Venugopalan, K. Saenko, and T. Darrell. Long-term recur- Meeting on Association for Computational Linguistics, ACL
rent convolutional networks for visual recognition and de- ’04, Stroudsburg, PA, USA, 2004. Association for Computa-
scription. CVPR, 2015. 2, 3 tional Linguistics. 6
[10] D. Elliott and F. Keller. Comparing automatic evaluation [28] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra-
measures for image description. In ACL, 2014. 6 manan, P. Dollár, and C. L. Zitnick. Microsoft COCO: Com-
[11] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, mon objects in context. In ECCV, 2014. 2, 6
and A. Zisserman. The PASCAL visual object classes (VOC) [29] J. Mao, W. Xu, Y. Yang, J. Wang, and A. L. Yuille. Explain
challenge. IJCV, 88(2):303–338, June 2010. 2, 6 images with multimodal recurrent neural networks. arXiv
[12] A. Farhadi, M. Hejrati, M. A. Sadeghi, P. Young, preprint arXiv:1410.1090, 2014. 2, 3
C. Rashtchian, J. Hockenmaier, and D. Forsyth. Every pic- [30] O. Maron and T. Lozano-Pérez. A framework for multiple-
ture tells a story: Generating sentences from images. In instance learning. NIPS, 1998. 2, 3
ECCV, 2010. 2 [31] T. Mikolov, A. Deoras, D. Povey, L. Burget, and J. Cernocky.
[13] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich fea- Strategies for training large scale neural network language
ture hierarchies for accurate object detection and semantic models. In ASRU, 2011. 4
segmentation. In CVPR, 2014. 3 [32] M. Mitchell, X. Han, J. Dodge, A. Mensch, A. Goyal,
[14] G. Gkioxari, B. Hariharan, R. Girshick, and J. Malik. Us- A. Berg, K. Yamaguchi, T. Berg, K. Stratos, and
ing k-poselets for detecting people and localizing their key- H. Daumé III. Midge: Generating image descriptions from
points. In CVPR, 2014. 4 computer vision detections. In EACL, 2012. 2, 6, 8
[15] M. Hodosh, P. Young, and J. Hockenmaier. Framing image [33] A. Mnih and G. Hinton. Three new graphical models for
description as a ranking task: Data, models and evaluation statistical language modelling. In ICML, 2007. 4
metrics. JAIR, 47:853–899, 2013. 2 [34] A. Mnih and Y. W. Teh. A fast and simple algorithm for train-
[16] P. Huang, X. He, J. Gao, L. Deng, A. Acero, and L. Heck. ing neural probabilistic language models. In ICML, 2012. 4
Learning deep structured semantic models for web search [35] F. J. Och. Minimum error rate training in statistical machine
using clickthrough data. In CIKM, 2013. 5 translation. In ACL, 2003. 2, 5
[17] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Gir- [36] V. Ordonez, G. Kulkarni, and T. L. Berg. Im2text: Describ-
shick, S. Guadarrama, and T. Darrell. Caffe: Convolu- ing images using 1 million captioned photographs. In NIPS,
tional architecture for fast feature embedding. arXiv preprint 2011. 2
arXiv:1408.5093, 2014. 3 [37] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: a
[18] A. Karpathy and L. Fei-Fei. Deep visual-semantic align- method for automatic evaluation of machine translation. In
ments for generating image descriptions. CVPR, 2015. 2 ACL, 2002. 2, 6
1481
[38] C. Rashtchian, P. Young, M. Hodosh, and J. Hockenmaier.
Collecting image annotations using Amazon’s mechanical
turk. In NAACL HLT Workshop Creating Speech and Lan-
guage Data with Amazon’s Mechanical Turk, 2010. 2, 6, 8
[39] A. Ratnaparkhi. Trainable methods for surface natural lan-
guage generation. In NAACL, 2000. 4
[40] A. Ratnaparkhi. Trainable approaches to surface natural lan-
guage generation and their application to conversational dia-
log systems. Computer Speech & Language, 16(3):435–455,
2002. 2, 3
[41] Y. Shen, X. He, J. Gao, L. Deng, and G. Mesnil. A latent
semantic model with convolutional-pooling structure for in-
formation retrieval. In CIKM, 2014. 5
[42] K. Simonyan and A. Zisserman. Very deep convolu-
tional networks for large-scale image recognition. CoRR,
abs/1409.1556, 2014. 2, 3, 6, 7
[43] R. Socher, Q. Le, C. Manning, and A. Ng. Grounded com-
positional semantics for finding and describing images with
sentences. In NIPS Deep Learning Workshop, 2013. 2
[44] R. Vedantam, C. L. Zitnick, and D. Parikh. Cider:
Consensus-based image description evaluation. CoRR,
abs/1411.5726, 2014. 6, 8
[45] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and
tell: A neural image caption generator. CVPR, 2015. 2
[46] K. Xu, J. Ba, R. Kiros, A. Courville, R. Salakhutdinov,
R. Zemel, and Y. Bengio. Show, attend and tell: Neural im-
age caption generation with visual attention. arXiv preprint
arXiv:1502.03044, 2015. 2, 3
[47] Y. Yang, C. L. Teo, H. Daumé III, and Y. Aloimonos.
Corpus-guided sentence generation of natural images. In
EMNLP, 2011. 1, 2
[48] B. Z. Yao, X. Yang, L. Lin, M. W. Lee, and S.-C. Zhu. I2T:
Image parsing to text description. Proceedings of the IEEE,
98(8):1485–1508, 2010. 2
[49] C. Zhang, J. C. Platt, and P. A. Viola. Multiple instance
boosting for object detection. In NIPS, 2005. 2, 3
[50] C. L. Zitnick and P. Dollár. Edge boxes: Locating object
proposals from edges. In ECCV, 2014. 8
[51] C. L. Zitnick and D. Parikh. Bringing semantics into focus
using visual abstraction. In CVPR, 2013. 1
1482