0% found this document useful (0 votes)
29 views10 pages

Fang 2015

Uploaded by

Thanhbich Nguyen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views10 pages

Fang 2015

Uploaded by

Thanhbich Nguyen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

From Captions to Visual Concepts and Back

Hao Fang∗ Saurabh Gupta∗ Forrest Iandola∗ Rupesh K. Srivastava∗


Li Deng Piotr Dollár† Jianfeng Gao Xiaodong He
Margaret Mitchell John C. Platt‡ C. Lawrence Zitnick Geoffrey Zweig

Microsoft Research

Abstract
This paper presents a novel approach for automatically
generating image descriptions: visual detectors, language
models, and multimodal similarity models learnt directly
from a dataset of image captions. We use multiple instance
learning to train visual detectors for words that commonly
occur in captions, including many different parts of speech
such as nouns, verbs, and adjectives. The word detector
outputs serve as conditional inputs to a maximum-entropy
language model. The language model learns from a set of
over 400,000 image descriptions to capture the statistics
of word usage. We capture global semantics by re-ranking
caption candidates using sentence-level features and a deep
multimodal similarity model. Our system is state-of-the-art
on the official Microsoft COCO benchmark, producing a
BLEU-4 score of 29.1%. When human judges compare the
system captions to ones written by other people on our held-
out test set, the system captions have equal or better quality
34% of the time.

1. Introduction Figure 1. An illustrative example of our pipeline.

When does a machine “understand” an image? One def-


inition is when it can generate a novel caption that summa- from a dataset of images and corresponding image descrip-
rizes the salient content within an image. This content may tions. Previous approaches to generating image captions re-
include objects that are present, their attributes, or their re- lied on object, attribute, and relation detectors learned from
lations with each other. Determining the salient content re- separate hand-labeled training data [47, 22].
quires not only knowing the contents of an image, but also
The direct use of captions in training has three distinct
deducing which aspects of the scene may be interesting or
advantages. First, captions only contain information that is
novel through commonsense knowledge [51, 5, 8].
inherently salient. For example, a dog detector trained from
This paper describes a novel approach for generating im-
images with captions containing the word dog will be bi-
age captions from samples. We train our caption generator
ased towards detecting dogs that are salient and not those
∗ H. Fang, S. Gupta, F. Iandola and R. K. Srivastava contributed equally
that are in the background. Image descriptions also contain
to this work while doing internships at Microsoft Research. Current af- variety of word types, including nouns, verbs, and adjec-
filiations are H. Fang: University of Washington; S. Gupta and F. Iandola:
University of California at Berkeley; R. K. Srivastava: IDSIA, USI-SUPSI.
tives. As a result, we can learn detectors for a wide vari-
† P. Dollár is currently at Facebook AI Research. ety of concepts. While some concepts, such as riding or
‡ J. Platt is currently at Google. beautiful, may be difficult to learn in the abstract, these

978-1-4673-6964-0/15/$31.00 ©2015 IEEE 1473


terms may be highly correlated to specific visual patterns plex images with multiple objects. Each of the 82,783 train-
(such as a person on a horse or mountains at sunset). ing images has 5 human annotated captions. For measuring
Second, training a language model (LM) on image cap- the quality of our sentences we use the popular BLEU [37],
tions captures commonsense knowledge about a scene. A METEOR [1] and perplexity (PPLX) metrics. Surprisingly,
language model can learn that a person is more likely to we find our generated captions outperform humans based on
sit on a chair than to stand on it. This information disam- the BLEU metric; and this effect holds when evaluated on
biguates noisy visual detections. unseen test data from the COCO dataset evaluation server,
Third, by learning a joint multimodal representation on reaching 29.1% BLEU-4 vs. 21.7% for humans. Human
images and their captions, we are able to measure the global evaluation on our held-out test set has our captions judged
similarity between images and text, and select the most suit- to be of the same quality or better than humans 34% of the
able description for the image. time. We also compare to previous work on the PASCAL
An overview of our approach is shown in Figure 1. First, sentence dataset [38], and show marked improvements over
we use weakly-supervised learning to create detectors for previous work. Our results demonstrate the utility of train-
a set of words commonly found in image captions. Learn- ing both visual detectors and LMs directly on image cap-
ing directly from image captions is difficult, because the tions, as well as using a global multimodal semantic model
system does not have access to supervisory signals, such for re-ranking the caption candidates.
as object bounding boxes, that are found in other data sets
[11, 7]. Many words, e.g., crowded or inside, do not 2. Related Work
even have well-defined bounding boxes. To overcome this There are two well-studied approaches to automatic im-
difficulty, we use three ideas. First, the system reasons with age captioning: retrieval of existing human-written cap-
image sub-regions rather than with the full image. Next, tions, and generation of novel captions. Recent retrieval-
we featurize each of these regions using rich convolutional based approaches have used neural networks to map images
neural network (CNN) features, fine-tuned on our training and text into a common vector representation [43]. Other
data [21, 42]. Finally, we map the features of each region retrieval based methods use similarity metrics that take pre-
to words likely to be contained in the caption. We train this defined image features [15, 36]. Farhadi et al. [12] represent
map using multiple instance learning (MIL) [30, 49] which both images and text as linguistically-motivated semantic
learns discriminative visual signature for each word. triples, and compute similarity in that space. A similar fine-
Generating novel image descriptions from a bag of likely grained analysis of sentences and images has been done for
words requires an effective LM. In this paper, we view cap- retrieval in the context of neural networks [19].
tion generation as an optimization problem. In this view, Retrieval-based methods always return well-formed
the core task is to take the set of word detection scores, and human-written captions, but these captions may not be able
find the highest likelihood sentence that covers each word to describe new combinations of objects or novel scenes.
exactly once. We train a maximum entropy (ME) LM from This limitation has motivated a large body of work on gen-
a set of training image descriptions [2, 40]. This training erative approaches, where the image is first analyzed and
captures commonsense knowledge about the world through objects are detected, and then a novel caption is generated.
language statistics [3]. An explicit search over word se- Previous work utilizes syntactic and semantic constraints in
quences is effective at finding high-likelihood sentences. the generation process [32, 48, 26, 23, 22, 47], and we com-
The final stage of the system (Figure 1) re-ranks a set of pare against prior state of the art in this line of work. We
high-likelihood sentences by a linear weighting of sentence focus on the Midge system [32], which combines syntactic
features. These weights are learned using Minimum Error structures using maximum likelihood estimation to gener-
Rate Training (MERT) [35]. In addition to several common ate novel sentences; and compare qualitatively against the
sentence features, we introduce a new feature based on a Baby Talk system [22], which generates descriptions by fill-
Deep Multimodal Similarity Model (DMSM). The DMSM ing sentence template slots with words selected from a con-
learns two neural networks that map images and text frag- ditional random field that predicts the most likely image la-
ments to a common vector representation in which the sim- beling. Both of these previous systems use the same set of
ilarity between sentences and images can be easily mea- test sentences, making direct comparison possible.
sured. As we demonstrate, the use of the DMSM signifi- Recently, researchers explored purely statistical ap-
cantly improves the selection of quality sentences. proaches to guiding language models using images. Kiros
To evaluate the quality of our automatic captions, we et al. [20] use a log-bilinear model with bias features de-
use three easily computable metrics and better/worse/equal rived from the image to model text conditioned on the
comparisons by human subjects on Amazon’s Mechanical image. Also related are several contemporaneous papers
Turk (AMT). The evaluation was performed on the chal- [29, 45, 6, 18, 9, 46, 25]. Among these, a common theme
lenging Microsoft COCO dataset [28, 4] containing com- [29, 45, 6, 18] has been to utilize a recurrent neural network

1474
bounding boxes may not be easily defined, such as open or
beautiful. One possible approach is to use image classi-
fiers that take as input the entire image. As we show in Sec-
tion 6, this leads to worse performance since many words
or concepts only apply to image sub-regions. Instead, we
learn our detectors using the weakly-supervised approach
of Multiple Instance Learning (MIL) [30, 49].
For each word w ∈ V, MIL takes as input sets of “posi-
tive” and “negative” bags of bounding boxes, where each
bag corresponds to one image i. A bag bi is said to be
positive if word w is in image i’s description, and negative
otherwise. Intuitively, MIL performs training by iteratively
selecting instances within the positive bags, followed by re-
training the detector using the updated positive labels.
We use a noisy-OR version of MIL [49], where the prob-
ability of bag bi containing word w is calculated from the
probabilities of individual instances in the bag:
Y
1 − pw

Figure 2. Multiple Instance Learning detections for cat, red,
1− ij (1)
j∈bi
flying and two (left to right, top to bottom). View in color.
where pw
is the probability that a given image region j in
ij
image i corresponds to word w. We compute pw ij using a
for generating images captions by conditioning its output on
multi-layered architecture [21, 42]2 , by computing a logistic
image features extracted by a convolutional neural network.
function on top of the fc7 layer (this can be expressed as a
More recently, Donahue et al. [9] also applied a similar
fully connected fc8 layer followed by a sigmoid layer):
model to video description. Lebret et al. [25] have inves-
tigated the use of a phrase-based model for generating cap- 1
, (2)
tions, while Xu et al. [46] have proposed a model based on 1 + exp (−(vw
t φ(b ) + u ))
ij w
visual attention. where φ(bij ) is the fc7 representation for image region j
Unlike these approaches, in this work we detect words by in image i, and vw , uw are the weights and bias associated
applying a CNN to image regions [13] and integrating the with word w.
information with MIL [49]. We minimize a priori assump- We express the fully connected layers (fc6, fc7, fc8)
tions about how sentences should be structured by train- of these networks as convolutions to obtain a fully convo-
ing directly from captions. Finally, in contrast to [20, 29], lutional network. When this fully convolutional network
we formulate the problem of generation as an optimization is run over the image, we obtain a coarse spatial response
problem and search for the most likely sentence [40]. map. Each location in this response map corresponds to the
response obtained by applying the original CNN to overlap-
3. Word Detection ping shifted regions of the input image (thereby effectively
The first step in our caption generation pipeline detects scanning different locations in the image for possible ob-
a set of words that are likely to be part of the image’s de- jects). We up-sample the image to make the longer side
scription. These words may belong to any part of speech, to be 565 pixels which gives us a 12 × 12 response map at
including nouns, verbs, and adjectives. We determine our fc8 for both [21, 42] and corresponds to sliding a 224×224
vocabulary V using the 1000 most common words in the bounding box in the up-sampled image with a stride of 32.
training captions, which cover over 92% of the word occur- The noisy-OR version of MIL is then implemented on top
rences in the training data (available on project webpage 1 ). of this response map to generate a single probability pw
i for
each word for each image. We use a cross entropy loss and
3.1. Training Word Detectors optimize the CNN end-to-end for this task with stochastic
Given a vocabulary of words, our next goal is to detect gradient descent. We use one image in each batch and train
the words from images. We cannot use standard super- for 3 epochs. For initialization, we use the network pre-
vised learning techniques for learning detectors, since we trained on ImageNet [7].
do not know the image bounding boxes corresponding to 2 We denote the CNN from [21] as AlexNet and the 16-layer CNN from

the words. In fact, many words relate to concepts for which [42] as VGG for subsequent discussion. We use the code base and models
available from the Caffe Model Zoo https://github.com/BVLC/
1 http://research.microsoft.com/image_captioning caffe/wiki/Model-Zoo [17].

1475
3.2. Generating Word Scores for a Test Image
Given a novel test image i, we up-sample and forward Pr(wl = w̄l |w̄l−1 , · · · , w̄1 , <s>, Ṽl−1 ) =

propagate the image through the CNN to obtain pw


hP i
K
i as de- exp k=1 λk fk (w̄l , w̄l−1 , · · · , w̄1 , <s>, Ṽl−1 )
(3)
scribed above. We do this for all words w in the vocabulary P h PK i
v∈V∪</s> exp k=1 λk fk (v, w̄l−1 , · · · , w̄1 , <s>, Ṽl−1 )
V. Note that all the word detectors have been trained in-
dependently and hence their outputs need to be calibrated. where <s> denotes the start-of-sentence token, w̄j ∈ V ∪
To calibrate the output of different detectors, we use the im- </s>, and fk (wl , · · · , w1 , Ṽl−1 ) and λk respectively denote
age level likelihood pw i to compute precision on a held-out the k-th max-entropy feature and its weight. The basic dis-
subset of the training data [14]. We threshold this preci- crete ME features we use are summarized in Table 1. These
sion value at a global threshold τ , and output all words Ṽ features form our “baseline” system. It has proven effec-
with a precision of τ or higher along with the image level tive to extend this with a “score” feature, which evaluates
probability pwi , and raw score maxj pij .
w
to the log-likelihood of a word according to the correspond-
Figure 2 shows some sample MIL detections. For each ing visual detector. We have also experimented with distant
image, we visualize the spatial response map pw ij . Note that
bigram features [24] and continuous space log-bilinear fea-
the method has not used any bounding box annotations for tures [33, 34], but while these improved PPLX significantly,
training, but is still able to reliably localize objects and also they did not improve BLEU, METEOR or human prefer-
associate image regions with more abstract concepts. ence, and space restrictions preclude further discussion.
To train the ME LM, the objective function is the log-
likelihood of the captions conditioned on the corresponding
4. Language Generation set of detected objects, i.e.:
We cast the generation process as a search for the like-
X X
S #(s)
liest sentence conditioned on the set of visually detected L(Λ) =
(s) (s) (s) (s)
log Pr(w̄l |w̄l−1 , · · · , w̄1 , <s>, Ṽl−1 ) (4)
words. The language model is at the heart of this process s=1 l=1
because it defines the probability distribution over word se-
quences. Note that despite being a statistical model, the LM where the superscript (s) denotes the index of sentences in
can encode very meaningful information, for instance that the training data, and #(s) denotes the length of the sen-
running is more likely to follow horse than talking. tence. The noise contrastive estimation (NCE) technique is
This information can help identify false word detections and used to accelerate the training by avoiding the calculation
encodes a form of commonsense knowledge. of the exact denominator in (3) [34]. In the generation pro-
cess, we use the unnormalized NCE likelihood estimates,
which are far more efficient than the exact likelihoods, and
4.1. Statistical Model
produce very similar outputs. However, all PPLX numbers
To generate candidate captions for an image, we use a we report are computed with exhaustive normalization. The
maximum entropy (ME) LM conditioned on the set of vi- ME features are implemented in a hash table as in [31]. In
sually detected words. The ME LM estimates the prob- our experiments, we use N-gram features up to 4-gram and
ability of a word wl conditioned on the preceding words 15 contrastive samples in NCE training.
w1 , w2 , · · · , wl−1 , as well as the set of words with high
4.2. Generation Process
likelihood detections Ṽl ⊂ Ṽ that have yet to be mentioned
in the sentence. The motivation of conditioning on the un- During generation, we perform a left-to-right beam
used words is to encourage all the words to be used, while search similar to the one used in [39]. This maintains a stack
avoiding repetitions. The top 15 most frequent closed-class of length l partial hypotheses. At each step in the search, ev-
words3 are removed from the set Ṽ since they are detected in ery path on the stack is extended with a set of likely words,
nearly every image (and are trivially generated by the LM). and the resulting length l + 1 paths are stored. The top k
It should be noted that the detected words are usually some- length l + 1 paths are retained and the others pruned away.
what noisy. Thus, when the end of sentence token is being We define the possible extensions to be the end of sen-
predicted, the set of remaining words may still contain some tence token </s>, the 100 most frequent words, the set of at-
words with a high confidence of detection. tribute words that remain to be mentioned, and all the words
Following the definition of an ME LM [2], the word in the training data that have been observed to follow the last
probability conditioned on preceding words and remaining word in the hypothesis. Pruning is based on the likelihood
objects can be written as: of the partial path. When </s> is generated, the full path to
</s> is removed from the stack and set aside as a completed
3 The top 15 frequent closed-class words are a, on, of, the, in, sentence. The process continues until a maximum sentence
with, and, is, to, an, at, are, next, that and it. length L is reached.

1476
Table 1. Features used in the maximum entropy language model.
Feature Type Definition Description
Attribute 0/1 w̄l ∈ Ṽl−1 Predicted word is in the attribute set, i.e. has been visually detected and not yet used.
N-gram+ 0/1 w̄l−N +1 , · · · , w̄l = κ and w̄l ∈ Ṽl−1 N-gram ending in predicted word is κ and the predicted word is in the attribute set.
N-gram- 0/1 w̄l−N +1 , · · · , w̄l = κ and w̄l ∈
/ Ṽl−1 N-gram ending in predicted word is κ and the predicted word is not in the attribute set.
End 0/1 w̄l = κ and Ṽl−1 = ∅ The predicted word is κ and all attributes have been mentioned.
Score R score(w̄l ) when w̄l ∈ Ṽl−1 The log-probability of the predicted word when it is in the attribute set.

Table 2. Features used by MERT. mapping each input modality to a common semantic space,
1. The log-likelihood of the sequence. which are trained jointly. In training, the data consists of
2. The length of the sequence. a set of image/caption pairs. The loss function minimized
3. The log-probability per word of the sequence. during training represents the negative log posterior proba-
4. The logarithm of the sequence’s rank in the log-likelihood. bility of the caption given the corresponding image.
5. 11 binary features indicating whether the number Image model: We map images to semantic vectors us-
of mentioned objects is x (x = 0, . . . , 10). ing the same CNN (AlexNet / VGG) as used for detecting
6. The DMSM score between the sequence and the image.
words in Section 3. We first finetune the networks on the
COCO dataset for the full image classification task of pre-
After obtaining the set of completed sentences C, we
dicting the words occurring in the image caption. We then
form an M -best list as follows. Given a target number of
extract out the fc7 representation from the finetuned net-
T image attributes to be mentioned, the sequences in C cov-
work and stack three additional fully connected layers with
ering at least T objects are added to the M -best list, sorted
tanh non-linearities on top of this representation to obtain
in descending order by the log-likelihood. If there are less
a final representation of the same size as the last layer of
than M sequences covering at least T objects found in C,
the text model. We learn the parameters in these additional
we reduce T by 1 until M sequences are found.
fully connected layers during DMSM training.
Text model: The text part of the DMSM maps text frag-
5. Sentence Re-Ranking ments to semantic vectors, in the same manner as in the
Our LM produces an M -best set of sentences. Our final original DSSM. In general, the text fragments can be a full
stage uses MERT [35] to re-rank the M sentences. MERT caption. Following [16] we convert each word in the caption
uses a linear combination of features computed over an en- to a letter-trigram count vector, which uses the count dis-
tire sentence, shown in Table 2. The MERT model is trained tribution of context-dependent letters to represent a word.
on the M -best lists for the validation set using the BLEU This representation has the advantage of reducing the size
metric, and applied to the M -best lists for the test set. Fi- of the input layer while generalizing well to infrequent, un-
nally, the best sequence after the re-ranking is selected as seen and incorrectly spelled words. Then following [41],
the caption of the image. Along with standard MERT fea- this representation is forward propagated through a deep
tures, we introduce a new multimodal semantic similarity convolutional neural network to produce the semantic vec-
model, discussed below. tor at the last layer.
Objective and training: We define the relevance R as
5.1. Deep Multimodal Similarity Model the cosine similarity between an image or query (Q) and a
text fragment or document (D) based on their representa-
To model global similarity between images and text, we
tions yQ and yD obtained using the image and text models:
develop a Deep Multimodal Similarity Model (DMSM).
R(Q, D) = cosine(yQ , yD ) = (yQ T yD )/kyQ kkyD k. For a
The DMSM learns two neural networks that map images
given image-text pair, we can compute the posterior proba-
and text fragments to a common vector representation. We
bility of the text being relevant to the image via:
measure similarity between images and text by measuring
exp(γR(Q, D))
cosine similarity between their corresponding vectors. This P (D|Q) = (5)
ΣD′ ∈D exp(γR(Q, D′ ))
cosine similarity score is used by MERT to re-rank the
sentences. The DMSM is closely related to the unimodal Here γ is a smoothing factor determined using the val-
Deep Structured Semantic Model (DSSM) [16, 41], but ex- idation set, which is 10 in our experiments. D denotes the
tends it to the multimodal setting. The DSSM was initially set of all candidate documents (captions) which should be
proposed to model the semantic relevance between textual compared to the query (image). We found that restricting
search queries and documents, and is extended in this work D to one matching document D+ and a fixed number N
to replace the query vector in the original DSSM by the im- of randomly selected non-matching documents D− worked
age vector computed from the deep convolutional network. reasonably well, although using noise-contrastive estima-
The DMSM consists of a pair of neural networks, one for tion could further improve results. Thus, for each image we

1477
select one relevant text fragment and N non-relevant frag-
ments to compute the posterior probability. N is set to 50
in our experiments. During training, we adjust the model
parameters Λ to minimize the negative log posterior proba-
bility that the relevant captions are matched to the images:
Y
L(Λ) = − log P (D+ |Q) (6)
(Q,D + )
Figure 4. Qualitative results for images on the PASCAL sentence
dataset. Captions using our approach (black), Midge [32] (blue)
and Baby Talk [22] (red) are shown.
6. Experimental Results
We next describe the datasets used for testing, followed which often correspond to concrete objects in an image sub-
by an evaluation of our approach for word detection and region. Results for both classification and MIL NOR are
experimental results on sentence generation. lower for parts of speech that may be less visually infor-
mative and difficult to detect, such as adjectives (e.g., few,
6.1. Datasets which has an AP of 2.5), pronouns (e.g., himself, with
an AP of 5.7), and prepositions (e.g., before, with an
Most of our results are reported on the Microsoft COCO
AP of 1.0). In comparison words with high AP scores are
dataset [28, 4]. The dataset contains 82,783 training im-
typically either visually informative (red: AP 66.4, her:
ages and 40,504 validation images. The images create a
AP 45.6) or associated with specific objects (polar: AP
challenging testbed for image captioning since most images
94.6, stuffed: AP 74.2). Qualitative results demonstrat-
contain multiple objects and significant contextual informa-
ing word localization are shown in Figures 2 and 3.
tion. The COCO dataset provides 5 human-annotated cap-
tions per image. The test annotations are not available, so 6.3. Caption Generation
we split the validation set into validation and test sets4 .
For experimental comparison with prior papers, we also We next describe our caption generation results, begin-
report results on the PASCAL sentence dataset [38], which ning with a short discussion of evaluation metrics.
contains 1000 images from the 2008 VOC Challenge [11], Metrics: The sentence generation process is measured
with 5 human captions each. using both automatic metrics and human studies. We use
three different automatic metrics: PPLX, BLEU [37], and
6.2. Word Detection METEOR [1]. PPLX (perplexity) measures the uncertainty
of the language model, corresponding to how many bits on
To gain insight into our weakly-supervised approach for
average would be needed to encode each word given the
word detection using MIL, we measure its accuracy on the
language model. A lower PPLX indicates a better score.
word classification task: If a word is used in at least one
BLEU [37] is widely used in machine translation and mea-
ground truth caption, it is included as a positive instance.
sures the fraction of N-grams (up to 4-gram) that are in
Note that this is a challenging task, since conceptually sim-
common between a hypothesis and a reference or set of ref-
ilar words are classified separately; for example, the words
erences; here we compare against 4 randomly selected ref-
cat/cats/kitten, or run/ran/running all correspond to differ-
erences. METEOR [1] measures unigram precision and re-
ent classes. Attempts at adding further supervision, e.g., in
call, extending exact word matches to include similar words
the form of lemmas, did not result in significant gains.
based on WordNet synonyms and stemmed tokens. We ad-
Average Precision (AP) and Precision at Human Recall
ditionally report performance on the metrics made available
(PHR) [4] results for different parts of speech are shown
from the MSCOCO captioning challenge,5 which includes
in Table 3. We report two baselines. The first (Chance)
scores for BLEU-1 through BLEU-4, METEOR, CIDEr
is the result of randomly classifying each word. The sec-
[44], and ROUGE-L [27].
ond (Classification) is the result of a whole image classifier
All of these automatic metrics are known to only roughly
which uses features from AlexNet or VGG CNN [21, 42].
correlate with human judgment [10]. We therefore include
These features were fine-tuned for this word classification
human evaluation to further explore the quality of our mod-
task using a logistic regression loss.
els. Each task presents a human (Mechanical Turk worker)
As shown in Table 3, the MIL NOR approach improves
with an image and two captions: one is automatically gen-
over both baselines for all parts of speech, demonstrating
erated, and the other is a human caption. The human is
that better localization can help predict words. In fact, we
asked to select which caption better describes the image,
observe the largest improvement for nouns and adjectives,
or to choose a “same” option when they are of equal qual-
4 We split the COCO train/val set ito 82,729 train/20243 val/20244 test. ity. In each experiment, 250 humans were asked to compare
Unless otherwise noted, test results are reported on the 20444 images from
the validation set. 5 http://mscoco.org/dataset/#cap2015

1478
Table 3. Average precision (AP) and Precision at Human Recall (PHR) [4] for words with different parts of speech (NN: Nouns, VB: Verbs,
JJ: Adjectives, DT: Determiners, PRP: Pronouns, IN: Prepositions). Results are shown using a chance classifier, full image classification,
and Noisy OR multiple instance learning with AlexNet [21] and VGG [42] CNNs.
Average Precision Precision at Human Recall
NN VB JJ DT PRP IN Others All NN VB JJ DT PRP IN Others All
Count 616 176 119 10 11 38 30 1000
Chance 2.0 2.3 2.5 23.6 4.7 11.9 7.7 2.9
Classification (AlexNet) 32.4 16.7 20.7 31.6 16.8 21.4 15.6 27.1 39.0 27.7 37.0 37.3 26.2 31.5 25.0 35.9
Classification (VGG) 37.0 19.4 22.5 32.9 19.4 22.5 16.9 30.8 45.3 31.0 37.1 40.2 29.6 33.9 25.5 40.6
MIL (AlexNet) 36.9 18.0 22.9 31.7 16.8 21.4 15.2 30.4 46.0 29.4 40.1 37.9 25.9 31.5 21.6 40.8
MIL (VGG) 41.4 20.7 24.9 32.4 19.1 22.8 16.3 34.0 51.6 33.3 44.3 39.2 29.4 34.3 23.9 45.7
Human Agreement 63.8 35.0 35.9 43.1 32.5 34.3 31.6 52.8

Figure 3. Qualitative results for several randomly chosen images on the Microsoft COCO dataset, with our generated caption (black) and a
human caption (blue) for each image. In the bottom two rows we show localizations for the words used in the sentences. More examples
can be found on the project website1 .

20 caption pairs each, and 5 humans judged each caption tors; and Shuffled Human, which randomly picks another
pair. We used Crowdflower, which automatically filters out human generated caption from another image. Both the
spammers. The ordering of the captions was randomized BLEU and METEOR scores are very low for these ap-
to avoid bias, and we included four check-cases where the proaches, demonstrating the variation and complexity of the
answer was known and obvious; workers who missed any Microsoft COCO dataset.
of these were excluded. The final judgment is the majority We provide results on seven variants of our end-
vote of the judgment of the 5 humans. In ties, one-half of a to-end approach: Baseline is based on visual features
count is distributed to the two best answers. We also com- from AlexNet and uses the ME LM with all the dis-
pute errors bars on the human results by taking 1000 boot- crete features as described in Table 1. Baseline+Score
strap resamples of the majority vote outcome (with ties), adds the feature for the word detector score into the
then reporting the difference between the mean and the 5th ME LM. Both of these versions use the same set of
or 95th percentile (whichever is farther from the mean). sentence features (excluding the DMSM score) described
Generation results: Table 4 summarizes our results on in Section 5 when re-ranking the captions using MERT.
the Microsoft COCO dataset. We provide several base- Baseline+Score+DMSM uses the same ME LM as Base-
lines for experimental comparison, including two base- line+Score, but adds the DMSM score as a feature for
lines that measure the complexity of the dataset: Uncon- re-ranking. Baseline+Score+DMSM+ft adds finetuning.
ditioned, which generates sentences by sampling an N - VGG+Score+ft and VGG+Score+DMSM+ft are analogous
gram LM without knowledge of the visual word detec- to Baseline+Score and Baseline+Score+DMSM but use

1479
Table 4. Caption generation performance for seven variants of our system on the Microsoft COCO dataset. We report performance on
our held out test set (half of the validation set). We report Perplexity (PPLX), BLEU and METEOR, using 4 randomly selected caption
references. Results from human studies of subjective performance are also shown, with error bars in parentheses. Our final System
“VGG+Score+DMSM+ft” is “same or better” than human 34% of the time.
System PPLX BLEU METEOR ≈human >human ≥human
1. Unconditioned 24.1 1.2% 6.8%
2. Shuffled Human – 1.7% 7.3%
3. Baseline 20.9 16.9% 18.9% 9.9% (±1.5%) 2.4% (±0.8%) 12.3% (±1.6%)
4. Baseline+Score 20.2 20.1% 20.5% 16.9% (±2.0%) 3.9% (±1.0%) 20.8% (±2.2%)
5. Baseline+Score+DMSM 20.2 21.1% 20.7% 18.7% (±2.1%) 4.6% (±1.1%) 23.3% (±2.3%)
6. Baseline+Score+DMSM+ft 19.2 23.3% 22.2% – – –
7. VGG+Score+ft 18.1 23.6% 22.8% – – –
8. VGG+Score+DMSM+ft 18.1 25.7% 23.6% 26.2% (±2.1%) 7.8% (±1.3%) 34.0% (±2.5%)
Human-written captions – 19.3% 24.1%

Table 5. Official COCO evaluation server results on test set images are not available publicly), and evaluated them on
(40,775 images). First row show results using 5 reference captions,
the COCO evaluation server. These results are summarized
second row, 40 references. Human results reported in parentheses.
in Table 5. Our system gives a BLEU-4 score of 29.1%,
CIDE r BLEU - 4 BLEU - 1 ROUGE - L METEOR and equals or surpasses human performance on 12 of the 14
[5] .912 (.854) .291 (.217) .695 (.663) .519 (.484) .247 (.252) metrics reported – the only system to do so. These results
[40] .925 (.910) .567 (.471) .880 (.880) .662 (.626) .331 (.335)
are also state-of-the-art on all 14 reported metrics among
the four other results available publicly at the time of writ-
finetuned VGG features. Note: the AlexNet baselines with- ing this paper. In particular, our system is the only one ex-
out finetuning are from an early version of our system which ceeding human CIDEr scores, which has been specifically
used object proposals from [50] instead of dense scanning. proposed for evaluating image captioning systems [44].
As shown in Table 4, the PPLX of the ME LM with and To enable direct comparison with previous work on au-
without the word detector score feature is roughly the same. tomatic captioning, we also test on the PASCAL sentence
But, BLEU and METEOR improve with addition of the dataset [38], using the 847 images tested for both the Midge
word detector scores in the ME LM. Performance improves [32] and Baby Talk [22] systems. We show significantly
further with addition of the DMSM scores in re-ranking. improved results over the Midge [32] system, as measured
Surprisingly, the BLEU scores are actually above those pro- by both BLEU and METEOR (2.0% vs. 17.6% BLEU and
duced by human generated captions (25.69% vs. 19.32%). 9.2% vs. 19.2% METEOR).6 To give a basic sense of the
Improvements in performance using the DMSM scores with progress quickly being made in this field, Figure 4 shows
the VGG model are statistically significant as measured output from the system on the same images.7
by 4-gram overlap and METEOR per-image (Wilcoxon
signed-rank test, p < .001). 7. Conclusion
We also evaluated an approach (not shown) with whole- This paper presents a new system for generating novel
image classification rather than MIL. We found this ap- captions from images. The system trains on images and cor-
proach to under-perform relative to MIL in the same set- responding captions, and learns to extract nouns, verbs, and
ting (for example, using the VGG+Score+DMSM+ft set- adjectives from regions in the image. These detected words
ting, PPLX=18.9, BLEU=21.9%, METEOR=21.4%). This then guide a language model to generate text that reads well
suggests that integrating information about words associ- and includes the detected words. Finally, we use a global
ated to image regions with MIL leads to improved perfor- deep multimodal similarity model introduced in this paper
mance over image classification alone. to re-rank candidate captions
The VGG+Score+DMSM approach produces captions At the time of writing, our system is state-of-the-art on
that are judged to be of the same or better quality than all 14 official metrics of the COCO image captioning task,
human-written descriptions 34% of the time, which is a sig- and equal to or exceeding human performance on 12 out of
nificant improvement over the Baseline results. Qualitative the 14 official metrics. Our generated captions have been
results are shown in Figure 3, and many more are available judged by humans (Mechanical Turk workers) to be equal
on the project website. to or better than human-written captions 34% of the time.
COCO evaluation server results: We further gener- 6 Baby Talk generates long, multi-sentence captions, making compari-
ated the captions for the images in the actual COCO test son by BLEU/METEOR difficult; we thus exclude evaluation here.
set consisting of 40,775 images (human captions for these 7 Images were selected visually, without viewing system captions.

1480
References [19] A. Karpathy, A. Joulin, and L. Fei-Fei. Deep fragment em-
beddings for bidirectional image sentence mapping. arXiv
[1] S. Banerjee and A. Lavie. METEOR: An automatic met- preprint arXiv:1406.5679, 2014. 2
ric for MT evaluation with improved correlation with hu-
[20] R. Kiros, R. Zemel, and R. Salakhutdinov. Multimodal neu-
man judgments. In ACL Workshop on Intrinsic and Extrinsic
ral language models. In NIPS Deep Learning Workshop,
Evaluation Measures for Machine Translation and/or Sum-
2013. 2, 3
marization, 2005. 2, 6
[21] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet
[2] A. L. Berger, S. A. D. Pietra, and V. J. D. Pietra. A maximum
classification with deep convolutional neural networks. In
entropy approach to natural language processing. Computa-
NIPS, 2012. 2, 3, 6, 7
tional Linguistics, 1996. 2, 4
[22] G. Kulkarni, V. Premraj, S. Dhar, S. Li, Y. Choi, A. C. Berg,
[3] A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. R. Hr-
and T. L. Berg. Baby talk: Understanding and generating
uschka Jr, and T. M. Mitchell. Toward an architecture for
simple image descriptions. In CVPR, 2011. 1, 2, 6, 8
never-ending language learning. In AAAI, 2010. 2
[23] P. Kuznetsova, V. Ordonez, A. C. Berg, T. L. Berg, and
[4] X. Chen, H. Fang, T. Lin, R. Vedantam, S. Gupta, P. Dollár,
Y. Choi. Collective generation of natural image descriptions.
and C. L. Zitnick. Microsoft coco captions: Data collec-
In ACL, 2012. 2
tion and evaluation server. arXiv preprint arXiv:1504.00325,
2015. 2, 6, 7 [24] R. Lau, R. Rosenfeld, and S. Roukos. Trigger-based lan-
guage models: A maximum entropy approach. In ICASSP,
[5] X. Chen, A. Shrivastava, and A. Gupta. Neil: Extracting
1993. 4
visual knowledge from web data. In ICCV, 2013. 1
[6] X. Chen and C. L. Zitnick. Mind’s eye: A recurrent visual [25] R. Lebret, P. O. Pinheiro, and R. Collobert. Phrase-based
representation for image caption generation. CVPR, 2015. 2 image captioning. arXiv preprint arXiv:1502.03671, 2015.
2, 3
[7] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-
Fei. ImageNet: A large-scale hierarchical image database. [26] S. Li, G. Kulkarni, T. L. Berg, A. C. Berg, and Y. Choi. Com-
In CVPR, 2009. 2, 3 posing simple image descriptions using web-scale n-grams.
[8] S. Divvala, A. Farhadi, and C. Guestrin. Learning everything In CoNLL, 2011. 2
about anything: Webly-supervised visual concept learning. [27] C.-Y. Lin and F. J. Och. Automatic evaluation of machine
In CVPR, 2014. 1 translation quality using longest common subsequence and
[9] J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, skip-bigram statistics. In Proceedings of the 42nd Annual
S. Venugopalan, K. Saenko, and T. Darrell. Long-term recur- Meeting on Association for Computational Linguistics, ACL
rent convolutional networks for visual recognition and de- ’04, Stroudsburg, PA, USA, 2004. Association for Computa-
scription. CVPR, 2015. 2, 3 tional Linguistics. 6
[10] D. Elliott and F. Keller. Comparing automatic evaluation [28] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra-
measures for image description. In ACL, 2014. 6 manan, P. Dollár, and C. L. Zitnick. Microsoft COCO: Com-
[11] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, mon objects in context. In ECCV, 2014. 2, 6
and A. Zisserman. The PASCAL visual object classes (VOC) [29] J. Mao, W. Xu, Y. Yang, J. Wang, and A. L. Yuille. Explain
challenge. IJCV, 88(2):303–338, June 2010. 2, 6 images with multimodal recurrent neural networks. arXiv
[12] A. Farhadi, M. Hejrati, M. A. Sadeghi, P. Young, preprint arXiv:1410.1090, 2014. 2, 3
C. Rashtchian, J. Hockenmaier, and D. Forsyth. Every pic- [30] O. Maron and T. Lozano-Pérez. A framework for multiple-
ture tells a story: Generating sentences from images. In instance learning. NIPS, 1998. 2, 3
ECCV, 2010. 2 [31] T. Mikolov, A. Deoras, D. Povey, L. Burget, and J. Cernocky.
[13] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich fea- Strategies for training large scale neural network language
ture hierarchies for accurate object detection and semantic models. In ASRU, 2011. 4
segmentation. In CVPR, 2014. 3 [32] M. Mitchell, X. Han, J. Dodge, A. Mensch, A. Goyal,
[14] G. Gkioxari, B. Hariharan, R. Girshick, and J. Malik. Us- A. Berg, K. Yamaguchi, T. Berg, K. Stratos, and
ing k-poselets for detecting people and localizing their key- H. Daumé III. Midge: Generating image descriptions from
points. In CVPR, 2014. 4 computer vision detections. In EACL, 2012. 2, 6, 8
[15] M. Hodosh, P. Young, and J. Hockenmaier. Framing image [33] A. Mnih and G. Hinton. Three new graphical models for
description as a ranking task: Data, models and evaluation statistical language modelling. In ICML, 2007. 4
metrics. JAIR, 47:853–899, 2013. 2 [34] A. Mnih and Y. W. Teh. A fast and simple algorithm for train-
[16] P. Huang, X. He, J. Gao, L. Deng, A. Acero, and L. Heck. ing neural probabilistic language models. In ICML, 2012. 4
Learning deep structured semantic models for web search [35] F. J. Och. Minimum error rate training in statistical machine
using clickthrough data. In CIKM, 2013. 5 translation. In ACL, 2003. 2, 5
[17] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Gir- [36] V. Ordonez, G. Kulkarni, and T. L. Berg. Im2text: Describ-
shick, S. Guadarrama, and T. Darrell. Caffe: Convolu- ing images using 1 million captioned photographs. In NIPS,
tional architecture for fast feature embedding. arXiv preprint 2011. 2
arXiv:1408.5093, 2014. 3 [37] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: a
[18] A. Karpathy and L. Fei-Fei. Deep visual-semantic align- method for automatic evaluation of machine translation. In
ments for generating image descriptions. CVPR, 2015. 2 ACL, 2002. 2, 6

1481
[38] C. Rashtchian, P. Young, M. Hodosh, and J. Hockenmaier.
Collecting image annotations using Amazon’s mechanical
turk. In NAACL HLT Workshop Creating Speech and Lan-
guage Data with Amazon’s Mechanical Turk, 2010. 2, 6, 8
[39] A. Ratnaparkhi. Trainable methods for surface natural lan-
guage generation. In NAACL, 2000. 4
[40] A. Ratnaparkhi. Trainable approaches to surface natural lan-
guage generation and their application to conversational dia-
log systems. Computer Speech & Language, 16(3):435–455,
2002. 2, 3
[41] Y. Shen, X. He, J. Gao, L. Deng, and G. Mesnil. A latent
semantic model with convolutional-pooling structure for in-
formation retrieval. In CIKM, 2014. 5
[42] K. Simonyan and A. Zisserman. Very deep convolu-
tional networks for large-scale image recognition. CoRR,
abs/1409.1556, 2014. 2, 3, 6, 7
[43] R. Socher, Q. Le, C. Manning, and A. Ng. Grounded com-
positional semantics for finding and describing images with
sentences. In NIPS Deep Learning Workshop, 2013. 2
[44] R. Vedantam, C. L. Zitnick, and D. Parikh. Cider:
Consensus-based image description evaluation. CoRR,
abs/1411.5726, 2014. 6, 8
[45] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and
tell: A neural image caption generator. CVPR, 2015. 2
[46] K. Xu, J. Ba, R. Kiros, A. Courville, R. Salakhutdinov,
R. Zemel, and Y. Bengio. Show, attend and tell: Neural im-
age caption generation with visual attention. arXiv preprint
arXiv:1502.03044, 2015. 2, 3
[47] Y. Yang, C. L. Teo, H. Daumé III, and Y. Aloimonos.
Corpus-guided sentence generation of natural images. In
EMNLP, 2011. 1, 2
[48] B. Z. Yao, X. Yang, L. Lin, M. W. Lee, and S.-C. Zhu. I2T:
Image parsing to text description. Proceedings of the IEEE,
98(8):1485–1508, 2010. 2
[49] C. Zhang, J. C. Platt, and P. A. Viola. Multiple instance
boosting for object detection. In NIPS, 2005. 2, 3
[50] C. L. Zitnick and P. Dollár. Edge boxes: Locating object
proposals from edges. In ECCV, 2014. 8
[51] C. L. Zitnick and D. Parikh. Bringing semantics into focus
using visual abstraction. In CVPR, 2013. 1

1482

You might also like