0% found this document useful (0 votes)
32 views10 pages

Conceptual Captions: A Cleaned, Hypernymed, Image Alt-Text Dataset For Automatic Image Captioning

The document presents a new dataset called Conceptual Captions for image captioning which contains over 3.3 million image-description pairs harvested from webpages. It summarizes the dataset creation process which extracts candidate image-caption pairs from billions of webpages and applies filtering and processing to achieve a balance of clean, informative, fluent, and learnable captions. Evaluation of several image captioning models on the new dataset is also discussed, finding that Transformer-based models achieve higher accuracy than RNN-based models.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views10 pages

Conceptual Captions: A Cleaned, Hypernymed, Image Alt-Text Dataset For Automatic Image Captioning

The document presents a new dataset called Conceptual Captions for image captioning which contains over 3.3 million image-description pairs harvested from webpages. It summarizes the dataset creation process which extracts candidate image-caption pairs from billions of webpages and applies filtering and processing to achieve a balance of clean, informative, fluent, and learnable captions. Evaluation of several image captioning models on the new dataset is also discussed, finding that Transformer-based models achieve higher accuracy than RNN-based models.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset

For Automatic Image Captioning

Piyush Sharma, Nan Ding, Sebastian Goodman, Radu Soricut


Google AI
Venice, CA 90291
{piyushsharma,dingnan,seabass,rsoricut}@google.com

Abstract Alt-text: A Pakistani worker helps


to clear the debris from the Taj Ma-
hal Hotel November 7, 2005 in Bal-
akot, Pakistan.
We present a new dataset of image caption
annotations, Conceptual Captions, which Conceptual Captions: a worker
helps to clear the debris.
contains an order of magnitude more im-
Alt-text: Musician Justin Timber-
ages than the MS-COCO dataset (Lin et al., lake performs at the 2017 Pilgrim-
2014) and represents a wider variety of age Music & Cultural Festival on
September 23, 2017 in Franklin,
both images and image caption styles. We Tennessee.

achieve this by extracting and filtering im- Conceptual Captions: pop artist
performs at the festival in a city.
age caption annotations from billions of
webpages. We also present quantitative
evaluations of a number of image cap- Figure 1: Examples of images and image descrip-
tioning models and show that a model tions from the Conceptual Captions dataset; we
architecture based on Inception-ResNet- start from existing alt-text descriptions, and auto-
v2 (Szegedy et al., 2016) for image-feature matically process them into Conceptual Captions
extraction and Transformer (Vaswani et al., with a balance of cleanliness, informativeness, flu-
2017) for sequence modeling achieves the ency, and learnability.
best performance when trained on the Con-
ceptual Captions dataset. There are two main categories of advances re-
sponsible for increased interest in this task. The
1 Introduction first is the availability of large amounts of anno-
tated data. Relevant datasets include the ImageNet
Automatic image description is the task of pro-
dataset (Deng et al., 2009), with over 14 million
ducing a natural-language utterance (usually a sen-
images and 1 million bounding-box annotations,
tence) which correctly reflects the visual content
and the MS-COCO dataset (Lin et al., 2014), with
of an image. This task has seen an explosion in
120,000 images and 5-way image-caption anno-
proposed solutions based on deep learning architec-
tations. The second is the availability of power-
tures (Bengio, 2009), starting with the winners of
ful modeling mechanisms such as modern Con-
the 2015 COCO challenge (Vinyals et al., 2015a;
volutional Neural Networks (e.g. Krizhevsky et al.
Fang et al., 2015), and continuing with a variety of
(2012)), which are capable of converting image pix-
improvements (see e.g. Bernardi et al. (2016) for a
els into high-level features with no manual feature-
review). Practical applications of automatic image
engineering.
description systems include leveraging descriptions
In this paper, we make contributions to both
for image indexing or retrieval, and helping those
the data and modeling categories. First, we
with visual impairments by transforming visual sig-
present a new dataset of caption annotations∗ ,
nals into information that can be communicated via
Conceptual Captions (Fig. 1), which has an or-
text-to-speech technology. The scientific challenge
der of magnitude more images than the COCO
is seen as aligning, exploiting, and pushing further
the latest improvements at the intersection of Com- ∗
https://github.com/google-research-datasets/conceptual-
puter Vision and Natural Language Processing. captions

2556
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Long Papers), pages 2556–2565
Melbourne, Australia, July 15 - 20, 2018. c 2018 Association for Computational Linguistics
dataset. Conceptual Captions consists of about in the background, in a cluttered environment, or
3.3M himage, descriptioni pairs. In contrast with partially occluded. Its images are also annotated
the curated style of the COCO images, Concep- with captions, i.e. sentences produced by human an-
tual Captions images and their raw descriptions notators to reflect the visual content of the images
are harvested from the web, and therefore repre- in terms of objects and their actions or relations.
sent a wider variety of styles. The raw descriptions A large number of DNN models for image cap-
are harvested from the Alt-text HTML attribute† tion generation have been trained and evaluated
associated with web images. We developed an au- using COCO captions (Vinyals et al., 2015a; Fang
tomatic pipeline (Fig. 2) that extracts, filters, and et al., 2015; Xu et al., 2015; Ranzato et al., 2015;
transforms candidate image/caption pairs, with the Yang et al., 2016; Liu et al., 2017; Ding and Soricut,
goal of achieving a balance of cleanliness, informa- 2017). These models are inspired by sequence-to-
tiveness, fluency, and learnability of the resulting sequence models (Sutskever et al., 2014; Bahdanau
captions. et al., 2015) but use CNN-based encodings in-
As a contribution to the modeling category, we stead of RNNs (Hochreiter and Schmidhuber, 1997;
evaluate several image-captioning models. Based Chung et al., 2014). Recently, the Transformer ar-
on the findings of Huang et al. (2016), we use chitecture (Vaswani et al., 2017) has been shown
Inception-ResNet-v2 (Szegedy et al., 2016) for to be a viable alternative to RNNs (and CNNs) for
image-feature extraction, which confers optimiza- sequence modeling. In this work, we evaluate the
tion benefits via residual connections and com- impact of the Conceptual Captions dataset on the
putationally efficient Inception units. For cap- image captioning task using models that combine
tion generation, we use both RNN-based (Hochre- CNN, RNN, and Transformer layers.
iter and Schmidhuber, 1997) and Transformer- Also related to this work is the Pinterest image
based (Vaswani et al., 2017) models. Our results and sentence-description dataset (Mao et al., 2016).
indicate that Transformer-based models achieve It is a large dataset (order of 108 examples), but its
higher output accuracy; combined with the reports text descriptions do not strictly reflect the visual
of Vaswani et al. (2017) regarding the reduced num- content of the associated image, and therefore can-
ber of parameters and FLOPs required for training not be used directly for training image-captioning
& serving (compared with RNNs), models such as models.
T2T8x8 (Section 4) push forward the performance
on image-captioning and deserve further attention. 3 Conceptual Captions Dataset Creation

2 Related Work The Conceptual Captions dataset is programmat-


ically created using a Flume (Chambers et al.,
Automatic image captioning has a long history (Ho- 2010) pipeline. This pipeline processes billions
dosh et al., 2013; Donahue et al., 2014; Karpa- of Internet webpages in parallel. From these web-
thy and Fei-Fei, 2015; Kiros et al., 2015). It pages, it extracts, filters, and processes candidate
has accelerated with the success of Deep Neu- himage, captioni pairs. The filtering and process-
ral Networks (Bengio, 2009) and the availability ing steps are described in detail in the following
of annotated data as offered by datasets such as sections.
Flickr30K (Young et al., 2014) and MS-COCO (Lin
et al., 2014). Image-based Filtering The first filtering stage,
The COCO dataset is not large (order of 106 im- image-based filtering, discards images based on
ages), given the training needs of DNNs. In spite encoding format, size, aspect ratio, and offensive
of that, it has been very popular, in part because content. It only keeps JPEG images where both
it offers annotations for images with non-iconic dimensions are greater than 400 pixels, and the
views, or non-canonical perspectives of objects, ratio of larger to smaller dimension is no more than
and therefore reflects the composition of everyday 2. It excludes images that trigger pornography or
scenes (the same is true about Flickr30K (Young profanity detectors. These filters discard more than
et al., 2014)). COCO annotations–category label- 65% of the candidates.
ing, instance spotting, and instance segmentation–
Text-based Filtering The second filtering stage,
are done for all objects in an image, including those
text-based filtering, harvests Alt-text from HTML

https://en.wikipedia.org/wiki/Alt attribute webpages. Alt-text generally accompanies images,

2557
PIPELINE
Image Text Img/Text Text
Filtering Filtering Filtering Transform

ALT-TEXT IMAGE IMAGE IMAGE IMAGE CAPTION

[Alt-text not processed:


undesired image format, [Alt-text discarded]
aspect ratio or size]

[Alt-text discarded:
“Ferrari dice” Text does not contain
prep./article]

[Alt-text discarded:
“The meaning of life” No text vs.
image-object
overlap]

“Demi Lovato wearing a


black Ester Abner Spring “pop rock artist
2018 gown and Stuart wearing a black
Weitzman sandals at the gown and sandals
2017 American Music at awards”
Awards”

Figure 2: Conceptual Captions pipeline steps with examples and final output.

and intends to describe the nature or the content of the English Wikipedia, and discard candidates
the image. Because these Alt-text values are not in that contain tokens that are not found in this
any way restricted or enforced to be good image vocabulary.
descriptions, many of them have to be discarded,
e.g., search engine optimization (SEO) terms, or • candidates that score too high or too low on
Twitter hash-tag terms. the polarity annotations, or trigger the pornog-
We analyze candidate Alt-text using the Google raphy/profanity detectors, are discarded;
Cloud Natural Language APIs, specifically part-
of-speech (POS), sentiment/polarity, and pornogra- • predefined boiler-plate prefix/suffix sequences
phy/profanity annotations. On top of these annota- matching the text are cropped, e.g. “click to
tions, we have the following heuristics: enlarge picture”, “stock photo”; we also drop
text which begins/ends in certain patterns, e.g.
• a well-formed caption should have a high “embedded image permalink”, “profile photo”.
unique word ratio covering various POS tags;
candidates with no determiner, no noun, or no These filters only allow around 3% of the incoming
preposition are discarded; candidates with a candidates to pass to the later stages.
high noun ratio are also discarded; Image&Text-based Filtering In addition to the
separate filtering based on image and text content,
• candidates with a high rate of token repetition
we filter out candidates for which none of the text
are discarded;
tokens can be mapped to the content of the image.
• capitalization is a good indicator of well- To this end, we use classifiers available via the
composed sentences; candidates where the Google Cloud Vision APIs to assign class labels to
first word is not capitalized, or with too high images, using an image classifier with a large num-
capitalized-word ratio are discarded; ber of labels (order of magnitude of 105 ). Notably,
these labels are also 100% covered by the Vw token
• highly unlikely tokens are a good indicator of types.
not desirable text; we use a vocabulary VW of Images are generally assigned between 5 to 20
1B token types, appearing at least 5 times in labels, though the exact number depends on the

2558
Original Alt-text Harrison Ford and Calista Flockhart attend the premiere of ‘Hollywood Homicide’ at the
29th American Film Festival September 5, 2003 in Deauville, France.
Conceptual Captions actors attend the premiere at festival.
what-happened “Harrison Ford and Calista Flockhart” mapped to “actors”; name, location, and date dropped.
Original Alt-text Side view of a British Airways Airbus A319 aircraft on approach to land with landing gear
down - Stock Image
Conceptual Captions side view of an aircraft on approach to land with landing gear down
what-happened phrase “British Airways Airbus A319 aircraft” mapped to “aircraft”; boilerplate removed.
Original Alt-text Two sculptures by artist Duncan McKellar adorn trees outside the derelict Norwich Union
offices in Bristol, UK - Stock Image
Conceptual Captions sculptures by person adorn trees outside the derelict offices
what-happened object count (e.g. “Two”) dropped; proper noun-phrase hypernymized to “person”; proper-
noun modifiers dropped; location dropped; boilerplate removed.

Table 1: Examples of Conceptual Captions as derived from their original Alt-text versions.

image. We match these labels against the candi- Alt-text (groundtruth):


Jimmy Barnes performs at the
date text, taking into account morphology-based Sydney Entertainment Centre
stemming as provided by the text annotation. Can- Model output: Singer Justin
didate himage, captioni pairs with no overlap are Bieber performs onstage during
the Billboard Music Awards at
discarded. This filter discards around 60% of the the MGM
incoming candidates.
Figure 3: Example of model output trained on
Text Transformation with Hypernymization clean, non-hypernymized Alt-text data.
In the current version of the dataset, we consid-
ered over 5 billion images from about 1 billion
English webpages. The filtering criteria above are so we match them to their corresponding KG en-
designed to be high-precision (which comes with tries. These KG entries have “actor” as their hyper-
potentially low recall). From the original input can- nym, so we replace the original surface tokens with
didates, only 0.2% himage, captioni pairs pass the that hypernym.
filtering criteria described above. The following steps are applied to achieve text
transformations:
While the remaining candidate captions tend
to be appropriate Alt-text image descriptions (see • noun modifiers of certain types (proper nouns,
Alt-text in Fig. 1), a majority of these candidate numbers, units) are removed;
captions contain proper names (people, venues,
locations, etc.), which would be extremely diffi- • dates, durations, and preposition-based loca-
cult to learn as part of the image captioning task. tions (e.g., “in Los Angeles”) are removed;
To give an idea of what would happen in such
cases, we train an RNN-based captioning model • named-entities are identified, matched against
(see Section 4) on non-hypernymized Alt-text data the KG entries, and substitute with their hy-
and present an output example in Fig. 3. If auto- pernym;
matic determination of person identity, location, • resulting coordination noun-phrases with the
etc. is needed, it should be attempted as a sepa- same head (e.g., “actor and actor”) are re-
rate task and would need to leverage image meta- solved into a single-head, pluralized form
information about the image (e.g. location). (e.g., “actors”);
Using the Google Cloud Natural Language APIs,
we obtain named-entity and syntactic-dependency Around 20% of samples are discarded during this
annotations. We then use the Google Knowl- transformation because it can leave sentences too
edge Graph (KG) Search API to match the named- short or inconsistent.
entities to KG entries and exploit the associated hy- Finally, we perform another round of text analy-
pernym terms. For instance, both “Harrison Ford” sis and entity resolution to identify concepts with
and “Calista Flockhart” identify as named-entities, low-count. We cluster all resolved entities (e.g.,

2559
“actor”, “dog”, “neighborhood”, etc.) and keep 4 Image Captioning Models
only the candidates for which all detected types
In order to assess the impact of the Conceptual Cap-
have a count of over 100 (around 55% of the can-
tions dataset, we consider several image captioning
didates). These remaining himage, captioni pairs
models previously proposed in the literature. These
contain around 16,000 entity types, guaranteed to
models can be understood using the illustration in
be well represented in terms of number of examples.
Fig. 4, as they mainly differ in the way in which
Table 1 contains several examples of before/after-
they instantiate some of these components.
transformation pairs.
Z people playing frisbee in

Conceptual Captions Quality To evaluate the Encoder


H
Decoder
precision of our pipeline, we consider a random
Y
sample of 4K examples extracted from the test split X
<GO> people playing frisbee

of the Conceptual Captions dataset. We perform a


human evaluation on this sample, using the same
methodology described in Section 5.4.

Image Embedding
GOOD (out of 3)
1+ 2+ 3
Conceptual Captions 96.9% 90.3% 78.5%
Figure 4: The main model components.
Table 2: Human evaluation results on a sample
from Conceptual Captions.
There are three main components to this archi-
tecture:
The results are presented in Table 2 and show
• A deep CNN that takes a (preprocessed) im-
that, out of 3 annotations, over 90% of the captions
age and outputs a vector of image embeddings
receive a majority (2+) of GOOD judgments. This
X = (x1 , x2 , ..., xL ).
indicates that the Conceptual Captions pipeline,
though involving extensive algorithmic processing, • An Encoder module that takes the image
produces high-quality image captions. embeddings and encodes them into a tensor
H = fenc (X).
Examples Unique Tokens/Caption
Tokens Mean StdDev Median • A Decoder model that generates outputs zt =
Train 3,318,333 51,201 10.3 4.5 9.0 fdec (Y1:t , H) at each step t, conditioned on
Valid. 28,355 13,063 10.3 4.6 9.0
Test 22,530 11,731 10.1 4.5 9.0 H as well as the decoder inputs Y1:t .
Table 3: Statistics over Train/Validation/Test splits We explore two main instantiations of this architec-
for Conceptual Captions. ture. One uses RNNs with LSTM cells (Hochreiter
and Schmidhuber, 1997) to implement the fenc and
We present in Table 3 statistics over the fdec functions, corresponding to the Show-And-
Train/Validation/Test splits for the Conceptual Cap- Tell (Vinyals et al., 2015b) model. The other uses
tions dataset. The training set consists of slightly Transformer self-attention networks (Vaswani et al.,
over 3.3M examples, while there are slightly over 2017) to implement fenc and fdec . All models in
28K examples in the validation set and 22.5K ex- this paper use Inception-ResNet-v2 as the CNN
amples in the test set. The size of the training set component (Szegedy et al., 2016).
vocabulary (unique tokens) is 51,201. Note that the 4.1 RNN-based Models
test set has been cleaned using human judgements
Our instantiation of the RNN-based model is
(2+ GOOD), while both the training and valida-
close to the Show-And-Tell (Vinyals et al., 2015b)
tion splits contain all the data, as produced by our
model.
automatic pipeline. The mean/stddev/median statis-
tics for tokens-per-caption over the data splits are hl , RNNenc (xl , hl−1 ), and H = hL ,
consistent with each other, at around 10.3/4.5/9.0,
respectively. zt , RNNdec (yt , zt−1 ), where z0 = H .

2560
In the original Show-And-Tell model, a single im- where Wdq , Wdk , and Wdv are the weight matrices
age embedding of the entire image is fed to the first for query, key, and value transformation in the de-
cell of an RNN, which is also used for text gener- coder self-attention sub-layer; Wcq , Wck , Wcv are
ation. In our model, a single image embedding is the corresponding decoder weight matrices in the
fed to an RNNenc with only one cell, and then a dif- cross-attention sub-layer; and Wdf is the decoder
ferent RNNdec is used for text generation. We tried weight matrix of the feedforward sub-layer.
both single image (1x1) embeddings and 8x8 parti- The Transformer-based models utilize position
tions of the image, where each partition has its own information at the embedding layer. In the 8x8 case,
embedding. In the 8x8 case, image embeddings are the 64 embedding vectors are serialized to a 1D
fed in a sequence to the RNNenc . In both cases, we sequence with positions from [0, . . . , 63]. The po-
apply plain RNNs without cross attention, same as sition information is modeled by applying sine and
the Show-And-Tell model. RNNs with cross atten- cosine functions at each position and with differ-
tion were used in the Show-Attend-Tell model (Xu ent frequencies for each embedding dimension, as
et al., 2015), but we find its performance to be in (Vaswani et al., 2017), and subsequently added
inferior to the Show-And-Tell model. to the embedding representations.
4.2 Transformer Model
5 Experimental Results
In the Transformer-based models, both the encoder
and the decoder contain a stack of N layers. We In this section, we evaluate the impact of using
denote the n-th layer in the encoder by Xn = the Conceptual Captions dataset (referred to as
{xn,1 , . . . , xn,L }, and X0 = X, H = XN . Each ’Conceptual’ in what follows) for training image
of these layers contains two sub-layers: a multi- captioning models. To this end, we train the
head self-attention layer ATTN, and a position-wise models described in Section 4 under two exper-
feedforward network FFN: imental conditions: using the training & devel-
opment sets provided by the COCO dataset (Lin
x0n,j =ATTN(xn,j ,Xn ;Weq ,Wek ,Wev )
et al., 2014), versus training & development sets
,softmax(hxn,j Weq ,Xn Wek i) Xn Wev using the Conceptual dataset. We quantitatively
x(n+1),j =FFN(x0n,j ;Wef ) evaluate the resulting models using three differ-
ent test sets: the blind COCO-C40 test set (in-
where Weq , Wek , and Wev are the encoder weight domain for COCO-trained models, out-of-domain
matrices for query, key, and value transformation for Conceptual-trained models); the Conceptual
in the self-attention sub-layer; and Wef denotes the test set (out-of-domain for COCO-trained mod-
encoder weight matrix of the feedforward sub-layer. els, in-domain for Conceptual-trained models); and
Similar to the RNN-based model, we consider us- the Flickr (Young et al., 2014) 1K test set (out-
ing a single image embedding (1x1) and a vector of-domain for both COCO-trained models and
of 8x8 image embeddings. Conceptual-trained models).
In the decoder, we denote the n-th layer by
Zn = {zn,1 , . . . , zn,T } and Z0 = Y. There are 5.1 Dataset Details
two main differences between the decoder and en-
coder layers. First, the self-attention sub-layer in COCO Image Captions The COCO image cap-
the decoder is masked to the right, in order to pre- tioning dataset is normally divided into 82K images
vent attending to “future” positions (i.e. zn,j does for training, and 40K images for validation. Each
not attend to zn,(j+1) , . . . , zn,T ). Second, in be- of these images comes with at least 5 groundtruth
tween the self-attention layer and the feedforward captions. Following standard practice, we combine
layer, the decoder adds a third cross-attention layer the training set with most of the validation dataset
that connects zn,j to the top-layer encoder repre- for training our model, and only hold out a subset
sentation H = XN . of 4K images for validation.

z0n,j =ATTN(zn,j ,Zn,1:j ;Wdq ,Wdk ,Wdv ) Conceptual Captions The Conceptual Captions
z00n,j =ATTN(z0n,j ,H;Wcq ,Wck ,Wcv ) dataset contains around 3.3M images for training,
28K for validation and 22.5K for the test set. For
z(n+1),j =FFN(z00n,j ;Wdf ) more detailed statistics, see Table 3.

2561
COCO-trained
a group of men standing a couple of people walk- a child sitting at a table a close up of a stuffed
RNN8x8
in front of a building ing down a walkway with a cake on it animal on a table
a group of men in uni- a narrow hallway with a a woman cutting a birth- a picture of a fish on the
T2T8x8
form and ties are talking clock and two doors day cake at a party side of a car
Conceptual-trained
graduates line up for a cartoon business-
a child ’s drawing at a
RNN8x8 the commencement cer- a view of the nave man thinking about
birthday party
emony something
graduates line up to re- the cloister of the cathe- learning about the arts a cartoon businessman
T2T8x8
ceive their diplomas dral and crafts asking for help

Figure 5: Side by side comparison of model outputs under two training conditions. Conceptual-based
models (lower half) tend to hallucinate less, are more expressive, and handle well a larger variety of
images. The two images in the middle are from Flickr; the other two are from Conceptual Captions.

5.2 Experimental Setup size 4 to compute the most likely output sequence.
Image Preprocessing Each input image is first
preprocessed by random distortion and cropping 5.3 Qualitative Results
(using a random ratio from 50%∼100%). This Before we present the numerical results for our
prevents models from overfitting individual pixels experiments, we discuss briefly the patterns that
of the training images. we have observed.
Encoder-Decoder For RNN-based models, we One difference between COCO-trained models
use a 1-layer, 512-dim LSTM as the RNN cell. For and Conceptual-trained models is their ability to
the Transformer-based models, we use the default use the appropriate natural language terms for the
setup from (Vaswani et al., 2017), with N = 6 entities in an image. For the left-most image in
encoder and decoder layers, a hidden-layer size of Fig. 5, COCO-trained models use “group of men”
512, and 8 attention heads. to refer to the people in the image; Conceptual-
based models use the more appropriate and infor-
Text Handling Training captions are truncated mative term “graduates”. The second image, from
to maximum 15 tokens. We use a token type min- the Flickr test set, makes this even more clear. The
count of 4, which results in around 9,000 token Conceptual-trained T2T8x8 model is perfectly ren-
types for the COCO dataset, and around 25,000 dering the image content as “the cloister of the
token types for the Conceptual Captions dataset. cathedral”. None of the other models come close
All other tokens are replaced with special token to producing such an accurate description.
hUNKi. The word embedding matrix has size 512 A second difference is that COCO-trained mod-
and is tied to the output projection matrix. els often seem to hallucinate objects. For instance,
Optimization All models are trained using MLE they hallucinate “front of building” for the first im-
loss and optimized using Adagrad (Duchi et al., age, “clock and two doors” for the second, and
2011) with learning rate 0.01. Mini-batch size is 25. “birthday cake” for the third image. In contrast,
All model parameters are trained for a total number Conceptual-trained models do not seem to have
of 5M steps, with batch updates asynchronously this problem. We hypothesize that the hallucina-
distributed across 40 workers. The final model tion issue for COCO-based models comes from
is selected based on the best CIDEr score on the the high correlations present in the COCO data
development set for the given training condition. (e.g., if there is a kid at a table, there is also cake).
This high degree of correlation in the data does not
Inference During inference, the decoder predic- allow the captioning model to correctly disentan-
tion of the previous position is fed to the input of gle and learn representations at the right level of
the next position. We use a beam search of beam granularity.

2562
Model Training 1+ 2+ 3+ Model Training CIDEr ROUGE-L SPICE
RNN8x8 COCO 0.390 0.276 0.173 RNN1x1 COCO 0.183 0.149 0.062
T2T8x8 COCO 0.478 0.362 0.275 RNN8x8 COCO 0.191 0.152 0.065
RNN8x8 Conceptual 0.571 0.418 0.277 T2T1x1 COCO 0.184 0.148 0.062
T2T8x8 Conceptual 0.659 0.506 0.355 T2T8x8 COCO 0.190 0.151 0.064
RNN1x1 Conceptual 1.351 0.326 0.235
Table 4: Human eval results on Flickr 1K Test. RNN8x8 Conceptual 1.401 0.330 0.240
T2T1x1 Conceptual 1.588 0.331 0.254
T2T8x8 Conceptual 1.676 0.336 0.257
A third difference is the resilience to a large Table 6: Auto metrics on the 22.5K Conceptual
spectrum of image types. COCO only contains nat- Captions Test set.
ural images, and therefore a cartoon image like the
fourth one results in massive hallucination effects Model Training CIDEr ROUGE-L SPICE
for COCO-trained models (“stuffed animal”, “fish”, RNN1x1 COCO 0.340 0.414 0.101
RNN8x8 COCO 0.356 0.413 0.103
“side of car”). In contrast, Conceptual-trained mod- T2T1x1 COCO 0.341 0.404 0.101
els handle such images with ease. T2T8x8 COCO 0.359 0.416 0.103
RNN1x1 Conceptual 0.269 0.310 0.076
5.4 Quantitative Results RNN8x8 Conceptual 0.275 0.309 0.076
T2T1x1 Conceptual 0.226 0.280 0.068
In this section, we present quantitative results on T2T8x8 Conceptual 0.227 0.277 0.066
the quality of the outputs produced by several im- Table 7: Auto metrics on the Flickr 1K Test.
age captioning models. We present both automatic
evaluation results and human evaluation results.
5.4.2 Automatic Evaluation Results
5.4.1 Human Evaluation Results
In this section, we report automatic evaluation re-
For human evaluations, we use a pool of profes-
sults, using established image captioning metrics.
sional raters (tens of raters), with a double-blind
For the COCO C40 test set (Fig. 5), we report
evaluation condition. Raters are asked to assign a
the numerical values returned by the COCO on-
GOOD or BAD label to a given himage, captioni
line evaluation server‡ , using the CIDEr (Vedantam
input, using just common-sense judgment. This
et al., 2015), ROUGE-L (Lin and Och, 2004), and
approximates the reaction of a typical user, who
METEOR (Banerjee and Lavie, 2005) metrics. For
normally would not accept predefined notions of
Conceptual Captions (Fig. 6) and Flickr (Fig. 7)
GOOD vs. BAD. We ask 3 separate raters to rate
test sets, we report numerical values for the CIDEr,
each input pair and report the percentage of pairs
ROUGE-L, and SPICE (Anderson et al., 2016)§ .
that receive k or more (k+) GOOD annotations.
For all metrics, higher number means closer dis-
In Table 4, we report the results on the Flickr
tance between the candidates and the groundtruth
1K test set. This evaluation is out-of-domain for
captions.
both training conditions, so all models are on rel-
The automatic metrics are good at detecting in-
atively equal footing. The results indicate that the
vs out-of-domain situations. For COCO-models
Conceptual-based models are superior. In 50.6%
tested on COCO, the results in Fig. 5 show CIDEr
(for the T2T8x8 model) of cases, a majority of an-
scores in the 1.02-1.04 range, for both RNN- and
notators (2+) assigned a GOOD label. The results
Transformer-based models; the scores drop in the
also indicate that the Transformer-based models are
0.35-0.41 range (CIDEr) for the Conceptual-based
superior to the RNN-based models by a good mar-
models tested against COCO groundtruth. For
gin, by over 8-points (for 2+) under both COCO
Conceptual-models tested on the Conceptual Cap-
and Conceptual training conditions.
tions test set, the results in Fig. 6 show scores
Model Training CIDEr ROUGE-L METEOR as high as 1.468 CIDEr for the T2T8x8 model,
RNN1x1 COCO 1.021 0.694 0.348 which corroborates the human-eval results for the
RNN8x8 COCO 1.044 0.698 0.354
T2T1x1 COCO 1.032 0.700 0.358 Transformer-based models being superior to the
T2T8x8 COCO 1.032 0.700 0.356 RNN-based models; the scores for the COCO-
RNN1x1 Conceptual 0.403 0.445 0.191 based models tested against Conceptual Captions
RNN8x8 Conceptual 0.410 0.437 0.189
T2T1x1 Conceptual 0.348 0.403 0.171 groundtruth are all below 0.2 CIDEr.
T2T8x8 Conceptual 0.345 0.400 0.170 The automatic metrics fail to corroborate the
Table 5: Auto metrics on the COCO C40 Test. ‡
http://mscoco.org/dataset/#captions-eval.
§
https://github.com/tylin/coco-caption.

2563
human evaluation results. According to the auto- Satanjeev Banerjee and Alon Lavie. 2005. METEOR:
matic metrics, the COCO-trained models are su- An automatic metric for MT evaluation with im-
proved correlation with human judgments. In Pro-
perior to the Conceptual-trained models (CIDEr
ceedings of the ACL Workshop on intrinsic and ex-
scores in the mid-0.3 for the COCO-trained con- trinsic evaluation measures for machine translation
dition, versus mid-0.2 for the Conceptual-trained and/or summarization.
condition), and the RNN-based models are supe-
Yoshua Bengio. 2009. Learning deep architectures for
rior to Transformer-based models. Notably, these ai. Found. Trends Mach. Learn. 2(1):1–127.
are the same metrics which score humans lower
than the methods that won the COCO 2015 chal- Raffaella Bernardi, Ruket Cakici, Desmond Elliott,
Aykut Erdem, Erkut Erdem, Nazli Ikizler-Cinbis,
lenge (Vinyals et al., 2015a; Fang et al., 2015), Frank Keller, Adrian Muscat, and Barbara Plank.
despite the fact that humans are still much better 2016. Automatic description generation from im-
at this task. The failure of these metrics to align ages: A survey of models, datasets, and evaluation
with the human evaluation results casts again grave measures. JAIR 55.
doubts on their ability to drive progress in this field. Craig Chambers, Ashish Raniwala, Frances Perry,
A significant weakness of these metrics is that hal- Stephen Adams, Robert Henry, Robert Bradshaw,
lucination effects are under-penalized (a small pre- and Nathan. 2010. Flumejava: Easy, effi-
cision penalty for tokens with no correspondent cient data-parallel pipelines. In ACM SIGPLAN
Conference on Programming Language Design
in the reference), compared to human judgments and Implementation (PLDI). 2 Penn Plaza, Suite
that tend to dive dramatically in the presence of 701 New York, NY 10121-0701, pages 363–375.
hallucinations. http://dl.acm.org/citation.cfm?id=1806638.

6 Conclusions Junyoung Chung, Caglar Gulcehre, KyungHyun Cho,


and Yoshua Bengio. 2014. Empirical evaluation of
We present a new image captioning dataset, Con- gated recurrent neural networks on sequence model-
ing. arXiv preprint arXiv:1412.3555 .
ceptual Captions, which has several key character-
istics: it has around 3.3M examples, an order of J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-
magnitude larger than the COCO image-captioning Fei. 2009. ImageNet: A large-scale hierarchical im-
dataset; it consists of a wide variety of images, age database. In CVPR.
including natural images, product images, profes- Nan Ding and Radu Soricut. 2017. Cold-start reinforce-
sional photos, cartoons, drawings, etc.; and, its ment learning with softmax policy gradients. In
captions are based on descriptions taken from orig- NIPS.
inal Alt-text attributes, automatically transformed Jeff Donahue, Lisa Anne Hendricks, Sergio Guadar-
to achieve a balance between cleanliness, informa- rama, Marcus Rohrbach, Subhashini Venugopalan,
tiveness, and learnability. Kate Saenko, and Trevor Darrell. 2014. Long-term
recurrent convolutional networks for visual recog-
We evaluate both the quality of the resulting
nition and description. In Proc. of IEEE Confer-
image/caption pairs, as well as the performance of ence on Computer Vision and Pattern Recognition
several image-captioning models when trained on (CVPR).
the Conceptual Captions data. The results indicate
John Duchi, Elad Hazan, and Yoram Singer. 2011.
that such models achieve better performance, and Adaptive subgradient methods for online learning
avoid some of the pitfalls seen with COCO-trained and stochastic optimization. Journal of Machine
models, such as object hallucination. We hope that Learning Research 12(Jul):2121–2159.
the availability of the Conceptual Captions dataset
Hao Fang, Saurabh Gupta, Forrest Iandola, Rupesh Sri-
will foster considerable progress on the automatic vastava, Li Deng, Piotr Dollár, Jianfeng Gao, Xi-
image-captioning task. aodong He, Margaret Mitchell, John Platt, et al.
2015. From captions to visual concepts and back. In
Proc. of IEEE Conference on Computer Vision and
References Pattern Recognition (CVPR).

Peter Anderson, Basura Fernando, Mark Johnson, and Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long
Stephen Gould. 2016. SPICE: semantic proposi- short-term memory. Neural computation 9(8):1735–
tional image caption evaluation. In ECCV. 1780.
D. Bahdanau, K. Cho, and Y. Bengio. 2015. Neural Micah Hodosh, Peter Young, and Julia Hockenmaier.
machine translation by jointly learning to align and 2013. Framing image description as a ranking task:
translate. In Proceedings of ICLR. Data, models and evaluation metrics. JAIR .

2564
Jonathan Huang, Vivek Rathod, Chen Sun, Menglong Ramakrishna Vedantam, C. Lawrence Zitnick, and
Zhu, Anoop Korattikara, Alireza Fathi, Ian Fischer, Devi Parikh. 2015. Cider: Consensus-based image
Zbigniew Wojna, Yang Song, Sergio Guadarrama, description evaluation. In The IEEE Conference on
and Kevin Murphy. 2016. Speed/accuracy trade-offs Computer Vision and Pattern Recognition (CVPR).
for modern convolutional object detectors. CoRR
abs/1611.10012. Oriol Vinyals, Alexander Toshev, Samy Bengio, and
Dumitru Erhan. 2015a. Show and tell: A neural im-
Andrej Karpathy and Li Fei-Fei. 2015. Deep visual- age caption generator. In Proceedings of the IEEE
semantic alignments for generating image descrip- conference on computer vision and pattern recogni-
tions. In Proc. of IEEE Conference on Computer tion. pages 3156–3164.
Vision and Pattern Recognition (CVPR).
Oriol Vinyals, Alexander Toshev, Samy Bengio, and
Ryan Kiros, Ruslan Salakhutdinov, and Richard S Dumitru Erhan. 2015b. Show and tell: A neural
Zemel. 2015. Unifying visual-semantic embeddings image caption generator. In Proc. of IEEE Confer-
with multimodal neural language models. Transac- ence on Computer Vision and Pattern Recognition
tions of the Association for Computational Linguis- (CVPR).
tics .
Kelvin Xu, Jimmy Ba, Ryan Kiros, Aaron Courville,
Ruslan Salakhutdinov, Richard Zemel, and Yoshua
A. Krizhevsky, I. Sutskever, and G. Hinton. 2012. Im-
Bengio. 2015. Show, attend and tell: Neural image
agenet classification with deep convolutional neural
caption generation with visual attention. In Proc.
networks. In NIPS.
of the 32nd International Conference on Machine
Learning (ICML).
Chin-Yew Lin and Franz Josef Och. 2004. Auto-
matic evaluation of machine translation quality us- Z. Yang, Y. Yuan, Y. Wu, R. Salakhutdinov, and W. W.
ing longest common subsequence and skip-bigram Cohen. 2016. Review networks for caption genera-
statistics. In Proceedings of ACL. tion. In NIPS.
Tsung-Yi Lin, Michael Maire, Serge J. Belongie, Peter Young, Alice Lai, Micah Hodosh, and Julia Hock-
Lubomir D. Bourdev, Ross B. Girshick, James Hays, enmaier. 2014. From image descriptions to visual
Pietro Perona, Deva Ramanan, Piotr Dollár, and denotations: New similarity metrics for semantic in-
C. Lawrence Zitnick. 2014. Microsoft COCO: com- ference over event descriptions. TACL 2:67–78.
mon objects in context. CoRR abs/1405.0312.

Siqi Liu, Zhenhai Zhu, Ning Ye, Sergio Guadarrama,


and Kevin Murphy. 2017. Optimization of image
description metrics using policy gradient methods.
In International Conference on Computer Vision
(ICCV).

Junhua Mao, Jiajing Xu, Yushi Jing, and Alan Yuille.


2016. Training and evaluating multimodal word em-
beddings with large-scale web annotated images. In
NIPS.

Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli,


and Wojciech Zaremba. 2015. Sequence level
training with recurrent neural networks. CoRR
abs/1511.06732.

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.


Sequence to sequence learning with neural networks.
In Advances in neural information processing sys-
tems. pages 3104–3112.

Christian Szegedy, Sergey Ioffe, and Vincent Van-


houcke. 2016. Inception-v4, inception-resnet and
the impact of residual connections on learning.
CoRR abs/1602.07261.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob


Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. In Advances in Neural Information Pro-
cessing Systems.

2565

You might also like