0% found this document useful (0 votes)
98 views10 pages

DenseCap - Fully Convolutional Localization Networks For Dense Captioning

The document introduces DenseCap, a fully convolutional network that can jointly localize and describe salient regions in images with natural language captions. DenseCap addresses the dense captioning task, which requires predicting descriptions across multiple regions of an image, generalizing both object detection and image captioning. The proposed Fully Convolutional Localization Network (FCLN) architecture processes the entire image with a single forward pass, requires no external region proposals, and can be trained end-to-end to generate dense captions. The model achieves improved performance and speed over baseline approaches on the Visual Genome dataset.

Uploaded by

Insta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
98 views10 pages

DenseCap - Fully Convolutional Localization Networks For Dense Captioning

The document introduces DenseCap, a fully convolutional network that can jointly localize and describe salient regions in images with natural language captions. DenseCap addresses the dense captioning task, which requires predicting descriptions across multiple regions of an image, generalizing both object detection and image captioning. The proposed Fully Convolutional Localization Network (FCLN) architecture processes the entire image with a single forward pass, requires no external region proposals, and can be trained end-to-end to generate dense captions. The model achieves improved performance and speed over baseline approaches on the Visual Genome dataset.

Uploaded by

Insta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

DenseCap: Fully Convolutional Localization Networks for Dense Captioning

Justin Johnson∗ Andrej Karpathy∗ Li Fei-Fei


Department of Computer Science, Stanford University
{jcjohns,karpathy,feifeili}@cs.stanford.edu

Whole Image Image Regions label density


Abstract
Classification Detection
We introduce the dense captioning task, which requires a Single
computer vision system to both localize and describe salient Label Cat
Cat
regions in images in natural language. The dense caption- Skateboard

ing task generalizes object detection when the descriptions


consist of a single word, and Image Captioning when one Captioning Dense Captioning
Orange spotted cat
predicted region covers the full image. To address the local- Sequence
A cat Skateboard with
red wheels
ization and description task jointly we propose a Fully Con- riding a
Cat riding a
label skateboard skateboard
volutional Localization Network (FCLN) architecture that complexity Brown hardwood
flooring
processes an image with a single, efficient forward pass, re-
quires no external regions proposals, and can be trained Figure 1. We address the Dense Captioning task (bottom right)
end-to-end with a single round of optimization. The archi- with a model that jointly generates both dense and rich annotations
tecture is composed of a Convolutional Network, a novel in a single forward pass.
dense localization layer, and Recurrent Neural Network
language model that generates the label sequences. We remained separate. In this work we take a step towards uni-
evaluate our network on the Visual Genome dataset, which fying these two inter-connected tasks into one joint frame-
comprises 94,000 images and 4,100,000 region-grounded work. First, we introduce the dense captioning task (see
captions. We observe both speed and accuracy improve- Figure 1), which requires a model to predict a set of descrip-
ments over baselines based on current state of the art ap- tions across regions of an image. Object detection is hence
proaches in both generation and retrieval settings. recovered as a special case when the target labels consist
of one word, and image captioning is recovered when all
images consist of one region that spans the full image.
1. Introduction
Additionally, we develop a Fully Convolutional Local-
Our ability to effortlessly point out and describe all aspects ization Network (FCLN) for the dense captioning task.
of an image relies on a strong semantic understanding of a Our model is inspired by recent work in image captioning
visual scene and all of its elements. However, despite nu- [49, 21, 32, 8, 4] in that it is composed of a Convolutional
merous potential applications, this ability remains a chal- Neural Network and a Recurrent Neural Network language
lenge for our state of the art visual recognition systems. model. However, drawing on work in object detection [38],
In the last few years there has been significant progress our second core contribution is to introduce a new dense lo-
in image classification [39, 26, 53, 45], where the task is calization layer. This layer is fully differentiable and can
to assign one label to an image. Further work has pushed be inserted into any neural network that processes images
these advances along two orthogonal directions: First, rapid to enable region-level training and predictions. Internally,
progress in object detection [40, 14, 46] has identified mod- the localization layer predicts a set of regions of interest in
els that efficiently identify and label multiple salient regions the image and then uses bilinear interpolation [19, 16] to
of an image. Second, recent advances in image captioning smoothly crop the activations in each region.
[3, 32, 21, 49, 51, 8, 4] have expanded the complexity of
We evaluate the model on the large-scale Visual Genome
the label space from a fixed set of categories to sequence of
dataset, which contains 94,000 images and 4,100,000 region
words able to express significantly richer concepts.
captions. Our results show both performance and speed im-
However, despite encouraging progress along the label
provements over approaches based on previous state of the
density and label complexity axes, these two directions have
art. We make our code and data publicly available to sup-
∗ Both authors contributed equally to this work. port further progress on the dense captioning task.

14565
2. Related Work captioning and soft spatial attention to simultaneously ad-
Our work draws on recent work in object detection, im- dress these design constraints.
age captioning, and soft spatial attention that allows down- In Section 3.1 we first describe the components of our
stream processing of particular regions in the image. model. Then in Sections 3.2 and 3.3 we address the loss
Object Detection. Our core visual processing module is a function and the details of training and inference.
Convolutional Neural Network (CNN) [29, 26], which has 3.1. Model Architecture
emerged as a powerful model for visual recognition tasks
[39]. The first application of these models to dense predic- 3.1.1 Convolutional Network
tion tasks was introduced in R-CNN [14], where each re- We use the VGG-16 architecture [41] for its state-of-the-art
gion of interest was processed independently. Further work performance [39]. It consists of 13 layers of 3 × 3 con-
has focused on processing all regions with only single for- volutions interspersed with 5 layers of 2 × 2 max pooling.
ward pass of the CNN [17, 13], and on eliminating explicit We remove the final pooling layer, so an input image of
region proposal methods by directly predicting the bound- shape 3 × W × H gives rise to a tensor  of features′ of shape
C ×W ′ ×H ′ where C = 512, W ′ = W
H 
ing boxes either in the image coordinate system [46, 9], or in 16 , and H = 16 .
a fully convolutional [31] and hence position-invariant set- The output of this network encodes the appearance of the
tings [40, 38, 37]. Most related to our approach is the work image at a set of uniformly sampled image locations, and
of Ren et al. [38] who develop a region proposal network forms the input to the localization layer.
(RPN) that regresses from anchors to regions of interest.
However, they adopt a 4-step optimization process, while 3.1.2 Fully Convolutional Localization Layer
our approach does not require training pipelines. Addition- The localization layer receives an input tensor of activa-
ally, we replace their RoI pooling mechanism with a differ- tions, identifies spatial regions of interest and smoothly ex-
entiable, spatial soft attention mechanism [19, 16]. In par- tracts a fixed-sized representation from each region. Our
ticular, this change allows us to backpropagate through the approach is based on that of Faster R-CNN [38], but we
region proposal network and train the whole model jointly. replace their RoI pooling mechanism [13] with bilinear
Image Captioning. Several pioneering approaches have interpolation [19], allowing our model to propagate gra-
explored the task of describing images with natural lan- dients backward through the coordinates of predicted re-
guage [1, 27, 12, 34, 42, 43, 28, 20]. More recent ap- gions. This modification opens up the possibility of predict-
proaches based on neural networks have adopted Recurrent ing affine or morphed region proposals instead of bounding
Neural Networks (RNNs) [50, 18] as the core architectural boxes [19], but we leave these extensions to future work.
element for generating captions. These models have pre- Inputs/outputs. The localization layer accepts a tensor of
viously been used in language modeling [2, 15, 33, 44], activations of size C × W ′ × H ′ . It then internally selects
where they are known to learn powerful long-term inter- B regions of interest and returns three output tensors giving
actions [22]. Several recent approaches to Image Caption- information about these regions:
ing [32, 21, 49, 8, 4, 24, 11] rely on a combination of RNN
language model conditioned on image information, possi- 1. Region Coordinates: A matrix of shape B × 4 giving
bly with soft attention mechanisms [51, 5]. Similar to our bounding box coordinates for each output region.
work, Karpathy and Fei-Fei [21] run an image captioning
model on regions but they do not tackle the joint task of 2. Region Scores: A vector of length B giving a con-
detection of description in one model. Our model is end-to- fidence score for each output region. Regions with
end and designed in such way that the prediction for each high confidence scores are more likely to correspond
region is a function of the global image context, which we to ground-truth regions of interest.
show also ultimately leads to stronger performance. Finally,
3. Region Features: A tensor of shape B × C × X × Y
the metrics we develop for the dense captioning task are in-
giving features for output regions; is represented by an
spired by metrics developed for image captioning [48, 7, 3].
X × Y grid of C-dimensional features.
3. Model Convolutional Anchors. Similar to Faster R-CNN [38],
Overview. Our goal is to design an architecture that jointly our localization layer predicts region proposals by regress-
localizes regions of interest and then describes each with ing offsets from a set of translation-invariant anchors. In
natural language. The primary challenge is to develop a particular, we project each point in the W ′ × H ′ grid of
model that supports end-to-end training with a single step of input features back into the W × H image plane, and con-
optimization, and both efficient and effective inference. Our sider k anchor boxes of different aspect ratios centered at
proposed architecture (see Figure 2) draws on architectural this projected point. For each of these k anchor boxes,
elements present in recent work on object detection, image the localization layer predicts a confidence score and four

4566
Image: Region features:
3xWxH Conv features: BxCxXxY Region Codes:
C x W’ x H’ BxD

CNN LSTM
Striped gray cat
Recognition
Network Cats watching TV
Localization Layer
Region Proposals: Sampling Grid:
4k x W’ x H’ Best Proposals: BxXxYx2
Bx4
Conv
Grid
Sampling
Generator

Region scores:
k x W’ x H’ Region features:
Conv features: Bilinear Sampler
C x W’ x H’ B x 512 x 7 x 7
Figure 2. Model overview. An input image is first processed a CNN. The Localization Layer proposes regions and smoothly extracts a
batch of corresponding activations using bilinear interpolation. These regions are processed with a fully-connected recognition network
and described with an RNN language model. The model is trained end-to-end with gradient descent.

scalars regressing from the anchor to the predicted box co- regions, sampled uniformly without replacement from the
ordinates. These are computed by passing the input feature set of all positive and all negative regions respectively.
map through a 3 × 3 convolution with 256 filters, a rectified At test time we subsample using greedy non-maximum
linear nonlinearity, and a 1 × 1 convolution with 5k filters. suppression (NMS) based on the predicted proposal confi-
This results in a tensor of shape 5k × W ′ × H ′ containing dences to select the B = 300 most confident proposals.
scores and offsets for all anchors. The coordinates and confidences of the sampled propos-
Box Regression. We adopt the parameterization of [13] als are collected into tensors of shape B × 4 and B respec-
to regress from anchors to the region proposals. Given an tively, and are output from the localization layer.
anchor box with center (xa , ya ), width wa , and height ha , Bilinear Interpolation. After sampling, we are left with
our model predicts scalars (tx , ty , tw , th ) giving normalized region proposals of varying sizes and aspect ratios. In order
offsets and log-space scaling transforms, so that the output to interface with the full-connected recognition network and
region has center (x, y) and shape (w, h) given by the RNN language model, we must extract a fixed-size fea-
ture representation for each variably sized region proposal.
x = x a + tx w a y = y a + ty ha (1) To solve this problem, Fast R-CNN [13] proposes an RoI
w = wa exp(tw ) h = ha exp(hw ) (2) pooling layer where each region proposal is projected onto
the W ′ × H ′ grid of convolutional features and divided into
Note that the boxes are discouraged from drifting too far a coarse X × Y grid aligned to pixel boundaries by round-
from their anchors due to L2 regularization. ing. Features are max-pooled within each grid cell, result-
Box Sampling. Processing a typical image of size W = ing in an X × Y grid of output features.
720, H = 540 with k = 12 anchor boxes gives rise to The RoI pooling layer is a function of two inputs: convo-
17,280 region proposals. Since running the recognition net- lutional features and region proposal coordinates. Gradients
work and the language model for all proposals would be can be propagated backward from the output features to the
prohibitively expensive, it is necessary to subsample them. input features, but not to the input proposal coordinates. To
At training time, we follow the approach of [38] and overcome this limitation, we replace the RoI pooling layer
sample a minibatch containing B = 256 boxes with at most with with bilinear interpolation [16, 19].
B/2 positive regions and the rest negatives. A region is pos- Concretely, given an input feature map U of shape C ×
itive if it has an intersection over union (IoU) of at least 0.7 W ′ × H ′ and a region proposal, we interpolate the features
with some ground-truth region; in addition, the predicted of U to produce an output feature map V of shape C × X ×
region of maximal IoU with each ground-truth region is Y . After projecting the region proposal onto U we follow
positive. A region is negative if it has IoU < 0.3 with [19] and compute a sampling grid G of shape X × Y × 2
all ground-truth regions. Our sampled minibatch contains associating each element of V with real-valued coordinates
BP ≤ B/2 positive regions and BN = B − BP negative into U . If Gi,j = (xi,j , yi,j ) then Vc,i,j should be equal to U

4567
at (c, xi,j , yi,j ); however since (xi,j , yi,j ) are real-valued, token and feed it to the RNN in the next time step, repeating
we convolve with a sampling kernel k and set the process until the special END token is sampled.
X H
W X 3.2. Loss function
Vc,i,j = Uc,i′ ,j ′ k(i′ − xi,j )k(j ′ − yi,j ). (3)
During training our ground truth consists of positive boxes
i′ =1 j ′ =1
and descriptions. Our model predicts positions and confi-
We use bilinear sampling, corresponding to the kernel dences of sampled regions twice: in the localization layer
k(d) = max(0, 1 − |d|). The sampling grid is a linear and again in the recognition network. We use binary logis-
function of the proposal coordinates, so gradients can be tic losses for the confidences trained on sampled positive
propagated backward into predicted region proposal coordi- and negative regions. For box regression, we use a smooth
nates. Running bilinear interpolation to extract features for L1 loss in transform coordinate space similar to [38]. The
all sampled regions gives a tensor of shape B × C × X × Y , fifth term in our loss function is a cross-entropy term at ev-
forming the final output from the localization layer. ery time-step of the language model.
We normalize all loss functions by the batch size and
3.1.3 Recognition Network sequence length in the RNN. We searched over an effec-
The recognition network is a fully-connected neural net- tive setting of the weights between these contributions and
work that processes region features from the localization found that a reasonable setting is to use a weight of 0.1 for
layer. The features from each region are flattened into a vec- the first four criterions, and a weight of 1.0 for captioning.
tor and passed through two full-connected layers, each us- 3.3. Training and optimization
ing rectified linear units and regularized using Dropout. For
We train the full model end-to-end in a single step of opti-
each region this produces a code of dimension D = 4096
mization. We initialize the CNN with weights pretrained on
that compactly encodes its visual appearance. The codes
ImageNet [39] and all other weights from a gaussian with
for all positive regions are collected into a matrix of shape
standard deviation of 0.01. We use stochastic gradient de-
B × D and passed to the RNN language model.
scent with momentum 0.9 to train the weights of the con-
In addition, we allow the recognition network one more
volutional network, and Adam [23] to train the other com-
chance to refine the confidence and position of each pro-
ponents of the model. We use a learning rate of 1 × 10−6
posal region. It outputs a final scalar confidence of each pro-
and set β1 = 0.9, β2 = 0.99. We begin fine-tuning the lay-
posed region and four scalars encoding a final spatial off-
ers of the CNN after 1 epoch, and for efficiency we do not
set to be applied to the region proposal. These two outputs
fine-tune the first four convolutional layers of the network.
are computed as a linear transform from the D-dimensional
Our training batches consist of a single image that has
code for each region. The final box regression uses the same
been resized so that the longer side has 720 pixels. Our
parameterization as Section 3.1.2.
implementation uses Torch 7 [6] and [36]. One mini-batch
runs in approximately 300ms on a Titan X GPU and it takes
3.1.4 RNN Language Model
about three days of training for the model to converge.
Following previous work [32, 21, 49, 8, 4], we use the
region codes to condition an RNN language model [15, 4. Experiments
33, 44]. Concretely, given a training sequence of to- Dataset. Existing datasets that relate images and natural
kens s1 , . . . , sT , we feed the RNN T + 2 word vectors language either only include full image captions [3, 52], or
x−1 , x0 , x1 , . . . , xT , where x−1 = CNN(I) is the region ground words of image captions in regions but do not pro-
code encoded with a linear layer and followed by a ReLU vide individual region captions [35]. We perform our ex-
non-linearity, x0 corresponds to a special START token, and periments using the Visual Genome (VG) region captions
xt encode each of the tokens st , t = 1, . . . , T . The RNN dataset [25] 1 . Our version contained 94,313 images and
computes a sequence of hidden states ht and output vectors 4,100,413 snippets of text (43.5 per image), each grounded
yt using a recurrence formula ht , yt = f (ht−1 , xt ) (we use to a region of an image. Images were taken from the in-
the LSTM [18] recurrence). The vectors yt have size |V |+1 tersection of MS COCO and YFCC100M [47], and annota-
where V is the token vocabulary, and where the additional tions were collected on Amazon Mechanical Turk by asking
one is for a special END token. The loss function on the workers to draw a bounding box on the image and describe
vectors yt is the average cross entropy, where the targets at its content in text. Example captions from the dataset in-
times t = 0, . . . , T − 1 are the token indices for st+1 , and clude “cats play with toys hanging from a perch”, “news-
the target at t = T is the END token. The vector y−1 is papers are scattered across a table”, “woman pouring wine
ignored. Our tokens and hidden layers have size 512. into a glass”, “mane of a zebra”, and “red light”.
At test time we feed the visual information x−1 to the
RNN. At each time step we sample the most likely next 1 Dataset can be downloaded at http://visualgenome.org/.

4568
Our Model:

Full Image RNN: A large jetliner flying through a blue sky. A man and a woman sitting A teddy bear with
A train is traveling down the tracks near a forest.
at a table with a cake. a red bow on it.

Figure 3. Example captions generated and localized by our model on test images. We render the top few most confident predictions. On
the bottom row we additionally contrast the amount of information our model generates compared to the Full image RNN.

Preprocessing. We collapse words that appear less than 30] and image captioning [48], we propose to measure the
15 times into a special <UNK> token, giving a vocabulary mean Average Precision (AP) across a range of thresholds
of 10,497 words. We strip referring phrases such as “there for both localization and language accuracy. For localiza-
is...”, or “this seems to be a”. For efficiency we discard all tion we use intersection over union (IoU) thresholds .3, .4,
annotations with more than 10 words (7% of annotations). .5, .6, .7. For language we use METEOR score thresholds
We also discard all images that have fewer than 20 or more 0, .05, .1, .15, .2, .25. We adopt METEOR since this metric
than 50 annotations to reduce the variation in the number was found to be most highly correlated with human judg-
of regions per image. We are left with 87,398 images; we ments in settings with a low number of references [48]. We
assign 5,000 each to val/test splits and the rest to train. measure the average precision across all pairwise settings
For test time evaluation we also preprocess the ground of these thresholds and report the mean AP.
truth regions in the validation/test images by merging heav- To isolate the accuracy of language in the predicted cap-
ily overlapping boxes into single boxes with several refer- tions without localization we also merge ground truth cap-
ence captions. For each image we iteratively select the box tions across each test image into a bag of references sen-
with the highest number of overlapping boxes (based on tences and evaluate predicted captions with respect to these
IoU with threshold of 0.7), and merge these together (by references without taking into account their spatial position.
taking the mean) into a single box with multiple reference
Baseline models. Following Karpathy and Fei-Fei [21], we
captions. We then exclude this group and repeat the process.
train only the Image Captioning model (excluding the local-
4.1. Dense Captioning ization layer) on individual, resized regions. We refer to this
In the dense captioning task the model receives a single im- approach as a Region RNN model. To investigate the differ-
age and produces a set of regions, each annotated with a ences between captioning trained on full images or regions
confidence and a caption. we also train the same model on full images and captions
Evaluation metrics. Intuitively, we would like our model from MS COCO (Full Image RNN model).
to produce both well-localized predictions (as in object de- At test time we consider three sources of region propos-
tection) and accurate descriptions (as in image captioning). als. First, to establish an upper bound we evaluate the model
Inspired by evaluation metrics in object detection [10, on ground truth boxes (GT). Second, similar to [21] we use

4569
Language (METEOR) Dense captioning (AP) Test runtime (ms)
Region source EB RPN GT EB RPN GT Proposals CNN+Recog RNN Total
Full image RNN [21] 0.173 0.197 0.209 2.42 4.27 14.11 210ms 2950ms 10ms 3170ms
Region RNN [21] 0.221 0.244 0.272 1.07 4.26 21.90 210ms 2950ms 10ms 3170ms
FCLN on EB [13] 0.264 0.296 0.293 4.88 3.21 26.84 210ms 140ms 10ms 360ms
Our model (FCLN) 0.264 0.273 0.305 5.24 5.39 27.03 90ms 140ms 10ms 240ms
Table 1. Dense captioning evaluation on the test set of 5,000 images. The language metric is METEOR (high is good), our dense captioning
metric is Average Precision (AP, high is good), and the test runtime performance for a 720 × 600 image with 300 proposals is given in
milliseconds on a Titan X GPU (ms, low is good). EB, RPN, and GT correspond to EdgeBoxes [54], Region Proposal Network [38], and
ground truth boxes respectively, used at test time. Numbers in GT columns (italic) serve as upper bounds assuming perfect localization.
an external region proposal method to extract 300 boxes for and that the RPN boxes are likely out of sample for the
each test image. We use EdgeBoxes [54] (EB) due to their FCLN on EB model.
strong performance and speed. Finally, EdgeBoxes have
Our model outperforms individual region description.
been tuned to obtain high recall for objects, but our regions
Our final model performance is listed under the RPN col-
data contains a wide variety of annotations around groups of
umn as 5.39 AP. In particular, note that in this one cell of
objects, stuff, etc. Therefore, as a third source of test time
Table 1 we report the performance of our full joint model
regions we follow Faster R-CNN [38] and train a separate
instead of our model evaluated on the boxes from the inde-
Region Proposal Network (RPN) on the VG regions data.
pendently trained RPN network. Our performance is quite
This corresponds to training our full model except without
a bit higher than that of the Region RNN model, even when
the RNN language model.
the region model is evaluated on the RPN proposals (5.93
As the last baseline we reproduce the approach of Fast vs. 4.26). We attribute this improvement to the fact that our
R-CNN [13], where the region proposals during training are model can take advantage of visual information from the
fixed to EdgeBoxes instead of being predicted by the model context outside of the test regions.
(FCLN on EB). The results of this experiment can be found
in Table 1. We now highlight the main takeaways. Qualitative results. We show example predictions of the
dense captioning model in Figure 3. The model generates
Discrepancy between region and image level statistics. rich snippet descriptions of regions and accurately grounds
Focusing on the first two rows of Table 1, the Region RNN the captions in the images. For instance, note that several
model obtains consistently stronger results on METEOR parts of the elephant are correctly grounded and described
alone, supporting the difference in the language statistics (“trunk of an elephant”, “elephant is standing”, and both
present on the level of regions and images. Note that these “leg of an elephant”). The same is true for the airplane ex-
models were trained on nearly the same images, but one on ample, where the tail, engine, nose and windows are cor-
full image captions and the other on region captions. How- rectly localized. Common failure cases include repeated
ever, despite the differences in the language, the two models detections (e.g. the elephant is described as standing twice).
reach comparable performance on the final metric.
Runtime evaluation. Our model is efficient at test time:
RPN outperforms external region proposals. In all cases
a 720 × 600 image is processed in 240ms. This includes
we obtain performance improvements when using the RPN
running the CNN, computing B = 300 region proposals,
network instead of EB regions. The only exception is the
and sampling from the language model for each region.
FCLN model that was only trained on EB boxes. Our hy-
pothesis is that this reflects people’s tendency of annotating Table 1 (right) compares the test-time runtime perfor-
regions more general than those containing objects. The mance of our model with baselines that rely on EdgeBoxes.
RPN network can learn these distributions from the raw Regions RNN is slowest since it processes each region with
data, while the EdgeBoxes method was designed for high an independent forward pass of the CNN; with a runtime of
recall on objects. In particular, note that this also allows our 3170ms it is more than 13× slower than our method.
model (FCLN) to outperform the FCLN on EB baseline,
FCLN on EB extracts features for all regions after a sin-
which is constrained to EdgeBoxes during training (5.24
gle forward pass of the CNN. Its runtime is dominated by
vs. 4.88 and 5.39 vs. 3.21). This is despite the fact that
EdgeBoxes, and it is ≈ 1.5× slower than our method.
their localization-independent language scores are compa-
rable, which suggests that our model achieves improve- Our method takes 88ms to compute region proposals, of
ments specifically due to better localization. Finally, the which nearly 80ms is spent running NMS to subsample re-
noticeable drop in performance of the FCLN on EB model gions in the Localization Layer. This time can be drastically
when evaluating on RPN boxes (5.39 down to 3.21) also reduced by using fewer proposals: using 100 region propos-
suggests that the EB boxes have particular visual statistics, als reduces our total runtime to 166ms.

4570
Ranking Localization
R@1 R@5 R@10 Med. rank [email protected] [email protected] [email protected] Med. IoU
Full Image RNN [21] 0.10 0.30 0.43 13 - - - -
EB + Full Image RNN [21] 0.11 0.40 0.55 9 0.348 0.156 0.053 0.020
Region RNN [21] 0.18 0.43 0.59 7 0.460 0.273 0.108 0.077
Our model (FCLN) 0.27 0.53 0.67 5 0.560 0.345 0.153 0.137
Table 2. Results for image retrieval experiments. We evaluate ranking using recall at k (R@K, higher is better) and median rank of the
target image (Med.rank, lower is better). We evaluate localization using ground-truth region recall at different IoU thresholds (IoU@t,
higher is better) and median IoU (Med. IoU, higher is better). Our method outperforms baselines at both ranking and localization.

GT image Query phrases Retrieved Images

Figure 4. Example image retrieval results using our dense captioning model. From left to right, each row shows a ground-truth test image,
ground-truth region captions describing the image, and the top images retrieved by our model using the text of the captions as a query. Our
model is able to correctly retrieve and localize people, animals, and parts of both natural and man-made objects.

4.2. Image Retrieval using Regions and Captions for the caption. We then report the fraction of query cap-
tion for which this overlap is greater than a threshold t for
In addition to generating novel descriptions, our dense cap- t ∈ {0.1, 0.3, 0.5} (recall at t) and the median IoU across
tioning model can support image retrieval using natural- all query captions.
language queries, and can localize these queries in retrieved
images. We evaluate our model’s ability to correctly retrieve Models. We compare the ranking and localization perfor-
images and accurately localize textual queries. mance of full model with baseline models from Section 4.1.
Experiment setup. We use 1000 random images from For the Full Image RNN model trained on MS COCO,
the VG test set for this experiment. We generate 100 test we compute the probability of generating each query cap-
queries by repeatedly sampling four random captions from tion from the entire image and rank test images by mean
some image and then expect the model to correct retrieve probability across query captions. Since this does not local-
the source image for each query. ize captions we only evaluate its ranking performance.
Evaluation. To evaluate ranking, we report the fraction of The Full Image RNN and Region RNN methods are
queries for which the correct source image appears in the trained on full MS COCO images and ground-truth VG re-
top k positions for k ∈ {1, 5, 10} (recall at k) and the me- gions respectively. In either case, for each query and test
dian rank of the correct image across all queries. image we generate 100 region proposals using EdgeBoxes
To evaluate localization, for each query caption we and for each query caption and region proposal we compute
examine the image and ground-truth bounding box from the probability of generating the query caption from the re-
which the caption was sampled. We compute IoU between gion. Query captions are aligned to the proposal of maximal
this ground-truth box and the model’s predicted grounding probability, and images are ranked by the mean probability

4571
head of a giraffe legs of a zebra red and white sign white tennis shoes hands holding a phone front wheel of a bus

Figure 5. Example results for open world detection. We use our dense captioning model to localize arbitrary pieces of text in images, and
display the top detections on the test set for several queries.

of aligned caption / region pairs. Open-world Object Detection Using the retrieval setup de-
The process for the full FCLN model is similar, but uses scribed above, our dense captioning model can also be used
the top 100 proposals from the localization layer rather than to localize arbitrary pieces of text in images. This enables
EdgeBoxes proposals. “open-world” object detection, where instead of commit-
Discussion. Figure 4 shows examples of ground-truth ting to a fixed set of object classes at training time we can
images, query phrases describing those images, and im- specify object classes using natural language at test-time.
ages retrieved from these queries using our model. Our We show example results for this task in Figure 5, where we
model is able to localize small objects (“hand of the clock”, display the top detections on the test set for several phrases.
“logo with red letters”), object parts, (“black seat on bike”, Our model can detect animal parts (“head of a giraffe”,
“chrome exhaust pipe”), people (“man is wet”) and some “legs of a zebra”) and also understands some object at-
actions (“man playing tennis outside”). tributes (“red and white sign”, “white tennis shoes”) and in-
Quantitative results comparing our model against the teractions between objects (“hands holding a phone”). The
baseline methods is shown in Table 2. The relatively poor phrase “front wheel of a bus” is a failure case: the model
performance of the Full Image RNN model (Med. rank 13 correctly identifies wheels of buses, but cannot distinguish
vs. 9,7,5) may be due to mismatched statistics between its between the front and back wheel.
train and test distributions: the model was trained on full
images, but in this experiment it must match region-level 5. Conclusion
captions to whole images (Full Image RNN) or process im- We introduced the dense captioning task, which requires a
age regions rather than full images (EB + Full Image RNN). model to simultaneously localize and describe regions of an
The Region RNN model does not suffer from a mismatch image. To address this task we developed the FCLN ar-
between train and test data, and outperforms the Full Image chitecture, which supports end-to-end training and efficient
RNN model on both ranking and localization. Compared to test-time performance. Our FCLN architecture is based on
Full Image RNN, it reduces the median rank from 9 to 7 and recent CNN-RNN models developed for image captioning
improves localization recall at 0.5 IoU from 0.053 to 0.108. but includes a novel, differentiable localization layer that
Our model outperforms the Region RNN baseline for can be inserted into any neural network to enable spatially-
both ranking and localization under all metrics, further re- localized predictions. Our experiments in both generation
ducing the median rank from 7 to 5 and increasing localiza- and retrieval settings demonstrate the power and efficiency
tion recall at 0.5 IoU from 0.108 to 0.153. of our model with respect to baselines related tp previous
The baseline uses EdgeBoxes which was tuned to local- work, and qualitative experiments show visually pleasing
ize objects, but not all query phrases refer to objects. Our results. In future work we would like to relax the assump-
model achieves superior results since it learns to propose tion of rectangular proposal regions and to discard test-time
regions from the training data. NMS in favor of a trainable spatial suppression layer.

4572
6. Acknowledgments [16] K. Gregor, I. Danihelka, A. Graves, and D. Wierstra. DRAW:
A recurrent neural network for image generation. ICML,
Our work is partially funded by an ONR MURI grant 2015. 1, 2, 3
and an Intel research grant. We thank Vignesh Ramanathan, [17] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling
Yuke Zhu, Ranjay Krishna, and Joseph Lim for helpful in deep convolutional networks for visual recognition. IEEE
comments and discussion. We gratefully acknowledge the Transactions on Pattern Analysis and Machine Intelligence
support of NVIDIA Corporation with the donation of the (TPAMI), 2015, 2015. 2
GPUs used for this research. [18] S. Hochreiter and J. Schmidhuber. Long short-term memory.
Neural computation, 9(8):1735–1780, 1997. 2, 4
References [19] M. Jaderberg, K. Simonyan, A. Zisserman, and
K. Kavukcuoglu. Spatial transformer networks. NIPS,
[1] K. Barnard, P. Duygulu, D. Forsyth, N. De Freitas, D. M. 2015. 1, 2, 3
Blei, and M. I. Jordan. Matching words and pictures. JMLR, [20] Y. Jia, M. Salzmann, and T. Darrell. Learning cross-modality
2003. 2 similarity for multinomial data. ICCV, 2011. 2
[2] Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin. A neu- [21] A. Karpathy and L. Fei-Fei. Deep visual-semantic align-
ral probabilistic language model. The Journal of Machine ments for generating image descriptions. CVPR, 2015. 1,
Learning Research, 3:1137–1155, 2003. 2 2, 4, 5, 6, 7
[3] X. Chen, H. Fang, T.-Y. Lin, R. Vedantam, S. Gupta, P. Dol- [22] A. Karpathy, J. Johnson, and L. Fei-Fei. Visualizing
lar, and C. L. Zitnick. Microsoft coco captions: Data collec- and understanding recurrent networks. arXiv preprint
tion and evaluation server. arXiv preprint arXiv:1504.00325, arXiv:1506.02078, 2015. 2
2015. 1, 2, 4 [23] D. Kingma and J. Ba. Adam: A method for stochastic opti-
[4] X. Chen and C. L. Zitnick. Mind’s eye: A recurrent visual mization. ICLR, 2015. 4
representation for image caption generation. CVPR, 2015. 1, [24] R. Kiros, R. Salakhutdinov, and R. S. Zemel. Unifying
2, 4 visual-semantic embeddings with multimodal neural lan-
[5] K. Cho, A. C. Courville, and Y. Bengio. Describing multime- guage models. TACL, 2015. 2
dia content using attention-based encoder-decoder networks. [25] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz,
CoRR, abs/1507.01053, 2015. 2 S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, M. Bern-
[6] R. Collobert, K. Kavukcuoglu, and C. Farabet. Torch7: A stein, and L. Fei-Fei. Visual genome: Connecting language
matlab-like environment for machine learning. In BigLearn, and vision using crowdsourced dense image annotations.
NIPS Workshop, number EPFL-CONF-192376, 2011. 4 2016. 4
[7] M. Denkowski and A. Lavie. Meteor universal: Language [26] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet
specific translation evaluation for any target language. In classification with deep convolutional neural networks. In
Proceedings of the EACL 2014 Workshop on Statistical Ma- NIPS, 2012. 1, 2
chine Translation, 2014. 2 [27] G. Kulkarni, V. Premraj, S. Dhar, S. Li, Y. Choi, A. C. Berg,
and T. L. Berg. Baby talk: Understanding and generating
[8] J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach,
simple image descriptions. CVPR, 2011. 2
S. Venugopalan, K. Saenko, and T. Darrell. Long-term recur-
[28] P. Kuznetsova, V. Ordonez, A. C. Berg, T. L. Berg, and
rent convolutional networks for visual recognition and de-
Y. Choi. Generalizing image captions for image-text parallel
scription. arXiv preprint arXiv:1411.4389, 2014. 1, 2, 4
corpus. In ACL (2), pages 790–796. Citeseer, 2013. 2
[9] D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov. Scalable
[29] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-
object detection using deep neural networks. CVPR, 2014. 2
based learning applied to document recognition. Proceed-
[10] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, ings of the IEEE, 86(11):2278–2324, 1998. 2
and A. Zisserman. The PASCAL visual object classes [30] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra-
(VOC) challenge. International journal of computer vision, manan, P. Dollár, and C. L. Zitnick. Microsoft COCO: Com-
88(2):303–338, 2010. 5 mon objects in context. ECCV, 2014. 5
[11] H. Fang, S. Gupta, F. Iandola, R. Srivastava, L. Deng, [31] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional
P. Dollár, J. Gao, X. He, M. Mitchell, J. Platt, et al. From networks for semantic segmentation. CVPR, 2015. 2
captions to visual concepts and back. CVPR, 2015. 2 [32] J. Mao, W. Xu, Y. Yang, J. Wang, and A. L. Yuille. Explain
[12] A. Farhadi, M. Hejrati, M. A. Sadeghi, P. Young, images with multimodal recurrent neural networks. arXiv
C. Rashtchian, J. Hockenmaier, and D. Forsyth. Every pic- preprint arXiv:1410.1090, 2014. 1, 2, 4
ture tells a story: Generating sentences from images. ECCV, [33] T. Mikolov, M. Karafiát, L. Burget, J. Cernockỳ, and S. Khu-
2010. 2 danpur. Recurrent neural network based language model. In
[13] R. Girshick. Fast R-CNN. ICCV, 2015. 2, 3, 6 INTERSPEECH, 2010. 2, 4
[14] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich fea- [34] V. Ordonez, X. Han, P. Kuznetsova, G. Kulkarni,
ture hierarchies for accurate object detection and semantic M. Mitchell, K. Yamaguchi, K. Stratos, A. Goyal, J. Dodge,
segmentation. CVPR, 2014. 1, 2 A. Mensch, et al. Large scale retrieval and generation of im-
[15] A. Graves. Generating sequences with recurrent neural net- age descriptions. International Journal of Computer Vision
works. arXiv preprint arXiv:1308.0850, 2013. 2, 4 (IJCV), 2015. 2

4573
[35] B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, [53] M. D. Zeiler and R. Fergus. Visualizing and understanding
J. Hockenmaier, and S. Lazebnik. Flickr30k entities: Col- convolutional networks. ECCV, 2014. 1
lecting region-to-phrase correspondences for richer image- [54] C. L. Zitnick and P. Dollár. Edge boxes: Locating object
to-sentence models. ICCV, 2015. 4 proposals from edges. ECCV, 2014. 6
[36] qassemoquab. stnbhwd. https://github.com/
qassemoquab/stnbhwd, 2015. 4
[37] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You
only look once: Unified, real-time object detection. arXiv
preprint arXiv:1506.02640, 2015. 2
[38] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: To-
wards real-time object detection with region proposal net-
works. NIPS, 2015. 1, 2, 3, 4, 6
[39] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,
S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein,
A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual
Recognition Challenge. International Journal of Computer
Vision (IJCV), pages 1–42, April 2015. 1, 2, 4
[40] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and
Y. LeCun. OverFeat: Integrated recognition, localization and
detection using convolutional networks. ICLR, 2014. 1, 2
[41] K. Simonyan and A. Zisserman. Very deep convolutional
networks for large-scale image recognition. ICLR, 2015. 2
[42] R. Socher and L. Fei-Fei. Connecting modalities: Semi-
supervised segmentation and annotation of images using un-
aligned text corpora. CVPR, 2010. 2
[43] R. Socher, A. Karpathy, Q. V. Le, C. D. Manning, and A. Y.
Ng. Grounded compositional semantics for finding and de-
scribing images with sentences. TACL, 2014. 2
[44] I. Sutskever, J. Martens, and G. E. Hinton. Generating text
with recurrent neural networks. ICML, 2011. 2, 4
[45] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,
D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.
Going deeper with convolutions. CVPR, 2015. 1
[46] C. Szegedy, S. Reed, D. Erhan, and D. Anguelov.
Scalable, high-quality object detection. arXiv preprint
arXiv:1412.1441, 2014. 1, 2
[47] B. Thomee, B. Elizalde, D. A. Shamma, K. Ni, G. Fried-
land, D. Poland, D. Borth, and L.-J. Li. Yfcc100m: The new
data in multimedia research. Communications of the ACM,
59(2):64–73, 2016. 4
[48] R. Vedantam, C. Lawrence Zitnick, and D. Parikh. Cider:
Consensus-based image description evaluation. CVPR,
2015. 2, 5
[49] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and
tell: A neural image caption generator. CVPR, 2015. 1, 2, 4
[50] P. J. Werbos. Generalization of backpropagation with appli-
cation to a recurrent gas market model. Neural Networks,
1(4):339–356, 1988. 2
[51] K. Xu, J. Ba, R. Kiros, A. Courville, R. Salakhutdinov,
R. Zemel, and Y. Bengio. Show, attend and tell: Neural im-
age caption generation with visual attention. ICML, 2015. 1,
2
[52] P. Young, A. Lai, M. Hodosh, and J. Hockenmaier. From im-
age descriptions to visual denotations: New similarity met-
rics for semantic inference over event descriptions. TACL,
2014. 4

4574

You might also like