0% found this document useful (0 votes)
42 views10 pages

What Is The Role of Recurrent Neural Networks (RNNS) in An Image Caption Generator?

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 10

What is the Role of Recurrent Neural Networks (RNNs) in an Image

Caption Generator?

Marc Tanti Albert Gatt Kenneth P. Camilleri


Institute of Linguistics Deptartment of Systems
and Language Technology and Control Engineering
University of Malta University of Malta
[email protected] [email protected]
[email protected]
arXiv:1708.02043v2 [cs.CL] 25 Aug 2017

Abstract 1. Systems that rely on computer vision tech-


niques to extract object detections and features
In neural image captioning systems, a recur- from the source image, using these as input to
rent neural network (RNN) is typically viewed an NLG stage (Kulkarni et al., 2011; Mitchell
as the primary ‘generation’ component. This et al., 2012; Elliott and Keller, 2013). The lat-
view suggests that the image features should ter is roughly akin to the microplanning and
be ‘injected’ into the RNN. This is in fact the
realisation modules in the well-known NLG
dominant view in the literature. Alternatively,
the RNN can instead be viewed as only en- pipeline architecture (Reiter and Dale, 2000).
coding the previously generated words. This
view suggests that the RNN should only be 2. Systems that frame the task as a retrieval prob-
used to encode linguistic features and that only lem, where a caption, or parts thereof, is identi-
the final representation should be ‘merged’ fied by computing the proximity/relevance of
with the image features at a later stage. This strings in the training data to a given image.
paper compares these two architectures. We This is done by exploiting either a unimodal
find that, in general, late merging outper-
(Ordonez et al., 2011; Gupta et al., 2012; Ma-
forms injection, suggesting that RNNs are bet-
ter viewed as encoders, rather than generators.
son and Charniak, 2014) or multimodal (Ho-
dosh et al., 2013; Socher et al., 2014) space.
Many retrieval-based approaches rely on neu-
1 Introduction ral models to handle both image features and
linguistic information (Ordonez et al., 2011;
Image captioning (Bernardi et al., 2016) has Socher et al., 2014).
emerged as an important testbed for solutions to the
fundamental AI challenge of grounding symbolic 3. Systems that also rely on neural models, but
or linguistic information in perceptual data (Har- rather than performing partial or wholesale
nad, 1990; Roy and Reiter, 2005). Most caption- caption retrieval, generate novel captions us-
ing systems focus on what Hodosh et al. (2013) re- ing a recurrent neural network (RNN), usu-
fer to as concrete conceptual descriptions, that is, ally a long short-term memory (LSTM). Typi-
captions that describe what is strictly within the im- cally, such models use image features extracted
age, although recently, there has been growing inter- from a pre-trained convolutional neural net-
est in moving beyond this, with research on visual work (CNN) such as the VGG CNN (Simonyan
question-answering (Antol et al., 2015) and image- and Zisserman, 2014) to bias the RNN towards
grounded narrative generation (Huang et al., 2016) sampling terms from the vocabulary in such a
among others. way that a sequence of such terms produces
Approaches to image captioning can be divided a caption that is relevant to the image (Kiros
into three main classes (Bernardi et al., 2016): et al., 2014b; Kiros et al., 2014a; Vinyals et
al., 2015; Mao et al., 2015a; Hendricks et al., dictions conditioned by the image. A different archi-
2016). tecture keeps the encoding of linguistic and percep-
tual features separate, merging them in a later mul-
This paper focuses on the third class. The key
timodal layer, at which point predictions are made
property of these models is that the CNN image fea-
(Figure 1b). In this type of model, the RNN is func-
tures are used to condition the predictions of the best
tioning primarily as an encoder of sequences of word
caption to describe the image. However, this can be
embeddings, with the visual features merged with
done in different ways and the role of the RNN de-
the linguistic features in a later, multimodal layer.
pends in large measure on the mode in which CNN
This multimodal layer is the one that drives the gen-
and RNN are combined.
eration process since the RNN never sees the image
It is quite typical for RNNs to be viewed as ‘gen-
and hence would not be able to direct the generation
erators’. For example, Bernardi et al. (2016) sug-
process.
gest that ‘the RNN is trained to generate the next
While both architectural alternatives have been at-
word [of a caption]’, a view also expressed by Le-
tested in the literature, their implications have not, to
Cun et al. (2015). A similar position has also been
our knowledge, been systematically discussed and
taken in work focusing on the use of RNNs as lan-
comparatively evaluated. In what follows, we first
guage models for generation (Sutskever et al., 2011;
discuss the distinction between the two architectures
Graves, 2013). However, an alternative view is pos-
(Section 2) and then present some experiments com-
sible, whereby the role of the RNN can be thought
paring the two (Sections 3 and 4). Our conclusion is
of as primarily to encode sequences, but not directly
that grounding language generation in image data is
to generate them.
best conducted in an architecture that first encodes
(a) Conditioning by injecting the image means in- the two modalities separately, before merging them
jecting the image into the same RNN that processes to predict captions.
the words.
2 Background: Neural Caption
Generation Architectures
In a neural language model, an RNN encodes a pre-
fix (for example, the caption generated so far) and
(b) Conditioning by merging the image means merg-
ing the image with the final state of the RNN in a either itself predicts the next item in the sequence
“multimodal layer” after processing the words. with the help of a feed forward layer or else it passes
the encoding to the next layer which will make the
prediction itself. This new item is added to the prefix
at the next iteration to predict another item, until an
end-of-sequence symbol is reached. Typically, the
Figure 1: The inject and merge architectures for prediction is carried out using a softmax function to
caption generation. The RNN’s previous state going sample the next item according to a probability dis-
into the RNN is not shown. Legend: RNN - Recur- tribution over the vocabulary items, based on their
rent Neural Network; FF - Feed Forward layer. activation. This process is illustrated in Figure 2.
One way to condition the RNN to predict image
These two views can be associated with different captions is to inject both visual and linguistic fea-
architectures for neural caption generators, which tures directly into the RNN, depicted in Figure 1a.
we discuss below and illustrated in Figure 1. In one We refer to this as ‘conditioning-by-inject’ (or in-
class of architectures, image features are directly in- ject for short). Different types of inject architectures
corporated into the RNN during the sequence encod- have become the most widely attested among deep
ing process (Figure 1a). In these models, it is natural learning approaches to image captioning (Chen and
to think of the RNN as the primary generation com- Zitnick, 2015; Donahue et al., 2015; Hessel et al.,
ponent of the image captioning system, making pre- 2015; Karpathy and Fei-Fei, 2015; Liu et al., 2016;
ture is to combine a different representation of the
image with each word. In the case of merge, a dif-
ferent representation of the image can be combined
with the final RNN state before each prediction. At-
tentional mechanisms are however beyond the scope
of the present work.
The main differences between inject and merge
architectures can be summed up as follows: In an in-
ject model, the RNN is trained to predict sequences
based on histories consisting of both linguistic and
perceptual features. Hence, in this model, the RNN
Figure 2: How RNNs work: each state of the is primarily responsible for image-conditioned lan-
RNN encodes a prefix, which incorporates the out- guage generation. By contrast, in the merge archi-
put word derived from the previous state. In prac- tecture, RNNs in effect encode linguistic represen-
tice the neural network does not output a single word tations, which themselves constitute the input to a
but a probability distribution over all known words later prediction stage that comes after a multimodal
in the vocabulary. Legend: FF - feedforward layer; layer. It is only at this late stage that image features
<beg> - the start-of-sentence token; <end> - the are used to condition predictions.
end-of-sentence token. As a result, a model involving conditioning by in-
ject is trained to learn linguistic representations di-
rectly conditioned by image data; a merge architec-
Yang et al., 2016; Zhou et al., 2016).1 Given train- ture maintains a distinction between the two repre-
ing pairs consisting of an image and a caption, the sentations, but brings them together in a later layer.
RNN component of such models is trained by expo- Put somewhat differently, it could be argued that
sure to prefixes of increasing length extracted from at a given time step, the merge architecture pre-
the caption, in tandem with the image. dicts what to generate next by combining the RNN-
An alternative architecture – which we refer to encoded prefix of the string generated so far (the
as ‘conditioning-by-merge’ (Figure 1b) – treats the ‘past’ of the generation process) with non-linguistic
RNN exclusively as a ‘language model’ to encode information (the guide of the generation process).
linguistic sequences of varying length. The lin- The inject architecture on the other hand uses the full
guistic vector resulting from this encoding is subse- image features with every word of the prefix during
quently combined with the image features in a sepa- training, in effect learning a ‘visuo-linguistic’ rep-
rate multimodal layer. This amounts to viewing the resentation of each word. One effect of this is that
RNN as primarily an encoder of linguistic informa- image features can serve to further specify or dis-
tion. This type of architecture is also attested in the ambiguate the ‘meaning’ of words, by disambiguat-
literature, albeit to a lesser extent than the inject ar- ing tokens of the same word which are correlated
chitecture (Mao et al., 2014; Mao et al., 2015a; Mao with different image features (such as ‘crane’ as in
et al., 2015b; Song and Yoo, 2016; Hendricks et al., the bird versus the construction equipment). This
2016; You et al., 2016). A limited number of ap- implies that inject models learn a larger vocabulary
proaches have also been proposed in which both ar- during training.
chitectures are combined (Lu et al., 2016; Xu et al., The two architectures also differ in the number
2015). of parameters they need to handle. As noted above,
Notice that both architectures are compatible with since an inject architecture combines the image with
the inclusion of attentional mechanisms (Xu et al., each word during training, it is effectively han-
2015). The effect of attention in the inject architec- dling a larger vocabulary than merge. Assume that
1
See Tanti et al. (2017) for an overview of different versions
the image vectors are concatenated with the word
of the inject architecture and a systematic comparison among embedding vectors (inject) or the final RNN state
models. In this paper we focus on parallel-inject. (merge). Then, in the inject architecture, the number
of weights in the RNN is a function of both the cap- (a) The merge architecture.
tion embedding and the images, whereas in merge,
it is only the word embeddings that contribute to the
size of this layer of the network. Let e be the size
of the word embedding, v the size of the vocabulary,
i the image vector size and s the state size of the
RNN. In the inject case, the number of weights in (b) The inject architecture.
the RNN is w ∝ (e + i) × s, whereas it is w ∝ e × s
in merge. The smaller number of weights handled
by the RNN in merge is offset by a larger number of
weights at the final softmax layer, which has to take
as input the RNN state and the image, having size Figure 3: An illustration of the different architec-
∝ (s + i) × v. tures that are tested in this paper. The numbers or
A systematic comparison of these two architec- letters at the bottom of each box refer to the vector
tures would shed light on the best way to con- size output of a layer. ‘x’ is an arbitrary layer size
ceive of the role of RNNs in neural language gen- that is varied in the experiments and ‘v’ is the vocab-
eration. Apart from the theoretical implications ulary size which is also varied in the experiments.
concerning the stage at which language should be ‘Dense’ means fully connected layer with bias.
grounded in visual information, such a comparison
also has practical implications. In particular, if it
turns out that merge outperforms inject, this would
imply that the linguistic representations encoded in in both datasets has five different captions. 4,096-
an RNN could be pre-trained and re-used for a vari- element image feature vectors that were extracted
ety of tasks and/or image captioning datasets, with from the pre-trained VGG CNN (Simonyan and Zis-
domain-specific training only required for the fi- serman, 2014) are also available in the distributed
nal feedforward layer, where the tuning required to datasets. We normalised the image vectors to unit
make perceptually grounded predictions is carried length during preprocessing.
out. We return to this point in Section 6.1. Tokens with frequency lower than a threshold in
In the following sections, we describe some ex- the training set were replaced with the ‘unknown’
periments to conduct such a comparison. token. In our experiments we varied the threshold
between 3 and 5 in order to measure the perfor-
3 Experiments mance of each model as vocabulary size changes.
For thresholds of 3, 4, and 5, this gives vocabulary
To evaluate the performance of the inject and merge
sizes of 2,539, 2,918, and 3,478 for Flickr8k and
architectures, and thus the roles of the RNN, we
7,415, 8,275, 9,584 and for Flickr30k.
trained and evaluated them on the Flickr8k (Ho-
dosh et al., 2013) and Flickr30k (Young et al., Since our purpose is to compare the performance
2014) datasets of image-caption pairs. For the pur- of architectures, we used the ‘barest’ models pos-
poses of these experiments, we used the version sible, with the fewest number of hyperparameters.
of the datasets distributed by Karpathy and Fei-Fei This means that complexities that are usually intro-
(2015)2 . The dataset splits are identical to that used duced in order to reach state-of-the-art performance,
by Karpathy and Fei-Fei (2015): Flickr8k is split such as regularization, were avoided, since it is dif-
into 6,000 images for training, 1,000 for validation, ficult to determine which combination of hyperpa-
and 1,000 for testing whilst Flickr30k is split into rameters do not give an unfair advantage to one ar-
29,000 images for training, 1,014 images for vali- chitecture over the other.
dation, and 1,000 images for testing. Each image We constructed a basic neural language model
2
http://cs.stanford.edu/people/karpathy/ consisting of a word embedding matrix, a basic
deepimagesent/ LSTM (Hochreiter and Schmidhuber, 1997), and a
softmax layer. The LSTM is defined as follows: with a beam width of 3 and a clipped maximum
length of 20 words. The MSCOCO evaluation code3
in = sig(xn Wxi + sn−1 Wsi + bi ) (1)
was used to measure the quality of the captions
fn = sig(xn Wxf + sn−1 Wsf + bf ) (2) by using the standard evaluation metrics BLEU-
on = sig(xn Wxo + sn−1 Wso + bo ) (3) (1,2,3,4) (Papineni et al., 2002), METEOR (Baner-
gn = tanh(xn Wxc + sn−1 Wsc + bc ) (4) jee and Lavie, 2005), CIDEr (Vedantam et al., 2015),
cn = fn cn−1 + in gn (5) and ROUGE-L (Lin and Och, 2004). We also calcu-
lated the percentage of word types that were actually
sn = on tanh(cn ) (6)
used in the generated captions out of the vocabulary
where xn is the nth input, sn is the hidden state after of available word types. This measure indicates how
n inputs, s0 is the all-zeros vector, cn is the cell state well each architecture exploits the vocabulary it is
after n inputs, c0 is the all-zeros vector, in is the trained on.
input gate after n inputs, fn is the forget gate after The code used for the experiments was imple-
n inputs, on is the output gate after n inputs, in is mented with TensorFlow and is available online4 .
the input gate after n inputs, gn is the modified input
used to calculate cn after n inputs, Wαβ is the weight 4 Results
matrix between α and β, bα is the bias vector for α,
Table 1 reports means and standard deviations over
is the elementwise vector multiplication operator,
the three runs of all the MSCOCO measures and the
and ‘sig’ refers to the sigmoid function. The hidden
vocabulary usage. Since the point is to compare
state and the cell state always have the same size.
the effects of the architectures rather than to reach
In the experiments, this basic neural language
state-of-the-art performance, we do not include re-
model is used as a part of two different architec-
sults from other published systems in our tables.
tures: In the inject architecture, the image vector
Across all experimental variables (dataset, vocab-
is concatenated with each of the word vectors in a
ulary, and layer sizes), the performance of the merge
caption. In the merge architecture, it is only con-
architecture is generally superior to that of the in-
catenated with the final LSTM state. The layer sizes
ject architecture in all measures except for ROUGE-
of the embedding, LSTM state, and projected image
L and BLEU (ROUGE-L is designed for evaluating
vector were also varied in the experiments in order
text summarization whilst BLEU is criticized for its
to measure the effect of increasing the capacity of
lack of correlation with human-given scores). In
the networks. The layer sizes used are 128, 256, and
what follows, we focus on the CIDEr measure for
512. The details of the architectures used in the ex-
caption quality as it was specifically designed for
periments are illustrated in Figure 3.
captioning systems.
Training was performed using the Adam optimi-
Although merge outperforms inject by a rather
sation algorithm (Kingma and Ba, 2014) with de-
narrow margin, the low standard deviation over the
fault hyperparameters and a minibatch size of 50
three training runs suggests that this is a consistent
captions. The cost function used was sum cross-
performance advantage across train-and-test runs. In
entropy. Training was carried out with an early stop-
any case, there is clearly no disadvantage to the
ping criterion which terminated training as soon as
merge strategy with respect to injecting image fea-
performance on the validation data started to de-
tures.
teriorate (validation performance is measured after
each training epoch). Initialization of weights was One peculiarity is that results on Flickr8k are
done using Xavier initialization (Glorot and Bengio, better than those on Flickr30k. This could mean
2010) and biases were set to zero. that Flickr8k captions contain less variation, hence
Each architecture was trained three separate are easier to perform well on. Preliminary results
times; the results reported below are averages over on the larger dataset MSCOCO (Lin et al., 2014)
these three separate runs. (currently in progress) show CIDEr results over 0.7
To evaluate the trained models we generated cap- 3
https://github.com/tylin/coco-caption
4
tions for images in the test set using beam search https://github.com/mtanti/rnn-role
% Vocabulary CIDEr METEOR ROUGE-L
Layer Vocab. Merge Inject Merge Inject Merge Inject Merge Inject
128 2539 14.730 (0.40) 10.555 (0.34) 0.460 (0.01) 0.431 (0.01) 0.192 (0.00) 0.183 (0.00) 0.445 (0.00) 0.430 (0.00)
128 2918 13.719 (0.49) 8.876 (0.24) 0.456 (0.00) 0.431 (0.00) 0.191 (0.00) 0.185 (0.00) 0.437 (0.00) 0.434 (0.00)
128 3478 11.223 (0.35) 8.175 (0.31) 0.458 (0.01) 0.433 (0.01) 0.192 (0.00) 0.187 (0.00) 0.442 (0.00) 0.432 (0.00)
256 2539 15.439 (0.84) 11.448 (0.71) 0.462 (0.01) 0.456 (0.01) 0.192 (0.00) 0.189 (0.00) 0.439 (0.00) 0.436 (0.00)
256 2918 13.697 (0.19) 10.430 (0.34) 0.456 (0.01) 0.451 (0.01) 0.190 (0.00) 0.189 (0.00) 0.438 (0.00) 0.440 (0.00)
256 3478 11.252 (0.51) 8.405 (0.39) 0.470 (0.01) 0.449 (0.02) 0.191 (0.00) 0.189 (0.00) 0.439 (0.00) 0.437 (0.00)
512 2539 15.741 (0.40) 12.761 (0.81) 0.452 (0.01) 0.464 (0.00) 0.191 (0.00) 0.192 (0.00) 0.437 (0.00) 0.442 (0.00)
512 2918 13.114 (0.75) 10.155 (0.42) 0.469 (0.01) 0.457 (0.00) 0.193 (0.00) 0.189 (0.00) 0.440 (0.00) 0.437 (0.00)
512 3478 11.501 (0.49) 8.587 (0.50) 0.458 (0.01) 0.439 (0.01) 0.192 (0.00) 0.188 (0.00) 0.439 (0.00) 0.434 (0.00)

(a) Flickr8k: % of vocabulary used, CIDEr, METEOR and ROUGE-L results.

BLEU-1 BLEU-2 BLEU-3 BLEU-4


Layer Vocab. Merge Inject Merge Inject Merge Inject Merge Inject
128 2539 0.600 (0.00) 0.592 (0.01) 0.410 (0.00) 0.405 (0.01) 0.272 (0.00) 0.270 (0.01) 0.179 (0.00) 0.177 (0.00)
128 2918 0.595 (0.01) 0.590 (0.00) 0.405 (0.01) 0.406 (0.00) 0.267 (0.01) 0.271 (0.00) 0.175 (0.00) 0.178 (0.00)
128 3478 0.608 (0.01) 0.586 (0.01) 0.416 (0.01) 0.401 (0.01) 0.276 (0.01) 0.268 (0.01) 0.182 (0.01) 0.178 (0.01)
256 2539 0.594 (0.00) 0.591 (0.00) 0.407 (0.01) 0.408 (0.00) 0.269 (0.01) 0.276 (0.00) 0.176 (0.01) 0.184 (0.00)
256 2918 0.596 (0.01) 0.596 (0.01) 0.405 (0.01) 0.413 (0.01) 0.265 (0.00) 0.278 (0.01) 0.172 (0.00) 0.184 (0.00)
256 3478 0.601 (0.00) 0.596 (0.01) 0.411 (0.00) 0.409 (0.01) 0.272 (0.01) 0.274 (0.01) 0.179 (0.01) 0.181 (0.01)
512 2539 0.597 (0.01) 0.603 (0.00) 0.406 (0.01) 0.419 (0.00) 0.267 (0.01) 0.283 (0.00) 0.176 (0.01) 0.188 (0.00)
512 2918 0.593 (0.01) 0.589 (0.01) 0.404 (0.01) 0.409 (0.00) 0.268 (0.00) 0.277 (0.00) 0.177 (0.00) 0.185 (0.00)
512 3478 0.597 (0.01) 0.587 (0.00) 0.407 (0.01) 0.405 (0.00) 0.270 (0.01) 0.272 (0.00) 0.178 (0.00) 0.180 (0.01)

(b) Flickr8k: BLEU-n scores.

% Vocabulary CIDEr METEOR ROUGE-L


Layer Vocab. Merge Inject Merge Inject Merge Inject Merge Inject
128 7415 6.253 (0.06) 5.255 (0.02) 0.362 (0.01) 0.339 (0.01) 0.174 (0.00) 0.169 (0.00) 0.417 (0.00) 0.415 (0.00)
128 8275 5.402 (0.20) 4.939 (0.08) 0.376 (0.00) 0.351 (0.00) 0.174 (0.00) 0.171 (0.00) 0.420 (0.00) 0.417 (0.00)
128 9584 4.793 (0.01) 4.090 (0.18) 0.378 (0.00) 0.355 (0.00) 0.175 (0.00) 0.171 (0.00) 0.420 (0.00) 0.419 (0.00)
256 7415 6.150 (0.18) 5.597 (0.11) 0.363 (0.00) 0.361 (0.01) 0.174 (0.00) 0.173 (0.00) 0.414 (0.00) 0.420 (0.00)
256 8275 5.559 (0.08) 5.410 (0.10) 0.364 (0.01) 0.359 (0.00) 0.174 (0.00) 0.173 (0.00) 0.416 (0.00) 0.417 (0.00)
256 9584 4.873 (0.07) 4.309 (0.18) 0.364 (0.01) 0.359 (0.01) 0.175 (0.00) 0.173 (0.00) 0.416 (0.00) 0.420 (0.00)
512 7415 6.330 (0.56) 5.732 (0.32) 0.365 (0.01) 0.367 (0.01) 0.173 (0.00) 0.173 (0.00) 0.416 (0.00) 0.422 (0.01)
512 8275 5.619 (0.09) 5.221 (0.49) 0.370 (0.00) 0.369 (0.01) 0.174 (0.00) 0.174 (0.00) 0.419 (0.00) 0.422 (0.00)
512 9584 4.887 (0.16) 4.309 (0.25) 0.357 (0.01) 0.360 (0.01) 0.172 (0.00) 0.172 (0.00) 0.414 (0.00) 0.417 (0.00)

(c) Flickr30k: % of vocabulary used, CIDEr, METEOR and ROUGE-L results.

BLEU-1 BLEU-2 BLEU-3 BLEU-4


Layer Vocab. Merge Inject Merge Inject Merge Inject Merge Inject
128 7415 0.601 (0.01) 0.595 (0.01) 0.403 (0.01) 0.400 (0.01) 0.268 (0.01) 0.265 (0.01) 0.179 (0.01) 0.175 (0.01)
128 8275 0.605 (0.01) 0.604 (0.00) 0.411 (0.01) 0.409 (0.00) 0.276 (0.01) 0.275 (0.00) 0.185 (0.00) 0.183 (0.00)
128 9584 0.610 (0.01) 0.605 (0.00) 0.414 (0.01) 0.411 (0.00) 0.278 (0.00) 0.275 (0.01) 0.186 (0.00) 0.184 (0.01)
256 7415 0.593 (0.01) 0.606 (0.00) 0.400 (0.01) 0.412 (0.00) 0.268 (0.01) 0.277 (0.00) 0.179 (0.01) 0.186 (0.01)
256 8275 0.594 (0.01) 0.603 (0.01) 0.402 (0.01) 0.409 (0.00) 0.269 (0.01) 0.275 (0.00) 0.180 (0.00) 0.183 (0.00)
256 9584 0.596 (0.01) 0.614 (0.01) 0.404 (0.00) 0.419 (0.01) 0.270 (0.00) 0.283 (0.00) 0.181 (0.00) 0.189 (0.00)
512 7415 0.598 (0.02) 0.617 (0.01) 0.404 (0.02) 0.422 (0.01) 0.270 (0.01) 0.285 (0.00) 0.181 (0.01) 0.191 (0.00)
512 8275 0.603 (0.00) 0.609 (0.01) 0.406 (0.00) 0.419 (0.01) 0.271 (0.00) 0.284 (0.01) 0.181 (0.00) 0.191 (0.00)
512 9584 0.596 (0.00) 0.609 (0.01) 0.399 (0.00) 0.414 (0.01) 0.265 (0.00) 0.278 (0.01) 0.177 (0.00) 0.185 (0.00)

(d) Flickr30k: BLEU-n scores.

Table 1: Results on the captions generated using the inject and merge architectures. Values are means over
three separately retrained models, together with the standard deviation in parentheses. Legend: Layer - the
layer size used (‘x’ in Figure 3); Vocab. - the vocabulary size used.

which means that either Flickr8k is too easy or The best-performing models are merge with state
Flickr30k is too hard when compared to the much size of 256 on Flickr8k, and merge with state size
larger MSCOCO. 128 on Flickr30k, both with minimum token fre-
quency threshold of 3. Inject models tend to im- 5 Discussion
prove with increasing state size, on both datasets,
while the relationship between the performance of If the RNN had the primary role of generating cap-
merge and the state size shows no discernible trend. tions, then it would need to have access to the image
Inject therefore does not seem to overfit as state size in order to know what to generate. This does not
increases, even on the larger dataset. At the same seem to be the case as including the image into the
time, inject only seems to be able to outperform the RNN is not generally beneficial to its performance
best scores achieved by merge if it has a much larger as a caption generator.
layer size. Therefore, in practical terms, inject mod- When viewing RNNs as having the primary role
els have to have larger capacity to be at par with of encoding rather than generating, it makes sense
merge. Put differently, merge has a higher perfor- that the inject architecture generally suffers in per-
mance to model size ratio and makes more efficient formance when compared to the merge architecture.
use of limited resources (this observation holds even The most plausible explanation has to do with the
when model size is defined in terms of number of handling of variation. Consider once more the task
parameters instead of layer sizes). of the RNN in the image captioning task: During
Given the same layer sizes and vocabulary, the training, captions are broken down into prefixes of
number of parameters for merge is greater than for increasing length, with each prefix compressed to a
inject. The difference becomes greater as the vo- fixed-size vector, as illustrated in Figure 2 above.
cabulary size is increased. For a vocabulary size of
In the inject architecture, the encoding task is
2,539 and layer size of 512, merge has about 3%
made more complex by the inclusion of image fea-
more parameters than inject whilst for a vocabulary
tures. Indeed, in the version of inject used in our
size of 9,584 and layer size of 512, merge has about
experiments – the most commonly used solution in
20% more parameters. However, the foregoing re-
the caption generation literature5 – image features
marks concerning over- and under-fitting also apply
are concatenated with every word in the caption.
when the difference between the number of parame-
The upshot is (a) a requirement to compress caption
ters is small. That is, the difference in performance
prefixes together with image data into a fixed-size
is due at least in part to architectural differences, not
vector and (b) a substantial growth in the vocabu-
just to differences in number of parameters.
lary size the RNN has to handle, because each im-
Merge models use a greater proportion of the
age+word is treated as a single ‘word’. This prob-
training vocabulary on test captions. However, the
lem is alleviated in merge, where the RNN encodes
proportion of vocabulary used is generally quite
linguistic histories only, at the expense of more pa-
small for both architectures: less than 16% for
rameters in the softmax layer.
Flickr8k and less than 7% for Flickr30k. Overall, the
trend is for smaller proportions of the overall train- One practical consequence of these findings is
ing vocabulary to be used, as the vocabulary grows that, while merge models can handle more variety
larger, suggesting that neural language models find with smaller layers, increasing the state size of the
it harder to use infrequent words (which are more RNN in the merge architecture is potentially quite
numerous at larger vocabulary sizes, by definition). profitable, as the entire state will be used to remem-
In practice, it means that reducing training vocabu- ber a greater variety of previously generated words.
laries results in minimal performance loss. By contrast, in the inject architecture, this increase
Overall, the evidence suggests that delaying the in memory would be used to better accommodate in-
merging of image features with linguistic encodings formation from two distinct, but combined, modali-
to a late stage in the architecture may be advan- ties.
tageous, at least as far as corpus-based evaluation
5
measures are concerned. Furthermore, the results We are referring to architectures that inject image features
in parallel with word embeddings in the RNN. In the literature,
suggest that a merge architecture has a higher ca- when this type of architecture is used, the image features might
pacity than an inject architecture and can generate only be included with some of the words or are changed for
better quality captions with smaller layers. different words (such as in attention models).
6 Conclusions tations, not as generating sequences.

This paper has presented two views of the role of


the RNN in an image caption generator. In the first, 6.1 Future work
an RNN decides on which word is the most likely The experiments reported here were conducted on
to be generated next, given what has been generated two separate datasets. One concern is that results on
before. In multimodal generation, this view encour- Flickr8k and Flickr30k are not entirely consistent,
ages architectures where the image is incorporated though the superiority of merge over inject is clear
into the RNN along with the words that were gen- in both. We are currently extending our experiments
erated in order to allow the RNN to make visually- to the larger MSCOCO dataset (Lin et al., 2014).
informed predictions.
The insights discussed in this paper invite future
The second view is that the RNN’s role is purely
research on how generally applicable the merge ar-
memory-based and is only there to encode the se-
chitecture is in different domains. We would like
quence of words that have been generated thus far.
to investigate whether similar changes in architec-
This representation informs caption prediction at a
ture would work in sequence-to-sequence tasks such
later layer of the network as a function of both the
as machine translation, where instead of condition-
RNN encoding and perceptual features. This view
ing a language model on an image we are condi-
encourages architectures where vision and langauge
tioning a target language model on sentences in a
are brought together late, in a multimodal layer.
source language. A similar question arises in image
Caption generation turns out to perform worse, in
processing. If a CNN were conditioned to be more
general, when image features are injected into the
sensitive to certain types of objects or saliency dif-
RNN. Thus, the role of the RNN is better conceived
ferences among regions of a complex image, should
in terms of the learning of linguistic representations,
the conditioning vector be incorporated at the begin-
to be used to inform later layers in the neural net-
ning, thereby conditioning the entire CNN, or would
work, where predictions are made based on what has
it be better to instead incorporate it in a final layer,
been generated in the past together with the image
where saliency differences would then be based on
that is guiding the generation. Had the RNN been
high-level visual features?
the component primarily involved in generating the
caption, it would need to be informed about the im- There are also more practical advantages to merge
age in order to know what needs to be generated; architectures, such as for transfer learning. Since
however this line of reasoning seems to hurt perfor- merge keeps the image separate from the RNN, the
mance when applied to an architecture. This sug- RNN used for captioning can conceivably be trans-
gests that it is not the case that the RNN is the main ferred from a neural language model that has been
component of the caption generator that is involved trained on general text. This cannot be done with an
in generation. inject architecture since the RNN would need to be
trained to combine image and text in the input. In fu-
In short, given a neural network architecture that
ture work, we intend to see how the performance of
is expected to process input sequences from mul-
a caption generator is affected when the weights of
tiple modalities, arriving at a joint representation,
the RNN are initialized from those of a general neu-
it would be better to have a separate component to
ral language model, along lines explored in neural
encode each input, bringing them together at a late
machine translation (Ramachandran et al., 2016).
stage, rather than to pass them all into the same RNN
through separate input channels. With respect to
the question of how language should be grounded Acknowledgments
in perceptual data, the tentative answer offered by
these experiments is that the link between the sym- This work was partially funded by the Endeavour
bolic and perceptual should be established late, once Scholarship Scheme (Malta), part-financed by the
encoding has been performed. To this end, recur- European Social Fund (ESF).
rent networks are best viewed as learning represen-
References Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long
short-term memory. Neural computation, 9(8):1735–
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Mar-
1780.
garet Mitchell, Dhruv Batra, C Lawrence Zitnick, and
Devi Parikh. 2015. VQA: Visual Question Answer- Micah Hodosh, Peter Young, and Julia Hockenmaier.
ing. In Proc. ICCV’15, pages 2425–2433, Santiago, 2013. Framing Image Description as a Ranking Task:
Chile. IEEE. Data, Models and Evaluation Metrics. Journal of
Satanjeev Banerjee and Alon Lavie. 2005. METEOR: Artificial Intelligence Research, 47(1):853–899, May.
An automatic metric for MT evaluation with improved Flickr8k.
correlation with human judgments. In Proce. Work- Ting-Hao Huang, Francis Ferraro, Nasrin Mostafazadeh,
shop on Intrinsic and Extrinsic Evaluation Measures Ishan Misra, Aishwarya Agrawal, Jacob Devlin, Ross
for Machine Translation and/or Summarization, vol- Girshick, Xiaodong He, Pushmeet Kohli, Dhruv Batra,
ume 29, pages 65–72. C Lawrence Zitnick, Devi Parikh, Lucy Vanderwende,
Raffaella Bernardi, Ruket Cakici, Desmond Elliott, Michel Galley, and Margaret Mitchell. 2016. Visual
Aykut Erdem, Erkut Erdem, Nazli Ikizler-Cinbis, Storytelling. In Proc. NAACL-HLT’16, pages 1233–
Frank Keller, Adrian Muscat, and Barbara Plank. 1239.
2016. Automatic Description Generation from Im- Andrej Karpathy and Li Fei-Fei. 2015. Deep Visual-
ages: A Survey of Models, Datasets, and Evaluation Semantic Alignments for Generating Image Descrip-
Measures. Journal of Artificial Intelligence Research, tions. In Proc. CVPR’15. Institute of Electrical and
55:409–442. Electronics Engineers (IEEE), June.
Xinlei Chen and C. Lawrence Zitnick. 2015. Mind’s Diederik P. Kingma and Jimmy Ba. 2014. Adam:
eye: A recurrent visual representation for image cap- A Method for Stochastic Optimization. CoRR,
tion generation. In Proc. CVPR’15. Institute of Elec- abs/1412.6980.
trical and Electronics Engineers (IEEE), jun. Ryan Kiros, Ruslan Salakhutdinov, and Richard S Zemel.
Jeff Donahue, Lisa Anne Hendricks, Sergio Guadar- 2014a. Multimodal neural language models. In Pro-
rama, Marcus Rohrbach, Subhashini Venugopalan, ceedings of The 31st International Conference on Ma-
Kate Saenko, and Trevor Darrell. 2015. Long-term chine Learning, page 595603.
Recurrent Convolutional Networks for Visual Recog- Ryan Kiros, Ruslan Salakhutdinov, and Richard S Zemel.
nition and Description. In 2015 IEEE Conference on 2014b. Unifying visual-semantic embeddings with
Computer Vision and Pattern Recognition (CVPR). In- multimodal neural language models. arXiv preprint
stitute of Electrical and Electronics Engineers (IEEE), arXiv:1411.2539.
jun. Girish Kulkarni, Visruth Premraj, Sagnik Dhar, Siming
Desmond Elliott and Frank Keller. 2013. Image De-
Li, Yejin Choi, Alexander C Berg, and Tamara L
scription using Visual Dependency Representations.
Berg. 2011. Baby Talk : Understanding and Gener-
In Proc. EMNLP’13, pages 1292–1302, Seattle, WA.
ating Image Descriptions. In Proceedings of the IEEE
Association for Computational Linguistics.
Conference on Computer Vision and Pattern Recogni-
Xavier Glorot and Yoshua Bengio. 2010. Understand-
tion (CVPR’11), pages 1601–1608, Colorado Springs.
ing the difficulty of training deep feedforward neural
IEEE.
networks. In Aistats, volume 9, pages 249–256.
Yann LeCun, Yoshua Bengio, and Geoffrey Hinton.
Alex Graves. 2013. Generating Sequences with Recur-
2015. Deep learning. Nature, 521(7553):436–444.
rent Neural Networks. arXiv preprint, arXiv:1308:1–
43. Chin-Yew Lin and Franz Josef Och. 2004. Auto-
Ankush Gupta, Yashaswi Verma, and C. V. Jawahar. matic evaluation of machine translation quality using
2012. Choosing Linguistics over Vision to Describe longest common subsequence and skip-bigram statis-
Images. In Proc. AAAI’12, pages 606–612. tics. In Proc. ACL’04. Association for Computational
Stevan Harnad. 1990. The symbol grounding problem. Linguistics (ACL).
Lisa Anne Hendricks, Subhashini Venugopalan, Mar- Tsung-Yi Lin, Michael Maire, Serge Belongie, James
cus Rohrbach, Raymond Mooney, Kate Saenko, and Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and
Trevor Darrell. 2016. Deep Compositional Cap- C. Lawrence Zitnick. 2014. Microsoft COCO: Com-
tioning: Describing Novel Object Categories without mon objects in context. In Computer Vision – ECCV
Paired Training Data. In Proc. CVPR’16. Institute of 2014, pages 740–755. Springer Nature.
Electrical and Electronics Engineers (IEEE), jun. Siqi Liu, Zhenhai Zhu, Ning Ye, Sergio Guadarrama, and
Jack Hessel, Nicolas Savva, and Michael J. Wilber. 2015. Kevin Murphy. 2016. Optimization of image descrip-
Image Representations and New Domains in Neural tion metrics using policy gradient methods. CoRR,
Image Captioning. CoRR, abs/1508.02091. abs/1612.00370.
Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Association for Computational Linguistics (TACL),
Socher. 2016. Knowing when to look: Adaptive atten- 2(April):207–218.
tion via A visual sentinel for image captioning. CoRR, Mingoo Song and Chang D. Yoo. 2016. Multimodal rep-
abs/1612.01887. resentation: Kneser-ney smoothing/skip-gram based
Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, and Alan L. neural language model. In 2016 IEEE International
Yuille. 2014. Explain images with multimodal recur- Conference on Image Processing (ICIP). Institute of
rent neural networks. In Proc. NIPS Deep Learning Electrical and Electronics Engineers (IEEE), sep.
Workshop. Ilya Sutskever, James Martens, and Geoffrey Hinton.
Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Zhiheng 2011. Generating Text with Recurrent Neural Net-
Huang, and Alan Yuille. 2015a. Deep Caption- works. In Procededings of the 28th International Con-
ing with Multimodal Recurrent Neural Networks (m- ference on Machine Learning (ICML’11), pages 1017–
RNN). In Proc. ICLR’15. 1024, Bellevue, WA. ACM.
Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Zhiheng Marc Tanti, Albert Gatt, and Kenneth P. Camilleri. 2017.
Huang, and Alan Yuille. 2015b. Learning like a Child: Where to put the image in an image caption generator.
Fast Novel Visual Concept Learning from Sentence CoRR, abs/1703.09137.
Descriptions of Images. In Proc. ICCV’15, Santiago, Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi
Chile, 12/2015. Institute of Electrical and Electronics Parikh. 2015. CIDEr: Consensus-based image de-
Engineers (IEEE). scription evaluation. In 2015 IEEE Conference on
Rebecca Mason and Eugene Charniak. 2014. Domain- Computer Vision and Pattern Recognition (CVPR). In-
Specific Image Captioning. In Proc. CONLL’14, stitute of Electrical and Electronics Engineers (IEEE),
pages 11–20, Baltimore, MA. Association for Com- jun.
putational Linguistics. Oriol Vinyals, Alexander Toshev, Samy Bengio, and Du-
Margaret Mitchell, Jesse Dodge, Amit Goyal, Kota Ya- mitru Erhan. 2015. Show and tell: A neural image
maguchi, Karl Stratos, Xufeng Han, Alyssa Men- caption generator. In 2015 IEEE Conference on Com-
sch, Alex Berg, Xufeng Han, Tamara Berg, and Hal puter Vision and Pattern Recognition (CVPR). Institute
Daume III. 2012. Midge: Generating Image De- of Electrical and Electronics Engineers (IEEE), jun.
scriptions From Computer Vision Detections. In Proc. Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho,
EACL’12, pages 747–756, Avignon, France. Associa- Aaron C. Courville, Ruslan Salakhutdinov, Richard S.
tion for Computational Linguistics. Zemel, and Yoshua Bengio. 2015. Show, Attend
V Ordonez, G Kulkarni, and Tl Berg. 2011. Im2text: and Tell: Neural Image Caption Generation with Vi-
Describing images using 1 million captioned pho- sual Attention. In Proceedings of The 32nd Inter-
tographs. In Proceedings of the 2011 Conference on national Conference on Machine Learning, volume
Advances in Neural Information Processing Systems abs/1502.03044, page 20482057.
(NIPS’11), pages 1143–1151, Granada, Spain. Curran Zhilin Yang, Ye Yuan, Yuexin Wu, Ruslan Salakhutdi-
Associates Ltd. nov, and William W. Cohen. 2016. Encode, review,
Kishore Papineni, Salim Roukos, Todd Ward, and Wei- and decode: Reviewer module for caption generation.
Jing Zhu. 2002. BLEU: a method for automatic eval- CoRR, abs/1605.07912.
uation of machine translation. In Proc. ACL’02, pages Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang,
311–318. Association for Computational Linguistics. and Jiebo Luo. 2016. Image captioning with seman-
Prajit Ramachandran, Peter J. Liu, and Quoc V. Le. 2016. tic attention. In Proc. CVPR’16. Institute of Electrical
Unsupervised pretraining for sequence to sequence and Electronics Engineers (IEEE), jun.
learning. arXiv. Peter Young, Alice Lai, Micah Hodosh, and Julia Hock-
E. Reiter and R. Dale. 2000. Building Natural Lan- enmaier. 2014. From image descriptions to visual
guage Generation Systems. Cambridge University denotations: New similarity metrics for semantic in-
Press, Cambridge, UK. ference over event descriptions. Transactions of the
Deb Roy and Ehud Reiter. 2005. Connecting language Association for Computational Linguistics, 2:67–78.
to the world. Artificial Intelligence, 167(1-2):1–12. Luowei Zhou, Chenliang Xu, Parker Koch, and Ja-
Karen Simonyan and Andrew Zisserman. 2014. Very son J. Corso. 2016. Image caption generation
Deep Convolutional Networks for Large-Scale Image with text-conditional semantic attention. CoRR,
Recognition. CoRR, abs/1409.1556. abs/1606.04621.
Richard Socher, Andrej Karpathy, Quoc V Le, Christo-
pher D Manning, and Andrew Y Ng. 2014. Grounded
Compositional Semantics for Finding and Describ-
ing Images with Sentences. Transactions of the

You might also like