What Is The Role of Recurrent Neural Networks (RNNS) in An Image Caption Generator?
What Is The Role of Recurrent Neural Networks (RNNS) in An Image Caption Generator?
What Is The Role of Recurrent Neural Networks (RNNS) in An Image Caption Generator?
Caption Generator?
Table 1: Results on the captions generated using the inject and merge architectures. Values are means over
three separately retrained models, together with the standard deviation in parentheses. Legend: Layer - the
layer size used (‘x’ in Figure 3); Vocab. - the vocabulary size used.
which means that either Flickr8k is too easy or The best-performing models are merge with state
Flickr30k is too hard when compared to the much size of 256 on Flickr8k, and merge with state size
larger MSCOCO. 128 on Flickr30k, both with minimum token fre-
quency threshold of 3. Inject models tend to im- 5 Discussion
prove with increasing state size, on both datasets,
while the relationship between the performance of If the RNN had the primary role of generating cap-
merge and the state size shows no discernible trend. tions, then it would need to have access to the image
Inject therefore does not seem to overfit as state size in order to know what to generate. This does not
increases, even on the larger dataset. At the same seem to be the case as including the image into the
time, inject only seems to be able to outperform the RNN is not generally beneficial to its performance
best scores achieved by merge if it has a much larger as a caption generator.
layer size. Therefore, in practical terms, inject mod- When viewing RNNs as having the primary role
els have to have larger capacity to be at par with of encoding rather than generating, it makes sense
merge. Put differently, merge has a higher perfor- that the inject architecture generally suffers in per-
mance to model size ratio and makes more efficient formance when compared to the merge architecture.
use of limited resources (this observation holds even The most plausible explanation has to do with the
when model size is defined in terms of number of handling of variation. Consider once more the task
parameters instead of layer sizes). of the RNN in the image captioning task: During
Given the same layer sizes and vocabulary, the training, captions are broken down into prefixes of
number of parameters for merge is greater than for increasing length, with each prefix compressed to a
inject. The difference becomes greater as the vo- fixed-size vector, as illustrated in Figure 2 above.
cabulary size is increased. For a vocabulary size of
In the inject architecture, the encoding task is
2,539 and layer size of 512, merge has about 3%
made more complex by the inclusion of image fea-
more parameters than inject whilst for a vocabulary
tures. Indeed, in the version of inject used in our
size of 9,584 and layer size of 512, merge has about
experiments – the most commonly used solution in
20% more parameters. However, the foregoing re-
the caption generation literature5 – image features
marks concerning over- and under-fitting also apply
are concatenated with every word in the caption.
when the difference between the number of parame-
The upshot is (a) a requirement to compress caption
ters is small. That is, the difference in performance
prefixes together with image data into a fixed-size
is due at least in part to architectural differences, not
vector and (b) a substantial growth in the vocabu-
just to differences in number of parameters.
lary size the RNN has to handle, because each im-
Merge models use a greater proportion of the
age+word is treated as a single ‘word’. This prob-
training vocabulary on test captions. However, the
lem is alleviated in merge, where the RNN encodes
proportion of vocabulary used is generally quite
linguistic histories only, at the expense of more pa-
small for both architectures: less than 16% for
rameters in the softmax layer.
Flickr8k and less than 7% for Flickr30k. Overall, the
trend is for smaller proportions of the overall train- One practical consequence of these findings is
ing vocabulary to be used, as the vocabulary grows that, while merge models can handle more variety
larger, suggesting that neural language models find with smaller layers, increasing the state size of the
it harder to use infrequent words (which are more RNN in the merge architecture is potentially quite
numerous at larger vocabulary sizes, by definition). profitable, as the entire state will be used to remem-
In practice, it means that reducing training vocabu- ber a greater variety of previously generated words.
laries results in minimal performance loss. By contrast, in the inject architecture, this increase
Overall, the evidence suggests that delaying the in memory would be used to better accommodate in-
merging of image features with linguistic encodings formation from two distinct, but combined, modali-
to a late stage in the architecture may be advan- ties.
tageous, at least as far as corpus-based evaluation
5
measures are concerned. Furthermore, the results We are referring to architectures that inject image features
in parallel with word embeddings in the RNN. In the literature,
suggest that a merge architecture has a higher ca- when this type of architecture is used, the image features might
pacity than an inject architecture and can generate only be included with some of the words or are changed for
better quality captions with smaller layers. different words (such as in attention models).
6 Conclusions tations, not as generating sequences.