Video To Sequence
Video To Sequence
Video To Sequence
Subhashini Venugopalan1
Marcus Rohrbach2,4
Trevor Darrell2
Jeff Donahue2
Kate Saenko3
Raymond Mooney1
Abstract
Raw Frames
CNN - Object
pretrained
CNN
Outputs
LSTMs
A
man
is
Flow images
cutting
a
CNN - Action
pretrained
bottle
<eos>
1. Introduction
2. Related Work
Early work on video captioning considered tagging
videos with metadata [1] and clustering captions and videos
[14, 25, 42] for retrieval tasks. Several previous methods
for generating sentence descriptions [11, 19, 36] used a two
stage pipeline that first identifies the semantic content (subject, verb, object) and then generates a sentence based on a
template. This typically involved training individual classifiers to identify candidate objects, actions and scenes. They
then use a probabilistic graphical model to combine the visual confidences with a language model in order to estimate
the most likely content (subject, verb, object, scene) in the
video, which is then used to generate a sentence. While this
simplified the problem by detaching content generation and
surface realization, it requires selecting a set of relevant objects and actions to recognize. Moreover, a template-based
approach to sentence generation is insufficient to model the
richness of language used in human descriptions e.g.,
which attributes to use and how to combine them effectively to generate a good description. In contrast, our approach avoids the separation of content identification and
sentence generation by learning to directly map videos to
full human-provided sentences, learning a language model
simultaneously conditioned on visual features.
Our models take inspiration from the image caption generation models in [8, 40]. Their first step is to generate a
fixed length vector representation of an image by extracting features from a CNN. The next step learns to decode
this vector into a sequence of words composing the description of the image. While any RNN can be used in principle
to decode the sequence, the resulting long-term dependencies can lead to inferior performance. To mitigate this issue,
LSTM models have been exploited as sequence decoders, as
they are more suited to learning long-range dependencies.
In addition, since we are using variable-length video as input, we use LSTMs as sequence to sequence transducers,
following the language translation models of [34].
In [39], LSTMs are used to generate video descriptions
by pooling the representations of individual frames. Their
technique extracts CNN features for frames in the video and
then mean-pools the results to get a single feature vector
representing the entire video. They then use an LSTM as
a sequence decoder to generate a description based on this
vector. A major shortcoming of this approach is that this
representation completely ignores the ordering of the video
frames and fails to exploit any temporal information. The
approach in [8] also generates video descriptions using an
LSTM; however, they employ a version of the two-step approach that uses CRFs to obtain semantic tuples of activity,
object, tool, and locatation and then use an LSTM to translate this tuple into a sentence. Moreover, the model in [8] is
applied to the limited domain of cooking videos while ours
is aimed at generating descriptions for videos in the wild.
Contemporaneous with our work, the approach in [43]
also addresses the limitations of [39] in two ways. First,
they employ a 3-D convnet model that incorporates spatiotemporal motion features. To obtain the features, they assume videos are of fixed volume (width, height, time). They
extract dense trajectory features (HoG, HoF, MBH) [41]
over non-overlapping cuboids and concatenate these to form
the input. The 3-D convnet is pre-trained on video datasets
for action recognition. Second, they include an attention
mechanism that learns to weight the frame features nonuniformly conditioned on the previous word input(s) rather
than uniformly weighting features from all frames as in
[39]. The 3-D convnet alone provides limited performance
improvement, but in conjunction with the attention model it
notably improves performance. We propose a simpler approach to using temporal information by using an LSTM
to encode the sequence of video frames into a distributed
vector representation that is sufficient to generate a sentential description. Therefore, our direct sequence to sequence
model does not require an explicit attention mechanism.
Another recent project [33] uses LSTMs to predict the
future frame sequence from an encoding of the previous
frames. Their model is more similar to the language translation model in [34], which uses one LSTM to encode the
3. Approach
We propose a sequence to sequence model for video description, where the input is the sequence of video frames
(x1 , . . . , xn ), and the output is the sequence of words
(y1 , . . . , ym ). Naturally, both the input and output are of
variable, potentially different, lengths. In our case, there
are typically many more frames than words.
In our model, we estimate the conditional probability of
an output sequence (y1 , . . . , ym ) given an input sequence
(x1 , . . . , xn ) i.e.
p(y1 , . . . , ym |x1 , . . . , xn )
(1)
This problem is analogous to machine translation between
natural languages, where a sequence of words in the input
language is translated to a sequence of words in the output
language. Recently, [6, 34] have shown how to effectively
attack this sequence to sequence problem with an LSTM
Recurrent Neural Network (RNN). We extend this paradigm
to inputs comprised of sequences of video frames, significantly simplifying prior RNN-based methods for video description. In the following, we describe our model and architecture in detail, as well as our input and output representation for video and sentences.
= argmax
log p(yt |hn+t1 , yt1 ; )
(4)
t=1
LSTM
<pad>
LSTM
<pad>
LSTM
LSTM
<pad>
LSTM
LSTM
<pad>
LSTM
<pad>
<pad>
<pad>
<pad>
<pad>
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
man
is
talking
<EOS>
<BOS>
LSTM
Encoding stage
Decoding stage
time
Figure 2. We propose a stack of two LSTMs that learn a representation of a sequence of frames in order to decode it into a sentence that
describes the event in the video. The top LSTM layer (colored red) models visual feature inputs. The second LSTM layer (colored green)
models language given the text input and the hidden representation of the video sequence. We use <BOS> to indicate begin-of-sentence
and <EOS> for the end-of-sentence tag. Zeros are used as a <pad> when there is no input at the time step.
loss is propagated back in time, the LSTM learns to generate an appropriate hidden state representation (hn ) of the
input sequence. The output (zt ) of the second LSTM layer
is used to obtain the emitted word (y). We apply a softmax
function to get the probability distribution over the words y 0
in the vocabulary V :
p(y|zt ) = P
exp(Wy zt )
exp(Wy0 zt )
(5)
y 0 V
#-sentences
#-tokens
vocab
#-videos
avg. length
#-sents per video
MSVD
MPII-MD
MVAD
80,827
567,874
12,594
1,970
10.2s
41
68,375
679,157
21,700
68,337
3.9s
1
56,634
568,408
18,092
46,009
6.2s
1-2
4. Experimental Setup
Quantitative evaluation of the models are performed using the METEOR [7] metric which was originally proposed to evaluate machine translation results. The METEOR score is computed based on the alignment between
a given hypothesis sentence and a set of candidate reference sentences. METEOR compares exact token matches,
stemmed tokens, paraphrase matches, as well as semantically similar matches using WordNet synonyms. This semantic aspect of METEOR distinguishes it from others such
as BLEU [26], ROUGE-L [21], or CIDEr [38]. The authors of CIDEr [38] evaluated these four measures for image description. They showed that METEOR is always better than BLEU and ROUGE and outperforms CIDEr when
the number of references are small (CIDEr is comparable to
METEOR when the number of references are large). Since
MPII-MD and M-VAD have only a single reference, we decided to use METEOR in all our evaluations. We employ
METEOR version 1.5 2 using the code3 released with the
Microsoft COCO Evaluation Server [4].
alavie/METEOR
3 https://github.com/tylin/coco-caption
Model
FGM [36]
Mean pool
- AlexNet [39]
- VGG
- AlexNet COCO pre-trained [39]
- GoogleNet [43]
Temporal attention
- GoogleNet [43]
- GoogleNet + 3D-CNN [43]
S2VT (ours)
- Flow (AlexNet)
- RGB (AlexNet)
- RGB (VGG) random frame order
- RGB (VGG)
- RGB (VGG) + Flow (AlexNet)
METEOR
23.9
(1)
26.9
27.7
29.1
28.7
(2)
(3)
(4)
(5)
29.0
29.6
(6)
(7)
24.3
27.9
28.2
29.2
29.8
(8)
(9)
(10)
(11)
(12)
Edit-Distance
MSVD
MPII-MD
MVAD
k=0
k <= 1
k <= 2
k <= 3
42.9
28.8
15.6
81.2
43.5
28.7
93.6
56.4
37.8
96.6
83.0
45.0
Table 3. Percentage of generated sentences which match a sentence of the training set with an edit (Levenshtein) distance of less
than 4. All values reported in percentage (%).
Approach
SMT (best variant) [28]
Visual-Labels [27]
Mean pool (VGG)
S2VT: RGB (VGG), ours
METEOR
5.6
7.0
6.7
7.1
Approach
Visual-Labels [27]
Temporal att. (GoogleNet+3D-CNN) [43] 4
Mean pool (VGG)
S2VT: RGB (VGG), ours
METEOR
6.3
4.3
6.1
6.7
6. Conclusion
This paper proposed a novel approach to video description. In contrast to related work, we construct descriptions using a sequence to sequence model, where frames
are first read sequentially and then words are generated sequentially. This allows us to handle variable-length input
and output while simultaneously modeling temporal structure. Our model achieves state-of-the-art performance on
the MSVD dataset, and outperforms related work on two
large and challenging movie-description datasets. Despite
its conceptual simplicity, our model significantly benefits
from additional data, suggesting that it has a high model
capacity, and is able to learn complex temporal structure
in the input and output sequences for challenging moviedescription datasets.
Acknowledgments
We thank Lisa Anne Hendricks, Matthew Hausknecht,
Damian Mrowca for helpful discussions; and Anna
Rohrbach for help with both movie corpora; and the
4 We report results using the predictions provided by [43] but using the
orginal COCO Evaluation scripts. [43] report 5.7% METEOR for their
temporal attention + 3D-CNN model using a different tokenization.
5 LSMDC: sites.google.com/site/describingmovies
6 http://vsubhashini.github.io/s2vt.html
Correct descriptions.
Irrelevant descriptions.
(a)
(b)
(c)
Figure 3. Qualitative results on MSVD YouTube dataset from our S2VT model (RGB on VGG net). (a) Correct descriptions involving
different objects and actions for several videos. (b) Relevant but incorrect descriptions. (c) Descriptions that are irrelevant to the event in
the video.
(1)
(2)
(3)
(4)
(5)
(6a)
(6b)
S2VT (Ours): (1) Now, the van pulls out a window and a
Figure 4. M-VAD Movie corpus: Representative frame from 6 contiguous clips from the movie Big Mommas: Like Father, Like Son.
From left: Temporal Attention (GoogleNet+3D-CNN) [43], S2VT (in blue) trained on the M-VAD dataset, and DVS: ground truth.
anonymous reviewers for insightful comments and suggestions. We acknowledge support from ONR ATL Grant
N00014-11-1-010, DARPA, AFRL, DoD MURI award
N000141110688, DEFT program (AFRL grant FA8750-132-0026), NSF awards IIS-1427425, IIS-1451244, and IIS1212798, and Berkeley Vision and Learning Center. Raymond and Kate acknowledge support from Google. Marcus
was supported by the FITweltweit-Program of the German
Academic Exchange Service (DAAD).
References
[1] H. Aradhye, G. Toderici, and J. Yagnik. Video2text: Learning to annotate video content. In ICDMW, 2009. 2
[2] T. Brox, A. Bruhn, N. Papenberg, and J. Weickert. High accuracy optical flow estimation based on a theory for warping.
In ECCV, pages 2536, 2004. 2, 4
[3] D. L. Chen and W. B. Dolan. Collecting highly parallel data
for paraphrase evaluation. In ACL, 2011. 2, 5
[4] X. Chen, H. Fang, T.-Y. Lin, R. Vedantam, S. Gupta, P. Dol-
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
lar, and C. L. Zitnick. Microsoft COCO captions: Data collection and evaluation server. arXiv:1504.00325, 2015. 5
X. Chen and C. L. Zitnick. Learning a recurrent visual representation for image caption generation. CVPR, 2015. 1
K. Cho, B. van Merrienboer, D. Bahdanau, and Y. Bengio.
On the properties of neural machine translation: Encoderdecoder approaches. arXiv:1409.1259, 2014. 3
M. Denkowski and A. Lavie. Meteor universal: Language
specific translation evaluation for any target language. In
EACL, 2014. 5
J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach,
S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. In CVPR, 2015. 1, 2, 3, 4
G. Gkioxari and J. Malik. Finding action tubes. 2014. 4
A. Graves and N. Jaitly. Towards end-to-end speech recognition with recurrent neural networks. In ICML, 2014. 1
S. Guadarrama, N. Krishnamoorthy, G. Malkarnenkar,
S. Venugopalan, R. Mooney, T. Darrell, and K. Saenko.
Youtube2text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shoot recognition.
In ICCV, 2013. 1, 2
S. Hochreiter and J. Schmidhuber. Long short-term memory.
Neural computation, 9(8), 1997. 1, 3
P. Hodosh, A. Young, M. Lai, and J. Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. In TACL,
2014. 6
H. Huang, Y. Lu, F. Zhang, and S. Sun. A multi-modal clustering method for web videos. In ISCTCS. 2013. 2
Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional
architecture for fast feature embedding. ACMMM, 2014. 2
A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating image descriptions. CVPR, 2015. 1
D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. 7
R. Kiros, R. Salakhutdinov, and R. S. Zemel. Unifying
visual-semantic embeddings with multimodal neural language models. arXiv:1411.2539, 2014. 1
N. Krishnamoorthy, G. Malkarnenkar, R. J. Mooney,
K. Saenko, and S. Guadarrama. Generating natural-language
video descriptions using text-mined knowledge. In AAAI,
July 2013. 2
P. Kuznetsova, V. Ordonez, T. L. Berg, U. C. Hill, and
Y. Choi. Treetalk: Composition and compression of trees
for image descriptions. In TACL, 2014. 1
C.-Y. Lin. Rouge: A package for automatic evaluation of
summaries. In Text Summarization Branches Out: Proceedings of the ACL-04 Workshop, pages 7481, 2004. 5
T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C. L. Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014. 6
J. Mao, W. Xu, Y. Yang, J. Wang, and A. L. Yuille. Deep
captioning with multimodal recurrent neural networks (mrnn). arXiv:1412.6632, 2014. 1