0% found this document useful (0 votes)
63 views10 pages

Aafaq Spatio-Temporal Dynamics and Semantic Attribute Enriched Visual Encoding For Video CVPR 2019 Paper

This document presents a new method for video captioning that encodes visual features from videos to capture spatiotemporal dynamics and semantic attributes. The method applies short Fourier transforms hierarchically to CNN features to encode temporal dynamics. It also processes object detections to encode spatial dynamics like object locations and frequencies. High-level semantics from detected objects and actions are embedded. The visual features are compressed and fed to a GRU language model to generate captions. Evaluation on standard datasets achieves state-of-the-art results, demonstrating the effectiveness of the proposed visual encoding method.

Uploaded by

ford ferrari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views10 pages

Aafaq Spatio-Temporal Dynamics and Semantic Attribute Enriched Visual Encoding For Video CVPR 2019 Paper

This document presents a new method for video captioning that encodes visual features from videos to capture spatiotemporal dynamics and semantic attributes. The method applies short Fourier transforms hierarchically to CNN features to encode temporal dynamics. It also processes object detections to encode spatial dynamics like object locations and frequencies. High-level semantics from detected objects and actions are embedded. The visual features are compressed and fed to a GRU language model to generate captions. Evaluation on standard datasets achieves state-of-the-art results, demonstrating the effectiveness of the proposed visual encoding method.

Uploaded by

ford ferrari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Spatio-Temporal Dynamics and Semantic Attribute Enriched

Visual Encoding for Video Captioning

Nayyer Aafaq Naveed Akhtar Wei Liu Syed Zulqarnain Gilani Ajmal Mian
Computer Science and Software Engineering,
The University of Western Australia.
[email protected],{naveed.akhtar, wei.liu, syed.gilani, ajmal.mian}@uwa.edu.au

Abstract trieval [45, 55]; have recently caused it to receive attention


as a fundamental task in Computer Vision.
Automatic generation of video captions is a fundamen- Early methods in video captioning and description,
tal challenge in computer vision. Recent techniques typi- e.g. [26, 9] primarily aimed at generating the correct Sub-
cally employ a combination of Convolutional Neural Net- ject, Verb and Object (a.k.a. SVO-Triplet) in the cap-
works (CNNs) and Recursive Neural Networks (RNNs) for tions. More recent methods [50, 39] rely on Deep Learn-
video captioning. These methods mainly focus on tailor- ing [28] to build frameworks resembling a typical neural
ing sequence learning through RNNs for better caption gen- machine translation system that can generate a single sen-
eration, whereas off-the-shelf visual features are borrowed tence [57, 33] or multiple sentences [38, 43, 60] to describe
from CNNs. We argue that careful designing of visual fea- videos. The two-pronged problem of video captioning pro-
tures for this task is equally important, and present a visual vides a default division for the deep learning methods to
feature encoding technique to generate semantically rich encode visual contents of videos using Convolutional Neu-
captions using Gated Recurrent Units (GRUs). Our method ral Networks (CNNs) [44, 48] and decode those into cap-
embeds rich temporal dynamics in visual features by hier- tions using language models. Recurrent Neural Networks
archically applying Short Fourier Transform to CNN fea- (RNNs) [16, 14, 22] are the natural choice for the latter
tures of the whole video. It additionally derives high level component of the problem.
semantics from an object detector to enrich the representa- Since semantically correct sentence generation has a
tion with spatial dynamics of the detected objects. The final longer history in the field of NLP, deep learning based
representation is projected to a compact space and fed to a captioning techniques mainly focus on language mod-
language model. By learning a relatively simple language elling [51, 34]. For visual encoding, these methods for-
model comprising two GRU layers, we establish new state- ward pass video frames through a pre-trained 2D CNN;
of-the-art on MSVD and MSR-VTT datasets for METEOR or a video clip through a 3D CNN, and extract features
and ROUGEL metrics. from an inner layer of the network - referred as ‘extraction
layer’. Features of frames/clips are commonly combined
1. Introduction with mean pooling to compute the final representation of
the whole video. This, and similar other visual encoding
Describing videos in natural language is trivial for hu- techniques [33, 51, 18, 34] - due to the nascency of video
mans, however it is a very complex task for machines. To captioning research - grossly under-exploit the prowess of
generate meaningful video captions, machines are required visual representation for the captioning task. To the best of
to understand objects, their interaction, spatio-temporal our knowledge, this paper presents the first work that con-
order of events and other such minutiae in videos; yet, centrates on improving the visual encoding mechanism for
have the ability to articulate these details in grammati- the captioning task.
cally correct and meaningful natural language sentences [1]. We propose a visual encoding technique to compute rep-
The bicephalic nature of this problem has recently led re- resentations enriched with spatio-temporal dynamics of the
searchers from Computer Vision and Natural Language Pro- scene, while also accounting for the high-level semantic
cessing (NLP) to combine efforts in addressing its chal- attributes of the videos. Our visual code (‘v’ in Fig. 1)
lenges [3, 4, 5, 30]. Incidentally, wide applications of video fuses information from multiple sources. We process ac-
captioning in emerging technologies, e.g. procedure gener- tivations of 2D and 3D CNN extraction layers by hierar-
ation from instructional videos [2], video indexing and re- chically applying Short Fourier Transform [31] to them,

112487
Figure 1. The ‘c’ clips and ‘f ’ frames of a video are processed with 3D and 2D CNNs respectively. Neuron-wise Short Fourier Transform
is applied hierarchically to the extraction layer activations of these networks (using the whole video). This results in spatio-temporal
dynamics enriched encodings α and β. Relevant high-level object semantics γ and action semantics η are derived using the intersection
of vocabulary from the language model dictionary with the labels of 3D CNN and an Object Detector. The output features of the Object
Detector are also used to embed spatial dynamics of the scene and plurality of the objects therein. The resulting codes are compressed with
a fully-connected layer and used to learn a multi-layer GRU as a language model.

where InceptionResNetv2 [46] and C3D [48] are used as 2. Related Work
the 2D and 3D CNNs respectively. The proposed neuron-
wise activation transformation using whole videos results Classical methods in video captioning commonly use
in encoding fine temporal dynamics of the scenes. We en- template based techniques in which Subject (S), Verb (V),
code spatial dynamics by processing objects’ locations and and Object (O) are detected separately and then joined to-
their multiplicity information extracted from an Object De- gether in a sentence. However, the advancement of deep
tector (YOLO [37]). The semantics attached to the output learning research has also transcended to modern video cap-
layers of the Object Detector and the 3D CNN are also ex- tioning methods. The latest approaches in this direction
ploited to embed high-level semantic attributes in our vi- generally exploit deep learning for visual feature encoding
sual codes. We compress the visual codes and learn a lan- as well as its decoding into meaningful captions.
guage model using the resulting representation. With highly In template based approaches, the first successful video
rich visual codes, a relatively simple Gated Recurrent Unit captioning method was proposed by Kojima et al. [26] that
(GRU) network is proposed for language modeling, com- focuses on describing videos of one person performing one
prising two layers, that already results in on-par or better action only. Their heavy reliance on the correctness of
performance compared to the existing sophisticated mod- manually created activity concept hierarchy and state transi-
els [52, 54, 34, 18] on multiple evaluation metrics. The tion model prevented its extension to more complex videos.
main contributions of this paper are as follows. We pro- Hanckmann et al. [21] proposed a method to automatically
pose a visual encoding technique that effectively encapsu- describe events involving multiple actions (seven on aver-
lates spatio-temporal dynamics of the videos and embeds age), performed by one or more individuals. Whilst most of
relevant high-level semantic attributes in the visual codes the prior work was restricted to constrained domains [25, 9],
for video captioning. The proposed visual features con- Krishnamoorthy et al. [27] led the early works of describing
tain the detected object attributes, their frequency of occur- open domain videos. [20] proposed semantic hierarchies
rences as well as the evolution of their locations over time. to establish relationships between actor, action and objects.
We establish the effectiveness of the proposed encoding by [40] used CRF to model the relationship between visual en-
learning a GRU-based language model and perform thor- tities and treated video description as a machine translation
ough experimentation on MSVD [11] and MSR-VTT [57] problem. However, the aforementioned approaches depend
datasets. Our method achieves up to 2.64% and 2.44% gain on predefined sentence templates and fill in the template by
in the state-of-the-art on METEOR and ROUGEL metrics detecting entities from classical methods. These approaches
for these datasets. are not sufficient for the syntactically rich sentence genera-
tion to describe open domain videos.

12488
In contrast to the methods mentioned above, deep mod- where Pr(.) denotes the probability, and v ∈ Rd is a vi-
els directly generate sentences given a visual input. For ex- sual representation of V. By minimizing the cost defined as
ample LSTM-YT [51] feed in visual contents of video ob- the Expected value of the energy Ξ(.) over a large corpus of
tained by average pooling all the frames into LSTM and videos, it is hoped that the inferred model M can automat-
produce the sentences. LSTM-E [33] explores the rele- ically generate meaningful captions for unseen videos.
vance between the visual context and sentence semantics. In this formulation, ‘v’ is considered a training input,
The initial visual features in this framework were obtained that makes remainder of the problem a sequence learning
using 2D-CNN and 3D-CNN whereas the final video rep- task. Consequently, the existing methods in video caption-
resentation was achieved by average pooling the features ing mainly focus on tailoring RNNs [16] or LSTMs [22] to
from frames / clips neglecting the temporal dynamics of generate better captions, assuming effective visual encod-
the video. TA [59] explored the temporal domain of video ing of V to be available in the form of ‘v’. The represen-
by introducing an attention mechanism to assign weights tation prowess of CNNs has made them the default choice
to the features of each frame and later fused them based for visual encoding in the existing literature. However, due
on attention weights. S2VT [50] incorporated optical flow to the nascency of video captioning research, only primi-
to cater for the temporal information of the video. SCN- tive methods of using CNN features for ‘v’ can be found
LSTM [18] proposed semantic compositional network that in the literature. These methods directly use 2D/3D CNN
can detect the semantic concepts from mean pooled visual features or their concatenations for visual encoding, where
content of the video and fed that information into a lan- the temporal dimension of the video is resolved by mean
guage model to generate captions with more relevant words. pooling [33, 34, 18].
LSTM-TSA[34] proposed a transfer unit that extracts se- We acknowledge the role of apt sequence modeling for
mantic attributes from both images as well as mean pooled video description, however, we also argue that designing
visual content of videos and added it as a complementary specialized visual encoding techniques for captioning is
information to the video representation to further improve equally important. Hence, we mainly focus on the operator
the quality of caption generation. M3 -VC [54] proposed a Q(.) in the mapping M(Q(V))) → S, where Q(V) → v.
multi-model memory network to cater for long term visual- We propose a visual encoding technique that, along harness-
textual dependency and to guide the visual attention. ing the power of CNN features, explicitly encodes spatio-
Even though the above methods have employed deep temporal dynamics of the scene in the visual representation,
learning, they have used mean pooled visual features or at- and embeds semantic attributes in it to further help the se-
tention based high level features from CNNs. These fea- quence modelling phase of video description to generate se-
tures have been used directly in their framework in the lan- mantically rich textual sentences.
guage model or by introducing additional unit in the stan-
dard framework. We argue that this technique under-utilizes 3.1. Visual Encoding
the state of the art CNN features in video captioning frame- For clarity, we describe the visual representation of a
work. We propose features that are rich in visual content video V as v = [α; β; γ; η], where α to γ are themselves
and empirically show that this enrichment of visual features column-vectors computed by the proposed technique. We
alone when combined with a standard and simple language explain these computations in the following.
model can outperform existing state of the art methods. Vi-
sual features are part of every video captioning framework. 3.1.1 Encoding Temporal Dynamics
Hence, instead of using high level or mean pooled features,
building on top of our visual features can further enhance In the context of video description, features extracted
the video captioning frameworks’ performances. from pre-trained 2D-CNNs, e.g. VGG [44] and 3D-CNNs,
e.g. C3D [48] have been shown useful for visual encoding
3. Proposed Approach of videos. The standard practice is to forward pass individ-
ual video frames through a 2D CNN and store activation
Let V denote a video that has ‘f ’ frames or ‘c’ clips. The values of a pre-selected extraction layer of the network.
fundamental task in automatic video captioning is to gener- Then, perform mean pooling over those activations for all
ate a textual sentence S = {W1 , W2 , ..., Ww } comprising the frames to compute the visual representation. A similar
‘w’ words that matches closely to human generated captions procedure is adopted with 3D CNN with a difference that
for the same video. Deep learning based video captioning video clips are used in forward passes instead of frames.
methods typically define an energy loss function of the fol- A simple mean pooling operation over activation values
lowing form for this task: is bound to fail in encoding fine-grained temporal dynamics
w
X of the video. This is true for both 2D and 3D CNNs, de-
Ξ(v, S) = − log Pr (Wt |v, W1 , ...Wt−1 ) , (1) spite the fact that the latter models video clips. We address
t=1 this shortcoming by defining transformations Tf (F) → α

12489
features account for the temporal dimension of V to some
extent. Nevertheless, accounting for the fine temporal de-
tails in the whole video adds to our encoding significantly
(see Section 4.3). It is noteworthy that exploiting Fourier
Transform in a hierarchical fashion to encode temporal dy-
namics has also been considered in human action recogni-
tion [53, 36]. However, this work is the first to apply Short
Figure 2. Illustration of hierarchical application of Short Fourier Fourier Transform hierarchically for video captioning.
Transform Φ(.) to the activations aij of the j th neuron of the ex-
traction layer for the ith video. 3.1.2 Encoding Semantics and Spatial Evolution
2D 2D 2D
and Tc (C) → β, such that F = {a1 , a2 , ..., af } and It is well-established that the latter layers of CNNs are able
3D 3D 3D 2D 3D
C = {a1 , a2 , ..., ac }. Here, at and at denote the activa- to learn features at higher levels of abstraction due to hier-
tion vectors of the extraction layers of 2D and 3D CNNs archical application of convolution operations in the earlier
for the tth video frame and video clip respectively. The aim layers [28]. The common use of activations of e.g. fully-
of these transformations is to compute α and β that encode connected layers as visual features for captioning is also
temporal dynamics of the complete video with high fidelity. motivated by the fact that these representations are discrim-
We use the last avg pool layer of InceptionRes- inative transformations of high-level video features. We
2D take this concept further and argue that the output layers
netV2 [46] to compute ai , and the f c6 layer of C3D [48]
3D of CNNs can themselves serve as discriminative encodings
to get ai . The transformations Tf /c (.) are defined over the of the highest abstraction level for video captioning. We de-
activations of those extraction layers. Below, we explain scribe the technique to effectively exploit these features in
Tf (.) in detail. The transformation Tc (.) is similar, except the paragraphs to follow. Here, we briefly emphasize that
that it uses activations of clips instead of frames. the output layer of a network contains additional informa-
Let aij,t denote the activation value of the j th neuron of tion for video captioning beyond what is provided by the
the network’s extraction layer for the tth frame of the ith commonly used extraction layers of networks, because:
training video. We leave out the superscript 2D for bet-
1. The output labels are yet another transformation of
ter readability. To perform the transform, we first define
i i i i f i the extraction layer features, resulting from network
1 aj = [aj,1 , aj,2 , ..., aj,f ] ∈ R and compute Ψ(1 aj ) →
weights that are unaccounted for by extraction layer.
ς 1 ∈ Rp , where the operator Ψ(.) computes the Short
Fourier Transform [31] of the vector in its argument and 2. The semantics attached to the output layer are at the
stores the first ‘p’ coefficients of the transform. Then, same level of abstraction that is encountered in video
we divide 1 aij into two smaller vectors 21 aij ∈ Rh and captions - a unique property of the output layers.
i
22 aj ∈ R
h−f
, where h = ⌊ f2 ⌋. We again apply the op-
We use the output layers of an Object Detector
erator Ψ(.) to these vectors to compute ς 21 and ς 22 in p-
(i.e. YOLO [37]) and a 3D CNN (i.e. C3D [48]) to extract
dimensional space. We recursively perform the same op-
semantics pertaining to the objects and actions recorded in
erations on ς 21 and ς 22 to get the p-dimensional vectors
videos. The core idea is to quantitatively embed object la-
ς 311 , ς 312 , ς 321 , and ς 322 . We combine all these vectors
bels, their frequencies of occurrence, and evolution of their
as ς(j) = [ς 1 , ς 21 , ς 22 , ..., ς 322 ] ∈ R(p×7)×1 . We also
spatial locations in videos in the visual encoding vector.
illustrate this operation in Fig. 2. The same operation is
Moreover, we also aim to enrich our visual encoding with
performed individually for each neuron of our extraction
the semantics of actions performed in the video. The details
layer. We then concatenate ς(j) : j ∈ {1, 2, ..., m} to
of materializing this concept are presented below.
form α ∈ R(p×7×m)×1 , where m denotes the number of
neurons in the extraction layer. As a result of performing Objects Information: Different from classifiers that only
Tf (F) → α, we have computed a representation the video predict labels of input images/frames, object detectors can
while accounting for fine temporal dynamics in the whole localize multiple objects in individual frames, thereby pro-
sequence of video frames. Consequently, Tf (.) results in viding cues for ascertaining plurality of the same type of
a much more informative representation than that obtained objects in individual frames and evolution of objects’ lo-
with mean pooling of the neuron activations. cations in multiple frames. Effective embedding of such
We define Tc (.) in a similar manner for the set C of high-level information in vector ‘v’ promises descriptions
video clip activations. This transformation results in β ∈ that can clearly differentiate between e.g. ‘people running’
R(p×7×k)×1 , where k denotes the number of neurons in the and ‘person walking’ in a video.
extraction layer of the 3D CNN. It is worth mentioning that The sequence modeling component of a video captioning
a 3D CNN is already trained on short video clips. Hence, its system generates a textual sentence by selecting words from

12490
a large dictionary D. An object detector provides a set T Le of We concatenate the above described vectors α, β, γ and
object labels at its output. We first compute L = D L, e η to form our visual encoding vector v ∈ Rd , where d =
and define γ = [ζ 1 , ζ 2 , ..., ζ |L| ], where |.| denotes the car- 2×(p×7×m)+(10×|L|)+(2×|A|). Before passing this
dinality of a set. The vectors ζ i , ∀i in γ are further de- vector to a sequence modelling component of our method,
fined with the help ‘q’ frames sampled from the original we perform its compression using a fully connected layer,
video. We perform this sampling using a fixed time interval as shown in Fig. 1. Using tanh activation function and fixed
between the sampled frames of a given video. The sam- weights, this layer projects ‘v’ to a 2K-dimensional space.
ples are passed through the object detector and its output The resulting projection ‘υ’ is used by our language model.
is utilized in computing ζ i , ∀i. A vector ζ i is defined as
(q−1) 3.2. Sequence Modelling
ζ i = [Pr(ℓi ), Fr(ℓi ), ν 1i , ν 2i , ..., ν i ], where ℓi indicates
th
the i element of L (i.e. an object name), Pr(.) and Fr(.) re- We follow the common pipeline of video description
spectively compute the probability and frequency of occur- techniques that feeds visual representation of a video to a
rence of the object corresponding to ℓi , and νi z represent sequence modelling component, see Fig. 1. Instead of re-
the velocity of the object between the frames z and z + 1 (in sorting to a sophisticated language model, we develop a rel-
the sampled q frames). atively simpler model employing multiple layers of Gated
We define γ over ‘q’ frames, whereas the used object Recurrent Units (GRUs) [14]. GRUs are known to be more
detector processes individual frames that results in a proba- robust to vanishing gradient problem - an issue encountered
bility and frequency value for each frame. We resolve this in long captions - due to their ability of remembering the
and related mismatches by using the following definitions relevant information and forgetting the rest over time. A
of the components of ζ i : GRU has two gates: reset Γr and update Γu , where the up-
• Pr(.) = max Prz (.) : z ∈ {1, ..., q}. date gate decides how much the unit updates its previous
z
memory and the reset gate determines how to combine the
max Frz (.) new input with the previous memory. Concretely, our lan-
• Fr(.) = z N : z ∈ {1, ..., q}, where ‘N ’ is the
guage model computes the hidden state h<t> of a GRU as:
allowed maximum number of the same class of objects
detected in a frame. We let N = 10 in experiments.
Γu = σ(Wu [h<t−1> , x<t> ] + bu )
• νiz = [δxz , δyz ]
: δxz
= x̃ z+1
− x̃ and z
δyz
= ỹ − z+1
z
ỹ . Here, x̃, ỹ denote the Expected values of the x and
Γr = σ(Wr [h<t−1> , x<t> ] + br )
y coordinates of the same type of objects in a given
frame, such that the coordinates are also normalized
by the respective frame dimensions. h̃<t> = tanh (Wh [Γr ⊙ h<t−1> , x<t> ] + bh
We let q = 5 in our experiments, resulting in ζ i ∈
R10 , ∀i that compose γ ∈ R(10×|L|)×1 . The indices of co- h<t> = Γu ⊙ h̃<t> + (1 − Γu ) ⊙ h<t−1>
efficients in γ identify the object labels in videos (i.e. prob-
where, ⊙ denotes the hadamard product, σ(.) is sigmoid ac-
able nouns to appear in the description). Unless an object
tivation , Wq , ∀q are learnable weight matrices, and bu/r/h
is detected in the video, the coefficients of γ corresponding
denote the respective biases. In our approach, h<0> = υ
to it are kept zero. The proposed embedding of high level
for a given video, whereas the signal x is the word em-
semantics in γ contain highly relevant information about
bedding vector. In Section 4.3, we report results using two
objects in explicit form for a sequence learning module of
layers of GRUs, and demonstrate that our language model
video description system.
under the proposed straightforward sequence modelling al-
Actions Information: Videos generally record objects ready provides highly competitive performance due to the
and their interaction. The latter is best described by the proposed visual encoding.
actions performed in the videos. We already use a 3D
CNN that learns action descriptors for the videos. We 4. Experimental Evaluation
tap into the output layer of that network to further em-
bed high level action information inTour visual encoding. 4.1. Datasets
To that end, we compute A = Ae D, where A is the We evaluate our technique using two popular benchmark
set of labels
 at the output of the 3D CNN. Then, we de- datasets from the existing literature in video description,
fine η = [ϑ1 , Pr(ℓ1 )], [ϑ2 , Pr(ℓ2 )], ..., [ϑ|A| , Pr(ℓ|A| )] ∈ namely Microsoft Video Description (MSVD) dataset [11],
R(2×|A|)×1 , where ℓi is the ith element of A (an action la- and MSR-Video To Text (MSR-VTT) dataset [57]. We first
bel) and ϑ is a binary variable that is 1 only if the action is give details of these datasets and their processing performed
predicted by the network. in this work, before discussing the experimental results.

12491
MSVD Dataset [11]: This dataset is composed of 1,970 4.3. Experiments
YouTube open domain videos that predominantly show only
In our experiments reported below1 , we use Inception-
a single activity each. Generally, each clip is spanning over
ResnetV2 (IRV2) [46] as the 2D CNN, whereas C3D [48]
10 to 25 seconds. The dataset provides multilingual human
is used as the 3D CNN. The last ‘avg pool’ layer of the for-
annotated sentences as captions for the videos. We experi-
mer, and the ‘f c6’ layer of the latter are considered as the
ment with the captions in English. On average, 41 ground
extraction layers. The 2D CNN is pre-trained on the popu-
truth captions can be associated with a single video. For
lar ImageNet dataset [41], whereas Sports 1M dataset [24]
benchmarking, we follow the common data split of 1,200
is used for the pre-training of C3D. To process videos, we
training samples, 100 samples for validation and 670 videos
re-size the frames to match the input dimensions of these
for testing [59, 54, 18].
networks. For the 3D CNN, we use 16-frame clips as inputs
MSR-VTT Dataset [57]: This recently introduced open with an 8-frame overlap. YOLO [37] is used as the ob-
domain videos dataset contains a wide variety of videos ject detector in all our experiments. To train our language
for the captioning task. It consists of 7,180 videos that are model, we include a start and an end token to the captions to
transformed into 10,000 clips. The clips are grouped into 20 deal with the dynamic length of different sentences. We set
different categories. Following the common settings [57], the maximum sentence length to 30 words in the case of ex-
we divide the 10,000 clips into 6,513 samples for training, periments with MSVD dataset, and to 50 for the MSR-VTT
497 samples for validation and the remaining 2,990 clips dataset. These length limits are based on the available cap-
for testing. Each video is described by 20 single sentence tions in the datasets. We truncate a sentence if its length ex-
annotations by Amazon Mechanical Turk (AMT) workers. ceeds the set limit, and zero pad in the case of shorter length.
This is one of the largest clips-sentence pair dataset avail- We tune the hyper-parameters of our language model on the
able for the video captioning task, which is the main reason validation set. The results below use two layers of GRUs,
of choosing this dataset for benchmarking our technique. that employ 0.5 as the dropout value. We use the RMSProp
algorithm with a learning rate 2 × 10−4 to train the models.
4.2. Dataset Processing & Evaluation Metrics A batch size of 60 is used for training in our experiments.
We performed training of our models for 50 epochs. We
We converted the captions in both datasets to lower case, used the sparse cross entropy loss to train our model. The
and removed all punctuations. All the sentences were then training is conducted using NVIDIA Titan XP 1080 GPU.
tokenized. We set the vocabulary size for MSVD to 9,450 We used TensorFlow framework for development.
and for MSR-VTT to 23,500. We employed “fasttext“ [10]
word embedding vectors of dimension 300. Embedding 4.3.1 Results on MSVD dataset
vectors of 1,615 words for MSVD and 2,524 words for
We comprehensively benchmark our method against the
MSR-VTT were not present in the pretrained set. Instead
current state-of-the-art in video captioning. We report the
of using randomly initialized vectors or ignoring the out of
results of the existing methods and our approach in Table. 1.
vocabulary words entirely in the training set, we generated
For the existing techniques, recent best performing methods
embedding vectors for these words using character n-grams
are chosen and their results are directly taken from the ex-
within the word, and summing the resulting vectors to pro-
isting literature (same evaluation protocol is ensured). The
duce the final vector. We performed dataset specific fine-
table columns present scores for the metrics BLEU-4 (B-4),
tuning on the pretrained word embeddings.
METEOR (M), CIDErD (C) and ROUGEL (R).
In order to compare our technique with the existing The last seven rows of the Table report results of dif-
methods, we report results on the four most popular metrics, ferent variants of our method to highlight the contribution
including; Bilingual Evaluation Understudy (BLEU) [35], of various components of the overall technique. GRU-MP
Metric for Evaluation of Translation with Explicit Ordering indicates that we use our two-layer GRU model, while the
(METEOR) [7], Consensus based Image Description Eval- common ‘Mean Pooling (MP)’ strategy is adopted to re-
uation (CIDErD ) [49] and Recall Oriented Understudy of solve the temporal dimension of videos. ‘C3D’ and ‘IRV2’
Gisting Evaluation (ROUGEL ) [29]. We refer to the origi- in the parentheses identify the networks used to compute
nal works for the concrete definitions of these metrics. The the visual codes. We abbreviate the joint use of C3D and
subscript ‘D’ in CIDEr indicates the metric variant that IRV2 as ‘CI’. We use ‘EVE’ to denote our Enriched Visual
inhibits higher values for inappropriate captions in human Encoding that applies Hierarchical Fourier Transform - in-
judgment. Similarly, the subscript ‘L’ indicates the variant dicated by the subscript ‘hft’ - on the activations of the net-
of ROUGE that is based on recall-precision scores of the work extraction layers. The proposed final technique, that
longest common sequence between the prediction and the
ground truth. We used the Microsoft COCO server [12] to 1 Due to through evaluation, supplementary material also contains fur-

compute our results. ther results. Only the best performing setting is discussed here.

12492
Table 1. Benchmarking on MSVD dataset [11] in terms of BLEU- Table 2. Performance comparison with single 2D-CNN based
4 (B-4), METEOR (M), CIDErD (C) and ROUGEL (R). See the methods on MSVD dataset [11].
text for the description of proposed method GRU-EVE’s variants. Model METEOR
Model B-4 M C R FGM [47] 23.90
FGM [47] 13.7 23.9 - - S2VT [50] 29.2
S2VT [50] - 29.2 - - LSTM-YT [51] 29.07
LSTM-YT [51] 33.3 29.1 - - TA [59] 29.0
Temporal-Attention (TA) [59] 41.9 29.6 51.67 - p-RNN [60] 31.1
h-RNN [60] 49.9 32.6 65.8 - HRNE [32] 33.1
MM-VDN [56] 37.6 29.0 - - BGRCN [6] 31.70
HRNE [32] 43.8 33.1 - - MAA [17] 31.80
GRU-RCN [6] 47.9 31.1 67.8 - RMA [23] 31.90
LSTM-E [33] 45.3 31.0 - - LSTM-E [33] 29.5
SCN-LSTM [18] 51.1 33.5 77.7 - M3 -inv3 [54] 32.18
DMRM [58] 51.1 33.6 74.8 - mGRU [62] 33.39
LSTM-TSA [34] 52.8 33.5 74.0 - GRU-EVEhft -(IRV2) 33.7
TDDF [61] 45.8 33.3 73.0 69.7
BAE [8] 42.5 32.4 63.5 - Table 3. Performance comparison on MSVD dataset [11] with the
PickNet [13] 46.1 33.1 76.0 69.2 methods using multiple features. The scores of existing methods
aLSTMs [19] 50.8 33.3 74.8 - are taken from [54]. V denotes VGG19, C is C3D, Iv denotes
M3 -IC [54] 52.8 33.3 - - Inception-V3, G is GoogleNet and I denotes InceptionResNet-V2
RecNetlocal [52] 52.3 34.1 80.3 69.8 Model METEOR
GRU-MP - (C3D) 28.8 27.7 42.6 61.6 SA-G-3C [59] 29.6
GRU-MP - (IRV2) 41.4 32.3 68.2 67.6 S2VT-RGB-Flow [50] 29.8
GRU-MP - (CI) 41.0 31.3 61.9 67.6 LSTM-E-VC [33] 31.0
GRU-EVEhft - (C3D) 40.6 31.0 55.7 67.4 p-RNN-VC [60] 32.6
GRU-EVEhft - (IRV2) 45.6 33.7 74.2 69.8 M3 -Iv C [54] 33.3
GRU-EVEhft - (CI) 47.8 34.7 75.8 71.1
GRU-EVEhft+sem - (CI) 35.0
GRU-EVEhft+sem - (CI) 47.9 35.0 78.1 71.5
in Table 1. We also observe in the table that our method
also incorporates the high-level semantic information - in- categorically outperforms the mean pool based methods,
dicated by the subscript ‘+sem’ - is mentioned in the last i.e. LSTM-YT [51], LSTM-E [33], SCN-LSTM [18], and
row of the Table. We also follow the same notational con- LSTM-TSA[34] on METEOR, CIDEr and ROUGEL . Un-
ventions for our method in the remaining Tables. der these observations, we safely recommend the proposed
Our method achieves a strong 35 value of METEOR, hierarchical Fourier transformation as the substitute for the
which provides a 35.0−34.1 × 100 = 2.64% gain over the ‘mean pooling’ in video captioning.
34.1
closest competitor. Similarly, gain over the current state-of- In Table 2, we compare the variant of our method based
the-art for ROUGEL is 2.44%. For the other metrics, our on a single CNN with the best performing single CNN
scores remain competitive to the best performing methods. based existing methods. The results are directly taken
It is emphasized, that our approach derives its main strength from [54] for the provided METEOR metric. As can be
from the visual encoding part in contrast to sophisticated seen, our method outperforms all these methods. In Ta-
language model, which is generally the case for the existing ble 3, we also compare our method on METEOR with the
methods. Naturally, complex language models entail diffi- state-of-the-art methods that necessarily use multiple visual
cult and computationally expensive training process, which features to obtain the best performance. A significant 5.1%
is not a limitation of our approach. gain is achieved by our method to the closest competitor in
this regard.
We illustrate representative qualitative results of our
method in Fig. 3. We abbreviate our final approach as 4.3.2 Results on MSR-VTT dataset
‘GRU-EVE’ in the figure for brevity. The semantic details
and accuracy of e.g. plurality, nouns and verbs is clearly MSR-VTT [57] is a recently released dataset. We com-
visible in the captions generated by the proposed method. pare performance of our approach on this dataset with
The figure also reports the captions for GRU-MP-(CI) and the latest published models such as Alto [42], RUC-
GRU-EVEhft -(CI) to show the difference resulting from hi- UVA [15], TDDF [61], PickNet [13], M3 -VC [54] and
erarchical Fourier transform (hft) as compared to the Mean RecNetlocal [52]. The results are summarized in Table 4.
Pooling (MP) strategy. These captions justify the noticeable Similar to the MSVD dataset, our method significantly im-
gain achieved by the proposed hft over the traditional MP proves the state-of-the-art on this dataset on METEOR and

12493
Figure 3. Illustration of caption generated for MSVD test set: The final approach is abbreviated as GRU-EVE for brevity. A sentence from
ground truth captions is shown for reference.

Table 4. Benchmarking on MSR-VTT dataset [57] in terms of we also tested different architectures of GRU, e.g. with state
BLEU-4 (B-4), METEOR (M), CIDErD (C) and ROUGEL (R).
sizes 512, 1024, 2048 and 4096. We observed a trend of per-
Model B-4 M C R
formance improvement until 2048 states. However, further
Alto [42] 39.8 26.9 45.7 59.8
RUC-UVA [15] 38.7 26.9 45.9 58.7 states did not improve the performance. Hence, 2048 were
TDDF [61] 37.3 27.8 43.8 59.2 finally used in the results reported in the previous section.
PickNet [13] 38.9 27.2 42.1 59.5 Whereas all the components of the proposed technique
M3 -VC [54] 38.1 26.6 - - contributed to the overall final performance, the biggest rev-
RecNetlocal [52] 39.1 26.6 42.7 59.3 elation of our work is the use of hierarchical Fourier Trans-
GRU-EVEhft - (IRV2) 32.9 26.4 39.2 57.2 form to capture the temporal dynamics of videos. As com-
GRU-EVEhft - (CI) 36.1 27.7 45.2 59.9 pared to the ‘nearly standard’ mean pooling operation per-
GRU-EVEhft+sem - (CI) 38.3 28.4 48.1 60.7 formed in the existing captioning pipeline, the proposed use
of Fourier Transform promises a significant performance
ROUGEL metrics, while achieving strong results on the re-
gain for any method. Hence, we safely recommend replac-
maining metrics. These result ascertain the effectiveness
ing the mean pooling operation with our transformation for
of the proposed enriched visual encoding for visual cap-
the future techniques.
tioning. We provide examples of qualitative results on this
dataset in the supplementary material of the paper.
6. Conclusion
5. Discussion We presented a novel technique for visual encoding of
We conducted a through empirical evaluation of the pro- videos to generate semantically rich captions. Besides cap-
posed method to explore its different aspects. Below we dis- italizing on the representation power of CNNs, our method
cuss and highlight few of these aspects in the text. Where explicitly accounts for the spatio-temporal dynamics of the
necessary, we also provide results in the supplementary ma- scene, and high-level semantic concepts encountered in the
terial of the paper to back the discussion. video. We applying Short Fourier Transform to 2D and 3D
For the settings discussed in the previous section, we CNN features of the videos in a hierarchical manner, and
generally observed semantically rich captions generated by account for the high-level semantics by processing output
the proposed approach. In particular, these captions well layer features of an Object Detector and the 3D CNN. Our
captured the plurality of objects and their motions/actions. enriched visual representation is used to learn a relatively
Moreover, the captions generally described the whole simple GRU based language model that performs on-par or
videos instead of its partial clips. Instead of only two, we better than the existing video description methods on popu-
also tested different number of GRU layers, and observed lar MSVD and MSR-VTT datasets.
that increasing the number of GRU layers deteriorated the Acknowledgment This research was supported by
BLEU-4 score. However, there were improvements in all ARC Discovery Grant DP160101458 and partially by
the remaining metrics. We retained only two GRU layers in DP190102443. The Titan XP GPU used in our experiments
the final method mainly for computational gains. Moreover, was donated by NVIDIA corporation.

12494
References [17] R. Fakoor, A.-r. Mohamed, M. Mitchell, S. B. Kang, and
P. Kohli. Memory-augmented attention modelling for videos.
[1] N. Aafaq, A. Mian, W. Liu, S. Z. Gilani, and M. Shah. Video arXiv preprint arXiv:1611.02261, 2016. 7
description: A survey of methods, datasets and evaluation
[18] Z. Gan, C. Gan, X. He, Y. Pu, K. Tran, J. Gao, L. Carin,
metrics. arXiv preprint arXiv:1806.00186, 2018. 1
and L. Deng. Semantic Compositional Networks for visual
[2] J.-B. Alayrac, P. Bojanowski, N. Agrawal, J. Sivic, I. Laptev, captioning. In IEEE CVPR, 2017. 1, 2, 3, 6, 7
and S. Lacoste-Julien. Unsupervised learning from narrated [19] L. Gao, Z. Guo, H. Zhang, X. Xu, and H. T. Shen. Video cap-
instruction videos. In IEEE CVPR, 2016. 1 tioning with attention-based lstm and semantic consistency.
[3] B. Andrei, E. Georgios, H. Daniel, M. Krystian, N. Sid- IEEE Transactions on Multimedia, 19(9):2045–2055, 2017.
dharth, X. Caiming, and Z. Yibiao. A Workshop on Lan- 7
guage and Vision at CVPR 2015. 1 [20] S. Guadarrama, N. Krishnamoorthy, G. Malkarnenkar,
[4] B. Andrei, M. Tao, N. Siddharth, Z. Quanshi, S. Nishant, S. Venugopalan, R. Mooney, T. Darrell, and K. Saenko.
L. Jiebo, and S. Rahul. A Workshop on Language and Vision Youtube2text: Recognizing and describing arbitrary activi-
at CVPR 2018. http://languageandvision.com/. ties using semantic hierarchies and zero-shot recognition. In
1 Proceedings of the IEEE international conference on com-
[5] R. Anna, T. Atousa, R. Marcus, P. Christopher, L. Hugo, puter vision, pages 2712–2719, 2013. 2
C. Aaron, and S. Bernt. The Joint Video and Language Un- [21] P. Hanckmann, K. Schutte, and G. J. Burghouts. Automated
derstanding Workshop at ICCV 2015. 1 textual descriptions for a wide range of video events with 48
[6] N. Ballas, L. Yao, C. Pal, and A. Courville. Delving deeper human actions. In ECCV, pages 372–380, 2012. 2
into convolutional networks for learning video representa- [22] S. Hochreiter and J. Schmidhuber. Long short-term memory.
tions. In ICLR, 2016. 7 Neural computation, 9(8):1735–1780, 1997. 1, 3
[7] S. Banerjee and A. Lavie. Meteor: An automatic metric for [23] A. K. Jain, A. Agarwalla, K. K. Agrawal, and P. Mitra. Re-
mt evaluation with improved correlation with human judg- current memory addressing for describing videos. In CVPR
ments. In Proceedings of the ACL workshop on intrinsic and Workshops, 2017. 7
extrinsic evaluation measures for machine translation and/or [24] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar,
summarization, pages 65–72, 2005. 6 and L. Fei-Fei. Large-scale video classification with convo-
[8] L. Baraldi, C. Grana, and R. Cucchiara. Hierarchical lutional neural networks. In Proceedings of the IEEE con-
boundary-aware neural encoder for video captioning. In ference on Computer Vision and Pattern Recognition, pages
IEEE CVPR, 2017. 7 1725–1732, 2014. 6
[9] A. Barbu, A. Bridge, Z. Burchill, D. Coroian, S. Dickin- [25] M. U. G. Khan, L. Zhang, and Y. Gotoh. Human focused
son, S. Fidler, A. Michaux, S. Mussman, S. Narayanaswamy, video description. In IEEE International Conference on
D. Salvi, et al. Video in sentences out. In UAI, 2012. 1, 2 Computer Vision Workshops (ICCV Workshops), 2011. 2
[10] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov. Enrich- [26] A. Kojima, T. Tamura, and K. Fukunaga. Natural language
ing word vectors with subword information. In TACL, pages description of human activities from video images based on
135–146, 2017. 6 concept hierarchy of actions. IJCV, 50(2):171–184, 2002. 1,
[11] D. L. Chen and W. B. Dolan. Collecting highly parallel 2
data for paraphrase evaluation. In ACL: Human Language [27] N. Krishnamoorthy, G. Malkarnenkar, R. J. Mooney,
Technologies-Volume 1, pages 190–200. ACL, 2011. 2, 5, 6, K. Saenko, and S. Guadarrama. Generating natural-language
7 video descriptions using text-mined knowledge. In AAAI,
[12] X. Chen, H. Fang, T.-Y. Lin, R. Vedantam, S. Gupta, volume 1, page 2, 2013. 2
P. Dollár, and C. L. Zitnick. Microsoft coco captions: [28] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. nature,
Data collection and evaluation server. arXiv preprint 521(7553):436, 2015. 1, 4
arXiv:1504.00325, 2015. 6 [29] C.-Y. Lin. Rouge: A package for automatic evaluation of
[13] Y. Chen, S. Wang, W. Zhang, and Q. Huang. Less is more: summaries. In Text summarization branches out: Proceed-
Picking informative frames for video captioning. In ECCV, ings of the ACL-04 workshop, volume 8. Barcelona, Spain,
2018. 7, 8 2004. 6
[14] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, [30] M. Margaret, M. Ishan, H. Ting-Hao, and F. Frank. Story
F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase Telling Workshop and Visual Story Telling Challenge at
representations using rnn encoder-decoder for statistical ma- NAACL 2018. 1
chine translation. In EMNLP, pages 1724–1734, 2014. 1, [31] A. V. Oppenheim. Discrete-time signal processing. Pearson
5 Education India, 1999. 1, 4
[15] J. Dong, X. Li, W. Lan, Y. Huo, and C. G. Snoek. Early [32] P. Pan, Z. Xu, Y. Yang, F. Wu, and Y. Zhuang. Hierarchical
embedding and late reranking for video captioning. In Pro- recurrent neural encoder for video representation with ap-
ceedings of the 2016 ACM on Multimedia Conference, pages plication to captioning. In IEEE CVPR, pages 1029–1038,
1082–1086. ACM, 2016. 7, 8 2016. 7
[16] J. L. Elman. Finding structure in time. Cognitive science, [33] Y. Pan, T. Mei, T. Yao, H. Li, and Y. Rui. Jointly modeling
14(2):179–211, 1990. 1, 3 embedding and translation to bridge video and language. In

12495
Proceedings of the IEEE conference on computer vision and [49] R. Vedantam, C. Lawrence Zitnick, and D. Parikh. Cider:
pattern recognition, pages 4594–4602, 2016. 1, 3, 7 Consensus-based image description evaluation. In IEEE
[34] Y. Pan, T. Yao, H. Li, and T. Mei. Video captioning with CVPR, 2015. 6
transferred semantic attributes. In IEEE CVPR, 2017. 1, 2, [50] S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney,
3, 7 T. Darrell, and K. Saenko. Sequence to sequence-video to
[35] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: a text. In Proceedings of the IEEE international conference on
method for automatic evaluation of machine translation. In computer vision, pages 4534–4542, 2015. 1, 3, 7
Proceedings of the 40th annual meeting on ACL, pages 311– [51] S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach,
318, 2002. 6 R. Mooney, and K. Saenko. Translating videos to natural
[36] H. Rahmani and A. Mian. 3d action recognition from novel language using deep recurrent neural networks. In NAACL,
viewpoints. In Proceedings of the IEEE Conference on Com- pages 1494–1504, 2015. 1, 3, 7
puter Vision and Pattern Recognition, pages 1506–1515, [52] B. Wang, L. Ma, W. Zhang, and W. Liu. Reconstruction net-
2016. 4 work for video captioning. In Proceedings of the IEEE Con-
[37] J. Redmon and A. Farhadi. Yolo9000: better, faster, stronger. ference on Computer Vision and Pattern Recognition, pages
In IEEE CVPR, 2017. 2, 4, 6 7622–7631, 2018. 2, 7, 8
[38] A. Rohrbach, M. Rohrbach, W. Qiu, A. Friedrich, M. Pinkal, [53] J. Wang, Z. Liu, Y. Wu, and J. Yuan. Learning actionlet en-
and B. Schiele. Coherent multi-sentence video description semble for 3d human action recognition. IEEE transactions
with variable level of detail. In German Conference on Pat- on pattern analysis and machine intelligence, 36(5):914–
tern Recognition, 2014. 1 927, 2014. 4
[39] A. Rohrbach, A. Torabi, M. Rohrbach, N. Tandon, C. Pal, [54] J. Wang, W. Wang, Y. Huang, L. Wang, and T. Tan. M3:
H. Larochelle, A. Courville, and B. Schiele. Movie descrip- Multimodal memory modelling for video captioning. In Pro-
tion. IJCV, 123(1):94–120, 2017. 1 ceedings of the IEEE Conference on Computer Vision and
[40] M. Rohrbach, W. Qiu, I. Titov, S. Thater, M. Pinkal, and Pattern Recognition, pages 7512–7520, 2018. 2, 3, 6, 7, 8
B. Schiele. Translating video content to natural language [55] J. Wang, T. Zhang, N. Sebe, H. T. Shen, et al. A survey on
descriptions. In Proceedings of the IEEE International Con- learning to hash. IEEE transactions on pattern analysis and
ference on Computer Vision, pages 433–440, 2013. 2 machine intelligence, 40(4):769–790, 2018. 1
[41] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, [56] H. Xu, S. Venugopalan, V. Ramanishka, M. Rohrbach, and
S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, K. Saenko. A multi-scale multiple instance video description
et al. Imagenet large scale visual recognition challenge. network, A Workshop on Closing the Loop Between Vision
International Journal of Computer Vision, 115(3):211–252, and Language at ICCV 2015. 7
2015. 6 [57] J. Xu, T. Mei, T. Yao, and Y. Rui. Msr-vtt: A large video
[42] R. Shetty and J. Laaksonen. Frame-and segment-level fea- description dataset for bridging video and language. In IEEE
tures and candidate pool evaluation for video caption gen- CVPR, 2016. 1, 2, 5, 6, 7, 8
eration. In Proceedings of the 2016 ACM on Multimedia [58] Z. Yang, Y. Han, and Z. Wang. Catching the temporal
Conference, pages 1073–1076. ACM, 2016. 7, 8 regions-of-interest for video captioning. In 25th ACM Mul-
[43] A. Shin, K. Ohnishi, and T. Harada. Beyond caption to nar- timedia, pages 146–153, 2017. 7
rative: Video captioning with multiple sentences. In IEEE [59] L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle,
International Conference on Image Processing (ICIP), 2016. and A. Courville. Describing videos by exploiting temporal
1 structure. In Proceedings of the IEEE international confer-
[44] K. Simonyan and A. Zisserman. Very deep convolutional ence on computer vision, pages 4507–4515, 2015. 3, 6, 7
networks for large-scale image recognition. In ICLR, 2015. [60] H. Yu, J. Wang, Z. Huang, Y. Yang, and W. Xu. Video
1, 3 paragraph captioning using hierarchical recurrent neural net-
[45] J. Song, L. Gao, L. Liu, X. Zhu, and N. Sebe. Quantization- works. In IEEE CVPR, 2016. 1, 7
based hashing: a general framework for scalable image and [61] X. Zhang, K. Gao, Y. Zhang, D. Zhang, J. Li, and Q. Tian.
video retrieval. Pattern Recognition, 75:175–187, 2018. 1 Task-driven dynamic fusion: Reducing ambiguity in video
[46] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi. description. In IEEE CVPR, 2017. 7, 8
Inception-v4, inception-resnet and the impact of residual [62] L. Zhu, Z. Xu, and Y. Yang. Bidirectional multirate recon-
connections on learning. In AAAI, volume 4, page 12, 2017. struction for temporal modeling in videos. In IEEE CVPR,
2, 4, 6 pages 2653–2662, 2017. 7
[47] J. Thomason, S. Venugopalan, S. Guadarrama, K. Saenko,
and R. J. Mooney. Integrating language and vision to gen-
erate natural language descriptions of videos in the wild. In
Coling, volume 2, page 9, 2014. 7
[48] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri.
Learning spatiotemporal features with 3d convolutional net-
works. In Proceedings of the IEEE international conference
on computer vision, pages 4489–4497, 2015. 1, 2, 3, 4, 6

12496

You might also like