0% found this document useful (0 votes)
73 views13 pages

Video Representation Learning by Dense Predictive Coding 1909.04656

This document presents Dense Predictive Coding (DPC), a framework for self-supervised representation learning from videos. DPC learns dense encodings of spatiotemporal blocks by recurrently predicting future representations. It is inspired by Contrastive Predictive Coding and uses noise contrastive estimation to learn representations without explicitly predicting future frames. The authors evaluate DPC by pre-training on Kinetics-400 and fine-tuning on action recognition benchmarks, achieving state-of-the-art self-supervised performance on UCF101 and HMDB51, outperforming previous single-stream methods.

Uploaded by

Fgpeqw
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
73 views13 pages

Video Representation Learning by Dense Predictive Coding 1909.04656

This document presents Dense Predictive Coding (DPC), a framework for self-supervised representation learning from videos. DPC learns dense encodings of spatiotemporal blocks by recurrently predicting future representations. It is inspired by Contrastive Predictive Coding and uses noise contrastive estimation to learn representations without explicitly predicting future frames. The authors evaluate DPC by pre-training on Kinetics-400 and fine-tuning on action recognition benchmarks, achieving state-of-the-art self-supervised performance on UCF101 and HMDB51, outperforming previous single-stream methods.

Uploaded by

Fgpeqw
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Video Representation Learning by Dense Predictive Coding

Tengda Han Weidi Xie Andrew Zisserman


Visual Geometry Group, Department of Engineering Science, University of Oxford
{htd, weidi, az}@robots.ox.ac.uk
arXiv:1909.04656v3 [cs.CV] 27 Sep 2019

(a)

(b)
Figure 1: Nearest Neighbour (NN) video clip retrieval on UCF101. Each row contains four video clips, a query clip and the top three retrievals using clip
embeddings. To get the embedding, each video is passed to a 3D-ResNet18, average pooled to a single vector, and cosine similarity is used for retrieval. (a)
Embeddings obtained by Dense Predictive Coding (DPC); (b) Embeddings obtained by using the inflated ImageNet pretrained weights. The DPC captures
the semantics of the human action, rather than the scene appearance or layout as captured by the ImageNet trained embeddings. In the DPC retrievals the
actual appearances of frames can vary dramatically, e.g. in the change in camera viewpoint for the climbing case.

Abstract a baseline pre-trained on ImageNet. The code is available


The objective of this paper is self-supervised learning of at https://github.com/TengdaHan/DPC.
spatio-temporal embeddings from video, suitable for human
action recognition. 1. Introduction
We make three contributions: First, we introduce
the Dense Predictive Coding (DPC) framework for self- Videos are very appealing as a data source for self-
supervised representation learning on videos. This learns supervision: there is almost an infinite supply available
a dense encoding of spatio-temporal blocks by recurrently (from Youtube etc.); image level proxy losses can be used
predicting future representations; Second, we propose a at the frame level; and, there are plenty of additional proxy
curriculum training scheme to predict further into the fu- losses that can be employed from the temporal information.
ture with progressively less temporal context. This en- One of the most natural, and consequently one of the first
courages the model to only encode slowly varying spatial- video proxy losses, is to predict future frames in the videos
temporal signals, therefore leading to semantic represen- based on frames in the past. This has ample scope for ex-
tations; Third, we evaluate the approach by first train- ploration by varying the extent of the past knowledge (the
ing the DPC model on the Kinetics-400 dataset with self- temporal aggregation window used for the prediction) and
supervised learning, and then finetuning the representa- also the temporal distance into the future for the predicted
tion on a downstream task, i.e. action recognition. With frames. However, future frame prediction does have a se-
single stream (RGB only), DPC pretrained representa- rious disadvantage – that the future is not deterministic –
tions achieve state-of-the-art self-supervised performance so methods may have to consider multiple hypotheses with
on both UCF101 (75.7% top1 acc) and HMDB51 (35.7% multiple instance losses, or other distributions and losses
top1 acc), outperforming all previous learning methods by over their predictions.
a significant margin, and approaching the performance of Previous approaches to future frame prediction in

1
video [23, 24, 36, 42, 41] can roughly be divided into supervised learning, and then fine-tuning on action recogni-
two types: those that predict a reconstruction of the actual tion benchmarks. Our DPC model achieves state-of-the-art
frames [23, 24, 36, 42]; and those that only predict the latent self-supervised performance on both UCF101 (75.7% top1
representation (the embedding) of the frames [41]. If our acc) and HMDB51 (35.7% top1 acc), outperforming all
goal of self-supervision is only to learn a representation that previous single-stream (RGB only) self-supervised learning
allows generalization for downstream discriminative tasks, methods by a significant margin.
e.g. action recognition in video, then it may not be neces- 2. Related Work
sary to waste model capacity on resolving the stochastic-
ity of frame appearance in detail, e.g. appearance changes Self-supervised learning from images. In recent years,
due to shadows, illumination changes, camera motion, etc. methods for self-supervised learning on images have
Approaches that only predict the frame embedding, such achieved an impressive performance in learning high-
as Vondrick et al. [41], avoid this potentially unnecessary level image representations. Inspired by the variants of
task of detailed reconstruction, and use a mixture model to Word2vec [3, 25, 26] that rely on predicting words from
resolve the uncertainty in future prediction. Although not their context, Doersch et al. [5] proposed the pretext task of
applied to videos (but rather to speech signals and images), predicting the relative location of image patches. This work
the Contrastive Predictive Coding (CPC) model of Oord et spawned a line of work in context-based self-supervised vi-
al. [40] also learns embeddings, in their case by using a sual representation learning methods, e.g. in [29]. In con-
multi-way classification over temporal audio frames (or im- trast to the context-based idea, another set of pretext tasks
age patches), rather than the regression loss of [41]. include carefully designed image-level classification, such
as rotation [8] or pseudo-labels from clustering [4]. Another
In this paper we propose a new idea for learning spatio-
class of pre-text tasks is for dense predictions, e.g. image
temporal video embeddings, that we term “Dense Predictive
inpainting [32], image colorization [48], and motion seg-
Coding” (DPC). The model is designed to predict the future
mentation prediction [31]. Other methods instead enforce
representations based on the recent past [47]. It is inspired
structural constraints on the representation space [30].
by the CPC [40] framework, and more generally by pre-
vious research on learning word embeddings [25, 26, 28]. Self-supervised learning from videos. Other than the pre-
DPC is also trained by using a variant of noise contrastive dictive tasks reviewed in the introduction, another class of
estimation [9], therefore, in practice, the model has never proxy tasks is based on temporal sequence ordering of the
been optimized to predict the exact future, it is only asked to frames [27, 7, 46]. [12, 14, 44] use the temporal coherence
solve a multiple choice question, i.e. pick the correct future as a proxy loss. Other approaches use egomotion [1, 13] to
states from lots of distractors. In order to succeed in this enforce equivariance in feature space [13]. In contrast, [15]
task, the model only needs to learn the shared semantics of predicts the transformation applied to a spatio-temporal
the multiple possible future states, and this common/shared block. In [17], the authors propose to use a 3D puzzle as
representation is the kind of invariance required in many of the proxy loss. Recently [43, 21, 45], leveraged the natural
the vision tasks, e.g. action recognition in videos. In other temporal coherency of color in videos, to train a network
words, the optimization objective will actually benefit from for tracking and correspondence related tasks.
the fact that the future is not deterministic, and map the rep- Action recognition with two-stream architectures. Re-
resentation of all possible future states to a space that their cently, the two-stream architecture [33] has been a founda-
embeddings are close. Concurrent work [2] applies similar tion for many competitive methods. The authors show that
method on reinforcement learning. optical flow is a powerful representation that improves ac-
The contributions of this paper are three-fold: First, we tion recognition dramatically. Other modalities like audio
introduce Dense Predictive Coding (DPC) framework for signal can also benefits visual representation learning [19].
self-supervised representation learning on videos, we task While in this paper, we deliberately avoid using any in-
the model to predict the future embedding of the spatio- formation from optical flow or audio, and aim to probe
temporal blocks recurrently (as used in N-gram prediction). the upperbound of self-supervised learning with only RGB
The model is trained to pick the “correct” future states from streams. We leave it as a future work to explore how much
a pool of distractors, therefore treated as a multi-way classi- boost optical flow branch and audio branch can bring to our
fication problem. Second, we propose a curriculum training self-supervised learning architecture.
scheme that enables the model to gradually predict further
3. Dense Predictive Coding (DPC)
in the future (up to 2 seconds) with progressively less tem-
poral context, leading more challenging training samples, In this section, we describe the learning framework, de-
and preventing the model from using shortcuts such as op- tails of the architecture, and the curriculum training that
tical flow; Third, we evaluate the approach by first training gradually learns to predict further into the future with pro-
the DPC model on the Kinetics-400 [16] dataset using self- gressively less temporal context.

2
... ...
temporal neg.

pos. spatial neg.

... ...

video segments

Figure 2: A diagram of Dense Predictive Coding method. The left part is the pipeline of the DPC, which is explained in Sec. 3.1. The right part (in the
dashed rectangle) is an illustration of the Pred-GT pair construction for contrastive loss, which is explained in Sec. 3.2.

3.1. Learning Framework the previously generated embedding (ẑt−1 ) as input when
generating the next (ẑt ), further enforcing the prediction to
The goal of DPC is to predict a slowly varying semantic
be conditioned on all previous observations and predictions,
representation based on the recent past, e.g. we construct a
and therefore encourages an N-gram like video representa-
prediction task that observes about 2.5 seconds of the video
tion.
and predict the embedding for the future 1.5 seconds, as il-
lustrated in Figure 2. A video clip is partitioned into multi- 3.2. Contrastive Loss
ple non-overlapping blocks x1 , x2 , . . . , xn , with each block Noise Contrastive Estimation (NCE) [9] constructs a bi-
containing an equal number of frames. First, a non-linear nary classification task: a classifier is fed with real samples
encoder function f (.) maps each input video block xt to its and noise samples, and the objective is to distinguish them.
latent representation zt , then an aggregation function g(.) A variant of NCE [28, 40] classifies one real sample among
temporally aggregates t consecutive latent representations many noise samples. Similar to [28, 40], we use a loss based
into a context representation ct : on NCE for the predictive task. NCE over feature embed-
dings encourages the predicted representation ẑ to be close
zt = f (xt ) (1)
to the ground truth representation z, but not so strictly that
ct = g(z1 , z2 , ..., zt ) (2) it has to resolve the low-level stochasticity.
In the forward pass, the ground truth representation z
where xt has dimension RT ×H×W ×C , and zt is a feature
0 0 and the predicted representation ẑ are computed. The rep-
map with dimension R1×H ×W ×D , organized as time ×
1 resentation for the i-th time step is denoted as zi and ẑi ,
height × width × channels.
which have the same dimensions. Note that, instead of
The intuition behind the predictive task is that if one can
pooling into a feature vector, both zi and ẑi are kept as
infer future semantics from ct , then the context representa- 0 0
feature maps (zi , ẑi ∈ RH ×W ×D ), which maintains the
tion ct and the latent representations z1 , z2 , ..., zt must have
spatial layout representation. We denote the feature vector
encoded strong semantics of the input video clip. Thus, we
in each spatial location of the feature map as zi,k ∈ RD and
introduce a predictive function φ(.) to predict the future. In
ẑi,k ∈ RD where i denotes the temporal index and k is the
detail, φ(.) takes the context representation as the input and
spatial index k ∈ {(1, 1), (1, 2), . . . , (H, W )}. The similar-
predicts the future clip representation:
ity of the predicted and ground-truth pair (Pred-GT pair) is
>
ẑt+1 = φ(ct ) = φ g(z1 , z2 , . . . , zt )

(3) computed by the dot product ẑi,k zj,m . The objective is to
 optimize:
ẑt+2 = φ(ct+1 ) = φ g(z1 , z2 , . . . , zt , ẑt+1 ) (4) "
>
#
X exp(ẑi,k · zi,k )
where ct denotes the context representation from time step L=− log P >
(5)
i,k j,m exp(ẑi,k · zj,m )
1 to t, and ẑt+1 denotes the predicted latent representation
of the time step t+1. In the spirit of Seq2seq [37], represen-
In essense, this is simply a cross-entropy loss (negative
tations are predicted in a sequential manner. We predict q
log-likelihood) that distinguishes the positive Pred-GT pair
steps in the future, at each time step t, the model consumes
out of all other negative pairs. For a predicted feature vec-
1 In our initial experiments, xt ∈ R5×128×128×3 , zt ∈ R1×4×4×256 tor ẑi,k , the only positive pair is (ẑi,k , zi,k ), i.e. the pre-

3
dicted and ground-truth features at the same time step and tackling tasks [5, 29, 46]. In our training, we employ a num-
same spatial location. All the other pairs (ẑi,k , zj,m ) where ber of mechanisms to avoid potential shortcuts, as detailed
(i, k) 6= (j, m), are negative pairs. The loss encourages the next.
positive pair to have a higher similarity than any negative Disrupting optical flow. A trivial solution of our predic-
pairs. If the network is trained in a mini-batch consisting of tive task is that f (.), g(.) and φ(.) together learn to capture
B video clips and each of the B clips is from distinct video, low-level optical flow information and perform feature ex-
more negative pairs can be obtained. trapolation as the prediction. To force the model to learn
To discriminate the different types of negative pairs, high-level semantics, a critical operation is frame-wise aug-
given a Pred-GT pair (ẑi,k , zj,m ), we define the terminol- mentation, i.e. random augmentation for each individual
ogy as follows: frame in the video blocks, such as frame-wise color jittering
Easy negatives: is the Pred-GT pair that is formed from including random brightness, contrast, saturation, hue and
two distinct videos. These pairs are naturally easy because random greyscale during training. Furthermore, the curricu-
they usually have distinct color distributions and thus pre- lum of predicting further into the future, i.e. predicting the
dicted feature and ground-truth feature have low similarity. semantics for the next a few seconds, also ensures that opti-
cal flow alone will not be able to solve this prediction task.
Spatial negatives: is the Pred-GT pair that is formed Temporal receptive field. The temporal receptive field
from the same video but at a different spatial position in (RF) of f (.) is limited by cutting the input video clip into
the feature map, i.e. k 6= m, while i, j can be any index. non-overlapping blocks before feeding it into f (.). Thus,
Temporal negatives (hard negatives): is the Pred-GT the effective temporal RF of each feature map zi is strictly
pair that comes from the same video and same spatial posi- restricted to be within each video block. This avoids the net-
tion, but from different time steps, i.e. k = m, i 6= j. They work being able to discriminate positive and hard-negative
are the hardest pair to classify because their score will be by recognizing relative temporal position.
very close to the positive pairs. Spatial receptive field. Due to the depth of CNN, each
Overall, we use a similar idea to the Multi-batch train- feature vector ẑi,k in the final predicted feature map ẑi has a
ing [38]. If the mini-batch has batch size B, the feature large spatial RF that (almost) covers the entire input spatial
map has spatial dimension H 0 × W 0 and the task is to clas- dimension. This creates a shortcut to discriminate positive
sify one of q time steps, the number of each classes follows: and spatial negative by using padding patterns. One can
limit the spatial RF by cutting input frames into patches [40,
Pos : Ntemporal : Nspatial : Neasy 17]. However this brings some drawbacks: First, the self-
=1 : (q − 1) : (H 0 W 0 − 1)q : (B − 1)H 0 W 0 q supervised pre-trained network will have limited receptive
field (RF), so the representation may not generalize well for
Curriculum learning strategy. A curriculum learning downstream tasks where a large RF is required. Second,
strategy is designed by progressively increasing the num- limiting spatial RF in videos makes the context feature too
ber of prediction steps of the model (Sec. 4.1.4). For in- weak. The context feature has a spatio-temporal RF that
stance, the training process can start by predicting only 2 covers a thin cube in the video flow. Neglecting context is
steps (about 1 second), i.e. only computing ẑt+1 and ẑt+2 , also not ideal for understanding video semantics and brings
and the Pred-GT pairs are constructed between {zt+1 , zt+2 } ambiguity to the predictive task. Considering this trade-off,
and {ẑt+1 , ẑt+2 }. After the network has learnt this simple our method does not restrict the spatial RF.
task, it can be trained to predict 3 steps (about 1.5 seconds), Batch normalization. Common practice uses Batch Nor-
e.g. computing ẑt+1 , ẑt+2 and ẑt+3 and construct Pred-GT malization [11] (BN) in deep CNN architecture. The BN
pairs accordingly. Importantly, curriculum learning intro- layer may provide shortcuts that the network acknowledges
duces more hard negatives throughout the training process, the statistical distribution of the mini-batch, which benefits
and forces the model to gradually learn to predict further in the classification. In [40], the authors demonstrate BN re-
the future with progressively less temporal context. Mean- sults in network cheating, and the ResNet trained with BN
while, the model is gradually trained to grasp the uncertain does not generalize to the downstream image classification
nature in its prediction. task. In our method, we find the effect of BN shortcut is
very limited. The self-supervised training gives similar ac-
3.3. Avoiding Shortcuts and Learning Semantics
curacy using either BN or Instance Normalization [39] (IN).
Empirical experience in self-supervised learning indi- For downstream tasks like classification, a network with
cates that if the proxy task is well-designed and requires BN gives 5%-10% accuracy gain comparing with a network
semantic understanding, a more difficult learning task usu- with IN. It is hard to train a deep CNN without normaliza-
ally leads to a better-quality representation [22]. However, tion for either self-supervised training or supervised train-
ConvNets are notoriously known for learning shortcuts for ing. Overall, we use BN in our encoder function f (.).

4
3.4. Network Architecture Second, the benefits of training on a larger, and more di-
verse dataset. Third, the correlation between performance
We choose to use a 3D-ResNet similar to [10] as the en-
on self-supervised learning and performance on the down-
coder f (.). Following the convention of [6] there are four
stream supervised learning task. Fourth, the variation in the
residual blocks in ResNet architecture, namely res2 , res3 ,
learnt representations when predicting further into the fu-
res4 and res5 , and only expand the convolutional kernels in
ture.
res4 and res5 to be 3D ones. For experiment analysis, we
used 3D-ResNet18, denoted as R-18 below. Datasets. The DPC is a general self-supervised learning
To train a strong encoder f (.), a weak aggregation func- framework for any video types, but we focus here on hu-
tion g(.) is preferable. Specifically, a one-layer Convolu- man action videos e.g. UCF101 [34], HMDB51 [20] and
tional Gated Recurrent Unit (ConvGRU) with kernel size Kinetics-400 [16] datasets. UCF101 contains 13K videos
(1, 1) is used, which shares the weights amongst all spatial spanning over 101 human action classes. HMDB51 con-
positions in the feature map. This design allows the aggre- tains 7K videos from 51 human action classes. Kinetics-400
gation function to propagate features in the temporal axis. (K400) is a big video dataset containing 306K video clips
A dropout [35] with p = 0.1 is used when computing hid- for 400 human action classes.
den state in each time step. A shallow two-layer perceptron
Evaluation methodology. The self-supervised model is
is used as the predictive function φ(.).
trained either on UCF101 or K400. The representation is
3.5. Self-Supervised Training evaluated by its performance on a downstream task, i.e.
action classification on UCF101 and HMDB51. For all
For data pre-processing, we use 30 fps videos with a
the experiments below: we report top1 accuracy for self-
uniform temporal downsampling by factor 3, i.e. take one
supervised learning in the middle column of all tables; and
frame from every 3 frames. These consecutive frames are
report the top1 accuracy for supervised learning for action
grouped into 8 video blocks where each block consists of 5
classification on UCF101 in the rightmost column. In self-
frames. Frames are sampled in a consecutive way with con-
supervised learning, the top1 accuracy refers to how often
sistent temporal stride to preserve the temporal regularity,
the multi-way classifier picks the right Pred-GT pair, i.e.
because random temporal stride introduces uncertainties to
this is not related with any action classes. While for super-
the predictive task especially when the network needs to dis-
vised learning, the top1 accuracy indicates the action clas-
tinguish the difference among different time steps. Specif-
sification accuracy on UCF101. Note, we report the first
ically, each video block spans over 0.5s and the entire 8
training/testing splits of UCF101 and HMDB51 in all the
segments span over 4s in the raw video. The predictive task
experiments, apart from the comparison with the state of
is initially designed to observe the first 5 blocks and predict
the art in Table 4 where we report the average accuracy over
the remaining 3 blocks (denoted as ‘5pred3’ afterwards),
three splits.
which is observing 2.5 seconds to predict the following 1.5
seconds. We also experiment with different predictive con- Action classifier. During supervised learning, 5 video
figuration like 4pred4 in Sec. 4.1.4. blocks are passed as input (the same as for self-supervised
For data augmentation, we apply random crop, random training, i.e. each block is of R5×128×128×3 ), and encoded
horizontal flip, random grey, and color jittering. Note that as a sequence of feature maps with the encoding function
the random crop and random horizontal flip are applied for f (.) (a 3D-ResNet). As with the self-supervised architec-
the entire clip in a consistent way. Random grey and color ture, the aggregation function g(.) (a ConvGRU) aggre-
jittering are applied in a frame-wise manner to prevent the gates the feature maps over time and produces a context
network from learning low-level flow information as men- feature. The context feature is further passed through a spa-
tioned above (in Sec. 3.3), e.g. each video block may con- tial pooling layer followed by a fully-connected layer and
tain both colored and grey-scale image with different con- a multi-way softmax for action classification. The classi-
trast. All models are trained end-to-end using Adam [18] fier is trained using the Adam [18] optimizer with an initial
optimizer with an initial learning rate 10−3 and weight de- learning rate 10−3 and weight decay 10−3 . Learning rate
cay 10−5 . Learning rate is decayed to 10−4 when validation is decayed twice to 10−4 and 10−5 Note that the entire net-
loss plateaus. A batchsize of 64 samples per GPU is used, work is trained end-to-end. The details of the architecture
and our experiments use 4 GPUs. are given in Appendix A.
4. Experiments and Analysis During inference, video clips from the validation set
are densely sampled from an input video and cut into
In the following sections we present controlled experi- blocks (R5×128×128×3 ) with half-length overlapping. Aug-
ments, and aim to investigate four aspects: First, an ablation mentations are removed and only center crop is used. The
study on the DPC model to show the function of different softmax probabilities are averaged to give the final classifi-
design choices, e.g. sequential prediction, dense prediction. cation result.

5
4.1. Performance Analysis from 60.6% to 65.9%, suggesting the model has cap-
tured more regularities than a smaller dataset like UCF101.
4.1.1 Ablation Study on Architecture
It is clear that DPC will benefit from large-scale video
In this section, we present an ablation study by gradually dataset (infinite supply available), which naturally provides
removing components from the DPC model (see Table 1). more diverse negative Pred-GT pairs.
For efficiency, all the self-supervised learning experiments
refer to the 5pred3 setting, i.e. 5 video blocks (2.5 second) 4.1.3 Self-Supervised vs. Classification Accuracy
are used as input to predict the future 3 steps (1.5 second).
In this section, we investigate the correlation between the
Self-Sup. (UCF) Sup. (UCF) accuracy of self-supervised learning and downstream super-
Network vised learning. While training DPC (5pred3 task on K400),
setting method top1 acc top1 acc
R-18 - - (rand. init.) - 46.5 we evaluate the representation at different training stages
R-18 5pred3 DPC 53.6 60.6 (number of epochs) on the downstream task (on UCF101).
R-18 5pred3 remove Seq. 51.3 56.9 The results are shown in Figure 3.
R-18 5pred3 remove Map 36.5 44.9
67
Table 1: Ablation study of DPC. remove Seq means removing the sequen- 65.9

classification top1 acc(%) on UCF101


tial prediction mechanism in DPC, and replacing by parallel prediction. 66
remove Map means removing the dense feature map design in DPC, and
use a feature vector instead. Self-supervised tasks are trained on UCF101 65
using 5pred3 setting. Representation learned from each self-supervised
64.2
task is evaluated by training a supervised action classifier on UCF101. 64
62.8
Compared with the baseline model trained with random 63
initialization and fully supervised learning, our DPC model
62
pre-trained with self-supervised learning has a significant 61.2
boost (top1 acc: 46.5% vs. 60.6%). When removing the 61
sequential prediction, i.e. all 3 future steps are predicted
in parallel with three different fully-connected layers, the 60
48 50 52 54 56 58 60 62
accuracy for both self-supervised learning and supervised self-supervised top1 acc(%) on K400
learning start to drop. Lastly, we further replace the dense Figure 3: Relation between self-supervised accuracy and classifica-
feature map by the average-pooled feature vector, i.e. it be- tion accuracy. Self-supervised model (DPC) is trained on K400
comes a CPC-like model, we are not able to train this model and the weights at epoch {13, 48, 81, 109} are saved, which achieve
either on self-supervised learning task or supervised learn- {50.7%, 57.4%, 59.1%, 61.1%} self-supervised accuracy respectively.
The checkpoints are evaluated by finetuning on UCF101.
ing. This demonstrates that dense predictive coding is es-
sential to our success, and sequential prediction also helps It can be seen that a higher accuracy in self-supervised
to boost the model performance. task always leads to a higher accuracy in downstream clas-
sification. The result indicates that DPC has actually learnt
4.1.2 Benefits of Large Datasets visual representations that are not only specific to self-
supervised task, but are also generic enough to be beneficial
In this section, we investigate the benefits of pre-training for the downstream task.
on a large-scale dataset (UCF101 vs. K400), we keep the
5pred3 setting and evaluate the effectiveness for down-
4.1.4 Benefits of Predicting Further into the Future
stream task on UCF101. Results are shown in Table 2.
Due to the increase of uncertainty, predicting further into
Self-Sup. Sup. (UCF) the future in video sequences gets more difficult, therefore
Network
setting dataset top1 acc top1 acc
more abstract (semantic) understanding is required. We hy-
R-18 5pred3 UCF101 53.6 60.6
pothesize that if we can train the model to predict further,
R-18 5pred3 K400 61.1 65.9
the learnt representation should be even better. In this sec-
Table 2: Results of DPC on UCF101 and K400 respectively. Both ex- tion, we employ curriculum learning to gradually train the
periments use 5pred3 setting. Representations are evaluated by training a model to predict further with progressively less temporal
supervised action classifier on UCF101 (right column). context, i.e. from 5pred3 to 4pred4 (4 video blocks as input
and predict the future 4 steps).
Training the model on K400 increases the self- The result shows that the 4pred4 setting gives a substan-
supervised accuracy to 61.1%, and supervised accuracy tially lower accuracy on the self-supervised learning than

6
Self-Sup. (K400) Sup. (UCF) eters), i.e. all convolutions are 3D, however our modified
Network
setting curr. top1 acc top1 acc 3D-ResNet18 has fewer parameters (only the last 2 blocks
R-18 5pred3 7 61.1 65.9 are 3D convolutions). The authors of [17] obtain 65.8%
R-18 4pred4 7 48.3 64.9 accuracy by combing the rotation classification [15] with
R-18 5pred3+4pred4 3 50.8 68.2
their Space-Time Cubic Puzzles method, essentially multi-
Table 3: Results of DPC with different prediction steps. All models are task learning. When only considering their Space-Time Cu-
trained on K400 with same number of 320k iterations. Note that for bic Puzzles method, they obtain 63.9% top1 accuracy. On
5pred3 and 4pred4, the model is trained from scratch. ‘5pred3+4pred4’ de- HMDB51, our method also outperforms the previous state
notes that curriculum learning strategy, i.e. initialized with the pre-trained
weights from 5pred3 task. The representation is evaluated by training an
of the art result by 0.8% (34.5% vs. 33.7%). Third, when
action classifier on UCF101 (right column). applying on larger input resolution (224 × 224) and using
model with more capacity (3D-ResNet34), our DPC clearly
dominate all self-supervised learning methods (75.7% on
5pred3. This is actually not surprising, as 4pred4 naturally
UCF101 and 35.7% on HMDB51), further demonstrating
introduces 33% more hard negative pairs than predicting fu-
that DPC is able to take advantage from networks with more
ture 3 steps, making the self-supervised learning more dif-
capacity and today’s large-scale datasets. Fourth, ImageNet
ficult (explained in Section 3.2).
pretrained weights have been a golden baseline for action
Interestingly, despite a lower accuracy on self-supervised
recognition [33], our self-supervised DPC is the first model
learning task, when comparing with 5pred3, curriculum
that surpasses the performance of models (VGG-M) pre-
learning on 4pred4 provides 2.3% performance boost on the
trained with ImageNet (75.7% vs. 73.0% on UCF101).
downstream supervised task (top1 acc: 68.2% vs. 65.9%).
The experiment also shows that curriculum learning is 5.1. Visualization
effective as it achieves higher performance than training
4pred4 task from scratch (top1 acc: 68.2% vs. 64.9%). We visualize the Nearest Neighbour (NN) of the video
Similar effect is also observed in [19]. segments in the spatio-temporal feature space in Figure 4
and Figure 1. In detail, one video segment is randomly
4.1.5 Summary sampled from each video, then the spatio-temporal feature
zi = f (xi ) is extracted and pooled into a vector. Then
Through the experiments above, we have demonstrated the the feature vector is used to compute the cosine similarity
keys to the success of DPC. First, it is critical to do dense score. In all figures, Figure 4a includes the video clips re-
predictive coding, i.e. predicting both temporal and spatial trieved using our DPC model from self-supervised learning,
representation in the future blocks, and sequential predic- note that the network does not receive any class label infor-
tion enables a further boost in the quality of the learnt rep- mation during training. In comparison, Figure 4b uses the
resentation. Second, a large-scale dataset helps to improve inflated ImageNet pre-trained weights.
the self-supervised learning, as it naturally contains more It can be seen, that the ImageNet model is able to encode
world patterns and provides more diverse negative sample the scene semantics, e.g. human faces, crowds, but does not
pairs. Third, the representation learnt from DPC is generic, capture any semantics about the human actions. In con-
as a higher accuracy in the self-supervised task also yield trast, our DPC model has actually learnt the video semantics
a higher accuracy in the downstream classification task. without using any manual annotation, for instance, despite
Fourth, predicting further into the future is also beneficial, the background change in running, DPC can still correctly
as the model is forced to encode the high-level semantic retrieve the video block.
representations, and ignore the low-level information.
5.2. Discussion
5. Comparison with State-of-the-art Methods
Why should the DPC model succeed in learning a rep-
The results are given in Table 4, four phenomena can resentation suitable for action recognition, given the prob-
be observed: First, when self-supervised training with lem of a non-deterministic future? There are three rea-
only UCF101, our DPC (60.6%) outperforms all previous sons: First, the use of the softmax function and multi-way
methods under similar settings. Note that OPN [22] per- classification loss enables multi-modal, skewed, peaked or
forms worse when input resolution increases, which in- long tailed distributions; the model can therefore handle the
dicates a simple self-supervised task like order prediction task of predicting the non-deterministic future. Second, by
may not capture the rich semantics from videos. Second, avoiding the shortcuts, the model has been prevented from
when using Kinetics-400 for self-supervised pre-training, learning simple smooth extrapolation of the embeddings;
our DPC (68.2%) outperforms all the previous methods by it is forced to learn semantic embeddings to succeed in its
a large margin. Note that, in the work [15, 17], the authors learning task. Third, in essense, DPC is trained by predict-
use a full-scale 3D-ResNet18 architecture (33.6M param- ing future representations, and use them as a “query” to pick

7
Self-Supervised Method (RGB stream only) Supervised Accuracy (top1 acc)
Method Architecture (#param) Dataset UCF101 HMDB51
Random Initialization 3D-ResNet18 (14.2M) - 46.5 17.1
ImageNet Pretrained [33] VGG-M-2048 (25.4M) - 73.0 40.5
Shuffle & Learn [27] (227 × 227) CaffeNet (58.3M) UCF101/HMDB51 50.2 18.1
OPN [22] (80 × 80) VGG-M-2048 (8.6M) UCF101/HMDB51 59.8 23.8
OPN [22] (120 × 120) VGG-M-2048 (11.2M) UCF101/HMDB51 55.4 -
OPN [22] (224 × 224) VGG-M-2048 (25.4M) UCF101/HMDB51 51.9 -
Ours (128 × 128) 3D-ResNet18 (14.2M) UCF101 60.6 -
3D-RotNet [15] (112 × 112) 3D-ResNet18-full (33.6M) Kinetics-400 62.9 33.7
3D-ST-Puzzle [17] (224 × 224) 3D-ResNet18-full (33.6M) Kinetics-400 63.9 (65.8? ) 33.7?
Ours (128 × 128) 3D-ResNet18 (14.2M) Kinetics-400 68.2 34.5
Ours (224 × 224) 3D-ResNet34 (32.6M) Kinetics-400 75.7 35.7

Table 4: Comparison with other self-supervised methods, results are reported as an average over three training-testing splits. Note that, previous works [15,
17] use full-scale 3D-ResNet18, i.e. all convolutions are 3D, and the input sizes for different models have been shown. ? indicates the results from the
multi-task self-supervised learning, i.e. Rotation + 3D Puzzle.

(a)

(b)
Figure 4: More examples of video retrieval with nearest neighbour (same setting as Figure 1). Figure 4a is the NN retrieval with DPC pre-trained f (.) on
UCF101 (performance reported in Sec. 4.1.2). Figure 4b is the NN retrieval with ImageNet inflated f (.). Retrieval is performed on UCF101 validation set.

the correct “key” from lots of distractors. In order to suc- of action classification on UCF101 and HMDB51. As for
ceed in this task, the model has to learn the shared semantics future work, one straightforward extension of this idea is
of the multiple possible future states, as this is the only way to employ different methods for aggregating the tempo-
to always solve the multiple choice problem, no matter what ral information – instead of using a ConvGRU for tem-
future state appears along with the distractors. This com- poral aggregation (g(.) in the paper), other methods like
mon/shared representation is the invariance we are wishing masked CNN and attention based methods are also promis-
for, i.e. higher level semantics. In other words, the repre- ing. In addition, empirical evidence shows that optical flow
sentation of all these possible future states will be mapped is able to boost the performance for action recognition sig-
to a space that their embeddings are close. nificantly; it will be interesting to explore how optical flow
can be trained jointly with DPC with self-supervised learn-
ing to further enhance the representation quality.
6. Conclusion
In this paper, we have introduced the Dense Predic- Acknowledgements
tive Coding (DPC) framework for self-supervised represen- Funding for this research is provided by the Oxford-
tation learning on videos, and outperformed the previous Google DeepMind Graduate Scholarship, and by the EP-
state-of-the-art by a large margin on the downstream tasks SRC Programme Grant Seebibyte EP/M013774/1.

8
References [20] Hilde Kuehne, Huei-han Jhuang, Estibaliz Garrote, Tomaso
Poggio, and Thomas Serre. HMDB: A large video database
[1] Pulkit Agrawal, Joao Carreira, and Jitendra Malik. Learning for human motion recognition. In ICCV, 2011. 5
to see by moving. In ICCV, 2015. 2 [21] Zihang Lai and Weidi Xie. Self-supervised learning for video
[2] Ankesh Anand, Evan Racah, Sherjil Ozair, Yoshua Bengio, correspondence flow. In BMVC, 2019. 2
Marc-Alexandre Côté, and R. Devon Hjelm. Unsupervised [22] Hsin-Ying Lee, Jia-Bin Huang, Maneesh Singh, and Ming-
state representation learning in atari. In NIPS, 2019. 2 Hsuan Yang. Unsupervised representation learning by sort-
[3] Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and ing sequence. In ICCV, 2017. 4, 7, 8
Christian Janvin. A neural probabilistic language model. In [23] William Lotter, Gabriel Kreiman, and David D. Cox. Deep
JMLR, 2003. 2 predictive coding networks for video prediction and unsuper-
[4] Mathilde Caron, Piotr Bojanowski, Armand Joulin, and vised learning. In ICLR, 2017. 2
Matthijs Douze. Deep clustering for unsupervised learning [24] Michael Mathieu, Camille Couprie, and Yann LeCun. Deep
of visual features. In ECCV, 2018. 2 multi-scale video prediction beyond mean square error. In
[5] Carl Doersch, Abhinav Gupta, and Alexei A. Efros. Unsu- ICLR, 2016. 2
pervised visual representation learning by context prediction. [25] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean.
In CVPR, 2015. 2, 4 Efficient estimation of word representations in vector space.
[6] Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and In NIPS, 2013. 2
Kaiming He. Slowfast networks for video recognition. arXiv [26] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado,
preprint arXiv:1812.03982, 2018. 5 and Jeffrey Dean. Distributed representations of words and
[7] Basura Fernando, Hakan Bilen, Efstratios Gavves, and phrases and their compositionality. In NIPS, 2013. 2
Stephen Gould. Self-supervised video representation learn- [27] Ishan Misra, C. Lawrence Zitnick, and Martial Hebert. Shuf-
ing with odd-one-out networks. In CVPR, 2017. 2 fle and learn: Unsupervised learning using temporal order
[8] Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Un- verification. In ECCV, 2016. 2, 8
supervised representation learning by predicting image rota- [28] Andriy Mnih and Koray Kavukcuoglu. Learning word em-
tions. In ICLR, 2018. 2 beddings efficiently with noise-contrastive estimation. In
[9] Michael U. Gutmann and Aapo Hyvärinen. Noise- NIPS, 2013. 2, 3
contrastive estimation: A new estimation principle for un- [29] Mehdi Noroozi and Paolo Favaro. Unsupervised learning of
normalized statistical models. In AISTATS, 2010. 2, 3 visual representations by solving jigsaw puzzles. In ECCV,
[10] Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. Can 2016. 2, 4
spatiotemporal 3d cnns retrace the history of 2d cnns and [30] Mehdi Noroozi, Hamed Pirsiavash, and Paolo Favaro. Rep-
imagenet? In CVPR, 2018. 5 resentation learning by learning to count. In ICCV, 2017.
[11] Sergey Ioffe and Christian Szegedy. Batch normalization: 2
Accelerating deep network training by reducing internal co- [31] Deepak Pathak, Ross B. Girshick, Piotr Dollár, Trevor Dar-
variate shift. In ICML, 2015. 4 rell, and Bharath Hariharan. Learning features by watching
[12] Phillip Isola, Daniel Zoran, Dilip Krishnan, and Edward H objects move. In CVPR, 2017. 2
Adelson. Learning visual groups from co-occurrences in [32] Deepak Pathak, Philipp Krähenbühl, Jeff Donahue, Trevor
space and time. In ICLR, 2015. 2 Darrell, and Alexei A. Efros. Context encoders: Feature
[13] Dinesh Jayaraman and Kristen Grauman. Learning image learning by inpainting. In CVPR, 2016. 2
representations tied to ego-motion. In ICCV, 2015. 2 [33] Karen Simonyan and Andrew Zisserman. Two-stream con-
[14] Dinesh Jayaraman and Kristen Grauman. Slow and steady volutional networks for action recognition in videos. In
feature analysis: higher order temporal coherence in video. NIPS, 2014. 2, 7, 8
In CVPR, 2016. 2 [34] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah.
[15] Longlong Jing and Yingli Tian. Self-supervised spatiotem- UCF101: A dataset of 101 human actions classes from
poral feature learning by video geometric transformations. videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
arXiv preprint arXiv:1811.11387, 2018. 2, 7, 8 5
[16] Will Kay, João Carreira, Karen Simonyan, Brian Zhang, [35] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya
Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Sutskever, and Ruslan Salakhutdinov. Dropout: A simple
Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, way to prevent neural networks from overfitting. In JMLR,
and Andrew Zisserman. The kinetics human action video 2014. 5
dataset. arXiv preprint arXiv:1705.06950, 2017. 2, 5 [36] Nitish Srivastava, Elman Mansimov, and Ruslan Salakhutdi-
[17] Dahun Kim, Donghyeon Cho, and In So Kweon. Self- nov. Unsupervised learning of video representations using
supervised video representation learning with space-time cu- LSTMs. In ICML, 2015. 2
bic puzzles. In AAAI, 2019. 2, 4, 7, 8 [37] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to
[18] Diederik P. Kingma and Jimmy Ba. Adam: A method for sequence learning with neural networks. In NIPS, 2014. 3
stochastic optimization. In ICLR, 2015. 5 [38] Oren Tadmor, Yonatan Wexler, Tal Rosenwein, Shai Shalev-
[19] Bruno Korbar, Du Tran, and Lorenzo Torresani. Coopera- Shwartz, and Amnon Shashua. Learning a metric embedding
tive learning of audio and video models from self-supervised for face recognition using the multibatch method. In NIPS,
synchronization. In NIPS, 2018. 2, 7 2016. 4

9
[39] Dmitry Ulyanov, Andrea Vedaldi, and Victor S. Lempitsky.
Instance normalization: The missing ingredient for fast styl-
ization. arXiv preprint arXiv:1607.08022, 2016. 4
[40] Aäron van den Oord, Yazhe Li, and Oriol Vinyals. Repre-
sentation learning with contrastive predictive coding. arXiv
preprint arXiv:1807.03748, 2018. 2, 3, 4
[41] Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. An-
ticipating visual representations from unlabelled video. In
CVPR, 2016. 2
[42] Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba.
Generating videos with scene dynamics. In NIPS, 2016. 2
[43] Carl Vondrick, Abhinav Shrivastava, Alireza Fathi, Sergio
Guadarrama, and Kevin Murphy. Tracking emerges by col-
orizing videos. In ECCV, 2018. 2
[44] Xiaolong Wang and Abhinav Gupta. Unsupervised learning
of visual representations using videos. In ICCV, 2015. 2
[45] Xiaolong Wang, Allan Jabri, and Alexei A. Efros. Learn-
ing correspondence from the cycle-consistency of time. In
CVPR, 2019. 2
[46] Donglai Wei, Joseph Lim, Andrew Zisserman, and
William T. Freeman. Learning and using the arrow of time.
In CVPR, 2018. 2, 4
[47] Laurenz Wiskott and Terrence Sejnowski. Slow feature anal-
ysis: Unsupervised learning of invariances. In Neural Com-
putation, 2002. 2
[48] Richard Zhang, Phillip Isola, and Alexei A. Efros. Colorful
image colorization. In ECCV, 2016. 2

10
Appendix module specification
output sizes
T × t × d2 × C
A. Architectures in detail input data - 5 × (5 × 1282 × 3)
f (.) see Table 7 5 × (1 × 42 × 256) (z)
g(.) see Table 8 1 × 1 × 42 × 256 (c)
1⨉42⨉256
1 × 42
pool 1 × 1 × 12 × 256
stride 1, 12
pooling
final fc 1-layer FC 1 × 1 × 12 × # classes
compute cross-entropy loss
prediction
1⨉42⨉256
Table 5: The structure of the linear classifier.
FC

ple, where f (.) takes 5 video blocks and extracts 5 spatio-


1⨉12⨉256 #classes
temporal feature maps, then g(.) aggregates feature maps
5⨉1282⨉3 5⨉1282⨉3
into context c. The prediction function φ(.) is a two-layer
perceptron, which takes the context c as input and produces
a predicted feature ẑ as output. The contrastive loss is com-
video segments puted using z and ẑ as described in the paper Sec. 3.2.
output sizes
module specification
Figure 5: The action classifier structure used to evaluate the representation. T × t × d2 × C
input data - 5 × (5 × 1282 × 3)
We use tables to display CNN structures. The dimen-
sion of convolutional kernels are denoted by {temporal × f (.) see Table 7 5 × (1 × 42 × 256) (z)
spatial2 , channel size}. The strides are denoted by g(.) see Table 8 1 × 1 × 42 × 256 (c)
{temporal stride, spatial stride2 }. The ‘output sizes’ col- φ(.) 2-layer FC 1 × 1 × 42 × 256 (ẑ)
umn displays the dimension of feature map after the op- compute loss using z and ẑ
eration (except the dimension of input data in the first row), Table 6: The structure of DPC model.
where {t×d2 ×C} denotes {temporal size×spatial size2 ×
channel size}, and T denotes the number of video blocks. Structure of f (.). The detailed structure of the encoder
In the following tables we take 3D-ResNet18 backbone with function f (.) is shown in Table 7. Note that f (.) takes input
128 × 128 input resolution as an example. video blocks independently, so the number of video block
T is omitted in the table.
Structure of the action classifier. Table 5 gives the de-
tails of the action classifier which is used to evaluate the output sizes
stage specification
learned representation. Figure 5 is a diagram of the action t × d2 × C
classifier structure. For an input video with 30 fps, first a input data - 5 × 1282 × 3
2
temporal stride 3 is applied, i.e. every 3rd frame is taken, 1 × 7 , 64
conv1 5 × 642 × 64
resulting in 10 fps. Then T × 5 consecutive frames are sam- stride 1, 22
pled and truncated into T video blocks, i.e. each video block 1 × 32 , 64
pool1 5 × 322 × 64
has a size 5 × 1282 × 3, and we take T = 5 for the action  stride 1, 22
classifier. 1 × 32 , 64
res2 2 ×2 5 × 322 × 64
The action classifier is built with f (.) and g(.). The en-  1 × 32 , 64 
coder function f (.) takes 5 video blocks, each block con- 1 × 3 , 128
res3 2 × 2 5 × 162 × 128
tains 5 video frames (5 × (5 × 1282 × 3)) as input, spatio- 1 × 32 , 128
temporal features (z) are extracted from the 5 video blocks 3 × 3 , 256
res4 2 × 2 3 × 82 × 256
with shared encoder (f (.)). Then the aggregation function 3 × 32 , 256
g(.) (ConvGRU) aggregates the 5 spatio-temporal feature 3 × 3 , 256
res5 × 2 2 × 42 × 256
maps into one spatio-temporal feature map, which is re- 3 × 32 , 256
ferred to as the context c in the paper. The context c is then 2 × 12 , 256
pool2 1 × 42 × 256
pooled into a feature vector followed by a fully-connected stride 1, 12
layer. Table 7: The structure of the encoding function f (.) with 3D-ResNet18
backbone.
Structure of the DPC. The DPC is built from f (.) and
g(.) with an additional prediction mechanism, which is de- Structure of g(.). The structure of the temporal aggrega-
scribed in Table 6. Here we use 5pred3 setting for an exam- tion function g(.) is shown in Table 8. It aggregates the fea-

11
ture maps over the past T time steps. Note that in the case
of sequential prediction, T increments by 1 after each pre-
diction step. Table 8 shows the case where g(.) aggregates
the feature maps over the past 5 steps. random initialized weights
30

output sizes 20
stage specification
T × t × d2 × C
input data - 5 × 1 × 42 × 256 10

ConvGRU [12 , 256] × 1 layer 1 × 1 × 42 × 256


0
Table 8: The structure of aggregation function g(.).
10

B. t-SNE clustering of DPC context represen- 20

tation
30 20 10 0 10 20
This section shows the t-SNE clustering of the context weights after 13 epochs DPC training on K400
representation on UCF101 extracted by f (.) and g(.) (Fig-
20
ure 6). In detail, 5 consecutive video blocks are sampled
from each video in the validation set, then the feature maps
10
{z1 , ..., z5 } are extracted from each video block and aggre-
gated into context representation c5 and then pooled into 0 ApplyEyeMakeup
vectors. We use t-SNE to visualize the context vectors in CliffDiving
2D. For clarity, only 10 action classes (out of 101 classes 10
Drumming
from UCF101) are displayed. The upper-left figure visu-
alizes the context features extracted by randomly initial- 20 GolfSwing
ized f (.) and g(.). The following 3 figures show the con- 20 10 0 10 20 HulaHoop
text features extracted by f (.) and g(.) after {13, 48, 109}
weights after 48 epochs DPC training on K400 LongJump
epochs of DPC training on K400, without any finetuning on
UCF101. 20 PlayingPiano

It can be seen that as the DPC training proceeds the intra- Rafting
class distance is reduced (compared to the random initial- 10
Skiing
ization) and also the inter-class distance is increased, i.e. the
self-supervised DPC method is clustering the feature vec- 0 Surfing

tors into action classes.


10

C. Cosine distance histogram of DPC context


20
representation
20 10 0 10 20
This section shows the cosine distance of the context rep- weights after 109 epochs DPC training on K400
20
resentation on UCF101 extracted by DPC pre-trained f (.)
and g(.) (Figure 7). We use the same setting as Figure 6 and 15

extract one context representation for each video and pool 10


into vector. Then we compute the cosine distance of each 5
pair of context vectors across the entire UCF101 validation 0
set. The cosine distance is summarized by histogram, where
5
‘positive’ means two source videos are from the same ac-
tion class and ‘negative’ means two source videos are from 10

different action classes. For clarity, 17 out of 101 action 15

classes are evenly sampled from UCF101 and visualized. 20 10 0 10 20 30


Note that there is no finetunning in this stage, i.e. the net-
work doesn’t see any action labels. Figure 6: t-SNE visualization of the context representations on
UCF101 validation set extracted by different f (.) and g(.) after
It can be seen that for all action classes, the context rep- {0 (random init.), 13, 48, 109} epochs DPC training on K400.
resentations from the same action class have higher cosine
similarity, i.e. DPC can cluster actions without knowing ac-
tion labels.

12
Figure 7: Histogram of the cosine distance of the context representations extracted from UCF101 validation set by DPC weights. ‘Positive’ and ‘negative’
refer to the video pairs that are from the same or different action classes. DPC is trained on K400 without any finetuning on UCF101.

13

You might also like