Action Recog
Action Recog
Action Recognition
Abstract CorrNet-R50
SlowFast-R50
I3D-R50* NL-I3D-R50
SlowFast-R50* TEA-R50
76 TAM-R50*
In recent years, a number of approaches based on 2D or CIDC-R50
3D convolutional neural networks (CNN) have emerged for TSN-R50* S3D-G-Inception
6165
inappropriate for validating a model’s capability of spatio- • Benchmarking of SOTA Approaches. We thor-
temporal modeling. In addition, there seems to be a ten- oughly benchmarked several SOTA approaches and
dency in current research to overly focus on pursuing state- compared them with I3D. Our analysis reveals that
of-the-art (SOTA) performance, but overlooking other im- I3D still stays on par with SOTA approaches in terms
portant factors such as the backbone networks and the num- of accuracy (Figure 1) and the recent advance in ac-
ber of input frames. For instance, I3D [2] based on 3D- tion recognition is mostly on the efficiency side, not
InceptionV1 has become a “gatekeeper” baseline to com- on accuracy. Our analysis also suggests that the in-
pare with for any recently proposed approaches of action put sampling strategy taken by a model (i.e. uniform
recognition. However such comparisons are often unfair or dense sampling) should be considered for fairness
against stronger backbones such as ResNet50 [24]. As when comparing two models (Section 6.2).
shown in Figure 1, I3D, with ResNet50 as backbone, per-
forms comparably with or outperforms many recent meth- 2. Related Work
ods that are claimed to be better. As a result, such evaluation
are barely informative w.r.t whether the improved results of Video understanding has made rapid progress with the
an approach come from a better backbone or the algorithm introduction of a number of large-scale video datasets such
itself. As discussed in Section 3, performance evaluation as Kinetics [31], Sports1M [30], Moments-In-Time [44],
in action recognition may be further confounded by many and YouTube-8M [1]. A number of models introduced re-
other issues such as variations in training and evaluation cently have emphasized the need to efficiently model spatio-
protocols, model inputs and pretrained models. temporal information for video action recognition.
In light of the great need for better understanding of Most successful deep architectures for action recogni-
CNN-based action recognition models, in this paper we pro- tion are usually based on two-stream model [54], process-
vide a common ground for comparative analysis of 2D- ing RGB frames and optical-flow in two separate CNNs
CNN and 3D-CNN models without any bells and whis- with a late fusion in upper layers [30]. Two-stream ap-
tles. We conduct comprehensive experiments and analysis proaches have been used in different action recognition
to compare several representative 2D-CNN and 3D-CNN methods [3, 6, 19, 75, 56, 63, 70, 66, 11, 12]. Another
methods on three large-scale benchmark datasets. Our main straightforward but popular approach is the use of 2D-CNN
goal is to deliver deep understanding of the important ques- to extract frame-level features and then model the tempo-
tions brought up above, especially, a) the current progress ral causality. For example, TSN [68] propose consensus
of action recognition and b) the differences between 2D- module to aggregate features; on the other hand, TRN [77]
CNN and 3D-CNN methods w.r.t spatial-temporal represen- use bag of features to model relationship between frames.
tations of video data. Our systematic analysis provides in- While TSM [38] shifts part of the channels along tempo-
sights to researchers to understand spatio-temporal effects ral dimension, thereby allowing for information to be ex-
of different action models across backbone and architecture changed among neighboring frames, TAM [8] is based on
and will broadly simulate discussions in the community re- depthwise 1 × 1 convolutions to capture temporal depen-
garding a very important but largely neglected issue of fair dencies across frames effectively. Different methods for
comparison in video action recognition. temporal aggregation of feature descriptors have also been
The main contributions of our work as follows: proposed [13, 35, 73, 66, 48, 16, 15]. More complex ap-
proaches have also been investigated for capturing long-
• A unified framework for Action Recognition. We range dependencies, e.g. non-local neural networks [69].
present a unified framework for 2D-CNN and 3D- Another approach is to use 3D-CNN, which extends the
CNN approaches and implement several representative success of 2D models in image recognition [28] to recog-
methods for comparative analysis on three standard ac- nize actions in videos. For example, C3D [60] learns 3D
tion recognition benchmark datasets. ConvNets which outperforms 2D CNNs through the use
• Spatio-Temporal Analysis. We systematically com- of large-scale video datasets. Many variants of 3D-CNNs
pare 2D-CNN and 3D-CNN models to better under- are introduced for learning spatio-temporal features such as
stand the differences and spatio-temporal behavior of I3D [2] and ResNet3D [22]. 3D CNN features were also
these models. Our analysis leads to some interest- demonstrated to generalize well to other vision tasks, such
ing findings as follows: a) Temporal pooling tends to as action detection [52], video captioning [45], action lo-
suppress the efficacy of temporal modeling in an ac- calization [47], and video summarization [46]. Nonethe-
tion model, but surprisingly provides a significant per- less, as 3D convolution leads high computational load, few
formance boost to TSN [68]; b) By removing non- works aim to reduce the complexity by decomposing the
structural differences between 2D-CNN and 3D-CNN 3D convolution into 2D spatial convolution and 1D tempo-
models, they behave similarly in terms of spatio- ral convolution, e.g., P3D [50], S3D [72], R(2+1)D [62], or
temporal representation abilities and transferability. incorporating group convolution [61]; or using a combina-
6166
tion of 2D-CNN and 3D-CNN [79]. Furthermore, SlowFast
network employs two pathways to capture short-term and
long-term temporal information [10] by processing a video
at both slow and fast frame rates. Beyond that, Timecep-
tion applies the Inception concept in the temporal domain
for capturing long-range temporal dependencies [26]. Fe-
ichtenhofer [9] finds efficient networks by extending 2D ar-
Figure 2: Statistics collected from 37 action recognition papers
chitectures through a stepwise expansion approach over the
from 2015 to 2020. Left: Used datasets. Right: Ratio of papers
key variables such as temporal duration, frame rate, spatial used different settings to compare with others.
resolution, network width, etc. Leveraging weak supervi-
sion [14, 67, 33] or distillation [18] is another recent trend Input Length. Figure 2 shows that about 80% of the papers
in action recognition. Few works have assessed the im- use different number of frames for comparison. It is because
portance of temporal information in a video, e.g., Sigurds- each method could prefer different frame numbers; how-
son et.al analyzed performance per action category based ever, comparing under different number of frames could fa-
on different levels of object complexity, verb complexity, vor either the proposed method or the reference method.
and motion [53]. They state that to differentiate temporally
Training Protocol. Due to recent advances in technology,
similar but semantically different videos, its important for
It is often easier to train action recognition models for a
models to develop temporal understanding. Huang et. al
very long time (epochs) which was not possible a few years
analyzed the effect of motion via an ablation analysis on
ago, indicating that old methods might not be well-trained.
C3D model [25]. Nonetheless, these works only study a
Furthermore, many works reuse the ImageNet weights to
limited set of backbone and temporal modeling methods.
initialize the models while others are not. It raises the con-
cern that does the gain comes from different training pro-
3. Challenges of Evaluating Action Models tocol. Based on our analysis, about 60% of the papers use
The first challenge in evaluating action models stem different protocols to train action recognition models.
from the fact that unlike ImageNet for image classification, Evaluation Protocol. As the models are trained under dif-
action recognition does not have one dataset widely used ferent sampling strategies and input lengths, a model is
for every paper. As shown in Figure 2, the most popular used to take more than one clip from a video for predic-
Kinetics-400 is used by around 60% papers1 . On the other tion. Hence, different evaluation protocol could lead un-
hand, Something-Something (V1 and V2), which has very clear comparison. About 60% papers evaluated models dif-
different temporal characteristic from Kinetics-400, is also ferently when comparing to others.
used by about 50% papers. Furthermore, two successors
of the Kinetics-400 datasets, Kinetics-600 and Kinetics-700 4. 2D-CNN and 3D-CNN Approaches
are released recently. It is difficult to evaluate different
methods if they do not test on common datasets. We fur- To address the above mentioned issue for fair com-
ther check those 37 papers how do they compare the perfor- parison, we analyze several popular 2D-CNN and 3D-
mance in their paper [60, 68, 2, 17, 50, 78, 58, 77, 5, 76, CNN approaches for action recognition, including I3D [2],
22, 65, 34, 79, 72, 62, 69, 23, 42, 29, 49, 38, 8, 36, 26, 41, ResNet3D [21], S3D [72], R(2+1)D [62], TSN [68] and
10, 61, 71, 39, 57, 51, 74, 7, 64, 9, 37]. We evaluate those TAM [8]. These approaches not only yield competitive re-
papers from four aspects, including backbone, input length, sults on popular large-scale datasets, but also widely serve
training protocol and evaluation protocol. Figure 2 shows as fundamental building blocks for many other successive
the summary of how papers compare to others differently. approaches such as SlowFast [10] and CSN [61].
Among these approaches, I3D and ResNet3D are pure
Backbone. From our analysis, we observe that about 70%
3D-CNN models, differing only in backbones. S3D and
papers compare results with different backbones (e.g., most
R(2+1)D factorize a 3D convolutional filter into a 2D spa-
of the papers use ResNet50 as backbone but compare with
tial filter followed by a 1D temporal filter. In such a sense,
I3D [2] which uses InceptionV1 as the backbone). Com-
they are architecturally similar to 2D models. However, we
paring action models with different types of backbones can
categorize them into 3D-CNN models as their implementa-
often lead to incorrect conclusions, also making harder to
tions are based on 3D convolutions. While TSN rely only
evaluate the advantage of the proposed temporal modeling.
on 2D convolution without temporal modeling, TAM, an-
For example, using stronger backbone for I3D, it improves
other 2D-CNN approach, adds efficient depthwise temporal
the results by 4.0% on Kinetics-400 (see Figure 7).
aggregation on top of TSN, which shows strong results on
1 Kinetics-400 dataset is available after 2017, the used rate increases to Something-Something [8]. Finally, since SlowFast is ar-
69% if only the papers published after 2017 are counted. guably one of the best approaches on Kinetics, we use it
6167
Model Input Temporal Spatial Temporal Initial !"#$%&'(#)*+,-#./01234
Approach Backbone $ $%&'()
Input Sampling Pooling Module Aggregation Weights 12&3453('25%&-.
I3D [2] InceptionV1 !"# !"# 6578-(
DE345,.@%(74E345,
3D Conv. 3D Conv. Inflation !
R3D [21] ResNet
1E5%(.F8)45,
4D Dense Y " " <9.12&34&-.*5,:;.
# # = ?('25%&-.@55-4,A.
S3D [72] InceptionV1 Inflation >9.?('25%&-.*5,:;
B52345,&-C
2D Conv. 1D Conv.
R(2+1)D [62] ResNet Scratch 5"#$%&'(#)*+,-#.1/234
*+&,,(-)./.0
TAM [8] bLResNet 1D dw Conv. ImageNet 09.*5,:;
3D Uniform N 2D Conv.
TSN [68] InceptionV1 None ImageNet $ $
!
" "
Table 1: 2D-CNN and 3D-CNN approaches in our study. # #
6168
Figure 4: Top-1 accuracy of all the compared models without
temporal pooling on three mini-datasets. The video architectures Figure 6: Accuracy gain of the models with temporal pooling
are separated by color while the backbones by symbol. w.r.t. the models without temporal pooling. Temporal pooling
significantly hurts the performance of all models except TSNs.
6169
SlowFast-R50 Pretrain Dataset
+1.18 Model FLOPs
75 +1.12 dataset Kinetics SSV2 MiT
+4.09
Top-1 Accuracy (%)
6170
InceptionV1 ResNet50
76
None I3D Conv. TAM None I3D Conv. TAM TSM NLN
74 74
72 72 f =8 33.1 56.4 58.2 59.7 33.9 62.6 61.6 65.4 64.1 53.0
Mini-SSV2
f =16 34.7 61.8 63.7 63.9 35.3 66.2 65.7 68.6 67.4 55.0
70 SlowFast_D 70 SlowFast_D
I3D_D I3D_D f =8 70.4 68.1 68.3 68.8 72.1 73.3 71.5 74.1 74.1 73.7
68 68 SlowFast_U Mini-Kinetics
SlowFast_U f =16 70.5 70.9 70.7 70.0 72.5 75.5 73.4 76.4 75.6 74.5
I3D_U 66 I3D_U
66
2 4 6 8 10 0 2000 4000 6000 8000 10000
Number of Clips FLOPs (109) Table 5: Performance of different temporal aggregation strategies
Figure 8: Model performance tested using 3 256×256 spatial w/o temporal pooling. FLOPs and parameters of different models
crops and different number of clips. ’U’: uniform sampling; ’D’: can be found in the supplementary material.
dense sampling. Best viewed in color.
mark. These findings seem to imply that more complex tem-
Target dataset poral modeling is not necessary for “static” datasets such as
Model UCF101 HMDB51 Jester Mini-SSV2
Kinetics and MiT. We believe that lack of fairness in per-
I3D-ResNet50 97.12 72.32 96.39 65.86
TAM-ResNet50 95.05 71.67 96.35 66.91
formance evaluation leads to confusion on understanding
SlowFast-ResNet50-8×8 95.67 74.61 96.75 63.93 significance of temporal modeling for action recognition.
Table 4: Top-1 Acc. of Transferability study from Kinetics. Temporal Aggregation. The essence of temporal model-
ing is how it aggregates temporal information. The 2D ar-
that is largely overlooked by the community when assess- chitecture offers great flexibility in temporal modeling. For
ing model efficiency, i.e., the impact of input sampling. As example, TSM [38] and TAM [8] can be easily inserted into
shown in Figure 8 (Right), when putting I3D and SlowFast a CNN for learning spatio-temporal features. Here we ana-
in a plot of accuracy v.s. FLOPs for comparison, the ad- lyze several basic temporal aggregations on top of the 2D ar-
vantage of SlowFast over I3D is better and more fairly rep- chitecture including 1D convolution (Conv, i.e., S3D [72]),
resented, i.e., when considering uniform sampling for I3D, 1D depthwise convolution (dw Conv, i.e., TAM), and TSM.
SlowFast is only slightly more accurate but at the same effi- We also consider the non-local network module (NLN) [69]
ciency in FLOPs. This clearly suggests that input sampling for its ability to capture long-range temporal video depen-
strategy of a model (i.e. uniform or dense) should factor in dencies add 3 NLN modules and 2 NLN modules at stage 2
evaluation for fairness when comparing it to another model. and stage 3 of TSN-ResNet50, respectively as in [69].
Model Transferability. We further compare the transfer- Table 5 shows results of using different temporal aggre-
ability of the three models trained above on four small- gations as well as those of TSN (i.e., w/o any temporal ag-
scale datasets including UCF101 [55], HMDB51 [32], gregation) on InceptionV1 and ResNet50. The results sug-
Jester [43], and Mini-SSV2. We follow the same training gest that effective temporal modeling is required for achiev-
setting in Section 5 and finetune 45 epochs with cosine an- ing competitive results, even on datasets such as Kinetics
nealing learning rate schedule starting with 0.01; further- where temporal information is thought as non-essential for
more, since those are 32-frame models, we trained the mod- recognition. On the other hand, TAM and TSM, while be-
els with a batch size of 48 with synchronized batch normal- ing simple and efficient, demonstrate better performance
ization. Table 4 shows the results, indicating that all the than the I3D, 1D regular convolution and the NLN mod-
three models have very similar performance (difference of ule, which have more parameters and FLOPs. We argue
less than 2%) on the downstream tasks. In particular, I3D it is because the frames sampled under uniform sampling
performs on par with the SOTA approaches like TAM and are sparse and it is not suitable to model temporal informa-
SlowFast in transfer learning (e.g., I3D obtains the best ac- tion in 3D convolution. While TAM and TSM use depth-
curacy of 97.12% on UCF101), which once again corrobo- wise convolution that is more effective to model temporal
rates the fact that the improved spatio-temporal modeling is information since it only consider the single feature map at
largely due the use of stronger backbones. different frames once instead of combining all channels of
frames once. We also find the same pattern on full Kinetics
6.3. Analysis of Spatio-temporal Effects in Table 3. Interestingly, the NLN module does not per-
It’s generally believed that temporal modeling is the form as expected on Mini-SSV2. This is possibly because
core for action recognition and state-of-the-art approaches NLN models temporal dependencies through matching spa-
can capture better temporal information. However, it has tial features between frames, which are weak in Mini-SSV2.
also been demonstrated on datasets such as Kinetics and Locations of Temporal Modules. In [72] and [62], some
Moments-in-Time (MiT) [44] that approaches purely based preliminary analysis w.r.t the effect of the locations of tem-
on spatial modeling [68, 44] can achieve very competi- poral modules on 3D models was performed on Kinetics-
tive results compared to more sophisticated spatio-temporal 400. In this experiment, we conduct a similar experiment on
models. More recently, a paper [27] also shows that 2D both Mini-Kinetics and Mini-SSV2 to understand if this is so
models outperform their 3D counterparts on the MiT bench- for 2D models. We modified TAM-ResNet18 in a number
6171
Top-1 Acc. Datasets Metrics I3D S3D TAM
# of TAMs locations Mini-SSV2 Mini-Kinetics
Φ̄a 0.53 0.53 0.52
8 All 59.1 69.08
Mini-SSV2 Ψ̄ta
a 0.46 0.45 0.47
4 Top-half 59.7 69.21
4 Bottom-half 56.5 69.27 Ψ̄ta+tp
a 0.38 0.38 0.37
4 Uniform-half 59.4 69.14 Φ̄a 0.97 0.97 0.96
Top and bottom mean the residual blocks closer to output and input respectively. Mini-Kinetics Ψ̄ta
a 0.06 0.08 0.09
Table 6: Performance comparison by using different numbers and Ψ̄ta+tp
a -0.08 -0.10 -0.12
locations of TAMs in ResNet18 (w/o temporal pooling). Φ̄a 0.89 0.91 0.87
Mini-MiT Ψ̄ta
a 0.04 0.03 0.04
of different ways by keeping: a) half of the temporal mod- Ψ̄ta+tp 0.02 0.02 0.04
a
ules only in the bottom network layers (Bottom-Half ); b) Ψ̄ta
a : the improvement from temporal aggregation only.
half of the temporal modules only in the top network lay- Ψ̄ta+tp
a : the improvement from combining temporal
ers (Top-Half ); c) every other temporal module (Uniform-
Half ); and d) all the temporal modules (All). As observed Table 7: Effects of spatio-temporal modeling.
in Table 6, only half of the temporal modules (Top-Half )
is needed to achieve the best accuracy on Mini-SSV2 while Table 7 shows the results of Φ̄a and Ψ̄a for three spatio-
the accuracy on Mini-Kinetics is not sensitive to the num- temporal representations. All three representations behave
ber and locations of temporal modules. It is thus interesting similarly, namely their spatial modeling contributes slightly
to explore if this insightful observation can lead to an effi- more than temporal modeling on Mini-SSV2, much higher
cient but effective video architecture by mixing 2D and 3D on Mini-MiT, and dominantly on Mini-Kinetics. This con-
modelings, similar to the idea of ECO in [79]. vincingly explains why a model lack of temporal modeling
like TSN can perform well on Mini-Kinetics, but fail badly
Disentangling Spatial and Temporal Effects. So far we on Mini-SSV2. Note that similar observations have been
have only looked at the overall spatio-temporal effects of a made in the literature, but not in a quantitative way like
model (i.e., top-1 accuracy) in our analysis. Here we fur- ours. Furthermore, while all the approaches indicate the
ther disentangle the spatial and temporal contributions of a utmost importance of spatial modeling on mini-Kinetics,
model to understand its ability of spatio-temporal model- the results of Ψ̄ta
a suggest that temporal modeling is more
ing. Doing so provides great insights into which informa- effective on Mini-Kinetics than on Mini-MiT for both 2D
tion, spatial or temporal, is more essential to recognition. and 3D approaches. We also observe that temporal pool-
We treat TSN w/o temporal pooling as the baseline spa- ing deters the effectiveness of temporal modeling on all the
tial model as it does not model temporal information. TSN approach from the results of Ψ̄ata+tp , which are constantly
can evolve into different types of spatio-temporal models lower than Ψ̄ta
a . Such damage is especially substantial on
by adding temporal modules on top of it. With this, we Mini-Kinetics, indicated by the negative values of Ψ̄ata+tp .
compute the spatial and temporal contributions of a model
as follows. Let Sab (k) be the accuracy of a model of some
architecture a that is based on a backbone b and takes k 7. Conclusion
ResN et50
frames as input. For instance, SI3D (16) is the accu- In this paper, we conducted a comprehensive compara-
racy of a 16-frame I3D-ResNet50 model. Then the spatial tive analysis of several representative CNN-based video ac-
contribution Φba and temporal improvement of a model Ψba tion recognition approaches with different backbones and
(k is omitted here for clarity) are given by, temporal aggregations. Our extensive analysis enables bet-
ter understanding of the differences and spatio-temporal ef-
Φba =STb SN / max (Sab , STb SN )
(1) fects of 2D-CNN and 3D-CNN approaches. It also provides
Ψba =(Sab − STb SN )/(100 − STb SN ). significant insights with regard to the efficacy of spatio-
Note that Φba is between 0 and 1; Ψba < 0 indicates that temporal representations for action recognition.
temporal modeling is harmful to model performance. We Acknowledgments. This work is supported by the Intelligence
further combine Φba and Ψba across all models with different Advanced Research Projects Activity (IARPA) via DOI/IBC con-
backbone networks to obtain average spatial and temporal tract number D17PC00341. The U.S. Government is authorized
contributions of a network architecture, as shown below. to reproduce and distribute reprints for Governmental purposes
notwithstanding any copyright annotation thereon. This work is
1 XX b 1 XX b also supported by the MIT-IBM Watson AI Lab.
Φ̄a = Φa (k), Ψ̄a = Ψa (k),
ZΦ ZΨ Disclaimer. The views and conclusions contained herein are those
b∈B k∈K b∈B k∈K
(2) of the authors and should not be interpreted as necessarily repre-
where B = {InceptionV1, ResNet18, ResNet50}, K = {8, senting the official policies or endorsements, either expressed or
16, 32, 64}. ZΦ and ZΨ are the normalization factors. implied, of IARPA, DOI/IBC, or the U.S. Government.
6172
References [17] Rohit Girdhar, Deva Ramanan, Abhinav Gupta, Josef Sivic,
and Bryan Russell. Actionvlad: Learning spatio-temporal
[1] Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul Nat- aggregation for action classification. In Proceedings of the
sev, George Toderici, Balakrishnan Varadarajan, and Sud- IEEE Conference on Computer Vision and Pattern Recogni-
heendra Vijayanarasimhan. Youtube-8m: A large-scale tion (CVPR), July 2017. 3
video classification benchmark. arXiv:1609.08675, 2016. 2
[18] Rohit Girdhar, Du Tran, Lorenzo Torresani, and Deva Ra-
[2] Joao Carreira and Andrew Zisserman. Quo vadis, action
manan. Distinit: Learning video representations without a
recognition? a new model and the kinetics dataset. In CVPR,
single labeled video. arXiv:1901.09244, 2019. 3
pages 6299–6308, 2017. 1, 2, 3, 4
[19] Georgia Gkioxari and Jitendra Malik. Finding action tubes.
[3] Guilhem Chéron, Ivan Laptev, and Cordelia Schmid. P-cnn:
In CVPR, pages 759–768, 2015. 2
Pose-based cnn features for action recognition. In ICCV,
[20] Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michal-
pages 3218–3226, 2015. 2
ski, Joanna Materzynska, Susanne Westphal, Heuna Kim,
[4] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li,
Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz
and Li Fei-Fei. Imagenet: A large-scale hierarchical image
Mueller-Freitag, et al. The” something something” video
database. In 2009 IEEE conference on computer vision and
database for learning and evaluating visual common sense.
pattern recognition, pages 248–255. Ieee, 2009. 1
In ICCV, 2017. 1
[5] Ali Diba, Mohsen Fayyaz, Vivek Sharma, M. Mahdi Arzani,
[21] Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. Learn-
Rahman Yousefzadeh, Juergen Gall, and Luc Van Gool.
ing spatio-temporal features with 3d residual networks for
Spatio-temporal channel correlation networks for action
action recognition. In ICCV, pages 3154–3160, 2017. 3, 4
classification. In ECCV, September 2018. 3
[6] Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, [22] Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. Can
Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, Spatiotemporal 3D CNNs Retrace the History of 2D CNNs
and Trevor Darrell. Long-term recurrent convolutional net- and ImageNet? In CVPR, June 2018. 1, 2, 3
works for visual recognition and description. In CVPR, June [23] Dongliang He, Zhichao Zhou, Chuang Gan, Fu Li, Xiao Liu,
2015. 2 Yandong Li, Limin Wang, and Shilei Wen. StNet: Local
[7] Linxi Fan, Shyamal Buch, Guanzhi Wang, Ryan Cao, Yuke and Global Spatial-Temporal Modeling for Action Recogni-
Zhu, Juan Carlos Niebles, and Li Fei-Fei. RubiksNet: Learn- tion. Proceedings of the AAAI Conference on Artificial Intel-
able 3D-Shift for Efficient Video Action Recognition. In ligence, 33(01):8401–8408, July 2019. 3
Proceedings of the European Conference on Computer Vi- [24] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
sion (ECCV), 2020. 3 Deep Residual Learning for Image Recognition. In CVPR,
[8] Quanfu Fan, Chun-Fu (Ricarhd) Chen, Hilde Kuehne, Marco June 2016. 1, 2
Pistoia, and David Cox. More Is Less: Learning Efficient [25] De-An Huang, Vignesh Ramanathan, Dhruv Mahajan,
Video Representations by Temporal Aggregation Modules. Lorenzo Torresani, Manohar Paluri, Li Fei-Fei, and Juan
In NeurIPS, 2019. 1, 2, 3, 4, 6, 7 Carlos Niebles. What makes a video a video: Analyz-
[9] Christoph Feichtenhofer. X3d: Expanding architectures for ing temporal information in video understanding models and
efficient video recognition. In CVPR, June 2020. 3, 5, 6 datasets. In CVPR, pages 7366–7375, 2018. 3
[10] Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and [26] Noureldien Hussein, Efstratios Gavves, and Arnold W.M.
Kaiming He. Slowfast networks for video recognition. Smeulders. Timeception for complex action recognition. In
arXiv:1812.03982, 2018. 1, 3, 5, 6 CVPR, June 2019. 3
[11] Christoph Feichtenhofer, Axel Pinz, and Richard Wildes. [27] Matthew Hutchinson, Siddharth Samsi, William Arcand,
Spatiotemporal residual networks for video action recogni- David Bestor, Bill Bergeron, Chansup Byun, Micheal Houle,
tion. In NeurIPS, pages 3468–3476, 2016. 2 Matthew Hubbell, Micheal Jones, Jeremy Kepner, et al. Ac-
[12] Christoph Feichtenhofer, Axel Pinz, and Richard P Wildes. curacy and performance comparison of video action recog-
Spatiotemporal multiplier networks for video action recog- nition approaches. arXiv:2008.09037, 2020. 5, 7
nition. In CVPR, pages 4768–4777, 2017. 2 [28] S. Ji, W. Xu, M. Yang, and K. Yu. 3d convolutional neu-
[13] Basura Fernando, Efstratios Gavves, Jose M Oramas, Amir ral networks for human action recognition. IEEE TPAMI,
Ghodrati, and Tinne Tuytelaars. Modeling video evolution 35(1):221–231, Jan 2013. 2
for action recognition. In CVPR, pages 5378–5387, 2015. 2 [29] Boyuan Jiang, MengMeng Wang, Weihao Gan, Wei Wu, and
[14] Deepti Ghadiyaram, Du Tran, and Dhruv Mahajan. Large- Junjie Yan. Stm: Spatiotemporal and motion encoding for
scale weakly-supervised pre-training for video action recog- action recognition. In Proceedings of the IEEE/CVF Inter-
nition. In CVPR, pages 12046–12055, 2019. 3 national Conference on Computer Vision (ICCV), October
[15] Rohit Girdhar, Joao Carreira, Carl Doersch, and Andrew Zis- 2019. 3
serman. Video action transformer network. In CVPR, pages [30] Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas
244–253, 2019. 2 Leung, Rahul Sukthankar, and Li Fei-Fei. Large-scale video
[16] Rohit Girdhar, Deva Ramanan, Abhinav Gupta, Josef Sivic, classification with convolutional neural networks. In CVPR,
and Bryan Russell. Actionvlad: Learning spatio-temporal pages 1725–1732, 2014. 2
aggregation for action classification. In CVPR, pages 971– [31] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang,
980, 2017. 2 Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola,
6173
Tim Green, Trevor Back, Paul Natsev, et al. The kinetics [46] Rameswar Panda and Amit K Roy-Chowdhury. Collabora-
human action video dataset. arXiv:1705.06950, 2017. 1, 2 tive summarization of topic-related videos. In CVPR, pages
[32] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. 7083–7092, 2017. 2
HMDB: a large video database for human motion recogni- [47] Sujoy Paul, Sourya Roy, and Amit K Roy-Chowdhury. W-
tion. In ICCV, 2011. 7 talc: Weakly-supervised temporal activity localization and
[33] Hilde Kuehne, Alexander Richard, and Juergen Gall. Weakly classification. In ECCV, pages 563–579, 2018. 2
supervised learning of actions from transcripts. Computer [48] Xiaojiang Peng, Changqing Zou, Yu Qiao, and Qiang Peng.
Vision and Image Understanding, 163:78–89, 2017. 3 Action recognition with stacked fisher vectors. In ECCV,
[34] Myunggi Lee, Seungeui Lee, Sungjoon Son, Gyutae Park, pages 581–595. Springer, 2014. 2
and Nojun Kwak. Motion Feature Network: Fixed Mo- [49] AJ Piergiovanni and Michael S. Ryoo. Representation flow
tion Filter for Action Recognition. In Vittorio Ferrari, Mar- for action recognition. In Proceedings of the IEEE/CVF
tial Hebert, Cristian Sminchisescu, and Yair Weiss, edi- Conference on Computer Vision and Pattern Recognition
tors, Computer Vision – ECCV 2018, pages 392–408, Cham, (CVPR), June 2019. 3
2018. Springer International Publishing. 3 [50] Zhaofan Qiu, Ting Yao, and Tao Mei. Learning spatio-
[35] Guy Lev, Gil Sadeh, Benjamin Klein, and Lior Wolf. Rnn temporal representation with pseudo-3d residual networks.
fisher vectors for action recognition and image annotation. In ICCV, Oct 2017. 2, 3
In ECCV, pages 833–850. Springer, 2016. 2 [51] Michael S. Ryoo, AJ Piergiovanni, Mingxing Tan, and
[36] Chao Li, Qiaoyong Zhong, Di Xie, and Shiliang Pu. Collab- Anelia Angelova. Assemblenet: Searching for multi-stream
orative Spatiotemporal Feature Learning for Video Action neural connectivity in video architectures. In International
Recognition. In Proceedings of the IEEE/CVF Conference Conference on Learning Representations, 2020. 3, 6
on Computer Vision and Pattern Recognition (CVPR), June [52] Zheng Shou, Dongang Wang, and Shih-Fu Chang. Temporal
2019. 3, 6 action localization in untrimmed videos via multi-stage cnns.
[37] Xinyu Li, Bing Shuai, and Joseph Tighe. Directional Tem- In CVPR, pages 1049–1058, 2016. 2
poral Modeling for Action Recognition. In Andrea Vedaldi, [53] Gunnar A Sigurdsson, Olga Russakovsky, and Abhinav
Horst Bischof, Thomas Brox, and Jan-Michael Frahm, edi- Gupta. What actions are needed for understanding human
tors, Computer Vision – ECCV 2020, pages 275–291, Cham, actions in videos? In ICCV, pages 2137–2146, 2017. 3
2020. Springer International Publishing. 3 [54] Karen Simonyan and Andrew Zisserman. Two-stream con-
[38] Ji Lin, Chuang Gan, and Song Han. Temporal Shift Module volutional networks for action recognition in videos. In
for Efficient Video Understanding. In ICCV, 2019. 1, 2, 3, 7 NeurIPS, 2014. 2
[39] Zhaoyang Liu, Donghao Luo, Yabiao Wang, Limin Wang, [55] Khurram Soomro, Amir Roshan Zamir, Mubarak Shah,
Ying Tai, Chengjie Wang, Jilin Li, Feiyue Huang, and Tong Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah.
Lu. TEINet: Towards an Efficient Architecture for Video Ucf101: A dataset of 101 human actions classes from videos
Recognition. Proceedings of the AAAI Conference on Artifi- in the wild. arXiv, 2012. 7
cial Intelligence, 34(07):11669–11676, Apr. 2020. 3 [56] Nitish Srivastava, Elman Mansimov, and Ruslan Salakhudi-
[40] Chenxu Luo and Alan L Yuille. Grouped spatial-temporal nov. Unsupervised learning of video representations using
aggregation for efficient action recognition. In ICCV, pages lstms. In ICML, pages 843–852, 2015. 2
5512–5521, 2019. 1 [57] Swathikiran Sudhakaran, Sergio Escalera, and Oswald Lanz.
[41] Chenxu Luo and Alan L. Yuille. Grouped spatial-temporal Gate-shift networks for video action recognition. In CVPR,
aggregation for efficient action recognition. In Proceedings pages 1102–1111, 2020. 1, 3
of the IEEE/CVF International Conference on Computer Vi- [58] Shuyang Sun, Zhanghui Kuang, Lu Sheng, Wanli Ouyang,
sion (ICCV), October 2019. 3 and Wei Zhang. Optical flow guided feature: A fast and
[42] Brais Martinez, Davide Modolo, Yuanjun Xiong, and Joseph robust motion representation for video action recognition.
Tighe. Action recognition with spatial-temporal discrimi- In Proceedings of the IEEE Conference on Computer Vision
native filter banks. In Proceedings of the IEEE/CVF Inter- and Pattern Recognition (CVPR), June 2018. 3
national Conference on Computer Vision (ICCV), October [59] C Szegedy, Wei Liu, Yangqing Jia, P Sermanet, S Reed, D
2019. 3 Anguelov, D Erhan, V Vanhoucke, and A Rabinovich. Going
[43] Joanna Materzynska, Guillaume Berger, Ingo Bax, and deeper with convolutions. In CVPR, pages 1–9, 2015. 1
Roland Memisevic. The jester dataset: A large-scale video [60] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani,
dataset of human gestures. In ICCV Workshops, Oct 2019. 7 and Manohar Paluri. Learning Spatiotemporal Features With
[44] Mathew Monfort, Alex Andonian, Bolei Zhou, Kandan Ra- 3D Convolutional Networks. In ICCV, 2015. 2, 3
makrishnan, Sarah Adel Bargal, Yan Yan, Lisa Brown, [61] Du Tran, Heng Wang, Lorenzo Torresani, and Matt Feis-
Quanfu Fan, Dan Gutfreund, Carl Vondrick, et al. Moments zli. Video classification with channel-separated convolu-
in time dataset: one million videos for event understanding. tional networks. In ICCV, October 2019. 2, 3, 6
IEEE TPAMI, 2019. 1, 2, 7 [62] Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann
[45] Yingwei Pan, Tao Mei, Ting Yao, Houqiang Li, and Yong LeCun, and Manohar Paluri. A Closer Look at Spatiotem-
Rui. Jointly modeling embedding and translation to bridge poral Convolutions for Action Recognition. In CVPR, June
video and language. In CVPR, pages 4594–4602, 2016. 2 2018. 2, 3, 4, 7
6174
[63] Subhashini Venugopalan, Marcus Rohrbach, Jeffrey Don- [78] Yizhou Zhou, Xiaoyan Sun, Zheng-Jun Zha, and Wenjun
ahue, Raymond Mooney, Trevor Darrell, and Kate Saenko. Zeng. Mict: Mixed 3d/2d convolutional tube for human
Sequence to sequence-video to text. In ICCV, pages 4534– action recognition. In Proceedings of the IEEE Conference
4542, 2015. 2 on Computer Vision and Pattern Recognition (CVPR), June
[64] Heng Wang, Du Tran, Lorenzo Torresani, and Matt Feiszli. 2018. 3
Video Modeling With Correlation Networks. In IEEE/CVF [79] Mohammadreza Zolfaghari, Kamaljeet Singh, and Thomas
Conference on Computer Vision and Pattern Recognition Brox. Eco: Efficient convolutional network for online video
(CVPR), June 2020. 3, 6 understanding. In ECCV, pages 695–712, 2018. 3, 8
[65] Limin Wang, Wei Li, Wen Li, and Luc Van Gool.
Appearance-and-relation networks for video classification.
In CVPR, June 2018. 3
[66] Limin Wang, Yu Qiao, and Xiaoou Tang. Action recogni-
tion with trajectory-pooled deep-convolutional descriptors.
In CVPR, pages 4305–4314, 2015. 2
[67] Limin Wang, Yuanjun Xiong, Dahua Lin, and Luc Van Gool.
Untrimmednets for weakly supervised action recognition
and detection. In CVPR, pages 4325–4334, 2017. 3
[68] Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua
Lin, Xiaoou Tang, and Luc Van Gool. Temporal segment
networks: Towards good practices for deep action recogni-
tion. In ECCV. Springer, 2016. 1, 2, 3, 4, 7
[69] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaim-
ing He. Non-local neural networks. In CVPR, June 2018. 2,
3, 7
[70] Philippe Weinzaepfel, Zaid Harchaoui, and Cordelia
Schmid. Learning to track for spatio-temporal action local-
ization. In ICCV, pages 3164–3172, 2015. 2
[71] Junwu Weng, Donghao Luo, Yabiao Wang, Ying Tai,
Chengjie Wang, Jilin Li, Feiyue Huang, Xudong Jiang, and
Junsong Yuan. Temporal Distinct Representation Learning
for Action Recognition. In Andrea Vedaldi, Horst Bischof,
Thomas Brox, and Jan-Michael Frahm, editors, Computer
Vision – ECCV 2020, pages 363–378, Cham, 2020. Springer
International Publishing. 3
[72] Saining Xie, Chen Sun, Jonathan Huang, Zhuowen Tu, and
Kevin Murphy. Rethinking Spatiotemporal Feature Learn-
ing: Speed-Accuracy Trade-offs in Video Classification. In
ECCV, Sept. 2018. 2, 3, 4, 7
[73] Zhongwen Xu, Yi Yang, and Alex G Hauptmann. A dis-
criminative cnn video representation for event detection. In
CVPR, pages 1798–1807, 2015. 2
[74] Ceyuan Yang, Yinghao Xu, Jianping Shi, Bo Dai, and Bolei
Zhou. Temporal pyramid network for action recognition.
In IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), June 2020. 3
[75] Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vi-
jayanarasimhan, Oriol Vinyals, Rajat Monga, and George
Toderici. Beyond short snippets: Deep networks for video
classification. In CVPR, pages 4694–4702, 2015. 2
[76] Yue Zhao, Yuanjun Xiong, and Dahua Lin. Trajectory con-
volution for action recognition. In Proceedings of the 32nd
International Conference on Neural Information Processing
Systems, NIPS’18, page 2208–2219, Red Hook, NY, USA,
2018. Curran Associates Inc. 3
[77] Bolei Zhou, Alex Andonian, Aude Oliva, and Antonio Tor-
ralba. Temporal relational reasoning in videos. In ECCV,
pages 803–818, 2018. 2, 3
6175