0% found this document useful (0 votes)
25 views11 pages

Action Recog

This paper presents a comprehensive analysis of CNN-based approaches for video action recognition, focusing on both 2D and 3D models. It highlights the need for fair evaluation practices, revealing that recent advancements in efficiency do not correlate with improved accuracy. The study emphasizes the similarities in spatio-temporal representation capabilities between 2D and 3D models, while also addressing the challenges posed by varying datasets and evaluation protocols.

Uploaded by

Awanish Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views11 pages

Action Recog

This paper presents a comprehensive analysis of CNN-based approaches for video action recognition, focusing on both 2D and 3D models. It highlights the need for fair evaluation practices, revealing that recent advancements in efficiency do not correlate with improved accuracy. The study emphasizes the similarities in spatio-temporal representation capabilities between 2D and 3D models, while also addressing the challenges posed by varying datasets and evaluation protocols.

Uploaded by

Awanish Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Deep Analysis of CNN-based Spatio-temporal Representations for

Action Recognition

Chun-Fu (Richard) Chen1,† , Rameswar Panda1,† , Kandan Ramakrishnan1 ,


Rogerio Feris1 , John Cohn1 , Aude Oliva2 , Quanfu Fan1,†
†: Equal Contribution
1
MIT-IBM Watson AI Lab, 2 Massachusetts Institute of Technology

Abstract CorrNet-R50
SlowFast-R50
I3D-R50* NL-I3D-R50
SlowFast-R50* TEA-R50
76 TAM-R50*
In recent years, a number of approaches based on 2D or CIDC-R50
3D convolutional neural networks (CNN) have emerged for TSN-R50* S3D-G-Inception

Top-1 Accuracy (%)


74 R(2+1)D-R50 TSM-R50
video action recognition, achieving state-of-the-art results bLVNet-TAM-R50
on several large-scale benchmark datasets. In this paper,
S3D-Inception
we carry out in-depth comparative analysis to better un- 72
derstand the differences between these approaches and the I3D-Inception Rubik-L-R50
progress made by them. To this end, we develop an unified 70
FLOPs reference
framework for both 2D-CNN and 3D-CNN action models, TSN-Inception ECO-Inception-R18
50G 100G 200G
which enables us to remove bells and whistles and provides
2016 2017 2018 2019 2020 2021
a common ground for fair comparison. We then conduct Year
an effort towards a large-scale analysis involving over 300
Figure 1: Recent progress of action recognition on Kinetics-
action recognition models. Our comprehensive analysis re-
400 (only models based on InceptionV1 and ResNet50 are in-
veals that a) a significant leap is made in efficiency for ac- cluded). Models marked with ∗ are re-trained and evaluated (see
tion recognition, but not in accuracy; b) 2D-CNN and 3D- Section 6.2) while others are from the existing literature. The size
CNN models behave similarly in terms of spatio-temporal of a circle indicates the 1-clip FLOPs of a model. With temporal
representation abilities and transferability. Our codes are pooling turned off, I3D performs on par with the state-of-the-art
available at https://github.com/IBM/action- approaches. Best viewed in color.
recognition-pytorch.
temporal representations of these recent approaches? Do
these approaches enable more effective temporal modeling,
1. Introduction the crux of the matter for action recognition? Furthermore,
there seems no clear winner between 2D-CNN and 3D-
With the recent advances in convolutional neural net- CNN approaches in terms of accuracy. 3D models report
works (CNNs) [59, 24] and the availability of large-scale better performance than 2D models on Kinetics while the
video benchmark datasets [31, 44], deep learning ap- latter are superior on Something-Something. How differ-
proaches have dominated the field of video action recogni- ently do these two types of models behave with regard to
tion by using 2D-CNNs [68, 38, 8] or 3D-CNNs [2, 22, 10] spatial-temporal modeling of video data?
or both [40, 57]. The 2D CNNs perform temporal model- We argue that the difficulty of understanding the recent
ing independent of 2D spatial convolutions while their 3D progress on action recognition is mainly due to the lack
counterparts learn space and time information jointly by 3D of fairness in performance evaluation related to datasets,
convolution. These methods have achieved state-of-the-art backbones and experimental practices. In contrast to im-
performance on multiple large-scale benchmarks such as age recognition where ImageNet [4] has served as a gold-
Kinetics [31] and Something-Something [20]. standard benchmark for evaluation, there are at least 4∼5
Although CNN-based approaches have made impressive popular action datasets widely used for evaluation (see Fig-
progress in action recognition, there are several fundamen- ure 2). While Kinetics-400 [31] has recently emerged as
tal questions that still largely remain unanswered in the a primary benchmark for action recognition, it is known
field. For example, what contributes to improved spatio- to be strongly biased towards spatial modeling, thus being

6165
inappropriate for validating a model’s capability of spatio- • Benchmarking of SOTA Approaches. We thor-
temporal modeling. In addition, there seems to be a ten- oughly benchmarked several SOTA approaches and
dency in current research to overly focus on pursuing state- compared them with I3D. Our analysis reveals that
of-the-art (SOTA) performance, but overlooking other im- I3D still stays on par with SOTA approaches in terms
portant factors such as the backbone networks and the num- of accuracy (Figure 1) and the recent advance in ac-
ber of input frames. For instance, I3D [2] based on 3D- tion recognition is mostly on the efficiency side, not
InceptionV1 has become a “gatekeeper” baseline to com- on accuracy. Our analysis also suggests that the in-
pare with for any recently proposed approaches of action put sampling strategy taken by a model (i.e. uniform
recognition. However such comparisons are often unfair or dense sampling) should be considered for fairness
against stronger backbones such as ResNet50 [24]. As when comparing two models (Section 6.2).
shown in Figure 1, I3D, with ResNet50 as backbone, per-
forms comparably with or outperforms many recent meth- 2. Related Work
ods that are claimed to be better. As a result, such evaluation
are barely informative w.r.t whether the improved results of Video understanding has made rapid progress with the
an approach come from a better backbone or the algorithm introduction of a number of large-scale video datasets such
itself. As discussed in Section 3, performance evaluation as Kinetics [31], Sports1M [30], Moments-In-Time [44],
in action recognition may be further confounded by many and YouTube-8M [1]. A number of models introduced re-
other issues such as variations in training and evaluation cently have emphasized the need to efficiently model spatio-
protocols, model inputs and pretrained models. temporal information for video action recognition.
In light of the great need for better understanding of Most successful deep architectures for action recogni-
CNN-based action recognition models, in this paper we pro- tion are usually based on two-stream model [54], process-
vide a common ground for comparative analysis of 2D- ing RGB frames and optical-flow in two separate CNNs
CNN and 3D-CNN models without any bells and whis- with a late fusion in upper layers [30]. Two-stream ap-
tles. We conduct comprehensive experiments and analysis proaches have been used in different action recognition
to compare several representative 2D-CNN and 3D-CNN methods [3, 6, 19, 75, 56, 63, 70, 66, 11, 12]. Another
methods on three large-scale benchmark datasets. Our main straightforward but popular approach is the use of 2D-CNN
goal is to deliver deep understanding of the important ques- to extract frame-level features and then model the tempo-
tions brought up above, especially, a) the current progress ral causality. For example, TSN [68] propose consensus
of action recognition and b) the differences between 2D- module to aggregate features; on the other hand, TRN [77]
CNN and 3D-CNN methods w.r.t spatial-temporal represen- use bag of features to model relationship between frames.
tations of video data. Our systematic analysis provides in- While TSM [38] shifts part of the channels along tempo-
sights to researchers to understand spatio-temporal effects ral dimension, thereby allowing for information to be ex-
of different action models across backbone and architecture changed among neighboring frames, TAM [8] is based on
and will broadly simulate discussions in the community re- depthwise 1 × 1 convolutions to capture temporal depen-
garding a very important but largely neglected issue of fair dencies across frames effectively. Different methods for
comparison in video action recognition. temporal aggregation of feature descriptors have also been
The main contributions of our work as follows: proposed [13, 35, 73, 66, 48, 16, 15]. More complex ap-
proaches have also been investigated for capturing long-
• A unified framework for Action Recognition. We range dependencies, e.g. non-local neural networks [69].
present a unified framework for 2D-CNN and 3D- Another approach is to use 3D-CNN, which extends the
CNN approaches and implement several representative success of 2D models in image recognition [28] to recog-
methods for comparative analysis on three standard ac- nize actions in videos. For example, C3D [60] learns 3D
tion recognition benchmark datasets. ConvNets which outperforms 2D CNNs through the use
• Spatio-Temporal Analysis. We systematically com- of large-scale video datasets. Many variants of 3D-CNNs
pare 2D-CNN and 3D-CNN models to better under- are introduced for learning spatio-temporal features such as
stand the differences and spatio-temporal behavior of I3D [2] and ResNet3D [22]. 3D CNN features were also
these models. Our analysis leads to some interest- demonstrated to generalize well to other vision tasks, such
ing findings as follows: a) Temporal pooling tends to as action detection [52], video captioning [45], action lo-
suppress the efficacy of temporal modeling in an ac- calization [47], and video summarization [46]. Nonethe-
tion model, but surprisingly provides a significant per- less, as 3D convolution leads high computational load, few
formance boost to TSN [68]; b) By removing non- works aim to reduce the complexity by decomposing the
structural differences between 2D-CNN and 3D-CNN 3D convolution into 2D spatial convolution and 1D tempo-
models, they behave similarly in terms of spatio- ral convolution, e.g., P3D [50], S3D [72], R(2+1)D [62], or
temporal representation abilities and transferability. incorporating group convolution [61]; or using a combina-

6166
tion of 2D-CNN and 3D-CNN [79]. Furthermore, SlowFast
network employs two pathways to capture short-term and
long-term temporal information [10] by processing a video
at both slow and fast frame rates. Beyond that, Timecep-
tion applies the Inception concept in the temporal domain
for capturing long-range temporal dependencies [26]. Fe-
ichtenhofer [9] finds efficient networks by extending 2D ar-
Figure 2: Statistics collected from 37 action recognition papers
chitectures through a stepwise expansion approach over the
from 2015 to 2020. Left: Used datasets. Right: Ratio of papers
key variables such as temporal duration, frame rate, spatial used different settings to compare with others.
resolution, network width, etc. Leveraging weak supervi-
sion [14, 67, 33] or distillation [18] is another recent trend Input Length. Figure 2 shows that about 80% of the papers
in action recognition. Few works have assessed the im- use different number of frames for comparison. It is because
portance of temporal information in a video, e.g., Sigurds- each method could prefer different frame numbers; how-
son et.al analyzed performance per action category based ever, comparing under different number of frames could fa-
on different levels of object complexity, verb complexity, vor either the proposed method or the reference method.
and motion [53]. They state that to differentiate temporally
Training Protocol. Due to recent advances in technology,
similar but semantically different videos, its important for
It is often easier to train action recognition models for a
models to develop temporal understanding. Huang et. al
very long time (epochs) which was not possible a few years
analyzed the effect of motion via an ablation analysis on
ago, indicating that old methods might not be well-trained.
C3D model [25]. Nonetheless, these works only study a
Furthermore, many works reuse the ImageNet weights to
limited set of backbone and temporal modeling methods.
initialize the models while others are not. It raises the con-
cern that does the gain comes from different training pro-
3. Challenges of Evaluating Action Models tocol. Based on our analysis, about 60% of the papers use
The first challenge in evaluating action models stem different protocols to train action recognition models.
from the fact that unlike ImageNet for image classification, Evaluation Protocol. As the models are trained under dif-
action recognition does not have one dataset widely used ferent sampling strategies and input lengths, a model is
for every paper. As shown in Figure 2, the most popular used to take more than one clip from a video for predic-
Kinetics-400 is used by around 60% papers1 . On the other tion. Hence, different evaluation protocol could lead un-
hand, Something-Something (V1 and V2), which has very clear comparison. About 60% papers evaluated models dif-
different temporal characteristic from Kinetics-400, is also ferently when comparing to others.
used by about 50% papers. Furthermore, two successors
of the Kinetics-400 datasets, Kinetics-600 and Kinetics-700 4. 2D-CNN and 3D-CNN Approaches
are released recently. It is difficult to evaluate different
methods if they do not test on common datasets. We fur- To address the above mentioned issue for fair com-
ther check those 37 papers how do they compare the perfor- parison, we analyze several popular 2D-CNN and 3D-
mance in their paper [60, 68, 2, 17, 50, 78, 58, 77, 5, 76, CNN approaches for action recognition, including I3D [2],
22, 65, 34, 79, 72, 62, 69, 23, 42, 29, 49, 38, 8, 36, 26, 41, ResNet3D [21], S3D [72], R(2+1)D [62], TSN [68] and
10, 61, 71, 39, 57, 51, 74, 7, 64, 9, 37]. We evaluate those TAM [8]. These approaches not only yield competitive re-
papers from four aspects, including backbone, input length, sults on popular large-scale datasets, but also widely serve
training protocol and evaluation protocol. Figure 2 shows as fundamental building blocks for many other successive
the summary of how papers compare to others differently. approaches such as SlowFast [10] and CSN [61].
Among these approaches, I3D and ResNet3D are pure
Backbone. From our analysis, we observe that about 70%
3D-CNN models, differing only in backbones. S3D and
papers compare results with different backbones (e.g., most
R(2+1)D factorize a 3D convolutional filter into a 2D spa-
of the papers use ResNet50 as backbone but compare with
tial filter followed by a 1D temporal filter. In such a sense,
I3D [2] which uses InceptionV1 as the backbone). Com-
they are architecturally similar to 2D models. However, we
paring action models with different types of backbones can
categorize them into 3D-CNN models as their implementa-
often lead to incorrect conclusions, also making harder to
tions are based on 3D convolutions. While TSN rely only
evaluate the advantage of the proposed temporal modeling.
on 2D convolution without temporal modeling, TAM, an-
For example, using stronger backbone for I3D, it improves
other 2D-CNN approach, adds efficient depthwise temporal
the results by 4.0% on Kinetics-400 (see Figure 7).
aggregation on top of TSN, which shows strong results on
1 Kinetics-400 dataset is available after 2017, the used rate increases to Something-Something [8]. Finally, since SlowFast is ar-
69% if only the papers published after 2017 are counted. guably one of the best approaches on Kinetics, we use it

6167
Model Input Temporal Spatial Temporal Initial !"#$%&'(#)*+,-#./01234
Approach Backbone $ $%&'()
Input Sampling Pooling Module Aggregation Weights 12&3453('25%&-.
I3D [2] InceptionV1 !"# !"# 6578-(

DE345,.@%(74E345,
3D Conv. 3D Conv. Inflation !
R3D [21] ResNet

1E5%(.F8)45,
4D Dense Y " " <9.12&34&-.*5,:;.
# # = ?('25%&-.@55-4,A.
S3D [72] InceptionV1 Inflation >9.?('25%&-.*5,:;
B52345,&-C
2D Conv. 1D Conv.
R(2+1)D [62] ResNet Scratch 5"#$%&'(#)*+,-#.1/234
*+&,,(-)./.0
TAM [8] bLResNet 1D dw Conv. ImageNet 09.*5,:;
3D Uniform N 2D Conv.
TSN [68] InceptionV1 None ImageNet $ $
!
" "
Table 1: 2D-CNN and 3D-CNN approaches in our study. # #

Figure 3: A general framework for 2D-CNN and 3D-CNN ap-


as a reference to SOTA results. Apart from using different proaches of video action recognition. A video action recognition
types of convolutional kernels, 2D and 3D models differ in model can be viewed as a sequence of stacked spatio-temporal
a number of other aspects, including model input, temporal modules. The input frames are formed as a 3D tensor for 2D mod-
pooling, and temporal aggregation, as shown in Table 1. els and 4D tensor for 3D models.
The differences between 2D-CNN and 3D-CNN ap-
proaches make it a challenge to compare these approaches. a single clip into the network and video-level accuracy is
To remove the bells and whistles and ensure a fair compar- the combined predictions of multiple clips; thus, the video-
ison, we show in Figure 3 that 2D and 3D models can be level accuracy is usually higher than the clip-level accuracy.
represented by a general framework. Under such a frame- By default, we report the clip-level accuracy.
work, an action recognition model is viewed as a sequence
of stacked spatio-temporal modules with temporal pooling 6. Experimental Results and Analysis
optionally applied. Thus what differentiates a model from
In this section, we provide a detailed analysis on the per-
another boils down to only its spatio-temporal module. We
formance of 2D and 3D models (Section 6.1), SOTA results
re-implemented all the approaches used in our comparison
and transferability (Section 6.2) and their spatio-temporal
under this framework, which allows us to test an approach
effects (Section 6.3). For clarity, from now onwards, we
flexibly using different configurations such as backbone,
refer to each of I3D, S3D and TAM as one type of spatio-
temporal pooling and temporal aggregation. For example,
temporal module illustrated in Figure 3. We name a specific
in S3D-ResNet (i.e., R(2+1)D), we do not expand the chan-
model by module-backbone[-tp] where tp indicates that
nel dimension between spatial and temporal convolution to
temporal pooling is applied. For example, I3D-ResNet18-
keep it align to S3D [72]. More details on the models and
tp is a 3D model based on ResNet18 with temporal pooling.
implementations can be found in the Supplemental.
To verify the correctness of our implementation, we trained
5. Datasets, Training, Evaluation Protocols a I3D-InceptionV1 as the original paper [2], and find that
our model achieves 73.1% top-1 accuracy, which is 2% bet-
To ensure fair comparison and facilitate reproduciblity, ter than the result reported in the original paper. It clearly
we train all the models using the same data preprocessing, justifies the results conducted by our setup is reliable.
training protocol, and evaluation protocol. Below we pro-
vide a brief description and refer the reader to the Supple- 6.1. Performance Analysis on Mini Datasets
mental for more details including the source codes.
For each spatio-temporal module, we experiment with 3
Datasets. We choose Something-Something V2 (SSV2), backbones (InceptionV1, ResNet18 and ResNet50) and two
Kinetics-400 (Kinetics) and Moments-in-time (MiT) for our scenarios (w/ and w/o temporal pooling) on three datasets.
experiments. We also create a mini version of each dataset: In each case, 8, 16, 32 and 64 frames are considered as in-
Mini-SSV2 and Mini-Kinetics account for half of their full put. This results in a total of 4 × 3 × 2 × 3 × 4 = 288
datasets by randomly selecting half of the categories of models to train, many of which haven’t been explored in
SSV2 and Kinetics. Mini-MiT is provided on the official the original papers. Since temporal pooling is detrimental to
MiT website, consisting of 1/8 of videos in the full dataset. model performance (see Figure 6), our analysis in this work
Training. Following [8], we progressively train the models mainly focus on models w/o temporal pooling unless other-
using different input frames. Let Ki ∈ [8, 16, 32, 64] where wise specified. Figure 4 reports the clip-level top-1 accura-
i = 1 . . . 4. We first train a starter model using 8 frames. cies w/o temporal pooling for all models. We refer readers
The model is either inflated with (e.g., I3D) or initialized to the Supplemental for the results w/ temporal pooling.
from (e.g., TAM) its corresponding ImageNet pre-trained Backbone Network and Input Length. As seen from
model. We then finetune the model using more frames Ki Figure 4, regardless of the spatiotemporal modules used,
from the model using Ki−1 frames. there is a general tendency that ResNet50 > InceptionV1
Evaluation. There are two major evaluation metrics for > ResNet18 w.r.t their overall spatiotemporal representa-
video action recognition: clip-level accuracy and video- tion capability. Longer input frames tend to produce bet-
level accuracy. Clip-level accuracy is prediction by feeding ter results; however, the performance improvement does not

6168
Figure 4: Top-1 accuracy of all the compared models without
temporal pooling on three mini-datasets. The video architectures Figure 6: Accuracy gain of the models with temporal pooling
are separated by color while the backbones by symbol. w.r.t. the models without temporal pooling. Temporal pooling
significantly hurts the performance of all models except TSNs.

formance. Such effects, however, have not been well un-


derstood in the literature. Figure 6 shows the performance
gaps between models w/ and w/o temporal pooling across
different backbones and architectures. As can be seen, tem-
poral pooling in general counters the effectiveness of tem-
poral modeling and hurts the performance of action mod-
els, just like what spatial pooling does to object recognition
(a) I3D-ResNet18 (b) TAM-ResNet18
and detection. For this reason, more recent 3D-CNN ap-
Figure 5: Performance comparison between Uniform Sam- proaches such as SlowFast [10] and X3D [9] drop temporal
pling (U) and Dense Sampling (D). (a) I3D-ResNet18 (b) TAM- pooing and rely on other techniques for reducing compu-
ResNet18. Both models do not include temporal pooling. Solid tation. Similarly, one important reason for the prior find-
bars are the clip-level accuracy while transparent bars indicates
ing in [27] that 3D models are inferior to C2D (pure spatial
the improvement by the video-level (multi-clip) evaluation.
models) on Kinetics and MiT is because their comparisons
seem significant after 32 frames on all three datasets. neglect the negative impact of temporal pooling on 3D mod-
Input Sampling. Two sampling strategies are widely els. As shown in Section 6.2, I3D w/o temporal pooling is
adopted in action recognition to create model inputs. The competitively comparable with the SOTA approaches.
first one, Uniform sampling, which is often seen in 2D mod- Interestingly, TSN is the only architecture benefiting
els, divides a video into multiple equal-length segments and from temporal pooling, demonstrating a large boost in per-
then randomly selects one frame from each segment. The formance on Mini-SSV2 (>20%) and Mini-MiT (3%∼5%).
other method used by 3D models, dense sampling, instead Also, as the number of input frames increases, the improve-
directly takes a set of continuous frames as the input. It ment is more pronounced. On Mini-Kinetics, even though
is not clear, though, why these two types of models pre- TSN is also negatively affected by temporal pooling , it suf-
fer different inputs. Figure 5 shows that uniform sampling fers the least and starts seeing positive gains after 32 frames.
(blue) yields better clip-level accuracies than dense sam- To further confirm that, we trained a 32-frame TSN model
pling (orange) under all circumstances. This is not surpris- with temporal pooling on Kinetics. This model (TSN-R50∗
ing as dense sampling only uses part of the test video in the in Figure 1) achieves a top-1 accuracy of 74.9%, 5.1%
clip-level evaluation. Even though the video-level evalua- higher than the version w/o temporal pooling and only about
tion boosts the performance of dense sampling by 6%∼15% 2.0% shy from the SOTA results. We interpret temporal
on Mini-Kinetics and 5%∼20% on Mini-SSV2, its computa- pooling as a simple form of exchanging information across
tional needs are increased proportionally, e.g., 10 clips used frames, which empowers TSN with the ability of temporal
in Figure 5 to get video-level accuracy, increases the FLOPs modeling. The consistent improvements by temporal pool-
by ten folds. Such costs make it inappropriate in practice. ing across all the datasets provide strong evidence that tem-
Thus, all our analysis is based on uniform sampling and poral modeling is necessary for video action recognition,
clip-level evaluation unless otherwise stated. We will fur- even for datasets like Kinetics where temporal information
ther analyze the effect of input sampling strategies in Sec- has been shown less crucial for recognition.
tion 6.2 based on the results from full datasets.
6.2. Benchmarking of SOTA Approaches
Temporal Pooling. Temporal pooling is usually applied
to 3D models to reduce computational complexity. It is Results on Full Datasets. I3D based on InceptionV1 has
known that temporal pooling negatively affects model per- been used as an important baseline by many papers to show-

6169
SlowFast-R50 Pretrain Dataset
+1.18 Model FLOPs
75 +1.12 dataset Kinetics SSV2 MiT
+4.09
Top-1 Accuracy (%)

I3D-ResNet50 ImageNet 335.3G 76.61 62.84 31.21


70 I3D-ResNet50 None 335.3G 76.54 − −
Kinetics400 TAM-ResNet50 ImageNet 171.5G 76.18 63.83 30.80
SSV2
65 TAM-R50 SlowFast-ResNet50-8×8† [10] None∗ 65.7G 76.40 60.10 31.20
-0.13 I3D-ResNet101 ImageNet 654.7G 77.80 64.29 −
60 +6.02 TAM-ResNet101 ImageNet 327.1G 77.61 65.32 −
+0.00
SlowFast-ResNet50-8×8‡ [10] None∗ 65.7G 77.00 − −
I3D-Inception-tp I3D-R50-tp I3D-R50 I3D-SE-R50 SlowFast-ResNet50-16×8‡ [10] Kinetics 124.5G − 63.0 −
CorrNet-ResNet50‡ [64] None∗ 115G 77.20 − −
Figure 7: Performance of I3D models by changing the back- SlowFast-ResNet101-8×8† [10] None 125.9G 76.72 − −
bone (I3D-R50-tp), removing temporal pooling (I3D-R50-tp) and SlowFast-ResNet101-8×8‡ [10] None 125.9G 78.00 − −
SlowFast-ResNet101-16×8‡ [10] None 213G 78.90 − −
adding squeeze-excitation modules (I3D-SE-R50) on Kinetics and CSN-ResNet101‡ [61] None∗ 83G 76.70 − −
SSV2. Red numbers indicate performance changes. All models CorrNet-ResNet101‡ [64] None∗ 224G 79.20 − −
X3D-L‡ [9] None∗ 24.8G 77.50 − −
are trained with 32 frames and evaluated using 3 × 10 clips on X3D-XL‡ [9] None∗ 48.4G 79.10 − −
Kinetics, and 3 × 2 clips on SSV2, respectively.
AssembleNet-501 [51] − − − − 31.41
GST-ResNet101 [36] ImageNet − − − 32.40
case their progress. However, the results of I3D on the mini ∗: Those networks cannot be initialized from ImageNet due to its structure.
†: Retrained by ourselves. ‡ : reported by the authors of the paper. 1 : Use RGB + Flow.
datasets, especially the unexpectedly significant impact of
temporal pooling, seem to suggest that the spatio-temporal Table 2: Performance of SOTA models.
modeling capability of I3D has been underestimated by the
field. To more precisely understand the recent progress Model Pretrain U-Sampling D-Sampling
in action recognition, we further conduct a more rigorous I3D-ResNet50 ImageNet 76.07 76.61
TAM-ResNet50 ImageNet 76.45 76.18
benchmarking effort including I3D, TAM and SlowFast on SlowFast-ResNet50-8×8 − 71.85 76.40
the full datasets. I3D was the prior SOTA method while
SlowFast [10] and TAM [8], both of which have official Table 3: Model performance on Kinetics based on uniform and
codes released, are competitively comparable with existing dense sampling. Uniform sampling trained models are evaluated
SOTA methods. To ensure apple-to-apple comparison, we under 3 256×256 spatial crops and 2 clips.
follow the same training settings of SlowFast to train all the
models using 32 frames as input. During evaluation, we use weakens its temporal modeling capability; and (II) that the
3 × 10 clips for Kinetics and MiT, and 3 × 2 clips for SSV2. two-stream architecture is less effective in capturing tempo-
We first augment original I3D by a stronger backbone ral dependencies in such a highly temporal dataset.
ResNet50 and turning off temporal pooling. As shown in Uniform Sampling vs Dense Sampling. We revisit the ef-
Figure 7, ResNet50 alone pushes up the accuracy of I3D by fect of input sampling on model performance and retrain all
4.0% on Kinetics, and removing temporal pooling adds an- three approaches using uniform sampling on Kinetics. As
other 1.1% performance gain, putting I3D on par with Slow- shown in Table 3, the small difference between uniform and
Fast in terms of top-1 accuracy. Further inserting Squeeze- dense sampling results indicates that both I3D and TAM are
Excitation modules into I3D makes it surpass SlowFast by flexible w.r.t model input. In contrast, uniform sampling is
0.8%. On SSV2, a stronger backbone provides I3D little not as friendly as dense sampling to SlowFast, producing an
benefit in accuracy, but removing temporal pooling boosts accuracy ∼5% lower than dense sampling. We conjecture
the performance substantially by 6%, making I3D compa- that this has to do with dual-path architecture of SlowFast.
rable to TAM. Table 2 provides more detailed results in Such an architecture is primarily designed for efficiency and
this experiment. In summary, I3D-ResNet50 demonstrates possibly less effective in learning spatial-temporal represen-
impressive results, staying on par with state-of-the-art ap- tations from sparsely sampled frames (i.e. 8-frame uniform
proaches in accuracy on all three datasets. The fact that sampling in this case). This also explains why SlowFast,
I3D remains very strong across multiple large-scale datasets when trained with uniform sampling, under performs by 2%
suggests that the recent progress of action recognition in ∼ 3% on SSV2 in Table 2 in contrast to I3D and TAM.
terms of accuracy is largely attributed to the use of more Furthermore, Figure 8 (Left) shows model accuracy v.s.
powerful backbone networks, but not the improved spatio- number of clips used for evaluation in uniform and dense
temporal modeling as expected. Nevertheless, we do ob- sampling, respectively. As can be observed, the model per-
serve that recent approaches such as X3D [9] have made formance with dense sampling is saturated quickly after
a large leap ahead in efficiency (FLOPs) compared to I3D. 4-5 clips for both I3D and SlowFast. This suggests that
Moreover, SlowFast performs worse than I3D and TAM on the common practice in the literature of using 10 clips for
SSV2 on the Something-Something dataset. We speculate dense sampling is often not necessary. As opposed to dense
that this could be related to: (I) that the slow pathway only sampling, uniform sampling benefits slightly (i.e., for Slow-
uses temporal convolutions after stage4 of ResNet, which Fast) or little from multiple clips. This raises another pitfall

6170
InceptionV1 ResNet50
76

Top-1 Accuracy (%)


76 Dataset Frames
Top-1 Accuracy (%)

None I3D Conv. TAM None I3D Conv. TAM TSM NLN
74 74
72 72 f =8 33.1 56.4 58.2 59.7 33.9 62.6 61.6 65.4 64.1 53.0
Mini-SSV2
f =16 34.7 61.8 63.7 63.9 35.3 66.2 65.7 68.6 67.4 55.0
70 SlowFast_D 70 SlowFast_D
I3D_D I3D_D f =8 70.4 68.1 68.3 68.8 72.1 73.3 71.5 74.1 74.1 73.7
68 68 SlowFast_U Mini-Kinetics
SlowFast_U f =16 70.5 70.9 70.7 70.0 72.5 75.5 73.4 76.4 75.6 74.5
I3D_U 66 I3D_U
66
2 4 6 8 10 0 2000 4000 6000 8000 10000
Number of Clips FLOPs (109) Table 5: Performance of different temporal aggregation strategies
Figure 8: Model performance tested using 3 256×256 spatial w/o temporal pooling. FLOPs and parameters of different models
crops and different number of clips. ’U’: uniform sampling; ’D’: can be found in the supplementary material.
dense sampling. Best viewed in color.
mark. These findings seem to imply that more complex tem-
Target dataset poral modeling is not necessary for “static” datasets such as
Model UCF101 HMDB51 Jester Mini-SSV2
Kinetics and MiT. We believe that lack of fairness in per-
I3D-ResNet50 97.12 72.32 96.39 65.86
TAM-ResNet50 95.05 71.67 96.35 66.91
formance evaluation leads to confusion on understanding
SlowFast-ResNet50-8×8 95.67 74.61 96.75 63.93 significance of temporal modeling for action recognition.
Table 4: Top-1 Acc. of Transferability study from Kinetics. Temporal Aggregation. The essence of temporal model-
ing is how it aggregates temporal information. The 2D ar-
that is largely overlooked by the community when assess- chitecture offers great flexibility in temporal modeling. For
ing model efficiency, i.e., the impact of input sampling. As example, TSM [38] and TAM [8] can be easily inserted into
shown in Figure 8 (Right), when putting I3D and SlowFast a CNN for learning spatio-temporal features. Here we ana-
in a plot of accuracy v.s. FLOPs for comparison, the ad- lyze several basic temporal aggregations on top of the 2D ar-
vantage of SlowFast over I3D is better and more fairly rep- chitecture including 1D convolution (Conv, i.e., S3D [72]),
resented, i.e., when considering uniform sampling for I3D, 1D depthwise convolution (dw Conv, i.e., TAM), and TSM.
SlowFast is only slightly more accurate but at the same effi- We also consider the non-local network module (NLN) [69]
ciency in FLOPs. This clearly suggests that input sampling for its ability to capture long-range temporal video depen-
strategy of a model (i.e. uniform or dense) should factor in dencies add 3 NLN modules and 2 NLN modules at stage 2
evaluation for fairness when comparing it to another model. and stage 3 of TSN-ResNet50, respectively as in [69].
Model Transferability. We further compare the transfer- Table 5 shows results of using different temporal aggre-
ability of the three models trained above on four small- gations as well as those of TSN (i.e., w/o any temporal ag-
scale datasets including UCF101 [55], HMDB51 [32], gregation) on InceptionV1 and ResNet50. The results sug-
Jester [43], and Mini-SSV2. We follow the same training gest that effective temporal modeling is required for achiev-
setting in Section 5 and finetune 45 epochs with cosine an- ing competitive results, even on datasets such as Kinetics
nealing learning rate schedule starting with 0.01; further- where temporal information is thought as non-essential for
more, since those are 32-frame models, we trained the mod- recognition. On the other hand, TAM and TSM, while be-
els with a batch size of 48 with synchronized batch normal- ing simple and efficient, demonstrate better performance
ization. Table 4 shows the results, indicating that all the than the I3D, 1D regular convolution and the NLN mod-
three models have very similar performance (difference of ule, which have more parameters and FLOPs. We argue
less than 2%) on the downstream tasks. In particular, I3D it is because the frames sampled under uniform sampling
performs on par with the SOTA approaches like TAM and are sparse and it is not suitable to model temporal informa-
SlowFast in transfer learning (e.g., I3D obtains the best ac- tion in 3D convolution. While TAM and TSM use depth-
curacy of 97.12% on UCF101), which once again corrobo- wise convolution that is more effective to model temporal
rates the fact that the improved spatio-temporal modeling is information since it only consider the single feature map at
largely due the use of stronger backbones. different frames once instead of combining all channels of
frames once. We also find the same pattern on full Kinetics
6.3. Analysis of Spatio-temporal Effects in Table 3. Interestingly, the NLN module does not per-
It’s generally believed that temporal modeling is the form as expected on Mini-SSV2. This is possibly because
core for action recognition and state-of-the-art approaches NLN models temporal dependencies through matching spa-
can capture better temporal information. However, it has tial features between frames, which are weak in Mini-SSV2.
also been demonstrated on datasets such as Kinetics and Locations of Temporal Modules. In [72] and [62], some
Moments-in-Time (MiT) [44] that approaches purely based preliminary analysis w.r.t the effect of the locations of tem-
on spatial modeling [68, 44] can achieve very competi- poral modules on 3D models was performed on Kinetics-
tive results compared to more sophisticated spatio-temporal 400. In this experiment, we conduct a similar experiment on
models. More recently, a paper [27] also shows that 2D both Mini-Kinetics and Mini-SSV2 to understand if this is so
models outperform their 3D counterparts on the MiT bench- for 2D models. We modified TAM-ResNet18 in a number

6171
Top-1 Acc. Datasets Metrics I3D S3D TAM
# of TAMs locations Mini-SSV2 Mini-Kinetics
Φ̄a 0.53 0.53 0.52
8 All 59.1 69.08
Mini-SSV2 Ψ̄ta
a 0.46 0.45 0.47
4 Top-half 59.7 69.21
4 Bottom-half 56.5 69.27 Ψ̄ta+tp
a 0.38 0.38 0.37
4 Uniform-half 59.4 69.14 Φ̄a 0.97 0.97 0.96
Top and bottom mean the residual blocks closer to output and input respectively. Mini-Kinetics Ψ̄ta
a 0.06 0.08 0.09
Table 6: Performance comparison by using different numbers and Ψ̄ta+tp
a -0.08 -0.10 -0.12
locations of TAMs in ResNet18 (w/o temporal pooling). Φ̄a 0.89 0.91 0.87
Mini-MiT Ψ̄ta
a 0.04 0.03 0.04
of different ways by keeping: a) half of the temporal mod- Ψ̄ta+tp 0.02 0.02 0.04
a
ules only in the bottom network layers (Bottom-Half ); b) Ψ̄ta
a : the improvement from temporal aggregation only.
half of the temporal modules only in the top network lay- Ψ̄ta+tp
a : the improvement from combining temporal
ers (Top-Half ); c) every other temporal module (Uniform-
Half ); and d) all the temporal modules (All). As observed Table 7: Effects of spatio-temporal modeling.
in Table 6, only half of the temporal modules (Top-Half )
is needed to achieve the best accuracy on Mini-SSV2 while Table 7 shows the results of Φ̄a and Ψ̄a for three spatio-
the accuracy on Mini-Kinetics is not sensitive to the num- temporal representations. All three representations behave
ber and locations of temporal modules. It is thus interesting similarly, namely their spatial modeling contributes slightly
to explore if this insightful observation can lead to an effi- more than temporal modeling on Mini-SSV2, much higher
cient but effective video architecture by mixing 2D and 3D on Mini-MiT, and dominantly on Mini-Kinetics. This con-
modelings, similar to the idea of ECO in [79]. vincingly explains why a model lack of temporal modeling
like TSN can perform well on Mini-Kinetics, but fail badly
Disentangling Spatial and Temporal Effects. So far we on Mini-SSV2. Note that similar observations have been
have only looked at the overall spatio-temporal effects of a made in the literature, but not in a quantitative way like
model (i.e., top-1 accuracy) in our analysis. Here we fur- ours. Furthermore, while all the approaches indicate the
ther disentangle the spatial and temporal contributions of a utmost importance of spatial modeling on mini-Kinetics,
model to understand its ability of spatio-temporal model- the results of Ψ̄ta
a suggest that temporal modeling is more
ing. Doing so provides great insights into which informa- effective on Mini-Kinetics than on Mini-MiT for both 2D
tion, spatial or temporal, is more essential to recognition. and 3D approaches. We also observe that temporal pool-
We treat TSN w/o temporal pooling as the baseline spa- ing deters the effectiveness of temporal modeling on all the
tial model as it does not model temporal information. TSN approach from the results of Ψ̄ata+tp , which are constantly
can evolve into different types of spatio-temporal models lower than Ψ̄ta
a . Such damage is especially substantial on
by adding temporal modules on top of it. With this, we Mini-Kinetics, indicated by the negative values of Ψ̄ata+tp .
compute the spatial and temporal contributions of a model
as follows. Let Sab (k) be the accuracy of a model of some
architecture a that is based on a backbone b and takes k 7. Conclusion
ResN et50
frames as input. For instance, SI3D (16) is the accu- In this paper, we conducted a comprehensive compara-
racy of a 16-frame I3D-ResNet50 model. Then the spatial tive analysis of several representative CNN-based video ac-
contribution Φba and temporal improvement of a model Ψba tion recognition approaches with different backbones and
(k is omitted here for clarity) are given by, temporal aggregations. Our extensive analysis enables bet-
ter understanding of the differences and spatio-temporal ef-
Φba =STb SN / max (Sab , STb SN )
(1) fects of 2D-CNN and 3D-CNN approaches. It also provides
Ψba =(Sab − STb SN )/(100 − STb SN ). significant insights with regard to the efficacy of spatio-
Note that Φba is between 0 and 1; Ψba < 0 indicates that temporal representations for action recognition.
temporal modeling is harmful to model performance. We Acknowledgments. This work is supported by the Intelligence
further combine Φba and Ψba across all models with different Advanced Research Projects Activity (IARPA) via DOI/IBC con-
backbone networks to obtain average spatial and temporal tract number D17PC00341. The U.S. Government is authorized
contributions of a network architecture, as shown below. to reproduce and distribute reprints for Governmental purposes
notwithstanding any copyright annotation thereon. This work is
1 XX b 1 XX b also supported by the MIT-IBM Watson AI Lab.
Φ̄a = Φa (k), Ψ̄a = Ψa (k),
ZΦ ZΨ Disclaimer. The views and conclusions contained herein are those
b∈B k∈K b∈B k∈K
(2) of the authors and should not be interpreted as necessarily repre-
where B = {InceptionV1, ResNet18, ResNet50}, K = {8, senting the official policies or endorsements, either expressed or
16, 32, 64}. ZΦ and ZΨ are the normalization factors. implied, of IARPA, DOI/IBC, or the U.S. Government.

6172
References [17] Rohit Girdhar, Deva Ramanan, Abhinav Gupta, Josef Sivic,
and Bryan Russell. Actionvlad: Learning spatio-temporal
[1] Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul Nat- aggregation for action classification. In Proceedings of the
sev, George Toderici, Balakrishnan Varadarajan, and Sud- IEEE Conference on Computer Vision and Pattern Recogni-
heendra Vijayanarasimhan. Youtube-8m: A large-scale tion (CVPR), July 2017. 3
video classification benchmark. arXiv:1609.08675, 2016. 2
[18] Rohit Girdhar, Du Tran, Lorenzo Torresani, and Deva Ra-
[2] Joao Carreira and Andrew Zisserman. Quo vadis, action
manan. Distinit: Learning video representations without a
recognition? a new model and the kinetics dataset. In CVPR,
single labeled video. arXiv:1901.09244, 2019. 3
pages 6299–6308, 2017. 1, 2, 3, 4
[19] Georgia Gkioxari and Jitendra Malik. Finding action tubes.
[3] Guilhem Chéron, Ivan Laptev, and Cordelia Schmid. P-cnn:
In CVPR, pages 759–768, 2015. 2
Pose-based cnn features for action recognition. In ICCV,
[20] Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michal-
pages 3218–3226, 2015. 2
ski, Joanna Materzynska, Susanne Westphal, Heuna Kim,
[4] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li,
Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz
and Li Fei-Fei. Imagenet: A large-scale hierarchical image
Mueller-Freitag, et al. The” something something” video
database. In 2009 IEEE conference on computer vision and
database for learning and evaluating visual common sense.
pattern recognition, pages 248–255. Ieee, 2009. 1
In ICCV, 2017. 1
[5] Ali Diba, Mohsen Fayyaz, Vivek Sharma, M. Mahdi Arzani,
[21] Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. Learn-
Rahman Yousefzadeh, Juergen Gall, and Luc Van Gool.
ing spatio-temporal features with 3d residual networks for
Spatio-temporal channel correlation networks for action
action recognition. In ICCV, pages 3154–3160, 2017. 3, 4
classification. In ECCV, September 2018. 3
[6] Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, [22] Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. Can
Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, Spatiotemporal 3D CNNs Retrace the History of 2D CNNs
and Trevor Darrell. Long-term recurrent convolutional net- and ImageNet? In CVPR, June 2018. 1, 2, 3
works for visual recognition and description. In CVPR, June [23] Dongliang He, Zhichao Zhou, Chuang Gan, Fu Li, Xiao Liu,
2015. 2 Yandong Li, Limin Wang, and Shilei Wen. StNet: Local
[7] Linxi Fan, Shyamal Buch, Guanzhi Wang, Ryan Cao, Yuke and Global Spatial-Temporal Modeling for Action Recogni-
Zhu, Juan Carlos Niebles, and Li Fei-Fei. RubiksNet: Learn- tion. Proceedings of the AAAI Conference on Artificial Intel-
able 3D-Shift for Efficient Video Action Recognition. In ligence, 33(01):8401–8408, July 2019. 3
Proceedings of the European Conference on Computer Vi- [24] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
sion (ECCV), 2020. 3 Deep Residual Learning for Image Recognition. In CVPR,
[8] Quanfu Fan, Chun-Fu (Ricarhd) Chen, Hilde Kuehne, Marco June 2016. 1, 2
Pistoia, and David Cox. More Is Less: Learning Efficient [25] De-An Huang, Vignesh Ramanathan, Dhruv Mahajan,
Video Representations by Temporal Aggregation Modules. Lorenzo Torresani, Manohar Paluri, Li Fei-Fei, and Juan
In NeurIPS, 2019. 1, 2, 3, 4, 6, 7 Carlos Niebles. What makes a video a video: Analyz-
[9] Christoph Feichtenhofer. X3d: Expanding architectures for ing temporal information in video understanding models and
efficient video recognition. In CVPR, June 2020. 3, 5, 6 datasets. In CVPR, pages 7366–7375, 2018. 3
[10] Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and [26] Noureldien Hussein, Efstratios Gavves, and Arnold W.M.
Kaiming He. Slowfast networks for video recognition. Smeulders. Timeception for complex action recognition. In
arXiv:1812.03982, 2018. 1, 3, 5, 6 CVPR, June 2019. 3
[11] Christoph Feichtenhofer, Axel Pinz, and Richard Wildes. [27] Matthew Hutchinson, Siddharth Samsi, William Arcand,
Spatiotemporal residual networks for video action recogni- David Bestor, Bill Bergeron, Chansup Byun, Micheal Houle,
tion. In NeurIPS, pages 3468–3476, 2016. 2 Matthew Hubbell, Micheal Jones, Jeremy Kepner, et al. Ac-
[12] Christoph Feichtenhofer, Axel Pinz, and Richard P Wildes. curacy and performance comparison of video action recog-
Spatiotemporal multiplier networks for video action recog- nition approaches. arXiv:2008.09037, 2020. 5, 7
nition. In CVPR, pages 4768–4777, 2017. 2 [28] S. Ji, W. Xu, M. Yang, and K. Yu. 3d convolutional neu-
[13] Basura Fernando, Efstratios Gavves, Jose M Oramas, Amir ral networks for human action recognition. IEEE TPAMI,
Ghodrati, and Tinne Tuytelaars. Modeling video evolution 35(1):221–231, Jan 2013. 2
for action recognition. In CVPR, pages 5378–5387, 2015. 2 [29] Boyuan Jiang, MengMeng Wang, Weihao Gan, Wei Wu, and
[14] Deepti Ghadiyaram, Du Tran, and Dhruv Mahajan. Large- Junjie Yan. Stm: Spatiotemporal and motion encoding for
scale weakly-supervised pre-training for video action recog- action recognition. In Proceedings of the IEEE/CVF Inter-
nition. In CVPR, pages 12046–12055, 2019. 3 national Conference on Computer Vision (ICCV), October
[15] Rohit Girdhar, Joao Carreira, Carl Doersch, and Andrew Zis- 2019. 3
serman. Video action transformer network. In CVPR, pages [30] Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas
244–253, 2019. 2 Leung, Rahul Sukthankar, and Li Fei-Fei. Large-scale video
[16] Rohit Girdhar, Deva Ramanan, Abhinav Gupta, Josef Sivic, classification with convolutional neural networks. In CVPR,
and Bryan Russell. Actionvlad: Learning spatio-temporal pages 1725–1732, 2014. 2
aggregation for action classification. In CVPR, pages 971– [31] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang,
980, 2017. 2 Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola,

6173
Tim Green, Trevor Back, Paul Natsev, et al. The kinetics [46] Rameswar Panda and Amit K Roy-Chowdhury. Collabora-
human action video dataset. arXiv:1705.06950, 2017. 1, 2 tive summarization of topic-related videos. In CVPR, pages
[32] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. 7083–7092, 2017. 2
HMDB: a large video database for human motion recogni- [47] Sujoy Paul, Sourya Roy, and Amit K Roy-Chowdhury. W-
tion. In ICCV, 2011. 7 talc: Weakly-supervised temporal activity localization and
[33] Hilde Kuehne, Alexander Richard, and Juergen Gall. Weakly classification. In ECCV, pages 563–579, 2018. 2
supervised learning of actions from transcripts. Computer [48] Xiaojiang Peng, Changqing Zou, Yu Qiao, and Qiang Peng.
Vision and Image Understanding, 163:78–89, 2017. 3 Action recognition with stacked fisher vectors. In ECCV,
[34] Myunggi Lee, Seungeui Lee, Sungjoon Son, Gyutae Park, pages 581–595. Springer, 2014. 2
and Nojun Kwak. Motion Feature Network: Fixed Mo- [49] AJ Piergiovanni and Michael S. Ryoo. Representation flow
tion Filter for Action Recognition. In Vittorio Ferrari, Mar- for action recognition. In Proceedings of the IEEE/CVF
tial Hebert, Cristian Sminchisescu, and Yair Weiss, edi- Conference on Computer Vision and Pattern Recognition
tors, Computer Vision – ECCV 2018, pages 392–408, Cham, (CVPR), June 2019. 3
2018. Springer International Publishing. 3 [50] Zhaofan Qiu, Ting Yao, and Tao Mei. Learning spatio-
[35] Guy Lev, Gil Sadeh, Benjamin Klein, and Lior Wolf. Rnn temporal representation with pseudo-3d residual networks.
fisher vectors for action recognition and image annotation. In ICCV, Oct 2017. 2, 3
In ECCV, pages 833–850. Springer, 2016. 2 [51] Michael S. Ryoo, AJ Piergiovanni, Mingxing Tan, and
[36] Chao Li, Qiaoyong Zhong, Di Xie, and Shiliang Pu. Collab- Anelia Angelova. Assemblenet: Searching for multi-stream
orative Spatiotemporal Feature Learning for Video Action neural connectivity in video architectures. In International
Recognition. In Proceedings of the IEEE/CVF Conference Conference on Learning Representations, 2020. 3, 6
on Computer Vision and Pattern Recognition (CVPR), June [52] Zheng Shou, Dongang Wang, and Shih-Fu Chang. Temporal
2019. 3, 6 action localization in untrimmed videos via multi-stage cnns.
[37] Xinyu Li, Bing Shuai, and Joseph Tighe. Directional Tem- In CVPR, pages 1049–1058, 2016. 2
poral Modeling for Action Recognition. In Andrea Vedaldi, [53] Gunnar A Sigurdsson, Olga Russakovsky, and Abhinav
Horst Bischof, Thomas Brox, and Jan-Michael Frahm, edi- Gupta. What actions are needed for understanding human
tors, Computer Vision – ECCV 2020, pages 275–291, Cham, actions in videos? In ICCV, pages 2137–2146, 2017. 3
2020. Springer International Publishing. 3 [54] Karen Simonyan and Andrew Zisserman. Two-stream con-
[38] Ji Lin, Chuang Gan, and Song Han. Temporal Shift Module volutional networks for action recognition in videos. In
for Efficient Video Understanding. In ICCV, 2019. 1, 2, 3, 7 NeurIPS, 2014. 2
[39] Zhaoyang Liu, Donghao Luo, Yabiao Wang, Limin Wang, [55] Khurram Soomro, Amir Roshan Zamir, Mubarak Shah,
Ying Tai, Chengjie Wang, Jilin Li, Feiyue Huang, and Tong Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah.
Lu. TEINet: Towards an Efficient Architecture for Video Ucf101: A dataset of 101 human actions classes from videos
Recognition. Proceedings of the AAAI Conference on Artifi- in the wild. arXiv, 2012. 7
cial Intelligence, 34(07):11669–11676, Apr. 2020. 3 [56] Nitish Srivastava, Elman Mansimov, and Ruslan Salakhudi-
[40] Chenxu Luo and Alan L Yuille. Grouped spatial-temporal nov. Unsupervised learning of video representations using
aggregation for efficient action recognition. In ICCV, pages lstms. In ICML, pages 843–852, 2015. 2
5512–5521, 2019. 1 [57] Swathikiran Sudhakaran, Sergio Escalera, and Oswald Lanz.
[41] Chenxu Luo and Alan L. Yuille. Grouped spatial-temporal Gate-shift networks for video action recognition. In CVPR,
aggregation for efficient action recognition. In Proceedings pages 1102–1111, 2020. 1, 3
of the IEEE/CVF International Conference on Computer Vi- [58] Shuyang Sun, Zhanghui Kuang, Lu Sheng, Wanli Ouyang,
sion (ICCV), October 2019. 3 and Wei Zhang. Optical flow guided feature: A fast and
[42] Brais Martinez, Davide Modolo, Yuanjun Xiong, and Joseph robust motion representation for video action recognition.
Tighe. Action recognition with spatial-temporal discrimi- In Proceedings of the IEEE Conference on Computer Vision
native filter banks. In Proceedings of the IEEE/CVF Inter- and Pattern Recognition (CVPR), June 2018. 3
national Conference on Computer Vision (ICCV), October [59] C Szegedy, Wei Liu, Yangqing Jia, P Sermanet, S Reed, D
2019. 3 Anguelov, D Erhan, V Vanhoucke, and A Rabinovich. Going
[43] Joanna Materzynska, Guillaume Berger, Ingo Bax, and deeper with convolutions. In CVPR, pages 1–9, 2015. 1
Roland Memisevic. The jester dataset: A large-scale video [60] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani,
dataset of human gestures. In ICCV Workshops, Oct 2019. 7 and Manohar Paluri. Learning Spatiotemporal Features With
[44] Mathew Monfort, Alex Andonian, Bolei Zhou, Kandan Ra- 3D Convolutional Networks. In ICCV, 2015. 2, 3
makrishnan, Sarah Adel Bargal, Yan Yan, Lisa Brown, [61] Du Tran, Heng Wang, Lorenzo Torresani, and Matt Feis-
Quanfu Fan, Dan Gutfreund, Carl Vondrick, et al. Moments zli. Video classification with channel-separated convolu-
in time dataset: one million videos for event understanding. tional networks. In ICCV, October 2019. 2, 3, 6
IEEE TPAMI, 2019. 1, 2, 7 [62] Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann
[45] Yingwei Pan, Tao Mei, Ting Yao, Houqiang Li, and Yong LeCun, and Manohar Paluri. A Closer Look at Spatiotem-
Rui. Jointly modeling embedding and translation to bridge poral Convolutions for Action Recognition. In CVPR, June
video and language. In CVPR, pages 4594–4602, 2016. 2 2018. 2, 3, 4, 7

6174
[63] Subhashini Venugopalan, Marcus Rohrbach, Jeffrey Don- [78] Yizhou Zhou, Xiaoyan Sun, Zheng-Jun Zha, and Wenjun
ahue, Raymond Mooney, Trevor Darrell, and Kate Saenko. Zeng. Mict: Mixed 3d/2d convolutional tube for human
Sequence to sequence-video to text. In ICCV, pages 4534– action recognition. In Proceedings of the IEEE Conference
4542, 2015. 2 on Computer Vision and Pattern Recognition (CVPR), June
[64] Heng Wang, Du Tran, Lorenzo Torresani, and Matt Feiszli. 2018. 3
Video Modeling With Correlation Networks. In IEEE/CVF [79] Mohammadreza Zolfaghari, Kamaljeet Singh, and Thomas
Conference on Computer Vision and Pattern Recognition Brox. Eco: Efficient convolutional network for online video
(CVPR), June 2020. 3, 6 understanding. In ECCV, pages 695–712, 2018. 3, 8
[65] Limin Wang, Wei Li, Wen Li, and Luc Van Gool.
Appearance-and-relation networks for video classification.
In CVPR, June 2018. 3
[66] Limin Wang, Yu Qiao, and Xiaoou Tang. Action recogni-
tion with trajectory-pooled deep-convolutional descriptors.
In CVPR, pages 4305–4314, 2015. 2
[67] Limin Wang, Yuanjun Xiong, Dahua Lin, and Luc Van Gool.
Untrimmednets for weakly supervised action recognition
and detection. In CVPR, pages 4325–4334, 2017. 3
[68] Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua
Lin, Xiaoou Tang, and Luc Van Gool. Temporal segment
networks: Towards good practices for deep action recogni-
tion. In ECCV. Springer, 2016. 1, 2, 3, 4, 7
[69] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaim-
ing He. Non-local neural networks. In CVPR, June 2018. 2,
3, 7
[70] Philippe Weinzaepfel, Zaid Harchaoui, and Cordelia
Schmid. Learning to track for spatio-temporal action local-
ization. In ICCV, pages 3164–3172, 2015. 2
[71] Junwu Weng, Donghao Luo, Yabiao Wang, Ying Tai,
Chengjie Wang, Jilin Li, Feiyue Huang, Xudong Jiang, and
Junsong Yuan. Temporal Distinct Representation Learning
for Action Recognition. In Andrea Vedaldi, Horst Bischof,
Thomas Brox, and Jan-Michael Frahm, editors, Computer
Vision – ECCV 2020, pages 363–378, Cham, 2020. Springer
International Publishing. 3
[72] Saining Xie, Chen Sun, Jonathan Huang, Zhuowen Tu, and
Kevin Murphy. Rethinking Spatiotemporal Feature Learn-
ing: Speed-Accuracy Trade-offs in Video Classification. In
ECCV, Sept. 2018. 2, 3, 4, 7
[73] Zhongwen Xu, Yi Yang, and Alex G Hauptmann. A dis-
criminative cnn video representation for event detection. In
CVPR, pages 1798–1807, 2015. 2
[74] Ceyuan Yang, Yinghao Xu, Jianping Shi, Bo Dai, and Bolei
Zhou. Temporal pyramid network for action recognition.
In IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), June 2020. 3
[75] Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vi-
jayanarasimhan, Oriol Vinyals, Rajat Monga, and George
Toderici. Beyond short snippets: Deep networks for video
classification. In CVPR, pages 4694–4702, 2015. 2
[76] Yue Zhao, Yuanjun Xiong, and Dahua Lin. Trajectory con-
volution for action recognition. In Proceedings of the 32nd
International Conference on Neural Information Processing
Systems, NIPS’18, page 2208–2219, Red Hook, NY, USA,
2018. Curran Associates Inc. 3
[77] Bolei Zhou, Alex Andonian, Aude Oliva, and Antonio Tor-
ralba. Temporal relational reasoning in videos. In ECCV,
pages 803–818, 2018. 2, 3

6175

You might also like