Tuber: Tubelet Transformer For Video Action Detection

This document summarizes a research paper that proposes TubeR, a method for video action detection that directly detects action tubelets (sequences of bounding boxes and labels) from a video clip without using pre-computed person detections, anchors, or proposals. TubeR adapts the transformer architecture to generate tubelet queries that can dynamically model the spatial-temporal nature of actions. It includes a tubelet attention module and context-aware classification head to better model relationships and utilize scene context for classification. Evaluation on standard datasets shows TubeR achieves state-of-the-art performance for video action detection.

Uploaded by

Varun G P

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

75 views10 pages

Tuber: Tubelet Transformer For Video Action Detection

Uploaded by

Varun G P

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

TubeR: Tubelet Transformer for Video Action Detection

Jiaojiao Zhao1 *, Yanyi Zhang2 *, Xinyu Li3 *, Hao Chen3 , Bing Shuai3 , Mingze Xu3 , Chunhui Liu3 ,
Kaustav Kundu3 , Yuanjun Xiong3 , Davide Modolo3 , Ivan Marsic2 , Cees G.M. Snoek1 , Joseph Tighe3
1
University of Amsterdam 2 Rutgers University 3 AWS AI Labs

person detection anchor cuboid refine

Abstract

We propose TubeR: a simple solution for spatio-temporal

t t
video action detection. Different from existing methods that hypotheses on fixed action location hypotheses on cuboid anchor
depend on either an off-line actor detector or hand-designed Fencing Fencing
actor-positional hypotheses like proposals or anchors, we
propose to directly detect an action tubelet in a video by si-
multaneously performing action localization and recognition TubeR
from a single representation. TubeR learns a set of tubelet- seq-to-seq
queries and utilizes a tubelet-attention module to model the no detector
no proposal
dynamic spatio-temporal nature of a video clip, which ef- no hypotheses t
t
fectively reinforces the model capacity compared to using
actor-positional hypotheses in the spatio-temporal space. Figure 1. TubeR takes as input a video clip and directly outputs
For videos containing transitional states or scene changes, tubelets: sequences of bounding boxes and their action labels. Tu-
we propose a context aware classification head to utilize beR runs end-to-end without person detectors, anchors or proposals.
short-term and long-term context to strengthen action classi-
methods simply repeat 2D proposals [12, 15, 35] or offline
fication, and an action switch regression head for detecting
person detections [9, 28, 37, 43] over time to obtain spatio-
the precise temporal action extent. TubeR directly produces
temporal features (Figure 1 top left).
action tubelets with variable lengths and even maintains
good results for long video clips. TubeR outperforms the Alternatively, tubelet-level detection approaches [16, 19,
previous state-of-the-art on commonly used action detection 26, 33, 45, 49], directly generate spatio-temporal volumes
datasets AVA, UCF101-24 and JHMDB51-21. Code will be from a video clip to capture the coherence and dynamic na-
available on GluonCV(https://cv.gluon.ai/). ture of actions. They typically predict action localization and
classification jointly over spatio-temporal hypotheses, like
3D cuboid proposals [16, 19] (Figure 1 top right). Unfortu-
nately, these 3D cuboids can only capture a short period of
1. Introduction time, also when the spatial location of a person changes as
This paper tackles the problem of spatio-temporal human soon as they move, or due to camera motion. Ideally, this
action detection in videos [3, 17, 39], which plays a central family of models would use flexible spatio-temporal tubelets
role in advanced video search engines, robotics, and self- that can track the person over a longer time, but the large
driving cars. Action detection is a compound task, requir- configuration space of such a parameterization has restricted
ing the localization of per-frame person instances, the link- previous methods to short cuboids. In this work we present
ing of these detected person instances into action tubes and a tubelet-level detection approach that is able to simulta-
the prediction of their action class labels. Two approaches neously localize and recognize action tubelets in a flexible
for spatio-temporal action detection are prevalent in the lit- manner, which allows tubelets to change in size and location
erature: frame-level detection and tubelet-level detection. over time (Figure 1 bottom). This allows our system to lever-
Frame-level detection approaches detect and classify the ac- age longer tubelets, which aggregate visual information of a
tion independently on each frame [14, 29, 32], and then link person and their actions over longer periods of time.
per-frame detections together into coherent action tubes. To We draw inspiration from sequence-to-sequence mod-
compensate for the lack of temporal information, several elling in natural language processing (NLP), particularly
machine translation [21, 24, 36, 40], and its application to
* Equally contributed and work done while at AWS AI Labs object detection, DETR [4]. Being a detection framework,
DETR can be applied as a frame-level action detection ap- 4. We present state-of-the-art results on three challenging
proach trivially, but the power of the transformer framework, action detection datasets.
on which DETR is built, is its ability to generate complex
structured outputs over sequences. In NLP, this typically 2. Related Work
takes the form of sentences but in this work we use the no-
tion of decoder queries to represent people and their actions Frame-level action detection. Spatio-temporal action de-
over video sequences, without having to restrict tubelets to tection in video has a long tradition, e.g. [3, 15, 17, 28, 29, 37,
fixed cuboids. 39, 42]. Inspired by object detection using deep convolution
We propose a tubelet-transformer, we call TubeR, for neural networks, action detection in video has been consider-
localizing and recognizing actions from a single representa- ably improved by frame-level methods [29,31,32,42]. These
tion. Building on the DETR framework [4], TubeR learns methods perform localization and recognition per-frame and
a set of tubelet queries to pull action-specific tubelet-level then link frame-wise predictions to action tubes. Specifically,
features from a spatio-temporal video representation. Our they apply 2D positional hypotheses (anchors) or an offline
TubeR design includes a specialized spatial and temporal person detector on a keyframe for localizing actors, and then
tubelet attention to allow our tubelets to be unrestricted in focus more on improving action recognition. They incorpo-
their spatial location and scale over time, thus overcoming rate temporal patterns by an extra stream utilizing optical
previous limitations of methods restricted to cuboids. Tu- flow. Others [12, 15, 35] apply 3D convolution networks to
beR regresses bounding boxes within a tubelet jointly across capture temporal information for recognizing actions. Feicht-
time, considering temporal correlations between tubelets, enhofer et al. [9] present a slowfast network to even better
and aggregates visual features over the tubelet to classify capture spatio-temporal information. Both Tang et al. [37]
actions. This core design already performs well, outperform- and Pan et al. [28] propose to explicitly model relations be-
ing many previous model designs, but still does not improve tween actors and objects. Recently, Chen et al. [5] propose to
upon frame-level approaches using offline person detectors. train actor localization and action classification end-to-end
We hypothesize that this is partially due to the lack of more from a single backbone. Different from these frame-level
global context in our query based feature as it is hard to clas- approaches, we target on tubelet-level video action detec-
sify actions referring to relationships such as ‘listening-to’ tion, with a unified configuration to simultaneously perform
and ‘talking-to’ by only looking at a single person. There- localization and recognition.
fore, we introduce a context aware classification head that, Tubelet-level action detection. Detecting actions by tak-
along with the tubelet feature, takes the full clip feature from ing a tubelet as a representation unit [23, 26, 33, 45, 49] has
which our classification head can draw contextual informa- been popular since it was proposed by Jain et al. [17]. Kalo-
tion. This design allows the network to effectively relate a geiton et al. [19] repeat 2D anchors per-frame for pooling
person tubelet to the full scene context where the tubelet ROI features and then stack the frame-wise features to pre-
appears and is shown to be effective on its own in our results dict action labels. Hou et al. [16] and Yang et al. [45] depend
section. One limitation of this design is the context feature on carefully-designed 3D cuboid proposals. The former di-
is only drawn from the same clip our tubelet occupies. It has rectly detects tubelets and the later progressively refines 3D
been shown [43] to be important to also include long term cuboid proposals across time. Besides box/cuboid anchors,
contextual features for the final action classification. Thus, Li et al. [26] detect tubelet instances by relying on center
we introduce a memory system inspired by [44] to compress position hypotheses. Hypotheses-based methods have diffi-
and store contextual features from video content around the culties to process long video clips, as we discussed in the
tubelet. We feed this long term contextual memory to our introduction. We add to the tubelet tradition by learning a
classification head using the same feature injection strategy small set of tubelet queries to represent the dynamic nature
and again show this gives an important improvement over of tubelets. We reformulate the action detection task as a
the short term context alone. We test our full system on sequence-to-sequence learning problem and explicitly model
three popular action detection datasets (AVA [15], UCF101- the temporal correlations within a tubelet. Our method is ca-
24 [34] and JHMDB51-21 [18]) and show our method can pable to handle long video clips.
outperform other state-of-the-art results. Transformer-based action detection. Vaswani et al. [40]
In summary, our contributions are as follows: proposed the transformer for machine translation, which
soon after became the most popular backbone for sequence-
1. We propose TubeR: a tubelet-level transformer frame- to-sequence tasks, e.g., [21, 24, 36]. Recently, it has also
work for human action detection. demonstrated impressive advances in object detection [4,50],
2. Our tubelet query and attention based formulation is image classification [6, 46] and video recognition [7, 10, 47].
able to generate tubelets of arbitrary location and scale. Girdhar et al. [13] propose a video action transformer net-
3. Our context aware classification head is able to aggre- work for detecting actions. They apply a region-proposal-
gate short-term and long-term contextual information. network for localization. The transformer is utilized for fur-
tubelet queries !
ther improving action recognition by aggregating features
from the spatio-temporal context around actors. We propose
a unified solution to simultaneously localize and recognize Encoder Decoder tubelet-attention layer (TA)
×, ×, "1
actions. "/0
self-attention layer (SA) cross-attention layer (CA)
"-.# "-.#
3D positional + classification head regression head
3. Action Detection by TubeR encoding "# temporal pool
I3D
self-attention layer cross-attention layer FFN FFN

In this section, we present our TubeR that takes as input a "2

FFN
$×&'() ×4 $×&'()
video clip and directly outputs a tubelet: a sequence of bound- $×+ tubelet action
ing boxes and the action label. The TubeR design takes inspi- action scores locations switch

ration from the image-based DETR [4] but reformulates the

transformer architecture for sequence-to-sequence(s) model- Figure 2. The overall structure of TubeR. Both encoder and decoder
contain n stacked modules. We only show the key components in
ing in video (Figure 2).
the encoder and decoder modules. The encoder models the spatio-
Given a video clip I ∈ RTin ×H×W ×C where temporal features from the backbone Fb by self-attention layers (see
Tin , H, W, C denote the number of frames, height, width, and Section 3.1). The decoder transforms a set of tubelet queries Q and
colour channels, TubeR first applies a 3D backbone to extract generates tubelet-level features Ftub . We utilize tubelet-attention
′ ′ ′ ′
video feature Fb ∈ RT ×H ×W ×C , where T ′ is the tempo- layers to model the relations between box query embeddings within
ral dimension and C ′ is the feature dimension. A transformer a tubelet (see Section 3.2). Finally, we apply the context aware
encoder-decoder is then utilized to transform the video fea- classification head and action switch regression head to predict
′
ture into a set of tubelet-specific feature Ftub ∈ RN ×Tout ×C , tubelet labels and coordinates (see Section 3.3).
with Tout the output temporal dimension and N the num-
′ ′ ′ ′
ber of tubelets. In order to process long video clips, we use where Fb is the backbone feature and Fen ∈ RT H W ×C
temporal down-sampling to make Tout < T ′ < Tin , which denotes the C ′ dimensional encoded feature embedding. The
reduces our memory requirement. In this case, TubeR gen- σ(∗) is the linear transformation plus positional embedding.
erates sparse tubelets. For short video clips we remove the Embpos is the 3D positional embedding [47]. The optional
temporal down-sampling to make sure Tout =T ′ =Tin , which temporal down-sampling can be applied to the backbone
results in dense tubelets. Tubelet regression and associated feature to shrink the input sequence length to the transformer
action classification can be achieved simultaneously with for better memory efficiency.
separated task heads as:
3.2. TubeR Decoder
y_{\text {coor}} = f(F_{\text {tub}}); y_{\text {class}} = g(F_{\text {tub}}), \label {equ:feature_head} (1)
Tubelet query. Directly detecting tubelets is quite challeng-
where f denotes the tubelet regression head and ycoor ∈ ing based on anchor hypotheses. The tubelet space along the
RN ×Tout ×4 stands for the coordinates of N tubelets, each of spatio-temporal dimension is huge compared to the single-
which is across Tout frames (or Tout sampled frames for long frame bounding box space. Consider for example Faster-
video clips). Here g denotes the action classification head, RCNN [30] for object detection, which requires for each
and yclass ∈ RN ×L stands for the action classification for N position in a feature map with spatial size H ′ ×W ′ , K(=9)
tubelets with L possible labels. anchors. There are in total KH ′ W ′ anchors. For a tubelet
across Tout frames, it would require (KH ′ W ′ )Tout anchors
3.1. TubeR Encoder to maintain the same sampling in space-time. To reduce the
Different from the vanilla transformer encoder, the TubeR tubelet space, several methods [16, 45] adopt 3D cuboids to
encoder is designed for processing information in the 3D approximate tubelets by ignoring the spatial action displace-
spatio-temporal space. Each encoder layer is made up of ments in a short video clip. However, the longer the video
a self-attention layer (SA), two normalization layers and a clip is, the less accurately a 3D cuboid hypotheses represents
feed forward network (FFN), following [40]. We only put a tubelet. We propose to learn a small set of tubelet queries
the core attention layers in all equations below. Q={Q1 , ..., QN } driven by the video data. N is the num-
ber of queries. The i-th tubelet query Qi ={qi,1 , ..., qi,Tout }
′
F_{\text {en}} = \text {Encoder}(F_\text {b}), (2) contains Tout box query embeddings qi,t ∈ RC across Tout
frames. We learn a tubelet query to represent the dynam-
ics of a tubelet, instead of hand-designing 3D anchors. We
\label {equ:self_attn} \text {SA}(F_\text {b}) = \text {softmax}(\frac {\sigma _{q}(F_\text {b}) \times \sigma _{k}(F_\text {b})^T}{\sqrt {C'}})\times \sigma _{v}(F_\text {b}), (3) initialize the box embeddings identically for a tubelet query.
Tubelet attention. In order to model relations in the tubelet
queries, we propose a tubelet-attention (TA) module which
\sigma (*) = \text {Linear}(*) + \text {Emb}_{\text {pos}}, (4) contains two self-attention layers (shown in Figure 2). First
we have a spatial self-attention layer that processes the Long-term context head. Inspired by [41, 43, 47] which
spatial relations between box query embeddings within a explore long-range temporal information for video under-
frame i.e. {q1,t , ..., qN,t }, t={1, ..., Tout }. The intuition of standing, we propose a long-term context head. To utilize
this layer is that recognizing actions benefits from the inter- long-range temporal information but under certain memory
actions between actors, or between actors and objects in the budget, we adopt a two-stage decoder for long-term context
same frame. Next we have our temporal self-attention layer compression as described in [44]:
that models the correlations between box query embeddings
across time within the same tubelet, i.e. {qi,1 , ..., qi,Tout }, \text {Emb}_{\text {long}} = \text {Decoder}(\text {Emn}_{n1}, \text {Decoder}(\text {Emb}_{n0}, F_{\text {long}}).
i={1, ..., N }. This layer facilitates a TubeR query to track (10)
′ ′ ′

actors and generate action tubelets that focus on single ac- The long-term context Flong ∈ RTlong ×H W ×C
tors instead of a fixed area in the frame. TubeR decoder (Tlong =(2w + 1)T ′ ) is a buffer that contains the backbone
applies the tubelet attention module to tubelet queries Q for feature extracted from a long 2w adjacent clips concatenated
generating the tubelet query feature Fq ∈ RN ×Tout ×C :
′ along time. In order to compress the long-term video feature
buffer to an embedding Emblong with a lower temporal
F_q = \text {TA}(Q). \label {equ: tube_query} (5) dimension, we apply two stacked decoders with two token
embedding Emnn0 and Emnn1 . Specifically, we first apply a
Decoder. The decoder contains a tubelet-attention module compressed token Embn0 (n0 < Tlong ) to query important
and a cross-attention (CA) layer which is used to decode the information from Flong and get an intermediary compressed
tubelet-specific feature Ftub from Fen and Fq : embedding with temporal dimension n0 . Then we further
utilize another compressed token Embn1 (n1 < n0 ) to query
\text {CA}(F_q, F_{\text {en}})=\text {softmax}(\frac {F_q \times \sigma _{k}(F_\text {en})^T}{\sqrt {C'}})\times \sigma _{v}(F_\text {en}), \label {equ: actor_query} (6) from the intermediary compressed embedding and get the
final compressed embedding Emblong . Emblong contains the
long-term video information but with a lower temporal
F_{\text {tub}} = \text {Decoder}(F_q, F_{\text {en}}). (7) dimension n1 . Then we adopt a cross-attention layer to
′ Fb and Emblong to generate a long-term context feature
Ftub ∈ RN ×Tout ×C is the tubelet specific feature. Note that ′ ′ ′ ′
Flt ∈ RT ×H ×W ×C :
with temporal pooling, Tout < Tin , TubeR produces sparse
tubelets; For Tout =Tin , TubeR produces dense tubelets. F_\text {lt} = \text {CA}(F_\text {b}, \text {Emb}_{\text {long}}), (11)
3.3. Task-Specific Heads we set Fcontext = Flt in Eq. 9 to utilize the long-term context
The bounding boxes and action classification for each for classification.
tubelet can be done simultaneously with independent task- Action switch regression head. The Tout bounding boxes
specific heads. Such design maximally reduces the computa- in a tubelet are simultaneously regressed with an FC layer
tional overheads and makes our system expandable. as:
Context aware classification head. The classification can y_{\text {coor}} = \text {Linear}_{\text {b}}(F_{\text {tub}}), (12)
be simply achieved with a linear projection. where ycoor ∈ RN ×Tout ×4 , N is the number of action tubelets,
and Tout is the temporal length of an action tubelet. To re-
y_{\text {class}} = \text {Linear}_{\text {c}}(F_{\text {tub}}), \label {equ:cls} (8) move non-action boxes in a tubelet, we further include an
FC layer for deciding whether a box prediction depicts the
where yclass ∈ RN ×L denotes the classification score on L
actor performing the action(s) of the tubelet, we call action
possible labels, one for each tubelet.
switch. The action switch allows our method to generate
Short-term context head. It is known that context is impor-
action tubelets with a more precise temporal extent. The
tant for understanding sequences [40]. We further propose
probabilities of the Tout predicted boxes in a tubelet being
to leverage spatio-temporal video context to help video se-
visible are:
quence understanding. We query the action specific feature
y_\text {switch} = \text {Linear}_{\text {s}}(F_\text {tub}), (13)
Ftub from some context feature Fcontext to strengthen Ftub ,
′
and get the feature Fc ∈ RN ×C for the final classification: where yswitch ∈ RN ×Tout . For each predicted tubelet, each of
its Tout bounding boxes obtain an action switch score.
F_\text {c} = \text {CA}(\text {Pool}_t(F_\text {tub}), \text {SA}(F_\text {context}))+\text {Pool}_t(F_\text {tub}). \label {equ:backbone_query} (9)
3.4. Losses
When we set Fcontext =Fb for utilizing the short-term con- The total loss is a linear combination of four losses:
text in the backbone feature, we call it short-term context
head. A self-attention layer is first applied to Fcontext , then a \begin {split} \mathcal {L} =\lambda _1\mathcal {L}_{\text {switch}}(y_{\text {switch}},Y_{\text {switch}}) +\lambda _2\mathcal {L}_{\text {class}}(y_{\text {class}},Y_{\text {class}})\\ +\lambda _3\mathcal {L}_{\text {box}}(y_{\text {coor}},Y_{\text {coor}}) +\lambda _4\mathcal {L}_{\text {iou}}(y_{\text {coor}},Y_{\text {coor}}), \end {split}
cross-attention layer utilizes Ftub to query from Fcontext . The
Linearc is applied to Fc for final classification. (14)
where y is the model output and Y denotes the ground truth. (a) with action switch
Basketball Basketball Basketball
The action switch loss Lswitch is a binary cross entropy loss.
The classification loss Lclass is a cross entropy loss. The Lbox
Basketball Basketball Basketball
and Liou denote the per-frame bounding box matching er-
ror. It is noted when Tout < Tin , the tubelet is sparse and transitional states action

the coordinate ground truth Ycoor are from the correspond- Basketball Basketball Basketball Basketball
Basketball
ing temporally down-sampled frame sequence. We used the
Hungarian matching similar to [4] and more details can be Basketball Basketball Basketball
(b) without action switch
found in the supplementary. We empirically set the scale
parameter as λ1 =1, λ2 =5, λ3 =2, λ4 =2.
Figure 3. Visualizations of action switch on UCF101-24. Best view
in color. The red box and label represent the ground truth. Yellow
4. Experiments indicates our detected tubelets. With the action switch (top row),
TubeR avoids misclassification for the transitional states.
4.1. Experimental Setup
Datasets. We report experiments on three commonly used For UCF101-24 with per-frame annotations, we report video-
video datasets for action detection. UCF101-24 [34] is a mAP at IoU=0.5. A standard backbone I3D-VGG [15] is
subset of UCF101. It contains 24 sport classes in 3207 utilized and the input length is set to 7 frames if not speci-
untrimmed videos. We use the revised annotations for fied. For AVA 2.1 with 1-fps annotation, we only take the
UCF101-24 from [32] and report the performance on split- model prediction on keyframes and report frame-mAP at
1. JHMDB51-21 [18] contains 21 action categories in 928 IoU=0.5. We use a CSN-50 backbone [38] with a single view
trimmed videos. We report the average results over all three evaluation protocol if not specified.
splits. AVA [15] is larger-scale and includes 299 15-minute Benefit of tubelet queries. We first show the benefit of the
movies, 235 for training, and the remaining 64 for validating. proposed tubelet query sets. Each query set is composed of
Box and label annotations are provided on per-second sam- Tout per-frame query embeddings (see section 3.2), which
pled keyframes. We evaluate on AVA with both annotation predict the spatial location of the action on their respective
versions v2.1 and v2.2. frames. We compare this to using a single query embedding
Evaluation criteria. We report the video-mAP at different that represents a whole tubelet and must regress Tout box
IoUs on UCF101-24 and JHMDB51-21. As AVA only has locations for all frames in the clip. Our results are shown in
keyframe annotations, we report frame-mAP@IoU=0.5 fol- Table 1a. Compared to using a single query embedding, our
lowing [15] using a single, center-crop inference protocol. tubelets query set improves performance by +4.1% video
Implementation details. We pre-train the backbone on mAP on UCF101-24, showing that modeling action detec-
Kinetics-400 [20]. The encoder and decoder contain 6 blocks tion as a sequence-to-sequence task effectively leverages the
on AVA. For the smaller UCF101-24 and JHMDB51-21, we capabilities of transformer architectures.
reduce the numbers of blocks to 3 to avoid overfitting. We Effect of tubelet attention. In Table 1b, we show using
empirically set the number of tubelet query N to 15. Dur- our tubelet attention module helps improve video-mAP on
ing training, we use the bipartite matching [11] based on UCF101-24 by 0.9% and 0.3% on AVA. The tubelet attention
the Hungarian algorithm [22] between predictions and the saves about 10% memory (4, 414MB) compared to the typi-
ground truth. We use the AdamW [27] optimizer with an cal self-attention implementation (5, 026MB) during training
initial learning rate 1e−5 for the backbone and 1e−4 for the (16 frames input with batch size of 1).
transformers. We decrease the learning rate 10× when the Benefit of action switch. We report the effectiveness of our
validation loss saturates. We set 1e−4 as the weight decay. action switch head in Table 1c. On UCF101-24 the action
Scale jittering in the range of (288, 320) and color jittering switch increases the video-mAP from 53.8% to 57.7% by
are used for data augmentation. During inference, we always precisely determining the temporal start and end point of
resize the short edge to 256 and use a single center-crop (1- actions. Without action switch, TubeR misclassifies transi-
view). We also tested the horizontal flip trick to create 2-view tional states as actions, like the example shown in Figure 3
inference. For fair comparisons with previous methods on (bottom row). As only the frame-level evaluation can be done
UCF101-24 and JHMDB51-21, we also test a two-stream on AVA, the advantage of the action switch is not shown by
setting with optical flow following [49]. the frame-mAP. Instead, we demonstrate its effect in Fig-
ure 4 and Figure 5. The action switch produces tubelets with
4.2. Ablations precise temporal extent for videos with shot changes.
We perform our ablations on both UCF101-24 and AVA Effect of short and long term context head. We report the
2.1 to demonstrate the effectiveness of our designs on differ- impact of our context aware classification head with both
ent evaluation protocols. Only RGB inputs are considered. short and long-term features in Table 1d. The context head
UCF101-24 AVA UCF101-24 AVA UCF101-24 AVA
single query 48.8 26.2 self-attention 52.9 27.4 w/o switch 53.8 27.7
tubelet query set 52.9 27.4 tubelet attention 53.8 27.7 w/ switch 57.7 27.7
(a) Analysis on tubelet query. Our tubelet query (b) Effect of tubelet attention. With tubelet (c) Benefit of action switch. Action switch
set design allows for each query to focus on the attention modeling relations within a tubelet produces a more precise temporal extent,
spatial location of the action on a specific frame. and across tubelets improves. which can only be shown by video-mAP.
w # of clips duration (s) mAP
UCF101-24 AVA UCF101-24 AVA
- 1 2.1 27.7
FC head 57.8 23.4 8 53.9 24.4 2 5 10.6 28.4
+ short-term context 58.4 27.7 16 58.2 26.9 3 7 14.9 28.8
+ long-term context - 28.8 32 58.4 27.7 5 11 23.5 28.6
(d) Effectiveness of short- and long-term con- (e) Length of input clip. Longer input (f) Long-term context length analysis on
text. The short-term context and long-term context video leads to a better performance on both AVA. The right amount of long-term context
help with performance, more noticeable on AVA. UCF101-24 and AVA. helps improve frame-mAP on AVA.

Table 1. Ablation studies on UCF101-24 and AVA 2.1. The proposed tubelet query, tubelet attention, the action switch and context-awareness
generally improve model performance. The proposed TubeR works well on long clips with shot changes. We report video-mAP@IoU=0.5
for UCF101-24 and frame-mAP@IoU=0.5 for AVA.

brings a decent performance gain (+4.3%) on AVA. This two-stage methods have to assume the actions occur at a
is probably because the movie clips in AVA contain shot fixed location. It is also worth mentioning that the TubeR
changes and so the network benefits from seeing the full with CSN backbones outperforms the two-stage model with
context of the clip. On UCF101-24, the videos are usually the same backbone by +4.4%, demonstrating that the gain is
short and without shot changes. The context does not bring not from the backbone but our TubeR design. TubeR even
a significant improvement on UCF101-24. outperforms the methods with multi-view augmentations
Length of input clip. We report results with variable input (horizontal flip, multiple spatial crops and multi-scale). Tu-
lengths in Table 1e. We compare with input length of 8, beR is also considerably faster than previous models, we
16 and 32 on both UCF101-24 and AVA with CSN-152 have attempted to collect the reported FLOPs from previous
as backbone. TubeR is able to handle long video clips as works (Table 2). Our TubeR has 8% fewer FLOPs than the
expected. We notice that our performance on UCF101-24 most recently published end-to-end model [5] with higher
saturates faster than on AVA, probably because UCF101-24 accuracy. Tuber is also 4× more efficient than the two-stage
does not contain shot changes that requires longer temporal model [9] with noticeable performance gain. Thanks to our
context for classification. sequence-to-sequence design, the heavy backbone is shared
Length of long-term context. This ablation is only con- and we do not need temporal iteration for tubelet regression.
ducted on AVA as videos on UCF101-24 are too short to We finally present the highest number reported in the
use long-term context. Table 1f shows that the right amount literature, regardless of the inference protocol, pre-training
of long-term context helps performance, but overwhelming dataset and additional information used. TubeR still achieves
the amount of long-term context harms performance. This is the best performance, even better than the model using addi-
probably because the long-term feature contains both useful tional object bounding-boxes as input [37].The results show
information and noise. The experiments show that about 15s that the proposed sequence-to-sequence model with tubelet
context serves best. Note that the context length varies per specific feature is a promising direction for action detection.
dataset, but can be easily determined empirically. AVA 2.2 Comparison. The results are shown in Table 3.
Under the same single-view protocol, TubeR is considerably
4.3. Frame-Level State-of-the-Art
better than previous methods, including the most recent work
AVA 2.1 Comparison. We first compare our results with pre- with an end-to-end design (WOO [5] +5.1%) and the two-
viously proposed methods on AVA 2.1 in Table 2. Compared stage work with strong backbones (MViT [7] +4.7%). A fair
to previous end-to-end models, with comparable backbone comparison between TubeR and a two-stage model [48] with
(I3D-Res50) and the same inference protocol, the proposed the same backbone CSN-152, shows TubeR gains +5.5%
TubeR outperforms all. TubeR outperforms the most recent frame-mAP. It demonstrates TubeR’s superior performance
end-to-end works WOO [5] by 0.9% and VTr [13] by 1.2%. comes from our design rather than the backbone.
This demonstrates the effectiveness of our designs. UCF101-24 Comparison. We also compare TubeR with the
Compared to previous work using an offline person detec- state-of-the-art using frame-mAP@IoU=0.5 on UCF101-24
tor, the proposed TubeR is also more effective under the same (see the first column with numbers in Table 4). Compared to
inference protocols. This is because TubeR generates tubelet- existing methods, TubeR acquires better results with com-
specific features without assumptions on location, while the parable backbones, for both RGB-stream and two-stream
Model Detector Input Backbone Pre-train Inference GFLOPs mAP

Comparison to end-to-end models

I3D [15] ✗ 32 × 2 I3D-VGG K400 1 view NA 14.5
ACRN [35] ✗ 32 × 2 S3D-G K400 1 view NA 17.4
STEP [45] ✗ 32 × 2 I3D-VGG K400 1 view NA 18.6
VTr [13] ✗ 64 × 1 I3D-VGG K400 1 view NA 24.9
WOO [5] ✗ 8×8 SF-50 K400 1 view 142 25.2
TubeR ✗ 16 × 4 I3D-Res50 K400 1 view 132 26.1
TubeR ✗ 16 × 4 I3D-Res101 K400 1 view 246 28.6

Comparison to two-stage models

Slowfast-50 [9] F-RCNN 16 × 4 SF-50 K400 1 view 308 24.2
X3D-XL [8] F-RCNN 16 × 5 X3D-XL K400 1 view 290 26.1
CSN-152* F-RCNN 32 × 2 CSN-152 IG + K400 1 views 342 27.3
LFB [43] F-RCNN 32 × 2 I3D-101-NL K400 18 views NA 27.7
ACAR-NET [28] F-RCNN 32 × 2 SF-50 K400 6 views NA 28.3
TubeR ✗ 32 × 2 CSN-50 K400 1 view 78 28.8
TubeR ✗ 32 × 2 CSN-152 IG + K400 1 view 120 31.7

Comparison to best reported results

WOO [5] ✗ 8×8 SF-101 K400+K600 1 view 246 28.0
SF-101-NL [9] F-RCNN 32 × 2 SF-101+NL K400+K600 6 views 962 28.2
ACAR-NET [28] F-RCNN 32 × 2 SF-101 K400+K600 6 views NA 30.0
AIA [37] F-RCNN 32 × 2 SF-101 K400+K700 18 views NA 31.2
TubeR ✗ 32 × 2 SF-101 K400+K700 1 view 240 31.6
TubeR ✗ 32 × 2 CSN-152 IG + K400 2 view 240 32.0

Table 2. Comparison on AVA v2.1 validation set. Detector shows if additional detector is required; * denotes the results we tested. IG
denotes the IG-65M dataset, SF denotes the slowfast network. The FLOPs for two-stage models are the sum of Faster RCNN-R101-FPN
FLOPs (246 GFLOPs [4]) plus classifier FLOPs multiplied by view number. TubeR performs more effectively and efficiently.

Model backbone pre-train inference mAP UCF101-24 JHMDB51-21

Single-view Backbone f-mAP 0.20 0.50 0.50:0.95 0.20 0.50
X3D-XL [8] X3D-XL K600+ K400 1 view 27.4 RGB-stream
CSN-152 [48] CSN-152 IG + K400 1 view 27.9 MOC [26] DLA34 72.1 78.2 50.7 26.2 - -
WOO [5] SF-101 K600+ K400 1 view 28.3 TubeR Res50 79.5 81.2 55.1 28.1 - -
M-ViT-B-24 [7] MViT-B-24 K600+ K400 1 view 28.7 T-CNN [16] C3D 41.4 47.1 - - 78.4 76.9
TubeR CSN-50 IG + K400 1 view 29.2 TubeR I3D 80.1 82.8 57.7 28.6 79.7 78.3
TubeR CSN-152 IG + K400 1 view 33.4 TubeR CSN-152 83.2 83.3 58.4 28.9 87.4 82.3
Multi-view Two-stream
SlowFast-101 [9] SF-101 K600+ K400 6 views 29.8 TacNet [33] VGG 72.1 77.5 52.9 24.1 - -
ACAR-Net [28] SF-101 K700+ K400 6 views 33.3 2in1 [49] VGG 78.5 50.3 24.5 - 74.7
AIA (obj) [37] SF-101 K700+ K400 18 views 32.2 ACT [19] VGG 67.1 77.2 51.4 25.0 74.2 73.7
TubeR CSN-152 IG + K400 2 views 33.6 MOC [26] DLA34 78.0 82.8 53.8 28.3 77.3 77.2
STEP [45] I3D 75.0 76.6 - - - -
Table 3. Comparison on AVA v2.2 validation set. IG denotes the I3D [15] I3D 76.3 - 59.9 - - 78.6
IG-65M, SF denotes the slowfast. TubeR achieves the best result. *CFAD [25] I3D 72.5 81.6 64.6 26.7 86.8 85.3
TubeR I3D 81.3 85.3 60.2 29.7 81.8 80.7
settings. Further with a CSN-152 backbone, TubeR gets 83.2
Table 4. Comparison on UCF101-24 and JHMDB51-21 with
frame-mAP, even better than two-stream methods. Though
video-mAP. TubeR achieves better results compared to most state-
TubeR targets on tubelet-level detection, it performs well on
of-arts. f-mAP denotes the frame mAP@IoU=0.5. *CFAD is pre-
frame-level evaluation on both AVA and UCF101-24. trained on K600 but others on K400.

4.4. Video-Level State-of-the-Art

tubelet queries is more effective compared to using posi-
We also compare TubeR with various settings to state-of- tional hypotheses. Compared to TacNet [33] which proposes
the-art reporting video-mAP on UCF101-24 and JHMDB51- a transition-aware context network to distinguish transitional
21 in Table 4. For fair comparisons, TubeR with a 2D states, TubeR with action switch performs better even with a
backbone gains +4.4% video-mAP@IoU=0.5 compared to one-stream setting. When incorporating optical flow inputs,
the recent state-of-the-art [26] on UCF101-24 without us- the TubeR with I3D further boosts the video-level results. It
ing optical flow, which demonstrates that TubeR learning is noted that TubeR pretrained on K400 even outperforms
Input frames
stand, stand, sit,
talk to, listen to talk to
watch watch watch
Tubelet 1: stand; listen to (a person); watch (a person)

walk walk

Tubelet 2: stand; listen to (a person); watch (a person)

walk
walk
stand,
Tubelet 3: sit; listen to (a person); watch (a person) talk

Figure 5. Results visualization, with different colors to label dif-

Tubelet 4: stand; talk to (e.g., a group); watch (a person) ferent tubelets. Each action tubelet contains its action labels and
boxes per frame. We only show the action labels on the first frame
of an action tubelet. Some challenging cases are shown. Top: shot
Tubelet 5: walk changes; sit,
Middle: actors moving with distance; Bottom: multiple
sit carry/hold
actors with small walk,
listen to and large scales. Best viewed in color.
watch

Results 5. Discussion and Conclusion

Limitations. Although proposed for long videos, we noticed
two potential limitations that stop us from feeding in very
Figure 4. Visualization of tubelet specific feature with attention long videos in one shot.
rollout. Each tubelet covers a separated action instance. Best viewed 1. We observe that 90% of computation (FLOPs) and 67%
in color. of memory usage was used by our 3D backbone. This heavy
backbone restricts us from applying TubeR on long videos.
Recent works show that transformer encoders can be used
CFAD pretrained on K600 on some metrics. We test Tu- for video embedding [2, 7, 47] and are less memory and
beR inference speed on UCF101-24 by following CFAD. To computationally hungry. We will explore these transformer
directly generate a tubelet without an offline linker, TubeR based embeddings in future work.
runs at 156 fps. Faster than CFAD (130fps) and most existing
SOTA methods (40-53 fps). The result illustrates our design 2. If we were to process a long video in one pass we’d need
is effective and efficient for video-level action detection. enough queries to cover the maximum number of different
actions per-person in that video. This would likely require a
large number of queries which would cause memeory issues
4.5. Visualization in our self attention layers. A possible solution is to generate
person tubelets, instead of action tubelets, so that we do not
We first provide visualizations (Figure 4) of the tubelet- need to split tubelets when a new action happens. Then we
specific features by overlaying the tubelet-specific feature would only need a query for each person instance.
activation over the input frames using attention rollout [1]. Potential negative impact. There are real-world applica-
The example in Figure 4 is challenging as it contains multiple tions of action detection technology such as patient or el-
people and concurrent actions. The visualization show that: 1. derly health monitoring, public safety, Augmented/Virtual
Our proposed TubeR is able to generate highly discriminative Reality, and collaborative robots. However, there could be
tubelet-specific features. Different actions in this case are unintended usages and we advocate responsible usage and
clearly separated in different tubelets. 2. Our action switch complying with applicable laws and regulations.
works as expected and initiates/cuts the tubelets when the
Conclusion. This paper introduces TubeR, a unified solution
action starts/stops. 3. Our TubeR generalizes well to scale
for spatio-temporal video action detection in a sequence-to-
changes (the brown tubelet). 4. The generated tubelets are
sequence manner. Our design of tubelet-specific features
tightly associated with tubelet specific feature as expected.
allows TubeR to generate tubelets (a set of linked bound-
We further show our TubeR performs well in various ing boxes) with action predictions for each of the tubelets.
scenarios. TubeR works well on videos with shot changes TubeR does not rely on positional hypotheses and therefore
(Figure 5 top); TubeR is able to detect an actor moving with scales well to longer video clips. TubeR achieves state-of-the-
distance (Figure 5 middle); and TubeR is robust to action art performance and better efficiency compared to previous
detection even for small people (Figure 5 bottom). works.
References [18] Hueihan Jhuang, Juergen Gall, Silvia Zuffi, Cordelia Schmid,
and Michael J Black. Towards understanding action recogni-
[1] Samira Abnar and Willem H Zuidema. Quantifying attention tion. In ICCV, 2013. 2, 5
flow in transformers. In ACL, 2020. 8 [19] Vicky Kalogeiton, Philippe Weinzaepfel, Vittorio Ferrari, and
[2] Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Cordelia Schmid. Action tubelet detector for spatio-temporal
Mario Lučić, and Cordelia Schmid. Vivit: A video vision action localization. In ICCV, 2017. 1, 2, 7
transformer. arXiv:2103.15691, 2021. 8 [20] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang,
[3] Liangliang Cao, Zicheng Liu, and Thomas S Huang. Cross- Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola,
dataset action detection. In CVPR, 2010. 1, 2 Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman,
[4] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas and Andrew Zisserman. The kinetics human action video
Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- dataset. arXiv:1705.06950, 2017. 5
end object detection with transformers. In ECCV, 2020. 1, 2, [21] Aisha Urooj Khan, Amir Mazaheri, Niels da Vitoria Lobo,
3, 5, 7 and Mubarak Shah. MMFT-BERT: Multimodal fusion trans-
[5] Shoufa Chen, Peize Sun, Enze Xie, Chongjian Ge, Jiannan former with bert encodings for visual question answering.
Wu, Lan Ma, Jiajun Shen, and Ping Luo. Watch only once: arXiv:2010.14095, 2020. 1, 2
An end-to-end video action detection framework. In ICCV, [22] Harold W Kuhn. The hungarian method for the assignment
2021. 2, 6, 7 problem. Naval research logistics quarterly, 1955. 5
[6] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, [23] Dong Li, Zhaofan Qiu, Qi Dai, Ting Yao, and Tao Mei. Re-
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, current tubelet proposal and recognition networks for action
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- detection. In ECCV, 2018. 2
vain Gelly, et al. An image is worth 16x16 words: Transform- [24] Guang Li, Linchao Zhu, Ping Liu, and Yi Yang. Entangled
ers for image recognition at scale. In ICLR, 2021. 2 transformer for image captioning. In ICCV, 2019. 1, 2
[7] Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, [25] Yuxi Li, Weiyao Lin, John See, Ning Xu, Shugong Xu, Ke
Zhicheng Yan, Jitendra Malik, and Christoph Feichtenhofer. Yan, and Cong Yang. Cfad: Coarse-to-fine action detector for
Multiscale vision transformers. arXiv:2104.11227, 2021. 2, spatiotemporal action localization. In ECCV, 2020. 7
6, 7, 8 [26] Yixuan Li, Zixu Wang, Limin Wang, and Gangshan Wu. Ac-
[8] Christoph Feichtenhofer. X3D: Expanding architectures for tions as moving points. In ECCV, 2020. 1, 2, 7
efficient video recognition. In CVPR, 2020. 7 [27] Ilya Loshchilov and Frank Hutter. Decoupled weight decay
[9] Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and regularization. ICLR, 2017. 5
Kaiming He. Slowfast networks for video recognition. In [28] Junting Pan, Siyu Chen, Mike Zheng Shou, Yu Liu, Jing Shao,
CVPR, 2019. 1, 2, 6, 7 and Hongsheng Li. Actor-context-actor relation network for
spatio-temporal action localization. In CVPR, 2021. 1, 2, 7
[10] Kirill Gavrilyuk, Ryan Sanford, Mehrsan Javan, and Cees GM
Snoek. Actor-transformers for group activity recognition. In [29] Xiaojiang Peng and Cordelia Schmid. Multi-region two-
CVPR, 2020. 2 stream r-cnn for action detection. In ECCV, 2016. 1, 2
[30] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun.
[11] Dobrik Georgiev and Pietro Lió. Neural bipartite matching.
Faster r-cnn: Towards real-time object detection with region
arXiv:2005.11304, 2020. 5
proposal networks. In NeurIPS, 2015. 3
[12] Rohit Girdhar, João Carreira, Carl Doersch, and Andrew Zis-
[31] Suman Saha, Gurkirt Singh, Michael Sapienza, Philip HS
serman. A better baseline for ava. arXiv:1807.10066, 2018.
Torr, and Fabio Cuzzolin. Deep learning for detecting multiple
1, 2
space-time action tubes in videos. arXiv:1608.01529, 2016.
[13] Rohit Girdhar, Joao Carreira, Carl Doersch, and Andrew Zis- 2
serman. Video action transformer network. In CVPR, 2019. [32] Gurkirt Singh, Suman Saha, Michael Sapienza, Philip HS
2, 6, 7 Torr, and Fabio Cuzzolin. Online real-time multiple spa-
[14] Georgia Gkioxari and Jitendra Malik. Finding action tubes. tiotemporal action localisation and prediction. In ICCV, 2017.
In CVPR, 2015. 1 1, 2, 5
[15] Chunhui Gu, Chen Sun, David A Ross, Carl Vondrick, Car- [33] Lin Song, Shiwei Zhang, Gang Yu, and Hongbin Sun. Tacnet:
oline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan, Transition-aware context network for spatio-temporal action
George Toderici, Susanna Ricco, Rahul Sukthankar, Cordelia detection. In CVPR, 2019. 1, 2, 7
Schmid, and Jitendra Malik. Ava: A video dataset of spatio- [34] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah.
temporally localized atomic visual actions. In CVPR, 2018. Ucf101: A dataset of 101 human actions classes from videos
1, 2, 5, 7 in the wild. arXiv:1212.0402, 2012. 2, 5
[16] Rui Hou, Chen Chen, and Mubarak Shah. Tube convolutional [35] Chen Sun, Abhinav Shrivastava, Carl Vondrick, Kevin Mur-
neural network (t-cnn) for action detection in videos. In ICCV, phy, Rahul Sukthankar, and Cordelia Schmid. Actor-centric
2017. 1, 2, 3, 7 relation network. In ECCV, 2018. 1, 2, 7
[17] Mihir Jain, Jan van Gemert, Hervé Jégou, Patrick Bouthemy, [36] Chiranjib Sur. Self-segregating and coordinated-segregating
and Cees GM Snoek. Action localization with tubelets from transformer for focused deep multi-modular network for vi-
motion. In CVPR, 2014. 1, 2 sual question answering. arXiv:2006.14264, 2020. 1, 2
[37] Jiajun Tang, Jin Xia, Xinzhi Mu, Bo Pang, and Cewu Lu.
Asynchronous interaction aggregation for action detection. In
ECCV, 2020. 1, 2, 6, 7
[38] Du Tran, Heng Wang, Lorenzo Torresani, and Matt Feiszli.
Video classification with channel-separated convolutional net-
works. In ICCV, 2019. 5
[39] Du Tran and Junsong Yuan. Max-margin structured output
regression for spatio-temporal action localization. In NIPS,
2012. 1, 2
[40] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
Polosukhin. Attention is all you need. In NIPS, 2017. 1, 2, 3,
4
[41] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming
He. Non-local neural networks. In CVPR, 2018. 4
[42] Philippe Weinzaepfel, Zaid Harchaoui, and Cordelia Schmid.
Learning to track for spatio-temporal action localization. In
ICCV, 2015. 2
[43] Chao-Yuan Wu, Christoph Feichtenhofer, Haoqi Fan, Kaim-
ing He, Philipp Krahenbuhl, and Ross Girshick. Long-term
feature banks for detailed video understanding. In CVPR,
2019. 1, 2, 4, 7
[44] Mingze Xu, Yuanjun Xiong, Hao Chen, Xinyu Li, Wei Xia,
Zhuowen Tu, and Stefano Soatto. Long short-term trans-
former for online action detection. In NeurIPS, 2021. 2,
4
[45] Xitong Yang, Xiaodong Yang, Ming-Yu Liu, Fanyi Xiao,
Larry S Davis, and Jan Kautz. Step: Spatio-temporal progres-
sive learning for video action detection. In CVPR, 2019. 1, 2,
3, 7
[46] Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi,
Francis EH Tay, Jiashi Feng, and Shuicheng Yan. Tokens-
to-token vit: Training vision transformers from scratch on
imagenet. arXiv:2101.11986, 2021. 2
[47] Yanyi Zhang, Xinyu Li, Chunhui Liu, Bing Shuai, Yi Zhu, Bi-
agio Brattoli, Hao Chen, Ivan Marsic, and Joseph Tighe. Vidtr-
switch: Video transformer without convolutions. In ICCV,
2021. 2, 3, 4, 8
[48] Yanyi Zhang, Xinyu Li, and Ivan Marsic. Multi-label activ-
ity recognition using activity-specific features and activity
correlations. In CVPR, 2021. 6, 7
[49] Jiaojiao Zhao and Cees GM Snoek. Dance with flow: Two-in-
one stream action detection. In CVPR, 2019. 1, 2, 5, 7
[50] Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang,
and Jifeng Dai. Deformable DETR: Deformable transformers
for end-to-end object detection. In ICLR, 2021. 2