Tuber: Tubelet Transformer For Video Action Detection
Tuber: Tubelet Transformer For Video Action Detection
Jiaojiao Zhao1 *, Yanyi Zhang2 *, Xinyu Li3 *, Hao Chen3 , Bing Shuai3 , Mingze Xu3 , Chunhui Liu3 ,
Kaustav Kundu3 , Yuanjun Xiong3 , Davide Modolo3 , Ivan Marsic2 , Cees G.M. Snoek1 , Joseph Tighe3
1
University of Amsterdam 2 Rutgers University 3 AWS AI Labs
actors and generate action tubelets that focus on single ac- The long-term context Flong ∈ RTlong ×H W ×C
tors instead of a fixed area in the frame. TubeR decoder (Tlong =(2w + 1)T ′ ) is a buffer that contains the backbone
applies the tubelet attention module to tubelet queries Q for feature extracted from a long 2w adjacent clips concatenated
generating the tubelet query feature Fq ∈ RN ×Tout ×C :
′ along time. In order to compress the long-term video feature
buffer to an embedding Emblong with a lower temporal
F_q = \text {TA}(Q). \label {equ: tube_query} (5) dimension, we apply two stacked decoders with two token
embedding Emnn0 and Emnn1 . Specifically, we first apply a
Decoder. The decoder contains a tubelet-attention module compressed token Embn0 (n0 < Tlong ) to query important
and a cross-attention (CA) layer which is used to decode the information from Flong and get an intermediary compressed
tubelet-specific feature Ftub from Fen and Fq : embedding with temporal dimension n0 . Then we further
utilize another compressed token Embn1 (n1 < n0 ) to query
\text {CA}(F_q, F_{\text {en}})=\text {softmax}(\frac {F_q \times \sigma _{k}(F_\text {en})^T}{\sqrt {C'}})\times \sigma _{v}(F_\text {en}), \label {equ: actor_query} (6) from the intermediary compressed embedding and get the
final compressed embedding Emblong . Emblong contains the
long-term video information but with a lower temporal
F_{\text {tub}} = \text {Decoder}(F_q, F_{\text {en}}). (7) dimension n1 . Then we adopt a cross-attention layer to
′ Fb and Emblong to generate a long-term context feature
Ftub ∈ RN ×Tout ×C is the tubelet specific feature. Note that ′ ′ ′ ′
Flt ∈ RT ×H ×W ×C :
with temporal pooling, Tout < Tin , TubeR produces sparse
tubelets; For Tout =Tin , TubeR produces dense tubelets. F_\text {lt} = \text {CA}(F_\text {b}, \text {Emb}_{\text {long}}), (11)
3.3. Task-Specific Heads we set Fcontext = Flt in Eq. 9 to utilize the long-term context
The bounding boxes and action classification for each for classification.
tubelet can be done simultaneously with independent task- Action switch regression head. The Tout bounding boxes
specific heads. Such design maximally reduces the computa- in a tubelet are simultaneously regressed with an FC layer
tional overheads and makes our system expandable. as:
Context aware classification head. The classification can y_{\text {coor}} = \text {Linear}_{\text {b}}(F_{\text {tub}}), (12)
be simply achieved with a linear projection. where ycoor ∈ RN ×Tout ×4 , N is the number of action tubelets,
and Tout is the temporal length of an action tubelet. To re-
y_{\text {class}} = \text {Linear}_{\text {c}}(F_{\text {tub}}), \label {equ:cls} (8) move non-action boxes in a tubelet, we further include an
FC layer for deciding whether a box prediction depicts the
where yclass ∈ RN ×L denotes the classification score on L
actor performing the action(s) of the tubelet, we call action
possible labels, one for each tubelet.
switch. The action switch allows our method to generate
Short-term context head. It is known that context is impor-
action tubelets with a more precise temporal extent. The
tant for understanding sequences [40]. We further propose
probabilities of the Tout predicted boxes in a tubelet being
to leverage spatio-temporal video context to help video se-
visible are:
quence understanding. We query the action specific feature
y_\text {switch} = \text {Linear}_{\text {s}}(F_\text {tub}), (13)
Ftub from some context feature Fcontext to strengthen Ftub ,
′
and get the feature Fc ∈ RN ×C for the final classification: where yswitch ∈ RN ×Tout . For each predicted tubelet, each of
its Tout bounding boxes obtain an action switch score.
F_\text {c} = \text {CA}(\text {Pool}_t(F_\text {tub}), \text {SA}(F_\text {context}))+\text {Pool}_t(F_\text {tub}). \label {equ:backbone_query} (9)
3.4. Losses
When we set Fcontext =Fb for utilizing the short-term con- The total loss is a linear combination of four losses:
text in the backbone feature, we call it short-term context
head. A self-attention layer is first applied to Fcontext , then a \begin {split} \mathcal {L} =\lambda _1\mathcal {L}_{\text {switch}}(y_{\text {switch}},Y_{\text {switch}}) +\lambda _2\mathcal {L}_{\text {class}}(y_{\text {class}},Y_{\text {class}})\\ +\lambda _3\mathcal {L}_{\text {box}}(y_{\text {coor}},Y_{\text {coor}}) +\lambda _4\mathcal {L}_{\text {iou}}(y_{\text {coor}},Y_{\text {coor}}), \end {split}
cross-attention layer utilizes Ftub to query from Fcontext . The
Linearc is applied to Fc for final classification. (14)
where y is the model output and Y denotes the ground truth. (a) with action switch
Basketball Basketball Basketball
The action switch loss Lswitch is a binary cross entropy loss.
The classification loss Lclass is a cross entropy loss. The Lbox
Basketball Basketball Basketball
and Liou denote the per-frame bounding box matching er-
ror. It is noted when Tout < Tin , the tubelet is sparse and transitional states action
the coordinate ground truth Ycoor are from the correspond- Basketball Basketball Basketball Basketball
Basketball
ing temporally down-sampled frame sequence. We used the
Hungarian matching similar to [4] and more details can be Basketball Basketball Basketball
(b) without action switch
found in the supplementary. We empirically set the scale
parameter as λ1 =1, λ2 =5, λ3 =2, λ4 =2.
Figure 3. Visualizations of action switch on UCF101-24. Best view
in color. The red box and label represent the ground truth. Yellow
4. Experiments indicates our detected tubelets. With the action switch (top row),
TubeR avoids misclassification for the transitional states.
4.1. Experimental Setup
Datasets. We report experiments on three commonly used For UCF101-24 with per-frame annotations, we report video-
video datasets for action detection. UCF101-24 [34] is a mAP at IoU=0.5. A standard backbone I3D-VGG [15] is
subset of UCF101. It contains 24 sport classes in 3207 utilized and the input length is set to 7 frames if not speci-
untrimmed videos. We use the revised annotations for fied. For AVA 2.1 with 1-fps annotation, we only take the
UCF101-24 from [32] and report the performance on split- model prediction on keyframes and report frame-mAP at
1. JHMDB51-21 [18] contains 21 action categories in 928 IoU=0.5. We use a CSN-50 backbone [38] with a single view
trimmed videos. We report the average results over all three evaluation protocol if not specified.
splits. AVA [15] is larger-scale and includes 299 15-minute Benefit of tubelet queries. We first show the benefit of the
movies, 235 for training, and the remaining 64 for validating. proposed tubelet query sets. Each query set is composed of
Box and label annotations are provided on per-second sam- Tout per-frame query embeddings (see section 3.2), which
pled keyframes. We evaluate on AVA with both annotation predict the spatial location of the action on their respective
versions v2.1 and v2.2. frames. We compare this to using a single query embedding
Evaluation criteria. We report the video-mAP at different that represents a whole tubelet and must regress Tout box
IoUs on UCF101-24 and JHMDB51-21. As AVA only has locations for all frames in the clip. Our results are shown in
keyframe annotations, we report frame-mAP@IoU=0.5 fol- Table 1a. Compared to using a single query embedding, our
lowing [15] using a single, center-crop inference protocol. tubelets query set improves performance by +4.1% video
Implementation details. We pre-train the backbone on mAP on UCF101-24, showing that modeling action detec-
Kinetics-400 [20]. The encoder and decoder contain 6 blocks tion as a sequence-to-sequence task effectively leverages the
on AVA. For the smaller UCF101-24 and JHMDB51-21, we capabilities of transformer architectures.
reduce the numbers of blocks to 3 to avoid overfitting. We Effect of tubelet attention. In Table 1b, we show using
empirically set the number of tubelet query N to 15. Dur- our tubelet attention module helps improve video-mAP on
ing training, we use the bipartite matching [11] based on UCF101-24 by 0.9% and 0.3% on AVA. The tubelet attention
the Hungarian algorithm [22] between predictions and the saves about 10% memory (4, 414MB) compared to the typi-
ground truth. We use the AdamW [27] optimizer with an cal self-attention implementation (5, 026MB) during training
initial learning rate 1e−5 for the backbone and 1e−4 for the (16 frames input with batch size of 1).
transformers. We decrease the learning rate 10× when the Benefit of action switch. We report the effectiveness of our
validation loss saturates. We set 1e−4 as the weight decay. action switch head in Table 1c. On UCF101-24 the action
Scale jittering in the range of (288, 320) and color jittering switch increases the video-mAP from 53.8% to 57.7% by
are used for data augmentation. During inference, we always precisely determining the temporal start and end point of
resize the short edge to 256 and use a single center-crop (1- actions. Without action switch, TubeR misclassifies transi-
view). We also tested the horizontal flip trick to create 2-view tional states as actions, like the example shown in Figure 3
inference. For fair comparisons with previous methods on (bottom row). As only the frame-level evaluation can be done
UCF101-24 and JHMDB51-21, we also test a two-stream on AVA, the advantage of the action switch is not shown by
setting with optical flow following [49]. the frame-mAP. Instead, we demonstrate its effect in Fig-
ure 4 and Figure 5. The action switch produces tubelets with
4.2. Ablations precise temporal extent for videos with shot changes.
We perform our ablations on both UCF101-24 and AVA Effect of short and long term context head. We report the
2.1 to demonstrate the effectiveness of our designs on differ- impact of our context aware classification head with both
ent evaluation protocols. Only RGB inputs are considered. short and long-term features in Table 1d. The context head
UCF101-24 AVA UCF101-24 AVA UCF101-24 AVA
single query 48.8 26.2 self-attention 52.9 27.4 w/o switch 53.8 27.7
tubelet query set 52.9 27.4 tubelet attention 53.8 27.7 w/ switch 57.7 27.7
(a) Analysis on tubelet query. Our tubelet query (b) Effect of tubelet attention. With tubelet (c) Benefit of action switch. Action switch
set design allows for each query to focus on the attention modeling relations within a tubelet produces a more precise temporal extent,
spatial location of the action on a specific frame. and across tubelets improves. which can only be shown by video-mAP.
w # of clips duration (s) mAP
UCF101-24 AVA UCF101-24 AVA
- 1 2.1 27.7
FC head 57.8 23.4 8 53.9 24.4 2 5 10.6 28.4
+ short-term context 58.4 27.7 16 58.2 26.9 3 7 14.9 28.8
+ long-term context - 28.8 32 58.4 27.7 5 11 23.5 28.6
(d) Effectiveness of short- and long-term con- (e) Length of input clip. Longer input (f) Long-term context length analysis on
text. The short-term context and long-term context video leads to a better performance on both AVA. The right amount of long-term context
help with performance, more noticeable on AVA. UCF101-24 and AVA. helps improve frame-mAP on AVA.
Table 1. Ablation studies on UCF101-24 and AVA 2.1. The proposed tubelet query, tubelet attention, the action switch and context-awareness
generally improve model performance. The proposed TubeR works well on long clips with shot changes. We report video-mAP@IoU=0.5
for UCF101-24 and frame-mAP@IoU=0.5 for AVA.
brings a decent performance gain (+4.3%) on AVA. This two-stage methods have to assume the actions occur at a
is probably because the movie clips in AVA contain shot fixed location. It is also worth mentioning that the TubeR
changes and so the network benefits from seeing the full with CSN backbones outperforms the two-stage model with
context of the clip. On UCF101-24, the videos are usually the same backbone by +4.4%, demonstrating that the gain is
short and without shot changes. The context does not bring not from the backbone but our TubeR design. TubeR even
a significant improvement on UCF101-24. outperforms the methods with multi-view augmentations
Length of input clip. We report results with variable input (horizontal flip, multiple spatial crops and multi-scale). Tu-
lengths in Table 1e. We compare with input length of 8, beR is also considerably faster than previous models, we
16 and 32 on both UCF101-24 and AVA with CSN-152 have attempted to collect the reported FLOPs from previous
as backbone. TubeR is able to handle long video clips as works (Table 2). Our TubeR has 8% fewer FLOPs than the
expected. We notice that our performance on UCF101-24 most recently published end-to-end model [5] with higher
saturates faster than on AVA, probably because UCF101-24 accuracy. Tuber is also 4× more efficient than the two-stage
does not contain shot changes that requires longer temporal model [9] with noticeable performance gain. Thanks to our
context for classification. sequence-to-sequence design, the heavy backbone is shared
Length of long-term context. This ablation is only con- and we do not need temporal iteration for tubelet regression.
ducted on AVA as videos on UCF101-24 are too short to We finally present the highest number reported in the
use long-term context. Table 1f shows that the right amount literature, regardless of the inference protocol, pre-training
of long-term context helps performance, but overwhelming dataset and additional information used. TubeR still achieves
the amount of long-term context harms performance. This is the best performance, even better than the model using addi-
probably because the long-term feature contains both useful tional object bounding-boxes as input [37].The results show
information and noise. The experiments show that about 15s that the proposed sequence-to-sequence model with tubelet
context serves best. Note that the context length varies per specific feature is a promising direction for action detection.
dataset, but can be easily determined empirically. AVA 2.2 Comparison. The results are shown in Table 3.
Under the same single-view protocol, TubeR is considerably
4.3. Frame-Level State-of-the-Art
better than previous methods, including the most recent work
AVA 2.1 Comparison. We first compare our results with pre- with an end-to-end design (WOO [5] +5.1%) and the two-
viously proposed methods on AVA 2.1 in Table 2. Compared stage work with strong backbones (MViT [7] +4.7%). A fair
to previous end-to-end models, with comparable backbone comparison between TubeR and a two-stage model [48] with
(I3D-Res50) and the same inference protocol, the proposed the same backbone CSN-152, shows TubeR gains +5.5%
TubeR outperforms all. TubeR outperforms the most recent frame-mAP. It demonstrates TubeR’s superior performance
end-to-end works WOO [5] by 0.9% and VTr [13] by 1.2%. comes from our design rather than the backbone.
This demonstrates the effectiveness of our designs. UCF101-24 Comparison. We also compare TubeR with the
Compared to previous work using an offline person detec- state-of-the-art using frame-mAP@IoU=0.5 on UCF101-24
tor, the proposed TubeR is also more effective under the same (see the first column with numbers in Table 4). Compared to
inference protocols. This is because TubeR generates tubelet- existing methods, TubeR acquires better results with com-
specific features without assumptions on location, while the parable backbones, for both RGB-stream and two-stream
Model Detector Input Backbone Pre-train Inference GFLOPs mAP
Table 2. Comparison on AVA v2.1 validation set. Detector shows if additional detector is required; * denotes the results we tested. IG
denotes the IG-65M dataset, SF denotes the slowfast network. The FLOPs for two-stage models are the sum of Faster RCNN-R101-FPN
FLOPs (246 GFLOPs [4]) plus classifier FLOPs multiplied by view number. TubeR performs more effectively and efficiently.
walk walk
walk
walk
stand,
Tubelet 3: sit; listen to (a person); watch (a person) talk