Self-Training Multi-Sequence Learning With Transformer
Self-Training Multi-Sequence Learning With Transformer
1395
to predict video-level anomaly scores, and a Snippet Regres- Newsam 2019). Wan et al. propose a dynamic MIL loss to
sor to predict snippet-level anomaly scores. In the inference enlarge the inter-class distance between anomalous and nor-
stage, we propose to use video-level anomaly scores to sup- mal instances, and a center loss to reduce the intra-class dis-
press fluctuations in the snippet-level anomaly scores. Since tance of normal instances (Wan et al. 2020). Feng, Hong,
the goal of VAD is to predict fine-grained anomaly scores and Zheng propose a MIL-based pseudo label generator and
(Tian et al. 2021), a two stage self-training strategy is used adopt a self-training scheme to refine pseudo-label by op-
to gradually refine the anomaly scores. timizing a self-guided attention encoder and a task-specific
To demonstrate the performance of our MSL, we use encoder (Feng, Hong, and Zheng 2021). Tian et al. propose
VideoSwin (a Transformer-based method) (Liu et al. 2021c) an robust temporal feature magnitude learning to effectively
as the backbone to extract snippet-level features and con- recognize the anomaly instances (Tian et al. 2021).
duct experiments on ShanghaiTech (Luo, Liu, and Gao
2017), UCF-Crime (Sultani, Chen, and Shah 2018), and XD- Self-Training
Violence (Wu et al. 2020). For a fair comparison, we also Self-training is widely used in semi-supervised learning
use C3D (Tran et al. 2015) and I3D (Carreira and Zisserman (Rosenberg, Hebert, and Schneiderman 2005; Tanha, van
2017) as the backbone to extract features. Experiments show Someren, and Afsarmanesh 2017; Tao et al. 2018; Li et al.
that our MSL achieve the state-of-the-art results. In summa- 2019; Jeong, Lee, and Kwak 2020; Tai, Bailis, and Valiant
ry, our main contributions are as follows: 2021). In self-training, the training data usually contain la-
• We propose a Multi-Sequence Learning method, which beled and unlabeled data (Liu et al. 2011). Self-training
uses a sequence composed of multiple instances as an includes the following steps (Zheng et al. 2020; Yu et al.
optimization unit. Based on this, we propose a Multi- 2021): 1) Train model with labeled data; 2) Use the trained
Sequence Learning ranking loss, which selects the se- model to predict unlabeled data to generate pseudo-labels;
quence with the highest sum of anomaly scores. 3) Train model with labeled and pseudo-labeled data to-
• Based on Multi-Sequence Learning and its ranking loss, gether; 4) Repeat 2) and 3). In VAD, Pang et al. pro-
we design a Transformer-based Multi-Sequence Learn- pose a self-training deep neural network for ordinal regres-
ing network, and propose to use the video-level anoma- sion (Pang et al. 2020). Feng, Hong, and Zheng propose
ly classification probability to suppress the fluctuation of a multi-instance self-training method that assigns snippet-
the snippet-level anomaly score in the inference stage. level pseudo-labels to all snippets in abnormal videos (Feng,
Hong, and Zheng 2021). Unlike them, our focus is on refin-
• By gradually reducing the length of selected sequence, ing anomaly scores through self-training.
we propose a two stage self-training strategy to gradually
refine the anomaly scores, because VAD needs to predict Transformer Combined With Convolution
fine-grained anomaly scores.
More and more studies have shown that Transformer has ex-
• Experimental results show that our method achieves the cellent performance (Dosovitskiy et al. 2021; Touvron et al.
state-of-the-art results on ShanghaiTech, UCF-Crime, 2021; Liu et al. 2021b). Dosovitskiy et al. first prove that a
and XD-Violence. The visualization shows that our pure Transformer architecture can attain state-of-the-art per-
method can realize the detection of abnormal snippets. formance (Dosovitskiy et al. 2021). Touvron et al. further
explore the data-efficient training strategies for the vision
Related Work transformer (Dosovitskiy et al. 2021; Touvron et al. 2021).
Weakly Supervised Video Anomaly Detection Liu et al. further introduces the inductive biases of local-
ity, hierarchy and translation invariance for various image
Most existing weakly supervised VAD methods (He, Shao, recognition tasks (Liu et al. 2021b). Because transformer
and Sun 2018; Zhang, Qing, and Miao 2019) are based on lacks the ability of local perception, many works combine
MIL. Since most methods (Li, Mahadevan, and Vasconcelos convolution and transformer (d’Ascoli et al. 2021; Wu et al.
2014; Zhao, Fei-Fei, and Xing 2011) earlier than 2017 on- 2021; Li et al. 2021; Xu et al. 2021; Yan et al. 2021; Zhang
ly used normal training videos, He, Shao, and Sun propose and Yang 2021; Liu et al. 2021a). To introduce local inter-
an anomaly-introduced learning method to detect abnormal frame perception, similar to Wu et al., we turn the linear pro-
events, and propose a graph-based MIL model with both jection in the Transformer Block into a Depthwise Separable
normal and abnormal video data (He, Shao, and Sun 2018). 1D Convolution (Chollet 2017; Howard et al. 2017).
Sultani, Chen, and Shah propose a deep MIL ranking loss
to predict anomaly scores (Sultani, Chen, and Shah 2018). Our Approach
Zhang, Qing, and Miao further introduces inner-bag score
gap regularization by defining an inner bag loss (Zhang, In this section, we first define the notations and problem s-
Qing, and Miao 2019). Zhong et al. consider the anomaly tatement. We then introduce our Multi-Sequence Learning
detection with weak labels as a supervised learning under (MSL). Finally, we present the pipeline of our approach.
noise labels, and design an alternate training procedure to
promote the discrimination of action classifiers (Zhong et al. Notations and Problem Statement
2019). Zhu and Newsam propose an attention-based tempo- In weakly supervised VAD, training videos are only labeled
ral MIL ranking loss, which use temporal context to distin- at the video-level. That is, videos containing anomalies are
guish between abnormal and normal events better (Zhu and labeled as 1 (positive), and videos without any anomalies
1396
Video Classifier
Linear Head
𝑝 BCE Loss +
Class Token
𝑓𝜃 (𝑣1 ) MLP
Snippet Regressor
CTE 2
𝑓𝜃 (𝑣2 )
Linear Head
Backbone 𝑓𝜃 (𝑣3 ) MSL
CTE 2
Layer Norm
𝑓𝜃 (𝑣4 ) Ranking
𝑓𝜃 (𝑣5 ) Loss
𝑓𝜃 (𝑣6 ) +
𝑓𝜃 (𝑣7 )
New Snippet-
Multi-Sequence Learning Level Pseudo-
Labels DW Conv1D DW Conv1D DW Conv1D
Features
1
Stage 1: Stage 2:
Initial Snippet- 𝐾⟸𝑇 MSL with MSL with 0 V K Q
1 T
Level Pseudo- pseudo-labels to predictions to bag
Labels Layer Norm
1 1
select sequences select sequences
0 0
Figure 1: Overall framework. (a) The architecture of our Multi-Sequence Learning (MSL), which includes a Backbone and
a Transformer-based MSL Network (MSLNet). The feature F ∈ RT × D extracted by the Backbone is input into MSLNet
to predict the anomaly scores, where T is the number of snippets and D is the feature dimension of each snippet. MSLNet
contains a video classifier to predict the probability p of the video containing anomalies and a snippet regressor to predict the
snippet anomaly score fθ (vi ) of the i-th snippet. BCE is the Binary Cross Entropy loss. (b) The pipeline of self-training MSL,
where K gradually changes from T to 1 through a self-training mechanism. According to the way of selecting sequences,
the optimization of MSL includes two stages: the first stage uses pseudo-labels to select sequences and the second stage uses
predictions to select sequences. (c) Convolutional Transformer Encoder (CTE), which is similar to (Dosovitskiy et al. 2021),
except that the linear projection is replaced with DW Conv1D (Depthwise Separable 1D Convolution) (Howard et al. 2017).
are labeled as 0 (negative). Given a video V = {vi }Ti=1 bag (Zhu and Newsam 2019). In order to keep a large margin
with T snippets and its video-level label Y ∈ {0, 1}. MIL- between the positive and negative instances, Sultani, Chen,
based methods treat video V as a bag and each snippet as and Shah give a hinge-based ranking loss:
an instance. A positive video is regarded as a positive bag
Ba = (a1 , a2 , ..., aT ), and a negative video is regarded as a L(Ba , Bn ) = max(0, 1 − max fθ (ai ) + max fθ (ni )). (3)
i∈Ba i∈Bn
negative bag Bn = (n1 , n2 , ..., nT ). The goal of VAD is to
learn a function fθ maps snippets to their anomaly scores, At the beginning of the optimization, fθ needs to have a
ranging from 0 to 1. Generally, MIL-based VAD assumes certain ability to predict abnormalities. Otherwise, it will be
that abnormal snippets have higher abnormal scores than possible to select a normal instance as an abnormal instance.
normal snippets. Sultani, Chen, and Shah formulate VAD If fθ predicts the instances in the positive bag incorrectly,
as an anomaly score regression problem and propose a MIL e.g. predicting normal instances as abnormal instances, this
ranking objective function and a MIL ranking loss (Sultani, error will be strengthened as the training progresses. In ad-
Chen, and Shah 2018): dition, the abnormal event is usually multiple consecutive
snippets, but MIL-based methods do not consider this prior.
max fθ (ai ) > max fθ (ni ). (1)
i∈Ba i∈Bn
Multi-Sequence Learning
L(Ba , Bn ) = max(0, max fθ (ni ) − max fθ (ai )). (2) In order to alleviate the above shortcomings in MIL-based
i∈Bn i∈Ba
methods, we propose a novel Multi-Sequence Learning (M-
The intuition behind Eq.1 and Eq.2 that the snippet with SL) method. As shown in Figure 2, given a video V =
highest anomaly score in the positive bag should rank higher {vi }Ti=1 with T snippets, the anomaly score curve is pre-
than the snippet with highest anomaly score in the negative dicted through a mapping function fθ . Let us assume that
1397
𝑓𝜃 (𝑣7 ) sequence data as input to model long-range relationships,
and has made great progress in many tasks. We adopt Trans-
(a)
score
former as our basic layer. The representation between the
local frames or snippets of the video is also very important.
1 5 snippet 𝑇 However, Transformer is not good at learning local represen-
(b) tations of adjacent frames or snippets (Yan et al. 2021). Mo-
1 5 𝑇 tivated by this, as shown in Figure 1(c), we replace the lin-
ear projection in the original Transformer with a DW Con-
1 𝑖 𝑖+𝐾 𝑇 (c) v1D (Depthwise Separable 1D Convolution) (Howard et al.
2017) projection. The new Transformer is named Convolu-
Figure 2: Comparison of instance selection method between tional Transformer Encoder (CTE). In this way, our CTE
MIL and our MSL. (a) Anomaly score curve of a video con- can inherit the advantages of Transformer and Convolutional
taining T snippets, assuming that the 5-th snippet has the Neural Network.
largest anomaly score fθ (v5 ). (b) Instance selection method
of MIL, which selects the 5-th snippet. (c) Instance selection Transformer-based MSL Network As shown in Figure
method of our MSL, which selects a sequence consisting of 1 (a), our architecture includes a Backbone and a MSLNet.
K consecutive snippets starting from the i-th snippet. Any action recognition method can be used as the Backbone,
such as C3D (Tran et al. 2015), I3D (Carreira and Zisserman
2017), and VideoSwin (Liu et al. 2021c). Similar to (Tian
the 5-th snippet v5 has the largest anomaly score fθ (v5 ). In et al. 2021), the Backbone uses pre-trained weights on the
MIL-based methods, the 5-th snippet will be selected to op- action recognition datasets (Karpathy et al. 2014; Kay et al.
timize the network (Zhu and Newsam 2019). In our MSL, 2017). Through the Backbone, a feature F ∈ RT ×D extracts
given a hyperparameter K, we propose a sequence selection from a video containing T snippets, where D is the feature
method, which selects a sequence that contains K consec- dimension of each snippet. Our MSLNet will use F as the
utive snippets. In detail, we calculate the mean of anomaly input to predict anomalies.
scores of all possible sequences of K consecutive snippets: Our MSLNet includes a video classifier and a snippet re-
K−1 gressor. The video classifier is used to predict whether the
−K 1 X video contains anomalies. Specifically, the video classifier
S = {si }Ti=1 , si = fθ (vi+k ), (4)
K contains two layers of CTE and a linear head for predicting
k=0
where si represents the mean of anomaly scores of the se- the probability of whether the video contains anomalies:
quence of K consecutive snippets starting from the i-th snip- p = σ(W c · E c [0]), E c = CT E×2 (class token||F ), (7)
pet. Then, the sequence with the largest mean of abnormal
scores can be selected by maxsi ∈S si . where W c is the parameter of the linear head, p is the prob-
Based on the above sequence selection method, we can ability that the video contains anomalies, and class token
simply use an MSL ranking objective function as: is used to predict the probability by aggregated features in
max sa,i > max sn,i , CTE. Since whether the video contains anomalies is a bina-
sa,i ∈Sa sn,i ∈Sn ry classification problem, σ chooses the sigmoid function.
K−1 K−1 (5) The snippet regressor is used to predict the anomaly s-
1 X 1 X core of each snippet. Specifically, the snippet regressor con-
sa,i = fθ (ai+k ), sn,i = fθ (ni+k ).
K K tains two layers of CTE and a linear head for predicting the
k=0 k=0
where sa,i and sn,i represent the mean of abnormal scores anomaly score of each snippet:
of K consecutive snippets starting from the i-th snippet in fθ (vi ) = σ(W r · E r [i]), E r = CT E×2 (E c ), (8)
abnormal video and normal video, respectively. The intu-
r
ition of our MSL ranking objective function is that the mean where W is the parameter of the linear head, fθ (vi ) is the
of abnormal scores of K consecutive snippets in abnormal abnormal score of the i-th snippet, and E r [i] is the feature of
videos should be greater than the mean of abnormal scores the i-th snippet. Since predicting the anomaly score is treat-
of K consecutive snippets in normal videos. To keep a large ed as a regression problem, σ chooses the sigmoid function.
margin between the positive and negative instances, similar We regard the optimization of the video classifier and s-
to Eq. 3, our hinge-based MSL ranking loss is defined as: nippet regressor as a multi-task learning problem. The total
L(Ba , Bn ) = max(0, 1 − max sa,i + max sn,i ). (6) loss to optimize the parameters of MSLNet is the sum of our
sa,i ∈Sa sn,i ∈Sn hinge-based MSL ranking loss and the classification loss:
It can be seen that MIL is a case of our MSL. When K = 1, L = L(Ba , Bn ) + BCE(p, Y ), (9)
MIL and our MSL are equivalent. When K = T , our MSL
treats every snippet in the abnormal video as abnormal. where L(Ba , Bn ) is the Eq. 6, and BCE is the Binary Cross
Entropy loss between the output p and the target Y .
Transformer-based MSL Network To reduce the fluctuation of the abnormal scores predict-
Convolutional Transformer Encoder Before introduc- ed by the snippet regressor, we propose a score correction
ing our Transformer-based MSL architecture, we first intro- method in the inference stage. Specifically, the score cor-
duce the basic layer. Transformer (Vaswani et al. 2017) uses rection method corrects the abnormal scores by using the
1398
probability of whether the video contains anomalies: Algorithm 1: Our Self-Training Multi-Sequence Learning.
Input: A set of features F and its video-level labels Y.
fˆθ (vi ) = fθ (vi ) × p. (10) Parameter: The number T of snippets.
The intuition of this method is that to keep the anomaly s- Output: MSLNet.
cores when the video classifier predicts that the video con- 1: Set K ← T .
tains anomalies with a higher probability, and weaken the 2: Get the initial snippet-level pseudo-labels Ŷ by Y.
anomaly scores when the video classifier predicts that the 3: while K ≥ 1 do
video contains anomalies with a lower probability. 4: Initialize the parameters of MSLNet.
5: // Stage one: with pseudo-labels to select sequences.
Self-Training MSL 6: Optimize MSLNet with K by F , Ŷ, and Eq. 11.
As shown in Figure 1 (b), we propose a self-training mecha- 7: // Stage two: with predictions to select sequences.
nism to achieve the training from coarse to fine. The training 8: Optimize MSLNet with K by F and Eq. 6.
process of our MSLNet includes two training stages. Be- 9: Inference the new snippet-level pseudo-labels Ŷ.
fore introducing our self-training mechanism, we first get 10: Set K ← K 2 .
the pseudo-labels Ŷ of the training videos. By taking the 11: end while
known video-level labels Y in weakly supervised VAD as 12: return MSLNet.
the anomaly scores of snippets, we can immediately get the
initial snippet-level pseudo-labels Ŷ. That is, for an abnor-
mal video, the pseudo label of each snippet is 1, and for a supervised setting, we adopt the split proposed by (Zhong
normal video, the pseudo label of each snippet is 0. et al. 2019): 238 training videos and 199 testing videos.
In the initial stage of training, the function fθ has a poor UCF-Crime is a large-scale dataset that contains 1,900
ability to predict abnormalities. Therefore, if the sequence untrimmed real-world street and indoor surveillance videos
is selected directly through the prediction of fθ , there is a with 13 classes of anomalous events and a total duration of
probability of selecting the wrong sequence. Based on this 128 hours (Sultani, Chen, and Shah 2018). The training set
motivation, we propose a transitional stage (stage one): MSL contains 1,610 videos with video-level labels, and the test
with pseudo-labels to select sequences. Specifically, by re- set contains 290 videos with frame-level labels.
placing the predicted anomaly score fθ (vi ) in Eq. 4 with the XD-Violence is a large-scale dataset that contains 4,754
pseudo-label ŷi of each snippet vi , we select the sequence untrimmed videos with a total duration of 217 hours and col-
with the largest mean of pseudo labels by maxsi ∈S si . Based lect from multiple sources, such as movies, sports, surveil-
on this sequence, we can calculate sa,i and sn,i , and then op- lances, and CCTVs (Wu et al. 2020). The training set con-
timize MSLNet through the hinge-based MSL ranking loss: tains 3,954 videos with video-level labels, and the test set
contains 800 videos with frame-level labels.
L(Ba , Bn ) = max(0, 1 − sa,i + sn,i ), (11) Following previous works (Zhong et al. 2019; Wan et al.
where sa,i and sn,i are the sequence with the largest mean of 2020), we use the AUC (Area Under Curve) of frame-level
pseudo labels starting from the i-th snippet in the abnormal ROC (Receiver Operating Characteristic) as our metric for
and normal video, respectively. After E1 epochs training, fθ ShanghaiTech and UCF-Crime. Following previous works
has a preliminary ability to predict the anomaly scores. (Wu et al. 2020; Tian et al. 2021), we use the AP (Aver-
In stage two, MSLNet is optimized with predictions to age Precision) as our metric for XD-Violence. Note that the
select sequences. This stage uses Eq. 5 and Eq. 6 to calculate larger the value of AUC and AP, the better the performance.
the ranking loss. After E2 epochs training, the new snippet-
Implementation Details
level pseudo-labels Ŷ of training videos are inferenced. By
halving the sequence length K and repeating the above two We extract the 4,096D features from the f c6 layer of the
stages, the predicted anomaly scores are gradually refined. pre-trained C3D (Tran et al. 2015) on Sports-1M (Karpa-
The role of the transitional stage is to establish a connec- thy et al. 2014), the 1,024D features from the mixed5c lay-
tion between MSL and different self-training rounds. By in- er of the pre-trained I3D (Carreira and Zisserman 2017) on
troducing a self-training mechanism, we achieve the predic- Kinetics-400 (Kay et al. 2017), and the 1,024D features from
tion of anomaly scores from coarse to fine. For better under- the Stage4 layer of the pre-trained VideoSwin (Liu et al.
standing, we show our self-training MSL in Algorithm 1. 2021c) on Kinetics-400. Following previous works (Tian
et al. 2021), we divide each video into 32 snippets, that is,
Experiments T = 32 and K ∈ {32, 16, 8, 4, 2, 1}. The length of each s-
nippet is 16. Our MSLNet is trained using the SGD optimiz-
Datasets and Evaluation Metrics er with a learning rate of 0.001, a weight decay of 0.0005
We conduct sufficient experiments on the ShanghaiTech, and a batch size of 64. We set E1 to 100 and E2 to 400. Fol-
UCF-Crime, and XD-Violence datasets. lowing (Tian et al. 2021), each mini-batch is composed of 32
ShanghaiTech is a medium-scale dataset that contains 437 randomly selected normal and abnormal videos. In abnormal
campus surveillance videos with 130 abnormal events in 13 videos, we randomly select one of the top 10% snippets as
scenes (Luo, Liu, and Gao 2017). However, all the training the abnormal snippet. In CTE, we set the number of headers
videos of this dataset are normal. In line with the weakly to 12 and use DW Conv1D with kernel size is 3.
1399
Method Feature Crop AUC(%) ↑ Method Feature Crop AUC(%) ↑
MIL-Rank† I3D RGB one 85.33 MIL-Rank C3D RGB one 75.41
GCN C3D-RGB ten 76.44 MIL-Rank† I3D RGB one 77.92
GCN TSN-Flow ten 84.13 Motion-Aware PWC-Flow one 79.00
GCN TSN-RGB ten 84.44 GCN C3D-RGB ten 81.08
IBL I3D-RGB one 82.50 GCN TSN-Flow ten 78.08
AR-Net† C3D RGB one 85.01 GCN TSN-RGB ten 82.12
AR-Net I3D Flow one 82.32 IBL C3D-RGB one 78.66
AR-Net I3D RGB one 85.38 CLAWS C3D-RGB ten 83.03
AR-Net I3D-RGB+Flow one 91.24 MIST C3D-RGB one 81.40
CLAWS C3D-RGB one 89.67 MIST I3D-RGB one 82.30
MIST C3D-RGB one 93.13 RTFM C3D-RGB ten 83.28
MIST I3D-RGB one 94.83 RTFM I3D-RGB ten 84.03
RTFM C3D-RGB ten 91.51 RTFM∗ VideoSwin-RGB one 83.31
RTFM I3D-RGB ten 97.21 Ours C3D-RGB one 82.85
RTFM∗ VideoSwin-RGB ten 96.76 Ours I3D-RGB one 85.30
Ours C3D-RGB one 94.23 Ours VideoSwin-RGB one 85.62
Ours I3D-RGB one 95.45
Ours VideoSwin-RGB one 96.93 Table 2: Compared with other methods on UCF-Crime. The
Ours C3D-RGB ten 94.81 method with † is reported by (Tian et al. 2021). ∗ indicates
Ours I3D-RGB ten 96.08 we re-train the method. Bold represents the best results.
Ours VideoSwin-RGB ten 97.32
Method Feature Crop AP(%) ↑
Table 1: Compared with related methods on ShanghaiTech. MIL-Rank† C3D RGB five 73.20
The methods with † are reported by (Feng, Hong, and Zheng MIL-Rank† I3D RGB five 75.68
2021) or (Tian et al. 2021). ∗ indicates we re-train the Multimodal-VD I3D-RGB five 75.41
method. Under the same feature, the highest result is bolded. RTFM C3D-RGB five 75.89
RTFM I3D-RGB five 77.81
RTFM∗ VideoSwin-RGB five 77.95
Results on ShanghaiTech Ours C3D-RGB five 75.53
We report the results on ShanghaiTech (Zhong et al. 2019) Ours I3D-RGB five 78.28
in Table 1. For a fair comparison, we use two features: Ours VideoSwin-RGB five 78.59
one-crop and ten-crop. One-crop means cropping snippet-
s into the center. Ten-crop means cropping snippets into Table 3: Compared with related methods on XD-Violence.
the center, four corners, and their flipped version (Zhong The methods with † are reported by (Wu et al. 2020) or (Tian
et al. 2019). Under the same backbone and crop, compared et al. 2021). ∗ indicates we re-train the method.
with the previous weakly supervised methods, our method-
s achieve the superior performance on AUC. For example,
with the one-crop I3D-RGB feature, our model achieves an features with other methods. Five-crop means cropping s-
AUC of 95.45% and outperforms all other methods with the nippets into the center and four corners. Under the same
same crop, and with the ten-crop VideoSwin-RGB feature, backbone, our method outperforms all previous weakly su-
our model achieves the best AUC of 97.32%. pervised VAD methods on the AP metric. For example, with
five-crop I3D-RGB features, our model achieves an AP of
Results on UCF-Crime 78.28% and outperforms all other methods, and with five-
We report our experimental results on UCF-Crime (Sul- crop VideoSwin-RGB features, our model achieves an AP
tani, Chen, and Shah 2018) in Table 2. Under I3D and of 78.59% which higher than RTFM by 0.64%.
VideoSwin as the backbone, our method outperforms all pre-
vious weakly supervised methods on the frame-level AUC Complexity Analysis
metric. Under C3D as the backbone, our method has also Generally, Transformer has been often computationally ex-
achieved competitive result. For example, with the one-crop pensive, but our method can achieve real-time surveillance.
I3D-RGB feature, our model achieves an AUC of 85.30% On an NVIDIA 2080 GPU, with VideoSwin (Liu et al.
and outperforms all other methods, and with the one-crop 2021c) as the backbone processes 3.6 snippets per second
VideoSwin-RGB feature, our model achieves the best AUC (a snippet has 16 frames), which is 57.6 frames per second
of 85.62% which is higher than RTFM by 2.31%. (FPS); with I3D (Carreira and Zisserman 2017) as the back-
bone processes 6.5 snippets per second, which is 104 FPS.
Results on XD-Violence Our MSL Network can reach 156.4 forwards per second.
We report our results on XD-Violence (Wu et al. 2020) in Overall, the speed with VideoSwin as the backbone is 42
Table 3. For a fair comparison, we use the same five-crop FPS, and the speed with I3D as the backbone is 63 FPS.
1400
(a) Abnormal Video 01 0025 (b) Abnormal Video 03 0032 (c) Abnormal Video 01 0051 (d) Normal Video 08 045
(e) Explosion 033 x264 (f) RoadAccidents 012 x264 (g) Robbery 102 x264 (h) Normal Video 894 x264
Figure 3: Visualization of abnormal score curves. The horizontal axis represents the number of frames, and the vertical axis
represents the abnormal scores. Videos of (a), (b), (c), and (d) are from the ShanghaiTech dataset, and videos of (e), (f), (g), and
(h) are from the UCF-Crime dataset. The curves indicate the abnormal scores of the video frames, pink areas indicate that the
interval contains an abnormal event, and the red rectangles indicate the location of abnormal events. Best viewed in color.
Table 4: Compared with Transformer (Dosovitskiy et al. Table 5: Performance improvement brought by the score
2021), AUC(%) improvement brought by CTE on the correction method in the inference stage measured by
ShanghaiTech and UCF-Crime datasets. AUC(%) on the ShanghaiTech and UCF-Crime datasets.
Qualitative Analysis sults of this ablation experiment. Compared with the result
In order to further demonstrate the effect of our method, as using the standard Transformer as the basic layer, the result
shown in Figure 3, we visualize the anomaly score curves. with CTE as the basic layer increases an AUC by 0.42% and
The first row shows the ground truth and prediction anomaly 0.21% on the ShanghaiTech and UCF-Crime datasets.
scores of three abnormal videos and one normal video from
the ShanghaiTech dataset. From the first row of Figure 3, Impact of score correction in the inference stage. As
we can see that our method can detect abnormal events in shown in Table 5, we conduct an experiment to report the
surveillance videos. Our method successfully predicts short- performance improvement brought by the score correction
term abnormal events (Figure 3 (a)) and long-term abnormal method in the inference stage. From Table 5 we can ob-
events (Figure 3 (b)). Furthermore, our method can also de- serve that score correction can bring an AUC improvemen-
tect multiple abnormal events in a video (Figure 3 (c)). The t of 0.95% and 0.68% with the one-crop features on the
second row shows the ground truth and predicted anomaly ShanghaiTech and UCF-Crime datasets, respectively.
scores of three abnormal videos and one normal video from
the UCF-Crime dataset. From the second row of Figure 3, Conclusion
we can see that our proposed method can also detect abnor-
mal events in complex surveillance scenes. In this work, we first propose an MSL method and a hinge-
based MSL ranking loss. We then design a Transformer-
Ablation Analysis based network to learn both video-level anomaly probability
In order to further evaluate our method, we perform ablation and snippet-level anomaly scores. In the inference stage, we
studies on the ShanghaiTech and UCF-Crime datasets with propose to use the video-level anomaly probability to sup-
one-crop VideoSwin-RGB features. press the fluctuation of snippet-level anomaly scores. Final-
ly, since VAD needs to predict the instance-level anomaly s-
Improvement brought by CTE. To evaluate the effect of cores, by gradually reducing the length of selected sequence,
our CTE, we replace CTE with the standard Transformer we propose a self-training strategy to refine the anomaly s-
(Dosovitskiy et al. 2021). The dimension of the standard cores. Experimental results show that our method achieves
Transformer is the same as our CTE. Table 4 reports the re- significant improvements on three public datasets.
1401
Acknowledgements Kay, W.; Carreira, J.; Simonyan, K.; Zhang, B.; Hillier, C.;
This work was supported in part by the National Natural Vijayanarasimhan, S.; Viola, F.; Green, T.; Back, T.; Natsev,
Science Foundation of China (No.62076192), Key Research P.; Suleyman, M.; and Zisserman, A. 2017. The Kinetics
and Development Program in Shaanxi Province of China Human Action Video Dataset. CoRR, abs/1705.06950.
(No.2019ZDLGY03-06), the State Key Program of National Li, W.; Mahadevan, V.; and Vasconcelos, N. 2014. Anoma-
Natural Science of China (No.61836009), in part by the Pro- ly Detection and Localization in Crowded Scenes. IEEE
gram for Cheung Kong Scholars and Innovative Research Transactions on Pattern Analysis and Machine Intelligence,
Team in University (No.IRT 15R53), in part by The Fund 36(1): 18–32.
for Foreign Scholars in University Research and Teaching Li, Y.; Xing, R.; Jiao, L.; Chen, Y.; Chai, Y.; Marturi, N.;
Programs (the 111 Project) (No.B07048), in part by the and Shang, R. 2019. Semi-Supervised PolSAR Image Clas-
Key Scientific Technological Innovation Research Project sification Based on Self-Training and Superpixels. Remote.
by Ministry of Education, the National Key Research and Sens., 11(16): 1933.
Development Program of China. Li, Y.; Zhang, K.; Cao, J.; Timofte, R.; and Gool, L. V. 2021.
LocalViT: Bringing Locality to Vision Transformers. CoRR,
References abs/2104.05707.
Cai, R.; Zhang, H.; Liu, W.; Gao, S.; and Hao, Z. Liu, K.; and Ma, H. 2019. Exploring Background-Bias for
2021. Appearance-Motion Memory Consistency Network Anomaly Detection in Surveillance Videos. In Proceedings
for Video Anomaly Detection. In AAAI, 938–946. of the 27th ACM International Conference on Multimedia,
Carreira, J.; and Zisserman, A. 2017. Quo Vadis, Action MM ’19, 14901499.
Recognition? A New Model and the Kinetics Dataset. In Liu, X.; Li, K.; Zhou, M.; and Xiong, Z. 2011. Enhancing
CVPR, 4724–4733. Semantic Role Labeling for Tweets Using Self-Training. In
Chollet, F. 2017. Xception: Deep Learning with Depthwise AAAI.
Separable Convolutions. In CVPR, 1800–1807. Liu, Y.; Sun, G.; Qiu, Y.; Zhang, L.; Chhatkuli, A.; and Gool,
d’Ascoli, S.; Touvron, H.; Leavitt, M. L.; Morcos, A. S.; L. V. 2021a. Transformer in Convolutional Neural Network-
Biroli, G.; and Sagun, L. 2021. ConViT: Improving Vision s. CoRR, abs/2106.03180.
Transformers with Soft Convolutional Inductive Biases. In
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin,
ICML, volume 139, 2286–2296.
S.; and Guo, B. 2021b. Swin Transformer: Hierarchical
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, Vision Transformer using Shifted Windows. CoRR, ab-
D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; s/2103.14030.
Heigold, G.; Gelly, S.; Uszkoreit, J.; and Houlsby, N. 2021.
Liu, Z.; Ning, J.; Cao, Y.; Wei, Y.; Zhang, Z.; Lin, S.; and Hu,
An Image is Worth 16x16 Words: Transformers for Image
H. 2021c. Video Swin Transformer. CoRR, abs/2106.13230.
Recognition at Scale. In ICLR.
Feng, J.-C.; Hong, F.-T.; and Zheng, W.-S. 2021. Mist: Mul- Luo, W.; Liu, W.; and Gao, S. 2017. A Revisit of Sparse
tiple instance self-training framework for video anomaly de- Coding Based Anomaly Detection in Stacked RNN Frame-
tection. In CVPR, 14009–14018. work. In ICCV, 341–349.
Gong, D.; Liu, L.; Le, V.; Saha, B.; Mansour, M. R.; Pang, G.; Yan, C.; Shen, C.; van den Hengel, A.; and Bai, X.
Venkatesh, S.; and van den Hengel, A. 2019. Memorizing 2020. Self-Trained Deep Ordinal Regression for End-to-End
Normality to Detect Anomaly: Memory-Augmented Deep Video Anomaly Detection. In CVPR, 12170–12179.
Autoencoder for Unsupervised Anomaly Detection. In IC- Rosenberg, C.; Hebert, M.; and Schneiderman, H. 2005.
CV, 1705–1714. Semi-Supervised Self-Training of Object Detection Models.
Guo, Z.; Zhao, J.; Jiao, L.; Liu, X.; and Liu, F. 2021. A In WACV/MOTION, 29–36.
Universal Quaternion Hypergraph Network for Multimodal Sultani, W.; Chen, C.; and Shah, M. 2018. Real-World
Video Question Answering. IEEE Transactions on Multime- Anomaly Detection in Surveillance Videos. In CVPR, 6479–
dia, 1–1. 6488.
He, C.; Shao, J.; and Sun, J. 2018. An anomaly-introduced Tai, K. S.; Bailis, P.; and Valiant, G. 2021. Sinkhorn
learning method for abnormal event detection. Multimedia Label Allocation: Semi-Supervised Classification via An-
Tools and Applications, 77(22): 29573–29588. nealed Self-Training. In ICML, volume 139, 10065–10075.
Howard, A. G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, Tanha, J.; van Someren, M.; and Afsarmanesh, H. 2017.
W.; Weyand, T.; Andreetto, M.; and Adam, H. 2017. Mo- Semi-supervised self-training for decision tree classifiers.
bileNets: Efficient Convolutional Neural Networks for Mo- Int. J. Mach. Learn. Cybern., 8(1): 355–370.
bile Vision Applications. CoRR, abs/1704.04861. Tao, Y.; Zhang, D.; Cheng, S.; and Tang, X. 2018. Improv-
Jeong, J.; Lee, S.; and Kwak, N. 2020. Self-Training us- ing semi-supervised self-training with embedded manifold
ing Selection Network for Semi-supervised Learning. In transduction. Trans. Inst. Meas. Control, 40(2): 363–374.
ICPRAM, 23–32. Tian, Y.; Pang, G.; Chen, Y.; Singh, R.; Verjans, J. W.; and
Karpathy, A.; Toderici, G.; Shetty, S.; Leung, T.; Sukthankar, Carneiro, G. 2021. Weakly-supervised Video Anomaly De-
R.; and Li, F. 2014. Large-Scale Video Classification with tection with Robust Temporal Feature Magnitude Learning.
Convolutional Neural Networks. In CVPR, 1725–1732. CoRR, abs/2101.10030.
1402
Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles,
A.; and Jégou, H. 2021. Training data-efficient image trans-
formers & distillation through attention. In ICML, volume
139, 10347–10357.
Tran, D.; Bourdev, L. D.; Fergus, R.; Torresani, L.; and
Paluri, M. 2015. Learning Spatiotemporal Features with 3D
Convolutional Networks. In ICCV, 4489–4497.
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones,
L.; Gomez, A. N.; Kaiser, L.; and Polosukhin, I. 2017. At-
tention is All you Need. In NIPS, 5998–6008.
Wan, B.; Fang, Y.; Xia, X.; and Mei, J. 2020. Weakly super-
vised video anomaly detection via center-guided discrimina-
tive learning. In ICME, 1–6. IEEE.
Wan, B.; Jiang, W.; Fang, Y.; Luo, Z.; and Ding, G. 2021.
Anomaly detection in video sequences: A benchmark and
computational model. IET Image Processing.
Wu, H.; Xiao, B.; Codella, N.; Liu, M.; Dai, X.; Yuan, L.;
and Zhang, L. 2021. CvT: Introducing Convolutions to Vi-
sion Transformers. CoRR, abs/2103.15808.
Wu, P.; Liu, j.; Shi, Y.; Sun, Y.; Shao, F.; Wu, Z.; and Yang,
Z. 2020. Not only Look, but also Listen: Learning Multi-
modal Violence Detection under Weak Supervision. In EC-
CV.
Xu, W.; Xu, Y.; Chang, T. A.; and Tu, Z. 2021. Co-
Scale Conv-Attentional Image Transformers. CoRR, ab-
s/2104.06399.
Yan, H.; Li, Z.; Li, W.; Wang, C.; Wu, M.; and Zhang, C.
2021. ConTNet: Why not use convolution and transformer
at the same time? CoRR, abs/2104.13497.
Yu, F.; Zhang, M.; Dong, H.; Hu, S.; Dong, B.; and Zhang, L.
2021. DAST: Unsupervised Domain Adaptation in Semantic
Segmentation Based on Discriminator Attention and Self-
Training. In AAAI, 10754–10762.
Zhang, J.; Qing, L.; and Miao, J. 2019. Temporal Convo-
lutional Network with Complementary Inner Bag Loss for
Weakly Supervised Anomaly Detection. In ICIP, 4030–
4034.
Zhang, Q.; and Yang, Y. 2021. ResT: An Efficient Trans-
former for Visual Recognition. CoRR, abs/2105.13677.
Zhao, B.; Fei-Fei, L.; and Xing, E. P. 2011. Online detection
of unusual events in videos via dynamic sparse coding. In
CVPR, 3313–3320.
Zheng, H.; Zhang, Y.; Yang, L.; Wang, C.; and Chen, D. Z.
2020. An Annotation Sparsification Strategy for 3D Medical
Image Segmentation via Representative Selection and Self-
Training. In AAAI, 6925–6932.
Zhong, J.; Li, N.; Kong, W.; Liu, S.; Li, T. H.; and Li, G.
2019. Graph Convolutional Label Noise Cleaner: Train a
Plug-And-Play Action Classifier for Anomaly Detection. In
CVPR, 1237–1246.
Zhu, Y.; and Newsam, S. D. 2019. Motion-Aware Feature
for Improved Video Anomaly Detection. In BMVC, 270.
1403