0% found this document useful (0 votes)

47 views9 pages

Self-Training Multi-Sequence Learning With Transformer

This document presents a novel approach for weakly supervised Video Anomaly Detection (VAD) using a Multi-Sequence Learning (MSL) method combined with a Transformer-based network. The proposed method aims to enhance the accuracy of anomaly detection by treating sequences of snippets as optimization units and employing a self-training strategy to refine anomaly scores. Experimental results demonstrate significant improvements in performance on benchmark datasets such as ShanghaiTech, UCF-Crime, and XD-Violence.

Uploaded by

Trong Nguyen Duc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

47 views9 pages

Self-Training Multi-Sequence Learning With Transformer

Uploaded by

Trong Nguyen Duc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

The Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI-22)

Self-Training Multi-Sequence Learning with Transformer for Weakly Supervised

Video Anomaly Detection
Shuo Li, Fang Liu∗ , Licheng Jiao
Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education,
International Research Center for Intelligent Perception and Computation,
Joint International Research Laboratory of Intelligent Perception and Computation,
School of Artificial Intelligent, Xidian University, Xi’an, 710071, P.R. China,
[email protected], [email protected], [email protected]

Abstract Hong, and Zheng 2021). Recently, many researchers have

focused on weakly supervised VAD (Zhong et al. 2019).
Weakly supervised Video Anomaly Detection (VAD) using
Multi-Instance Learning (MIL) is usually based on the fact Most weakly supervised VADs are based on Multi-
that the anomaly score of an abnormal snippet is higher than Instance Learning (MIL) (Sultani, Chen, and Shah 2018;
that of a normal snippet. In the beginning of training, due Zhu and Newsam 2019; Wan et al. 2020; Tian et al. 2021).
to the limited accuracy of the model, it is easy to select the MIL-based methods treat a video as a bag, which contains
wrong abnormal snippet. In order to reduce the probability of multiple instances. Each instance is a snippet. The bag gen-
selection errors, we first propose a Multi-Sequence Learning erated from an abnormal video is called a positive bag, and
(MSL) method and a hinge-based MSL ranking loss that uses the bag generated from a normal video is called a negative
a sequence composed of multiple snippets as an optimization bag. Since the video-level label indicates whether the video
unit. We then design a Transformer-based MSL network to contains anomalies, the positive bag contains at least one
learn both video-level anomaly probability and snippet-level
abnormal snippet and the negative bag contains no abnormal
anomaly scores. In the inference stage, we propose to use the
video-level anomaly probability to suppress the fluctuation snippet. MIL-based methods learn instance-level anomaly s-
of snippet-level anomaly scores. Finally, since VAD needs to cores through the bag-level labels (Zhong et al. 2019).
predict the snippet-level anomaly scores, by gradually reduc- In MIL-based methods, at least one instance of the posi-
ing the length of selected sequence, we propose a self-training tive bag contains the anomaly, and any instance of the neg-
strategy to gradually refine the anomaly scores. Experimental ative bag does not contain the anomaly (Sultani, Chen, and
results show that our method achieves significant improve- Shah 2018). Generally, MIL-based methods assume that the
ments on ShanghaiTech, UCF-Crime, and XD-Violence. instance with the highest anomaly score in the positive bag
should rank higher than the instance with the highest anoma-
Introduction ly score in the negative bag (Zhu and Newsam 2019). There-
fore, the important thing for MIL-based methods is to cor-
Video Anomaly Detection (VAD) aims to detect abnormal rectly select anomalous instance in the positive bag. Most
events in the video, which has important practical value MIL-based methods regard an instance as an optimization
(Zhang, Qing, and Miao 2019; Guo et al. 2021). General- unit (Zhang, Qing, and Miao 2019; Feng, Hong, and Zheng
ly, VAD predicts the anomaly score of each snippet in the 2021; Tian et al. 2021). However, if the model predicts the
video. There are three main paradigms: unsupervised VAD anomalous instances incorrectly in the positive bag, this er-
(Gong et al. 2019; Cai et al. 2021), weakly supervised VAD ror will be strengthened as the training progresses. That is, if
(Zhong et al. 2019), and supervised VAD (Liu and Ma 2019; a normal instance is predicted as an abnormal instance, this
Wan et al. 2021). Unsupervised VAD only learns on normal error will affect subsequent instance selection. In addition,
videos, assuming that unseen abnormal videos have high re- the abnormal event is usually multiple consecutive snippets,
construction errors. Due to the lack of prior knowledge of but MIL-based methods do not consider this prior.
abnormality and inability to learn all normal video patterns, In order to alleviate the above-mentioned shortcomings,
the performance of unsupervised VAD is usually poor (Tian we propose a Multi-Sequence Learning (MSL) method. Our
et al. 2021). Because fine-grained anomaly label is time- MSL no longer uses an instance as the optimization unit, but
consuming and laborious, it is difficult to collect large-scale a sequence composed of multiple instances as the optimiza-
datasets for supervised paradigm. With whether the video tion unit. In other words, instead of choosing the instance
contains anomalies as video-level label, the weakly super- with the highest anomaly score, our MSL method choos-
vised paradigm predicts the anomaly score of each frame. es the sequence with the highest sum of anomaly scores.
The weakly supervised paradigm is verified to be a feasi- This reduces the probability of incorrect selection of anoma-
ble method because of its competitive performance (Feng, lous instances. In order to achieve our MSL, we propose
∗
Corresponding author a Transformer-based Multi-Sequence Learning Network,
Copyright c 2022, Association for the Advancement of Artificial which includes a multi-layer Convolutional Transformer En-
Intelligence (www.aaai.org). All rights reserved. coder to encode extracted snippet features, a Video Classifier

1395
to predict video-level anomaly scores, and a Snippet Regres- Newsam 2019). Wan et al. propose a dynamic MIL loss to
sor to predict snippet-level anomaly scores. In the inference enlarge the inter-class distance between anomalous and nor-
stage, we propose to use video-level anomaly scores to sup- mal instances, and a center loss to reduce the intra-class dis-
press fluctuations in the snippet-level anomaly scores. Since tance of normal instances (Wan et al. 2020). Feng, Hong,
the goal of VAD is to predict fine-grained anomaly scores and Zheng propose a MIL-based pseudo label generator and
(Tian et al. 2021), a two stage self-training strategy is used adopt a self-training scheme to refine pseudo-label by op-
to gradually refine the anomaly scores. timizing a self-guided attention encoder and a task-specific
To demonstrate the performance of our MSL, we use encoder (Feng, Hong, and Zheng 2021). Tian et al. propose
VideoSwin (a Transformer-based method) (Liu et al. 2021c) an robust temporal feature magnitude learning to effectively
as the backbone to extract snippet-level features and con- recognize the anomaly instances (Tian et al. 2021).
duct experiments on ShanghaiTech (Luo, Liu, and Gao
2017), UCF-Crime (Sultani, Chen, and Shah 2018), and XD- Self-Training
Violence (Wu et al. 2020). For a fair comparison, we also Self-training is widely used in semi-supervised learning
use C3D (Tran et al. 2015) and I3D (Carreira and Zisserman (Rosenberg, Hebert, and Schneiderman 2005; Tanha, van
2017) as the backbone to extract features. Experiments show Someren, and Afsarmanesh 2017; Tao et al. 2018; Li et al.
that our MSL achieve the state-of-the-art results. In summa- 2019; Jeong, Lee, and Kwak 2020; Tai, Bailis, and Valiant
ry, our main contributions are as follows: 2021). In self-training, the training data usually contain la-
• We propose a Multi-Sequence Learning method, which beled and unlabeled data (Liu et al. 2011). Self-training
uses a sequence composed of multiple instances as an includes the following steps (Zheng et al. 2020; Yu et al.
optimization unit. Based on this, we propose a Multi- 2021): 1) Train model with labeled data; 2) Use the trained
Sequence Learning ranking loss, which selects the se- model to predict unlabeled data to generate pseudo-labels;
quence with the highest sum of anomaly scores. 3) Train model with labeled and pseudo-labeled data to-
• Based on Multi-Sequence Learning and its ranking loss, gether; 4) Repeat 2) and 3). In VAD, Pang et al. pro-
we design a Transformer-based Multi-Sequence Learn- pose a self-training deep neural network for ordinal regres-
ing network, and propose to use the video-level anoma- sion (Pang et al. 2020). Feng, Hong, and Zheng propose
ly classification probability to suppress the fluctuation of a multi-instance self-training method that assigns snippet-
the snippet-level anomaly score in the inference stage. level pseudo-labels to all snippets in abnormal videos (Feng,
Hong, and Zheng 2021). Unlike them, our focus is on refin-
• By gradually reducing the length of selected sequence, ing anomaly scores through self-training.
we propose a two stage self-training strategy to gradually
refine the anomaly scores, because VAD needs to predict Transformer Combined With Convolution
fine-grained anomaly scores.
More and more studies have shown that Transformer has ex-
• Experimental results show that our method achieves the cellent performance (Dosovitskiy et al. 2021; Touvron et al.
state-of-the-art results on ShanghaiTech, UCF-Crime, 2021; Liu et al. 2021b). Dosovitskiy et al. first prove that a
and XD-Violence. The visualization shows that our pure Transformer architecture can attain state-of-the-art per-
method can realize the detection of abnormal snippets. formance (Dosovitskiy et al. 2021). Touvron et al. further
explore the data-efficient training strategies for the vision
Related Work transformer (Dosovitskiy et al. 2021; Touvron et al. 2021).
Weakly Supervised Video Anomaly Detection Liu et al. further introduces the inductive biases of local-
ity, hierarchy and translation invariance for various image
Most existing weakly supervised VAD methods (He, Shao, recognition tasks (Liu et al. 2021b). Because transformer
and Sun 2018; Zhang, Qing, and Miao 2019) are based on lacks the ability of local perception, many works combine
MIL. Since most methods (Li, Mahadevan, and Vasconcelos convolution and transformer (d’Ascoli et al. 2021; Wu et al.
2014; Zhao, Fei-Fei, and Xing 2011) earlier than 2017 on- 2021; Li et al. 2021; Xu et al. 2021; Yan et al. 2021; Zhang
ly used normal training videos, He, Shao, and Sun propose and Yang 2021; Liu et al. 2021a). To introduce local inter-
an anomaly-introduced learning method to detect abnormal frame perception, similar to Wu et al., we turn the linear pro-
events, and propose a graph-based MIL model with both jection in the Transformer Block into a Depthwise Separable
normal and abnormal video data (He, Shao, and Sun 2018). 1D Convolution (Chollet 2017; Howard et al. 2017).
Sultani, Chen, and Shah propose a deep MIL ranking loss
to predict anomaly scores (Sultani, Chen, and Shah 2018). Our Approach
Zhang, Qing, and Miao further introduces inner-bag score
gap regularization by defining an inner bag loss (Zhang, In this section, we first define the notations and problem s-
Qing, and Miao 2019). Zhong et al. consider the anomaly tatement. We then introduce our Multi-Sequence Learning
detection with weak labels as a supervised learning under (MSL). Finally, we present the pipeline of our approach.
noise labels, and design an alternate training procedure to
promote the discrimination of action classifiers (Zhong et al. Notations and Problem Statement
2019). Zhu and Newsam propose an attention-based tempo- In weakly supervised VAD, training videos are only labeled
ral MIL ranking loss, which use temporal context to distin- at the video-level. That is, videos containing anomalies are
guish between abnormal and normal events better (Zhu and labeled as 1 (positive), and videos without any anomalies

1396
Video Classifier
Linear Head
𝑝 BCE Loss +
Class Token

𝑓𝜃 (𝑣1 ) MLP

Snippet Regressor
CTE 2
𝑓𝜃 (𝑣2 )

Linear Head
Backbone 𝑓𝜃 (𝑣3 ) MSL

CTE 2
Layer Norm
𝑓𝜃 (𝑣4 ) Ranking
𝑓𝜃 (𝑣5 ) Loss
𝑓𝜃 (𝑣6 ) +
𝑓𝜃 (𝑣7 )

Transformer-based Multi-Sequence Multi-Head

Feature 𝐹 Linear
Learning Network Attention

(a) Multi-Sequence Learning Architecture Concat

h head

Inference Features Scaled Dot-Product Attention

New Snippet-
Multi-Sequence Learning Level Pseudo-
Labels DW Conv1D DW Conv1D DW Conv1D
Features
1
Stage 1: Stage 2:
Initial Snippet- 𝐾⟸𝑇 MSL with MSL with 0 V K Q
1 T
Level Pseudo- pseudo-labels to predictions to bag
Labels Layer Norm
1 1
select sequences select sequences
0 0

positive bag negative bag 𝐾

𝐾⟸
2
Feature 𝐹

(b) Self-Training Multi-Sequence Learning Pipeline (c) Convolutional Transformer Encoder

Figure 1: Overall framework. (a) The architecture of our Multi-Sequence Learning (MSL), which includes a Backbone and
a Transformer-based MSL Network (MSLNet). The feature F ∈ RT × D extracted by the Backbone is input into MSLNet
to predict the anomaly scores, where T is the number of snippets and D is the feature dimension of each snippet. MSLNet
contains a video classifier to predict the probability p of the video containing anomalies and a snippet regressor to predict the
snippet anomaly score fθ (vi ) of the i-th snippet. BCE is the Binary Cross Entropy loss. (b) The pipeline of self-training MSL,
where K gradually changes from T to 1 through a self-training mechanism. According to the way of selecting sequences,
the optimization of MSL includes two stages: the first stage uses pseudo-labels to select sequences and the second stage uses
predictions to select sequences. (c) Convolutional Transformer Encoder (CTE), which is similar to (Dosovitskiy et al. 2021),
except that the linear projection is replaced with DW Conv1D (Depthwise Separable 1D Convolution) (Howard et al. 2017).

are labeled as 0 (negative). Given a video V = {vi }Ti=1 bag (Zhu and Newsam 2019). In order to keep a large margin
with T snippets and its video-level label Y ∈ {0, 1}. MIL- between the positive and negative instances, Sultani, Chen,
based methods treat video V as a bag and each snippet as and Shah give a hinge-based ranking loss:
an instance. A positive video is regarded as a positive bag
Ba = (a1 , a2 , ..., aT ), and a negative video is regarded as a L(Ba , Bn ) = max(0, 1 − max fθ (ai ) + max fθ (ni )). (3)
i∈Ba i∈Bn
negative bag Bn = (n1 , n2 , ..., nT ). The goal of VAD is to
learn a function fθ maps snippets to their anomaly scores, At the beginning of the optimization, fθ needs to have a
ranging from 0 to 1. Generally, MIL-based VAD assumes certain ability to predict abnormalities. Otherwise, it will be
that abnormal snippets have higher abnormal scores than possible to select a normal instance as an abnormal instance.
normal snippets. Sultani, Chen, and Shah formulate VAD If fθ predicts the instances in the positive bag incorrectly,
as an anomaly score regression problem and propose a MIL e.g. predicting normal instances as abnormal instances, this
ranking objective function and a MIL ranking loss (Sultani, error will be strengthened as the training progresses. In ad-
Chen, and Shah 2018): dition, the abnormal event is usually multiple consecutive
snippets, but MIL-based methods do not consider this prior.
max fθ (ai ) > max fθ (ni ). (1)
i∈Ba i∈Bn
Multi-Sequence Learning
L(Ba , Bn ) = max(0, max fθ (ni ) − max fθ (ai )). (2) In order to alleviate the above shortcomings in MIL-based
i∈Bn i∈Ba
methods, we propose a novel Multi-Sequence Learning (M-
The intuition behind Eq.1 and Eq.2 that the snippet with SL) method. As shown in Figure 2, given a video V =
highest anomaly score in the positive bag should rank higher {vi }Ti=1 with T snippets, the anomaly score curve is pre-
than the snippet with highest anomaly score in the negative dicted through a mapping function fθ . Let us assume that

1397
𝑓𝜃 (𝑣7 ) sequence data as input to model long-range relationships,
and has made great progress in many tasks. We adopt Trans-
(a)
score
former as our basic layer. The representation between the
local frames or snippets of the video is also very important.
1 5 snippet 𝑇 However, Transformer is not good at learning local represen-
(b) tations of adjacent frames or snippets (Yan et al. 2021). Mo-
1 5 𝑇 tivated by this, as shown in Figure 1(c), we replace the lin-
ear projection in the original Transformer with a DW Con-
1 𝑖 𝑖+𝐾 𝑇 (c) v1D (Depthwise Separable 1D Convolution) (Howard et al.
2017) projection. The new Transformer is named Convolu-
Figure 2: Comparison of instance selection method between tional Transformer Encoder (CTE). In this way, our CTE
MIL and our MSL. (a) Anomaly score curve of a video con- can inherit the advantages of Transformer and Convolutional
taining T snippets, assuming that the 5-th snippet has the Neural Network.
largest anomaly score fθ (v5 ). (b) Instance selection method
of MIL, which selects the 5-th snippet. (c) Instance selection Transformer-based MSL Network As shown in Figure
method of our MSL, which selects a sequence consisting of 1 (a), our architecture includes a Backbone and a MSLNet.
K consecutive snippets starting from the i-th snippet. Any action recognition method can be used as the Backbone,
such as C3D (Tran et al. 2015), I3D (Carreira and Zisserman
2017), and VideoSwin (Liu et al. 2021c). Similar to (Tian
the 5-th snippet v5 has the largest anomaly score fθ (v5 ). In et al. 2021), the Backbone uses pre-trained weights on the
MIL-based methods, the 5-th snippet will be selected to op- action recognition datasets (Karpathy et al. 2014; Kay et al.
timize the network (Zhu and Newsam 2019). In our MSL, 2017). Through the Backbone, a feature F ∈ RT ×D extracts
given a hyperparameter K, we propose a sequence selection from a video containing T snippets, where D is the feature
method, which selects a sequence that contains K consec- dimension of each snippet. Our MSLNet will use F as the
utive snippets. In detail, we calculate the mean of anomaly input to predict anomalies.
scores of all possible sequences of K consecutive snippets: Our MSLNet includes a video classifier and a snippet re-
K−1 gressor. The video classifier is used to predict whether the
−K 1 X video contains anomalies. Specifically, the video classifier
S = {si }Ti=1 , si = fθ (vi+k ), (4)
K contains two layers of CTE and a linear head for predicting
k=0
where si represents the mean of anomaly scores of the se- the probability of whether the video contains anomalies:
quence of K consecutive snippets starting from the i-th snip- p = σ(W c · E c [0]), E c = CT E×2 (class token||F ), (7)
pet. Then, the sequence with the largest mean of abnormal
scores can be selected by maxsi ∈S si . where W c is the parameter of the linear head, p is the prob-
Based on the above sequence selection method, we can ability that the video contains anomalies, and class token
simply use an MSL ranking objective function as: is used to predict the probability by aggregated features in
max sa,i > max sn,i , CTE. Since whether the video contains anomalies is a bina-
sa,i ∈Sa sn,i ∈Sn ry classification problem, σ chooses the sigmoid function.
K−1 K−1 (5) The snippet regressor is used to predict the anomaly s-
1 X 1 X core of each snippet. Specifically, the snippet regressor con-
sa,i = fθ (ai+k ), sn,i = fθ (ni+k ).
K K tains two layers of CTE and a linear head for predicting the
k=0 k=0
where sa,i and sn,i represent the mean of abnormal scores anomaly score of each snippet:
of K consecutive snippets starting from the i-th snippet in fθ (vi ) = σ(W r · E r [i]), E r = CT E×2 (E c ), (8)
abnormal video and normal video, respectively. The intu-
r
ition of our MSL ranking objective function is that the mean where W is the parameter of the linear head, fθ (vi ) is the
of abnormal scores of K consecutive snippets in abnormal abnormal score of the i-th snippet, and E r [i] is the feature of
videos should be greater than the mean of abnormal scores the i-th snippet. Since predicting the anomaly score is treat-
of K consecutive snippets in normal videos. To keep a large ed as a regression problem, σ chooses the sigmoid function.
margin between the positive and negative instances, similar We regard the optimization of the video classifier and s-
to Eq. 3, our hinge-based MSL ranking loss is defined as: nippet regressor as a multi-task learning problem. The total
L(Ba , Bn ) = max(0, 1 − max sa,i + max sn,i ). (6) loss to optimize the parameters of MSLNet is the sum of our
sa,i ∈Sa sn,i ∈Sn hinge-based MSL ranking loss and the classification loss:
It can be seen that MIL is a case of our MSL. When K = 1, L = L(Ba , Bn ) + BCE(p, Y ), (9)
MIL and our MSL are equivalent. When K = T , our MSL
treats every snippet in the abnormal video as abnormal. where L(Ba , Bn ) is the Eq. 6, and BCE is the Binary Cross
Entropy loss between the output p and the target Y .
Transformer-based MSL Network To reduce the fluctuation of the abnormal scores predict-
Convolutional Transformer Encoder Before introduced by the snippet regressor, we propose a score correction
ing our Transformer-based MSL architecture, we first intro- method in the inference stage. Specifically, the score cor-
duce the basic layer. Transformer (Vaswani et al. 2017) uses rection method corrects the abnormal scores by using the

1398
probability of whether the video contains anomalies: Algorithm 1: Our Self-Training Multi-Sequence Learning.
Input: A set of features F and its video-level labels Y.
fˆθ (vi ) = fθ (vi ) × p. (10) Parameter: The number T of snippets.
The intuition of this method is that to keep the anomaly s- Output: MSLNet.
cores when the video classifier predicts that the video con- 1: Set K ← T .
tains anomalies with a higher probability, and weaken the 2: Get the initial snippet-level pseudo-labels Ŷ by Y.
anomaly scores when the video classifier predicts that the 3: while K ≥ 1 do
video contains anomalies with a lower probability. 4: Initialize the parameters of MSLNet.
5: // Stage one: with pseudo-labels to select sequences.
Self-Training MSL 6: Optimize MSLNet with K by F , Ŷ, and Eq. 11.
As shown in Figure 1 (b), we propose a self-training mecha- 7: // Stage two: with predictions to select sequences.
nism to achieve the training from coarse to fine. The training 8: Optimize MSLNet with K by F and Eq. 6.
process of our MSLNet includes two training stages. Be- 9: Inference the new snippet-level pseudo-labels Ŷ.
fore introducing our self-training mechanism, we first get 10: Set K ← K 2 .
the pseudo-labels Ŷ of the training videos. By taking the 11: end while
known video-level labels Y in weakly supervised VAD as 12: return MSLNet.
the anomaly scores of snippets, we can immediately get the
initial snippet-level pseudo-labels Ŷ. That is, for an abnor-
mal video, the pseudo label of each snippet is 1, and for a supervised setting, we adopt the split proposed by (Zhong
normal video, the pseudo label of each snippet is 0. et al. 2019): 238 training videos and 199 testing videos.
In the initial stage of training, the function fθ has a poor UCF-Crime is a large-scale dataset that contains 1,900
ability to predict abnormalities. Therefore, if the sequence untrimmed real-world street and indoor surveillance videos
is selected directly through the prediction of fθ , there is a with 13 classes of anomalous events and a total duration of
probability of selecting the wrong sequence. Based on this 128 hours (Sultani, Chen, and Shah 2018). The training set
motivation, we propose a transitional stage (stage one): MSL contains 1,610 videos with video-level labels, and the test
with pseudo-labels to select sequences. Specifically, by re- set contains 290 videos with frame-level labels.
placing the predicted anomaly score fθ (vi ) in Eq. 4 with the XD-Violence is a large-scale dataset that contains 4,754
pseudo-label ŷi of each snippet vi , we select the sequence untrimmed videos with a total duration of 217 hours and col-
with the largest mean of pseudo labels by maxsi ∈S si . Based lect from multiple sources, such as movies, sports, surveil-
on this sequence, we can calculate sa,i and sn,i , and then op- lances, and CCTVs (Wu et al. 2020). The training set con-
timize MSLNet through the hinge-based MSL ranking loss: tains 3,954 videos with video-level labels, and the test set
contains 800 videos with frame-level labels.
L(Ba , Bn ) = max(0, 1 − sa,i + sn,i ), (11) Following previous works (Zhong et al. 2019; Wan et al.
where sa,i and sn,i are the sequence with the largest mean of 2020), we use the AUC (Area Under Curve) of frame-level
pseudo labels starting from the i-th snippet in the abnormal ROC (Receiver Operating Characteristic) as our metric for
and normal video, respectively. After E1 epochs training, fθ ShanghaiTech and UCF-Crime. Following previous works
has a preliminary ability to predict the anomaly scores. (Wu et al. 2020; Tian et al. 2021), we use the AP (Aver-
In stage two, MSLNet is optimized with predictions to age Precision) as our metric for XD-Violence. Note that the
select sequences. This stage uses Eq. 5 and Eq. 6 to calculate larger the value of AUC and AP, the better the performance.
the ranking loss. After E2 epochs training, the new snippet-
Implementation Details
level pseudo-labels Ŷ of training videos are inferenced. By
halving the sequence length K and repeating the above two We extract the 4,096D features from the f c6 layer of the
stages, the predicted anomaly scores are gradually refined. pre-trained C3D (Tran et al. 2015) on Sports-1M (Karpa-
The role of the transitional stage is to establish a connec- thy et al. 2014), the 1,024D features from the mixed5c lay-
tion between MSL and different self-training rounds. By in- er of the pre-trained I3D (Carreira and Zisserman 2017) on
troducing a self-training mechanism, we achieve the predic- Kinetics-400 (Kay et al. 2017), and the 1,024D features from
tion of anomaly scores from coarse to fine. For better under- the Stage4 layer of the pre-trained VideoSwin (Liu et al.
standing, we show our self-training MSL in Algorithm 1. 2021c) on Kinetics-400. Following previous works (Tian
et al. 2021), we divide each video into 32 snippets, that is,
Experiments T = 32 and K ∈ {32, 16, 8, 4, 2, 1}. The length of each s-
nippet is 16. Our MSLNet is trained using the SGD optimiz-
Datasets and Evaluation Metrics er with a learning rate of 0.001, a weight decay of 0.0005
We conduct sufficient experiments on the ShanghaiTech, and a batch size of 64. We set E1 to 100 and E2 to 400. Fol-
UCF-Crime, and XD-Violence datasets. lowing (Tian et al. 2021), each mini-batch is composed of 32
ShanghaiTech is a medium-scale dataset that contains 437 randomly selected normal and abnormal videos. In abnormal
campus surveillance videos with 130 abnormal events in 13 videos, we randomly select one of the top 10% snippets as
scenes (Luo, Liu, and Gao 2017). However, all the training the abnormal snippet. In CTE, we set the number of headers
videos of this dataset are normal. In line with the weakly to 12 and use DW Conv1D with kernel size is 3.

1399
Method Feature Crop AUC(%) ↑ Method Feature Crop AUC(%) ↑
MIL-Rank† I3D RGB one 85.33 MIL-Rank C3D RGB one 75.41
GCN C3D-RGB ten 76.44 MIL-Rank† I3D RGB one 77.92
GCN TSN-Flow ten 84.13 Motion-Aware PWC-Flow one 79.00
GCN TSN-RGB ten 84.44 GCN C3D-RGB ten 81.08
IBL I3D-RGB one 82.50 GCN TSN-Flow ten 78.08
AR-Net† C3D RGB one 85.01 GCN TSN-RGB ten 82.12
AR-Net I3D Flow one 82.32 IBL C3D-RGB one 78.66
AR-Net I3D RGB one 85.38 CLAWS C3D-RGB ten 83.03
AR-Net I3D-RGB+Flow one 91.24 MIST C3D-RGB one 81.40
CLAWS C3D-RGB one 89.67 MIST I3D-RGB one 82.30
MIST C3D-RGB one 93.13 RTFM C3D-RGB ten 83.28
MIST I3D-RGB one 94.83 RTFM I3D-RGB ten 84.03
RTFM C3D-RGB ten 91.51 RTFM∗ VideoSwin-RGB one 83.31
RTFM I3D-RGB ten 97.21 Ours C3D-RGB one 82.85
RTFM∗ VideoSwin-RGB ten 96.76 Ours I3D-RGB one 85.30
Ours C3D-RGB one 94.23 Ours VideoSwin-RGB one 85.62
Ours I3D-RGB one 95.45
Ours VideoSwin-RGB one 96.93 Table 2: Compared with other methods on UCF-Crime. The
Ours C3D-RGB ten 94.81 method with † is reported by (Tian et al. 2021). ∗ indicates
Ours I3D-RGB ten 96.08 we re-train the method. Bold represents the best results.
Ours VideoSwin-RGB ten 97.32
Method Feature Crop AP(%) ↑
Table 1: Compared with related methods on ShanghaiTech. MIL-Rank† C3D RGB five 73.20
The methods with † are reported by (Feng, Hong, and Zheng MIL-Rank† I3D RGB five 75.68
2021) or (Tian et al. 2021). ∗ indicates we re-train the Multimodal-VD I3D-RGB five 75.41
method. Under the same feature, the highest result is bolded. RTFM C3D-RGB five 75.89
RTFM I3D-RGB five 77.81
RTFM∗ VideoSwin-RGB five 77.95
Results on ShanghaiTech Ours C3D-RGB five 75.53
We report the results on ShanghaiTech (Zhong et al. 2019) Ours I3D-RGB five 78.28
in Table 1. For a fair comparison, we use two features: Ours VideoSwin-RGB five 78.59
one-crop and ten-crop. One-crop means cropping snippet-
s into the center. Ten-crop means cropping snippets into Table 3: Compared with related methods on XD-Violence.
the center, four corners, and their flipped version (Zhong The methods with † are reported by (Wu et al. 2020) or (Tian
et al. 2019). Under the same backbone and crop, compared et al. 2021). ∗ indicates we re-train the method.
with the previous weakly supervised methods, our method-
s achieve the superior performance on AUC. For example,
with the one-crop I3D-RGB feature, our model achieves an features with other methods. Five-crop means cropping s-
AUC of 95.45% and outperforms all other methods with the nippets into the center and four corners. Under the same
same crop, and with the ten-crop VideoSwin-RGB feature, backbone, our method outperforms all previous weakly su-
our model achieves the best AUC of 97.32%. pervised VAD methods on the AP metric. For example, with
five-crop I3D-RGB features, our model achieves an AP of
Results on UCF-Crime 78.28% and outperforms all other methods, and with five-
We report our experimental results on UCF-Crime (Sul- crop VideoSwin-RGB features, our model achieves an AP
tani, Chen, and Shah 2018) in Table 2. Under I3D and of 78.59% which higher than RTFM by 0.64%.
VideoSwin as the backbone, our method outperforms all pre-
vious weakly supervised methods on the frame-level AUC Complexity Analysis
metric. Under C3D as the backbone, our method has also Generally, Transformer has been often computationally ex-
achieved competitive result. For example, with the one-crop pensive, but our method can achieve real-time surveillance.
I3D-RGB feature, our model achieves an AUC of 85.30% On an NVIDIA 2080 GPU, with VideoSwin (Liu et al.
and outperforms all other methods, and with the one-crop 2021c) as the backbone processes 3.6 snippets per second
VideoSwin-RGB feature, our model achieves the best AUC (a snippet has 16 frames), which is 57.6 frames per second
of 85.62% which is higher than RTFM by 2.31%. (FPS); with I3D (Carreira and Zisserman 2017) as the back-
bone processes 6.5 snippets per second, which is 104 FPS.
Results on XD-Violence Our MSL Network can reach 156.4 forwards per second.
We report our results on XD-Violence (Wu et al. 2020) in Overall, the speed with VideoSwin as the backbone is 42
Table 3. For a fair comparison, we use the same five-crop FPS, and the speed with I3D as the backbone is 63 FPS.

1400
(a) Abnormal Video 01 0025 (b) Abnormal Video 03 0032 (c) Abnormal Video 01 0051 (d) Normal Video 08 045

(e) Explosion 033 x264 (f) RoadAccidents 012 x264 (g) Robbery 102 x264 (h) Normal Video 894 x264

Figure 3: Visualization of abnormal score curves. The horizontal axis represents the number of frames, and the vertical axis
represents the abnormal scores. Videos of (a), (b), (c), and (d) are from the ShanghaiTech dataset, and videos of (e), (f), (g), and
(h) are from the UCF-Crime dataset. The curves indicate the abnormal scores of the video frames, pink areas indicate that the
interval contains an abnormal event, and the red rectangles indicate the location of abnormal events. Best viewed in color.

Basic Layer ShanghaiTech UCF-Crime Score correction ShanghaiTech UCF-Crime

Transformer 96.51 85.41 × 95.98 84.94
CTE 96.93 (+0.42) 85.62 (+0.21) X 96.93 (+0.95) 85.62 (+0.68)

Table 4: Compared with Transformer (Dosovitskiy et al. Table 5: Performance improvement brought by the score
2021), AUC(%) improvement brought by CTE on the correction method in the inference stage measured by
ShanghaiTech and UCF-Crime datasets. AUC(%) on the ShanghaiTech and UCF-Crime datasets.

Qualitative Analysis sults of this ablation experiment. Compared with the result
In order to further demonstrate the effect of our method, as using the standard Transformer as the basic layer, the result
shown in Figure 3, we visualize the anomaly score curves. with CTE as the basic layer increases an AUC by 0.42% and
The first row shows the ground truth and prediction anomaly 0.21% on the ShanghaiTech and UCF-Crime datasets.
scores of three abnormal videos and one normal video from
the ShanghaiTech dataset. From the first row of Figure 3, Impact of score correction in the inference stage. As
we can see that our method can detect abnormal events in shown in Table 5, we conduct an experiment to report the
surveillance videos. Our method successfully predicts short- performance improvement brought by the score correction
term abnormal events (Figure 3 (a)) and long-term abnormal method in the inference stage. From Table 5 we can ob-
events (Figure 3 (b)). Furthermore, our method can also de- serve that score correction can bring an AUC improvemen-
tect multiple abnormal events in a video (Figure 3 (c)). The t of 0.95% and 0.68% with the one-crop features on the
second row shows the ground truth and predicted anomaly ShanghaiTech and UCF-Crime datasets, respectively.
scores of three abnormal videos and one normal video from
the UCF-Crime dataset. From the second row of Figure 3, Conclusion
we can see that our proposed method can also detect abnor-
mal events in complex surveillance scenes. In this work, we first propose an MSL method and a hinge-
based MSL ranking loss. We then design a Transformer-
Ablation Analysis based network to learn both video-level anomaly probability
In order to further evaluate our method, we perform ablation and snippet-level anomaly scores. In the inference stage, we
studies on the ShanghaiTech and UCF-Crime datasets with propose to use the video-level anomaly probability to sup-
one-crop VideoSwin-RGB features. press the fluctuation of snippet-level anomaly scores. Final-
ly, since VAD needs to predict the instance-level anomaly s-
Improvement brought by CTE. To evaluate the effect of cores, by gradually reducing the length of selected sequence,
our CTE, we replace CTE with the standard Transformer we propose a self-training strategy to refine the anomaly s-
(Dosovitskiy et al. 2021). The dimension of the standard cores. Experimental results show that our method achieves
Transformer is the same as our CTE. Table 4 reports the re- significant improvements on three public datasets.

1401
Acknowledgements Kay, W.; Carreira, J.; Simonyan, K.; Zhang, B.; Hillier, C.;
This work was supported in part by the National Natural Vijayanarasimhan, S.; Viola, F.; Green, T.; Back, T.; Natsev,
Science Foundation of China (No.62076192), Key Research P.; Suleyman, M.; and Zisserman, A. 2017. The Kinetics
and Development Program in Shaanxi Province of China Human Action Video Dataset. CoRR, abs/1705.06950.
(No.2019ZDLGY03-06), the State Key Program of National Li, W.; Mahadevan, V.; and Vasconcelos, N. 2014. Anoma-
Natural Science of China (No.61836009), in part by the Pro- ly Detection and Localization in Crowded Scenes. IEEE
gram for Cheung Kong Scholars and Innovative Research Transactions on Pattern Analysis and Machine Intelligence,
Team in University (No.IRT 15R53), in part by The Fund 36(1): 18–32.
for Foreign Scholars in University Research and Teaching Li, Y.; Xing, R.; Jiao, L.; Chen, Y.; Chai, Y.; Marturi, N.;
Programs (the 111 Project) (No.B07048), in part by the and Shang, R. 2019. Semi-Supervised PolSAR Image Clas-
Key Scientific Technological Innovation Research Project sification Based on Self-Training and Superpixels. Remote.
by Ministry of Education, the National Key Research and Sens., 11(16): 1933.
Development Program of China. Li, Y.; Zhang, K.; Cao, J.; Timofte, R.; and Gool, L. V. 2021.
LocalViT: Bringing Locality to Vision Transformers. CoRR,
References abs/2104.05707.
Cai, R.; Zhang, H.; Liu, W.; Gao, S.; and Hao, Z. Liu, K.; and Ma, H. 2019. Exploring Background-Bias for
2021. Appearance-Motion Memory Consistency Network Anomaly Detection in Surveillance Videos. In Proceedings
for Video Anomaly Detection. In AAAI, 938–946. of the 27th ACM International Conference on Multimedia,
Carreira, J.; and Zisserman, A. 2017. Quo Vadis, Action MM ’19, 14901499.
Recognition? A New Model and the Kinetics Dataset. In Liu, X.; Li, K.; Zhou, M.; and Xiong, Z. 2011. Enhancing
CVPR, 4724–4733. Semantic Role Labeling for Tweets Using Self-Training. In
Chollet, F. 2017. Xception: Deep Learning with Depthwise AAAI.
Separable Convolutions. In CVPR, 1800–1807. Liu, Y.; Sun, G.; Qiu, Y.; Zhang, L.; Chhatkuli, A.; and Gool,
d’Ascoli, S.; Touvron, H.; Leavitt, M. L.; Morcos, A. S.; L. V. 2021a. Transformer in Convolutional Neural Network-
Biroli, G.; and Sagun, L. 2021. ConViT: Improving Vision s. CoRR, abs/2106.03180.
Transformers with Soft Convolutional Inductive Biases. In
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin,
ICML, volume 139, 2286–2296.
S.; and Guo, B. 2021b. Swin Transformer: Hierarchical
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, Vision Transformer using Shifted Windows. CoRR, ab-
D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; s/2103.14030.
Heigold, G.; Gelly, S.; Uszkoreit, J.; and Houlsby, N. 2021.
Liu, Z.; Ning, J.; Cao, Y.; Wei, Y.; Zhang, Z.; Lin, S.; and Hu,
An Image is Worth 16x16 Words: Transformers for Image
H. 2021c. Video Swin Transformer. CoRR, abs/2106.13230.
Recognition at Scale. In ICLR.
Feng, J.-C.; Hong, F.-T.; and Zheng, W.-S. 2021. Mist: Mul- Luo, W.; Liu, W.; and Gao, S. 2017. A Revisit of Sparse
tiple instance self-training framework for video anomaly de- Coding Based Anomaly Detection in Stacked RNN Frame-
tection. In CVPR, 14009–14018. work. In ICCV, 341–349.
Gong, D.; Liu, L.; Le, V.; Saha, B.; Mansour, M. R.; Pang, G.; Yan, C.; Shen, C.; van den Hengel, A.; and Bai, X.
Venkatesh, S.; and van den Hengel, A. 2019. Memorizing 2020. Self-Trained Deep Ordinal Regression for End-to-End
Normality to Detect Anomaly: Memory-Augmented Deep Video Anomaly Detection. In CVPR, 12170–12179.
Autoencoder for Unsupervised Anomaly Detection. In IC- Rosenberg, C.; Hebert, M.; and Schneiderman, H. 2005.
CV, 1705–1714. Semi-Supervised Self-Training of Object Detection Models.
Guo, Z.; Zhao, J.; Jiao, L.; Liu, X.; and Liu, F. 2021. A In WACV/MOTION, 29–36.
Universal Quaternion Hypergraph Network for Multimodal Sultani, W.; Chen, C.; and Shah, M. 2018. Real-World
Video Question Answering. IEEE Transactions on Multime- Anomaly Detection in Surveillance Videos. In CVPR, 6479–
dia, 1–1. 6488.
He, C.; Shao, J.; and Sun, J. 2018. An anomaly-introduced Tai, K. S.; Bailis, P.; and Valiant, G. 2021. Sinkhorn
learning method for abnormal event detection. Multimedia Label Allocation: Semi-Supervised Classification via An-
Tools and Applications, 77(22): 29573–29588. nealed Self-Training. In ICML, volume 139, 10065–10075.
Howard, A. G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, Tanha, J.; van Someren, M.; and Afsarmanesh, H. 2017.
W.; Weyand, T.; Andreetto, M.; and Adam, H. 2017. Mo- Semi-supervised self-training for decision tree classifiers.
bileNets: Efficient Convolutional Neural Networks for Mo- Int. J. Mach. Learn. Cybern., 8(1): 355–370.
bile Vision Applications. CoRR, abs/1704.04861. Tao, Y.; Zhang, D.; Cheng, S.; and Tang, X. 2018. Improv-
Jeong, J.; Lee, S.; and Kwak, N. 2020. Self-Training using semi-supervised self-training with embedded manifold
ing Selection Network for Semi-supervised Learning. In transduction. Trans. Inst. Meas. Control, 40(2): 363–374.
ICPRAM, 23–32. Tian, Y.; Pang, G.; Chen, Y.; Singh, R.; Verjans, J. W.; and
Karpathy, A.; Toderici, G.; Shetty, S.; Leung, T.; Sukthankar, Carneiro, G. 2021. Weakly-supervised Video Anomaly De-
R.; and Li, F. 2014. Large-Scale Video Classification with tection with Robust Temporal Feature Magnitude Learning.
Convolutional Neural Networks. In CVPR, 1725–1732. CoRR, abs/2101.10030.

1402
Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles,
A.; and Jégou, H. 2021. Training data-efficient image trans-
formers & distillation through attention. In ICML, volume
139, 10347–10357.
Tran, D.; Bourdev, L. D.; Fergus, R.; Torresani, L.; and
Paluri, M. 2015. Learning Spatiotemporal Features with 3D
Convolutional Networks. In ICCV, 4489–4497.
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones,
L.; Gomez, A. N.; Kaiser, L.; and Polosukhin, I. 2017. At-
tention is All you Need. In NIPS, 5998–6008.
Wan, B.; Fang, Y.; Xia, X.; and Mei, J. 2020. Weakly super-
vised video anomaly detection via center-guided discrimina-
tive learning. In ICME, 1–6. IEEE.
Wan, B.; Jiang, W.; Fang, Y.; Luo, Z.; and Ding, G. 2021.
Anomaly detection in video sequences: A benchmark and
computational model. IET Image Processing.
Wu, H.; Xiao, B.; Codella, N.; Liu, M.; Dai, X.; Yuan, L.;
and Zhang, L. 2021. CvT: Introducing Convolutions to Vi-
sion Transformers. CoRR, abs/2103.15808.
Wu, P.; Liu, j.; Shi, Y.; Sun, Y.; Shao, F.; Wu, Z.; and Yang,
Z. 2020. Not only Look, but also Listen: Learning Multi-
modal Violence Detection under Weak Supervision. In EC-
CV.
Xu, W.; Xu, Y.; Chang, T. A.; and Tu, Z. 2021. Co-
Scale Conv-Attentional Image Transformers. CoRR, ab-
s/2104.06399.
Yan, H.; Li, Z.; Li, W.; Wang, C.; Wu, M.; and Zhang, C.
2021. ConTNet: Why not use convolution and transformer
at the same time? CoRR, abs/2104.13497.
Yu, F.; Zhang, M.; Dong, H.; Hu, S.; Dong, B.; and Zhang, L.
2021. DAST: Unsupervised Domain Adaptation in Semantic
Segmentation Based on Discriminator Attention and Self-
Training. In AAAI, 10754–10762.
Zhang, J.; Qing, L.; and Miao, J. 2019. Temporal Convo-
lutional Network with Complementary Inner Bag Loss for
Weakly Supervised Anomaly Detection. In ICIP, 4030–
4034.
Zhang, Q.; and Yang, Y. 2021. ResT: An Efficient Trans-
former for Visual Recognition. CoRR, abs/2105.13677.
Zhao, B.; Fei-Fei, L.; and Xing, E. P. 2011. Online detection
of unusual events in videos via dynamic sparse coding. In
CVPR, 3313–3320.
Zheng, H.; Zhang, Y.; Yang, L.; Wang, C.; and Chen, D. Z.
2020. An Annotation Sparsification Strategy for 3D Medical
Image Segmentation via Representative Selection and Self-
Training. In AAAI, 6925–6932.
Zhong, J.; Li, N.; Kong, W.; Liu, S.; Li, T. H.; and Li, G.
2019. Graph Convolutional Label Noise Cleaner: Train a
Plug-And-Play Action Classifier for Anomaly Detection. In
CVPR, 1237–1246.
Zhu, Y.; and Newsam, S. D. 2019. Motion-Aware Feature
for Improved Video Anomaly Detection. In BMVC, 270.

1403

Data Science Master
100% (1)
Data Science Master
2 pages
Reguno Manual
No ratings yet
Reguno Manual
2 pages
Completeness and Uncertainty of Pseudo Labels
No ratings yet
Completeness and Uncertainty of Pseudo Labels
10 pages
Paper 05
No ratings yet
Paper 05
9 pages
MIST
No ratings yet
MIST
10 pages
A Coarse-to-Fine Pseudo-Labeling (C2FPL) Framework For Unsupervised Video
No ratings yet
A Coarse-to-Fine Pseudo-Labeling (C2FPL) Framework For Unsupervised Video
13 pages
Paper 02
No ratings yet
Paper 02
5 pages
Biradar Robust Anomaly Detection Through Transformer-Encoded Feature Diversity Learning ACCVW 2024 Paper
No ratings yet
Biradar Robust Anomaly Detection Through Transformer-Encoded Feature Diversity Learning ACCVW 2024 Paper
14 pages
Cross-Epoch Learning For Weakly Supervised Anomaly Detection in Surveillance Videos
No ratings yet
Cross-Epoch Learning For Weakly Supervised Anomaly Detection in Surveillance Videos
5 pages
Learning Anomalies With Normality Prior For Unsupervised Video Anomaly Detection
No ratings yet
Learning Anomalies With Normality Prior For Unsupervised Video Anomaly Detection
17 pages
Computer Vision3
No ratings yet
Computer Vision3
5 pages
Ai Paper 3
No ratings yet
Ai Paper 3
14 pages
View of Real-World Anomaly Detection in Video Using Spatio-Temporal Features Analysis For Weakly Labelled Data With Auto Label Generation
No ratings yet
View of Real-World Anomaly Detection in Video Using Spatio-Temporal Features Analysis For Weakly Labelled Data With Auto Label Generation
9 pages
Anomaly Detection r1 in Weakly Supervised Videos Using Multistage Graphs and General Deep Learning Based Spatial-Temporal Feature Enhancement
No ratings yet
Anomaly Detection r1 in Weakly Supervised Videos Using Multistage Graphs and General Deep Learning Based Spatial-Temporal Feature Enhancement
15 pages
Icses 24 T3 1047
No ratings yet
Icses 24 T3 1047
8 pages
Learning Memory-Guided Normality For Anomaly Detection
No ratings yet
Learning Memory-Guided Normality For Anomaly Detection
10 pages
Georgescu Anomaly Detection in Video Via Self-Supervised and Multi-Task Learning CVPR 2021 Paper
No ratings yet
Georgescu Anomaly Detection in Video Via Self-Supervised and Multi-Task Learning CVPR 2021 Paper
11 pages
Wu - Learning Causal Temporal Relation and Feature Discrimination For Anomaly Detection - 21
No ratings yet
Wu - Learning Causal Temporal Relation and Feature Discrimination For Anomaly Detection - 21
15 pages
RTFM
No ratings yet
RTFM
13 pages
Improved Anomaly Detection in Surveillance Videos Based On A Deep Learning Method (2018)
No ratings yet
Improved Anomaly Detection in Surveillance Videos Based On A Deep Learning Method (2018)
9 pages
Anomaly Detection in Video Via Self-Supervised and Multi-Task Learning
No ratings yet
Anomaly Detection in Video Via Self-Supervised and Multi-Task Learning
15 pages
Zaheer Generative Cooperative Learning For Unsupervised Video Anomaly Detection CVPR 2022 Paper
No ratings yet
Zaheer Generative Cooperative Learning For Unsupervised Video Anomaly Detection CVPR 2022 Paper
11 pages
Annomally Detection Reserach Paper
No ratings yet
Annomally Detection Reserach Paper
21 pages
Sensors 23 06256 v2
No ratings yet
Sensors 23 06256 v2
27 pages
Pedestrian Detection: Please, suggest a subtitle for a book with title 'Pedestrian Detection' within the realm of 'Computer Vision'. The suggested subtitle should not have ':'.
From Everand
Pedestrian Detection: Please, suggest a subtitle for a book with title 'Pedestrian Detection' within the realm of 'Computer Vision'. The suggested subtitle should not have ':'.
Fouad Sabry
No ratings yet
Electronics 12 00029
No ratings yet
Electronics 12 00029
22 pages
Real-World Anomaly Detection in Surveillance Videos
No ratings yet
Real-World Anomaly Detection in Surveillance Videos
10 pages
Qiu 22 B
No ratings yet
Qiu 22 B
15 pages
Latent Space Conditioning For Improved Classificat
No ratings yet
Latent Space Conditioning For Improved Classificat
19 pages
Real
No ratings yet
Real
8 pages
Glancevad: Exploring Glance Supervision For Label-Efficient Video Anomaly Detection
No ratings yet
Glancevad: Exploring Glance Supervision For Label-Efficient Video Anomaly Detection
21 pages
Log Eucledian Covariance Matrix
No ratings yet
Log Eucledian Covariance Matrix
8 pages
5-Uninformed Students Student-Teacher Anomaly Detection
No ratings yet
5-Uninformed Students Student-Teacher Anomaly Detection
11 pages
(Babenko2009) Multiple Instance Learning - Algorithms and Applications
No ratings yet
(Babenko2009) Multiple Instance Learning - Algorithms and Applications
19 pages
ASurveyof Deep Learning Solutionsfor Anomaly Detectionin Surveillance Videos
No ratings yet
ASurveyof Deep Learning Solutionsfor Anomaly Detectionin Surveillance Videos
9 pages
Multiple Instance Learning With Bag Dissimilarities
No ratings yet
Multiple Instance Learning With Bag Dissimilarities
35 pages
Video Anomaly Detection For Smart Surveillance: Related Concepts
No ratings yet
Video Anomaly Detection For Smart Surveillance: Related Concepts
12 pages
Invariant Anomaly Detection Under Distribution Shifts: A Causal Perspective
No ratings yet
Invariant Anomaly Detection Under Distribution Shifts: A Causal Perspective
28 pages
VideoAnamolyDetection Survey
No ratings yet
VideoAnamolyDetection Survey
36 pages
Any-Shot Sequential Anomaly Detection in Surveillance Videos CVPRW 2020 Paper
No ratings yet
Any-Shot Sequential Anomaly Detection in Surveillance Videos CVPRW 2020 Paper
6 pages
Semi-Supervised Deep Learning Based Method For Abnormality Detection in Videos
No ratings yet
Semi-Supervised Deep Learning Based Method For Abnormality Detection in Videos
5 pages
Lin Interventional Bag Multi-Instance Learning On Whole-Slide Pathological Images CVPR 2023 Paper
No ratings yet
Lin Interventional Bag Multi-Instance Learning On Whole-Slide Pathological Images CVPR 2023 Paper
10 pages
Anomaly Detection in Industrial Software Systems: Using Variational Autoencoders
No ratings yet
Anomaly Detection in Industrial Software Systems: Using Variational Autoencoders
8 pages
3603 9910 1 PB
No ratings yet
3603 9910 1 PB
8 pages
PHD Persentation
No ratings yet
PHD Persentation
22 pages
Enhancing Video Anomaly Detection For Human Suspicious Behavior Through Deep Hybrid Temporal Spatial Network
No ratings yet
Enhancing Video Anomaly Detection For Human Suspicious Behavior Through Deep Hybrid Temporal Spatial Network
8 pages
1 s2.0 S1047320321000201 Main
No ratings yet
1 s2.0 S1047320321000201 Main
14 pages
Introduction To Multiple Instance Learning PDF
No ratings yet
Introduction To Multiple Instance Learning PDF
29 pages
Electronics 11 02306 v2 PDF
No ratings yet
Electronics 11 02306 v2 PDF
15 pages
Anomaly Detection Using Deep Learning Based Model With Feature Attention
No ratings yet
Anomaly Detection Using Deep Learning Based Model With Feature Attention
8 pages
Chapters
No ratings yet
Chapters
27 pages
使用半监督学习进行细微异常检测的特征提取
No ratings yet
使用半监督学习进行细微异常检测的特征提取
12 pages
Cse 12
No ratings yet
Cse 12
12 pages
Algorithms 13 00140
No ratings yet
Algorithms 13 00140
4 pages
Anomaly Detection in Surveillance
No ratings yet
Anomaly Detection in Surveillance
9 pages
WSDM21 Tutorial DLAD Slides
No ratings yet
WSDM21 Tutorial DLAD Slides
110 pages
Suspicious Activity Detection Using Convolution Neural Network
No ratings yet
Suspicious Activity Detection Using Convolution Neural Network
11 pages
Reverse Accessible in Local Outlier Factor Density Based Recognition
No ratings yet
Reverse Accessible in Local Outlier Factor Density Based Recognition
10 pages
NeurIPS 2022 A Unified Model For Multi Class Anomaly Detection Paper Conference
No ratings yet
NeurIPS 2022 A Unified Model For Multi Class Anomaly Detection Paper Conference
14 pages
IEEE Xplore Citation Plain Text Download 2025.1.25.10.43.24
No ratings yet
IEEE Xplore Citation Plain Text Download 2025.1.25.10.43.24
3 pages
Review of Violence Detection System.
No ratings yet
Review of Violence Detection System.
4 pages
Abnormal Event Detection in Videos Using Spatiotemporal Autoencoder
100% (1)
Abnormal Event Detection in Videos Using Spatiotemporal Autoencoder
20 pages
01 Introduction
No ratings yet
01 Introduction
50 pages
Chap1 Course Introduction
No ratings yet
Chap1 Course Introduction
34 pages
10 Tuples
No ratings yet
10 Tuples
17 pages
Setup The Environment - Visual Studio Code
No ratings yet
Setup The Environment - Visual Studio Code
6 pages
05 Iterations
No ratings yet
05 Iterations
48 pages
Gentle Start To Natural Language Processing Using Python
No ratings yet
Gentle Start To Natural Language Processing Using Python
6 pages
Natural Language Processing Is Fun! - Adam Geitgey - Medium
No ratings yet
Natural Language Processing Is Fun! - Adam Geitgey - Medium
19 pages
A Practitioner's Guide To Natural Language Processing (Part I) - Processing Understanding Text
No ratings yet
A Practitioner's Guide To Natural Language Processing (Part I) - Processing Understanding Text
46 pages
Session08 1slot Collections SuTV
No ratings yet
Session08 1slot Collections SuTV
31 pages
Session06 2slots Numbers and Strings SuTV
No ratings yet
Session06 2slots Numbers and Strings SuTV
30 pages
Session09 1slot Algorithms SuTV
No ratings yet
Session09 1slot Algorithms SuTV
18 pages
Modality-Aware Contrastive Instance Learning With
No ratings yet
Modality-Aware Contrastive Instance Learning With
10 pages
2020-10-19SupplementaryMA102MA102-I - Ktu Qbank
No ratings yet
2020-10-19SupplementaryMA102MA102-I - Ktu Qbank
3 pages
Familiarization of Matlab - Part2 - lsdcRNyfAH
No ratings yet
Familiarization of Matlab - Part2 - lsdcRNyfAH
3 pages
Introduction To Finite Element Analysis: Dr. A. Kumaravel, M.Tech., PH.D.
No ratings yet
Introduction To Finite Element Analysis: Dr. A. Kumaravel, M.Tech., PH.D.
7 pages
Cbse Math PH - Ii Polynomials Solutions
No ratings yet
Cbse Math PH - Ii Polynomials Solutions
29 pages
Introduction To IPL Winning Team Prediction Model
No ratings yet
Introduction To IPL Winning Team Prediction Model
10 pages
Literature Survey On Image Deblurring Techniques
100% (1)
Literature Survey On Image Deblurring Techniques
3 pages
Quantum Gravity As Gravitized Quantum Theory: Tristan H Ubsch and Djordje Minic
No ratings yet
Quantum Gravity As Gravitized Quantum Theory: Tristan H Ubsch and Djordje Minic
60 pages
Single-Source Shortest Paths - Cormen Book CH 24
No ratings yet
Single-Source Shortest Paths - Cormen Book CH 24
28 pages
Report CA2 Template
No ratings yet
Report CA2 Template
5 pages
Sharma CreditScoring
No ratings yet
Sharma CreditScoring
45 pages
Networking Analysis (Shortest Route Program) : Example 1
No ratings yet
Networking Analysis (Shortest Route Program) : Example 1
7 pages
6.1.9 Recursion - Recursive Algorithms Assignment
No ratings yet
6.1.9 Recursion - Recursive Algorithms Assignment
3 pages
Finite Elements in Analysis & Design: Abhishek Arora, Benjamin M. Ward, Caglar Oskay
No ratings yet
Finite Elements in Analysis & Design: Abhishek Arora, Benjamin M. Ward, Caglar Oskay
25 pages
HW #3
No ratings yet
HW #3
3 pages
DSA Question Bank
No ratings yet
DSA Question Bank
6 pages
Insilico Gene Analysis
No ratings yet
Insilico Gene Analysis
34 pages
Mock End Sem 2024-2025 NMCP
No ratings yet
Mock End Sem 2024-2025 NMCP
2 pages
ITM Chapter 5 New On Probability Distributions
No ratings yet
ITM Chapter 5 New On Probability Distributions
16 pages
Week 5 EMQ Solution
100% (2)
Week 5 EMQ Solution
4 pages
Low Code No Code Ai Presentation
No ratings yet
Low Code No Code Ai Presentation
11 pages
Chapter Three Lecture Note
No ratings yet
Chapter Three Lecture Note
12 pages
04B. Bioinformatics-Lecture 4 (Alternative) - Blast
100% (1)
04B. Bioinformatics-Lecture 4 (Alternative) - Blast
38 pages
Dynamic Soil-Structure Interaction Analysis of Buildings by Neural Networks
No ratings yet
Dynamic Soil-Structure Interaction Analysis of Buildings by Neural Networks
13 pages
Business Intelligence and Decision Support Systems (9 Ed., Prentice Hall)
No ratings yet
Business Intelligence and Decision Support Systems (9 Ed., Prentice Hall)
41 pages
Applied Computing and Informatics: Kumash Kapadia, Hussein Abdel-Jaber, Fadi Thabtah, Wael Hadi
No ratings yet
Applied Computing and Informatics: Kumash Kapadia, Hussein Abdel-Jaber, Fadi Thabtah, Wael Hadi
6 pages
Computer Vision: Image Enhancement in Spatial Domain
No ratings yet
Computer Vision: Image Enhancement in Spatial Domain
18 pages
RISE QM MCQs Ch#04
No ratings yet
RISE QM MCQs Ch#04
14 pages
An Explicit Construction of Casimir Operators
No ratings yet
An Explicit Construction of Casimir Operators
16 pages
Ghafari 2012
No ratings yet
Ghafari 2012
9 pages

Self-Training Multi-Sequence Learning With Transformer

Uploaded by

Self-Training Multi-Sequence Learning With Transformer

Uploaded by

The Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI-22)

Self-Training Multi-Sequence Learning with Transformer for Weakly Supervised

Abstract Hong, and Zheng 2021). Recently, many researchers have

Transformer-based Multi-Sequence Multi-Head

(a) Multi-Sequence Learning Architecture Concat

Inference Features Scaled Dot-Product Attention

positive bag negative bag 𝐾

(b) Self-Training Multi-Sequence Learning Pipeline (c) Convolutional Transformer Encoder

Basic Layer ShanghaiTech UCF-Crime Score correction ShanghaiTech UCF-Crime

You might also like