0% found this document useful (0 votes)
95 views15 pages

Wu - Learning Causal Temporal Relation and Feature Discrimination For Anomaly Detection - 21

This document discusses challenges in weakly supervised anomaly detection in videos and proposes a new method that leverages causal temporal relations and feature discrimination. It identifies that previous methods overlook temporal context and feature separability. The proposed method introduces modules to capture local temporal dependencies, classify features using causal convolution, and ensure feature compactness and dispersion to learn discriminative power.

Uploaded by

masoud kia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
95 views15 pages

Wu - Learning Causal Temporal Relation and Feature Discrimination For Anomaly Detection - 21

This document discusses challenges in weakly supervised anomaly detection in videos and proposes a new method that leverages causal temporal relations and feature discrimination. It identifies that previous methods overlook temporal context and feature separability. The proposed method introduces modules to capture local temporal dependencies, classify features using causal convolution, and ensure feature compactness and dispersion to learn discriminative power.

Uploaded by

masoud kia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL.

30, 2021 3513

Learning Causal Temporal Relation and Feature


Discrimination for Anomaly Detection
Peng Wu and Jing Liu , Senior Member, IEEE

Abstract— Weakly supervised anomaly detection is a challeng- intelligent machines to automatically detect anomalies within a
ing task since frame-level labels are not given in the training video where the anomaly detection technology is the essential
phase. Previous studies generally employ neural networks to link. However, anomaly detection in unconstrained videos is
learn features and produce frame-level predictions and then
use multiple instance learning (MIL)-based classification loss to also a challenging task. Here, we list several major challenges
ensure the interclass separability of the learned features; all including but not limited to the rare occurrence of anomalies,
operations simply take into account the current time information diversity of anomalies, large intraclass variations, and time-
as input and ignore the historical observations. According to consuming temporal annotation.
investigations, these solutions are universal but ignore two essen-
tial factors, i.e., the temporal cue and feature discrimination.
A. Development of Video Anomaly Detection
The former introduces temporal context to enhance the current
time feature, and the latter enforces the samples of different Early studies mainly focus on semi-supervised anomaly
categories to be more separable in the feature space. In this detection, namely, only normal videos are available in the
article, we propose a method that consists of four modules to training set. These methods [1]–[3], [5], [48]–[52], [63], [64]
leverage the effect of these two ignored factors. The causal generally construct a normal pattern to encode normal samples.
temporal relation (CTR) module captures local-range temporal
dependencies among features to enhance features. The classifier In this pattern, test samples, which are not consistent with
(CL) projects enhanced features to the category space using the the normal pattern, are regarded as anomalies. Such semi-
causal convolution and further expands the temporal modeling supervised anomaly detection relieves the pressure triggered
range. Two additional modules, namely, compactness (CP) and by collecting anomalous samples to a certain extent. However,
dispersion (DP) modules, are designed to learn the discriminative such solutions suffer from several limitations. First, such
power of features, where the compactness module ensures the
intraclass compactness of normal features, and the dispersion solutions are more likely to produce high false alarm rates
module enhances the interclass dispersion. Extensive experiments for unseen normal events. Second, several anomalies are not
on three public benchmarks demonstrate the significance of realistic. For instance, in some datasets, e.g., the UCSD
causal temporal relations and feature discrimination for anomaly Ped [3] and CUHK Avenue [4] datasets, bicycling, running,
detection and the superiority of our proposed method. and throwing are regarded as anomalies. Third, most video
Index Terms— Anomaly detection, discriminative features, datasets for semi-supervised anomaly detection are small scale,
temporal modeling, weak supervision. and their total duration is a few minutes. These limitations are
major obstacles to real-world anomaly detection.
I. I NTRODUCTION To further move toward real-world anomaly detection,

M ILLONS of videos are currently captured by surveil-


lance cameras and stored on hard disks. If there were
no special requirements, these videos would be overwritten
recent studies have worked on weakly supervised anomaly
detection. Unlike semi-supervised anomaly detection, in the
weak supervision mode, both normal and abnormal videos
periodically. For the public, we do not much care how videos are available in the training set. “Weak supervision” repre-
are stored, but are deeply concerned about how to identify sents that there are neither trimmed anomalous segments nor
unusual events in a timely manner that may endanger public accurate temporal annotations. Namely, only video-level labels
security in videos. It is impractical to locate unusual events are available in the training videos. When time-consuming
in time through human effort since petabytes of video data temporal annotations are not required, it is favorable to build
are generated by cameras every minute. Therefore, we need large-scale datasets. Sultani et al. [10] constructed a large-
Manuscript received June 11, 2020; revised November 29, 2020 and scale anomaly dataset and proposed a deep multiple instance
January 25, 2021; accepted February 16, 2021. Date of publication March 3, learning (MIL) ranking model. After that, several innovative
2021; date of current version March 11, 2021. This work was supported in methods [6], [14]–[16] were developed, and most of them
part by the Key Project of Science and Technology Innovation 2030 sup-
ported by the Ministry of Science and Technology of China under Grant used the MIL-based loss as the classification loss and achieved
2018AAA0101302 and in part by the General Program of National Natural remarkable performance. Recently, in order to explore multi-
Science Foundation of China (NSFC) under Grant 61773300. The associate modality potential, Wu et al. [57] develop a larger dataset that
editor coordinating the review of this manuscript and approving it for
publication was Dr. Soma Biswas. (Corresponding author: Jing Liu.) contains audio-video signals.
The authors are with the School of Artificial Intelligence, Xidian University,
Xi’an 710071, China (e-mail: [email protected]; [email protected]). B. Motivation and Challenge
This article has supplementary downloadable material available at
https://doi.org/10.1109/TIP.2021.3062192, provided by the authors. However, comparing the abovementioned weakly super-
Digital Object Identifier 10.1109/TIP.2021.3062192 vised methods, we identify two key factors that are overlooked
1941-0042 © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: UNIVERSITY OF WESTERN ONTARIO. Downloaded on May 27,2021 at 23:04:08 UTC from IEEE Xplore. Restrictions apply.
3514 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 30, 2021

Fig. 1. Three different paradigms of temporal modeling.

Fig. 3. Difference between vanilla convolution and causal convolution.

supervision was provided to enforce the features of different


categories to be more separable. As pointed out in [17],
while MIL-based classification loss ensures the interclass
separability of learned features, this separability at the video
level alone is insufficient for accurate action localization.
In contrast, the discriminative power of features is learned by
clustering similar features and isolating different features, and
Fig. 2. An example video from the anomaly (Abuse) category of the it should complement and even enhance the separability guided
benchmark [10]. by MIL-based classification loss. Therefore, it is essential to
learn the discriminative power of features.
and have been widely exploited in other related tasks, such
as temporal action detection [17], [24], [27], [47]. The first C. Our Proposal and Contribution
one is temporal relation modeling. A video is more than The above observations motivate this work. In this article,
a stack of images in which temporality appears throughout we propose a method for weakly supervised anomaly detec-
images. Long-range temporal structure plays an important role tion, aiming at exploring the above two crucial factors to
in understanding the dynamics in action videos [34]. However, enhance anomaly detectors. First, we propose a causal tempo-
video anomaly detection is generally applied to surveillance ral relation (CTR) module to capture temporal cues between
videos, and this task should be dealt with online and cannot video snippets. As mentioned, there should be no information
rely on any of future cues to learn long-range temporal depen- leakage from future to past in the anomaly detection task
dencies. Based on this principle, previous anomaly detection (i.e., the standard temporal modeling methods, such as self-
works [10], [16] either a) only use the current information attention networks [18], [30], [54] that use the information
where current information is short-range temporal encoding, of full videos are inapplicable). As in [19] and [25], our
which is generally a video snippet comprised of 5 to 16 frames, CTR module is causal and local, which only captures tem-
and its total duration is less than 1 second. (Figure 1(a)); or b) poral relations between the current snippet and its historical
use current information in the test phase despite capturing neighbors within a local range (Figure 1(c), the local-range
long-range temporal dependencies of the full video in the dependency is a case of long-range dependency). In addition,
training phase (Figure 1(b)). The former completely ignores the CTR module mines two types of temporal relationships
temporal cues, and the latter cannot use temporal cues explic- simultaneously: namely, the semantic similarity and positional
itly in the test phase. Here, we take a video segment in prior. Although recurrent neural networks (RNNs) and their
an anomaly detection benchmark as an example to illustrate variants are designed to model a sequence over time and have
the significance of temporal relations using previous video made great achievements, we do not adopt the RNN-based
frames for anomaly detection. As seen from Figure 2, only model to learn long-range temporal relations since they expe-
given the current frame, the 360th frame, where an old lady rience difficulty modeling long sequences [28] and cannot
lies on the ground, we cannot determine if the anomaly has allow the network to selectively extract the most relevant
occurred. Once the previous video frames, frame 210 through information and relationships [47]. The experimental results in
360, are given, we immediately know that this woman was Section IV.D also indicate that our CTR is superior to RNNs
knocked down by a man, which means that an anomalous in temporal relation modeling. Furthermore, the designed
event occurred here. This example indicates previous frames classifier (CL) in our method, which is used to compute
can provide useful information for our decision making under the final anomaly score, also captures local-range temporal
reasonable temporal relation modeling. The second one is the dependencies by the causal convolution operation [25], [26].
discriminative power of learned features (otherwise known Here, we show the difference between causal convolution and
as feature discrimination). Previous studies have only used vanilla convolution in Figure 3, where causal convolution does
MIL-based classification supervision, and no other explicit not depend on any future information.

Authorized licensed use limited to: UNIVERSITY OF WESTERN ONTARIO. Downloaded on May 27,2021 at 23:04:08 UTC from IEEE Xplore. Restrictions apply.
WU AND LIU: LEARNING CTR AND FEATURE DISCRIMINATION FOR ANOMALY DETECTION 3515

feature discrimination further improves the performance to


2.0% (AUC@ROC) and 3.1% (AUC@PR), and the total gain
is 3.8% and 7.7% in terms of AUC@ROC and AUC@PR,
respectively. On the XD-Violence and ShanghaiTech datasets,
the total gains are 6.3% (AUC@ROC) and 9.2% (AUC@PR)
and 1.2% (AUC@ROC) and 7.1% (AUC@PR), respectively.
The rest of this article is organized as follows. Section II
reviewed the existing anomaly detection works as well as other
related tasks. Our proposed method is introduced in detail in
Section III. The comparison and ablation experiments on the
benchmarks are given in Section IV. Finally, conclusions are
Fig. 4. Principles of CP and DP modules. The CP module tries to compress given in Section V.
NFs, whereas the DP module make ANFs close to the center of NFs and
maneuver AAFs away from the center. II. R ELATED W ORK
A. Video Anomaly Detection
Second, in order to learn the discriminative power of As we mentioned, video anomaly detection falls into two
features, we propose two auxiliary modules termed the com- major categories, namely, semi-supervision and weak supervi-
pactness (CP) module and dispersion (DP) module, which are sion. Common semi-supervised solutions are based on recon-
applied on normal and abnormal videos, respectively. For the struction errors or one-class classifiers, e.g., Cong et al. [2]
sake of discussion, NFs, AAFs, and ANFs are short for the proposed sparse reconstruction costs over the normal dictio-
features of a normal video, features of anomalous regions in an nary to detect anomalies. Luo et al. [20] proposed a stacked
abnormal video and features of normal regions in an abnormal recurrent neural network based on reconstruction errors to
video, respectively. On the one hand, the CP module is only classify whether a test event belongs to normal or abnormal
used for NFs, which efficiently pulls NFs to their centers activities. Xu et al. [21] first proposed an appearance and
and thus reduces the interclass variation. On the other hand, motion deep network to generate discriminative features and
the DP module is based on the motivation that ANFs should then used a one-class SVM [29] to predict anomaly scores of
be closer to NFs than AAFs in the feature space. In view of each test feature. Coincidentally, Wu et al. [8] introduced a
this motivation, the DP module is intended to increase the one-stage deep one-class classifier for anomaly detection; the
interclass dispersion by minimizing the distance from NFs core of this model is that it can jointly learn discriminative
to ANFs, and maximizing the distance from NFs to AAFs. features and trains a one-class classifier.
However, due to lack of temporal annotations in the training Recently, several weakly supervised methods have been
phase, we use the attention-based feature aggregation scheme developed. Sultani et al. [10] introduced a deep multiple
to compute AAFs and ANFs. In summary, the CP module instance learning ranking model for weakly supervised anom-
targets the learning objective of intraclass compactness, and aly detection. Zhu et al. [6] pointed out that stronger visual
the DP module aims to interclass dispersion; both of them features can generally improve the anomaly detection per-
are used to learn the discriminative power of features. The formance, and they presented a temporal augmented net-
operating principles of the CP and DP modules are shown work to generate motion-aware features. At the same time,
in Figure 4. Liu et al. [15] focused on the anomalous areas and attempted
The main contributions of this article can be summarized to eliminate the effect of background and reannotated the
as follows: UCF-Crime dataset, adding extra anomalous localization infor-
1) we introduce a weakly supervised method for anomaly mation. However, the above works simply utilize the current
detection. In our method, we propose four modules to explore snippet and overlook temporal relations, e.g., previous snippet
two non-trivial information, namely, temporal relations and features. Zhong et al. [16] cast weakly supervised anomaly
feature discrimination. CTR and CL modules are used to detection as a supervised learning task under noisy labels [56]
enhance the interaction between the current and historical and proposed a graph convolutional label noise cleaner that
information, and both CP and DP modules are utilized to learn uses the graph convolution network (GCN) to capture temporal
discriminative features. To our knowledge, we are the first to relations, of which the core of temporal modeling is consistent
introduce causal temporal relations and feature discrimination with ours. The key difference is that capturing long-range
for weakly supervised anomaly detection. temporal dependencies only exists in the training phase for
2) We conduct extensive experiments on three benchmarks, graph convolutional label noise cleaner because the future
and the experimental results demonstrate our method with information is unknown in the test phase. Our method is
the assistance of causal temporal relations and feature dis- designed to address weakly supervised anomaly detection
crimination achieves significant performance boosts. More and achieves good performance. Compared with previous
specifically, on the UCF-Crime dataset, compared to a simple methods, our method has two advantages: on the one hand,
baseline that leverages no temporal relations and feature dis- the CTR and CL modules introduce long-range temporal
crimination, effective use of temporal relations in our method dependencies in both the training and test phases, which ensure
achieves clear performance boosts of 1.8% (AUC@ROC) and aggregating contextual information for obtaining the current
4.6% (AUC@PR); on top of temporal relations, leveraging anomaly score; on the other hand, the CP and DP modules

Authorized licensed use limited to: UNIVERSITY OF WESTERN ONTARIO. Downloaded on May 27,2021 at 23:04:08 UTC from IEEE Xplore. Restrictions apply.
3516 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 30, 2021

improve the discriminative power of features by compacting STPN [40], AutoLoc [41], CleanNet [42], BaSNet [43], and
(attracting) normal samples and repulsing anomalies. The Winners-out [7], where W-TALC and 3C-Net are similar to
related validation experiments are shown in Section IV. ours, all of which force learned features from the same class
to be similar, otherwise dissimilar, and our method is more
B. Anomaly Detection in Other Fields comprehensive than W-TALC and 3C-Net.
OAD and online anomaly detection both aim to detect ongo-
Anomaly detection is a significant problem that has been ing actions or events from streaming videos based only on cur-
well studied within diverse research fields and application rent and historical observations. Previous methods [44]–[46]
domains, such as video anomaly detection [66], image anom- have mainly adopted RNNs, e.g., long short-term memory
aly detection [65], hyperspectral anomaly detection [58], [60], (LSTM) and gated recurrent unit (GRU), to model the current
network anomaly detection [59], [61], geochemical anomaly action; however, these methods would not explicitly consider
detection [62], and defect detection [67]. There are several the relevance of historical observations to the current action.
reviews of anomaly detection [76]–[78]. Thereinto, clustering Recently, a modified GRU [47] was developed to model the
algorithms [79], [80], e.g., K-means, density-based spatial relations between an ongoing action and past actions, aiming
clustering of applications with noise (DBSCAN) and density at the aforementioned shortage.
peak clustering, are widely used to detect anomalies. For
instance, Song et al. [60] used DBSCAN to construct a
III. M ETHOD
dictionary for hyperspectral anomaly detection. Tu et al. [58]
adopted the density peak clustering algorithm to calculate In this section, we first present our overall architecture and
the density of each pixel and detected anomalies using the then describe each module in our proposed method in detail.
obtained density map. Unlike these unsupervised methods, our
method does not use these distance or density-based cluster A. Overall Architecture
algorithms to directly classify each sample; the gist of our
method is that using distance-based loss on neural networks The overall architecture of our method is shown in Figure 5.
makes all normal samples compact and isolates abnormal As we can see, a video first passes through the pretrained
samples, where the compact operation is similar to several feature extractor to extract initial features. To learn the dis-
prior works [8], [17], [74], and the isolation operation is criminative power, these features are fed into a fully connected
similar to the recent work [24]. (FC) layer followed by ReLU and dropout; then, the CP and
DP modules come into play. Features are called discriminative
feature hereafter. These discriminative features are fed into
C. Video Action Recognition the ad hoc CTR module and are transformed into interaction
Action recognition is a fundamental problem in video feature representations since each feature comprises current
analytics. Most video-related tasks, such as video anomaly information as well as interactive information between current
detection, video action detection and video captioning, use off- and previous features. After that, the CL module that includes
the-shelf action recognition models as feature extractors for two parallel convolutional layers projects aggregated features
further analysis. Before the age of deep learning, most action into the 1-dimension (1D) category space to obtain snippet-
recognition methods [36]–[38] used hand-crafted features. level (or frame-level) anomaly activations. The main difference
Recently, with the renaissance of convolutional neural net- between these two convolution layers is the kernel size. The
works, deep learning-based methods have been proposed for classification (CS) module takes activations as input, and uses
action recognition, e.g., 3D ConvNet (C3D) [9], Inflated 3D MIL-based classification loss to learn high activations for
(I3D) [23], Expand 3D (X3D) [32], two-stream network [33], anomalous regions.
temporal segment network (TSN) [34], efficient convolutional Formally, suppose that we are given N training videos
network (ECO) [35]. {v n }n=1
N
and the corresponding weak label {qn }n=1
N
, where qn ∈
{0, 1}, qn =1 indicates that v n covers abnormal events, but the
start and end time are unknown. For each v n , before v n is fed
D. Temporal Action Detection
into the feature extractor, it is divided into non-overlapping
Temporal action detection (or localization) is different from snippets, where we use Tn to denote the number of snippets
action recognition, aiming at identifying the start and end (the length of v n ). The length of untrimmed videos varies
time stamps and categories of actions. Anomaly detection greatly. Let X n ∈ RT ×D denote discriminative features. Here,
can be seen as a coarse temporal action detection. We focus D denotes the feature dimensions, in this article, D = 512.
on weakly supervised action localization (WSAL) and online The interaction features are denoted by Zn ∈ RT ×2D . The
action detection (OAD), which are more related to weakly final anomaly activations are denoted by An ∈ RT , where
supervised anomaly detection (WSAD). each element of An is in the range of [0, 1], representing the
WSAL and WSAD have many common traits, e.g., only anomaly score of the corresponding snippet.
video-level labels for training, the goal of detecting the start
and end time stamps. The major difference is that WSAL
is an offline task that takes full video as input. Several B. Causal Temporal Relation Module
innovative methods have been represented in the last few years, The CTR module aims to aggregate useful information
such as, UntrimmedNet [39], W-TALC [24], 3C-Net [17], from historical and current features by temporal attention

Authorized licensed use limited to: UNIVERSITY OF WESTERN ONTARIO. Downloaded on May 27,2021 at 23:04:08 UTC from IEEE Xplore. Restrictions apply.
WU AND LIU: LEARNING CTR AND FEATURE DISCRIMINATION FOR ANOMALY DETECTION 3517

Fig. 5. Overview of our proposed method. Our method is mainly comprised of six components, where the off-the-shelf feature extractor (I3D) is used to
extract snippet features, and the CP and DP modules enforce the features to be discriminative. After that, the CTR module further processes features and
captures causal local-range temporal dependencies, and the CL module projects features into the category space and generates the anomaly prediction at the
snippet-level. Finally, the CS module uses MIL to obtain the video-level prediction and compute the classification loss.

mechanism. Following the temporal attention mechanism [18],


[30], we formulate CTR module as follows,
   
Y n,t = softmax X n,t W θ W  
ϕ X n,t −t :t , softmax (R)
×X n,t −t :t (1)
   
Zn,t = Dropout ReLU Conv1D Y n,t (2)
Here, (1) is the core of CTR module, which formulates how
to model causal temporal relations of local snippets. Although
(1) is similar to the formulation of the non-local network [18], Fig. 6. Structure of the CTR module.
our CTR module captures temporal dependencies within a
local range only by leveraging previous information rather
matrix of size 1×t. The positional prior is also crucial;
than global information in [18], which is well-suited for online
next, we show how to compute it. R is defined as the relative
tasks. Moreover, the CTR module also captures the positional
position, which is {-t, …, -1, 0}, and softmax normalization
prior that is absent in non-local networks. We describe (1)
is used to compute the final positional prior matrix, which
in detail later. In addition, (2) is a standard 1D convolution
indicates that the farther from the current feature, the more
operation (Conv1D) followed by ReLU and dropout, where
information attenuation exists. Different from the similarity
the number of kernels is set to 2D. The detailed structure of
matrix, the positional prior matrix is constant; namely, it does
CTR is shown in Figure 6.
not change with input features. To execute the following
In (1), [.] is the concatenation operation, which concatenates
convolution operation, the size of Y n,t is transformed from
two different terms, where the first term measures semantic
to 2×D to 1 × 2D.
similarity between the current  feature X n,t and its histori-
cal neighbors X n,t −t :t = X n,t −t , . . . , X n,t −1 , X n,t , and
the second term defines the positional prior. More specifically, C. Classifier and Classification Modules
X n,t ∈ R1×D and X n,t −t :t ∈ Rt ×D , where t is the The CL takes interaction features Zn as input and outputs
temporal length of the local scope. First, X n,t and X n,t −t :t snippet-level anomaly activations An . As seen from Figure 5,
are projected into the embedded space by FC layers with learn- the CL comprises two parallel layers, where one is a standard
able weights W θ and W ϕ . Then, the dot product operation is 1D convolution layer with kernel size of 1, equated with an FC
performed to compute embedded similarity between X n,t and layer, and another one is a 1D causal convolution layer with
each feature of X n,t −t :t . Finally, the softmax normalization kernel size of K . Notably, our CL also introduces temporal
is used over the local scope to compute the final similarity relations by the causal convolutional layer. To ensure the

Authorized licensed use limited to: UNIVERSITY OF WESTERN ONTARIO. Downloaded on May 27,2021 at 23:04:08 UTC from IEEE Xplore. Restrictions apply.
3518 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 30, 2021

between features and the center. Following [13], the center


loss Lcp and the update C n for the center are given by
1 Nn  n 
Lcp =  X − C n 2 (4)
i 2
Nn i=1
N n  n 
i=1 C − X i
n
C n = (5)
1 + Nn

E. Dispersion Module
The CP module ensures the interclass compactness of
normal features, but it is incapable of increasing the interclass
distance. A long interclass distance means normal and anom-
alous regions are separated by a greater probability. We design
the DP module to achieve interclass dispersion, where the DP
Fig. 7. Illustration of the CS module.
module pulls ANFs close to NFs and pushes AAFs to stay
away from NFs. Suppose that X n is anomalous, the primary
requirement is to obtain AAFs and ANFs from X n . Since
interclass separability of features at the video level, we use
frame-level labels are unknown, we use activations to identify
the CS module. The illustration of the CS module is shown
the required portions. For AAFs, only the high activation
in Figure 7. Specifically, given anomaly activations An of
regions are considered; conversely, low anomaly activation
the video v n , we use the average of top-k activations over
regions correspond to ANFs. Based on such considerations,
the temporal dimension as the anomaly prediction pn , where
we take the feature corresponding to the highest activation,
k = Tn 16 + 1 if v n is abnormal else k = 1. The final the attention-based aggregated feature on entire regions and
classification loss is the binary cross-entropy between the attention-based aggregated feature on high activation regions
predicted labels { pn }n=1
N
and ground truth {qn }n=1
N
on training as AAFs. Likewise, the feature corresponding to the lowest
videos, which is given by, activation and the reversed attention-based aggregated feature
N on entire regions are taken as ANFs. Features corresponding to
Lcs = −qn log( pn ) (3)
n=1 the highest (lowest) activation are easy to identify; hence, we
emphatically introduce the attention-based feature aggregation.
The first step is to compute the attention (denoted by Aan )
D. Compactness Module
by normalizing activations over the temporal dimension as
The above classification loss ensures the interclass sepa- follows:
rability of features, whereas it cannot ensure the more dis-  
exp ω An,t − 1
criminative power of features, e.g., the intraclass compactness An,t =  
a    (6)
and interclass dispersion since there is no explicit supervision. k exp ω An,k − 1
Here, we first propose the CP module to provide supervision where ω is a predefined scaling factor used to amplify the
to explore the intraclass compactness. impact of high activations. The attention-based aggregated
Several studies on semi-supervised anomaly detection [8] feature on entire regions Fna is computed using
and one-class classification [11] bring enlightenment to us;
in these works, normal samples are forced to be collected F an = Aa
n Xn (7)
in a compact space such that normal samples have a lower For the attention-based aggregated feature on high activation
intraclass distance and would be far from the abnormal space. regions F tnh , we first use the hard shrinkage operation on An
In the weakly supervised anomaly detection, it is hard to to compute the truncation activation Atnh as follows,
build a model to completely cover all anomalous activities 
due to high variance within anomalies and weak video-level An,t if An,t > τ,
labels. Conversely, there are reasonable grounds to believe that An,t =
th
(8)
0 otherwise,
normal activities are supposed to be compact in the feature
space. Consequently, we adopt the CP module to cluster where τ denotes the shrinkage threshold. To facilitate back-
normal features. To reduce computation, we first select the propagation, we rewrite the hard shrinkage operation using the
mean feature and stochastic features from normal features as following continuous function,
the representative. Formally, given {X n }n=1
N
, for each normal  
max An,t − τ, 0 · An,t
X n , we compute the mean feature by averaging features over An,t =
th   (9)
max An,t − τ, 0 + ε
the temporal dimension and random select a quarter of X n
over the temporal dimension as the stochastic features. Then, Here, ε is a very small positive scalar to prevent the denom-
we stack all mean features and stochastic features as the inator from being zero. Then, we use the same operations as
representative, which is denoted by X n of size N n × D. With (6) and (7) on Atnh to compute F tnh . For ANFs, the formula of
X n in hands, we use center loss [13] to learn the cluster center reversed attention-based aggregated features on entire regions
(denoted by C n ) of normal features and penalize the distance is the same as that of F an ; the only difference is that we use

Authorized licensed use limited to: UNIVERSITY OF WESTERN ONTARIO. Downloaded on May 27,2021 at 23:04:08 UTC from IEEE Xplore. Restrictions apply.
WU AND LIU: LEARNING CTR AND FEATURE DISCRIMINATION FOR ANOMALY DETECTION 3519

TABLE I
C OMPARISONS OF T IME -S PACE C OMPLEXITY

reversed activations, namely, 1- An as the input, where 1 is an where S and F are the length of the output feature sequence
all-ones array with the same size as An . and kernel, respectively. Cin and Cout are the number of chan-
There are both normal and anomalous features in {X n }n=1 N
. nels for the kernel and output feature sequence, respectively.
For anomalous features, all AAFs and ANFs are computed by Generally, the time complexity of a model can be measured
the above manner and then stacked and denoted by X aa and using the FLOPs, which is the number of floating-point
X an , respectively. Here, the sizes of X aa and X an are N aa × D operations. With respect to the space complexity, we focus on
and N an × D, respectively. the parameters and express the space complexity as follows,
We use the cosine similarity to measure the similarity
Space ∼ O (F × Cin × Cout ) (14)
between two features, and it is expressed as follows:
Fa · Fb Formally, the GCN layer can be presented as follows [16],
d (F a , F b ) = (10) [30],
F b  ·  F b 
To implement the functionality discussed above, i.e., ANFs Y = AX W (15)
should be closer to NFs than AAFs in the feature space, we use where A represents the adjacency matrix, X represents the
the ranking loss [22]. Given X aa and X an , the loss function input features, and W is the weight matrix of the layer in
is represented as follows, our case. Suppose there be two matrices, and their sizes are
 M × N and N × Z ; we show the time complexity of matrix
1 N aa  aa 
Ld p = max d Xi , Cn multiplication between them.
N aa i=1

1 N an  an  Time ∼ O (M × N × Z ) (16)
− an d X i , C n + 1, 0 (11)
N i=1 In regard to the space complexity, we take the size of the
adjacency matrix and learned weight into account.
F. Training and Test Procedures From Table I, we observe that 1) the method in [10]
The total loss can be represented as follows: without any temporal modeling has the lowest time and space
complexity, and hence shows the worst performance (results
L = Lcs + αLcp+ βLd p (12) in Tables II-IV); and 2) compared with GCN [16] that uses
where α and β are, respectively, the weights for the center the global temporal information, our method has low space
loss and ranking loss. and time complexity for each GCN layer because our method
In the test phase, we use anomaly activations as the uses local temporal information rather than global information
final anomaly scores to evaluate the performance of anomaly for online video anomaly detection.
detectors.
IV. E XPERIMENTS
A. Datasets and Evaluation Metric
G. Analysis of Time–Space Complexity
We conduct experiments on three datasets, i.e., the UCF-
In this subsection, we analyze the time and space complexity
Crime1 [10], XD-Violence2 [57], and ShanghaiTech3 [68]
of our method and show the comparison results between our
datasets, where the UCF-Crime and XD-Violence datasets are
method and two other classical methods [10], [16] in Table I.
large-scale datasets for weakly supervised anomaly detection,
Since these three methods use FC, Conv1D and (or) GCN
and the ShanghaiTech dataset is used for semi-supervised
layers, we mainly analyze the time-space complexity of these
anomaly detection.
layers.
For each Conv1D (or FC) layer, the time complexity can be 1 https://visionlab.uncc.edu/download/summary/60-data/477-ucf-anomaly-

expressed by the following formula: detection-dataset


2 https://roc-ng.github.io/XD-Violence/
Time ∼ O (S × F × Cin × Cout ) (13) 3 https://svip-lab.github.io/dataset/campus_dataset.html

Authorized licensed use limited to: UNIVERSITY OF WESTERN ONTARIO. Downloaded on May 27,2021 at 23:04:08 UTC from IEEE Xplore. Restrictions apply.
3520 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 30, 2021

TABLE II 2) XD-Violence: This is an audio-visual dataset with


P ERFORMANCE C OMPARISONS B ETWEEN OUR M ETHOD AND 217 hours of videos. In contrast to UCF-Crime, XD-Violence
S TATE - OF - THE -A RT M ETHODS ON THE
UCF-C RIME D ATASET
is collected from both movies and real-world scenes. It has a
total of 4754 videos, of which 3954 videos are divided into the
training set. XD-Violence covers 6 violent classes, including
abuse, car accident, explosion, fighting, riot, and shooting.
3) ShanghaiTech: This is a medium-scale dataset
of 437 videos for semi-supervised anomaly detection.
Most of the anomalies are performed by actors. Following
Zhong et al. [16], we split the data into two subsets: the
training set is made up of 238 videos, and the testing set
contains 199 videos.
4) Evaluation Metric: Following previous work [10],
we use the frame-level receiver operating characteristic curve
(ROC) and corresponding area under the curve (AUC@ROC)
to evaluate the performance of our proposed method and
comparison methods. In addition, we also utilize the frame-
level precision-recall curve (PR) and corresponding area under
the curve (average precision, AUC@PR) since the AUC@ROC
usually shows an optimistic result when dealing with class-
imbalanced data, and the AUC@PR focuses on positive sam-
ples (anomalies). We also use the AUC@ROC of anomaly
videos (termed AUC@AnoROC); since the whole test set
contains both the normal and anomaly videos, superior per-
formance on normal videos conceals the poor accuracy of
TABLE III anomaly localization within anomalous videos.
P ERFORMANCE C OMPARISONS B ETWEEN OUR M ETHOD AND
S TATE - OF - THE -A RT M ETHODS ON THE B. Implementation Details
S HANGHAI T ECH D ATASET
1) Feature Extraction: Given a video, we first split it
into non-overlapping snippets, where each snippet contains
16 video frames. Then, we use the I3D network [23], which
is pretrained on Kinetic-400 dataset, to extract per-snippet
features. The snippet features of size D = 1024 are obtained
after global average pooling of the Mixed_5c layer from the
RGB I3D network. For the XD-Violence dataset, we use
VGGish network [71] to extract audio features, which is
TABLE IV
consistent with the setting in [57].
P ERFORMANCE C OMPARISONS B ETWEEN OUR M ETHOD AND
2) Video sampling: As aforementioned, we only have video-
S TATE - OF - THE -A RT M ETHODS ON THE level labels; hence, we need to process the entire video at
XD-V IOLENCE D ATASET once. Unfortunately, videos vary in length from seconds to
hours, and videos which are too long maybe exceed GPU
memory limitations. Therefore, we need a video sampling
mechanism. We process the entire video if its length is less
than the predefined length ; otherwise, we evenly extract
clips to meet the GPU bandwidth.
3) Training Details: We train our network on a single GTX
1080Ti GPU using PyTorch. Our network is trained with a
mini-batch size of 64 using the Adam [12] optimizer. The
learning rate is set to 5 × 10−5 , which is divided by 5 at the
3rd epoch. Following [24], the dropout rate is set to 0.7. For
hyper-parameters, t and K are set to 9 and 7, respectively,
1) UCF-Crime: This has a total of 128 hours with 1900 which are suitable according to our observations. ω in (6) and
long untrimmed videos, 1610 of which are used for training, ω in (8) are set to 10 and 0.95, respectively. The predefined
and the rest for testing. UCF-Crime covers 13 realistic anom- sampling length is set to 200. α in (12) is set to 10−3 since
alies, including abuse, arrest, arson, assault, accident, burglary, the center loss penalty has a higher magnitude compared with
explosion, fighting, robbery, shooting, stealing, shoplifting, other loss terms. β in (12) is set to 0.5. For the ShanghaiTech
and vandalism. All videos are captured from real-world dataset, due to the short length of each video, t and K are
surveillance. both set to 3.

Authorized licensed use limited to: UNIVERSITY OF WESTERN ONTARIO. Downloaded on May 27,2021 at 23:04:08 UTC from IEEE Xplore. Restrictions apply.
WU AND LIU: LEARNING CTR AND FEATURE DISCRIMINATION FOR ANOMALY DETECTION 3521

Fig. 8. ROC and PR curve comparisons between our method and other state-of-the art methods. (a) ROC curve on the UCF-Crime dataset; (b) PR curve on
the UCF-Crime dataset; (c) PR curve on the XD-Violence dataset.

C. Comparison With State-of-the-Art Methods TABLE V


P ERFORMANCE C OMPARISONS ON THE UCF-C RIME D ATASET FOR
We compare our method with current methods on three I NVESTIGATING THE E FFECT OF T EMPORAL M ODELING
datasets, and show results in Tables II-IV. These comparison
methods include semi-supervised anomaly detection baselines
(Semi-Anomaly), weakly supervised anomaly detection base-
lines (Weak-Anomaly), and even weakly supervised action
detection baseline (Weak-Action). For the UCF-Crime dataset,
comparison results show that our method outperforms all
comparison methods. Due to the proximity between temporal
action detection and anomaly detection, we also use a well-
known WSAL method, W-TALC [24], for anomaly detec-
tion and evaluate its performance. Our method achieves an
absolute gain of 4.3% and 5.3%, in terms of AUC@ROC
and AUC@PR, respectively, over W-TALC. We also show the
performance of NL-Net [15], [18], whose initial snippet fea-
tures contain long-range spatial-temporal dependencies. Our
method achieves clear improvements against the NL-Net by
6.0% in terms of AUC@ROC, which demonstrates ad hoc using MIL-based classification loss Lcs . In addition, the first
temporal modeling is essential to extracted snippet features. layer of all baselines is the same as that of our method,
We also show the ROC and PR curves in Figure 8. As seen, which is an FC layer. Here, ReC denotes the 1D causal
the curves of our method completely enclose the others, which convolution followed by ReLU and dropout, where the number
indicates our method is consistently superior to its competitors of kernels is set to 2D. SiC denotes the 1D causal convolution
at various thresholds. Our method also outperforms other followed by sigmoid, the number of kernels of which is set
existing methods by a large margin on the ShanghaiTech to 1, and ‘+’ denotes the cascade. For example, ReC+SiC
dataset. In addition, on the XD-Violence, our method is means this baseline consists of two 1D causal convolutional
better than other comparison methods except the approach layers in which the first layer is followed by ReLU and
in [57]. Because this approach uses the global long-range dropout, and the second layer is followed by the sigmoid
dependencies to model temporal relations, it cannot address function. The comparison results in Table V convincingly
online anomaly detection. However, when this approach [57] demonstrate the improvement contributed by CTR and CL
uses only local-range temporal relation, our proposed method modules. FC+ReC+SiC (the 1st row), a baseline without mod-
achieves an absolute gain of 2.2% over it. eling any temporal relations, shows a drop of 1.8% and 4.6% in
terms of AUC@ROC and AUC@PR, respectively, compared
with our method, and such a performance is still competitive.
D. Ablation Studies Even with the addition of temporal relation modeling (the
In this subsection, we conduct ablation studies to investigate 2nd , 3rd and 4th rows), the performance is still worse than
the contributions of each component of our method. that of our method, which also demonstrates that our CTR
1) Importance of Temporal Relation Modeling: Both CTR module achieves clear advantages over the standard causal
and CL modules capture temporal relations between the convolution layer and RNN that even have the same temporal
current and historical snippets. To investigate the relative length as CTR module. In addition, our CL outperforms the
contributions of the above two modules, we compare our baseline that uses a single layer, i.e., FC+CTR+SiC. Another
method with several baselines on the UCF-Crime dataset observation is that the two relationships in the CTR module,
and show the results in Table V. For a fair and simple namely, the semantic similarity and positional prior, both count
comparison, all baselines and our method are trained only as causal temporal relations. Similarly, the comparison results

Authorized licensed use limited to: UNIVERSITY OF WESTERN ONTARIO. Downloaded on May 27,2021 at 23:04:08 UTC from IEEE Xplore. Restrictions apply.
3522 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 30, 2021

TABLE VI TABLE IX
E FFECTS OF F EATURE D ISCRIMINATION AND T EMPORAL M ODELING ON P ERFORMANCE G AINS OF L EVERAGING F EATURE D ISCRIMINATION ON
THE XD-V IOLENCE D ATASET U SING RGB I3D F EATURES THE UCF-C RIME D ATASET

TABLE X
P ERFORMANCE C OMPARISONS W ITH D IFFERENT W EIGHT α ON THE
UCF-C RIME D ATASET
TABLE VII
E FFECTS OF F EATURE D ISCRIMINATION AND T EMPORAL M ODELING ON
THE S HANGHAITECH D ATASET U SING RGB I3D F EATURES

using t-SNE [31] in Figure 9. It is obvious to observe that with


the help of CP and DP modules, NFs are more compact and
TABLE VIII ANFs are closer to NFs than AAFs, which indicates learned
P ERFORMANCE C OMPARISONS ON THE UCF-C RIME D ATASET FOR features have the discriminative power in our method. From
I NVESTIGATING THE E FFECT OF F EATURE D ISCRIMINATION the evaluation results (the results in the last three rows of
Tables VI and VII) on the XD-Violence and ShanghaiTech
datasets, we can draw consistent conclusions: using only
feature discrimination achieves poor performance and the joint
use of feature discrimination with classification is beneficial
to video anomaly detection.

E. Evaluation on Parameters
(the results in the first two rows of Tables VI and VII) on the 1) Temporal Length t and K: On top of CTR and CL mod-
XD-Violence and ShanghaiTech datasets also clearly verify ules, we also conduct experiments to further investigate the
the importance of explicitly modeling temporal context for temporal length of CTR and CL modules, and show the change
online anomaly detection. Moreover, performance improve- of performance with different temporal lengths in Figure 10.
ments on the XD-Violence dataset are more remarkable since We observe that setting t and K to 9 and 7, respectively, is an
the XD-Violence comprises more diverse and complicated optimum choice within a certain range, where small temporal
activities compared with the ShanghaiTech dataset. lengths would not provide enough temporal context, whereas
2) Importance of Feature Discrimination: We conduct too large temporal lengths increase the calculation burden and
experiments and show comparison results to evaluate the are likely to introduce extra noises.
effects of feature discrimination. We use two baselines, 2) Weights of Loss α and β: The weights α and β balance
of which one lacks CP and DP modules with respect to (w.r.t) the classification and discrimination terms. We evaluate the
our method, and another one lacks CS modules w.r.t our performance of our method with different weights, and the
method. From Table VIII, we can see that CP and DP modules results are summarized in Tables X and XI. We observe that
gain 2.0% and 3.1% improvement in terms of AUC@ROC 1) our method yields the best performance when α is set to
and AUC@PR, respectively, and they also significantly reduce 10−3 and β is set to 0.5 or 1; 2) Generally, as α and β
the false alarm rate (we use the standard threshold of 50%) increase, the false alarm rate declines gradually; and 3) CP and
from 2.11% to 0.73%. The performance of the baseline that DP modules can release more powerful abilities by working
only uses CP and DP modules is still far from being satis- together.
factory (57.91% in AUC@ROC and 10.52% in AUC@PR), 3) Sampling Length : As shown in Table XII, with the
and the main gap between this baseline and our method increasing dimension, the false alarm rate improves rapidly
consists of the lack of MIL-based classification loss that and the run time per training epoch also increases. AUC@ROC
cannot learn anomaly activations. In addition, we also show and AUC@PR reach the peak at 400 and 200, respectively.
the performance gain of leveraging feature discrimination on In this article, we choose =200 which is a trade-off between
several baselines. As shown in Table IX, leveraging feature accuracy and computation.
discrimination on other baselines also obtains clear gains in 4) Different Modalities in the XD-Violence: Based on the
terms of AUC@ROC, AUC@PR and the false alarm rate. audio-visual attributes of XD-Violence dataset, we take the
Moreover, we show distributions of discriminative features audio feature, the visual feature and the concatenation of them

Authorized licensed use limited to: UNIVERSITY OF WESTERN ONTARIO. Downloaded on May 27,2021 at 23:04:08 UTC from IEEE Xplore. Restrictions apply.
WU AND LIU: LEARNING CTR AND FEATURE DISCRIMINATION FOR ANOMALY DETECTION 3523

Fig. 9. Distributions of discriminative features using t-SNE on the UCF-Crime dataset.

Fig. 10. Changes of AUC@ROC and AUC@PR w.r.t t and K on the UCF-Crime dataset. (a) Performance versus t when K=7; (b) Performance versus
K when t =9.

TABLE XI
P ERFORMANCE C OMPARISONS W ITH D IFFERENT W EIGHT β ON
THE UCF-C RIME D ATASET

TABLE XII
P ERFORMANCE C OMPARISONS ON THE UCF-C RIME D ATASET W HEN
VARYING THE S AMPLING L ENGTH D URING T RAINING
Fig. 11. Effects of different modalities on the XD-Violence dataset.

ShanghaiTech datasets to demonstrate the effect of causal


temporal relations and feature discrimination using qualitative
descriptions. It is not hard to observe that equipped with
the power of temporal relations and feature discrimination,
our method achieves superior results. Moreover, the baseline
as input and study the effect of different modalities. From
that lacks feature discrimination also achieves competitive
Figure 11, we draw the following conclusions: 1) visual input
results compared with our method, whereas our method has
achieves better performance compared with audio input; and
advantages against this baseline in producing high anomaly
2) audio-visual input outperforms the single-modality input.
scores for anomalous regions. Another baseline, without the
This also accords with the perception of neurobiology, namely,
help of temporal relations and feature discrimination, cannot
the primary form of human information acquisition is vision,
detect some anomalous regions and its detection curves are not
and the perceptual benefits of integrating visual and auditory
smooth. In regard to the less successful examples in Figure 13,
information are extensive [72], [73].
all three methods produce high false alarm rates, and the
false alarm rate of the baseline “w/o FD&TR” is relatively
F. Qualitative Results lower. One potential reason is temporal relation modeling
Qualitative results are shown in Figures 12-14. Here, we uses aggregation to enlarge anomaly scores of some regions
compare our method with two baselines on UCF-Crime and that should be normal. Another noteworthy observation is

Authorized licensed use limited to: UNIVERSITY OF WESTERN ONTARIO. Downloaded on May 27,2021 at 23:04:08 UTC from IEEE Xplore. Restrictions apply.
3524 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 30, 2021

Fig. 12. Qualitative results of our method and another two baselines on the ShanghaiTech dataset. (a) Bicycle, (b) Baby carriage, (c) Romp. In addition,
“GT” is the ground truth at the frame-level. “w/o FD” denotes a baseline that lacks feature discriminations w.r.t our method, and “w/o FD&TR” denotes a
baseline that lacks feature discrimination and causal temporal relations w.r.t our method, i.e., the first baseline in Table V. Best viewed in high resolution.

Fig. 13. Qualitative results of our method and another two baselines on the UCF-Crime dataset. (a) Arrest, (b) Explosion, (c) Fighting, (d) Shooting,
(e) Robbery, (f) Vandalism, (g)-(i) Normal activities, and (j)-(l) less successful examples.

that our method generates a delayed response for explosion boom. For the XD-Violence dataset, we show the qualitative
(Figure 13(b)) since it relies on the visual billowing smoke results taking different modality features as input. Videos in
seen in the mid-to-late explosions rather than the audible the top row are collected from movies, and videos in the

Authorized licensed use limited to: UNIVERSITY OF WESTERN ONTARIO. Downloaded on May 27,2021 at 23:04:08 UTC from IEEE Xplore. Restrictions apply.
WU AND LIU: LEARNING CTR AND FEATURE DISCRIMINATION FOR ANOMALY DETECTION 3525

Fig. 14. Qualitative results of our method using different modal inputs on the XD-Violence dataset. (a) Fighting, (b) Car accident, (c) Shooting and explosion,
(d) Riot, (e) Riot and abuse, (f) Explosion.

bottom row are collected from in-the-wild scenarios. Regard- [6] Y. Zhu and S. Newsam, “Motion-aware feature for improved video
less of whether movies or real-world videos are considered, anomaly detection,” in Proc. Brit. Mach. Vis. Conf. (BMVC), Sep. 2019,
p. 270.
audio-visual input is superior to single-modal input since [7] R. Zeng, C. Gan, P. Chen, W. Huang, Q. Wu, and M. Tan, “Breaking
audio-visual information is comprehensive. winner-takes-all: Iterative-winners-out networks for weakly supervised
temporal action localization,” IEEE Trans. Image Process., vol. 28,
no. 12, pp. 5797–5808, Dec. 2019.
V. C ONCLUSION [8] P. Wu, J. Liu, and F. Shen, “A deep one-class neural network for
In this article, we identified a problem posed by the lack of anomalous event detection in complex scenes,” IEEE Trans. Neural
Netw. Learn. Syst., vol. 31, no. 7, pp. 2609–2622, Jul. 2020.
temporal relation modeling and discriminative features in the [9] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning
previous weakly supervised methods for anomaly detection. spatiotemporal features with 3D convolutional networks,” in Proc. IEEE
Int. Conf. Comput. Vis. (ICCV), Dec. 2015, pp. 4489–4497.
To solve this problem, we proposed a method to explore [10] W. Sultani, C. Chen, and M. Shah, “Real-world anomaly detection in
causal temporal relations within the local range as well as surveillance videos,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
the discriminative power of features. By directly compar- Recognit., Jun. 2018, pp. 6479–6488.
[11] P. Perera and V. M. Patel, “Learning deep features for one-class classi-
ing evaluation results of our method with other baselines, fication,” IEEE Trans. Image Process., vol. 28, no. 11, pp. 5450–5463,
we demonstrated that causal temporal relations and feature Nov. 2019.
discrimination are both requisite to achieving performance [12] D. P. Kingma and J. Ba, “Adam: A method for stochastic
optimization,” 2014, arXiv:1412.6980. [Online]. Available:
improvement. Extensive experiments on three benchmarks http://arxiv.org/abs/1412.6980
showed that our method gains a clear advantage over the [13] Y. Wen, K. Zhang, Z. Li, and Y. Qiao, “A discriminative feature learning
current state-of-the-art methods. In the future, temporal rela- approach for deep face recognition,” in Proc. Eur. Conf. Comput. Vis.,
2016, pp. 499–515.
tions between the current and future information is yet to be [14] S. Lin, H. Yang, X. Tang, T. Shi, and L. Chen, “Social MIL: Interaction-
explored. In addition, audio-visual anomaly detection is of aware for crowd anomaly detection,” in Proc. 16th IEEE Int. Conf. Adv.
concern since audio cues provide complementary information, Video Signal Based Surveill. (AVSS), Sep. 2019, pp. 1–8.
[15] K. Liu and H. Ma, “Exploring background-bias for anomaly detection
which could improve the performance of anomaly detectors. in surveillance videos,” in Proc. 27th ACM Int. Conf. Multimedia,
Oct. 2019, pp. 1490–1499.
R EFERENCES [16] J.-X. Zhong, N. Li, W. Kong, S. Liu, T. H. Li, and G. Li, “Graph
convolutional label noise cleaner: Train a plug-and-play action classifier
[1] M. Hasan, J. Choi, J. Neumann, A. K. Roy-Chowdhury, and L. S. Davis, for anomaly detection,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
“Learning temporal regularity in video sequences,” in Proc. IEEE Conf. Recognit. (CVPR), Jun. 2019, pp. 1237–1246.
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 733–742. [17] S. Narayan, H. Cholakkal, F. S. Khan, and L. Shao, “3C-net: Category
[2] Y. Cong, J. Yuan, and J. Liu, “Sparse reconstruction cost for abnormal count and center loss for weakly-supervised action localization,” in Proc.
event detection,” in Proc. CVPR, Jun. 2011, pp. 3449–3456. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 8679–8687.
[3] W. Li, V. Mahadevan, and N. Vasconcelos, “Anomaly detection and [18] X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural net-
localization in crowded scenes,” IEEE Trans. Pattern Anal. Mach. Intell., works,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2018,
vol. 36, no. 1, pp. 18–32, Jan. 2014. pp. 7794–7803.
[4] C. Lu, J. Shi, and J. Jia, “Abnormal event detection at 150 FPS [19] H. Hu, Z. Zhang, Z. Xie, and S. Lin, “Local relation networks for
in MATLAB,” in Proc. IEEE Int. Conf. Comput. Vis., Dec. 2013, image recognition,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV),
pp. 2720–2727. Oct. 2019, pp. 3464–3473.
[5] W. Liu, W. Luo, D. Lian, and S. Gao, “Future frame prediction for [20] W. Luo et al., “Video anomaly detection with sparse coding inspired
anomaly detection–a new baseline,” in Proc. IEEE Conf. Comput. Vis. deep neural networks,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 43,
Pattern Recognit., Jun. 2018, pp. 6536–6545. no. 3, pp. 1070–1084, Mar. 2021.

Authorized licensed use limited to: UNIVERSITY OF WESTERN ONTARIO. Downloaded on May 27,2021 at 23:04:08 UTC from IEEE Xplore. Restrictions apply.
3526 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 30, 2021

[21] D. Xu, Y. Yan, E. Ricci, and N. Sebe, “Detecting anomalous [47] H. Eun, J. Moon, J. Park, C. Jung, and C. Kim, “Learning to discriminate
events in videos by learning deep representations of appearance and information for online action detection,” in Proc. IEEE Conf. Comput.
motion,” Comput. Vis. Image Understand., vol. 156, pp. 117–127, Vis. Pattern Recognit., Jun. 2020, pp. 809–818.
Mar. 2017. [48] M. Sabokrou, M. Fayyaz, M. Fathy, and R. Klette, “Deep-cascade:
[22] F. Schroff, D. Kalenichenko, and J. Philbin, “FaceNet: A unified Cascading 3D deep neural networks for fast anomaly detection and
embedding for face recognition and clustering,” in Proc. IEEE Conf. localization in crowded scenes,” IEEE Trans. Image Process., vol. 26,
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015, pp. 815–823. no. 4, pp. 1992–2004, Apr. 2017.
[23] J. Carreira and A. Zisserman, “Quo vadis, action recognition? A new [49] K.-W. Cheng, Y.-T. Chen, and W.-H. Fang, “Gaussian process
model and the kinetics dataset,” in Proc. IEEE Conf. Comput. Vis. regression-based video anomaly detection and localization with hier-
Pattern Recognit. (CVPR), Jul. 2017, pp. 6299–6308. archical feature representation,” IEEE Trans. Image Process., vol. 24,
[24] S. Paul, S. Roy, and A. K. Roy-Chowdhury, “W-TALC: Weakly- no. 12, pp. 5288–5301, Dec. 2015.
supervised temporal activity localization and classification,” in Proc. Eur. [50] R. Leyva, V. Sanchez, and C.-T. Li, “Video anomaly detection with
Conf. Comput. Vis., 2018, pp. 563–579. compact feature sets for online performance,” IEEE Trans. Image
[25] A. van den Oord et al., “WaveNet: A generative model Process., vol. 26, no. 7, pp. 3463–3478, Jul. 2017.
for raw audio,” 2016, arXiv:1609.03499. [Online]. Available: [51] S. Biswas and V. Gupta, “Abnormality detection in crowd videos by
http://arxiv.org/abs/1609.03499 tracking sparse components,” Mach. Vis. Appl., vol. 28, nos. 1–2,
[26] S. Bai, J. Zico Kolter, and V. Koltun, “An empirical evaluation of generic pp. 35–48, Feb. 2017.
convolutional and recurrent networks for sequence modeling,” 2018, [52] P. Wu, J. Liu, M. Li, Y. Sun, and F. Shen, “Fast sparse coding networks
arXiv:1803.01271. [Online]. Available: http://arxiv.org/abs/1803.01271 for anomaly detection in videos,” Pattern Recognit., vol. 107, Nov. 2020,
[27] W. Wang, X. Peng, Y. Qiao, and J. Cheng, “A comprehensive study on Art. no. 107515.
temporal modeling for online action detection,” 2020, arXiv:2001.07501. [53] J. Wang and A. Cherian, “GODS: Generalized one-class discriminative
[Online]. Available: http://arxiv.org/abs/2001.07501 subspaces for anomaly detection,” in Proc. IEEE Int. Conf. Comput. Vis.,
[28] J. Collins, J. Sohl-Dickstein, and D. Sussillo, “Capacity and trainability Oct. 2019, pp. 8201–8211.
in recurrent neural networks,” in Proc. Int. Conf. Learn. Represent., [54] M. Xu, C. Zhao, D. S. Rojas, A. Thabet, and B. Ghanem,
Apr. 2017. “G-TAD: Sub-graph localization for temporal action detection,” in Proc.
[29] B. Schölkopf, R. C. Williamson, A. J. Smola, J. Shawe-Taylor, and IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020,
J. C. Platt, “Support vector method for novelty detection,” in Proc. Adv. pp. 10156–10165.
Neural Inf. Process. Syst., 2000, pp. 582–588. [55] F. Sohrab, J. Raitoharju, M. Gabbouj, and A. Iosifidis, “Subspace support
[30] X. Wang and A. Gupta, “Videos as space-time region graphs,” in Proc. vector data description,” in Proc. 24th Int. Conf. Pattern Recognit.
Eur. Conf. Comput. Vis., 2018, pp. 399–417. (ICPR), Aug. 2018, pp. 722–727.
[31] L. van der Maaten and G. Hinton, “Visualizing data using t-SNE,” [56] D. Mandal, S. Bharadwaj, and S. Biswas, “A novel self-supervised re-
J. Mach. Learn. Res., vol. 9, pp. 2579–2605, Nov. 2008. labeling approach for training with noisy labels,” in Proc. IEEE Winter
[32] C. Feichtenhofer, “X3D: Expanding architectures for efficient video Conf. Appl. Comput. Vis. (WACV), Mar. 2020, pp. 1381–1390.
recognition,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. [57] P. Wu et al., “Not only look, but also listen: Learning multimodal
(CVPR), Jun. 2020, pp. 203–213. violence detection under weak supervision,” in Proc. Eur. Conf. Comput.
[33] K. Simonyan and A. Zisserman, “Two-stream convolutional networks for Vis., 2020, pp. 322–339.
action recognition in videos,” in Proc. Adv. Neural Inf. Process. Syst., [58] B. Tu, X. Yang, N. Li, C. Zhou, and D. He, “Hyperspectral anomaly
2014, pp. 568–576. detection via density peak clustering,” Pattern Recognit. Lett., vol. 129,
[34] L. Wang et al., “Temporal segment networks: Towards good practices pp. 144–149, Jan. 2020.
for deep action recognition,” in Proc. Eur. Conf. Comput. Vis., 2016, [59] R. A. A. Habeeb et al., “Clustering-based real-time anomaly detection—
pp. 20–36. A breakthrough in big data technologies,” Trans. Emerg. Telecommun.
[35] M. Zolfaghari, K. Singh, and T. Brox, “ECO: Efficient convolutional Technol., Art. no. e3647, Jun. 2019.
network for online video understanding,” in Proc. Eur. Conf. Comput. [60] S. Song, H. Zhou, Y. Yang, and J. Song, “Hyperspectral anomaly
Vis., 2018, pp. 695–712. detection via convolutional neural network and low rank with density-
[36] D. Oneata, J. Verbeek, and C. Schmid, “Action and event recognition based clustering,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens.,
with Fisher vectors on a compact feature set,” in Proc. IEEE Int. Conf. vol. 12, no. 9, pp. 3637–3649, Sep. 2019.
Comput. Vis., Dec. 2013, pp. 1817–1824. [61] S. Garg, K. Kaur, S. Batra, G. Kaddoum, N. Kumar, and A. Boukerche,
[37] H. Wang and C. Schmid, “Action recognition with improved trajecto- “A multi-stage anomaly detection scheme for augmenting the security
ries,” in Proc. IEEE Int. Conf. Comput. Vis., Dec. 2013, pp. 3551–3558. in IoT-enabled applications,” Future Gener. Comput. Syst., vol. 104,
[38] I. Laptev, “On space-time interest points,” Int. J. Comput. Vis., vol. 64, pp. 105–118, Mar. 2020.
nos. 2–3, pp. 107–123, Sep. 2005. [62] R. Ghezelbash, A. Maghsoudi, and E. J. M. Carranza, “Optimization
[39] L. Wang, Y. Xiong, D. Lin, and L. Van Gool, “UntrimmedNets for of geochemical anomaly detection using a novel genetic K-means
weakly supervised action recognition and detection,” in Proc. IEEE clustering (GKMC) algorithm,” Comput. Geosci., vol. 134, Jan. 2020,
Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 4325–4334. Art. no. 104335.
[40] P. Nguyen, B. Han, T. Liu, and G. Prasad, “Weakly supervised action [63] H. Park, J. Noh, and B. Ham, “Learning memory-guided normality
localization by sparse temporal pooling network,” in Proc. IEEE/CVF for anomaly detection,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
Conf. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 6752–6761. Recognit. (CVPR), Jun. 2020, pp. 14372–14381.
[41] Z. Shou, H. Gao, L. Zhang, K. Miyazawa, and S. Chang, “AutoLoc: [64] R. Morais, V. Le, T. Tran, B. Saha, M. Mansour, and S. Venkatesh,
Weakly-supervised temporal action localization in untrimmed videos,” “Learning regularity in skeleton trajectories for anomaly detection in
in Proc. Eur. Conf. Comput. Vis., 2018, pp. 154–171. videos,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.
[42] Z. Liu et al., “Weakly supervised temporal action localization through (CVPR), Jun. 2019, pp. 11996–12004.
contrast based evaluation networks,” in Proc. IEEE/CVF Int. Conf. [65] G. Somepalli, Y. Wu, Y. Balaji, B. Vinzamuri, and S. Feizi, “Unsuper-
Comput. Vis. (ICCV), Oct. 2019, pp. 3899–3908. vised anomaly detection with adversarial mirrored AutoEncoders,” 2020,
[43] P. Lee, Y. Uh, and H. Byun, “Background suppression network for arXiv:2003.10713. [Online]. Available: http://arxiv.org/abs/2003.10713
weakly-supervised temporal action localization,” in Proc. AAAI Conf. [66] G. Pang, C. Yan, C. Shen, A. van den Hengel, and X. Bai, “Self-
Artif. Intell., vol. 34, no. 7, Apr. 2020, pp. 11320–11327. trained deep ordinal regression for end-to-end video anomaly detection,”
[44] J. Gao, Z. Yang, and R. Nevatia, “RED: Reinforced encoder-decoder in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR),
networks for action anticipation,” in Proc. Brit. Mach. Vis. Conf., 2017, Jun. 2020, pp. 12173–12182.
p. 92. [67] P. Bergmann, M. Fauser, D. Sattlegger, and C. Steger, “Uninformed
[45] R. De Geest, E. Gavves, A. Ghodrati, Z. Li, G. Snoek, and T. Tuytelaars, students: Student-teacher anomaly detection with discriminative latent
“Online action detection,” in Proc. Eur. Conf. Comput. Vis., 2016, embeddings,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.
pp. 269–285. (CVPR), Jun. 2020, pp. 4183–4192.
[46] M. Xu, M. Gao, Y.-T. Chen, L. Davis, and D. Crandall, “Temporal [68] W. Luo, W. Liu, and S. Gao, “A revisit of sparse coding based anomaly
recurrent networks for online action detection,” in Proc. IEEE/CVF Int. detection in stacked RNN framework,” in Proc. IEEE Int. Conf. Comput.
Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 5532–5541. Vis., Oct. 2017, pp. 341–349.

Authorized licensed use limited to: UNIVERSITY OF WESTERN ONTARIO. Downloaded on May 27,2021 at 23:04:08 UTC from IEEE Xplore. Restrictions apply.
WU AND LIU: LEARNING CTR AND FEATURE DISCRIMINATION FOR ANOMALY DETECTION 3527

[69] J. Zhang, L. Qing, and J. Miao, “Temporal convolutional network Peng Wu received the B.Eng. degree from Xidian
with complementary inner bag loss for weakly supervised anomaly University, Xi’an, China, in 2017, where he is
detection,” in Proc. IEEE Int. Conf. Image Process. (ICIP), Sep. 2019, currently pursuing the Ph.D. degree with the
pp. 4030–4034. Guangzhou Institute of Technology. His current
[70] B. Wan, Y. Fang, X. Xia, and J. Mei, “Weakly supervised video anomaly research interests include video anomaly detection,
detection via center-guided discriminative learning,” in Proc. IEEE Int. video retrieval, temporal action detection, weakly
Conf. Multimedia Expo (ICME), Jul. 2020, pp. 1–6. supervised learning, and deep learning.
[71] J. F. Gemmeke et al., “Audio set: An ontology and human-labeled
dataset for audio events,” in Proc. IEEE Int. Conf. Acoust., Speech Signal
Process. (ICASSP), Mar. 2017, pp. 776–780.
[72] D. A. Bulkin and J. M. Groh, “Seeing sounds: Visual and auditory
interactions in the brain,” Current Opinion Neurobiol., vol. 16, no. 4,
pp. 415–419, Aug. 2006.
[73] Y. Tian, J. Shi, B. Li, Z. Duan, and C. Xu, “Audio-visual event
localization in unconstrained videos,” in Proc. Eur. Conf. Comput. Vis.,
2018, pp. 247–263.
[74] L. Ruff et al., “Deep one-class classification,” in Proc. Int. Conf. Mach.
Learn., 2018, pp. 4393–4402.
[75] M. Z. Zaheer, A. Mahmood, M. Astrid, and S. I. Lee, “CLAWS: Clus- Jing Liu (Senior Member, IEEE) received the B.S.
tering assisted weakly supervised learning with normalcy suppression degree in computer science and technology and the
for anomalous event detection,” in Proc. Eur. Conf. Comput. Vis., 2020, Ph.D. degree in circuits and systems from Xidian
pp. 358–376. University in 2000 and 2004, respectively. In 2005,
[76] B. Kiran, D. Thomas, and R. Parakkal, “An overview of deep learning she joined Xidian University as a Lecturer, and
based methods for unsupervised and semi-supervised anomaly detection was promoted to a Full Professor in 2009. From
in videos,” J. Imag., vol. 4, no. 2, p. 36, Feb. 2018. April 2007 to April 2008, she was a Postdoctoral
[77] R. Chalapathy and S. Chawla, “Deep learning for anomaly Research Fellow with The University of Queensland,
detection: A survey,” 2019, arXiv:1901.03407. [Online]. Available: Australia. From July 2009 to July 2011, she was
http://arxiv.org/abs/1901.03407 a Research Associate with The University of New
[78] G. Pang, C. Shen, L. Cao, and A. van den Hengel, “Deep learning South Wales, Australian Defence Force Academy.
for anomaly detection: A review,” 2020, arXiv:2007.02500. [Online]. She is currently a Full Professor with the Guangzhou Institute of Technology,
Available: http://arxiv.org/abs/2007.02500 Xidian University. Her research interests include evolutionary computation,
[79] M. D. Parmar et al., “FREDPC: A feasible residual error-based density complex networks, fuzzy cognitive maps, and computer vision.
peak clustering algorithm with the fragment merging strategy,” IEEE Dr. Liu was the Chair of Emerging Technologies Technical Committee
Access, vol. 7, pp. 89789–89804, 2019. (ETTC) of the IEEE Computational Intelligence Society from 2017 to
[80] M. Parmar et al., “REDPC: A residual error-based density peak cluster- 2018 and an Associate Editor of the IEEE T RANSACTIONS ON E VOLUTION -
ing algorithm,” Neurocomputing, vol. 348, pp. 82–96, Jul. 2019. ARY C OMPUTATION from 2015 to 2020.

Authorized licensed use limited to: UNIVERSITY OF WESTERN ONTARIO. Downloaded on May 27,2021 at 23:04:08 UTC from IEEE Xplore. Restrictions apply.

You might also like