Wu - Learning Causal Temporal Relation and Feature Discrimination For Anomaly Detection - 21
Wu - Learning Causal Temporal Relation and Feature Discrimination For Anomaly Detection - 21
Abstract— Weakly supervised anomaly detection is a challeng- intelligent machines to automatically detect anomalies within a
ing task since frame-level labels are not given in the training video where the anomaly detection technology is the essential
phase. Previous studies generally employ neural networks to link. However, anomaly detection in unconstrained videos is
learn features and produce frame-level predictions and then
use multiple instance learning (MIL)-based classification loss to also a challenging task. Here, we list several major challenges
ensure the interclass separability of the learned features; all including but not limited to the rare occurrence of anomalies,
operations simply take into account the current time information diversity of anomalies, large intraclass variations, and time-
as input and ignore the historical observations. According to consuming temporal annotation.
investigations, these solutions are universal but ignore two essen-
tial factors, i.e., the temporal cue and feature discrimination.
A. Development of Video Anomaly Detection
The former introduces temporal context to enhance the current
time feature, and the latter enforces the samples of different Early studies mainly focus on semi-supervised anomaly
categories to be more separable in the feature space. In this detection, namely, only normal videos are available in the
article, we propose a method that consists of four modules to training set. These methods [1]–[3], [5], [48]–[52], [63], [64]
leverage the effect of these two ignored factors. The causal generally construct a normal pattern to encode normal samples.
temporal relation (CTR) module captures local-range temporal
dependencies among features to enhance features. The classifier In this pattern, test samples, which are not consistent with
(CL) projects enhanced features to the category space using the the normal pattern, are regarded as anomalies. Such semi-
causal convolution and further expands the temporal modeling supervised anomaly detection relieves the pressure triggered
range. Two additional modules, namely, compactness (CP) and by collecting anomalous samples to a certain extent. However,
dispersion (DP) modules, are designed to learn the discriminative such solutions suffer from several limitations. First, such
power of features, where the compactness module ensures the
intraclass compactness of normal features, and the dispersion solutions are more likely to produce high false alarm rates
module enhances the interclass dispersion. Extensive experiments for unseen normal events. Second, several anomalies are not
on three public benchmarks demonstrate the significance of realistic. For instance, in some datasets, e.g., the UCSD
causal temporal relations and feature discrimination for anomaly Ped [3] and CUHK Avenue [4] datasets, bicycling, running,
detection and the superiority of our proposed method. and throwing are regarded as anomalies. Third, most video
Index Terms— Anomaly detection, discriminative features, datasets for semi-supervised anomaly detection are small scale,
temporal modeling, weak supervision. and their total duration is a few minutes. These limitations are
major obstacles to real-world anomaly detection.
I. I NTRODUCTION To further move toward real-world anomaly detection,
Authorized licensed use limited to: UNIVERSITY OF WESTERN ONTARIO. Downloaded on May 27,2021 at 23:04:08 UTC from IEEE Xplore. Restrictions apply.
3514 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 30, 2021
Authorized licensed use limited to: UNIVERSITY OF WESTERN ONTARIO. Downloaded on May 27,2021 at 23:04:08 UTC from IEEE Xplore. Restrictions apply.
WU AND LIU: LEARNING CTR AND FEATURE DISCRIMINATION FOR ANOMALY DETECTION 3515
Authorized licensed use limited to: UNIVERSITY OF WESTERN ONTARIO. Downloaded on May 27,2021 at 23:04:08 UTC from IEEE Xplore. Restrictions apply.
3516 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 30, 2021
improve the discriminative power of features by compacting STPN [40], AutoLoc [41], CleanNet [42], BaSNet [43], and
(attracting) normal samples and repulsing anomalies. The Winners-out [7], where W-TALC and 3C-Net are similar to
related validation experiments are shown in Section IV. ours, all of which force learned features from the same class
to be similar, otherwise dissimilar, and our method is more
B. Anomaly Detection in Other Fields comprehensive than W-TALC and 3C-Net.
OAD and online anomaly detection both aim to detect ongo-
Anomaly detection is a significant problem that has been ing actions or events from streaming videos based only on cur-
well studied within diverse research fields and application rent and historical observations. Previous methods [44]–[46]
domains, such as video anomaly detection [66], image anom- have mainly adopted RNNs, e.g., long short-term memory
aly detection [65], hyperspectral anomaly detection [58], [60], (LSTM) and gated recurrent unit (GRU), to model the current
network anomaly detection [59], [61], geochemical anomaly action; however, these methods would not explicitly consider
detection [62], and defect detection [67]. There are several the relevance of historical observations to the current action.
reviews of anomaly detection [76]–[78]. Thereinto, clustering Recently, a modified GRU [47] was developed to model the
algorithms [79], [80], e.g., K-means, density-based spatial relations between an ongoing action and past actions, aiming
clustering of applications with noise (DBSCAN) and density at the aforementioned shortage.
peak clustering, are widely used to detect anomalies. For
instance, Song et al. [60] used DBSCAN to construct a
III. M ETHOD
dictionary for hyperspectral anomaly detection. Tu et al. [58]
adopted the density peak clustering algorithm to calculate In this section, we first present our overall architecture and
the density of each pixel and detected anomalies using the then describe each module in our proposed method in detail.
obtained density map. Unlike these unsupervised methods, our
method does not use these distance or density-based cluster A. Overall Architecture
algorithms to directly classify each sample; the gist of our
method is that using distance-based loss on neural networks The overall architecture of our method is shown in Figure 5.
makes all normal samples compact and isolates abnormal As we can see, a video first passes through the pretrained
samples, where the compact operation is similar to several feature extractor to extract initial features. To learn the dis-
prior works [8], [17], [74], and the isolation operation is criminative power, these features are fed into a fully connected
similar to the recent work [24]. (FC) layer followed by ReLU and dropout; then, the CP and
DP modules come into play. Features are called discriminative
feature hereafter. These discriminative features are fed into
C. Video Action Recognition the ad hoc CTR module and are transformed into interaction
Action recognition is a fundamental problem in video feature representations since each feature comprises current
analytics. Most video-related tasks, such as video anomaly information as well as interactive information between current
detection, video action detection and video captioning, use off- and previous features. After that, the CL module that includes
the-shelf action recognition models as feature extractors for two parallel convolutional layers projects aggregated features
further analysis. Before the age of deep learning, most action into the 1-dimension (1D) category space to obtain snippet-
recognition methods [36]–[38] used hand-crafted features. level (or frame-level) anomaly activations. The main difference
Recently, with the renaissance of convolutional neural net- between these two convolution layers is the kernel size. The
works, deep learning-based methods have been proposed for classification (CS) module takes activations as input, and uses
action recognition, e.g., 3D ConvNet (C3D) [9], Inflated 3D MIL-based classification loss to learn high activations for
(I3D) [23], Expand 3D (X3D) [32], two-stream network [33], anomalous regions.
temporal segment network (TSN) [34], efficient convolutional Formally, suppose that we are given N training videos
network (ECO) [35]. {v n }n=1
N
and the corresponding weak label {qn }n=1
N
, where qn ∈
{0, 1}, qn =1 indicates that v n covers abnormal events, but the
start and end time are unknown. For each v n , before v n is fed
D. Temporal Action Detection
into the feature extractor, it is divided into non-overlapping
Temporal action detection (or localization) is different from snippets, where we use Tn to denote the number of snippets
action recognition, aiming at identifying the start and end (the length of v n ). The length of untrimmed videos varies
time stamps and categories of actions. Anomaly detection greatly. Let X n ∈ RT ×D denote discriminative features. Here,
can be seen as a coarse temporal action detection. We focus D denotes the feature dimensions, in this article, D = 512.
on weakly supervised action localization (WSAL) and online The interaction features are denoted by Zn ∈ RT ×2D . The
action detection (OAD), which are more related to weakly final anomaly activations are denoted by An ∈ RT , where
supervised anomaly detection (WSAD). each element of An is in the range of [0, 1], representing the
WSAL and WSAD have many common traits, e.g., only anomaly score of the corresponding snippet.
video-level labels for training, the goal of detecting the start
and end time stamps. The major difference is that WSAL
is an offline task that takes full video as input. Several B. Causal Temporal Relation Module
innovative methods have been represented in the last few years, The CTR module aims to aggregate useful information
such as, UntrimmedNet [39], W-TALC [24], 3C-Net [17], from historical and current features by temporal attention
Authorized licensed use limited to: UNIVERSITY OF WESTERN ONTARIO. Downloaded on May 27,2021 at 23:04:08 UTC from IEEE Xplore. Restrictions apply.
WU AND LIU: LEARNING CTR AND FEATURE DISCRIMINATION FOR ANOMALY DETECTION 3517
Fig. 5. Overview of our proposed method. Our method is mainly comprised of six components, where the off-the-shelf feature extractor (I3D) is used to
extract snippet features, and the CP and DP modules enforce the features to be discriminative. After that, the CTR module further processes features and
captures causal local-range temporal dependencies, and the CL module projects features into the category space and generates the anomaly prediction at the
snippet-level. Finally, the CS module uses MIL to obtain the video-level prediction and compute the classification loss.
Authorized licensed use limited to: UNIVERSITY OF WESTERN ONTARIO. Downloaded on May 27,2021 at 23:04:08 UTC from IEEE Xplore. Restrictions apply.
3518 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 30, 2021
E. Dispersion Module
The CP module ensures the interclass compactness of
normal features, but it is incapable of increasing the interclass
distance. A long interclass distance means normal and anom-
alous regions are separated by a greater probability. We design
the DP module to achieve interclass dispersion, where the DP
Fig. 7. Illustration of the CS module.
module pulls ANFs close to NFs and pushes AAFs to stay
away from NFs. Suppose that X n is anomalous, the primary
requirement is to obtain AAFs and ANFs from X n . Since
interclass separability of features at the video level, we use
frame-level labels are unknown, we use activations to identify
the CS module. The illustration of the CS module is shown
the required portions. For AAFs, only the high activation
in Figure 7. Specifically, given anomaly activations An of
regions are considered; conversely, low anomaly activation
the video v n , we use the average of top-k activations over
regions correspond to ANFs. Based on such considerations,
the temporal dimension as the anomaly prediction pn , where
we take the feature corresponding to the highest activation,
k = Tn 16 + 1 if v n is abnormal else k = 1. The final the attention-based aggregated feature on entire regions and
classification loss is the binary cross-entropy between the attention-based aggregated feature on high activation regions
predicted labels { pn }n=1
N
and ground truth {qn }n=1
N
on training as AAFs. Likewise, the feature corresponding to the lowest
videos, which is given by, activation and the reversed attention-based aggregated feature
N on entire regions are taken as ANFs. Features corresponding to
Lcs = −qn log( pn ) (3)
n=1 the highest (lowest) activation are easy to identify; hence, we
emphatically introduce the attention-based feature aggregation.
The first step is to compute the attention (denoted by Aan )
D. Compactness Module
by normalizing activations over the temporal dimension as
The above classification loss ensures the interclass sepa- follows:
rability of features, whereas it cannot ensure the more dis-
exp ω An,t − 1
criminative power of features, e.g., the intraclass compactness An,t =
a (6)
and interclass dispersion since there is no explicit supervision. k exp ω An,k − 1
Here, we first propose the CP module to provide supervision where ω is a predefined scaling factor used to amplify the
to explore the intraclass compactness. impact of high activations. The attention-based aggregated
Several studies on semi-supervised anomaly detection [8] feature on entire regions Fna is computed using
and one-class classification [11] bring enlightenment to us;
in these works, normal samples are forced to be collected F an = Aa
n Xn (7)
in a compact space such that normal samples have a lower For the attention-based aggregated feature on high activation
intraclass distance and would be far from the abnormal space. regions F tnh , we first use the hard shrinkage operation on An
In the weakly supervised anomaly detection, it is hard to to compute the truncation activation Atnh as follows,
build a model to completely cover all anomalous activities
due to high variance within anomalies and weak video-level An,t if An,t > τ,
labels. Conversely, there are reasonable grounds to believe that An,t =
th
(8)
0 otherwise,
normal activities are supposed to be compact in the feature
space. Consequently, we adopt the CP module to cluster where τ denotes the shrinkage threshold. To facilitate back-
normal features. To reduce computation, we first select the propagation, we rewrite the hard shrinkage operation using the
mean feature and stochastic features from normal features as following continuous function,
the representative. Formally, given {X n }n=1
N
, for each normal
max An,t − τ, 0 · An,t
X n , we compute the mean feature by averaging features over An,t =
th (9)
max An,t − τ, 0 + ε
the temporal dimension and random select a quarter of X n
over the temporal dimension as the stochastic features. Then, Here, ε is a very small positive scalar to prevent the denom-
we stack all mean features and stochastic features as the inator from being zero. Then, we use the same operations as
representative, which is denoted by X n of size N n × D. With (6) and (7) on Atnh to compute F tnh . For ANFs, the formula of
X n in hands, we use center loss [13] to learn the cluster center reversed attention-based aggregated features on entire regions
(denoted by C n ) of normal features and penalize the distance is the same as that of F an ; the only difference is that we use
Authorized licensed use limited to: UNIVERSITY OF WESTERN ONTARIO. Downloaded on May 27,2021 at 23:04:08 UTC from IEEE Xplore. Restrictions apply.
WU AND LIU: LEARNING CTR AND FEATURE DISCRIMINATION FOR ANOMALY DETECTION 3519
TABLE I
C OMPARISONS OF T IME -S PACE C OMPLEXITY
reversed activations, namely, 1- An as the input, where 1 is an where S and F are the length of the output feature sequence
all-ones array with the same size as An . and kernel, respectively. Cin and Cout are the number of chan-
There are both normal and anomalous features in {X n }n=1 N
. nels for the kernel and output feature sequence, respectively.
For anomalous features, all AAFs and ANFs are computed by Generally, the time complexity of a model can be measured
the above manner and then stacked and denoted by X aa and using the FLOPs, which is the number of floating-point
X an , respectively. Here, the sizes of X aa and X an are N aa × D operations. With respect to the space complexity, we focus on
and N an × D, respectively. the parameters and express the space complexity as follows,
We use the cosine similarity to measure the similarity
Space ∼ O (F × Cin × Cout ) (14)
between two features, and it is expressed as follows:
Fa · Fb Formally, the GCN layer can be presented as follows [16],
d (F a , F b ) = (10) [30],
F b · F b
To implement the functionality discussed above, i.e., ANFs Y = AX W (15)
should be closer to NFs than AAFs in the feature space, we use where A represents the adjacency matrix, X represents the
the ranking loss [22]. Given X aa and X an , the loss function input features, and W is the weight matrix of the layer in
is represented as follows, our case. Suppose there be two matrices, and their sizes are
M × N and N × Z ; we show the time complexity of matrix
1 N aa aa
Ld p = max d Xi , Cn multiplication between them.
N aa i=1
1 N an an Time ∼ O (M × N × Z ) (16)
− an d X i , C n + 1, 0 (11)
N i=1 In regard to the space complexity, we take the size of the
adjacency matrix and learned weight into account.
F. Training and Test Procedures From Table I, we observe that 1) the method in [10]
The total loss can be represented as follows: without any temporal modeling has the lowest time and space
complexity, and hence shows the worst performance (results
L = Lcs + αLcp+ βLd p (12) in Tables II-IV); and 2) compared with GCN [16] that uses
where α and β are, respectively, the weights for the center the global temporal information, our method has low space
loss and ranking loss. and time complexity for each GCN layer because our method
In the test phase, we use anomaly activations as the uses local temporal information rather than global information
final anomaly scores to evaluate the performance of anomaly for online video anomaly detection.
detectors.
IV. E XPERIMENTS
A. Datasets and Evaluation Metric
G. Analysis of Time–Space Complexity
We conduct experiments on three datasets, i.e., the UCF-
In this subsection, we analyze the time and space complexity
Crime1 [10], XD-Violence2 [57], and ShanghaiTech3 [68]
of our method and show the comparison results between our
datasets, where the UCF-Crime and XD-Violence datasets are
method and two other classical methods [10], [16] in Table I.
large-scale datasets for weakly supervised anomaly detection,
Since these three methods use FC, Conv1D and (or) GCN
and the ShanghaiTech dataset is used for semi-supervised
layers, we mainly analyze the time-space complexity of these
anomaly detection.
layers.
For each Conv1D (or FC) layer, the time complexity can be 1 https://visionlab.uncc.edu/download/summary/60-data/477-ucf-anomaly-
Authorized licensed use limited to: UNIVERSITY OF WESTERN ONTARIO. Downloaded on May 27,2021 at 23:04:08 UTC from IEEE Xplore. Restrictions apply.
3520 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 30, 2021
Authorized licensed use limited to: UNIVERSITY OF WESTERN ONTARIO. Downloaded on May 27,2021 at 23:04:08 UTC from IEEE Xplore. Restrictions apply.
WU AND LIU: LEARNING CTR AND FEATURE DISCRIMINATION FOR ANOMALY DETECTION 3521
Fig. 8. ROC and PR curve comparisons between our method and other state-of-the art methods. (a) ROC curve on the UCF-Crime dataset; (b) PR curve on
the UCF-Crime dataset; (c) PR curve on the XD-Violence dataset.
Authorized licensed use limited to: UNIVERSITY OF WESTERN ONTARIO. Downloaded on May 27,2021 at 23:04:08 UTC from IEEE Xplore. Restrictions apply.
3522 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 30, 2021
TABLE VI TABLE IX
E FFECTS OF F EATURE D ISCRIMINATION AND T EMPORAL M ODELING ON P ERFORMANCE G AINS OF L EVERAGING F EATURE D ISCRIMINATION ON
THE XD-V IOLENCE D ATASET U SING RGB I3D F EATURES THE UCF-C RIME D ATASET
TABLE X
P ERFORMANCE C OMPARISONS W ITH D IFFERENT W EIGHT α ON THE
UCF-C RIME D ATASET
TABLE VII
E FFECTS OF F EATURE D ISCRIMINATION AND T EMPORAL M ODELING ON
THE S HANGHAITECH D ATASET U SING RGB I3D F EATURES
E. Evaluation on Parameters
(the results in the first two rows of Tables VI and VII) on the 1) Temporal Length t and K: On top of CTR and CL mod-
XD-Violence and ShanghaiTech datasets also clearly verify ules, we also conduct experiments to further investigate the
the importance of explicitly modeling temporal context for temporal length of CTR and CL modules, and show the change
online anomaly detection. Moreover, performance improve- of performance with different temporal lengths in Figure 10.
ments on the XD-Violence dataset are more remarkable since We observe that setting t and K to 9 and 7, respectively, is an
the XD-Violence comprises more diverse and complicated optimum choice within a certain range, where small temporal
activities compared with the ShanghaiTech dataset. lengths would not provide enough temporal context, whereas
2) Importance of Feature Discrimination: We conduct too large temporal lengths increase the calculation burden and
experiments and show comparison results to evaluate the are likely to introduce extra noises.
effects of feature discrimination. We use two baselines, 2) Weights of Loss α and β: The weights α and β balance
of which one lacks CP and DP modules with respect to (w.r.t) the classification and discrimination terms. We evaluate the
our method, and another one lacks CS modules w.r.t our performance of our method with different weights, and the
method. From Table VIII, we can see that CP and DP modules results are summarized in Tables X and XI. We observe that
gain 2.0% and 3.1% improvement in terms of AUC@ROC 1) our method yields the best performance when α is set to
and AUC@PR, respectively, and they also significantly reduce 10−3 and β is set to 0.5 or 1; 2) Generally, as α and β
the false alarm rate (we use the standard threshold of 50%) increase, the false alarm rate declines gradually; and 3) CP and
from 2.11% to 0.73%. The performance of the baseline that DP modules can release more powerful abilities by working
only uses CP and DP modules is still far from being satis- together.
factory (57.91% in AUC@ROC and 10.52% in AUC@PR), 3) Sampling Length : As shown in Table XII, with the
and the main gap between this baseline and our method increasing dimension, the false alarm rate improves rapidly
consists of the lack of MIL-based classification loss that and the run time per training epoch also increases. AUC@ROC
cannot learn anomaly activations. In addition, we also show and AUC@PR reach the peak at 400 and 200, respectively.
the performance gain of leveraging feature discrimination on In this article, we choose =200 which is a trade-off between
several baselines. As shown in Table IX, leveraging feature accuracy and computation.
discrimination on other baselines also obtains clear gains in 4) Different Modalities in the XD-Violence: Based on the
terms of AUC@ROC, AUC@PR and the false alarm rate. audio-visual attributes of XD-Violence dataset, we take the
Moreover, we show distributions of discriminative features audio feature, the visual feature and the concatenation of them
Authorized licensed use limited to: UNIVERSITY OF WESTERN ONTARIO. Downloaded on May 27,2021 at 23:04:08 UTC from IEEE Xplore. Restrictions apply.
WU AND LIU: LEARNING CTR AND FEATURE DISCRIMINATION FOR ANOMALY DETECTION 3523
Fig. 10. Changes of AUC@ROC and AUC@PR w.r.t t and K on the UCF-Crime dataset. (a) Performance versus t when K=7; (b) Performance versus
K when t =9.
TABLE XI
P ERFORMANCE C OMPARISONS W ITH D IFFERENT W EIGHT β ON
THE UCF-C RIME D ATASET
TABLE XII
P ERFORMANCE C OMPARISONS ON THE UCF-C RIME D ATASET W HEN
VARYING THE S AMPLING L ENGTH D URING T RAINING
Fig. 11. Effects of different modalities on the XD-Violence dataset.
Authorized licensed use limited to: UNIVERSITY OF WESTERN ONTARIO. Downloaded on May 27,2021 at 23:04:08 UTC from IEEE Xplore. Restrictions apply.
3524 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 30, 2021
Fig. 12. Qualitative results of our method and another two baselines on the ShanghaiTech dataset. (a) Bicycle, (b) Baby carriage, (c) Romp. In addition,
“GT” is the ground truth at the frame-level. “w/o FD” denotes a baseline that lacks feature discriminations w.r.t our method, and “w/o FD&TR” denotes a
baseline that lacks feature discrimination and causal temporal relations w.r.t our method, i.e., the first baseline in Table V. Best viewed in high resolution.
Fig. 13. Qualitative results of our method and another two baselines on the UCF-Crime dataset. (a) Arrest, (b) Explosion, (c) Fighting, (d) Shooting,
(e) Robbery, (f) Vandalism, (g)-(i) Normal activities, and (j)-(l) less successful examples.
that our method generates a delayed response for explosion boom. For the XD-Violence dataset, we show the qualitative
(Figure 13(b)) since it relies on the visual billowing smoke results taking different modality features as input. Videos in
seen in the mid-to-late explosions rather than the audible the top row are collected from movies, and videos in the
Authorized licensed use limited to: UNIVERSITY OF WESTERN ONTARIO. Downloaded on May 27,2021 at 23:04:08 UTC from IEEE Xplore. Restrictions apply.
WU AND LIU: LEARNING CTR AND FEATURE DISCRIMINATION FOR ANOMALY DETECTION 3525
Fig. 14. Qualitative results of our method using different modal inputs on the XD-Violence dataset. (a) Fighting, (b) Car accident, (c) Shooting and explosion,
(d) Riot, (e) Riot and abuse, (f) Explosion.
bottom row are collected from in-the-wild scenarios. Regard- [6] Y. Zhu and S. Newsam, “Motion-aware feature for improved video
less of whether movies or real-world videos are considered, anomaly detection,” in Proc. Brit. Mach. Vis. Conf. (BMVC), Sep. 2019,
p. 270.
audio-visual input is superior to single-modal input since [7] R. Zeng, C. Gan, P. Chen, W. Huang, Q. Wu, and M. Tan, “Breaking
audio-visual information is comprehensive. winner-takes-all: Iterative-winners-out networks for weakly supervised
temporal action localization,” IEEE Trans. Image Process., vol. 28,
no. 12, pp. 5797–5808, Dec. 2019.
V. C ONCLUSION [8] P. Wu, J. Liu, and F. Shen, “A deep one-class neural network for
In this article, we identified a problem posed by the lack of anomalous event detection in complex scenes,” IEEE Trans. Neural
Netw. Learn. Syst., vol. 31, no. 7, pp. 2609–2622, Jul. 2020.
temporal relation modeling and discriminative features in the [9] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning
previous weakly supervised methods for anomaly detection. spatiotemporal features with 3D convolutional networks,” in Proc. IEEE
Int. Conf. Comput. Vis. (ICCV), Dec. 2015, pp. 4489–4497.
To solve this problem, we proposed a method to explore [10] W. Sultani, C. Chen, and M. Shah, “Real-world anomaly detection in
causal temporal relations within the local range as well as surveillance videos,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
the discriminative power of features. By directly compar- Recognit., Jun. 2018, pp. 6479–6488.
[11] P. Perera and V. M. Patel, “Learning deep features for one-class classi-
ing evaluation results of our method with other baselines, fication,” IEEE Trans. Image Process., vol. 28, no. 11, pp. 5450–5463,
we demonstrated that causal temporal relations and feature Nov. 2019.
discrimination are both requisite to achieving performance [12] D. P. Kingma and J. Ba, “Adam: A method for stochastic
optimization,” 2014, arXiv:1412.6980. [Online]. Available:
improvement. Extensive experiments on three benchmarks http://arxiv.org/abs/1412.6980
showed that our method gains a clear advantage over the [13] Y. Wen, K. Zhang, Z. Li, and Y. Qiao, “A discriminative feature learning
current state-of-the-art methods. In the future, temporal rela- approach for deep face recognition,” in Proc. Eur. Conf. Comput. Vis.,
2016, pp. 499–515.
tions between the current and future information is yet to be [14] S. Lin, H. Yang, X. Tang, T. Shi, and L. Chen, “Social MIL: Interaction-
explored. In addition, audio-visual anomaly detection is of aware for crowd anomaly detection,” in Proc. 16th IEEE Int. Conf. Adv.
concern since audio cues provide complementary information, Video Signal Based Surveill. (AVSS), Sep. 2019, pp. 1–8.
[15] K. Liu and H. Ma, “Exploring background-bias for anomaly detection
which could improve the performance of anomaly detectors. in surveillance videos,” in Proc. 27th ACM Int. Conf. Multimedia,
Oct. 2019, pp. 1490–1499.
R EFERENCES [16] J.-X. Zhong, N. Li, W. Kong, S. Liu, T. H. Li, and G. Li, “Graph
convolutional label noise cleaner: Train a plug-and-play action classifier
[1] M. Hasan, J. Choi, J. Neumann, A. K. Roy-Chowdhury, and L. S. Davis, for anomaly detection,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
“Learning temporal regularity in video sequences,” in Proc. IEEE Conf. Recognit. (CVPR), Jun. 2019, pp. 1237–1246.
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 733–742. [17] S. Narayan, H. Cholakkal, F. S. Khan, and L. Shao, “3C-net: Category
[2] Y. Cong, J. Yuan, and J. Liu, “Sparse reconstruction cost for abnormal count and center loss for weakly-supervised action localization,” in Proc.
event detection,” in Proc. CVPR, Jun. 2011, pp. 3449–3456. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 8679–8687.
[3] W. Li, V. Mahadevan, and N. Vasconcelos, “Anomaly detection and [18] X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural net-
localization in crowded scenes,” IEEE Trans. Pattern Anal. Mach. Intell., works,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2018,
vol. 36, no. 1, pp. 18–32, Jan. 2014. pp. 7794–7803.
[4] C. Lu, J. Shi, and J. Jia, “Abnormal event detection at 150 FPS [19] H. Hu, Z. Zhang, Z. Xie, and S. Lin, “Local relation networks for
in MATLAB,” in Proc. IEEE Int. Conf. Comput. Vis., Dec. 2013, image recognition,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV),
pp. 2720–2727. Oct. 2019, pp. 3464–3473.
[5] W. Liu, W. Luo, D. Lian, and S. Gao, “Future frame prediction for [20] W. Luo et al., “Video anomaly detection with sparse coding inspired
anomaly detection–a new baseline,” in Proc. IEEE Conf. Comput. Vis. deep neural networks,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 43,
Pattern Recognit., Jun. 2018, pp. 6536–6545. no. 3, pp. 1070–1084, Mar. 2021.
Authorized licensed use limited to: UNIVERSITY OF WESTERN ONTARIO. Downloaded on May 27,2021 at 23:04:08 UTC from IEEE Xplore. Restrictions apply.
3526 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 30, 2021
[21] D. Xu, Y. Yan, E. Ricci, and N. Sebe, “Detecting anomalous [47] H. Eun, J. Moon, J. Park, C. Jung, and C. Kim, “Learning to discriminate
events in videos by learning deep representations of appearance and information for online action detection,” in Proc. IEEE Conf. Comput.
motion,” Comput. Vis. Image Understand., vol. 156, pp. 117–127, Vis. Pattern Recognit., Jun. 2020, pp. 809–818.
Mar. 2017. [48] M. Sabokrou, M. Fayyaz, M. Fathy, and R. Klette, “Deep-cascade:
[22] F. Schroff, D. Kalenichenko, and J. Philbin, “FaceNet: A unified Cascading 3D deep neural networks for fast anomaly detection and
embedding for face recognition and clustering,” in Proc. IEEE Conf. localization in crowded scenes,” IEEE Trans. Image Process., vol. 26,
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015, pp. 815–823. no. 4, pp. 1992–2004, Apr. 2017.
[23] J. Carreira and A. Zisserman, “Quo vadis, action recognition? A new [49] K.-W. Cheng, Y.-T. Chen, and W.-H. Fang, “Gaussian process
model and the kinetics dataset,” in Proc. IEEE Conf. Comput. Vis. regression-based video anomaly detection and localization with hier-
Pattern Recognit. (CVPR), Jul. 2017, pp. 6299–6308. archical feature representation,” IEEE Trans. Image Process., vol. 24,
[24] S. Paul, S. Roy, and A. K. Roy-Chowdhury, “W-TALC: Weakly- no. 12, pp. 5288–5301, Dec. 2015.
supervised temporal activity localization and classification,” in Proc. Eur. [50] R. Leyva, V. Sanchez, and C.-T. Li, “Video anomaly detection with
Conf. Comput. Vis., 2018, pp. 563–579. compact feature sets for online performance,” IEEE Trans. Image
[25] A. van den Oord et al., “WaveNet: A generative model Process., vol. 26, no. 7, pp. 3463–3478, Jul. 2017.
for raw audio,” 2016, arXiv:1609.03499. [Online]. Available: [51] S. Biswas and V. Gupta, “Abnormality detection in crowd videos by
http://arxiv.org/abs/1609.03499 tracking sparse components,” Mach. Vis. Appl., vol. 28, nos. 1–2,
[26] S. Bai, J. Zico Kolter, and V. Koltun, “An empirical evaluation of generic pp. 35–48, Feb. 2017.
convolutional and recurrent networks for sequence modeling,” 2018, [52] P. Wu, J. Liu, M. Li, Y. Sun, and F. Shen, “Fast sparse coding networks
arXiv:1803.01271. [Online]. Available: http://arxiv.org/abs/1803.01271 for anomaly detection in videos,” Pattern Recognit., vol. 107, Nov. 2020,
[27] W. Wang, X. Peng, Y. Qiao, and J. Cheng, “A comprehensive study on Art. no. 107515.
temporal modeling for online action detection,” 2020, arXiv:2001.07501. [53] J. Wang and A. Cherian, “GODS: Generalized one-class discriminative
[Online]. Available: http://arxiv.org/abs/2001.07501 subspaces for anomaly detection,” in Proc. IEEE Int. Conf. Comput. Vis.,
[28] J. Collins, J. Sohl-Dickstein, and D. Sussillo, “Capacity and trainability Oct. 2019, pp. 8201–8211.
in recurrent neural networks,” in Proc. Int. Conf. Learn. Represent., [54] M. Xu, C. Zhao, D. S. Rojas, A. Thabet, and B. Ghanem,
Apr. 2017. “G-TAD: Sub-graph localization for temporal action detection,” in Proc.
[29] B. Schölkopf, R. C. Williamson, A. J. Smola, J. Shawe-Taylor, and IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020,
J. C. Platt, “Support vector method for novelty detection,” in Proc. Adv. pp. 10156–10165.
Neural Inf. Process. Syst., 2000, pp. 582–588. [55] F. Sohrab, J. Raitoharju, M. Gabbouj, and A. Iosifidis, “Subspace support
[30] X. Wang and A. Gupta, “Videos as space-time region graphs,” in Proc. vector data description,” in Proc. 24th Int. Conf. Pattern Recognit.
Eur. Conf. Comput. Vis., 2018, pp. 399–417. (ICPR), Aug. 2018, pp. 722–727.
[31] L. van der Maaten and G. Hinton, “Visualizing data using t-SNE,” [56] D. Mandal, S. Bharadwaj, and S. Biswas, “A novel self-supervised re-
J. Mach. Learn. Res., vol. 9, pp. 2579–2605, Nov. 2008. labeling approach for training with noisy labels,” in Proc. IEEE Winter
[32] C. Feichtenhofer, “X3D: Expanding architectures for efficient video Conf. Appl. Comput. Vis. (WACV), Mar. 2020, pp. 1381–1390.
recognition,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. [57] P. Wu et al., “Not only look, but also listen: Learning multimodal
(CVPR), Jun. 2020, pp. 203–213. violence detection under weak supervision,” in Proc. Eur. Conf. Comput.
[33] K. Simonyan and A. Zisserman, “Two-stream convolutional networks for Vis., 2020, pp. 322–339.
action recognition in videos,” in Proc. Adv. Neural Inf. Process. Syst., [58] B. Tu, X. Yang, N. Li, C. Zhou, and D. He, “Hyperspectral anomaly
2014, pp. 568–576. detection via density peak clustering,” Pattern Recognit. Lett., vol. 129,
[34] L. Wang et al., “Temporal segment networks: Towards good practices pp. 144–149, Jan. 2020.
for deep action recognition,” in Proc. Eur. Conf. Comput. Vis., 2016, [59] R. A. A. Habeeb et al., “Clustering-based real-time anomaly detection—
pp. 20–36. A breakthrough in big data technologies,” Trans. Emerg. Telecommun.
[35] M. Zolfaghari, K. Singh, and T. Brox, “ECO: Efficient convolutional Technol., Art. no. e3647, Jun. 2019.
network for online video understanding,” in Proc. Eur. Conf. Comput. [60] S. Song, H. Zhou, Y. Yang, and J. Song, “Hyperspectral anomaly
Vis., 2018, pp. 695–712. detection via convolutional neural network and low rank with density-
[36] D. Oneata, J. Verbeek, and C. Schmid, “Action and event recognition based clustering,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens.,
with Fisher vectors on a compact feature set,” in Proc. IEEE Int. Conf. vol. 12, no. 9, pp. 3637–3649, Sep. 2019.
Comput. Vis., Dec. 2013, pp. 1817–1824. [61] S. Garg, K. Kaur, S. Batra, G. Kaddoum, N. Kumar, and A. Boukerche,
[37] H. Wang and C. Schmid, “Action recognition with improved trajecto- “A multi-stage anomaly detection scheme for augmenting the security
ries,” in Proc. IEEE Int. Conf. Comput. Vis., Dec. 2013, pp. 3551–3558. in IoT-enabled applications,” Future Gener. Comput. Syst., vol. 104,
[38] I. Laptev, “On space-time interest points,” Int. J. Comput. Vis., vol. 64, pp. 105–118, Mar. 2020.
nos. 2–3, pp. 107–123, Sep. 2005. [62] R. Ghezelbash, A. Maghsoudi, and E. J. M. Carranza, “Optimization
[39] L. Wang, Y. Xiong, D. Lin, and L. Van Gool, “UntrimmedNets for of geochemical anomaly detection using a novel genetic K-means
weakly supervised action recognition and detection,” in Proc. IEEE clustering (GKMC) algorithm,” Comput. Geosci., vol. 134, Jan. 2020,
Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 4325–4334. Art. no. 104335.
[40] P. Nguyen, B. Han, T. Liu, and G. Prasad, “Weakly supervised action [63] H. Park, J. Noh, and B. Ham, “Learning memory-guided normality
localization by sparse temporal pooling network,” in Proc. IEEE/CVF for anomaly detection,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
Conf. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 6752–6761. Recognit. (CVPR), Jun. 2020, pp. 14372–14381.
[41] Z. Shou, H. Gao, L. Zhang, K. Miyazawa, and S. Chang, “AutoLoc: [64] R. Morais, V. Le, T. Tran, B. Saha, M. Mansour, and S. Venkatesh,
Weakly-supervised temporal action localization in untrimmed videos,” “Learning regularity in skeleton trajectories for anomaly detection in
in Proc. Eur. Conf. Comput. Vis., 2018, pp. 154–171. videos,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.
[42] Z. Liu et al., “Weakly supervised temporal action localization through (CVPR), Jun. 2019, pp. 11996–12004.
contrast based evaluation networks,” in Proc. IEEE/CVF Int. Conf. [65] G. Somepalli, Y. Wu, Y. Balaji, B. Vinzamuri, and S. Feizi, “Unsuper-
Comput. Vis. (ICCV), Oct. 2019, pp. 3899–3908. vised anomaly detection with adversarial mirrored AutoEncoders,” 2020,
[43] P. Lee, Y. Uh, and H. Byun, “Background suppression network for arXiv:2003.10713. [Online]. Available: http://arxiv.org/abs/2003.10713
weakly-supervised temporal action localization,” in Proc. AAAI Conf. [66] G. Pang, C. Yan, C. Shen, A. van den Hengel, and X. Bai, “Self-
Artif. Intell., vol. 34, no. 7, Apr. 2020, pp. 11320–11327. trained deep ordinal regression for end-to-end video anomaly detection,”
[44] J. Gao, Z. Yang, and R. Nevatia, “RED: Reinforced encoder-decoder in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR),
networks for action anticipation,” in Proc. Brit. Mach. Vis. Conf., 2017, Jun. 2020, pp. 12173–12182.
p. 92. [67] P. Bergmann, M. Fauser, D. Sattlegger, and C. Steger, “Uninformed
[45] R. De Geest, E. Gavves, A. Ghodrati, Z. Li, G. Snoek, and T. Tuytelaars, students: Student-teacher anomaly detection with discriminative latent
“Online action detection,” in Proc. Eur. Conf. Comput. Vis., 2016, embeddings,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.
pp. 269–285. (CVPR), Jun. 2020, pp. 4183–4192.
[46] M. Xu, M. Gao, Y.-T. Chen, L. Davis, and D. Crandall, “Temporal [68] W. Luo, W. Liu, and S. Gao, “A revisit of sparse coding based anomaly
recurrent networks for online action detection,” in Proc. IEEE/CVF Int. detection in stacked RNN framework,” in Proc. IEEE Int. Conf. Comput.
Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 5532–5541. Vis., Oct. 2017, pp. 341–349.
Authorized licensed use limited to: UNIVERSITY OF WESTERN ONTARIO. Downloaded on May 27,2021 at 23:04:08 UTC from IEEE Xplore. Restrictions apply.
WU AND LIU: LEARNING CTR AND FEATURE DISCRIMINATION FOR ANOMALY DETECTION 3527
[69] J. Zhang, L. Qing, and J. Miao, “Temporal convolutional network Peng Wu received the B.Eng. degree from Xidian
with complementary inner bag loss for weakly supervised anomaly University, Xi’an, China, in 2017, where he is
detection,” in Proc. IEEE Int. Conf. Image Process. (ICIP), Sep. 2019, currently pursuing the Ph.D. degree with the
pp. 4030–4034. Guangzhou Institute of Technology. His current
[70] B. Wan, Y. Fang, X. Xia, and J. Mei, “Weakly supervised video anomaly research interests include video anomaly detection,
detection via center-guided discriminative learning,” in Proc. IEEE Int. video retrieval, temporal action detection, weakly
Conf. Multimedia Expo (ICME), Jul. 2020, pp. 1–6. supervised learning, and deep learning.
[71] J. F. Gemmeke et al., “Audio set: An ontology and human-labeled
dataset for audio events,” in Proc. IEEE Int. Conf. Acoust., Speech Signal
Process. (ICASSP), Mar. 2017, pp. 776–780.
[72] D. A. Bulkin and J. M. Groh, “Seeing sounds: Visual and auditory
interactions in the brain,” Current Opinion Neurobiol., vol. 16, no. 4,
pp. 415–419, Aug. 2006.
[73] Y. Tian, J. Shi, B. Li, Z. Duan, and C. Xu, “Audio-visual event
localization in unconstrained videos,” in Proc. Eur. Conf. Comput. Vis.,
2018, pp. 247–263.
[74] L. Ruff et al., “Deep one-class classification,” in Proc. Int. Conf. Mach.
Learn., 2018, pp. 4393–4402.
[75] M. Z. Zaheer, A. Mahmood, M. Astrid, and S. I. Lee, “CLAWS: Clus- Jing Liu (Senior Member, IEEE) received the B.S.
tering assisted weakly supervised learning with normalcy suppression degree in computer science and technology and the
for anomalous event detection,” in Proc. Eur. Conf. Comput. Vis., 2020, Ph.D. degree in circuits and systems from Xidian
pp. 358–376. University in 2000 and 2004, respectively. In 2005,
[76] B. Kiran, D. Thomas, and R. Parakkal, “An overview of deep learning she joined Xidian University as a Lecturer, and
based methods for unsupervised and semi-supervised anomaly detection was promoted to a Full Professor in 2009. From
in videos,” J. Imag., vol. 4, no. 2, p. 36, Feb. 2018. April 2007 to April 2008, she was a Postdoctoral
[77] R. Chalapathy and S. Chawla, “Deep learning for anomaly Research Fellow with The University of Queensland,
detection: A survey,” 2019, arXiv:1901.03407. [Online]. Available: Australia. From July 2009 to July 2011, she was
http://arxiv.org/abs/1901.03407 a Research Associate with The University of New
[78] G. Pang, C. Shen, L. Cao, and A. van den Hengel, “Deep learning South Wales, Australian Defence Force Academy.
for anomaly detection: A review,” 2020, arXiv:2007.02500. [Online]. She is currently a Full Professor with the Guangzhou Institute of Technology,
Available: http://arxiv.org/abs/2007.02500 Xidian University. Her research interests include evolutionary computation,
[79] M. D. Parmar et al., “FREDPC: A feasible residual error-based density complex networks, fuzzy cognitive maps, and computer vision.
peak clustering algorithm with the fragment merging strategy,” IEEE Dr. Liu was the Chair of Emerging Technologies Technical Committee
Access, vol. 7, pp. 89789–89804, 2019. (ETTC) of the IEEE Computational Intelligence Society from 2017 to
[80] M. Parmar et al., “REDPC: A residual error-based density peak cluster- 2018 and an Associate Editor of the IEEE T RANSACTIONS ON E VOLUTION -
ing algorithm,” Neurocomputing, vol. 348, pp. 82–96, Jul. 2019. ARY C OMPUTATION from 2015 to 2020.
Authorized licensed use limited to: UNIVERSITY OF WESTERN ONTARIO. Downloaded on May 27,2021 at 23:04:08 UTC from IEEE Xplore. Restrictions apply.