0% found this document useful (0 votes)
26 views36 pages

VideoAnamolyDetection Survey

Uploaded by

Tamizharasan P S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views36 pages

VideoAnamolyDetection Survey

Uploaded by

Tamizharasan P S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Generalized Video Anomaly Event Detection: Systematic Taxonomy and

Comparison of Deep Models

YANG LIU∗ , Fudan University, China and University of Toronto, Canada


DINGKANG YANG and YAN WANG, Fudan University, China
JING LIU, Fudan University, China, University of British Columbia, Canada, and Duke Kunshan University, China
JUN LIU, Singapore University of Technology and Design, Singapore
AZZEDINE BOUKERCHE, University of Ottawa, Canada
PENG SUN† , Duke Kunshan University, China
arXiv:2302.05087v3 [cs.CV] 1 Feb 2024

LIANG SONG† , Fudan University, China


Video Anomaly Detection (VAD) serves as a pivotal technology in the intelligent surveillance systems, enabling the temporal or spatial
identification of anomalous events within videos. While existing reviews predominantly concentrate on conventional unsupervised
methods, they often overlook the emergence of weakly-supervised and fully-unsupervised approaches. To address this gap, this survey
extends the conventional scope of VAD beyond unsupervised methods, encompassing a broader spectrum termed Generalized Video
Anomaly Event Detection (GVAED). By skillfully incorporating recent advancements rooted in diverse assumptions and learning
frameworks, this survey introduces an intuitive taxonomy that seamlessly navigates through unsupervised, weakly-supervised,
supervised and fully-unsupervised VAD methodologies, elucidating the distinctions and interconnections within these research
trajectories. In addition, this survey facilitates prospective researchers by assembling a compilation of research resources, including
public datasets, available codebases, programming tools, and pertinent literature. Furthermore, this survey quantitatively assesses
model performance, delves into research challenges and directions, and outlines potential avenues for future exploration.

CCS Concepts: • General and reference → Surveys and overviews; • Applied computing → Surveillance mechanisms; •
Information systems → Data streaming.

Additional Key Words and Phrases: Anomaly detection, video understanding, deep learning, intelligent surveillance system

ACM Reference Format:


Yang Liu, Dingkang Yang, Yan Wang, Jing Liu, Jun Liu, Azzedine Boukerche, Peng Sun, and Liang Song. 2024. Generalized Video
Anomaly Event Detection: Systematic Taxonomy and Comparison of Deep Models. 1, 1 (February 2024), 36 pages. https://doi.org/
XXXXXXX.XXXXXXX
∗ This paper was revised by Yang Liu during his FDU-UofT Joint Ph.D. Training Program at the University of Toronto, Canada.
† Prof. Peng Sun and Prof. Liang Song are the co-corresponding authors of this paper.

Authors’ addresses: Yang Liu, [email protected], Fudan University, Shanghai, 200433, China and University of Toronto, Toronto, M5S 1A1,
Ontario, Canada; Dingkang Yang, [email protected]; Yan Wang, [email protected], Fudan University, Shanghai, 200433, China; Jing Liu,
[email protected], Fudan University, Shanghai, 200433, China and University of British Columbia, Vancouver, V6T 1Z4, British Columbia, Canada and
Duke Kunshan University, Suzhou, 215316, Jiangsu, China; Jun Liu, [email protected], Singapore University of Technology and Design, Singapore, 487372,
Singapore; Azzedine Boukerche, [email protected], University of Ottawa, Ottawa, K1N 6N5, Ontario, Canada; Peng Sun, [email protected],
Duke Kunshan University, Suzhou, 215316, Jiangsu, China; Liang Song, [email protected], Fudan University, Shanghai, 200433, China.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not
made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components
of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to
redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
© 2024 Association for Computing Machinery.
Manuscript submitted to ACM

Manuscript submitted to ACM 1


2 Yang Liu et al.

1 INTRODUCTION
Surveillance cameras can sense environmental spatial-temporal information without contact and have been the primary
data collection tool for public services such as security protection [28], crime warning [158], and traffic management
[101]. However, with the rapid development of smart cities and digital society, the number of surveillance cameras
is growing explosively, making the ensuing video analysis a significant challenge. Traditional manual inspection is
time-consuming and laborious and may cause missing detections due to human visual fatigue [100], hardly coping with
the vast scale video stream. As the core technology of intelligent surveillance systems, Video Anomaly Detection (VAD)
aims to automatically analyze video patterns and locate abnormal events. Due to its potential application in unmanned
factories, self-driving vehicles, and secure communities, VAD has received wide attention from academia and industry.
VAD in a narrow sense refers specifically to the unsupervised research paradigm that uses only normal videos
to learn a normality model, abbreviated as UVAD. Such methods share the same assumption as the long-established
Anomaly Detection (AD) tasks in non-visual data (e.g., time series [7] and graphs [117]) and images [61]. They assume
the normality model learned on normal samples cannot represent anomalous samples. Typically, UVAD consists of two
phases, normality learning and downstream anomaly detection [50, 92, 101, 197]. UAVD shares a similar modeling process
with other AD tasks without predefining and collecting anomalies, following the open-world principle. In the real world,
anomalies are diverse and rare, so they cannot be defined and fully collected in advance. Therefore, UVAD was favored
by early researchers and was once considered the prevailing VAD paradigm. However, the definition of anomaly is
idealistic, ignoring that normal events are diverse. It is also unrealistic to collect all possible regular events for modeling.
In addition, the learned UVAD model has difficulty maintaining a reasonable balance between representation and
generalization power, either due to the insufficient representational that false-alarms unseen normal events as anomalies
or the excessive generalization power that effectively reconstructs anomalous events. Numerous experiments [218] have
shown that UVAD is valid for only simple scenarios. The model performance on complex datasets [92] is much inferior
to that of simple single-scene videos [84, 107], which limits the application of VAD technicals in realistic scenarios.
In contrast, Weakly-supervised Abnormal Event Detection (WAED) departs from the ambiguous setting that all are
anomalous except normal with a clearer definition for the anomaly that is more consistent with human consciousness
(e.g., traffic accidents, robbery, stealing, and shooting). [158]. Given its potential for immediate references in real-
life applications such as traffic management platforms and violence warning systems, WAED has become another
mainstream VAD paradigm since 2018 [39, 98, 164]. Generally, WAED models directly output anomaly scores by
comparing the spatial-temporal features of normal and abnormal events through Multiple Instance Learning (MIL).
The previous study [39] proved that WAED could understand the essential difference between normal and abnormal.
Therefore, its results are more reliable than that of UVAD. Unfortunately, WAED does not follow the basic assumptions
of AD tasks, which is more like a binary classification under unique settings (e.g., data imbalance and positive samples
containing multiple subcategories). Therefore, existing reviews [12, 136, 139] mainly focus on UVAD and consider
WAED as a marginal research pathway, lacking the systematic organization for WAED datasets and methods.
In recent times, certain researchers [129, 207] have introduced fully-unsupervised methods, which eliminate the
need for labels and eliminate any prerequisites on the training data. Given the deeply ingrained association of UVAD
with modeling using solely normal data, we retain the nomenclature UVAD for these techniques, rather than referring
to them as semi-supervised VAD. Conversely, the emerging paradigm of absolute unsupervised setting is denoted as
FVAD, contributing to terminology conceptual clarity in this survey and future research.

Manuscript submitted to ACM


Generalized Video Anomaly Event Detection: Systematic Taxonomy and Comparison of Deep Models 3

Supervision Input Structure

Resources
Datasets Frame-level Single-stream Models

Metrics
UVAD Patch-level Multi-stream Models
Codes
Literature Object-level MIL Ranking Models

Unimodal Classification Models


GVAED WAED
Multimodal Classification models

SVAD Frame-level label-based Regression models


Challenges
Trends
Discussions
FVAD Unlabeled video-based Taxonomy of GVAED

Fig. 1. Taxonomy of Generalized Video Anomaly Event Detection (GVAED). We provide a hierarchical taxonomy that organizes
existing deep GVAED models by supervised signals, model inputs, and network structure into a systematic framework, including
Unsupervised Video Anomaly Detection (UVAD), Weakly-supervised Abnormal Event Detection (WAED), Fully-unsupervised VAD
(FVAD) and Supervised VAD (SVAD). Besides, we collate benchmark datasets, evaluation metrics, available codes, and literature to a
public GitHub repository1 . Finally, we analyze the research challenges and possible trends.

In summary, this survey focuses on anomaly event detection in surveillance videos, integrating deep VAD methods
based on different assumptions, learning paradigms, and supervision into a systematic taxonomy: Generalized Video
Anomaly Event Detection (GVAED), as shown in Fig. 1. We compare the differences and performance among different
methods, sorting out the recent advances in GVAED. In addition, we collate available research resources, such as
datasets, metrics, codes, and literature, into a public GitHub repository1 . Moreover, we analyze the research challenges
and future trends, which can guide further research and promote the development and applications of GVAED.

1.1 Literature Statistics


We count the publications and citations of academic papers on the topic of Video Anomaly Detection and Abnormal
Event Detection in the past 12 years through reference databases (e.g., ACM Digital Library, IEEE explore, ScienceDirect,
and SpringerLink) and search engines. The results are shown in Fig. 2, where the bar and line graph indicate the number
of publications and citations, respectively. The lines in Fig. 2(a) and Fig. 2(b) show a steadily increasing trend, indicating
that GVAED has received wide attention. Therefore, a systematic taxonomy and comparison of GVAED methods are
necessary to guide further development. As mentioned above, current works focus on unsupervised methods that use
only regular videos to train models to represent normality. Thus, the development of UVAD is limited by representation
means. Until 2016, UVAD utilized handicraft features, such as Local Binary Patterns (LBP) [56, 123, 216], Histogram Of
Gradients (HOG) [50, 116, 148], and Space-Time Interest Point (STIP) [34]. The performance is poor and relies on a priori
knowledge. As a result, VAD developed slowly. Fortunately, after 2016, with the development of deep learning, especially
the application of Convolutional Neural Networks (CNNs) in image processing [20, 74, 193] and video understanding
[194–196], VAD has ushered in new development opportunities. The research progress increased significantly, as
shown in Fig 2(a). Deep CNNs can extract the video patterns end-to-end, freeing VAD research from complex a priori
1 https://github.com/fudanyliu/GVAED.git
Manuscript submitted to ACM
4 Yang Liu et al.

(a)Video anomaly detection topic (b)Abnormal event detection topic

Fig. 2. Publication and citation statistics on the topic of (a) Video Anomaly Detection and (b) Abnormal Event Detection.

Table 1. Analysis and Comparison of Related Reviews.

Research Pathways𝑎 Topics𝑎,𝑏


Year Ref. Main Focus UVAD WAED SVAD FVAD LW OD CS OE
2018 [67] Unsupervised and semi-supervised methods. ! % % % ! % % %
2019 [106] Weakly-supervised VAD methods and applications. % ! % % % % % %
2020 [139] Unsupervised single-scene video anomaly detection. ! % ! % ! % ! %
2021 [125] Deep learning driven unsupervised VAD methods. ! % % % ! % % %
2021 [144] Unsupervised crowd anomaly detection methods. ! % % % % % % %
2021 [150] Traffic scene video anomaly detection. ! % % % % % % %
2022 [136] Unsupervised video anomaly detection. ! % % % % ! % %
2022 [12] One&two-class classification-based methods. ! ! ! % ! ! % %
2023 Ours GVAED taxonomy, challenges and trends. ! ! ! ! ! ! ! !
a: %means no systematic analysis, while !is vice versa. b: LW=lightweight, OD=online detection, CS=cross-scene, and OE=online evolution.

knowledge construction. In addition, compared with manual features [34, 56, 148], deep representations can capture
multi-scale spatial semantic features and extract long-range temporal contextual features, which are more efficient in
learning video normality. On the one hand, the large amount of video generated by the surveillance cameras provides
sufficient training data for deep GVAED models. On the other hand, the iteratively updated Graphics Processing Units
(GPUs) make it possible to train large-scale models. As a result, VAD has developed rapidly in recent years and started
to move from academic research to commercial applications. Similarly, Fig 2(b) reflects the research enthusiasm and
development potential of abnormal event detection. To accelerate the application of GVAED in terminal devices and
inspire future researchers, this survey organizes various GVAED models into a unified framework. Additionally, we
collect commonly used datasets, publicly available codes, and classic literature for further research.

1.2 Related Reviews


In the past four years, several reviews [7, 12, 26, 64, 67, 106, 122, 125, 128, 136, 139, 144, 150, 154] have covered GVAED
works and generated various classification systems. We analyze the methodologies covered in recent reviews and the
research topics related to real-world deployment, as shown in Table 1. The mainstream reviews [67, 139] still consider
VAD as a narrow unsupervised task, lacking attention to WAED with excellent application value and FVAD methods
using unlimited training data. In addition, they are biased against Supervised Video Anomaly Detection (SVAD), arguing
that data labeling makes SVAD challenging to develop. However, the game engines [38, 152] and automatic annotations
[42, 49] make it possible to obtain anomalous events and fine-grained labels. In addition, the existing review suffers
Manuscript submitted to ACM
Generalized Video Anomaly Event Detection: Systematic Taxonomy and Comparison of Deep Models 5

from three major weaknesses: (1) [139] and [150] attempted to link existing works to specific scenes, missing the cross-
scene challenges in real-world. Specifically, [139] pointed out that existing works were trained and tested on videos of
the same scene, so they only reviewed single-single methods, leaving out the latest cross-scene VAD research. [150]
focuses on the traffic VAD methods, innovatively analyzing the applicability of existing works in traffic scenes. However,
weakly-supervised methods for crime and violence detection fail to be included in [150]. (2) Due to timeliness, earlier
reviews [26, 125, 144, 150] were unable to cover the latest research and were outdated for predicting research trends.
Recent surveys [12, 136] lack discussion of the interaction of GVAED with new techniques such as causal machine
learning [88, 103], domain adaptation [44, 183], and online evolutive learning [76, 77, 156], which are expected to be
the future directions of GVAED and essential to model deployment. (3) Although the latest review [12] in 2022 has
started to incorporate WAED and SVAD, it still treats them as a marginal exploration, lacking a systematic organization
of the datasets, literature, and trends.

1.3 Contribution Summary


GVAED will usher in new development opportunities with the rapid growth of deep learning technicals and surveillance
videos. To clarify the development of GVAED and inspire future research, this survey integrates UVAD, WAED, FVAD,
and SVAD into a unified framework from an application perspective. The main contributions of this survey are
summarized in the following four aspects:

(1) To the best of our knowledge, it is the first comprehensive survey that extends video anomaly detection from
narrow unsupervised methods to generalized video anomaly event detection. We analyze the various research
routes and clearly state the lineage and trends of deep GVAED models to help advance the field.
(2) We organize various GVAED methods with different assumptions and learning frameworks from an application
perspective, providing an intuitive taxonomy based on supervision signals, input data, and network structures.
(3) This survey collects accessible datasets, literature, and codes and makes them publicly available. Moreover, we
analyze the potential applications of other deep learning techniques and structures in GVAED tasks.
(4) We examine the research challenges and trends within GVAED in the context of the development of deep learning
techniques, which is anticipated to provide valuable guidance for upcoming researchers and engineers.

The remainder of this survey is organized as follows. Section 2 provides an overview of the basics and research
background of GVAED, including the definition of anomalies, basic assumptions, main evaluation metrics, and benchmark
datasets. Sections 3-5 introduce the unsupervised, weakly-supervised, supervised and fully-unsupervised GVAED
methods, respectively. We analyze the extant methods’ general ideas and specific implementations and compare their
strengths and weaknesses. Further, we quantitatively compare the performance in Section 7. Section 8 analyzes the
challenges and research outlook on the development of GVAED. Section 9 concludes this survey.

2 FOUNDATIONS OF GVAED

2.1 Definition of the Anomaly


UVAD follows the assumption of general AD tasks [128] and considers all events that have not occurred in the training
set as abnormal. In other words, the training set of the UVAD dataset contains only normal events, while videos in the
test set that differ from the training set are considered anomalies. Thus, certain normal events in the subjective human
Manuscript submitted to ACM
6 Yang Liu et al.

Normal frame Abnormal frame w/ frame-level label w/o frame-level label Video-level label

(a) UVAD (b) WAED (c) SVAD (d) FUVAD

Fig. 3. Illustration of training data. (a) UVAD trains the model using only normal data, with the hidden implication that all video-
level and frame-level labels are 0. (b) WAED models use positive and negative samples and require frame-level labels, where 𝑌 = 0
indicates normal video and 𝑌 = 1 indicates an anomaly. (c) SVAD is trained using a fine-grained frame-level labeling supervised
model, where the semantics of the frame-level labels expose the video-level labels. (d) FVAD attempts to learn the anomaly detector
from under-processed data with training data containing both normal and anomalous samples and without any level of labeling.

consciousness may also be labeled anomalies. For instance, in the UCSD Pedestrian datasets [84], riding a bicycle on
the college campus is labeled abnormal simply because the training set fails to contain such events. This seemingly odd
definition is dictated by the diversity and rarity of real-world anomalies. Collecting a sufficient number of anomalous
events with a full range of categories is nearly impossible. In response, researchers have taken the alternative route of
collecting enough normal videos to train models to describe the boundary of normal patterns and treat events that fall
outside the boundary as anomalies. Unfortunately, it is also costly to collect all possible normal events for training. In
addition, abnormal and normal frames share most of the appearance and motion information, making their patterns
overlap. Therefore, letting the model find a discriminative pattern boundary without seeing abnormal events is infeasible.
In contrast, WAED takes a more intuitive definition of anomalies. Events that are subjectively perceived as abnormal
by humans are considered anomalies, such as thefts and traffic accidents [158]. The training set for WAED tasks contains
both normal and abnormal events and provides easily accessible video-level labels to supervise the model. Compared
with fine-grained frame-level labels, video-level labels only tell the model whether a video contains abnormal events
without revealing the exact location of the abnormalities, avoiding the costly frame-by-frame labeling and providing
more reliable supervision. In contrast, the discrete frame-level annotations (0=normal, 1=abnormal) in SVAD ignore the
transition continuity from normal to abnormal events. WAED needs to predefine abnormal events so that it can only
distinguish specified abnormal events.

2.2 Problem Formation


In UVAD, the training data contains only normal events, as shown in Fig. 3(a). Such methods aim to describe the
boundaries of normal spatial-temporal patterns with a proxy task and consider the test samples whose patterns fall
outside the learned boundaries as anomalies [104]. Fig. 4 shows the two-stage anomaly detection framework in UVAD.
The deep network trained by performing the proxy task in the training phase is directly applied as a normality model
for anomaly detection in the testing phase. The performance of the proxy task is a credential to calculate the anomaly
score. Formulaically, the process of UVAD is as follows:

𝑒 = 𝑑 (𝑓 (𝒙 test ; 𝜃 ), 𝒙 test ) (1)

where 𝜃 denotes the learnable parameters of the deep model 𝑓 , designed to characterize the prototype of normal events.
𝑑 denotes the deviation between the test sample 𝑥 test and the well-trained 𝑓 , which is usually a quantifiable distance,
such as the Mean Square Error (MSE) of the prediction result, the 𝐿2 distance in the feature space and the difference
of the distribution [112, 131, 219]. Noting that the normality model is obtained by optimizing the proxy task. This
Manuscript submitted to ACM
Generalized Video Anomaly Event Detection: Systematic Taxonomy and Comparison of Deep Models 7

Traning Phase: Normality Learning

Data Normality Well-trained


Pre-processing Learning Model

Testing Phase: Anomaly Detection

Data Score
Anomaly Score
Pre-processing Calculation

Fig. 4. Illustration of the two-stage UVAD framework. Anomaly detection is performed in the test phase as a downstream task of
proxy task-based normality learning. The example video frames are from the CUHK Avenue [107] dataset.

process is independent of the downstream anomaly detection, so the performance of the proxy task cannot directly
determine the anomaly detection performance. In addition, for the reconstruction-based [46, 50] and prediction-based
[92, 101] methods, the final anomaly score is usually a relative value in the range [0, 1]. A higher score indicates a
larger deviation. Generally, these methods convert the absolute deviation 𝑒 into a relative anomaly score by performing
maximum-minimum normalization. They not only explicitly require all training data to be normal but also include the
hidden assumption that the test videos must include anomalous events. In other words, any test video will yield high
score intervals, which indicates that such methods are offline and may produce false alarms for normal videos.
WAED methods [39, 98, 164] always follow the MIL ranking framework. Fig. 3(b) shows the training data composition
for WAED, where both normal and anomalous events need to be pre-collected and labeled. Video-level labels are easy
to obtain and often more accurate than the fine-grained frame-level labels for SVAD shown in Fig. 3(c). In a concrete
implementation, WAED treats the video as a bag containing several instances, as illustrated in Fig. 5. The normal video
V𝑛 forms a negative bag B𝑛 , while the abnormal video V𝑎 a positive bag B𝑎 . Based on MIL, WVEAD aims to train a
regression model 𝑟 (·) to assign scores to instances, with the basic goal that the maximum score of B𝑎 is higher than
that of B𝑛 . Thus, the WAED methods do not rely on an additional self-supervised proxy task but compute anomaly
scores directly. The objective function 𝑂 (B𝑎 , B𝑛 ) is as follows:
𝐶𝑠𝑚𝑜𝑜𝑡ℎ 𝐶𝑠𝑝𝑎𝑟𝑠𝑖𝑡 𝑦
z }| { z }| {
     𝑛−1
∑︁      2 𝑛
∑︁  
𝑂 (B𝑎 , B𝑛 ) = min max 0, 1 − max 𝑟 V𝑎𝑖 + max 𝑟 V𝑛𝑖 + 𝜆1 𝑟 V𝑎𝑖 − 𝑟 V𝑎𝑖+1 +𝜆2 𝑟 V𝑎𝑖 (2)
𝑖 ∈ B𝑎 𝑖 ∈ B𝑛
𝑖 𝑖
In addition to the additional anomaly curve smoothness constraint 𝐶𝑠𝑚𝑜𝑜𝑡ℎ and sparsity constraint 𝐶𝑠𝑝𝑎𝑟𝑠𝑖𝑡 𝑦 , the core
of 𝑂 (B𝑎 , B𝑛 ) is to train a ranking model capable of distinguishing the spatial-temporal patterns between B𝑎 and
B𝑛 . Subsequent WAED works [39, 98, 102, 164, 172, 223] have followed the idea of MIL ranking and made effective
improvements regarding feature extraction [223], label denoising [220], and the objective function [98]. However, as
shown in Fig. 5, the MIL regression module takes the extracted feature representations as input, so the performance of
WAED methods partially depends on the pre-trained feature extractor, making the calculation costly.
The training data of FVAD contains both normal and abnormal, and none of the data labels are available for model
training, as shown in Fig. 3(d). One class of FVAD methods follows a similar workflow to that of UVAD, i.e., learning
the normality model directly from the original data. Although the training data contains anomalous events, the low
Manuscript submitted to ACM
8 Yang Liu et al.

MIL Ranking Loss With Constraints


Anomaly Video …

0.4, 0.8, 0.7, …, 0.1

… Pretrained MIL
Normal Video 0.1, 0.2, 0.2, …, 0.1
Feature Extractor Regression
0.2, 0.4, 0.7, …, 0.3
Anomaly Score

Test Video … Negative Instance Positive Instance Test Instance Objective Function
Dataflow for Anomaly Dataflow for Normal Dataflow for Test

Fig. 5. Structure of the MIL ranking model [158]. The anomalous video V𝑎 and the normal video V𝑛 are first sliced into several equal-
size instances. The positive bag B𝑎 contains at least one positive instance, while the negative bag B𝑛 contains only normal instances.
In the test phase, the well-trained MIL regression model output the anomaly scores of instances in the test video V𝑡 directly.

frequency of anomalies limits their impact on model optimization. As a result, the model learned on many normal
videos and a small number of abnormal frames is still only effective in representing normal events and generates large
errors for anomalous events. Another class of methods tries to discover anomalies through the mutual collaboration of
the representation learner and anomaly detector. Generally, the learning process of FVAD can be formulated as follows:
∑︁
F = arg min L 𝑓 𝑜𝑐 (𝑦ˆ = 𝜙 (𝑚 = 𝜑 (𝑥 = 𝑓 (𝐼 ))), 𝑙) (3)
Θ 𝐼 ∈I
where the aim is to learn an anomaly detector F via a deep neural network which consists of a backbone network
𝑓 (·; Θ𝑏 ) : R𝐻 ×𝑊 ×3 ↦→ R𝐷𝑏 that transforms an input video frame 𝐼 to feature 𝒙, an anomaly representation learner
𝜑 (·; Θ𝑎 ) : R𝐷𝑏 ↦→ R𝐷𝑛 that converts 𝑥 to an anomaly specific representation 𝑚, and an anomaly score regression
layer 𝜙 (·; Θ𝑠 ) : R𝐷𝑠 →
↦ R that learns to predict 𝑚 to an anomaly score 𝑦. The overall parameters Θ = {Θ𝑏 , Θ𝑎 , Θ𝑠 }
are optimized by the focal loss. Research on fully-unsupervised methods is still in its infancy, and they exploit the
imbalance of samples and the significant difference of anomalies in the GVAED task.

2.3 Benchmark Datasets


In Table 2, we show and compare the statistical results and properties of the frequently used GVAED datasets. Several
datasets [1, 92, 158, 186] have been proposed with different annotated signals to match new research requirements
after 2018, which reflects the trend of GVAED from unsupervised to weakly-unsupervised [158], from unimodal to
multimodal [186] and from simple to complex real-world scenarios [92] at the data level.

2.3.1 Subway Entrance & Exit. As an earlier dataset, Subway [2] includes two independent sub-datasets, Entrance and
Exit, which record the subway entrance and exit scenes, respectively. The anomalous events include people who skip
the subway entrance to evade tickets, cleaners who behave differently from regular entry and exit, and people who
travel in the wrong direction. Due to the cursory nature of the labeling work and the lack of clarity in the definition of
anomalous events, most existing works have refrained from using this dataset for model evaluation. Therefore, we do
not provide quantitative performance comparison results on this dataset but only briefly describe its characteristics to
reflect the lineage of GVAED datasets development.
Manuscript submitted to ACM
Generalized Video Anomaly Event Detection: Systematic Taxonomy and Comparison of Deep Models 9

Table 2. Representative GVAED Datasets. Italicized ones indicate WAED datasets, and underlined one is multimodal dataset.

#Videos #Frames
Year Dataset #Scenes #Anomalies
Total Training Testing Total Training Testing Normal Abnormal
2008 Subway Entrance1 - - - 144,250 76,543 67,797 132,138 12,112 1 51
2008 Subway Exit2 - - - 64,901 22,500 42,401 60,410 4,491 1 14
2011 UMN2† - - - 7,741 - - 6,165 1,576 3 11
2013 UCSD Ped13 70 34 36 14,000 6,800 7,200 9,995 4,005 1 61
2013 UCSD Ped24 28 16 12 4,560 2,550 2,010 2,924 1,636 1 21
2013 CUHK Avenue4 37 16 21 30,652 15,328 15,324 26,832 3,820 1 77
2018 ShanghaiTech5 - - - 317,398 274,515 42,883 300,308 17,090 13 158
2018 UCF-Crime6 1,900 1,610 290 13,741,393 12,631,211 1,110,182 - - - 950
2019 ShanghaiTech Weakly 7 437 330 107 - - - - - - -
2020 Street Scene8 81 46 35 203,257 56,847 146,410 159,341 43,916 205 17
2020 XD-Violance9 4,754 - - - - - - - - -
2022 UBnormal10 ‡ 543 268 211 236,902 116,087 92,640 147,887 89,015 29 660
† Following previous works, we set the frame rate to 15 fps. ‡ The UBnormal dataset is supervised and includes a validation set with 64 videos.

2.3.2 UMN. The UMN [28] is also an early GVAED dataset containing 11 short videos captured from three different
scenes: grassland, indoor hall, and park. The scenes are set by the researcher rather than naturally filmed to detect
abnormal crowd behavior in indoor and outdoor scenes, i.e., the crowd suddenly shifts from normal interaction to
evacuation and flees abruptly to simulate fear. The anomalies are artificially conceived and played out, ignoring the
diversity and rarity of anomalies in the real-world. Similar to the Subway [2] dataset, UMN has been abandoned by
recent researchers due to the lack of spatial annotation.

2.3.3 UCSD Pedestrian. UCSD Ped1 & Ped2 [84] are the most widely used UVAD datasets collected from university
campuses with simple but realistic scenarios. They reflect the value of GVAED in public security applications. Specifically,
the Ped1 dataset is captured by a camera with a viewpoint perpendicular to the road, so the moving object’s size changes
with its spatial position. In contrast, the Ped2 dataset used a camera whose viewpoint is parallel to the direction of the
road, which is simpler than Ped1. Pedestrian walking is defined as normal, while behaviors and objects different from it
are considered abnormal, such as biking, skateboarding, and driving. Since the scene is classical and anomalous events
are easy to understand, UCSD Pedestrian is widely used by existing works, and the frame-level AUC has been as high
as 99%, reflecting the saturation of model performance. Therefore, the dataset in simple scenes has become a constraint
for GVAED development. The large-scale and cross-scene datasets have become an inevitable trend.

2.3.4 CUHK Avenue. Similar to UCSD Pedestrian, the CUHK Avenue [107] dataset is also collected from the university
campus, and both focus on anomalous events that occur on the road outside of expectations. The difference is that most of
the 47 anomalous events in CUHK Avenue are simulated by the data collector, including appearance anomalies (e.g., bags
placed on the grass) and motion anomalies, such as throwing and wrong direction. CUHK Avenue provides both frame-
level and pixel-level spatial annotations. In addition, its large data scale makes it one of the mainstream UVAD datasets.

2.3.5 ShanghaiTech. The UCSD Pedestrian [84] and CUHK Avenue [107] datasets only consider anomalous events in
a single scene, while the real world usually faces the challenge of spatial-temporal pattern shifts across scenes. For
this reason, the team from ShanghaiTech University proposed the ShanghaiTech [92] dataset containing 13 scenes,
1 https://vision.eecs.yorku.ca/research/anomalous-behaviour-data/sets/ 2 http://mha.cs.umn.edu/proj_events.shtml#crowd
3 http://www.svcl.ucsd.edu/projects/anomaly/dataset.htm 4 http://www.cse.cuhk.edu.hk/leojia/projects/detectabnormal/dataset.html
5 https://svip-lab.github.io/dataset/campus_dataset.html 6 https://webpages.charlotte.edu/cchen62/dataset.html
7 https://github.com/jx-zhong-for-academic-purpose/GCN-Anomaly-Detection/ 8 https://www.merl.com/demos/video-anomaly-detection
9 https://roc-ng.github.io/XD-Violence/ 10 https://github.com/lilygeorgescu/UBnormal
Manuscript submitted to ACM
10 Yang Liu et al.

Normal Shooting Riot Explosion Car accident

Fig. 6. Examples of XD-Violence dataset [186]. XD-Violence is a multimodal dataset for violence detection, including video and audio.
We show video frames here. The anomalous events are not all from the real-world but also include movie and game footage, etc.

providing the largest UVAD benchmark. Abnormal behaviors are defined as all collected behaviors that distinguish
them from normal walking, such as riding a bicycle, crossing a road, and jumping forward. Unfortunately, although
the collectors pointed out the shortcomings of the existing dataset with a single scenario, their proposed FFP [92] was
not explicitly designed to address the cross-scene challenges but rather to treat it as a whole without differentiating
between scenarios. For the WAED setting, researchers [220] proposed to move some anomalous videos from the test
set to the training set and provided video-level labels for each training video, introducing the ShanghaiTech Weakly
dataset, which has become the mainstream WAED benchmark. A compelling phenomenon is that the performance of
WAED methods on ShanghaiTech Weakly (frame-level AUC is typically > 85% and has reached up to 95%) is generally
higher than that of UVAD methods on the ShanghaiTech (frame-level AUC is typically between 70 ∼ 80%), providing
evidence for the applicability of WAED in complex scenarios over UVAD.

2.3.6 UCF-Crime. UCF-Crime [158] is the first WAED dataset, presented together with the original MIL ranking
framework. UCF-Crime consists of 1900 unedited real-world surveillance videos collected from the Internet. The
abnormal events contain 850 anomalies of human concern in 13 categories: Abuse, Arrest, Arson, Assault, Burglary,
Explosion, Fighting, Road Accidents, Robbery, Shooting, Shoplifting, Stealing, and Vandalism. Unlike the UVAD dataset
above, its training set contains anomalous videos and provides a video-level label for each video, where 0 indicates
normal, and 1 indicates anomalous. The anomalous events in the WAED dataset are predefined and are usually associated
with specific scenarios, such as car accidents in urban traffic, shoplifting, and shootings in neighborhoods. Therefore,
WAED can provide more credible results for real scenarios with better application potential.

2.3.7 XD-Violence. As the first audio-video dataset, XD-Violence [186] expands anomaly event detection from single-
modal video understanding to multimodal signal processing, facilitating the coexistence of GVAED and multimedia
communities. XD-Violence focuses on violent behaviors, such as abuse, explosion, car accident, struggle, shootings,
and riots, as shown in Fig. 6. Due to the rarity of violent behaviors and the high difficulty of capturing violence, the
original videos include some movie clips in addition to real-world surveillance videos. XD-Violence provides a new way
to think about the GVAED by extending the data modality from single videos to sound, text, and others.

2.3.8 UBnormal. Inspired by the computer vision community benefiting from synthetic data, Acsintoae et al. [1]
propose the first GVAED benchmark with virtual scenes, named UBnormal. Notably, utilizing a data engine to synthesize
data under predetermined instructions rather than collecting real-world data makes pixel-level labeling possible.
Manuscript submitted to ACM
Generalized Video Anomaly Event Detection: Systematic Taxonomy and Comparison of Deep Models 11

Predicted Label
Accuracy
Abnormal (1) Normal (0)
PR Curve, AUC Curve, TPR, FNR, FPR, TNR,

Quantitative
Qualitative

Anomaly Score Curve; AUROC, PROC, DR, EER


Prediction Error Map
1 TP FN

GT Label
Parameter Size,
/ Inferrence Speed, FLOPs
0 FP TN
Cost

(a) Metrics System (b) Confusion Matrix

Fig. 7. Illustration of GVAED performance evaluation system. We show the (a) metrics system and (b) confusion matrix.

Therefore, UBnormal is supervised. UBnormal is built to address the problem that WAED ignores the open-set nature of
anomalies that prevents the model from correctly corresponding to new anomalies. The test set contains anomalous
events not present in the training set. Moreover, it provides a validation set for model tuning for the first time.

2.4 Performance Evaluation


Existing GVAED methods evaluate model performance in terms of detection accuracy and operational cost. The former
concerns the ability to discriminate anomalous events while the latter aims to measure the deployment potential on
resource-limited devices. According to the scale of detected anomalies, the detection accuracy criteria are divided into
three levels: Temporal-Detection-Oriented (TDO), Object-Detection-Oriented (ODO), and Spatial-Localization-Oriented
(SLO). Specifically, TDO criteria require the model to determine anomalous events’ starting and ending temporal
position without spatial localization of abnormal pixels. In contrast, ODO criteria include object-level, region-level,
and track-level, focusing on specific anomaly objects or trajectories. SLO criteria encourage pixel-level localization of
anomalous events. As for operational cost criteria, the commonly used metric include parameter size, inference speed,
and the number of FLOating Point operations (FLOPs) on the same platform, as shown in Fig. 7(a).
We can evaluate the model performance quantitatively by comparing the predicted results with the ground truth
labels. It is worth noting that the predicted labels of some GVAED models (e.g., prediction-based UVAD and WAED) are
continuous values in the range of [0, 1]. In contrast, the true labels are discrete 0 or 1, so a threshold value must first be
selected when calculating the performance metrics. Samples with abnormal scores below the threshold are considered
normal, and vice versa. In this way, we obtain the confusion matrix shown in Fig. 7(b), where TP, FN, FP, and TN denote
the number of abnormal samples correctly detected, abnormal samples mistakenly detected as normal, normal samples
mistakenly detected as abnormal, and normal samples correctly detected, respectively. The True-Positive-Rate (TPR),
False-Positive-Rate (FPR), True-Negative-Rate (TNR), and False-Negative-Rate (FNR) are defined as follows:
TP FP TN FN
TPR = ; FPR = ; TNR = ; FNR = (4)
TP+FN FP+TN FP+TN TP+FN
which are used to calculate the Area Under the Receiver Operating Characteristic (AUROC) and Average Precision
(AP).
AUROC: The horizontal and vertical coordinates of the Receiver Operating Characteristic (ROC) curve are the
FPR and TPR, and the curve is obtained by calculating the FPR and TPR under multiple sets of thresholds. The area
of the region enclosed by the ROC curve and the horizontal axis is often used to evaluate binary classification tasks,
denoted as AUROC. The value of AUROC is within the range of [0, 1], and higher values indicate better performance.
AUROC can visualize the generalization performance of the GVAED model and help to select the best alarm threshold.
Manuscript submitted to ACM
12 Yang Liu et al.

In addition, the Equal Error Rate (EER), i.e., the proportion of incorrectly classified frames when TPR and FNR are equal,
is also used to measure the performance of anomaly detection models.
AP: Due to the highly unbalanced nature of positive and negative samples in GVAED tasks, i.e., the TN is usually
larger than the TP, researchers think that the area under the Precision-Recall (PR) curve is more suitable for evaluating
GVAED models, denoted as AP. The horizontal coordinates of the PR curve are the Recall (i.e., the TPR in Eq. 4), while
TP . A point on the PR curve corresponds
the vertical coordinate represents the Precision, defined as Precision = TP+FP
to the Precision and Recall values at a certain threshold. Currently, AP has become the main metric for multimodal
GVAED models [181, 186, 187] and is widely used to evaluate the performance on the XD-Violence dataset [186].

3 UNSUPERVISED VIDEO ANOMALY DETECTION

Existing reviews [125, 139] usually classify UVAD methods into distance-based [29, 30, 148], probability-based
[6, 23, 146, 149], and reconstruction-based [46, 50, 101] according to the deviations calculation means. Early traditional
methods relied on manual features such as foreground masks [148], histogram of flow [30], motion magnitude [148], HOG
[29], dense trajectories [173], and STIP [34], which relied on human a priori knowledge and had poor representational
power. With the rise of deep learning in computer vision tasks [151, 209, 217], recent approaches preferred to extracting
features representations in an end-to-end framework with deep Auto-Encoder (AE) [24, 50, 97, 101, 131], Generative
Adversarial Network (GAN) [9, 17, 59, 60, 75, 126, 214], and Vision Transformer (ViT) [40, 70, 204].
This section is dedicated to providing a systematic overview of UVAD methods driven by deep learning techniques [21,
22, 32, 95, 103]. It’s worth noting that the traditional taxonomy, as outlined in previous studies [125, 139], predominantly
focuses on manual feature-based methods, which are limited in elucidating the evolving landscape of deep UVAD
models. Deep CNNs exhibit a remarkable capacity for modeling the spatiotemporal intricacies within video sequences
and generating profound representations of various sensory domains, contingent upon the nature of the input data.
Consequently, we categorize the work found in existing literature into three principal groups, contingent upon the
nature of the data employed: 1) Frame-level methods always utilize entire RGB and optical flow frames as input and
endeavor to capture a holistic understanding of normality within the video data. Such approaches consider the entirety
of the frame, aiming to comprehend the global context. 2) Patch-level methods recognize the repetitive spatial-
temporal information present in video sequences, so they extract features solely from designated regions of interest.
They intentionally disregard redundant data from repetitive regions and interactions among regional information that
do not warrant particular attention. This strategy offers distinct advantages in terms of computational efficiency and
inference speed. 3) Object-level methods, emerging in recent years with the development of target detection models,
shift the focus towards detecting foreground objects and scrutinizing the behavior of specific objects within the video
context. Object-level methods consider the relationship between objects and their backgrounds, showcasing impressive
performance in the task of identifying anomalous events within complex scenes. Based on the aforementioned analysis,
this section classifies UVAD methods into three distinct categories: frame-level, patch-level, and object-level. This
categorization is aligned with a hierarchical "input-structure" taxonomy shown in Fig. 1. This taxonomy serves as a
guiding framework for organizing and understanding the landscape of UVAD.

3.1 Frame-level Methods


Deep CNNs can directly extract abstract features from videos and learn task-specific deep representations. Frame-level
methods use complete RGB frames, sequences, or optical flows as input to model the normality of normal events in a
Manuscript submitted to ACM
Generalized Video Anomaly Event Detection: Systematic Taxonomy and Comparison of Deep Models 13

self-supervised learning manner. Existing methods can be classified into two categories according to model structure:
single-stream and multi-stream. The former does not distinguish spatial and temporal information. They usually take the
original RGB videos as input and learn the spatial-temporal patterns by reconstructing the input sequence or predicting
the next frame. Existing single-stream work focuses on designing more efficient network structures. They introduce
more powerful representational learners such as 3D convolution [219] and U-net [92]. In contrast, multi-stream networks
typically treat appearance and motion as different dimensions of information and attempt to learn spatial and temporal
normality using different agent tasks or network architectures. In addition to spatial-temporal separation modeling,
existing dual-stream works explored spatial-temporal coherence [97] and consistency [8] to perform anomaly detection.

3.1.1 Single-Stream Models. Single-stream models typically use a single generative model to describe the spatial-
temporal patterns of normal events by performing a proxy task and preserving the normality in learnable parameters.
For example, the Predictive Convolutional Long Short-Term Memory (PC-LSTM) [121] used a conforming ConvLSTM
network to model the evolution of video sequences. Hasan et al. [50] constructed a fully convolutional Feed-Forward Auto-
Encoder (FF-AE) with manual features as input, which can learn task-specific representations in an end-to-end manner.
Liu et al. [92] proposed a Future Frame Prediction (FFP) method that used a GAN-based video prediction framework
to learn the normality. Its extension, FFPN [112], further specified the design principles of predictive UVAD networks.
Singh and Pankajakshan [155] also used a predictive task to detect anomalies, proposing conformal structures based on
2D & 3D convolution and convLSTM to characterize spatial-temporal patterns more efficiently.
To address the detail loss in frame generation, Li et al. [85] proposed a Spatial-Temporal U-net network (STU-net) that
combined the advantages of U-net in representing spatial information with the ability of convLSTM to model temporal
variations for moving objects. [221] proposed a sparse coding-based neural network called AnomalyNet, which used
three neural networks to integrate the advantages of feature learning, sparse representation, and dictionary learning.
In [124], the authors proposed an Incremental Spatial-Temporal Learner (ISTL) to explore the nature of anomalous
behavior over time. ISTL used active learning with fuzzy aggregation to continuously update and distinguish between
new anomalous and normal events evolving. The anoPCN in [198] unified the reconstruction and prediction methods
into a deep predictive coding network by introducing an error refinement module to reconstruct the prediction errors
and refining the coarse predictions generated by the predictive coding module.
To lessen the deep model’s ability to generalize anomalous samples, memory-augmented Auto-Encoder (memAE)
[46] embedded an external memory network between the encoder and decoder to record the prototypical patterns of
normal events. Further, Park et al. [131] introduced an attention-based memory addressing mechanism and proposed to
update the memory pool during the testing phase to ensure that the network can better represent normal events.
Luo et al. [113] proposed a sparse coding-inspired neural network model, namely Temporally-coherent Sparse Coding
(TSC). It used a sequential iterative soft thresholding algorithm to optimize the sparse coefficients. [31] introduces
residual connection [53] into the auto-encoder to eliminate the gradient disappearance problem during normality
learning. Experiments have shown that ResNet brong 3%, 2% and 5% frame-level AUC gains for the proposed Residual
Spatial-Temporal Auto-Encoder (R-STAE) on CUHK Avenue [107], LV [72] and UCSD Ped2 [84] datasets, respectively.
The DD-GAN in [35] introduced an additional motion discriminator to GAN. The dual discriminators structure
encouraged the generator to generate more realistic frames with motion continuity. Yu et al. [202] also used GAN to
model normality. The proposed Adversarial Event Prediction (AEP) network performed adversarial learning on past
and future events to explore the correlation. Similarly, Zhao et al. [218] explored spatial-temporal correlations by GAN
and used a spatial-temporal LSTM to extract appearance and motion information within a unified unit.
Manuscript submitted to ACM
14 Yang Liu et al.

Table 3. Frame-level Multi-stream UVAD Methods.

Year Method Backbone Analysis


Pros: Learning appearance and motion patterns separately.
2017 AMDN [191] AE
Cons: Determining boundaries by OC-SVM with limited capability.
Pros: Using 3D CNN to learn the spatial-temporal patterns.
2017 STAE [219] AE
Cons: Dual decoders causing huge computational costs.
Pros: Learning the correspondence between appearance and motion.
2019 AMC [127] GAN
Cons: Limited performance on the complex datasets.
Pros: Training two GANs to learn temporal and spatial distribution.
2019 GANs [142] GAN
Cons: Unstable traning process and high training cost.
Pros: Using two auto-encoders to learn spatial and temporal patterns.
2020 CDD-AE [14] AE
Cons: No special consideration in the design of the encoder structure.
Pros: Using generators and discriminators to learn normality.
2020 OGNet [205] GAN
Cons: Adversarial learning making the training process unstable.
Pros: Exploring the consistency of appearance and motion.
2021 AMMC-net [8] AE
Cons: Lack of analysis to the flow-frame generation task.
Pros: Using two auto-encoders to perform different tasks.
2021 DSTAE [82] AE, ConvLSTM
Cons: High computional cost and relying on optical flow network.
Pros: Using two encoders and three decoders to learn features.
2022 AMAE [97] AE
Cons: High training cost and relying on optical flow network.
Pros: Using two memory-enhanced auto-encoders to learn normality.
2022 STM-AE [101] AE, GAN
Cons: Unstable traning process and high computional costs.

[16] proposed a Bidirectional Prediction (Bi-Pre) framework that used forward and backward prediction sub-networks
to reason about normal frames. In the test phase, only part significant regions are used to calculate the anomaly score,
allowing the model to focus on the foreground. Wang et al. [178] used multi-path convGRU to perform frame prediction.
The proposed ROADMAP model included three non-local modules to handle different scales of objects.

3.1.2 Multi-Stream Models. . The multiplicity of multi-stream models is reflected in the multiple sources of the input
data and the multiple tasks corresponding to multiple outputs. Considering that video anomaly may manifest as outliers
in appearance or motion, an intuitive idea is to use multi-stream networks to model spatial and temporal normality
separately [13, 14, 82, 175, 188]. In addition, learning associations between appearance and motion, such as consistency[8],
coherence [97, 101], and correspondence [127], is another effective GVAED solution. Events without such associations
are discriminated against as anomalies. The multi-stream model has achieved significant success in recent years due to
the high matching of its design motivation with the GVAED task. The multi-stream methods are summarized in Table 3.
Motivated by the remarkable success of 3D CNN in video understanding tasks, Zhao et al. [219] proposed a 3D
convolutional-based Spatial-Temporal Auto-Encoder (STAE) to model normality by simultaneously performing recon-
struction and prediction tasks. STAE included two decoders, which outputted reconstructed and predicted frames, respec-
tively. In contrast, Appearance and Motion DeepNet (AMDN) [191] used two stacked denoising auto-encoders to encode
RGB frames and optical flow separately. Similarly, Chang et al. [14] also used two auto-encoders to capture spatial and tem-
poral information, respectively. One learned the appearance by reconstructing the last frame, while the other outputted
Manuscript submitted to ACM
Generalized Video Anomaly Event Detection: Systematic Taxonomy and Comparison of Deep Models 15

RGB differences to simulate the generation of optical flow. Deep K-means clustering was used to force the extracted fea-
ture compact and detect anomalies. DSTAE [82] introduced convLSTM to a two-stream auto-encoder to better model the
temporal variations. The reconstruction errors of the two encoders are weighted and used to calculate anomaly scores.
In addition to spatial-temporal separation, Nguyen and Meunier [127] proposed to learn the correspondence between
appearance and motion. To this end, they proposed an AE with two decoders, one for reconstructing input frames
and the other for predicting optical flow. Cai et al. [8] proposed an Appearance-Motion Memory Consistency network
(AMMC-net), which aimed to capture the spatial-temporal consistency in high-level feature space.
Liu et al. [97] proposed an Appearance-Motion united Auto-Encoder (AMAE) framework using two independent auto-
encoders to perform denoising and optical flow generation tasks separately. Moreover, they utilized an additional decoder
to fuse spatial-temporal features and predict future frames to model spatial-temporal normality. STM-AE [101] and AMP-
Net [99] introduced the memory into the dual-stream auto-encoder to record prototype appearance and motion patterns.
Adversarial learning was used to explore the connection between spatial and temporal information of regular events.
Aside from the above anomaly detection means such as reconstruction error [82, 97, 101, 219], clustering [13, 14]
and one-class classification [191], researchers attempted to utilize the discriminator of GAN to directly output results.
For instance, Ravanbakhsh et al. [142] used GAN to learn the normal distribution and detect anomalies directly by
discriminators. The authors use a cross-channel approach to prevent the discriminator from learning mundane constant
functions. OGNet [205] shifted the discriminator from discriminating real or generated frames to distinguishing good
or poor reconstructions. The well-trained discriminator can find subtle distortions in the reconstruction results and
detect non-obvious anomalies.
The patch-level methods [96, 118, 145, 147] takes the video patch (spatial-temporal cube) as input. Compared with
frame-level methods that consider anomalies roughly, i.e., anomalies are reflected in spatial or temporal dimensions
beyond expectation, patch-level methods consider finding anomalies from specific spatial-temporal regions rather than
analyzing the whole sequence. Patch formation can be divided into three categories: scale equipartition [25, 78, 96, 118,
147, 185, 222], information equipartition [73], and foreground object extraction [177]. Specifically, scale equipartition
is the simplest. The video sequence is equipartitioned into several spatial-temporal cubes of uniform size along the
spatial and temporal dimensions. The subsequent modeling process is similar to frame-level methods. The information
equipartition strategy considers that image blocks of the same size do not contain the same information. Regions close
to the camera contain less information per unit area than those far away. Before representation, all cubes will be first
resized to the same size. The foreground object extraction focuses on modeling regions with information variation to
avoid the learning cost and disruption of the background. After the sequences are equated into same-scale cubes, those
containing only background will be eliminated.
Roshtkhari and Levine [145] densely sampled video sequences at different spatial and temporal scales and used a
probabilistic framework to model the spatial-temporal composition of the video volumes. The STCNN [222] treated
UVAD as a binary classification task. It first extracted patches’ appearance and motion information and outputted the
discriminative results with an FCN. It first equated the video sequence into patches of 3 × 3 × 7 and retained only
the part of the region containing moving pixels to ensure the robustness of the model to local noise and improve
the detection accuracy. Deep-Cascade [147] employed a cascaded autoencoder to represent video patches. It used a
lightweight network to select local patches of interest and then applied a complex 3D convolutional network to detect
anomalies. The lightweight network can filter simple normal patches to reduce computational costs and save processing
time. S2 -VAE [177] first detected the foreground and retained only the cell containing the object as input. And then,
a shadow generative network was used to fit the data distribution. The output was fed to another deep generative
Manuscript submitted to ACM
16 Yang Liu et al.

Table 4. Object-level UVAD Methods.

Year Method Detector Decision Logic Contributions


2017 LDGK [54] Fast R- Anomaly score of Integrating a generic CNN and environment-related
CNN the detected object anomaly detector to detect video anomalies and record
proposal the cause of the anomalies.
2018 DCF [68] YOLO Classification Extracting foreground objects by object detection models
and Kalman filtering and discriminating anomalies by
pose and motion classification.
2019 OC-AE [62] SSD One-versus-rest bi- Proposing an object-centric convolutional autoencoder
nary classification to encode motion and appearance and discriminating
anomalies using a one-versus-rest classifier.
2021 Background- SSD-FPN, Binary classification Using a set of autoencoders to extract the appearance and
Agnostic YOLOv3 motion features of foreground objects and then using a
[44] set of binary classifiers to detect anomalies.
2021 Multi-task YOLOv3 Binary classification Training a 3D convolutional neural network to generate
[43] discriminative representation by performing multiple self-
supervised learning tasks.
2021 OAD [37] YOLOv3 Clustering An online VAD method with asymptotic bounds on the
false alarm rate, providing a procedure for selecting a
proper decision threshold.
2021 HF2-VAD Cascade Prediction error A hybrid framework that seamlessly integrates sequence
[105] R-CNN reconstruction and frame prediction to handle video
anomaly detection.
2020 VEC [201] Cascade Cube construction Proposing a video event completion framework to exploit
R-CNN error advanced semantic and temporal contextual information
for video anomaly detection.
2022 BiP [15] Cascade Appearance and mo- Proposing a bi-directional architecture with three consis-
R-CNN tion error tency constraints to regularize the prediction task from
the pixel, cross pattern, and temporal levels.
2022 HSNBM [5] Cascade Frame and object Designing a hierarchical scene normative binding model-
R-CNN prediction error ing framework to detect global and local anomalies.

network to model normality. Wu et al. [185] proposed a deep one-class neural network (DeepOC). Specifically, DeepOC
used stacked auto-encoders to generate low-dimensional features for frame and optical flow patches and simultaneously
trained the OC classifier to make these representations more compact.
Spatial-Temporal Cascade Auto-Encoder (ST-CaAE) [78] first used an adversarial autoencoder to identify anomalous
videos and excluded simple regular patches. The retained patches were fed to a convolutional autoencoder, which
discriminated anomalies based on reconstruction errors. Liu et al. [96] proposed an Attention augmented Spatial-
Temporal Auto-Encoder (AST-AE) that equated frames in spatial dimensions into 8 × 8 parts and models spatial and
temporal information using CNN and LSTM, respectively. In the downstream anomaly detection stage, AST-AE only
retained significant regions with large prediction errors to calculate the anomaly score.
Manuscript submitted to ACM
Generalized Video Anomaly Event Detection: Systematic Taxonomy and Comparison of Deep Models 17

3.2 Object-level Methods


The emergence of high-performance object detection models [45, 143, 192] provides a new idea for GVAED, i.e., using
a pre-trained object detector to extract the objects of interest from the video sequence before normality learning.
Compared with the frame-level and patch-level methods, the object-level methods [44, 93, 160, 200] enable the model
to ignore redundant background information and focus on modeling the behavioral interactions of foreground objects.
In addition to outperforming object-free methods in terms of performance, object-level methods are also considered
feasible to investigate scene-adaptive GVAED models. Existing studies [43, 44] show that object-level methods perform
significantly better than other methods on multi-scene datasets such as ShanghaiTech [92]. Table 4 compares the object
detectors, decision logic, and main contributions of existing object-level methods.
Ryota et al. [54] attempted to describe anomalous events in a human-understandable form by detecting and analyzing
the classes, behaviors, and attributes of specific objects. The proposed LDGK model first used multi-task learning to
obtain anomaly-related semantic information and then inserted an anomaly detector to analyze scene-independent
features to detect anomalies. The DCF [68] used a pose classifier and an LSTM network to model the spatial and motion
information of the detected objects, respectively. [62] formalizes UVAD as a one-versus-rest binary classification task.
The proposed OC-AE first encoded the motion and appearance of selected objects and then clustered the training
samples into normal clusters. An object is considered anomalous in the inference stage if the one-versus-rest classifier’s
highest classification score is negative. Its extension, the Background-Agnostic framework [44], introduced instance
segmentation, allowing the model to focus only on the primary object. In addition, the authors used pseudo-anomaly
examples to perform adversarial learning to improve the appearance and motion auto-encoders.
To make full use of the contextual information, Yu et al. [201] proposed a Video Event Completion (VEC) method that
used appearance and motion as cues to locate regions of interest. VEC recovered the original video events by solving
visual completion tests to capture high-level semantics and inferring deleted patches. Georgescu et al. [43] designed
several self-supervised learning tasks, including discrimination of forward/backward moving objects, discrimination
of objects in continuous/intermittent frames, and reconstruction of object-specific appearance. In the testing phase,
anomalous objects would lead to large prediction discrepancies.
Doshi and Yilmaz [37] proposed an Online Anomaly Detection (OAD) scheme that used detected object information
such as location, category, and size as input to a clustering model to detect anomalous events. HF2 -VAD [105] seamlessly
integrated frames reconstruction and prediction. It used memory to record the normal pattern of optical flow recon-
struction and captured the correlation between RGB frames and optical flow using a conditional variation auto-encoder.
Chen et al. [15] proposed a Bidirectional Prediction (BiP) architecture with three consistency constraints. Specifically,
prediction consistency considered the symmetry of motion and appearance in forward and backward prediction.
Association consistency considered the correlation between frames and optical flow, and temporal consistency was
used to ensure that BiP can generate temporally consistent frames.
In summary, object-level methods, employing well-trained object detection/segmentation models to isolate significant
foreground targets from video frames and developing scene-independent anomaly detection models through analysis of
target-specific attributes, offer notable advantages over frame-level and patch-level approaches in real-world cross-scene
datasets. However, these methods face challenges in capturing the interaction between scenes and backgrounds, leading
to performance degradation in handling scene-specific anomalous events, such as a person walking on a motorway.
Addressing this limitation, Liu et al. [93] explored the semantic interaction between prototypical features of foreground
targets and the background scene using memory networks. Alternatively, instance segmentation proves more effective

Manuscript submitted to ACM


18 Yang Liu et al.

in modeling target-scene interactions. For instance, the Hierarchical Scene Normality-Binding Modeling (HSNBM)
framework [5] attempted to dissect global and local scenes, which introduced a scene object-binding frame prediction
module to capture the relationship between foreground and background through scene segmentation. Looking ahead,
object-level methods with object detection or instance segmentation will play a crucial role in discovering anomalous
events in real-world highly dynamic environments, such as autonomous driving and intelligent industries.

4 WEAKLY-SUPERVISED ABNORMAL EVENT DETECTION


Using weakly semantic video-level labels to supervise the model was first proposed by Sultani et al. [158] in 2018,
laying the foundation for WAED based on multiple instance learning [114, 132, 211]. The synchronously released UCF-
crime dataset collected 13 classes of real-world criminal behaviors and provided video-level labels for training sets.
Following researchers [164, 220] made UVAD datasets meet WAED requirements by moving some anomalous test videos
to the training set, introducing various WAED benchmarks such as the reorganized UCSD Ped2 [98] and ShanghaiTech
Weakly [79, 220]. In 2020, the XD-Violence [186] dataset extended GVAED to multimodal signal processing.
This section is dedicated to providing an in-depth exploration of existing WAED models, with a focus on their
categorization into unimodal and multimodal approaches based on the input data modalities. This taxonomy is
instrumental in guiding the development of GVAED methods, fostering the transition from video processing to a
broader multimodal understanding communities. Unimodal models [39, 79, 102, 158, 164, 208, 220], similar to UVAD
techniques, primarily rely on RGB frames as input data. However, they distinguish themselves by directly computing
the anomaly score. These models center their efforts on analyzing successive RGB frames to detect anomalies within
the videos. In contrast, multimodal models [19, 181, 186, 187, 203] aim to leverage diverse data sources, including
video, audio, text, and optical flow, to extract effective anomaly-related clues. These methods harness the power of
multiple modalities to enhance the overall understanding of anomalies, making them more robust and versatile in
capturing complex abnormal events. This categorization scheme not only clarifies the distinctions between unimodal
and multimodal WAED models but also sets the stage for the evolution of GVAED techniques that integrate various
data modalities, paving the way for a more comprehensive approach to anomaly detection and event understanding.

4.1 Unimodal Models


The unimodal WAED model typically slices the unedited video into several fixed-size clips. They consider each clip as an
instance, and all clips from a video form a bag with the same video-level label. And then, pre-trained feature extractors,
such as Convolutional 3D (C3D) [165], Temporal Segment Networks (TSN) [176], and Inflated 3D (I3D) [11], is used to
extract the spatial-temporal features of the examples. Generally, the scoring module takes deep representations as input
and calculates the anomaly score for each instance with the supervision of video-level labels.
The MIL ranking framework [158] introduced multiple instance learning to GVAED for the first time, using a 3-
layer Fully Connected Network (FCN) to predict high anomaly scores for anomalous clips and introducing sparsity and
smooth constraints to avoid drastic fluctuations in the score curve. Zhu and Newsam [223] considered motion as the
key to WAED performance. To this end, they proposed a temporal augmented network to learn motion-aware features
and used attention blocks [89] to incorporate temporal context into a MIL ranking model. Snehashis et al. [119] used a
dual-stream CNN to extract spatial and temporal features separately and fed the fused features as spatial-temporal
representations into an FCN to perform anomaly classification. The authors compared the performance of different
deep CNN architectures (e.g., ResNet- 50 [53], Inception V3 [161], and VGG-16 [157]) for feature extraction.
Manuscript submitted to ACM
Generalized Video Anomaly Event Detection: Systematic Taxonomy and Comparison of Deep Models 19

Zhong et al. [220] treated WAED as a supervised learning task under noisy labels, arguing that the supervised
action recognition models can perform anomaly detection after the label noise is removed. In response, they designed
a graph convolutional network to correct the labels. [172] proposed Anomaly Regression Network (ARNet) to learn
discriminative features WAED. Specifically, ARNet used dynamic multiple-instance learning loss and center loss to
enlarge the inter-class distance instances and reduce the intra-class distance of regular instances, respectively.
Waseem et al. [80] proposed a two-stage WAED framework that first used an echo state network to obtain spatially and
temporally aware features. And then, they used a 3D convolutional network to extract spatial-temporal features and fuse
them with the features from the first stage as the input to a binary classifier. Tian et al. proposed Robust Temporal Feature
Magnitude (RTFM) learning by training a feature volume learning function to identify positive examples efficiently. In
addition, RTFM utilized self-attention to capture both long and short-time correlations. Muhammad et al. [208] proposed
a self-reasoning framework that uses binary clustering to generate pseudo-labels to supervise the MIL regression models.
The CLustering Assisted Weakly Supervised (CLAWS) learning with normalcy sppression in [206] proposed a random
batch-based training strategy to reduce the correlation between batches. In addition, the authors introduced a loss
based on clustering distance to optimize the network to weaken the effect of label noise. Kamoona et al. [66] proposed
a Deep Temporal Encoding-Decoding (DTED) to capture the temporal evolution of videos over time. They treated
instances of the same bag as sequential visual data rather than as independent individuals. In addition, DTED uses joint
loss to optimize to maximize the average distance between normal and abnormal videos.
The Weakly-supervised Temporal Relationship (WSTR) learning framework [212] enhanced the model’s discrimina-
tive power by exploring the temporal relationships between clips. The proposed transformer-enabled encoder converts
the task-irrelevant representations into task-specific features by mining the semantic correlations and positional rela-
tionships between video clips. Weakly Supervised Anomaly Localization (WSAL) [115] performed anomaly detection
by fusing temporal and spatial contexts and proposed a higher-order context encoding model to measure temporal
dynamic changes. In addition, the authors collected a dataset called TAD for traffic anomaly detection.
Feng et al. [39] proposed a Multi-Instance Self-Training (MIST) framework consisting of a multi-instance pseudo
label generator and a self-guided attention-enhancing feature encoder for generating more reliable fragment-level
pseudo labels and extracting task-specific representations, respectively. Liu et al. [98] proposed a Self-guiding Multi-
instance Ranking (SMR) framework that used a clustering module to generate pseudo labels to aid the training of
supervised multi-instance regression models to explore task-relevant feature representations. The authors compared the
performance of different recurrent neural networks in exploring temporal correlation. Spatial-Temporal Attention (STA)
[102] explored the connection between example local representations and global spatial-temporal features through a
recurrent cross-attention operation and used mutual cosine loss to encourage the enhanced features to be task specific.

4.2 Multimodal Models


The emergence of TV shows and streaming media has broadened the application scope of GVAED technicals, transi-
tioning them from traditional offline surveillance video analysis to online video stream detection. Unlike surveillance
videos, which typically consist of only RGB images, most online video content, including vlogs, live streams, and talk
shows, incorporates multiple modalities such as language, speech, and subtitle text. Extracting anomaly-related cues
from these diverse data modalities exceeds the capabilities of current unimodal methods.
Real-world data are heterogeneous, and effectively exploiting the complementary nature of multimodal data is
the key to developing robust and efficient GVAED models. Due to the limitation of datasets, most existing works
[186, 187, 203] focused on video and audio information fusion to detect violent behaviors from surveillance videos.
Manuscript submitted to ACM
20 Yang Liu et al.

Table 5. Multimodal WAED Models.

Year Method Input Modality Contributions


2020 HL-Net Video + Audio Collecting the XD-Violence violence detection datasets and proposing a
[186] three-branch neural network model for multimodal anomaly detection.
2021 FVAI [130] Video + Audio Proposing a pooling-based feature fusion strategy to fuse video and
audio information to obtain more discriminative feature representations.
2022 SC [133] Video + Audio Proposing an audio-visual scene classification dataset containing 5
classes of anomalous events and a deep classification model.
2022 MACIL-SD Video + Audio Proposing a modality-aware contrastive instance learning with a self-
[203] distillation strategy to address the modality heterogeneity challenges.
2022 ACF [182] Video + Audio Proposing a two-stage multimodal information fusion method for vio-
lence detection that first refines video-level labels into clip-level labels.
2022 MSAF [181] Video + Au- Proposing multimodal labels refinement to refine video-level ground
dio, Video + truth into pseudo-clip-level labels and implicitly align multimodal infor-
Optical flow mation with multimodal supervise-attention fusion network.
2022 MD [153] Video + Audio Using mutual distillation to transfer information and proposing a multi-
+ Flow modal fusion network to fuse video, audio, and flow features.
2022 HL-Net+ Video + Audio Introducing coarse-grained violent frame and fine-grained violence de-
[187] tection tasks and proposing audio-visual violence detection network.
2022 AGAN Video + Audio Using cross-modal interaction to enhance video and audio and comput-
[134] ing high-confidence violence scores using temporal convolution.

Moreover, inspired by the frame-level multi-stream UVAD models [82, 97], recent work [181] considered RGB frames
and optical flow as different modalities. We display the modalities and principles of existing multimodal GVAED models
[130, 133, 134, 153, 181, 182, 186, 187, 203] in Table 5.
Wu et al. [186] released the first multimodal GVAED dataset and proposed a three-branch network called HL-
Net for multimodal violence detection. Specifically, the similarity branch used a similarity prior to capture long-
range correlations. In contrast, the proximity branch used proximity prior to capture local location relationships, and
the scoring branch dynamically captured the proximity of predicted scores. Experimental results demonstrated the
multimodal data’s positive impact on GVAED. The following MACIL-SD in [203] utilized a lightweight dual-stream
network to overcome the heterogeneity challenge. It used self-distillation to transfer unimodal visual knowledge to
audio-visual models to narrow the semantic gap between multimodal features.
Researchers [130, 134] attempted to explore more effective feature extraction and multimodal information fusion
strategies. For example, Pang et al. [130] proposed to use a bilinear pooling mechanism to fuse visual and audio
information and encourage the model to learn from each other to obtain a more effective representation. Audio-Guided
Attention Network (AGAN) [134] first used a deep neural network to extract video and audio features and then enhanced
the features in the temporal dimension using a cross-modal perceptual local arousal network.
Wei et al. [182] proposed a two-stage multimodal information fusion method, which first refines video-level hard
labels into clip-level soft labels and then uses an attention module for multimodal information fusion. Their extension
work, Multimodal Supervised Attentional Augmentation Fusion (MSAF) [181], used attention fusion to align information
and achieved implicit alignment of multimodal data.
Manuscript submitted to ACM
Generalized Video Anomaly Event Detection: Systematic Taxonomy and Comparison of Deep Models 21

Pseudo Normal Samples


Initial Anomaly Feature FCN-based
Anomaly Scores
Detection Extractor Scorer
Pseudo Anomalous Samples
Iterative Learning
(a)
Unabeled Videos
(b)
Generator Psedo-labels from Generator Loss
Feature
Extractor
Discriminator Psedo-labels from Discriminator Loss
Iterative optimization

Fig. 8. Workflow of two representative FVAD methods: (a) SDOR [129] and (b) GCL [207]. Given the unlabeled videos, the SDOR first
divided them into pseudo-normal and anomalous sets by initial anomaly detection. GCL introduces cross-supervision to train the
generator G and discriminator D to learn anomaly detectors. The pseudo-labels from G and D are used to compute each other’s losses.

Shang et al. [153] observed that existing models are limited by small datasets and proposed mutual distillation to
transfer information from large-scale datasets to small datasets. They proposed a multimodal attention fusion strategy
to fuse RGB images, audio, and flow to obtain a more discriminative representation. [133] introduced an audio-visual
scene classification task and released a multimodal dataset. The authors try different deep networks and fusion strategies
to explore the most effective classification model.

5 SUPERVISED VIDEO ANOMALY DETECTION


Supervised video anomaly detection requires frame-level or pixel-level labels to supervise models to distinguish
between normal and anomalies. Therefore, it is often considered a classification task rather than a mainstream GVAED
scheme. On the one hand, collecting fine-grained labeled anomalous samples is time-consuming. On the other hand,
the anomalous behavior occurs gradually, and the degree of anomaly is a relative value, while manual labeling can only
provide discrete 0/1 labels, which cannot adequately describe the severity and temporal continuity of video anomalies.
Existing SVAD methods usually consider VAD a binary classification task under data imbalance conditions. However,
game engines can simulate various types of anomalous events and provide frame-level and pixel-level personalized
annotations. With the penetration of synthetic datasets in vision tasks, supervised training of GVAED models with
virtual anomalies is expected to become possible. Researchers need to focus on the domain adaptation problem posed by
synthetic datasets, i.e., how to cope with the covariate shifts between synthetic data and the real-world surveillance video
and the ensuing performance degradation. Moreover, although the training set contains partially labeled anomalies,
SVAD models still need to consider how to reasonably generalize the anomalies to detect unseen anomalous events in
real-world scenarios. SVAD is an open-set recognition task rather than a supervised binary classification.

6 FULLY-UNSUPERVISED VIDEO ANOMALY DETECTION

Fully-unsupervised Video Anomaly Detection (FVAD) does not limit the composition of the training data and requires
no data annotation. In other words, FVAD tries to learn an anomaly detector from the random raw data, which is a
newly emerged technical route in recent years.
Ionescu et al. [167] introduced the unmasking technique to computer vision tasks, proposing an FVAD framework
that requires no training sequences. They iteratively trained a binary classifier to distinguish two consecutive video
sequences and simultaneously removed the most discriminative features at each step. Inspired by [167], Liu et al. [94]
tried to establish the connection between heuristic unmasking and multiple classifiers two sample tests to improve its
Manuscript submitted to ACM
22 Yang Liu et al.

testing capability. In this regard, they proposed a history sampling method to increase the testing power as well as to
improve the GVAED performance. Li et al. [83] first used a distribution clustering framework to identify the possible
anomalous samples in the training data, and then used the clustered subset of normal data to train the auto-encoder.
An encoder that can describe normality was obtained by repeating normal subset selection and representation learning.
The recent representative FVAD works are Self-trained Deep Ordinal regression (SDOR) [129] and Generative
Cooperative Learning (GCL) [207], which attempted to learn anomaly scorers from unlabeled videos in an end-to-end
manner, as shown in Fig. 8(a) and 8(b). Specifically, SDOR [129] first determined the initial pseudo-normal and abnormal
sets and then computed the abnormal scores using pre-trained ResNet-50 and FCN. The representation module and
the scorer were optimized iteratively in a self-training manner. Moreover, Lin et al. [88] looked at the pseudo label
generation process in SDOR from a causal inference perspective and proposed a causal graph to analyze confounding
effects and eliminate the impact of noisy pseudo labels. In addition, their proposed CIL model improved significantly by
performing counterfactual inference to capture long-range temporal dependencies.
In contrast, GCL [207] attempted to exploit the low-frequency nature of anomalous events. It included a generator G
and a discriminator D, which were supervised by each other in a cooperative rather. The generator primarily generated
representations for normal events. While for anomaly events, the generator used negative learning techniques to distort
the anomaly representation and generated pseudo-labels to train D. The discriminator estimated the probability of
anomalies and created pseudo labels to improve G. The scarcity and infrequent occurrence of anomalies provide valuable
insights into the development of FVAD. Hu et al. [55] leveraged the rarity of anomalies, operating under the assumption
that the small number of anomalous samples in the training set has a limited impact on the normality of the model
learning process. Inspired by the Masked Auto-Encoder (MAE) [51], their proposed TMAE learned representations
using a visual transformer performing a complementary task. Notably, MAE [51] applied masks primarily to 2D images,
whereas video anomalies are closely linked to temporal information. To address this challenge, TMAE first identified
video foregrounds and constructed temporal cubes to serve as masked objects, ensuring a more comprehensive approach
to anomaly detection in video data.

7 PERFORMANCE COMPARISON
We collect the performance of existing works on publicly available datasets [84, 92, 107, 158, 220] to quantitatively
compare the superiority and present the GVAED development progress. Table 6 presents the frame-level AUC and EER
of the early UVAD models on UCSD Ped1 & Ped2 [84], and CUHK Avenue [107] datasets and the frame-level AUC on
the ShanghaiTech [92] dataset. Since the recent UVAD [14, 46, 97, 101] and FVAD [88, 129, 207] only report frame-level
AUC as the main evaluation metric, we have collated these methods separately in Table 7. The ShanghaiTech dataset
was proposed in 2018 with the FFP [92] model, so methods before this time were usually tested without this dataset.
With the advantage of its data size and quality, ShanghaiTech has become the most widely used UVAD benchmark. An
interesting phenomenon is that the object-level methods outperform other frame-level and patch-level models on the
cross-scene ShanghaiTech dataset. For example, the frame-level AUC of the Multi-task [43] model is as high as 90.2%,
which is 12.1% higher than the state-of-the-art frame-level methods [69]. It shows that for cross-scene GVAED tasks,
using an object detector to separate the foreground object of interest from the scene can effectively avoid interference
of the background. In addition, the multi-stream model learns normality in both temporal and spatial dimensions and
generally outperforms the single-stream model. The usage frequency shows that UCSD Ped2 [84], CUHK Avenue [107],
and ShanghaiTech [92] have become the prevailing benchmarks for UVAD evaluation. Future work should consider
testing and comparing the proposed methods on these three datasets.
Manuscript submitted to ACM
Generalized Video Anomaly Event Detection: Systematic Taxonomy and Comparison of Deep Models 23

Table 6. EER and AUC Comparison of Early Unsupervised Methods on Benchmark Datasets.

Year Method Ped1 AUC Ped1 EER Ped2 AUC Ped2 EER Avenue AUC Avenue EER ShanghaiTech AUC
2015 DRAM [190] 92.1 16.0 90.8 17.0 - - -
2015 STVP [3] 93.9 12.9 94.6 10.6 - - -
2016 CMAC [215] 85.0 - 90.0 - - - -
2016 FF-AE [50] 81.0 27.9 90.0 21.7 70.2 25.1 60.9
2017 DEM [41] 92.5 15.1 - - - - -
2017 CFS [73] 82.0 21.1 84.0 19.2 - - -
2017 WTA-AE [166] 91.9 15.9 92.8 11.2 82.1 24.2 -
2017 EBM [171] 70.3 35.4 86.4 16.5 78.8 27.2 -
2017 CPE [168] 78.2 24.0 80.7 19.0 - - -
2017 LDGK [54] - - 92.2 13.9 - - -
2017 sRNN [110] - - 92.2 - 81.7 - 68.0
2017 GANS [141] 97.4 8.0 93.5 14.0 - - -
2017 OGNG [159] 93.8 - 94.0 - - - -
2018 FFP [92] 83.1 - 95.4 - 85.1 - 72.8
2018 PP-CNN [140] 95.7 8.0 88.4 18.0 - - -
2019 FAED [108] 93.8 14.0 95.0 - - - -
2019 NNC [63] - - - - 88.9 - -
2019 OC-AE [62] - - 97.8 - 90.4 - 84.9
2019 AMC [127] - - 96.2 - 86.9 - -
2019 MLR [170] 82.3 23.5 99.2 2.5 71.5 36.4 -
2019 memAE [46] - - 94.1 - 83.3 - 71.2
2019 MLEP [91] - - - - 92.8 - 76.8
2019 BMAN [71] - - 96.6 - 90.0 - 76.2
2020 Street Scene [137] 77.3 25.9 88.3 18.9 72.0 33.0 -
2020 IPR [162] 82.6 - 96.2 - 83.7 - 73.0
2020 DFSN [138] 86.0 23.3 94.0 14.1 87.2 18.8 -

Table 7. AUC Comparison of Recent Unsupervised and Fully-unsupervised (Marked in Italics) Methods on Benchmark Datasets.

Year Method Ped2 Avenue ShanghaiTech Year Method Ped2 Avenue ShanghaiTech
2020 MNAD-R [131] 90.2 82.8 69.8 2020 MNAD-P [131] 97.0 88.5 70.5
2020 DD-GAN [35] 95.6 84.9 73.7 2020 SDOR [129] 83.2 - -
2020 ASSAD [36] 97.8 86.4 71.6 2020 FSSA [109] 96.2 85.8 77.9
2020 VEC [201] 97.3 89.6 74.8 2020 Multispace[60] 95.4 86.8 73.6
2020 CDD-AE [14] 96.5 86.0 73.3 2021 CDD-AE+ [13] 96.7 87.1 73.7
2021 Multi-task (object level) [43] 99.8 91.9 89.3 2021 Multi-task (frame level) [43] 92.4 86.9 83.5
2021 Multi-task (late fusion) [43] 99.8 92.8 90.2 2021 HF2 AVD [105] 99.3 91.1 76.2
2021 AST-AE [96] 96.6 85.2 68.8 2021 ROADMAP[178] 96.3 88.3 76.6
2021 CT-D2GAN[40] 97.2 85.9 77.7 2022 AMAE [97] 97.4 88.2 73.6
2022 STM-AE [101] 98.1 89.8 73.8 2022 BiP [15] 97.4 86.7 73.6
2022 AR-AE [69] 98.3 90.3 78.1 2022 TAC-Net[60] 98.1 88.8 77.2
2022 STC-Net [218] 96.7 87.8 73.1 2022 HSNBM [5] 95.2 91.6 76.5
2022 CIL(ResNet50)+DCFD [88] 97.9 85.9 - 2022 CIL(ResNet50)+DCFD+CTCE [88] 99.4 87.3 -
2022 CIL(I3D-RGB)+DCFD+CTCE [88] 98.7 90.3 - 2022 GCL𝑃𝑇 (RESNEXT) [207] - - 78.93

Table 8 presents the performance of WAED methods on the UCF-Cirme [158] and ShanghaiTech Weaky [220] datasets.
As mentioned previously, WAED models usually rely on pre-trained feature extractors [11, 165, 176] to obtain feature
representations. Commonly used features include C3D𝑅𝐺𝐵 , I3D𝑅𝐺𝐵 , and I3D𝑅𝐺𝐵+𝑂𝑝𝑡𝑖𝑐𝑎𝑙 𝑓 𝑙𝑜𝑤 . The performance gap of
the same model using different features show that the effectiveness of the WAED model is related to the pre-trained
feature extractors, with the I3D outperforming the simple 3 × 3 × 3 convolution-based C3D network due to the separate
consideration of temporal information variation. Future WAED work should test the performance of the proposed model
on current commonly used features or provide the performance of existing works on emerging features to demonstrate
that the performance gain comes from the model design rather than benefiting from a more robust feature extraction
network. In addition to detection performance, other metrics are processing speed and deployment cost. GVAED typically
employs the Average Inference Speed (AIS) as a visual metric to gauge the overhead cost of the model. Figures reported
in existing literature often lack direct comparability due to variations in experimental environments and computational
platforms. Moreover, Recent advancements in GVAED research, such as object-level methods and weakly-supervised
Manuscript submitted to ACM
24 Yang Liu et al.

Table 8. Quantitative Performance Comparison of Weakly-supervised Methods on Public Datasets.

Method Feature UCF-Crime AUC UCF-Crime FAR ShanghaiTech AUC ShanghaiTech FAR
MIR [158] C3D𝑅𝐺𝐵 75.40 1.90 86.30 0.15
TCN [213] C3D𝑅𝐺𝐵 78.70 - 82.50 0.10
Zhong [220] C3D𝑅𝐺𝐵 80.67 3.30 76.44 -
ARNet [172] C3D𝑅𝐺𝐵 - - 85.01 0.57
I3D𝑅𝐺𝐵 - - 85.38 0.27
I3D𝑅𝐺𝐵+𝑂𝑝𝑡𝑖𝑐𝑎𝑙 𝐹𝑙𝑜𝑤 - - 91.24 0.10
MIST [39] C3D𝑅𝐺𝐵 81.40 2.19 93.13 1.71
I3D𝑅𝐺𝐵 82.30 0.13 94.83 0.05
RTFM [164] C3D𝑅𝐺𝐵 83.28 - 91.51 -
I3D𝑅𝐺𝐵 84.30 - 97.21 -
SMR [98] I3D𝑅𝐺𝐵+𝑂𝑝𝑡𝑖𝑐𝑎𝑙 𝐹𝑙𝑜𝑤 81.70 - - -
DTED [66] C3D𝑅𝐺𝐵 79.49 0.50 87.42 -

schemes, typically involve intricate data preprocessing and calls to pre-trained models, such as foreground object
detection, optical flow estimation, and spatial-temporal feature extraction with well-trained 3D convolutional networks.
It remains unclear whether the computational cost and processing time associated with these aspects are factored
into the overhead cost of the proposed model. Consequently, reporting inference speed and comparing computational
cost are not widespread practice in GVAED research. The few papers providing such results often lack a detailed
description of the experimental setup. Nevertheless, we diligently collected the AIS of existing works to offer an intuitive
demonstration of the trend toward lighter-weight GVAED research. Acknowledging the impact of image resolution on
model inference speed, we follow [139] to summarize these data while simultaneously documenting the dataset used
for model testing. The results are publicly accessible in our GitHub repository1 and will be continuously updated.

8 CHALLENGES AND TRENDS

8.1 Research Challenges


8.1.1 Mock anomalies vs. Real anomalies: How to bridge domain offsets between mock and real anomalies? GVAED aims
to automatically detect anomalous events in the living environment to provide a safe space for humans. However, the
difficulty of collecting anomalies makes most of the existing datasets formulate abnormal events by human simulation,
such as the UMN [28], CUHK Avenue [107], and ShanghaiTech [92]. The mock anomalies are simpler, and their spatial-
temporal patterns differ significantly from normal events, resulting in well-trained models difficult to detect complex
anomalies. In addition, the set of limited categories of anomalous events conflicts with the diverse nature of real
anomalies. As a result, models learned on such datasets perform poorly in real-world scenarios. Therefore, collecting
datasets containing various real anomalies and designing models to bridge the gap between mock and real anomalies is
an essential challenge for GVAED development.

8.1.2 Single-scene vs. Multi-scenes: How to develop cross-scenario GVAED models for the real world? Mainstream
unsupervised datasets [84, 107] and UVAD methods [14, 92, 131] only consider single-scene videos, while the real world
always contains multiple scenes, which constitutes another challenge for UVAD methods. Although the UMN [28] and
Manuscript submitted to ACM
Generalized Video Anomaly Event Detection: Systematic Taxonomy and Comparison of Deep Models 25

ShanghaiTech [92] datasets include multiple scenes, the anomalous events of the former are all crowd dispersal, while
the 13 scenes of the latter are similar. Therefore, most UVAD methods [92, 101] do not consider the scene differences
but learn normality directly from the original video as in other single-scene datasets [84, 107]. Recent researchers
[44, 62] believe that object-level methods are a feasible way to learn scene-invariant normality by extracting specific
objects from the scenes and then analyzing the spatial-temporal patterns of objects and backgrounds separately. The
multi-scene problem is inescapable for model deployment as it is almost impossible to train a scene-specific model
for each terminal device. Developing cross-scene GVAED models using domain adaptation/generalization techniques
[90, 183, 189] to learn scene-invariant normality is a definite challenge.

8.1.3 Real data vs. Synthetic data: How to develop large fine-grained GVAED models using synthetic data? Due to the
rarity and diversity of anomalies, collecting and labeling anomalous events is time-consuming and laborious. Therefore,
researchers [1] have considered using game engines [38, 152] to synthesize anomaly data. We remain optimistic about
this attempt and believe it may lead to new research opportunities for GVAED. While anomaly detection tasks suffer
from a lack of data and missing labels. Synthetic data can generate various anomalous samples and provide precise
frame-level or even pixel-level annotations, making it possible to develop SVAD models and save data preparation costs
for large-scale GVAED model training. However, a concomitant challenge is that covariate shifts between synthetic and
real data may make the trained GVAED models not work in real scenes.

8.1.4 Unimodal vs. Multimodal: How to effectively fuse multimodal data to mine anomaly-related clues? Researchers
[186, 187] are aware of the positive impact of multimodal data (e.g., audio) for GVAED. However, existing works are
stuck on the lack of datasets and the validity of model structures. XD-Violence [186] is the only mainstream multimodal
GVAED dataset, but it only contains video and audio, and much data is collected from movies and games rather than the
real world. With the popularity of IoT, using various sensors to collect environmental information (e.g., temperature,
brightness, and humidity) can assist cameras in detecting abnormal events. However, mining useful clues from valid
data and developing efficient multimodal GVAED models need further research, such as task-specific feature extraction
from heterogeneous data, semantic alignment of different modalities, anomaly-related multimodal information fusion,
amd domain offset bridging in emerging cross-modal GVAED research.

8.1.5 Single-view vs. Multi-view: How to integrate complementary information from multi-view data? In places such
as traffic intersections and parks, the same area is usually covered by multiple camera views, deriving another task:
anomalous event detection in multi-view videos. Multi-view data can provide more comprehensive environmental
awareness data, which is wildly used for re-identification [87, 199], tracking [163] and gaze estimation [86]. However,
existing datasets [28, 84, 92, 107] are all single-view, making multi-view GVAED research still a gap. The simplest idea
is to combine data from all views to train the same model and determine anomalies through a voting or winner-take-all
strategy. However, such approaches are training-costly and do not consider the differences and complementarities
between multi-view data. Therefore, multi-view GVAED remains to be investigated.

8.1.6 Offline vs. Online Detection: How to develop light-weight end-to-end online GVAED models? The deployable
Intelligent video surveillance systems (IVSS) need to process continuously generated video streams 24/7 online and
respond to anomalous events in real-time so that noteworthy clips can be saved in time to reduce storage and transmission
costs. Unfortunately, existing GVAED models are designed for public datasets rather than real-time video streams,
primarily pursuing detection performance while avoiding the online detection challenges. For example, the dominant
prediction-based methods [92, 97, 101, 131] in UVAD route can only give the prediction error of the current input
Manuscript submitted to ACM
26 Yang Liu et al.

in a single-step execution, while the Informative anomaly score needs are obtained after performing the maximum-
minimum normalization over the prediction error of all frames. Although the model can directly determine the current
frame as an anomaly with a pre-set error threshold, existing attempts show that the manually selected threshold is
unreliable. WAED [39, 98, 115, 158, 164] can directly output anomaly scores for segments. However, the input to the
scoring module is a discriminative spatial-temporal representation rather than the original video. The representations
usually rely on pre-trained 3D convolution-based feature extractors [11, 165, 176]. The time cost is unacceptable for
resource-limited terminal devices [65]. Therefore, developing online detection models is the primary challenge for
GVAED deployment, determining its application potential in IVSS and streaming media platforms.

8.2 Development Trends


8.2.1 Data level: Toward real-world GVAED model development for multi-view cross-scene multimodal data. From single-
scene [84, 107] to multi-scene [92, 137], from real-word vides [28] to synthetic data [1], and from unimodal [158] to
multimodal [186], GVAED datasets are moving towards large-scale and realistic scenarios. We see this as a positive
trend that will continue with the growth of online video platforms and tools. On the one hand, real-world scenarios
and anomalies are diverse, so efficient models for real-world applications need to be trained on large-scale datasets
that contain various anomalous events. On the other hand, the Internet has made it possible to collect multi-scene and
multi-view videos, including sufficient rare anomalous behaviors such as violence and crime. Furthermore, multimodal
and synthetic data will be increasingly important in GVAED research. The XD-Violence [186] dataset has demonstrated
the positive impact of multimodal data on GVAED. In the future, with streaming media (e.g., TikTok, Netflix, and
Hulu) and online video sites (e.g., YouTube, iQIYI, and Youku), more modal data can be collected. Besides, virtual game
engines (e.g., Airsim [152] and Carla [38]) can synthesize rare anomalous events and provide fine-grained annotations
on demand.The connection of GVAED with other tasks (e.g., multimodal analysis [181] and few-shot learning [109])
will tend to be close, with the latter inspiring the design of GVAED models under specific data conditions.

8.2.2 Representation level: Adaptive transfer of emerging representation learning and feature extraction means. Deep
learning has enabled spatial-temporal representations to be derived directly from the raw videos in an end-to-end
manner without a human prior [179]. The earlier deep GVAED models benefit from CNNs and pursue complex deep
networks to extract more abstract features. For example, the UVAD models attempt to introduce dual-stream networks to
learn spatial and temporal representations [14, 97, 101], and use 3D convolutional networks to model temporal features
[46, 219]. From C3D [165] to I3D [11], the WAED models [39, 98, 164] benefit from more powerful pre-trained feature
extractors and achieves general performance gains on existing datasets [158, 220]. We observe that the representation
means of WAED will become increasingly sophisticated. New visual representation learning models such as Transformer
[4, 57, 81, 169] will drive WAED development. In contrast, UVAD does not pursue abstract representations. Overly
powerful deep networks may lead to missing anomalous events as normal due to overgeneralization [46, 131]. Future
researchers should consider using clever representation strategies (e.g., causal representation learning [103]) to balance
the model’s powerful representation of normal events and the limited generalization of abnormal events [99]. Powerful
generative models such as graph learning [58, 111, 120] and diffusion models [27] are expected to provide more effective
normality learning tools for UVAD. In addition, researchers should consider introducing emerging techniques (e.g.,
domain adaptation [44, 174] and knowledge distillation [180]) to develop GVAED models for learning scene-invariant
representation from multi-scene and multi-view videos.
Manuscript submitted to ACM
Generalized Video Anomaly Event Detection: Systematic Taxonomy and Comparison of Deep Models 27

8.2.3 Deployment level: Lightweight easy-to-deploy model development for resource-constrained end devices. Model
deployment is an inevitable trend for GVAED development. As mentioned above, the multi-scene and the diversity of
anomalies in real-world videos pose new challenges for model design and training, such as online detection, lightweight
models, and high view robustness. On the one hand, the computational resources of terminal devices are limited.
Most deep GVAED methods are overly pursuing performance at the expense of running costs. On the other hand,
existing models are trained offline, which cannot perform real-time detection. Model compression [33] and knowledge
distillation [47] can drive the development of lightweight GVAED models. Online evolutive learning [76, 77], dge-cloud
collaboration, and integrated sensing and control technologies [135] enable models to dynamically optimize learnable
parameters in complex environments such as modern industry [99] and intelligent transportation systems [194].

8.2.4 Methodology level: High-efficiency & robust GVAED development by integrating different research pathways. This
survey compares the four main GVAED technical routes: UVAD, WAED, SVAD, and FVAD. UVAD has been regarded as
the mainstream solution, although WAED gradually dominates in recent years. However, the trend of UVAD is unclear
due to its performance saturation on limited datasets [84]. In addition, the setting of anomalies in UVAD datasets
makes UVAD models challenging to work in complex scenes. Self-supervised visual representation technicals (e.g.,
contrast learning [18, 48, 52, 60] and deep clustering [10, 210]) may provide new ideas for UVAD. In contrast, WAED
has been widely noticed as a research hotspot due to its excellent performance in crime detection [158]. In addition,
the multimodal video anomaly detection tasks also follow WAED routes. SVAD is once abandoned due to the lack of
labels and anomalies. However, it may face new research opportunities with the emergence of synthetic datasets [1].
In contrast, FVAD can learn directly from raw video data without the cost of training data filtering and annotations,
making it a hot research topic. The various routes are not completely independent, and existing works [100, 184] have
started to combine the assumptions of different methods to develop more efficient GVAED models.

9 CONCLUSION

This survey is the first to integrate the deep learning-driven technical routes based on different assumptions and
learning frameworks into a unified generalized video anomaly event detection framework. We provide a hierarchical
GVAED taxonomy that systematically organizes the existing literature by supervision, input data, and network structure,
focusing on the recent advances such as weakly-supervised, fully-unsupervised, and multimodal methods. To provide a
comprehensive survey of the extant work, we collect benchmark datasets and available codes, sort out the development
lines of various methods, and perform performance comparisons and strengths analysis. This survey helps clarify
the connections among deep GVAED routes and advance community development. In addition, we analyze research
challenges and future trends in the context of deep learning technology development and possible problems faced by
GAED model deployment, which can serve as a guide for future researchers and engineers.

ACKNOWLEDGMENTS
This work is supported in part by the China Mobile Research Fund of the Chinese Ministry of Education under Grant No.
KEH2310029, the National Natural Science Foundation of China under Grant No. 62250410368, and the Specific Research
Fund of the Innovation Platform for Academicians of Hainan Province under Grant No. YSPTZX202314. Additional
support is acknowledged from the Shanghai Key Research Laboratory of NSAI and the Joint Laboratory on Networked
AI Edge Computing Fudan University-Changan. The work of Yang Liu was financially supported in part by the China
Manuscript submitted to ACM
28 Yang Liu et al.

Scholarship Council (File No. 202306100221). The authors extend their appreciation to the anonymous reviewers for their
valuable comments and suggestions, as well as to the authors of the reviewed papers for their contributions to the field.

REFERENCES
[1] Andra Acsintoae, Andrei Florescu, Mariana-Iuliana Georgescu, Tudor Mare, Paul Sumedrea, Radu Tudor Ionescu, Fahad Shahbaz Khan, and
Mubarak Shah. 2022. UBnormal: New Benchmark for Supervised Open-Set Video Anomaly Detection. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition. 20143–20153.
[2] Amit Adam, Ehud Rivlin, Ilan Shimshoni, and Daviv Reinitz. 2008. Robust real-time unusual event detection using multiple fixed-location monitors.
IEEE transactions on pattern analysis and machine intelligence 30, 3 (2008), 555–560.
[3] Borislav Antić and BjörnT28 Ommer. 2015. Spatio-temporal Video Parsing for Abnormality Detection. arXiv preprint arXiv:1502.06235 (2015).
[4] Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, and Cordelia Schmid. 2021. Vivit: A video vision transformer. In
Proceedings of the IEEE/CVF International Conference on Computer Vision. 6836–6846.
[5] Qianyue Bao, Fang Liu, Yang Liu, Licheng Jiao, Xu Liu, and Lingling Li. 2022. Hierarchical scene normality-binding modeling for anomaly detection
in surveillance videos. In Proceedings of the 30th ACM International Conference on Multimedia. 6103–6112.
[6] Yannick Benezeth, P-M Jodoin, Venkatesh Saligrama, and Christophe Rosenberger. 2009. Abnormal events detection based on spatio-temporal
co-occurences. In 2009 IEEE conference on computer vision and pattern recognition. IEEE, 2458–2465.
[7] Ane Blázquez-García, Angel Conde, Usue Mori, and Jose A Lozano. 2021. A review on outlier/anomaly detection in time series data. ACM
Computing Surveys (CSUR) 54, 3 (2021), 1–33.
[8] Ruichu Cai, Hao Zhang, Wen Liu, Shenghua Gao, and Zhifeng Hao. 2021. Appearance-motion memory consistency network for video anomaly
detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 938–946.
[9] Yiheng Cai, Jiaqi Liu, Yajun Guo, Shaobin Hu, and Shinan Lang. 2021. Video anomaly detection with multi-scale feature and temporal information
fusion. Neurocomputing 423 (2021), 264–273.
[10] Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. 2020. Unsupervised learning of visual features by
contrasting cluster assignments. Advances in Neural Information Processing Systems 33 (2020), 9912–9924.
[11] Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition. 6299–6308.
[12] S Chandrakala, K Deepak, and G Revathy. 2022. Anomaly detection in surveillance videos: a thematic taxonomy of deep models, review and
performance analysis. Artificial Intelligence Review (2022), 1–50.
[13] Yunpeng Chang, Zhigang Tu, Wei Xie, Bin Luo, Shifu Zhang, Haigang Sui, and Junsong Yuan. 2021. Video anomaly detection with spatio-temporal
dissociation. Pattern Recognition 122 (2021), 108213.
[14] Yunpeng Chang, Zhigang Tu, Wei Xie, and Junsong Yuan. 2020. Clustering driven deep autoencoder for video anomaly detection. In European
Conference on Computer Vision. Springer, 329–345.
[15] Chengwei Chen, Yuan Xie, Shaohui Lin, Angela Yao, Guannan Jiang, Wei Zhang, Yanyun Qu, Ruizhi Qiao, Bo Ren, and Lizhuang Ma. 2022.
Comprehensive Regularization in a Bi-directional Predictive Network for Video Anomaly Detection. In Proceedings of the American association for
artificial intelligence. 1–9.
[16] Dongyue Chen, Pengtao Wang, Lingyi Yue, Yuxin Zhang, and Tong Jia. 2020. Anomaly detection in surveillance video based on bidirectional
prediction. Image and Vision Computing 98 (2020), 103915.
[17] Dongyue Chen, Lingyi Yue, Xingya Chang, Ming Xu, and Tong Jia. 2021. NM-GAN: Noise-modulated generative adversarial network for video
anomaly detection. Pattern Recognition 116 (2021), 107969.
[18] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A simple framework for contrastive learning of visual representations.
In International conference on machine learning. PMLR, 1597–1607.
[19] Weiling Chen, Keng Teck Ma, Zi Jian Yew, Minhoe Hur, and David Aik-Aun Khoo. 2023. TEVAD: Improved video anomaly detection with captions.
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5548–5558.
[20] Zhaoyu Chen, Bo Li, Jianghe Xu, Shuang Wu, Shouhong Ding, and Wenqiang Zhang. 2022. Towards Practical Certifiable Patch Defense with
Vision Transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15148–15158.
[21] Kai Cheng, Yang Liu, and Xinhua Zeng. 2023. Learning Graph Enhanced Spatial-Temporal Coherence for Video Anomaly Detection. IEEE Signal
Processing Letters 30 (2023), 314–318.
[22] Kai Cheng, Xinhua Zeng, Yang Liu, Mengyang Zhao, Chengxin Pang, and Xing Hu. 2023. Spatial-Temporal Graph Convolutional Network Boosted
Flow-Frame Prediction For Video Anomaly Detection. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP). IEEE, 1–5.
[23] Kai-Wen Cheng, Yie-Tarng Chen, and Wen-Hsien Fang. 2015. Video anomaly detection and localization using hierarchical feature representation
and Gaussian process regression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2909–2917.
[24] Yong Shean Chong and Yong Haur Tay. 2017. Abnormal event detection in videos using spatiotemporal autoencoder. In International symposium
on neural networks. Springer, 189–196.

Manuscript submitted to ACM


Generalized Video Anomaly Event Detection: Systematic Taxonomy and Comparison of Deep Models 29

[25] Peter Christiansen, Lars N Nielsen, Kim A Steen, Rasmus N Jørgensen, and Henrik Karstoft. 2016. DeepAnomaly: Combining background subtraction
and deep learning for detecting obstacles and anomalies in an agricultural field. Sensors 16, 11 (2016), 1904.
[26] Andrew A Cook, Göksel Mısırlı, and Zhong Fan. 2019. Anomaly detection for IoT time-series data: A survey. IEEE Internet of Things Journal 7, 7
(2019), 6481–6494.
[27] Florinel-Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu, and Mubarak Shah. 2022. Diffusion models in vision: A survey. arXiv preprint
arXiv:2209.04747 (2022).
[28] Xinyi Cui, Qingshan Liu, Mingchen Gao, and Dimitris N Metaxas. 2011. Abnormal detection using interaction energy potentials. In CVPR 2011.
IEEE, 3161–3167.
[29] Navneet Dalal and Bill Triggs. 2005. Histograms of oriented gradients for human detection. In 2005 IEEE computer society conference on computer
vision and pattern recognition (CVPR’05), Vol. 1. Ieee, 886–893.
[30] Navneet Dalal, Bill Triggs, and Cordelia Schmid. 2006. Human detection using oriented histograms of flow and appearance. In European conference
on computer vision. Springer, 428–441.
[31] K Deepak, S Chandrakala, and C Krishna Mohan. 2021. Residual spatiotemporal autoencoder for unsupervised video anomaly detection. Signal,
Image and Video Processing 15, 1 (2021), 215–222.
[32] Hanqiu Deng, Zhaoxiang Zhang, Shihao Zou, and Xingyu Li. 2023. Bi-Directional Frame Interpolation for Unsupervised Video Anomaly Detection.
In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2634–2643.
[33] Lei Deng, Guoqi Li, Song Han, Luping Shi, and Yuan Xie. 2020. Model compression and hardware acceleration for neural networks: A comprehensive
survey. Proc. IEEE 108, 4 (2020), 485–532.
[34] Piotr Dollár, Vincent Rabaud, Garrison Cottrell, and Serge Belongie. 2005. Behavior recognition via sparse spatio-temporal features. In 2005 IEEE
international workshop on visual surveillance and performance evaluation of tracking and surveillance. IEEE, 65–72.
[35] Fei Dong, Yu Zhang, and Xiushan Nie. 2020. Dual discriminator generative adversarial network for video anomaly detection. IEEE Access 8 (2020),
88170–88176.
[36] Keval Doshi and Yasin Yilmaz. 2020. Any-shot sequential anomaly detection in surveillance videos. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition Workshops. 934–935.
[37] Keval Doshi and Yasin Yilmaz. 2021. Online anomaly detection in surveillance videos with asymptotic bound on false alarm rate. Pattern Recognition
114 (2021), 107865.
[38] Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. 2017. CARLA: An open urban driving simulator. In
Conference on robot learning. PMLR, 1–16.
[39] Jia-Chang Feng, Fa-Ting Hong, and Wei-Shi Zheng. 2021. Mist: Multiple instance self-training framework for video anomaly detection. In
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 14009–14018.
[40] Xinyang Feng, Dongjin Song, Yuncong Chen, Zhengzhang Chen, Jingchao Ni, and Haifeng Chen. 2021. Convolutional transformer based dual
discriminator generative adversarial networks for video anomaly detection. In Proceedings of the 29th ACM International Conference on Multimedia.
5546–5554.
[41] Yachuang Feng, Yuan Yuan, and Xiaoqiang Lu. 2017. Learning deep event models for crowd anomaly detection. Neurocomputing 219 (2017), 548–556.
[42] Félix Fuentes-Hurtado, Abdolrahim Kadkhodamohammadi, Evangello Flouty, Santiago Barbarisi, Imanol Luengo, and Danail Stoyanov. 2019.
EasyLabels: weak labels for scene segmentation in laparoscopic videos. International journal of computer assisted radiology and surgery 14, 7 (2019),
1247–1257.
[43] Mariana-Iuliana Georgescu, Antonio Barbalau, Radu Tudor Ionescu, Fahad Shahbaz Khan, Marius Popescu, and Mubarak Shah. 2021. Anomaly
detection in video via self-supervised and multi-task learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.
12742–12752.
[44] Mariana Iuliana Georgescu, Radu Tudor Ionescu, Fahad Shahbaz Khan, Marius Popescu, and Mubarak Shah. 2021. A background-agnostic framework
with adversarial training for abnormal event detection in video. IEEE transactions on pattern analysis and machine intelligence 44, 9 (2021), 4505–4523.
[45] Ross Girshick. 2015. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision. 1440–1448.
[46] Dong Gong, Lingqiao Liu, Vuong Le, Budhaditya Saha, Moussa Reda Mansour, Svetha Venkatesh, and Anton van den Hengel. 2019. Memorizing
normality to detect anomaly: Memory-augmented deep autoencoder for unsupervised anomaly detection. In Proceedings of the IEEE/CVF International
Conference on Computer Vision. 1705–1714.
[47] Jianping Gou, Baosheng Yu, Stephen J Maybank, and Dacheng Tao. 2021. Knowledge distillation: A survey. International Journal of Computer
Vision 129, 6 (2021), 1789–1819.
[48] Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires,
Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. 2020. Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural
information processing systems 33 (2020), 21271–21284.
[49] H Haberfehlner, AI Buizer, KL Stolk, SS van de Ven, I Aleo, LA Bonouvrié, J Harlaar, and MM van der Krogt. 2020. Automatic video tracking using
deep learning in dyskinetic cerebral palsy. Gait Posture 81 (2020), 132–133.
[50] Mahmudul Hasan, Jonghyun Choi, Jan Neumann, Amit K Roy-Chowdhury, and Larry S Davis. 2016. Learning temporal regularity in video
sequences. In Proceedings of the IEEE conference on computer vision and pattern recognition. 733–742.

Manuscript submitted to ACM


30 Yang Liu et al.

[51] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. 2022. Masked autoencoders are scalable vision learners. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16000–16009.
[52] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2020. Momentum contrast for unsupervised visual representation learning. In
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9729–9738.
[53] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference
on computer vision and pattern recognition. 770–778.
[54] Ryota Hinami, Tao Mei, and Shin’ichi Satoh. 2017. Joint detection and recounting of abnormal events by learning deep generic knowledge. In
Proceedings of the IEEE international conference on computer vision. 3619–3627.
[55] Jingtao Hu, Guang Yu, Siqi Wang, En Zhu, Zhiping Cai, and Xinzhong Zhu. 2022. Detecting Anomalous Events from Unlabeled Videos via Temporal
Masked Auto-Encoding. In 2022 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 1–6.
[56] Xing Hu, Yingping Huang, Xiumin Gao, Lingkun Luo, and Qianqian Duan. 2018. Squirrel-cage local binary pattern and its application in video
anomaly detection. IEEE Transactions on Information Forensics and Security 14, 4 (2018), 1007–1022.
[57] Chao Huang, Chengliang Liu, Jie Wen, Lian Wu, Yong Xu, Qiuping Jiang, and Yaowei Wang. 2022. Weakly Supervised Video Anomaly Detection
via Self-Guided Temporal Discriminative Transformer. IEEE Transactions on Cybernetics (2022).
[58] Chao Huang, Yabo Liu, Zheng Zhang, Chengliang Liu, Jie Wen, Yong Xu, and Yaowei Wang. 2022. Hierarchical Graph Embedded Pose Regularity
Learning via Spatio-Temporal Transformer for Abnormal Behavior Detection. In Proceedings of the 30th ACM International Conference on Multimedia.
307–315.
[59] Chao Huang, Jie Wen, Yong Xu, Qiuping Jiang, Jian Yang, Yaowei Wang, and David Zhang. 2022. Self-supervised attentive generative adversarial
networks for video anomaly detection. IEEE transactions on neural networks and learning systems (2022).
[60] Chao Huang, Zhihao Wu, Jie Wen, Yong Xu, Qiuping Jiang, and Yaowei Wang. 2021. Abnormal event detection using deep contrastive learning for
intelligent video surveillance system. IEEE Transactions on Industrial Informatics 18, 8 (2021), 5171–5179.
[61] Chao Huang, Zehua Yang, Jie Wen, Yong Xu, Qiuping Jiang, Jian Yang, and Yaowei Wang. 2021. Self-supervision-augmented deep autoencoder for
unsupervised visual anomaly detection. IEEE Transactions on Cybernetics 52, 12 (2021), 13834–13847.
[62] Radu Tudor Ionescu, Fahad Shahbaz Khan, Mariana-Iuliana Georgescu, and Ling Shao. 2019. Object-centric auto-encoders and dummy anomalies
for abnormal event detection in video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7842–7851.
[63] Radu Tudor Ionescu, Sorina Smeureanu, Marius Popescu, and Bogdan Alexe. 2019. Detecting abnormal events in video using narrowed normality
clusters. In 2019 IEEE winter conference on applications of computer vision (WACV). IEEE, 1951–1960.
[64] Sabah Abdulazeez Jebur, Khalid A Hussein, Haider Kadhim Hoomod, Laith Alzubaidi, and José Santamaría. 2022. Review on Deep Learning
Approaches for Anomaly Event Detection in Video Surveillance. Electronics 12, 1 (2022), 29.
[65] Bobo Ju, Yang Liu, Liang Song, Guixiang Gan, Zengwen Li, and Linhua Jiang. 2023. A High-Reliability Edge-side Mobile Terminal Shared Computing
Architecture Based on Task Triple-stage Full-cycle Monitoring. IEEE Internet of Things Journal (2023).
[66] Ammar Mansoor Kamoona, Amirali Khodadadian Gostar, Alireza Bab-Hadiashar, and Reza Hoseinnezhad. 2023. Multiple instance-based video
anomaly detection using deep temporal encoding–decoding. Expert Systems with Applications 214 (2023), 119079.
[67] B Ravi Kiran, Dilip Mathew Thomas, and Ranjith Parakkal. 2018. An overview of deep learning based methods for unsupervised and semi-
supervised anomaly detection in videos. Journal of Imaging 4, 2 (2018), 36.
[68] Kwang-Eun Ko and Kwee-Bo Sim. 2018. Deep convolutional framework for abnormal behavior detection in a smart surveillance system. Engineering
Applications of Artificial Intelligence 67 (2018), 226–234.
[69] Viet-Tuan Le and Yong-Guk Kim. 2023. Attention-based residual autoencoder for video anomaly detection. Applied Intelligence 53, 3 (2023), 3240–
3254.
[70] Jooyeon Lee, Woo-Jeoung Nam, and Seong-Whan Lee. 2022. Multi-Contextual Predictions with Vision Transformer for Video Anomaly Detection.
In 2022 26th International Conference on Pattern Recognition (ICPR). IEEE, 1012–1018.
[71] Sangmin Lee, Hak Gu Kim, and Yong Man Ro. 2019. BMAN: bidirectional multi-scale aggregation networks for abnormal event detection. IEEE
Transactions on Image Processing 29 (2019), 2395–2408.
[72] Roberto Leyva, Victor Sanchez, and Chang-Tsun Li. 2017. The LV dataset: A realistic surveillance video dataset for abnormal event detection. In
2017 5th international workshop on biometrics and forensics (IWBF). IEEE, 1–6.
[73] Roberto Leyva, Victor Sanchez, and Chang-Tsun Li. 2017. Video anomaly detection with compact feature sets for online performance. IEEE
Transactions on Image Processing 26, 7 (2017), 3463–3478.
[74] Di Li, Yang Liu, and Liang Song. 2022. Adaptive Weighted Losses with Distribution Approximation for Efficient Consistency-based Semi-supervised
Learning. IEEE Transactions on Circuits and Systems for Video Technology 32, 11 (2022), 7832–7842.
[75] Daoheng Li, Xiushan Nie, Xiaofeng Li, Yu Zhang, and Yilong Yin. 2022. Context-related video anomaly detection via generative adversarial
network. Pattern Recognition Letters 156 (2022), 183–189.
[76] Di Li and Liang Song. 2022. Multi-Agent Multi-View Collaborative Perception Based on Semi-Supervised Online Evolutive Learning. Sensors 22, 18
(2022), 6893.
[77] Di Li, Xiaoguang Zhu, and Liang Song. 2022. Mutual match for semi-supervised online evolutive learning. Applied Intelligence (2022), 1–15.
[78] Nanjun Li, Faliang Chang, and Chunsheng Liu. 2020. Spatial-temporal cascade autoencoder for video anomaly detection in crowded scenes. IEEE
Transactions on Multimedia 23 (2020), 203–215.
Manuscript submitted to ACM
Generalized Video Anomaly Event Detection: Systematic Taxonomy and Comparison of Deep Models 31

[79] Nannan Li, Jia-Xing Zhong, Xiujun Shu, and Huiwen Guo. 2022. Weakly-supervised anomaly detection in video surveillance via graph convolutional
label noise cleaning. Neurocomputing 481 (2022), 154–167.
[80] Nannan Li, Jia-Xing Zhong, Xiujun Shu, and Huiwen Guo. 2022. Weakly-supervised anomaly detection in video surveillance via graph convolutional
label noise cleaning. Neurocomputing 481 (2022), 154–167.
[81] Shuo Li, Fang Liu, and Licheng Jiao. 2022. Self-training multi-sequence learning with Transformer for weakly supervised video anomaly detection.
Proceedings of the AAAI, Virtual 24 (2022).
[82] Tong Li, Xinyue Chen, Fushun Zhu, Zhengyu Zhang, and Hua Yan. 2021. Two-stream deep spatial-temporal auto-encoder for surveillance video
abnormal event detection. Neurocomputing 439 (2021), 256–270.
[83] Tangqing Li, Zheng Wang, Siying Liu, and Wen-Yan Lin. 2021. Deep unsupervised anomaly detection. In Proceedings of the IEEE/CVF Winter
Conference on Applications of Computer Vision. 3636–3645.
[84] Weixin Li, Vijay Mahadevan, and Nuno Vasconcelos. 2013. Anomaly detection and localization in crowded scenes. IEEE transactions on pattern
analysis and machine intelligence 36, 1 (2013), 18–32.
[85] Yuanyuan Li, Yiheng Cai, Jiaqi Liu, Shinan Lang, and Xinfeng Zhang. 2019. Spatio-Temporal Unity Networking for Video Anomaly Detection. IEEE
Access 7 (2019), 172425–172432. https://doi.org/10.1109/ACCESS.2019.2954540
[86] Dongze Lian, Lina Hu, Weixin Luo, Yanyu Xu, Lixin Duan, Jingyi Yu, and Shenghua Gao. 2018. Multiview multitask gaze estimation with deep
convolutional neural networks. IEEE transactions on neural networks and learning systems 30, 10 (2018), 3010–3023.
[87] Weipeng Lin, Yidong Li, Xiaoliang Yang, Peixi Peng, and Junliang Xing. 2019. Multi-view learning for vehicle re-identification. In 2019 IEEE
international conference on multimedia and expo (ICME). IEEE, 832–837.
[88] Xiangru Lin, Yuyang Chen, Guanbin Li, and Yizhou Yu. 2022. A Causal Inference Look at Unsupervised Video Anomaly Detection. In Thirty-Sixth
AAAI Conference on Artificial Intelligence. 1620–1629.
[89] Jing Liu, Yang Liu, Di Li, Hanqi Wang, Xiaohong Huang, and Liang Song. 2023. DSDCLA: Driving style detection via hybrid CNN-LSTM with
multi-level attention fusion. Applied Intelligence (2023), 1–18.
[90] Jing Liu, Yang Liu, Wei Zhu, Xiaoguang Zhu, and Liang Song. 2023. Distributional and spatial-temporal robust representation learning for
transportation activity recognition. Pattern Recognition 140 (2023), 109568.
[91] Wen Liu, Weixin Luo, Zhengxin Li, Peilin Zhao, Shenghua Gao, et al. 2019. Margin Learning Embedded Prediction for Video Anomaly Detection
with A Few Anomalies.. In IJCAI. 3023–3030.
[92] Wen Liu, Weixin Luo, Dongze Lian, and Shenghua Gao. 2018. Future frame prediction for anomaly detection–a new baseline. In Proceedings of the
IEEE conference on computer vision and pattern recognition. 6536–6545.
[93] Yang Liu, Zhengliang Guo, Jing Liu, Chengfang Li, and Liang Song. 2023. Osin: Object-centric scene inference network for unsupervised video
anomaly detection. IEEE Signal Processing Letters 30 (2023), 359–363.
[94] Yusha Liu, Chun-Liang Li, and Barnabás Póczos. 2018. Classifier Two Sample Test for Video Anomaly Detections.. In BMVC. 71.
[95] Yang Liu, Di Li, Wei Zhu, Dingkang Yang, Jing Liu, and Liang Song. 2023. MSN-net: Multi-Scale Normality Network for Video Anomaly Detection.
In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5.
[96] Yang Liu, Shuang Li, Jing Liu, Hao Yang, Mengyang Zhao, Xinhua Zeng, Wei Ni, and Liang Song. 2021. Learning Attention Augmented Spatial-
temporal Normality for Video Anomaly Detection. In 2021 3rd International Symposium on Smart and Healthy Cities (ISHC). IEEE, 137–144.
[97] Yang Liu, Jing Liu, Jieyu Lin, Mengyang Zhao, and Liang Song. 2022. Appearance-Motion United Auto-Encoder Framework for Video Anomaly
Detection. IEEE Transactions on Circuits and Systems II: Express Briefs 69, 5 (2022), 2498–2502.
[98] Yang Liu, Jing Liu, Wei Ni, and Liang Song. 2022. Abnormal Event Detection with Self-guiding Multi-instance Ranking Framework. In 2022
International Joint Conference on Neural Networks (IJCNN). IEEE, 01–07.
[99] Yang Liu, Jing Liu, Kun Yang, Bobo Ju, Siao Liu, Yuzheng Wang, Dingkang Yang, Peng Sun, and Liang Song. 2023. AMP-Net: Appearance-Motion
Prototype Network Assisted Automatic Video Anomaly Detection System. IEEE Transactions on Industrial Informatics (2023), 1–13.
[100] Yang Liu, Jing Liu, Mengyang Zhao, Shuang Li, and Liang Song. 2022. Collaborative Normality Learning Framework for Weakly Supervised Video
Anomaly Detection. IEEE Transactions on Circuits and Systems II: Express Briefs 69, 5 (2022), 2508–2512.
[101] Yang Liu, Jing Liu, Mengyang Zhao, Dingkang Yang, Xiaoguang Zhu, and Liang Song. 2022. Learning Appearance-Motion Normality for Video
Anomaly Detection. In 2022 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 1–6.
[102] Yang Liu, Jing Liu, Xiaoguang Zhu, Donglai Wei, Xiaohong Huang, and Liang Song. 2022. Learning Task-Specific Representation for Video Anomaly
Detection with Spatial-Temporal Attention. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
IEEE, 2190–2194.
[103] Yang Liu, Zhaoyang Xia, Mengyang Zhao, Donglai Wei, Yuzheng Wang, Siao Liu, Bobo Ju, Gaoyun Fang, Jing Liu, and Liang Song. 2023. Learning
Causality-Inspired Representation Consistency for Video Anomaly Detection. In Proceedings of the 31st ACM International Conference on Multimedia.
203–212.
[104] Yang Liu, Dingkang Yang, Gaoyun Fang, Yuzheng Wang, Donglai Wei, Mengyang Zhao, Kai Cheng, Jing Liu, and Liang Song. 2023. Stochastic
video normality network for abnormal event detection in surveillance videos. Knowledge-Based Systems (2023), 110986.
[105] Zhian Liu, Yongwei Nie, Chengjiang Long, Qing Zhang, and Guiqing Li. 2021. A hybrid video anomaly detection framework via memory-augmented
flow reconstruction and flow-guided frame prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 13588–13597.

Manuscript submitted to ACM


32 Yang Liu et al.

[106] Vina Lomte, Satish Singh, Siddharth Patil, Siddheshwar Patil, and Durgesh Pahurkar. 2019. A Survey on Real World Anomaly Detection in Live
Video Surveillance Techniques. International Journal of Research in Engineering, Science and Management 2, 2 (2019), 2581–5792.
[107] Cewu Lu, Jianping Shi, and Jiaya Jia. 2013. Abnormal event detection at 150 fps in matlab. In Proceedings of the IEEE international conference on
computer vision. 2720–2727.
[108] Cewu Lu, Jianping Shi, Weiming Wang, and Jiaya Jia. 2019. Fast abnormal event detection. International Journal of Computer Vision 127, 8 (2019),
993–1011.
[109] Yiwei Lu, Frank Yu, Mahesh Kumar Krishna Reddy, and Yang Wang. 2020. Few-shot scene-adaptive anomaly detection. In European Conference on
Computer Vision. Springer, 125–141.
[110] Weixin Luo, Wen Liu, and Shenghua Gao. 2017. A revisit of sparse coding based anomaly detection in stacked rnn framework. In Proceedings of the
IEEE international conference on computer vision. 341–349.
[111] Weixin Luo, Wen Liu, and Shenghua Gao. 2021. Normal graph: Spatial temporal graph convolutional networks based prediction network for
skeleton based video anomaly detection. Neurocomputing 444 (2021), 332–337.
[112] Weixin Luo, Wen Liu, Dongze Lian, and Shenghua Gao. 2021. Future frame prediction network for video anomaly detection. IEEE Transactions on
Pattern Analysis and Machine Intelligence (2021).
[113] Weixin Luo, Wen Liu, Dongze Lian, Jinhui Tang, Lixin Duan, Xi Peng, and Shenghua Gao. 2019. Video anomaly detection with sparse coding
inspired deep neural networks. IEEE transactions on pattern analysis and machine intelligence 43, 3 (2019), 1070–1084.
[114] Hui Lv, Zhongqi Yue, Qianru Sun, Bin Luo, Zhen Cui, and Hanwang Zhang. 2023. Unbiased Multiple Instance Learning for Weakly Supervised
Video Anomaly Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8022–8031.
[115] Hui Lv, Chuanwei Zhou, Zhen Cui, Chunyan Xu, Yong Li, and Jian Yang. 2021. Localizing anomalies from weakly-labeled videos. IEEE transactions
on image processing 30 (2021), 4505–4515.
[116] Ke Ma, Michael Doescher, and Christopher Bodden. 2015. Anomaly detection in crowded scenes using dense trajectories. University of Wisconsin-
Madison (2015).
[117] Xiaoxiao Ma, Jia Wu, Shan Xue, Jian Yang, Chuan Zhou, Quan Z Sheng, Hui Xiong, and Leman Akoglu. 2021. A comprehensive survey on graph
anomaly detection with deep learning. IEEE Transactions on Knowledge and Data Engineering (2021).
[118] Vijay Mahadevan, Weixin Li, Viral Bhalodia, and Nuno Vasconcelos. 2010. Anomaly detection in crowded scenes. In 2010 IEEE computer society
conference on computer vision and pattern recognition. IEEE, 1975–1981.
[119] Snehashis Majhi, Ratnakar Dash, and Pankaj Kumar Sa. 2020. Two-Stream CNN architecture for anomalous event detection in real world scenarios.
In International Conference on Computer Vision and Image Processing. Springer, 343–353.
[120] Amir Markovitz, Gilad Sharir, Itamar Friedman, Lihi Zelnik-Manor, and Shai Avidan. 2020. Graph embedded pose clustering for anomaly detection.
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10539–10547.
[121] Jefferson Ryan Medel and Andreas Savakis. 2016. Anomaly detection in video using predictive convolutional long short-term memory networks.
arXiv preprint arXiv:1612.00390 (2016).
[122] Harshadkumar S Modi, Dr Parikh, and A Dhaval. 2022. A Survey on Crowd Anomaly Detection. International Journal of Computing and Digital
Systems 12, 1 (2022), 1081–1096.
[123] Ruwan Nawarathna, JungHwan Oh, Jayantha Muthukudage, Wallapak Tavanapong, Johnny Wong, Piet C De Groen, and Shou Jiang Tang. 2014.
Abnormal image detection in endoscopy videos using a filter bank and local binary patterns. Neurocomputing 144 (2014), 70–91.
[124] Rashmika Nawaratne, Damminda Alahakoon, Daswin De Silva, and Xinghuo Yu. 2019. Spatiotemporal anomaly detection using deep learning for
real-time video surveillance. IEEE Transactions on Industrial Informatics 16, 1 (2019), 393–402.
[125] Rashmiranjan Nayak, Umesh Chandra Pati, and Santos Kumar Das. 2021. A comprehensive review on deep learning-based methods for video
anomaly detection. Image and Vision Computing 106 (2021), 104078.
[126] Khac-Tuan Nguyen, Dat-Thanh Dinh, Minh N Do, and Minh-Triet Tran. 2020. Anomaly detection in traffic surveillance videos with gan-based
future frame prediction. In Proceedings of the 2020 International Conference on Multimedia Retrieval. 457–463.
[127] Trong-Nguyen Nguyen and Jean Meunier. 2019. Anomaly detection in video sequence with appearance-motion correspondence. In Proceedings of
the IEEE/CVF international conference on computer vision. 1273–1283.
[128] Guansong Pang, Chunhua Shen, Longbing Cao, and Anton Van Den Hengel. 2021. Deep learning for anomaly detection: A review. ACM Computing
Surveys (CSUR) 54, 2 (2021), 1–38.
[129] Guansong Pang, Cheng Yan, Chunhua Shen, Anton van den Hengel, and Xiao Bai. 2020. Self-trained deep ordinal regression for end-to-end video
anomaly detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 12173–12182.
[130] Wen-Feng Pang, Qian-Hua He, Yong-jian Hu, and Yan-Xiong Li. 2021. Violence detection in videos based on fusing visual and audio information.
In ICASSP 2021-2021 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2260–2264.
[131] Hyunjong Park, Jongyoun Noh, and Bumsub Ham. 2020. Learning memory-guided normality for anomaly detection. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition. 14372–14381.
[132] Seongheon Park, Hanjae Kim, Minsu Kim, Dahye Kim, and Kwanghoon Sohn. 2023. Normality Guided Multiple Instance Learning for Weakly
Supervised Video Anomaly Detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2665–2674.
[133] Lam Pham, Dat Ngo, Tho Nguyen, Phu Nguyen, Truong Hoang, and Alexander Schindler. 2022. An audio-visual dataset and deep learning
frameworks for crowded scene classification. In Proceedings of the 19th International Conference on Content-based Multimedia Indexing. 23–28.
Manuscript submitted to ACM
Generalized Video Anomaly Event Detection: Systematic Taxonomy and Comparison of Deep Models 33

[134] Yujiang Pu and Xiaoyu Wu. 2022. Audio-Guided Attention Network for Weakly Supervised Violence Detection. In 2022 2nd International Conference
on Consumer Electronics and Computer Engineering (ICCECE). IEEE, 219–223.
[135] Lang Qian, Peng Sun, Jing Liu, Azzedine Boukerche, and Liang Song. 2023. A Novel Bidirectional Optimization Framework for Intelligent Agents
Capable of Online Evolution. arXiv preprint (2023).
[136] Rohit Raja, Prakash Chandra Sharma, Md Rashid Mahmood, and Dinesh Kumar Saini. 2022. Analysis of anomaly detection in surveillance video:
recent trends and future vision. Multimedia Tools and Applications (2022), 1–17.
[137] Bharathkumar Ramachandra and Michael Jones. 2020. Street Scene: A new dataset and evaluation protocol for video anomaly detection. In
Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2569–2578.
[138] Bharathkumar Ramachandra, Michael Jones, and Ranga Vatsavai. 2020. Learning a distance function with a Siamese network to localize anomalies
in videos. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2598–2607.
[139] Bharathkumar Ramachandra, Michael J Jones, and Ranga Raju Vatsavai. 2020. A survey of single-scene video anomaly detection. IEEE transactions
on pattern analysis and machine intelligence 44, 5 (2020), 2293–2312.
[140] Mahdyar Ravanbakhsh, Moin Nabi, Hossein Mousavi, Enver Sangineto, and Nicu Sebe. 2018. Plug-and-play cnn for crowd motion analysis: An
application in abnormal event detection. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 1689–1698.
[141] Mahdyar Ravanbakhsh, Moin Nabi, Enver Sangineto, Lucio Marcenaro, Carlo Regazzoni, and Nicu Sebe. 2017. Abnormal event detection in videos
using generative adversarial nets. In 2017 IEEE international conference on image processing (ICIP). IEEE, 1577–1581.
[142] Mahdyar Ravanbakhsh, Enver Sangineto, Moin Nabi, and Nicu Sebe. 2019. Training adversarial discriminators for cross-channel abnormal event
detection in crowds. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 1896–1904.
[143] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016. You only look once: Unified, real-time object detection. In Proceedings of the
IEEE conference on computer vision and pattern recognition. 779–788.
[144] Khosro Rezaee, Sara Mohammad Rezakhani, Mohammad R Khosravi, and Mohammad Kazem Moghimi. 2021. A survey on deep learning-based
real-time crowd anomaly detection for secure distributed video surveillance. Personal and Ubiquitous Computing (2021), 1–17.
[145] Mehrsan Javan Roshtkhari and Martin D Levine. 2013. An on-line, real-time learning method for detecting anomalies in videos using spatio-
temporal compositions. Computer vision and image understanding 117, 10 (2013), 1436–1452.
[146] Mohammad Sabokrou, Mahmood Fathy, Mojtaba Hoseini, and Reinhard Klette. 2015. Real-time anomaly detection and localization in crowded
scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops. 56–62.
[147] Mohammad Sabokrou, Mohsen Fayyaz, Mahmood Fathy, and Reinhard Klette. 2017. Deep-cascade: Cascading 3d deep neural networks for fast
anomaly detection and localization in crowded scenes. IEEE Transactions on Image Processing 26, 4 (2017), 1992–2004.
[148] Venkatesh Saligrama and Zhu Chen. 2012. Video anomaly detection based on local statistical aggregates. In 2012 IEEE Conference on computer
vision and pattern recognition. IEEE, 2112–2119.
[149] Venkatesh Saligrama, Janusz Konrad, and Pierre-Marc Jodoin. 2010. Video anomaly identification. IEEE Signal Processing Magazine 27, 5 (2010), 18–33.
[150] Kelathodi Kumaran Santhosh, Debi Prosad Dogra, and Partha Pratim Roy. 2020. Anomaly detection in road traffic using visual surveillance: A
survey. ACM Computing Surveys (CSUR) 53, 6 (2020), 1–26.
[151] Sam Sattarzadeh, Mahesh Sudhakar, and Konstantinos N Plataniotis. 2021. SVEA: A Small-scale Benchmark for Validating the Usability of Post-hoc
Explainable AI Solutions in Image and Signal Recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4158–4167.
[152] Shital Shah, Debadeepta Dey, Chris Lovett, and Ashish Kapoor. 2018. Airsim: High-fidelity visual and physical simulation for autonomous vehicles.
In Field and service robotics. Springer, 621–635.
[153] Yimeng Shang, Xiaoyu Wu, and Rui Liu. 2022. Multimodal Violent Video Recognition Based on Mutual Distillation. In Chinese Conference on
Pattern Recognition and Computer Vision (PRCV). Springer, 623–637.
[154] Md Sharif, Lei Jiao, Christian W Omlin, et al. 2022. Deep Crowd Anomaly Detection: State-of-the-Art, Challenges, and Future Research Directions.
arXiv preprint arXiv:2210.13927 (2022).
[155] Prakhar Singh and Vinod Pankajakshan. 2018. A Deep Learning Based Technique for Anomaly Detection in Surveillance Videos. In 2018 Twenty
Fourth National Conference on Communications (NCC). 1–6. https://doi.org/10.1109/NCC.2018.8599969
[156] Liang Song, Xing Hu, Guanhua Zhang, Petros Spachos, Konstantinos N Plataniotis, and Hequan Wu. 2022. Networking systems of ai: on the
convergence of computing and communications. IEEE Internet of Things Journal 9, 20 (2022), 20352–20381.
[157] Rupesh K Srivastava, Klaus Greff, and Jürgen Schmidhuber. 2015. Training very deep networks. Advances in neural information processing systems
28 (2015).
[158] Waqas Sultani, Chen Chen, and Mubarak Shah. 2018. Real-world anomaly detection in surveillance videos. In Proceedings of the IEEE conference on
computer vision and pattern recognition. 6479–6488.
[159] Qianru Sun, Hong Liu, and Tatsuya Harada. 2017. Online growing neural gas for anomaly detection in changing surveillance scenes. Pattern
Recognition 64 (2017), 187–201.
[160] Shengyang Sun and Xiaojin Gong. 2023. Hierarchical Semantic Contrast for Scene-aware Video Anomaly Detection. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition. 22846–22856.
[161] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer
vision. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2818–2826.

Manuscript submitted to ACM


34 Yang Liu et al.

[162] Yao Tang, Lin Zhao, Shanshan Zhang, Chen Gong, Guangyu Li, and Jian Yang. 2020. Integrating prediction and reconstruction for anomaly
detection. Pattern Recognition Letters 129 (2020), 123–130.
[163] Zheng Tang, Renshu Gu, and Jenq-Neng Hwang. 2018. Joint multi-view people tracking and pose estimation for 3D scene reconstruction. In 2018
IEEE International Conference on Multimedia and Expo (ICME). IEEE, 1–6.
[164] Yu Tian, Guansong Pang, Yuanhong Chen, Rajvinder Singh, Johan W Verjans, and Gustavo Carneiro. 2021. Weakly-supervised video anomaly
detection with robust temporal feature magnitude learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4975–4986.
[165] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d convolutional
networks. In Proceedings of the IEEE international conference on computer vision. 4489–4497.
[166] Hanh TM Tran and David Hogg. 2017. Anomaly detection using a convolutional winner-take-all autoencoder. In Proceedings of the British Machine
Vision Conference 2017. British Machine Vision Association.
[167] Radu Tudor Ionescu, Sorina Smeureanu, Bogdan Alexe, and Marius Popescu. 2017. Unmasking the abnormal events in video. In Proceedings of the
IEEE international conference on computer vision. 2895–2903.
[168] Francesco Turchini, Lorenzo Seidenari, and Alberto Del Bimbo. 2017. Convex polytope ensembles for spatio-temporal anomaly detection. In
International Conference on Image Analysis and Processing. Springer, 174–184.
[169] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is
all you need. Advances in neural information processing systems 30 (2017).
[170] Hung Vu, Tu Dinh Nguyen, Trung Le, Wei Luo, and Dinh Phung. 2019. Robust anomaly detection in videos using multilevel representations. In
Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 5216–5223.
[171] Hung Vu, Dinh Phung, Tu Dinh Nguyen, Anthony Trevors, and Svetha Venkatesh. 2017. Energy-based models for video anomaly detection. arXiv
preprint arXiv:1708.05211 (2017).
[172] Boyang Wan, Yuming Fang, Xue Xia, and Jiajie Mei. 2020. Weakly supervised video anomaly detection via center-guided discriminative learning.
In 2020 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 1–6.
[173] Heng Wang, Alexander Kläser, Cordelia Schmid, and Cheng-Lin Liu. 2013. Dense trajectories and motion boundary descriptors for action
recognition. International journal of computer vision 103, 1 (2013), 60–79.
[174] Jindong Wang, Cuiling Lan, Chang Liu, Yidong Ouyang, Tao Qin, Wang Lu, Yiqiang Chen, Wenjun Zeng, and Philip Yu. 2022. Generalizing to
unseen domains: A survey on domain generalization. IEEE Transactions on Knowledge and Data Engineering (2022).
[175] Le Wang, Junwen Tian, Sanping Zhou, Haoyue Shi, and Gang Hua. 2023. Memory-augmented appearance-motion network for video anomaly
detection. Pattern Recognition 138 (2023), 109335.
[176] Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2016. Temporal segment networks: Towards good
practices for deep action recognition. In European conference on computer vision. Springer, 20–36.
[177] Tian Wang, Meina Qiao, Zhiwei Lin, Ce Li, Hichem Snoussi, Zhe Liu, and Chang Choi. 2018. Generative neural networks for anomaly detection in
crowded scenes. IEEE Transactions on Information Forensics and Security 14, 5 (2018), 1390–1399.
[178] Xuanzhao Wang, Zhengping Che, Bo Jiang, Ning Xiao, Ke Yang, Jian Tang, Jieping Ye, Jingyu Wang, and Qi Qi. 2021. Robust unsupervised video
anomaly detection by multipath frame prediction. IEEE transactions on neural networks and learning systems (2021).
[179] Yuzheng Wang, Zhaoyu Chen, Dingkang Yang, Yang Liu, Siao Liu, Wenqiang Zhang, and Lizhe Qi. 2023. Adversarial contrastive distillation with
adaptive denoising. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5.
[180] Yuzheng Wang, Zhaoyu Chen, Jie Zhang, Dingkang Yang, Zuhao Ge, Yang Liu, Siao Liu, Yunquan Sun, Wenqiang Zhang, and Lizhe Qi. 2023.
Sampling to Distill: Knowledge Transfer from Open-World Data. arXiv preprint arXiv:2307.16601 (2023).
[181] Donglai Wei, Yang Liu, Xiaoguang Zhu, Jing Liu, and Xinhua Zeng. 2022. MSAF: Multimodal Supervise-Attention Enhanced Fusion for Video
Anomaly Detection. IEEE Signal Processing Letters 29 (2022), 2178–2182.
[182] Dong-Lai Wei, Chen-Geng Liu, Yang Liu, Jing Liu, Xiao-Guang Zhu, and Xin-Hua Zeng. 2022. Look, Listen and Pay More Attention: Fusing Multi-
Modal Information for Video Violence Detection. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP). IEEE, 1980–1984.
[183] Garrett Wilson and Diane J Cook. 2020. A survey of unsupervised deep domain adaptation. ACM Transactions on Intelligent Systems and Technology
(TIST) 11, 5 (2020), 1–46.
[184] Jhih-Ciang Wu, He-Yen Hsieh, Ding-Jie Chen, Chiou-Shann Fuh, and Tyng-Luh Liu. 2022. Self-supervised Sparse Representation for Video Anomaly
Detection. In European Conference on Computer Vision. Springer, 729–745.
[185] Peng Wu, Jing Liu, and Fang Shen. 2019. A deep one-class neural network for anomalous event detection in complex scenes. IEEE transactions on
neural networks and learning systems 31, 7 (2019), 2609–2622.
[186] Peng Wu, Jing Liu, Yujia Shi, Yujia Sun, Fangtao Shao, Zhaoyang Wu, and Zhiwei Yang. 2020. Not only look, but also listen: Learning multimodal
violence detection under weak supervision. In European conference on computer vision. Springer, 322–339.
[187] Peng Wu, Xiaotao Liu, and Jing Liu. 2022. Weakly supervised audio-visual violence detection. IEEE Transactions on Multimedia (2022).
[188] Peihao Wu, Wenqian Wang, Faliang Chang, Chunsheng Liu, and Bin Wang. 2023. DSS-Net: Dynamic Self-Supervised Network for Video Anomaly
Detection. IEEE Transactions on Multimedia (2023).
[189] Xiaowei Xiang, Yang Liu, Gaoyun Fang, Jing Liu, and Mengyang Zhao. 2023. Two-Stage Alignments Framework for Unsupervised Domain
Adaptation on Time Series Data. IEEE Signal Processing Letters (2023).
Manuscript submitted to ACM
Generalized Video Anomaly Event Detection: Systematic Taxonomy and Comparison of Deep Models 35

[190] Dan Xu, Elisa Ricci, Yan Yan, Jingkuan Song, and Nicu Sebe. 2015. Learning deep representations of appearance and motion for anomalous event
detection. arXiv preprint arXiv:1510.01553 (2015).
[191] Dan Xu, Yan Yan, Elisa Ricci, and Nicu Sebe. 2017. Detecting anomalous events in videos by learning deep representations of appearance and
motion. Computer Vision and Image Understanding 156 (2017), 117–127.
[192] Hang Xu, Lewei Yao, Wei Zhang, Xiaodan Liang, and Zhenguo Li. 2019. Auto-fpn: Automatic network architecture adaptation for object detection
beyond classification. In Proceedings of the IEEE/CVF international conference on computer vision. 6649–6658.
[193] Dingkang Yang, Shuai Huang, Shunli Wang, Yang Liu, Peng Zhai, Liuzhen Su, Mingcheng Li, and Lihua Zhang. 2022. Emotion Recognition for
Multiple Context Awareness. In European Conference on Computer Vision. Springer, 144–162.
[194] Dingkang Yang, Shuai Huang, Zhi Xu, Zhenpeng Li, Shunli Wang, Mingcheng Li, Yuzheng Wang, Yang Liu, Kun Yang, Zhaoyu Chen, et al. 2023.
AIDE: A Vision-Driven Multi-View, Multi-Modal, Multi-Tasking Dataset for Assistive Driving Perception. arXiv preprint arXiv:2307.13933 (2023).
[195] Dingkang Yang, Yang Liu, Can Huang, Mingcheng Li, Xiao Zhao, Yuzheng Wang, Kun Yang, Yan Wang, Peng Zhai, and Lihua Zhang. 2023. Target
and source modality co-reinforcement for emotion understanding from asynchronous multimodal sequences. Knowledge-Based Systems 265 (2023),
110370.
[196] Kun Yang, Dingkang Yang, Jingyu Zhang, Mingcheng Li, Yang Liu, Jing Liu, Hanqi Wang, Peng Sun, and Liang Song. 2023. Spatio-Temporal
Domain Awareness for Multi-Agent Collaborative Perception. arXiv preprint arXiv:2307.13929 (2023).
[197] Zhiwei Yang, Jing Liu, Zhaoyang Wu, Peng Wu, and Xiaotao Liu. 2023. Video Event Restoration Based on Keyframes for Video Anomaly Detection.
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14592–14601.
[198] Muchao Ye, Xiaojiang Peng, Weihao Gan, Wei Wu, and Yu Qiao. 2019. Anopcn: Video anomaly detection via deep predictive coding network. In
Proceedings of the 27th ACM International Conference on Multimedia. 1805–1813.
[199] Qingze Yin, Guodong Ding, Shaogang Gong, Zhenmin Tang, et al. 2021. Multi-view label prediction for unsupervised learning person re-
identification. IEEE Signal Processing Letters 28 (2021), 1390–1394.
[200] Guang Yu, Siqi Wang, Zhiping Cai, Xinwang Liu, En Zhu, and Jianping Yin. 2023. Video Anomaly Detection via Visual Cloze Tests. IEEE Transactions
on Information Forensics and Security (2023).
[201] Guang Yu, Siqi Wang, Zhiping Cai, En Zhu, Chuanfu Xu, Jianping Yin, and Marius Kloft. 2020. Cloze test helps: Effective video anomaly detection
via learning to complete video events. In Proceedings of the 28th ACM International Conference on Multimedia. 583–591.
[202] Jongmin Yu, Younkwan Lee, Kin Choong Yow, Moongu Jeon, and Witold Pedrycz. 2021. Abnormal event detection and localization via adversarial
event prediction. IEEE Transactions on Neural Networks and Learning Systems (2021).
[203] Jiashuo Yu, Jinyu Liu, Ying Cheng, Rui Feng, and Yuejie Zhang. 2022. Modality-Aware Contrastive Instance Learning with Self-Distillation for
Weakly-Supervised Audio-Visual Violence Detection. In Proceedings of the 30th ACM International Conference on Multimedia. 6278–6287.
[204] Hongchun Yuan, Zhenyu Cai, Hui Zhou, Yue Wang, and Xiangzhi Chen. 2021. TransAnomaly: Video Anomaly Detection Using Video Vision
Transformer. IEEE Access 9 (2021), 123977–123986.
[205] Muhammad Zaigham Zaheer, Jin-ha Lee, Marcella Astrid, and Seung-Ik Lee. 2020. Old is gold: Redefining the adversarially learned one-class
classifier training paradigm. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14183–14193.
[206] Muhammad Zaigham Zaheer, Arif Mahmood, Marcella Astrid, and Seung-Ik Lee. 2020. Claws: Clustering assisted weakly supervised learning with
normalcy suppression for anomalous event detection. In European Conference on Computer Vision. Springer, 358–376.
[207] M Zaigham Zaheer, Arif Mahmood, M Haris Khan, Mattia Segu, Fisher Yu, and Seung-Ik Lee. 2022. Generative Cooperative Learning for
Unsupervised Video Anomaly Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14744–14754.
[208] Muhammad Zaigham Zaheer, Arif Mahmood, Hochul Shin, and Seung-Ik Lee. 2020. A self-reasoning framework for anomaly detection using
video-level labels. IEEE Signal Processing Letters 27 (2020), 1705–1709.
[209] Cheng Zhan, Han Hu, Zhi Wang, Rongfei Fan, and Dusit Niyato. 2019. Unmanned aircraft system aided adaptive video streaming: A joint
optimization approach. IEEE Transactions on Multimedia 22, 3 (2019), 795–807.
[210] Xiaohang Zhan, Jiahao Xie, Ziwei Liu, Yew-Soon Ong, and Chen Change Loy. 2020. Online deep clustering for unsupervised representation
learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 6688–6697.
[211] Chen Zhang, Guorong Li, Yuankai Qi, Shuhui Wang, Laiyun Qing, Qingming Huang, and Ming-Hsuan Yang. 2023. Exploiting Completeness and
Uncertainty of Pseudo Labels for Weakly Supervised Video Anomaly Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition. 16271–16280.
[212] Dasheng Zhang, Chao Huang, Chengliang Liu, and Yong Xu. 2022. Weakly Supervised Video Anomaly Detection via Transformer-Enabled
Temporal Relation Learning. IEEE Signal Processing Letters (2022).
[213] Jiangong Zhang, Laiyun Qing, and Jun Miao. 2019. Temporal convolutional network with complementary inner bag loss for weakly supervised
anomaly detection. In 2019 IEEE International Conference on Image Processing (ICIP). IEEE, 4030–4034.
[214] Qianqian Zhang, Guorui Feng, and Hanzhou Wu. 2022. Surveillance video anomaly detection via non-local U-Net frame prediction. Multimedia
Tools and Applications (2022), 1–16.
[215] Ying Zhang, Huchuan Lu, Lihe Zhang, and Xiang Ruan. 2016. Combining motion and appearance cues for anomaly detection. Pattern Recognition
51 (2016), 443–452.
[216] Zhenzhen Zhang, Jianjun Hou, Qinglong Ma, and Zhaohong Li. 2015. Efficient video frame insertion and deletion detection based on inconsistency
of correlations between local binary pattern coded frames. Security and Communication networks 8, 2 (2015), 311–320.
Manuscript submitted to ACM
36 Yang Liu et al.

[217] Zhe Zhang, Shiyao Ma, Zhaohui Yang, Zehui Xiong, Jiawen Kang, Yi Wu, Kejia Zhang, and Dusit Niyato. 2022. Robust semi-supervised federated
learning for images automatic recognition in internet of drones. IEEE Internet of Things Journal (2022).
[218] Mengyang Zhao, Yang Liu, Jing Liu, and Xinhua Zeng. 2022. Exploiting Spatial-temporal Correlations for Video Anomaly Detection. In 2022 26th
International Conference on Pattern Recognition (ICPR). IEEE, 1727–1733.
[219] Yiru Zhao, Bing Deng, Chen Shen, Yao Liu, Hongtao Lu, and Xian-Sheng Hua. 2017. Spatio-temporal autoencoder for video anomaly detection. In
Proceedings of the 25th ACM international conference on Multimedia. 1933–1941.
[220] Jia-Xing Zhong, Nannan Li, Weijie Kong, Shan Liu, Thomas H Li, and Ge Li. 2019. Graph convolutional label noise cleaner: Train a plug-and-play
action classifier for anomaly detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 1237–1246.
[221] Joey Tianyi Zhou, Jiawei Du, Hongyuan Zhu, Xi Peng, Yong Liu, and Rick Siow Mong Goh. 2019. Anomalynet: An anomaly detection network for
video surveillance. IEEE Transactions on Information Forensics and Security 14, 10 (2019), 2537–2550.
[222] Shifu Zhou, Wei Shen, Dan Zeng, Mei Fang, Yuanwang Wei, and Zhijiang Zhang. 2016. Spatial–temporal convolutional neural networks for
anomaly detection and localization in crowded scenes. Signal Processing: Image Communication 47 (2016), 358–368.
[223] Yi Zhu and Shawn Newsam. 2019. Motion-aware feature for improved video anomaly detection. In The British Machine Vision Conference. 1–12.

Manuscript submitted to ACM

You might also like