Anomaly Detection r1 in Weakly Supervised Videos Using Multistage Graphs and General Deep Learning Based Spatial-Temporal Feature Enhancement
Anomaly Detection r1 in Weakly Supervised Videos Using Multistage Graphs and General Deep Learning Based Spatial-Temporal Feature Enhancement
ABSTRACT Weakly supervised video anomaly detection (WS-VAD) is a crucial research domain in
computer vision for the implementation of intelligent surveillance systems. Many researchers have been
working to develop WS-VAD systems using various technologies by assessing anomaly scores. However,
they are still facing challenges because of lacking effective feature extraction. To mitigate this limitation,
we propose a multi-stage deep-learning model for separating abnormal events from normality to extract the
hierarchical effective features. In the first stage, we extract two stream features using pre-trained techniques:
the first stream employs a ViT-based CLIP module to select top-k features, while the second stream
utilizes a CNN-based I3D module integrated into the Temporal Contextual Aggregation (TCA) mechanism.
These features are concatenated and fed into the second-stage module, where an Uncertainty-regulated
Dual Memory Units (UR-DMU) model is employed to learn representations of regular and abnormal
data simultaneously. The UR-DMU integrates global and local structures, leveraging Graph Convolutional
Networks (GCN) and Global and Local Multi-Head Self Attention (GL-MHSA) modules to capture video
associations. Subsequently, feature reduction is achieved using the multilayer-perceptron (MLP) integration
with the Prompt-Enhanced Learning (PEL) module via the knowledge-based prompt. Finally, we employed
a classifier module to predict the snippet-level anomaly scores. In the training phase, the based function
transfers the snippet-level scores into bag-level predictions for learning high activation in anomalous cases.
Our approach integrates these cutting-edge technologies and methodologies, offering a comprehensive
solution to video-based anomaly detection. Extensive experiments on ShanghaiTech, XD-Violence, and
UCF-Crime datasets validate the superiority of our method over state-of-the-art approaches by a substantial
margin. We believe that our model holds significant promise for real-world applications, demonstrating
superior performance and efficacy in anomaly detection tasks.
INDEX TERMS Temporal contextual aggregation (TCA), uncertainty-regulated dual memory units (UR-
DMU), graph convolutional networks and global/local multi-head self-attention (GL-MHSA), weakly
supervised video anomaly detection (WS-VAD) anomaly detection.
I. INTRODUCTION
The associate editor coordinating the review of this manuscript and Fully supervised, unsupervised, and weakly supervised are
approving it for publication was Tyson Brooks . the three prevailing paradigms in the field of video anomaly
2024 The Authors. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.
For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
VOLUME 12, 2024 65213
J. Shin et al.: Anomaly Detection in Weakly Supervised Videos
event detection (VAED). The fully supervised paradigm is Many of the existing methodologies encode the extracted
primarily characterized by its exceptional performance [1]. visual content by utilizing a backbone such as C3D [21]
However, it is important to note that the training data for and I3D [22], which have been pre-trained on tasks related
this paradigm necessitates the inclusion of frame-level nor- to action recognition. Nevertheless, Visual Activity Detec-
mal or abnormal annotations, which in turn requires video tion (VAD) requires representations that are able to clearly
annotators to identify and label abnormalities within the depict the events occurring in a given scene. Consequently,
videos. Given that abnormalities can occur at any given these current backbones are unsuitable for VAD due to the
moment, it becomes imperative for the annotators to spot existence of a domain gap. In order to overcome this limita-
nearly all frames. Regrettably, the process of accumulating a tion, Joo et al. decided to draw inspiration from the recent
fully annotated large-scale dataset for supervised VAED can achievements in vision-language research, specifically the
be both non-automated and time-consuming. On the other works of [23], [24], and [25], which demonstrated the effec-
hand, the unsupervised paradigm involves training models tiveness of feature representation derived from contrastive
exclusively on samples of normal events. It is based on language-image pretraining (CLIP). To achieve this, they
the common assumption that unseen anomaly videos will employed the visual features encoded by the vision trans-
exhibit high reconstruction errors [2], [3], [4]. Unfortunately, former (ViT) from CLIP. However, it is worth noting that
the performance of unsupervised Variational Autoencoder the performance of WVAED methods based on MIL heavily
Decoder (VAED) tends to be substandard. This can be relies on pre-trained feature extractors.
attributed to its limited comprehension of anomalies and its The drawback of the study is that they processed the video
incapacity to encompass various forms of normality vari- in individual frames or short clips to extract the long-range
ants [5]. Consequently, weakly supervised approaches are semantic contextual information. To overcome the prob-
widely regarded as the most viable paradigm. They out- lem, shao et al. proposed a Temporal Context Aggregation
shine both unsupervised and supervised paradigms due to for Video Retrieval (TCA) framework for video representa-
their competitive performance and cost-effectiveness regard- tion learning. This innovative approach integrates long-range
ing annotations. These approaches reduce cost by utilizing temporal context among frame-level features through the uti-
video-level labels instead of laborious fine-grained annota- lization of the self-attention mechanism [26], [27]. They used
tions [6], [7]. In recent time, WVAED has evolved into a contrastive learning to reduce loss or error rate in the eval-
well-established technical path of research for VAED [6], [7], uation. To enhance the TCA features, Tean et al. employed
[8], [9], [10], [11], [12], [13], [14], [15], [16]. The WVAED Robust TCA features, including multiple instance learning
issue is primarily perceived as a MIL (multiple instance (MIL) loss calculation approach [19]. They reported 84.30%
learning) problem [8]. Generally speaking, WVAED models and 97.21% AUC for the UCF-crime and Shanghai Tech
directly generate scores for anomalies by comparing the spa- datasets, respectively. To improve the AUC rate by increasing
tiotemporal features of normal and abnormal events through the feature effectiveness, PU et al. employed TCA to enhance
the MIL technique. The MIL approach deals with training the long-range dependency and PEL instead of contrastive
data organized into sets known as positive and negative bags. learning to increase the correct prediction rate by reducing the
In the context of MIL, a video is seen as a bag containing error [28]. They employed MLP with PEL to reduce the fea-
numerous instances, where each instance corresponds to a tures and casual convolution (CC) for the classification. PEL
video snippet. A negative bag encompasses the entirety of mainly integrates semantic priors utilizing knowledge-based
normal snippets, whereas a positive bag encompasses both prompts aiming to increase the recognition rate by boosting
normal and abnormal snippets without any indication of the the discriminative capacity while ensuring high separability
temporal boundaries of abnormal events. The conventional between subclass between the anomaly, and finally, they
Multiple Instance Learning (MIL) framework assumes that calculate the score and the error rate with the MIL loss func-
all negative bags exclusively contain negative snippets and tion: the reported AUC rate 86.76%, 85.59% and 98.14% for
that positive bags contain at least one positive snippet. Super- UCF-crimes, XD-Violence and Shanghai Tech dataset
vision is solely provided for complete sets, and the individual respectively. To improve the recognition rate, Zhao et al. pro-
labels of the snippets within the bags are not given [17]. The posed a new temporal feature extraction using graph-based
outputs of WVAED are inherently more reliable than those transformers, namely Uncertainty Regulated Dual Memory
of unsupervised VAED due to its ability to comprehend the Units. (UR-DMU) through the I3D backbone pre-trained fea-
fundamental variability between normal and abnormal [18]. tures [29]. They reported 86.97% and 94.02% for UCF-crime
However, in the WVAED approach, the frames labelled as and XD-violence datasets, respectively. To improve the per-
abnormal in the positive bag are often influenced by the formance, more recently, sharif et al. proposed a two-stream
frames labelled as normal in the negative bag, making it pre-trained feature-based temporal feature enhancement
challenging to distinctly identify an abnormality in contrast to module where they first extracted CNN-based I3D features in
normality. Consequently, the detection of anomalous snippets the first stream by selective top-k score and ViT-based Clip
can become problematic. Numerous researchers (e.g., [8], feature in the second stream [30]. Finally, they fused the two
[9], [10], [19], [20]) have endeavoured to address this issue features and employed MLP and classification module for
by employing multiple instance learning (MIL) frameworks. the classification. They reported 88.97% and 98.66% AUC
65214 VOLUME 12, 2024
J. Shin et al.: Anomaly Detection in Weakly Supervised Videos
for the UCF crime and the Shanghai tech dataset, respec- with PEL to refine and learn discriminative features
tively. The drawback of this model is that it did not achieve through knowledge-based prompts. This integration of
satisfactory performance for the real-time deployment due non-linear mapping further enhances the model’s abil-
to a lack of feature effectiveness. Also, they utilized CNN ity to differentiate between normal and anomalous
and ViT-based pre-trained model features and temporal fea- behaviour.
ture enhancement, but they did not consider the graph-based • Classification and Evaluation Finally, we employed a
feature enhancement and spatial feature enhancement in the classifier module to predict the snippet-level anomaly
module. Also, the UR-DMU [29] utilized the graph-based scores. In the training phase, the based function transfers
feature enhancement, but they did not discuss time-varying the snippet-level scores into bag-level predictions for
enhancement and TCA [26], [28] reflected the vice versa learning high activation in anomalous cases. We eval-
problem. In addition, UR-DMU [29], TCA [26], [28] and uate the proposed model with three benchmark datasets,
I3D-CLIP [30] they are having lacking the extracting all namely UCF-Crime Dataset, ShanghaiTech Dataset,
possible kind of the feature. This research group inspired us to and XD-Violence. The extensive performance result
work here to extract all possible kinds of features to increase proves the superiority of the proposed model. Through
the anomaly detection rate. To overcome the challenges, the integration of these cutting-edge technologies and
we proposed here a multi-stage graph and general deep methodologies, our approach offers a comprehensive
learning (DL) feature enhancement-based anomaly detection solution to video-based anomaly detection. We believe
system. In the study, we proposed including CNN and ViT- that our model holds significant promise for real-world
based pre-trained features, temporal features, graph-based applications, demonstrating superior performance and
temporal features and spatial enhancement of the features. efficacy in anomaly detection tasks.
The main contributions of the proposed model are given
below: II. LITERATURE REVIEW
• Stage 1: General Deep Learning Model Based The methodologies utilized in WVAED rely on labels at the
Dual-Stream Feature Extraction: video level, which consistently adhere to the MIL ranking
The first stage of our methodology is characterized framework [8]. According to the MIL approach, a regres-
by the innovative use of two streams, each contribut- sion model is trained using the WVAED method with the
ing distinct yet complementary features to the anomaly assumption that the maximum score of the positive bag
detection process. Leveraging CLIP and I3D, we extract is greater than that of the negative bag in order to assign
rich semantic information and spatiotemporal features, scores for video snippets. These [8], [9], [19], and [6],
respectively, setting a solid foundation for subsequent [11] all incorporated pre-trained models based on convo-
analysis. Building upon the extracted features, we seam- lutional neural networks into their experimental procedure
lessly integrate them into the Temporal Contextual setups. In addition, Sultani et al. [8] meticulously curated
Aggregation (TCA) mechanism. This module helps pre-annotated normal and abnormal video events at the video
to capture comprehensive contextual information by level to construct the widely recognized UCF-Crime dataset.
reusing the similarity matrix and implementing adaptive The dataset was employed for anomaly detection by utilizing
fusion. This integration facilitates the effective capture a weakly supervised framework. Within the confines of this
of temporal dependencies across video frames, enhanc- framework, C3D features [31] were extracted for video seg-
ing the model’s ability to discern anomalous patterns ments, and then a ranking loss function was used to train a
amidst dynamic scenes. fully connected neural network (FCNN). The purpose of this
• Stage 2: Graph-Based UR-DMU Model Integration function was to compute the loss between the most highly
and Refinement: scored rank examples in the positive bag and the negative bag.
In the second stage, we introduce the Uncertainty- Tian et al. [19] presented a model and utilized the C3D [31]
regulated Dual Memory Units (UR-DMU) model, and I3D [22] models for the aiming of feature extractors in
renowned for its ability to simultaneously learn rep- their WVAED model. They contended that by selecting the
resentations of regular and abnormal data. By incor- top 3 features based on their magnitude, a more pronounced
porating global and local structures through GCN and differentiation can be achieved between normal and abnormal
Global/Local Multi-Head Self Attention (GL-MHSA) videos (AVs). Specifically, in cases where multiple abnormal
modules, our model captures intricate associations snippets exist within an anomalous video, the average snippet
within video data. Additionally, refinement through feature magnitude of the anomalous video surpasses that of
a Multi-Layer Perceptron (MLP) enables non-linear normal videos (NVs). Hang et al. [9] presented a model to
mapping, further enhancing the model’s discriminatory extract positive and negative video-segmented C3D features
capabilities. by using a temporal convolution network [31]. Specifically,
• Stage-3 Feature Reduction Classification and Evalu- they trained the network between the previous adjacent seg-
ation with Impact and Promise: ment and the current segment. Further, they used inner and
In the third stage, we used a feature reduction module outer bag ranking losses to train the model based on two
using two-layer multilayer perception (MLP) integrated branches of an FCNN. This loss accounted for the greatest
and lowest-scoring parts in terms of the positive bags and prediction rate by reducing the error [28]. They employed
negative bags. MLP with PEL to reduce the features and casual convo-
Similarly, Zhong et al. [6] and Zhu et al. [11] imple- lution (CC) for the classification. PEL mainly integrates
mented models that trained a feature-based encoder and semantic priors utilizing knowledge-based prompts aiming
classifier simultaneously. Zhong et al. [6] analyzed WVAED to increase the recognition rate by boosting the discrim-
and performed it as a supervised learning problem using noise inative capacity while ensuring high separability between
labels. Extensive experiments were undertaken to evaluate the subclass between the anomaly, and finally, they calculate
universal applicability of their model, using both temporal the score and the error rate with the MIL loss function: the
segment network [32] and C3D [31]. Zhu and Newsam [11] reported AUC rate 86.76%, 85.59%, and 98.14% for UCF-
integrated an attention block to their MIL ranking model crimes, XD-Violence and Shanghai Tech dataset respectively.
to account for temporal context. They claimed that motion To improve the recognition rate, Zhao et al. proposed a
information features extracted by C3D [31] and I3D [22] new temporal feature extraction using graph-based trans-
outperformed features obtained from individual images using formers, namely Uncertainty Regulated Dual Memory Units.
pre-train model VGG16 [33] and Inception [34]. ViT-based (UR-DMU) through the I3D backbone pre-trained fea-
pre-trained models can be classified into two types: single- tures [29]. They reported 86.97% and 94.02% for UCF-crime
stream and dual-stream. In the single-stream approach, text and XD-violence datasets, respectively. To improve the
and picture (or video) representations are modeled using a performance, more recently, sharif et al. proposed a two-
single transformer in a single framework, while the dual- stream pre-trained feature-based temporal feature enhance-
stream model uses a decoupled encoder to encode text and ment module where they first extracted CNN-based I3D
image (or video) separately. Among the most notable ViT features in the first stream by selective top-k score and
feature extractors are CLIP [35], ViLBERT [36], Visual- ViT-based Clip feature in the second stream [30]. Finally,
BERT [37], and data-efficient CLIP [38]. For the WVAED the fused the two features and employed MLP and classifi-
problem, Joo et al. [20] recently presented a temporal cation module for the classification. They reported 88.97%
self-attention framework for CLIP-assisted [35]. They imple- and 98.66% AUC for the UCF crime and the Shanghai tech
mented the experiments on open accessible datasets to per- dataset, respectively. The drawback of this model is that it
form their end-to-end WVAED model. Li et al. [39] presented did not achieve satisfactory performance for the real-time
a multi-instance learning network based on transformers deployment due to a lack of feature effectiveness. Also,
to get anomaly scores for both videos and snippets. They they utilized CNN and ViT-based pre-trained model features
used the video-level anomaly probability in the inference and temporal feature enhancement, but they did not con-
stage to lessen the snippet-level anomaly score’s volatility. sider the graph-based feature enhancement and spatial feature
Lv et al. [40] introduced an unbiased MIL scheme that trained enhancement in the module. Also, the UR-DMU [29] utilized
an unbiased anomaly classifier and a tailored representation the graph-based feature enhancement, but they did not discuss
for WVAED. In view of the available solutions, it has been time-varying enhancement, and TCA [26], [28] reflected the
observed that, in general, CNN and ViT are typically utilized vice versa problem. In addition, UR-DMU [29], TCA [26],
in isolation. To leverage the benefits offered by both CNN- [28] and I3D-CLIP [30] they are having lacking the extracting
and ViT-based pre-trained models, an architecture known as all possible kind of the feature. This research group inspired
CNN-ViT-TSAN, which is supported by Multiple Instance us to work here to extract all possible kinds of features to
Learning (MIL), has been devised. This architecture aims increase the anomaly detection rate. To overcome the chal-
to establish a range of models for addressing the prob- lenges, we proposed here a multi-stage graph and general DL
lem of Weakly Supervised Variational Autoencoder Design feature enhancement-based anomaly detection system. In the
(WVAED). The drawback of the study is that they pro- study, we proposed including CNN and ViT-based pre-trained
cessed the video in individual frames or short clips to extract features, temporal features, graph-based temporal features,
the long-range semantic contextual information. To over- and spatial enhancement of the features.
come the problem, shao et al. proposed a TCA framework
for video representation learning. This innovative approach III. DATASET
integrates long-range temporal context among frame-level Anomaly detection datasets play a crucial role in developing
features through the utilization of the self-attention mecha- and evaluating algorithms aimed at identifying irregular or
nism [26], [27]. They used contrastive learning to reduce loss unexpected events within data streams. These datasets pro-
or error rate in the evaluation. To enhance the TCA features, vide diverse scenarios, allowing researchers to train and test
Tean et al. employed Robust TCA features, including mul- their models under various conditions. Also, there are many
tiple instance learning (MIL) loss calculation approach [19]. datasets available for anomaly detection, and we used the
They reported 84.30% and 97.21% AUC for the UCF-crime following most usable benchmark datasets, such as Shang-
and Shanghai Tech datasets, respectively. To improve the haiTech [41], the UCF-Crime [8], and XD-Violence [29],
AUC rate by increasing the feature effectiveness, PU et al. which offer different scales, backgrounds, and types of
employed TCA to enhance the long-range dependency and anomalies, catering to different research needs. By utiliz-
PEL instead of contrastive learning to increase the correct ing these datasets, researchers can benchmark their anomaly
detection methods, assess their performance, and contribute and Shanghai-Tech datasets and 5-crop augmentation for the
to advancing the field of anomaly detection in real-world XD-Violence dataset using pre-trained models. We divided
applications. the untrimmed video into non-overlapping snippets by uti-
lizing a 16-frame sliding window. Then, we introduce a
A. ShanghaiTech DATASET multi-backbone framework, combining a CLIP model trained
This dataset represents a medium-scale anomaly dataset com- on Kinetics with an I3D model also pre-trained on Kinet-
prising 317,398 frames of video clips. These clips capture ics. This dual-backbone approach leverages the strengths of
scenes from various locations within the ShanghaiTech Cam- both architectures to enhance the feature extraction process
pus. The dataset includes 13 distinct background scenes, for video anomaly detection. Subsequently, this enhanced
consisting of 307 NVs and 130 anomaly videos. The earliest feature set is streamlined via TCA, CNN, UR-DMU and a
dataset [41], serves as a common benchmark for VAED. two-layer Multilayer Perceptron (MLP), optimizing it for fur-
In this dataset, the training set contains NVs, while the testing ther analysis or applications. In the procedure, we employed
set contains both normal and anomalous videos. In order three stages. Where the first stage was constructed with two
to create a weakly supervised training set that encompasses streams and the initial stream, we utilised CLIP (Contrastive
all 13 background scenes, Zhong et al. [6] reorganized Language-Image Pre-training) and selected the top-k fea-
the dataset. Their approach involved selecting a subset of tures, which are considered as the feature of the first stream.
anomalous testing videos and using them as training data. In the second stream, we used a pre-trained I3D (Inflated
We followed the procedure outlined by Zhong et al. [6] to 3D ConvNet) network to extract rich semantic information
transform the ShanghaiTech dataset into this weakly super- that fed into the Temporal Contextual Aggregation (TCA)
vised setting. mechanism for integrating contextual information across
frames, effectively capturing temporal dependencies. Then,
B. UCF-CRIME DATASET we employed the CNN module for spatial enhancement of
This dataset consists of a large-scale anomaly detection TCA output to extract spatiotemporal features from video
dataset that includes 1900 untrimmed videos collected from frames as a feature of the second stream. We concatenated
real-world street and indoor surveillance cameras. There the CLIP-based first stream and the I3D-based second stream
are 128 hours of video in total. The dataset covers 13 dif- feature that fed into the UR-DMU [29] model, which employs
ferent real-world anomalies such as abuse, arrest, arson, dual memory units to learn representations of regular data
assault, accident, burglary, explosion, ‘‘fighting’’, ‘‘rob- and discriminative features of abnormal data simultaneously.
bery’’, ‘‘shooting’’, ‘‘stealing’’, ‘‘shoplifting’’, and ‘‘vandal- This model incorporates both global and local structures
ism’’. Unlike the static background in the ShanghaiTech through GCN [14], [15], [16] and Global/Local Multi-Head
dataset [41], the UCF-Crime [8] dataset has more com- Self Attention (GL-MHSA) modules, facilitating the capture
plicated and diverse backgrounds. The training set of the of associations in videos. In the third stage, we used a fea-
UCF-Crime dataset contains 1610 videos, with 800 labeled ture reduction module using two-layer multilayer perception
as normal and 810 labeled as anomalous. The testing set con- (MLP) integrated with PEL to refine and learn discriminative
tains 290 videos, with 150 labeled as normal and 140 labeled features through knowledge-based prompts. This integration
as anomalous, and includes frame-level labels. of non-linear mapping further enhances the model’s ability
to differentiate between normal and anomalous behaviour.
C. XD-VIOLENCE Finally, we employed a classifier module to predict the
XD-Violence dataset comprises a variety of media formats, snippet-level anomaly scores. In the training phase, the based
specifically videos and audio. The dataset encompasses a function transfers the snippet-level scores into bag-level pre-
diverse range of backgrounds, such as movies, games, and dictions for learning high activation in anomalous cases.
live scenes. It consists of a total of 4754 videos, with By integrating these cutting-edge technologies, our model
3954 videos designated for training purposes and equipped offers a comprehensive approach to video-based anomaly
with video-level labels. Additionally, 800 testing videos have detection, promising superior performance in real-world
been labelled at the frame-level [29]. applications.
⌊ Nv ⌋
as γi i=1
1
where each snippet contained same number of of transformers, namely ViLBERT [36], CLIP [35], Visual-
frames 1. BERT [37] and data-efficient CLIP [38], aiming to develop
In the preprocessing, we followed the existing system, different kinds of the language model and the multi-modal
and that is first, we divided the untrimmed video into vision. Generally, CLIP [38] serves as a multi-modal vision
non-overlapping snippets by utilizing a 16-frame sliding and language model, harnessing a ViT as its foundational
window [28], [29], [42]. Then, we extracted features from framework for extracting visual features. We considered that
each sample, using 10-crop augmentation for the UCF-Crime dj = ⌈ 12 ⌉ is a middle frame for the video and from the
and Shanghai-Tech datasets and 5-crop augmentation for the snipped γj , which means we did not consider all frame at a
XD-Violence dataset using pre-trained models in stages 1 time. In our study, we employ the CLIP on dj of the snippet
[28], [29], [42]. γj to represent its features as φvj ∈ Rℵ , here ℵ represented the
feature dimension, and final feature vector can be constructed
B. STAGE 1: PRETRAINED MODEL-BASED FEATURE with φvvit = φj Tj=1
v
∈ RT ×ℵ [30]. We used pre-trained CLIP
EXTRACTION models in the first stream to extract effective features from
In the first stage, we introduce a multi-backbone framework, each video. CLIP consists of a multi-backbone framework.
combining a CLIP model trained on Kinetics with an I3D The CLIP model provides feature vectors in 512 dimensions.
model also pre-trained on Kinetics. It is important to note
that our I3D model is configured to process only RGB input. b: TOP-K SCORE NOMINATOR
In this architecture, the I3D RGB model extracts features in a The output of the CLIP model is fed into the K-Score Selec-
1024-dimensional space, while the CLIP model provides fea- tion Module, which is demonstrated in Figure 2. The top-k
ture vectors in 512 dimensions. This dual-backbone approach score nominator, as described [42], is a crucial component for
leverages the strengths of both architectures to enhance the selecting the most relevant video snippets. These scores are
feature extraction process for video anomaly detection. then processed to identify the top most relevant snippets. This
method ensures that the snippets with the highest relevance,
1) ViT TRANSFORMER BASED CLIP FEATURE EXTRACTION indicated by their score values, are selected for further pro-
STREAM cessing. This module involved cloning the input vector, which
In the first stream, we employed a based CLIP model to comes from the CLIP model output and is known as a score
extract the pre-trained features and then nominate the top vector. Then, we added the Gaussian noise and calculated
score to select the most relevant video snippets. the magnitude. Based on the magnitude value, we selected
the top K scores. This top-k score nominator is integral for
a: PRETRAINED CLIP FEATURE FEATURE
focusing the model’s attention on the most significant parts
of the video.
CLIP leverages a unified framework for understanding both
text and image data, enabling it to capture rich semantic
information from video frames. Vision-language pre-trained 2) CNN BASED I3D FEATURE EXTRACTION STREAM
models leverage ViTs to capture the correlations between In the second branch of the first stage module, we employed
objects or actions depicted in a video and those described in I3D and then enhanced the information using TCA and
textual content. These sophisticated models excel at extract- CNN modules. The utilization of the I3D module allows for
ing intricate relationships between visual and linguistic the extraction of robust spatio-temporal features from video
elements, thereby facilitating comprehensive understanding sequences. By incorporating spatial and temporal informa-
and analysis across modalities. There are many researchers tion, this module effectively captures motion and appearance
used the concept of ViT as a backbone for different kinds cues, enabling a comprehensive representation of video
content. TCA plays a pivotal role in integrating contextual a video representation learning framework that incorporates
information across multiple frames. By considering temporal long-range temporal information between frame-level fea-
dependencies within video sequences, TCA enhances the tures using the self-attention mechanism [26], [28]. It mainly
model’s ability to discern anomalies. This mechanism ensures captures temporal relationships from both local and global
that the model can effectively capture dynamic changes over perspectives. Figure 3 demonstrated the TCA calculation
time, improving anomaly detection accuracy. The incorpora- procedure where X is the output of the I3D module, which
tion of a 1D CNN, followed by ReLU activation and dropout is projected here in the latent space utilizing various linear
regularization, contributes to feature dimensionality reduc- layers and finally produces the similarity matrix as below:
tion while preserving essential information. This process
M = fq (X ) · fk (X )⊤ (1)
ensures that the extracted features are concise yet informative,
facilitating efficient anomaly detection without sacrificing g M
A = softmax √ (2)
discriminative power. Dh
X g = Ag · fv (X ) (3)
a: I3D FEATURES Here, query, key and value are represented by fq (.), fk (.)
I3D excels in capturing spatio-temporal features from video and fv (.), ⊤ is denoted by the transpose operation, and the
sequences, providing a robust representation of motion and dimension of hidden spaces represented by Dh . In addition,
appearance cues within the temporal context [22]. One of the Ag denotes the global attention, and X g represents the global
most widely used DL models, (CNN), has a lot of potential context features [26], [28]. We enhanced the similarity matrix
for image classification. CNN-based C3D (Convolutional with the dynamic position encoding (DPE) approach accord-
3D) [21] are the most usable common feature extractors. ing to the following Equation 4:
Tran et al. [36] showed that C3D can model appearance and
G = exp(−|γ (i − j)2 + β|) (4)
motion information simultaneously and outperform the 2D
CNN features in various video-analysis tasks. Technically, where i and j denote the absolute positions of two snippets,
we calculated the I3D features from the T snippets in the and γ and β are learnable weights and bias terms.
dimension ℵ́. Assume φv´cnn = φi Ti=1v
∈ RTv ×ℵ́ [30] used to In contrast, we also calculated the local attention and local
extract features where for a specific video Vv contained the context features according to the below formulas:
Tv number of snippet and feature vector size can be expressed !
M̃
as Vv . In lieu of employing PCA, we opt for the low-variance Al = softmax √ (5)
filter algorithm to reduce the dimensionality of the extracted Dh
data. After reducing the dimension, it produces the first stage X l = Al · fv (X ) (6)
feature of this stage and that dimension can be expressed with
Here Al denotes the local attention, and Xlrepresents the
φv´cnn ∈ RTv ×ℵ́ which comes from the φvcnn ∈ RTv ×ℵ where ℵ
global context features where M̃ represent masking output of
is the feature dimension extracted from T snippets.
the similarity matrix from Equation (1) [26], [28]
Then we concatenated the global attention head X g and
b: TEMPORAL CONTEXT AGGREGATION MODULE (TCA) local attention head X l to produce the final feature X o using
To enhance the temporal contextual information of the I3D Equation (7).
features, we used the TCA model [28]. TCA facilitates the
X o = α · X g + (1 − α) · X l (7)
integration of contextual information across multiple frames,
enhancing the model’s ability to discern anomalies by consid- After normalizing the features, we concatenated with the
ering temporal dependencies effectively. This mainly used as skip connection to overcome lost information. Finally,
we employed a linear layer and produced the output of the helps distinguish hard samples better by comparing feature
TCA module feature according to Equation (8). similarities with stored templates. Normal Data Uncertainty
Learning (NUL) uses a Gaussian distribution to constrain
X c = LN(X + fh (Norm(X o ))) (8)
the latent normal representation. It’s an approach not com-
where global weight, local weight and combination of power monly used in weakly supervised video anomaly detection,
normalization are represented by α, (1 − α), and Norm(·) drawing on concepts from unsupervised anomaly detection
respectively. methods. For training and testing, pairs of videos with equal
CNN Module: These features are then processed through a amounts of normal and abnormal footage are processed.
1D CNN (Conv1d), followed by a ReLU activation function The model generates a score for each video snippet, using
and a dropout rate of 0.1, reducing the feature dimensionality Binary Cross-Entropy (BCE) loss and five auxiliary losses
to 512. for discrimination between normality and anomaly. During
testing, the model utilizes only the mean-encoder network
C. FEATURE FUSION of the DUL module to obtain feature embeddings, which are
In the first stream, we used top k Score Nominator [42] then used to label the video snippets and finally produce the
to select the top k segments based on their CLIP feature UR-DMU features, which is denoted Furdmu . Then we pro-
relevance to obtain a refined set of 512-dimensional fea- duce the final feature of stage 2 FStage−2 by adding the feature
tures denoted with XT . We got the final feature from the of the UR-DMU Furdmu with the TCA Xc using the following
FC module in the second stream denoted with XCCN . These Equation 12.
features are then concatenated, resulting in comprehensive
XStage−2 = Furdmu + Xc (12)
1024-dimensional features denoted with Fstage−1 using the
Equation.
E. STAGE 3: FEATURE REDUCTION WITH PEL THROUGH
Fstage−1 = XT ⊕ XCCN (9)
MLP MODULE
D. STAGE 2: UR-DMU BASED FEATURE To select the effective features from the graph-based
To produce the graph-based temporal enhancement fea- UR-DMU Fstage−2 features, we employed MLP with the PEL
ture, we employed the UR-DMU [14], [15], [29] approach, module, which is described below.
which is mainly incorporated with dual memory units to
a: MLP
simultaneously learn representations of regular data and dis-
criminative features of abnormal data. The main goal is to To achieve high-level semantic representations by selecting
improve the model’s ability to differentiate between nor- the effective feature from the graph-based Fstage−2 features,
mal and anomalous behaviour. It consists of three main we employed a two-layer MLP for feature reduction. MLP
components demonstrated in Figure 4. Global and Local serves as a powerful tool for non-linear mapping and fea-
Multi-Head Self Attention (GL-MHSA) is crucial for learn- ture transformation, enabling the model to learn complex
ing both long and short-temporal dependencies of anomalous decision boundaries and refine the extracted features for
features. It enhances the transformer structure by integrating final anomaly detection. This MLP incorporates two Conv1d
global and local structural concepts from graph convolution layers, two GELU activations, and two PDropout mecha-
networks. nisms [43]. Prior to the first Conv1d layer, we integrate fea-
tures from TCA. Following the first Conv1d layer, we append
XMt
S = σ ( √ ), Maug = SM (10) a 512-dimensional feature derived from I3D. Each Conv1D
D layer is succeeded by a GELU activation function and a
where X is a feature obtained from GL-MHSA. M is the dropout operation. This methodology is symbolized as fol-
memory bank number, D is the number of dimensions of lowing Equation 13:
output, M is querying memory banks, σ is sigmoid activation,
and S ∈ RNM is the query score. Following that, Maug is used FMLP−1 = Dropout(GELU (Conv1D(Fstage−2 )))
to represent the memory augmentation feature produced by a FStage−3 = Dropout(GELU (Conv1D(FMLP−1 ))) (13)
read operation. We define a dual memory loss as consisting
Finally, we utilize a causal convolution layer to produce
of four binary cross-entropy (BCE) losses in order to train the
the anomaly scores, integrating both present and past obser-
dual memory units.
vations for enhanced reliability. The classifier is represented
Ldm = BCE(Snk;n , ynn ) + BCE(Snk;a , yna ) as:
a
+ BCE(Sk;n;k , yan ) + BCE(Sk;a;k
a
, yaa ) (11) S = σ (ft (Xs )) , (14)
where Snk;n is a normal memory score, ynn = 1 ∈ RN , where ft (·) denotes the causal convolution layer with a ker-
Snk;a is a anomaly memory score, yna = 0 ∈ RN . And the nel size of 1t, σ (·) represents the sigmoid function, and
means of Sak;n , Sak;a top-K result along the first dimension are si signifies the anomaly score of the i-th snippet. Finally,
a
Sk;n;k , Sk;a;k
a ∈ RN . yan , yaa are labels and the value is 1. This we employed the multi-layer instance learning (MIL) as a loss
function [28], [44]. Specifically, we determine the video-level b: PROMPT-ENHANCED LEARNING (PEL)
prediction pi by computing the mean value of the top-k In this study, we employ Prompt-Enhanced Learning
T
anomaly scores. For positive bags, we set k = 16 +1 , (PEL) proposed by Joo et al. [28], [42] to enrich visual
and for negative bags, we set k = 1. Given a mini-batch representations by integrating knowledge-based contextual
containing B samples with video-level ground truth yi , the information, improving anomaly detection in complex sce-
binary cross-entropy is formulated as: narios. It involves three key steps: prompt construction,
B fore-background separation, and cross-modal alignment. The
1X
Lce = − yi log(pi ). (15) prompt construction mainly selected the common relation
B among the categories to form prompts that focus on the high
n=1
occurrence categories and make a relevant semantic relation- TABLE 2. Performance result.
ship dictionary. Then, based on the output of FM LP − 1 and
cross-entropy loss Lce information, it produces the video
label background and foreground features. Finally, the effec-
tive feature is prompted based on the enhanced fine-grained
semantics of visual features. That means PEL assess the like-
lihood of a visual feature matching a particular prompt across
several anomaly classes and 1 normal class. Overall, the PEL
TSAN processing, producing reweighed attention features.
module’s integration of textual and visual modalities enables
These features were then fed into the snippet association
a more nuanced and context-aware approach to anomaly
network and an MLP-based converter to obtain anomaly
detection in video data. Finally, the cross-modal alignment
scores. Each score, ranging from 0 to 1, indicates the anomaly
loss is computed using the Kullback-Leibler divergence,
probability of the corresponding snippet. To maintain the
compelling the network to discern between the visual content
original video order for evaluation against ground truth labels,
of the video representing abnormal behavior (foreground)
each score was replicated 1 times to match the video’s usual
and irrelevant content (background). The loss function is
frame length.
formulated as follows:
Lkd = Ep∼p(v) [log pv2t (v) − log qv2t (v)], (16) 1) ENVIRONMENTAL SETUP AND EVALUATION METRICES
where pv2t (v) and qv2t (v) denote the similarity score and The system was developed on a machine with a GeForce RTX
semantic consistency label of the video-prompt pair, respec- 4090 24GB series GPU, running CUDA version 11.7 and
tively. For a positive pair, q = 1; otherwise, q = 0. We added NVIDIA driver version 515. The system utilized 64GB
the Magnitude Contrastive (MC) Loss [45] with the Lkd to of RAM. During the training processing, the learning set
enhance the effectiveness of the loss calculation procedure. employed a learning rate of 0.005 and a batch size of 32.
The training process lasted for 300 epochs using the Adam
V. EVOLUTION AND PERFORMANCE optimizer on the same RTX 4090 machine. For efficient
To evaluate the proposed model, we used three benchmark implementation of graph convolution and attention with low
anomaly detection datasets: ShanghaiTech [41], the UCF- computational cost, the Python environment included the
Crime [8], and XD-Violence [29]. following packages open cv pickle package, panda package,
a [46], [47], [48], these all packages facilitated initial data
A. TRAINING AND TESTING PROCEDURES processing and model development [47], [48].
During training, we optimize the objective function L = We compare the results with the area under the curve
Lce + λLkd , where λ adjusts the alignment loss. This enables (AUC) of the frame-level receiver operating characteristics
our model to generate discriminative representations of posi- (ROC) for UCF-Crime and ShanghaiTech to the WS-VAD
tive and negative snippets, improving generalizability. In the performance. For XD-Violence, on the other hand, the AUC
testing phase, we mitigate transient noise impact with a score of the frame-level precision-recall curve (AP) is utilized.
smoothing (SS) strategy using distinct pooling operations by In ablation experiments, the False Alarm Rate (FAR) and the
following Equation (17). anomaly subset consisting of only abnormal data are also uti-
lized. The FAR (false alarm rate) we displayed today was dif-
i+κ−1
1 X ferent from the normal implementation. I used the ‘‘Learning
s̃i = sj (17) Prompt-Enhanced Context features for Weakly-Supervised
κ
j=κ Video Anomaly Detection’’ implementation as is, but this
It also helps us to suppress biases and reduce false alarms by FAR is limited to normal video. In other words, it is the FAR
smoothing prediction scores. Also, we skipped feature-length for video where all frames are 0. Also, in the ShangihaiTech
normalization of the extracted video feature vectors, assum- dataset, this FAR was exactly 0. This is probably due to the
ing independence among videos. These vectors underwent high AUC of 98.6%. And FAR was not used much as an
TABLE 3. State-of-the-art comparison of the proposed model for the UCF crime and ShanghaiTech dataset.
indicator. It was used in two papers, but one paper was not aforementioned technologies yielded the highest detection
prepared for comparison with the other. accuracy, showcasing the synergistic effect of combining
these methodologies.
B. ABLATION STUDY
Table 1 demonstrated the ablation study of the pro- C. PERFORMANCE RESULT OF THE PROPOSED STUDY
posed model, which also shows the contribution of the Table 2 presents performance metrics for three different
multi-backbone pre-trained model. In the ablation study, datasets in anomaly detection. The metrics include Area
we systematically evaluated the impact of various technolo- Under the Curve (AUC), Anomaly AUC, Average Preci-
gies on weakly supervised video anomaly detection. The sion (AP), and False Alarm Rate (FAR) values. For the
presence of a check mark indicates the utilization of the UCF-Crime dataset, the AUC is 0.9009, with an Anomaly
corresponding technology in our experiments. We observed AUC of 0.7456 and an AP of 0.4090, and the FAR is 0.0204.
that integrating I3D alone resulted in a notable improve- Similarly, the XD-Violence dataset shows an AUC of 0.9509,
ment in performance across all datasets. Incorporating TCA Anomaly AUC of 0.8626, AP of 0.8648, and FAR of 0.0013.
alongside I3D further enhanced detection accuracy. CLIP Lastly, the Shanghai dataset exhibits an AUC of 0.9869,
integration facilitated even better results, particularly on the Anomaly AUC of 0.8228, AP of 0.7780, and FAR of 0.0000.
UCF crimes dataset. Additionally, employing Top-K selec-
tion improved performance consistently. The introduction D. STATE OF THE ART COMPARISON FOR UCF CRIME AND
of UR-DMU significantly boosted detection rates, which is ShanghahiTech DATASET
particularly evident in the XD violence and SH tech datasets. The comparison Table 3 presents an overview of vari-
Furthermore, the inclusion of PEL and MC contributed to ous crime detection models developed over multiple years.
further performance gains. Finally, adopting SS alongside all Each model is evaluated based on its performance on the
Shanghai Tech and UCF Crime datasets using the AUC (Area TABLE 4. State-of-the-art comparison of the proposed model for the XD
Violence Dataset.
Under the Curve) metric. In 2018, Sultani et al. [8] intro-
duced models utilizing the C3D and ID3 feature extractors.
These models demonstrated competitive AUC scores on both
datasets, indicating their efficacy in identifying crime-related
activities in videos. Zhong et al. [6] further advanced the
field in 2019 by exploring the use of the C3D and TSN
feature extractors. Their experiments revealed varying per-
formance across the datasets, emphasizing the importance of
feature extractor selection in model development. Addition-
ally, Zhong et al. [9] investigated the ID3 feature extractor,
contributing additional insights into its suitability for crime
detection tasks.
In 2020, Zaheer et al. [7], [49] introduced novel feature
extractors such as C3D-self and C3D, achieving promising
results on both datasets. Wan et al. [50] also contributed and MSL [39] followed closely with scores of 77.81% and
to advancements by exploring the I3D feature extractor, 78.28%, respectively. Pang et al. [64] and ACF [65] leveraged
further diversifying the range of feature extractors used in RGB with audio, achieving 81.69% and 80.13%, respectively.
crime detection models. The year 2021 marked significant However, the proposed model significantly surpasses these
progress in the field, with studies by Purwanto et al. [13], benchmarks, demonstrating its efficacy in violence detection.
Tian et al. [19], Majhi et al. [51], Wu and Liu [44],
Yu et al. [52], Lv et al. [12], and Feng et al. [53] intro- VI. CONCLUSION AND FUTURE DIRECTION
ducing various feature extractors and achieving competitive In the study, we proposed a graph and general DL approach
results. These studies highlighted the continuous evolution to extract discriminative features to effectively distinguish
and improvement of crime detection models. In 2022, the abnormal events from normality in weakly supervised video
research landscape expanded further with a surge in model anomaly detection (WS-VAD) tasks. By addressing the lim-
diversity. Studies by Zaheer et al. [3], [30], Joo et al. [20], itations of existing approaches and proposing a multi-stage
Cao et al. [63], Li et al. [39], Sharif et al. [30], Yi et al. [57], deep-learning model that integrates cutting-edge technolo-
Yu et al. [27], and Gong et al. [58] introduced novel gies, we have demonstrated the effectiveness of our method.
approaches and feature extractors, pushing the boundaries Through the utilization of a ViT-based CLIP module, a CNN-
of performance in video-based crime detection. Finally, the based I3D module, an Uncertainty-regulated Dual Memory
proposed hybrid model showcased exceptional performance, Units (UR-DMU) model, and GCN and Global/Local Multi-
achieving remarkably high AUC scores on both datasets. This Head Self Attention (GL-MHSA) modules, we have suc-
model represents a culmination of feature extraction tech- cessfully extracted and learned representations of regular
niques and model architecture advancements, underscoring and abnormal data simultaneously. The refinement of fea-
the potential for further improvements in crime detection tures in our third-stage module, a CNN-based MLP, further
technology. Overall, the comparison table provides valuable enhances the model’s ability to differentiate between nor-
insights into the evolution of crime detection models over the mal and anomalous behaviour. Besides anomaly detection,
years, highlighting the importance of feature extraction tech- we believe that this model can be used to detect crimes and
niques and model architecture design in achieving superior contribute to crime control automatically. Extensive exper-
performance. As the field continues to advance, future models iments on multiple datasets have validated the superiority
are expected to enhance the capabilities of video-based crime of our approach over state-of-the-art methods, showcasing
detection systems, contributing to improving public safety its potential for real-world applications in anomaly detection
and security. tasks. We believe that our comprehensive solution offers
significant promise, demonstrating enhanced efficacy and
performance in video-based anomaly detection.
E. STATE OF THE ART COMPARISON FOR XD-VIOLENCE
DATASET ABBREVIATIONS
Table 4 demonstrated the state-of-the-art comparison for the WS-VAD Weakly supervised video anomaly detection.
proposed model with the XD-Violence dataset. The proposed UR-DMU Uncertainty-regulated dual memory units.
model outperforms existing state-of-the-art methods on the MLP Multilayer perception.
XD Violence Dataset, achieving an impressive average pre- TCA Temporal contextual aggregation.
cision (AP) score of 86.26%. Sultani et al. [8] achieved an AP GCN Graph convolutional networks.
of 73.20% using RGB features, while HL-Net [10] attained GL-MHSA Global/Local multi-head self attention.
slightly higher at 73.67%. Notably, incorporating audio fea- MIL Multiple instances learning.
tures alongside RGB, HL-Net reached 78.64%. RTFM [19] NVs Normal videos.
AVs Anomalous videos. [19] Y. Tian, G. Pang, Y. Chen, R. Singh, J. W. Verjans, and G. Carneiro,
DL Deep learning. ‘‘Weakly-supervised video anomaly detection with robust temporal feature
magnitude learning,’’ in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV),
CNN Convolutional network. Montreal, BC, Canada, Oct. 2021, pp. 4955–4966.
ViT Vision transformer. [20] H. Kevin Joo, K. Vo, K. Yamazaki, and N. Le, ‘‘CLIP-TSA: CLIP-assisted
BCE Binary cross entropy. temporal self-attention for weakly-supervised video anomaly detection,’’
2022, arXiv:2212.05136.
[21] S. Ji, W. Xu, M. Yang, and K. Yu, ‘‘3D convolutional neural networks
REFERENCES for human action recognition,’’ IEEE Trans. Pattern Anal. Mach. Intell.,
vol. 35, no. 1, pp. 221–231, Jan. 2013.
[1] K. Liu and H. Ma, ‘‘Exploring background-bias for anomaly detection [22] J. Carreira and A. Zisserman, ‘‘Quo vadis, action recognition? A new
in surveillance videos,’’ in Proc. 27th ACM Int. Conf. Multimedia, Nice, model and the kinetics dataset,’’ in Proc. IEEE Conf. Comput. Vis. Pattern
France, Oct. 2019, pp. 1490–1499. Recognit. (CVPR), Honolulu, HI, USA, Jul. 2017, pp. 4724–4733.
[2] D. Gong, L. Liu, V. Le, B. Saha, M. R. Mansour, S. Venkatesh, [23] O. Patashnik, Z. Wu, E. Shechtman, D. Cohen-Or, and D. Lischinski,
and A. Van Den Hengel, ‘‘Memorizing normality to detect anomaly: ‘‘StyleCLIP: Text-driven manipulation of StyleGAN imagery,’’ in Proc.
Memory-augmented deep autoencoder for unsupervised anomaly detec- IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Montreal, BC, Canada,
tion,’’ in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2019, Oct. 2021, pp. 2065–2074.
pp. 1705–1714. [24] K. Vo, S. Truong, K. Yamazaki, B. Raj, M.-T. Tran, and N. Le, ‘‘AOE-
[3] M. Z. Zaheer, A. Mahmood, M. H. Khan, M. Segu, F. Yu, and S.-I. Lee, Net: Entities interactions modeling with adaptive attention mechanism for
‘‘Generative cooperative learning for unsupervised video anomaly detec- temporal action proposals generation,’’ Int. J. Comput. Vis., vol. 131, no. 1,
tion,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 302–323, Jan. 2023.
New Orleans, LA, USA, Jun. 2022, pp. 14724–14734. [25] K. Yamazaki, K. Vo, S. Truong, B. Raj, and N. Le, ‘‘VLTinT: Visual-
[4] M. H. Sharif, L. Jiao, and C. W. Omlin, ‘‘Deep crowd anomaly detection by linguistic transformer-in-transformer for coherent video paragraph cap-
fusing reconstruction and prediction networks,’’ Electronics, vol. 12, no. 7, tioning,’’ 2022, arXiv:2211.15103.
p. 1517, Mar. 2023. [26] J. Shao, X. Wen, B. Zhao, and X. Xue, ‘‘Temporal context aggregation
[5] V. Chandola, A. Banerjee, and V. Kumar, ‘‘Anomaly detection: A survey,’’ for video retrieval with contrastive learning,’’ in Proc. IEEE Winter Conf.
ACM Comput. Surv. (CSUR), vol. 41, no. 3, pp. 1–58, 2009. Appl. Comput. Vis. (WACV), Jan. 2021, pp. 3267–3277.
[6] J.-X. Zhong, N. Li, W. Kong, S. Liu, T. H. Li, and G. Li, ‘‘Graph [27] S. Yu, C. Wang, L. Xiang, and J. Wu, ‘‘TCA-VAD: Temporal context
convolutional label noise cleaner: Train a plug-and-play action classifier alignment network for weakly supervised video anomly detection,’’ in
for anomaly detection,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Proc. IEEE Int. Conf. Multimedia Expo (ICME), Taipei, Taiwan, Jul. 2022,
Recognit. (CVPR), Long Beach, CA, USA, Jun. 2019, pp. 1237–1246. pp. 1–6.
[7] M. Z. Zaheer, A. Mahmood, M. Astrid, and S.-I. Lee, ‘‘CLAWS: Clus- [28] Y. Pu, X. Wu, L. Yang, and S. Wang, ‘‘Learning prompt-enhanced
tering assisted weakly supervised learning with normalcy suppression for context features for weakly-supervised video anomaly detection,’’ 2023,
anomalous event detection,’’ in Computer Vision—ECCV. Glasgow, U.K.: arXiv:2306.14451.
Springer, 2020, pp. 358–376. [29] H. Zhou, J. Yu, and W. Yang, ‘‘Dual memory units with uncertainty
[8] W. Sultani, C. Chen, and M. Shah, ‘‘Real-world anomaly detection in regulation for weakly supervised video anomaly detection,’’ in Proc. AAAI
surveillance videos,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Conf. Artif. Intell., Washington, DC, USA, vol. 37, 2023, pp. 3769–3777.
Recognit., Salt Lake City, UT, USA, Jun. 2018, pp. 6479–6488. [30] M. H. Sharif, L. Jiao, and C. W. Omlin, ‘‘CNN-ViT supported weakly-
[9] J. Zhang, L. Qing, and J. Miao, ‘‘Temporal convolutional network with supervised video segment level anomaly detection,’’ Sensors, vol. 23,
complementary inner bag loss for weakly supervised anomaly detection,’’ no. 18, p. 7734, Sep. 2023.
in Proc. IEEE Int. Conf. Image Process. (ICIP), Taipei, Taiwan, Sep. 2019, [31] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, ‘‘Learning spa-
pp. 4030–4034. tiotemporal features with 3D convolutional networks,’’ in Proc. IEEE Int.
[10] P. Wu, J. Liu, Y. Shi, Y. Sun, F. Shao, Z. Wu, and Z. Yang, ‘‘Not only Conf. Comput. Vis. (ICCV), Santiago, Chile, Dec. 2015, pp. 4489–4497.
look, but also listen: Learning multimodal violence detection under weak [32] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool,
supervision,’’ in Computer Vision—ECCV. Glasgow, U.K.: Springer, 2020, ‘‘Temporal segment networks for action recognition in videos,’’ IEEE
pp. 322–339. Trans. Pattern Anal. Mach. Intell., vol. 41, no. 11, pp. 2740–2755,
[11] Y. Zhu and S. Newsam, ‘‘Motion-aware feature for improved video Nov. 2019.
anomaly detection,’’ 2019, arXiv:1907.10211. [33] K. Simonyan and A. Zisserman, ‘‘Very deep convolutional networks for
[12] H. Lv, C. Zhou, Z. Cui, C. Xu, Y. Li, and J. Yang, ‘‘Localizing anoma- large-scale image recognition,’’ 2014, arXiv:1409.1556.
lies from weakly-labeled videos,’’ IEEE Trans. Image Process., vol. 30, [34] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
pp. 4505–4515, 2021. V. Vanhoucke, and A. Rabinovich, ‘‘Going deeper with convolutions,’’ in
[13] D. Purwanto, Y.-T. Chen, and W.-H. Fang, ‘‘Dance with self-attention: Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Boston, MA,
A new look of conditional random fields on anomaly detection in videos,’’ USA, Jun. 2015, pp. 1–9.
in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Montreal, BC, Canada, [35] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal,
Oct. 2021, pp. 173–183. G. Sastry, A. Askell, P. Mishkin, J. Clark, and G. Krueger, ‘‘Learning
[14] A. S. M. Miah, Md. A. M. Hasan, and J. Shin, ‘‘Dynamic hand transferable visual models from natural language supervision,’’ in Proc.
gesture recognition using multi-branch attention based graph and gen- Int. Conf. Mach. Learn., vol. 139, 2021, pp. 8748–8763.
eral deep learning model,’’ IEEE Access, vol. 11, pp. 4703–4716, [36] J. Lu, D. Batra, D. Parikh, and S. Lee, ‘‘ViLBERT: Pretraining task-
2023. agnostic visiolinguistic representations for vision-and-language tasks,’’ in
[15] A. S. M. Miah, M. A. M. Hasan, Y. Okuyama, Y. Tomioka, and J. Shin, Proc. Adv. Neural Inf. Process. Syst., 32, 2019, pp. 1–11.
‘‘Spatial–temporal attention with graph and general neural network-based [37] L. Harold Li, M. Yatskar, D. Yin, C.-J. Hsieh, and K.-W. Chang, ‘‘Visual-
sign language recognition,’’ Pattern Anal. Appl., vol. 27, no. 2, p. 37, BERT: A simple and performant baseline for vision and language,’’ 2019,
2024. arXiv:1908.03557.
[16] A. S. M. Miah, M. A. M. Hasan, Y. Tomioka, and J. Shin, ‘‘Hand gesture [38] Y. Li, F. Liang, L. Zhao, Y. Cui, W. Ouyang, J. Shao, F. Yu, and J. Yan,
recognition for multi-culture sign language using graph and general deep ‘‘Supervision exists everywhere: A data efficient contrastive language-
learning network,’’ IEEE Open J. Comput. Soc., vol. 5, pp. 144–155, image pre-training paradigm,’’ 2021, arXiv:2110.05208.
2024. [39] S. Li, F. Liu, and L. Jiao, ‘‘Self-training multi-sequence learning with
[17] M.-A. Carbonneau, V. Cheplygina, E. Granger, and G. Gagnon, ‘‘Multiple transformer for weakly supervised video anomaly detection,’’ in Proc.
instance learning: A survey of problem characteristics and applications,’’ AAAI Conf. Artif. Intell., 2022, vol. 36, no. 2, pp. 1395–1403.
Pattern Recognit., vol. 77, pp. 329–353, May 2018. [40] H. Lv, Z. Yue, Q. Sun, B. Luo, Z. Cui, and H. Zhang, ‘‘Unbiased multiple
[18] Y. Liu, D. Yang, Y. Wang, J. Liu, J. Liu, A. Boukerche, P. Sun, and L. Song, instance learning for weakly supervised video anomaly detection,’’ in Proc.
‘‘Generalized video anomaly event detection: Systematic taxonomy and IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Vancouver, BC,
comparison of deep models,’’ 2023, arXiv:2302.05087. Canada, Jun. 2023, pp. 8022–8031.
[41] W. Liu, W. Luo, D. Lian, and S. Gao, ‘‘Future frame prediction for [63] C. Cao, X. Zhang, S. Zhang, P. Wang, and Y. Zhang, ‘‘Adaptive graph con-
anomaly detection—A new baseline,’’ in Proc. IEEE/CVF Conf. Comput. volutional networks for weakly supervised anomaly detection in videos,’’
Vis. Pattern Recognit., Salt Lake City, UT, USA, Jun. 2018, pp. 6536–6545. IEEE Signal Process. Lett., vol. 29, pp. 2497–2501, 2022.
[42] H. K. Joo, K. Vo, K. Yamazaki, and N. Le, ‘‘CLIP-TSA: Clip-assisted [64] G. Pang, C. Yan, C. Shen, A. van den Hengel, and X. Bai, ‘‘Self-trained
temporal self-attention for weakly-supervised video anomaly detection,’’ deep ordinal regression for end-to-end video anomaly detection,’’ in Proc.
in Proc. IEEE Int. Conf. Image Process. (ICIP), Kuala Lumpur, Malaysia, IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Seattle, WA,
Oct. 2023, pp. 3230–3234. USA, Jun. 2020, pp. 12170–12179.
[43] W. Zhu, P. Qiu, O. M. Dumitrascu, and Y. Wang, ‘‘PDL: Regulariz- [65] D.-L. Wei, C.-G. Liu, Y. Liu, J. Liu, X.-G. Zhu, and X.-H. Zeng, ‘‘Look,
ing multiple instance learning with progressive dropout layers,’’ 2023, listen and pay more attention: Fusing multi-modal information for video
arXiv:2308.10112. violence detection,’’ in Proc. IEEE Int. Conf. Acoust., Speech Signal
[44] P. Wu and J. Liu, ‘‘Learning causal temporal relation and feature dis- Process. (ICASSP), Singapore, May 2022, pp. 1980–1984.
crimination for anomaly detection,’’ IEEE Trans. Image Process., vol. 30, [66] D. Wei, Y. Liu, X. Zhu, J. Liu, and X. Zeng, ‘‘MSAF: Multimodal
pp. 3513–3527, 2021. supervise-attention enhanced fusion for video anomaly detection,’’ IEEE
Signal Process. Lett., vol. 29, pp. 2178–2182, 2022.
[45] Y. Chen, Z. Liu, B. Zhang, W. Fok, X. Qi, and Y.-C. Wu, ‘‘MGFN:
[67] C. Zhang, G. Li, Y. Qi, S. Wang, L. Qing, Q. Huang, and M.-H. Yang,
Magnitude-contrastive glance-and-focus network for weakly-supervised
‘‘Exploiting completeness and uncertainty of pseudo labels for weakly
video anomaly detection,’’ in Proc. AAAI Conf. Artif. Intell., Jun. 2023,
supervised video anomaly detection,’’ in Proc. IEEE/CVF Conf. Com-
vol. 37, no. 1, pp. 387–395.
put. Vis. Pattern Recognit. (CVPR), Vancouver, BC, Canada, Jun. 2023,
[46] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, pp. 16271–16280.
Z. Lin, N. Gimelshein, and L. Antiga, ‘‘PyTorch: An imperative style, [68] Y. Pu and X. Wu, ‘‘Audio-guided attention network for weakly supervised
high-performance deep learning library,’’ in Proc. Adv. Neural Inf. Process. violence detection,’’ in Proc. 2nd Int. Conf. Consum. Electron. Comput.
Syst., 32, 2019, pp. 1–12. Eng. (ICCECE), Guangzhou, China, Jan. 2022, pp. 219–223.
[47] S. Gollapudi, Learn Computer Vision Using OpenCV. Berlin, Germany: [69] J. Yu, J. Liu, Y. Cheng, R. Feng, and Y. Zhang, ‘‘Modality-aware
Springer, 2019. contrastive instance learning with self-distillation for weakly-supervised
[48] T. Dozat, ‘‘Incorporating Nesterov momentum into Adam,’’ in Proc. 4th audio-visual violence detection,’’ in Proc. 30th ACM Int. Conf. Multimedia,
Int. Conf. Learn. Represent., Workshop, 2016, pp. 1–4. Lisboa, Portugal, Oct. 2022, pp. 6278–6287.
[49] M. Z. Zaheer, A. Mahmood, H. Shin, and S.-I. Lee, ‘‘A self-reasoning
framework for anomaly detection using video-level labels,’’ IEEE Signal
Process. Lett., vol. 27, pp. 1705–1709, 2020.
[50] B. Wan, Y. Fang, X. Xia, and J. Mei, ‘‘Weakly supervised video anomaly
detection via center-guided discriminative learning,’’ in Proc. IEEE Int.
Conf. Multimedia Expo (ICME), London, U.K., Jul. 2020, pp. 1–6.
[51] S. Majhi, S. Das, and F. Brémond, ‘‘DAM: Dissimilarity attention module JUNGPIL SHIN (Senior Member, IEEE) received
for weakly-supervised video anomaly detection,’’ in Proc. 17th IEEE Int. the B.Sc. degree in computer science and statis-
Conf. Adv. Video Signal Based Surveill. (AVSS), Nov. 2021, pp. 1–8. tics and the M.Sc. degree in computer science
[52] S. Yu, C. Wang, Q. Mao, Y. Li, and J. Wu, ‘‘Cross-epoch learning for from Pusan National University, South Korea, in
weakly supervised anomaly detection in surveillance videos,’’ IEEE Signal 1990 and 1994, respectively, and the Ph.D. degree
Process. Lett., vol. 28, pp. 2137–2141, 2021. in computer science and communication engi-
[53] J.-C. Feng, F.-T. Hong, and W.-S. Zheng, ‘‘MIST: Multiple instance self- neering from Kyushu University, Japan, in 1999,
training framework for video anomaly detection,’’ in Proc. IEEE/CVF under a scholarship from Japanese Government
Conf. Comput. Vis. Pattern Recognit. (CVPR), Nashville, TN, USA, (MEXT). He was an Associate Professor, a Senior
Jun. 2021, pp. 14004–14013. Associate Professor, and a Full Professor with the
[54] M. Z. Zaheer, A. Mahmood, M. Astrid, and S.-I. Lee, ‘‘Clustering aided School of Computer Science and Engineering, The University of Aizu, Japan,
weakly supervised training to detect anomalous events in surveillance in 1999, 2004, and 2019, respectively. He has coauthored more than 400 pub-
videos,’’ IEEE Trans. Neural Netw. Learn. Syst., early access, May 26, lished papers for widely cited journals and conferences. His research interests
2024, doi: 10.1109/TNNLS.2023.3274611. include pattern recognition, image processing, computer vision, machine
[55] C. Cao, X. Zhang, S. Zhang, P. Wang, and Y. Zhang, ‘‘Weakly super- learning, human–computer interaction, non-touch interfaces, human gesture
vised video anomaly detection based on cross-batch clustering guid- recognition, automatic control, Parkinson’s disease diagnosis, ADHD diag-
ance,’’ in Proc. IEEE Int. Conf. Multimedia Expo (ICME), Jul. 2023, nosis, user authentication, machine intelligence, bioinformatics, handwriting
pp. 2723–2728. analysis, recognition, and synthesis. He is a member of ACM, IEICE, IPSJ,
[56] W. Tan, Q. Yao, and J. Liu, ‘‘Overlooked video classification in weakly KISS, and KIPS. He served as the program chair and as a program committee
supervised video anomaly detection,’’ in Proc. IEEE/CVF Winter Conf. member for numerous international conferences. He serves as an Editor for
Appl. Comput. Vis. Workshops (WACVW), Jan. 2024, pp. 202–210. IEEE journals, Springer, Sage, Taylor & Francis, Sensors (MDPI), Electron-
[57] S. Yi, Z. Fan, and D. Wu, ‘‘Batch feature standardization network with ics (MDPI), and Tech Science. He serves as an Editorial Board Member for
triplet loss for weakly-supervised video anomaly detection,’’ Image Vis. Scientific Reports. He serves as a reviewer for several major IEEE and SCI
Comput., vol. 120, Apr. 2022, Art. no. 104397. journals.
[58] Y. Gong, C. Wang, X. Dai, S. Yu, L. Xiang, and J. Wu, ‘‘Multi-
scale continuity-aware refinement network for weakly supervised video
anomaly detection,’’ in Proc. IEEE Int. Conf. Multimedia Expo (ICME),
Taipei, Taiwan, Jul. 2022, pp. 1–6.
[59] S. Majhi, R. Dai, Q. Kong, L. Garattoni, G. Francesca, and F. Bremond,
‘‘Human-scene network: A novel baseline with self-rectifying loss for
weakly supervised video anomaly detection,’’ 2023, arXiv:2301.07923. YUTA KANEKO is currently pursuing the bach-
[60] S. Park, H. Kim, M. Kim, D. Kim, and K. Sohn, ‘‘Normality guided mul- elor’s degree in computer science and engineer-
tiple instance learning for weakly supervised video anomaly detection,’’ ing with The University of Aizu (UoA), Japan.
in Proc. IEEE/CVF Winter Conf. Appl. Comput. Vis. (WACV), Jan. 2023, He joined the Pattern Processing Laboratory, UoA,
pp. 2664–2673. in April 2023, under the direct supervision of Dr.
[61] S. Sun and X. Gong, ‘‘Long-short temporal co-teaching for weakly super- Jungpil Shin. He is currently working on human
vised video anomaly detection,’’ 2023, arXiv:2303.18044. activity recognition. His research interests include
[62] L. Wang, X. Wang, F. Liu, M. Li, X. Hao, and N. Zhao, ‘‘Attention- computer vision, pattern recognition, and deep
guided MIL weakly supervised visual anomaly detection,’’ Measurement, learning.
vol. 209, Mar. 2023, Art. no. 112500.
ABU SALEH MUSA MIAH (Member, IEEE) SATOSHI NISHIMURA (Member, IEEE)
received the B.Sc. (Eng.) and M.Sc. (Eng.) degrees received the B.E. degree from Tohoku University,
in computer science and engineering from the in 1987, and the M.Sc. and D.Sc. degrees in
Department of Computer Science and Engineer- information science from The University of Tokyo,
ing, University of Rajshahi, Rajshahi, Bangladesh, in 1989 and 1995, respectively. He is currently a
in 2014 and 2015, respectively, and the Ph.D. Senior Associate Professor with The University
degree in computer science and engineering from of Aizu. His research interests include computer
The University of Aizu, Japan, in 2024, under a graphics and computer music. He is a member of
scholarship from Japanese Government (MEXT). ACM and IPSJ.
He assumed the positions of a Lecturer and an
Assistant Professor with the Department of Computer Science and Engi-
neering, Bangladesh Army University of Science and Technology (BAUST),
Saidpur, Bangladesh, in 2018 and 2021, respectively. He has been a Visiting
Researcher (postdoctoral researcher) with The University of Aizu, since
April 2024. He has authored or coauthored more than 50 publications in
widely cited journals and conferences. His research interests include AI, ML,
DL, human activity recognition (HCR), hand gesture recognition (HGR),
movement disorder detection, Parkinson’s disease (PD), HCI, BCI, and
neurological disorder detection.