Anomaly Detection r1 in Weakly Supervised Videos Using Multistage Graphs and General Deep Learning Based Spatial-Temporal Feature Enhancement

Uploaded by

kimsuho098765

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views15 pages

Anomaly Detection r1 in Weakly Supervised Videos Using Multistage Graphs and General Deep Learning Based Spatial-Temporal Feature Enhancement

Uploaded by

kimsuho098765

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

Received 29 March 2024, accepted 22 April 2024, date of publication 30 April 2024, date of current version 14 May 2024.

Digital Object Identifier 10.1109/ACCESS.2024.3395329

Anomaly Detection in Weakly Supervised Videos

Using Multistage Graphs and General Deep
Learning Based Spatial–Temporal
Feature Enhancement
JUNGPIL SHIN , (Senior Member, IEEE), YUTA KANEKO ,
ABU SALEH MUSA MIAH , (Member, IEEE),
NAJMUL HASSAN , (Graduate Student Member, IEEE),
AND SATOSHI NISHIMURA , (Member, IEEE)
School of Computer Science and Engineering, The University of Aizu, Aizuwakamatsu 965-8580, Japan
Corresponding author: Jungpil Shin ([email protected])
This work was supported by the Competitive Research Fund of The University of Aizu, Japan.

ABSTRACT Weakly supervised video anomaly detection (WS-VAD) is a crucial research domain in
computer vision for the implementation of intelligent surveillance systems. Many researchers have been
working to develop WS-VAD systems using various technologies by assessing anomaly scores. However,
they are still facing challenges because of lacking effective feature extraction. To mitigate this limitation,
we propose a multi-stage deep-learning model for separating abnormal events from normality to extract the
hierarchical effective features. In the first stage, we extract two stream features using pre-trained techniques:
the first stream employs a ViT-based CLIP module to select top-k features, while the second stream
utilizes a CNN-based I3D module integrated into the Temporal Contextual Aggregation (TCA) mechanism.
These features are concatenated and fed into the second-stage module, where an Uncertainty-regulated
Dual Memory Units (UR-DMU) model is employed to learn representations of regular and abnormal
data simultaneously. The UR-DMU integrates global and local structures, leveraging Graph Convolutional
Networks (GCN) and Global and Local Multi-Head Self Attention (GL-MHSA) modules to capture video
associations. Subsequently, feature reduction is achieved using the multilayer-perceptron (MLP) integration
with the Prompt-Enhanced Learning (PEL) module via the knowledge-based prompt. Finally, we employed
a classifier module to predict the snippet-level anomaly scores. In the training phase, the based function
transfers the snippet-level scores into bag-level predictions for learning high activation in anomalous cases.
Our approach integrates these cutting-edge technologies and methodologies, offering a comprehensive
solution to video-based anomaly detection. Extensive experiments on ShanghaiTech, XD-Violence, and
UCF-Crime datasets validate the superiority of our method over state-of-the-art approaches by a substantial
margin. We believe that our model holds significant promise for real-world applications, demonstrating
superior performance and efficacy in anomaly detection tasks.

INDEX TERMS Temporal contextual aggregation (TCA), uncertainty-regulated dual memory units (UR-
DMU), graph convolutional networks and global/local multi-head self-attention (GL-MHSA), weakly
supervised video anomaly detection (WS-VAD) anomaly detection.

I. INTRODUCTION
The associate editor coordinating the review of this manuscript and Fully supervised, unsupervised, and weakly supervised are
approving it for publication was Tyson Brooks . the three prevailing paradigms in the field of video anomaly
2024 The Authors. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.
For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
VOLUME 12, 2024 65213
J. Shin et al.: Anomaly Detection in Weakly Supervised Videos

event detection (VAED). The fully supervised paradigm is Many of the existing methodologies encode the extracted
primarily characterized by its exceptional performance [1]. visual content by utilizing a backbone such as C3D [21]
However, it is important to note that the training data for and I3D [22], which have been pre-trained on tasks related
this paradigm necessitates the inclusion of frame-level nor- to action recognition. Nevertheless, Visual Activity Detec-
mal or abnormal annotations, which in turn requires video tion (VAD) requires representations that are able to clearly
annotators to identify and label abnormalities within the depict the events occurring in a given scene. Consequently,
videos. Given that abnormalities can occur at any given these current backbones are unsuitable for VAD due to the
moment, it becomes imperative for the annotators to spot existence of a domain gap. In order to overcome this limita-
nearly all frames. Regrettably, the process of accumulating a tion, Joo et al. decided to draw inspiration from the recent
fully annotated large-scale dataset for supervised VAED can achievements in vision-language research, specifically the
be both non-automated and time-consuming. On the other works of [23], [24], and [25], which demonstrated the effec-
hand, the unsupervised paradigm involves training models tiveness of feature representation derived from contrastive
exclusively on samples of normal events. It is based on language-image pretraining (CLIP). To achieve this, they
the common assumption that unseen anomaly videos will employed the visual features encoded by the vision trans-
exhibit high reconstruction errors [2], [3], [4]. Unfortunately, former (ViT) from CLIP. However, it is worth noting that
the performance of unsupervised Variational Autoencoder the performance of WVAED methods based on MIL heavily
Decoder (VAED) tends to be substandard. This can be relies on pre-trained feature extractors.
attributed to its limited comprehension of anomalies and its The drawback of the study is that they processed the video
incapacity to encompass various forms of normality vari- in individual frames or short clips to extract the long-range
ants [5]. Consequently, weakly supervised approaches are semantic contextual information. To overcome the prob-
widely regarded as the most viable paradigm. They out- lem, shao et al. proposed a Temporal Context Aggregation
shine both unsupervised and supervised paradigms due to for Video Retrieval (TCA) framework for video representa-
their competitive performance and cost-effectiveness regard- tion learning. This innovative approach integrates long-range
ing annotations. These approaches reduce cost by utilizing temporal context among frame-level features through the uti-
video-level labels instead of laborious fine-grained annota- lization of the self-attention mechanism [26], [27]. They used
tions [6], [7]. In recent time, WVAED has evolved into a contrastive learning to reduce loss or error rate in the eval-
well-established technical path of research for VAED [6], [7], uation. To enhance the TCA features, Tean et al. employed
[8], [9], [10], [11], [12], [13], [14], [15], [16]. The WVAED Robust TCA features, including multiple instance learning
issue is primarily perceived as a MIL (multiple instance (MIL) loss calculation approach [19]. They reported 84.30%
learning) problem [8]. Generally speaking, WVAED models and 97.21% AUC for the UCF-crime and Shanghai Tech
directly generate scores for anomalies by comparing the spa- datasets, respectively. To improve the AUC rate by increasing
tiotemporal features of normal and abnormal events through the feature effectiveness, PU et al. employed TCA to enhance
the MIL technique. The MIL approach deals with training the long-range dependency and PEL instead of contrastive
data organized into sets known as positive and negative bags. learning to increase the correct prediction rate by reducing the
In the context of MIL, a video is seen as a bag containing error [28]. They employed MLP with PEL to reduce the fea-
numerous instances, where each instance corresponds to a tures and casual convolution (CC) for the classification. PEL
video snippet. A negative bag encompasses the entirety of mainly integrates semantic priors utilizing knowledge-based
normal snippets, whereas a positive bag encompasses both prompts aiming to increase the recognition rate by boosting
normal and abnormal snippets without any indication of the the discriminative capacity while ensuring high separability
temporal boundaries of abnormal events. The conventional between subclass between the anomaly, and finally, they
Multiple Instance Learning (MIL) framework assumes that calculate the score and the error rate with the MIL loss func-
all negative bags exclusively contain negative snippets and tion: the reported AUC rate 86.76%, 85.59% and 98.14% for
that positive bags contain at least one positive snippet. Super- UCF-crimes, XD-Violence and Shanghai Tech dataset
vision is solely provided for complete sets, and the individual respectively. To improve the recognition rate, Zhao et al. pro-
labels of the snippets within the bags are not given [17]. The posed a new temporal feature extraction using graph-based
outputs of WVAED are inherently more reliable than those transformers, namely Uncertainty Regulated Dual Memory
of unsupervised VAED due to its ability to comprehend the Units. (UR-DMU) through the I3D backbone pre-trained fea-
fundamental variability between normal and abnormal [18]. tures [29]. They reported 86.97% and 94.02% for UCF-crime
However, in the WVAED approach, the frames labelled as and XD-violence datasets, respectively. To improve the per-
abnormal in the positive bag are often influenced by the formance, more recently, sharif et al. proposed a two-stream
frames labelled as normal in the negative bag, making it pre-trained feature-based temporal feature enhancement
challenging to distinctly identify an abnormality in contrast to module where they first extracted CNN-based I3D features in
normality. Consequently, the detection of anomalous snippets the first stream by selective top-k score and ViT-based Clip
can become problematic. Numerous researchers (e.g., [8], feature in the second stream [30]. Finally, they fused the two
[9], [10], [19], [20]) have endeavoured to address this issue features and employed MLP and classification module for
by employing multiple instance learning (MIL) frameworks. the classification. They reported 88.97% and 98.66% AUC
65214 VOLUME 12, 2024
J. Shin et al.: Anomaly Detection in Weakly Supervised Videos

for the UCF crime and the Shanghai tech dataset, respec- with PEL to refine and learn discriminative features
tively. The drawback of this model is that it did not achieve through knowledge-based prompts. This integration of
satisfactory performance for the real-time deployment due non-linear mapping further enhances the model’s abil-
to a lack of feature effectiveness. Also, they utilized CNN ity to differentiate between normal and anomalous
and ViT-based pre-trained model features and temporal fea- behaviour.
ture enhancement, but they did not consider the graph-based • Classification and Evaluation Finally, we employed a
feature enhancement and spatial feature enhancement in the classifier module to predict the snippet-level anomaly
module. Also, the UR-DMU [29] utilized the graph-based scores. In the training phase, the based function transfers
feature enhancement, but they did not discuss time-varying the snippet-level scores into bag-level predictions for
enhancement and TCA [26], [28] reflected the vice versa learning high activation in anomalous cases. We eval-
problem. In addition, UR-DMU [29], TCA [26], [28] and uate the proposed model with three benchmark datasets,
I3D-CLIP [30] they are having lacking the extracting all namely UCF-Crime Dataset, ShanghaiTech Dataset,
possible kind of the feature. This research group inspired us to and XD-Violence. The extensive performance result
work here to extract all possible kinds of features to increase proves the superiority of the proposed model. Through
the anomaly detection rate. To overcome the challenges, the integration of these cutting-edge technologies and
we proposed here a multi-stage graph and general deep methodologies, our approach offers a comprehensive
learning (DL) feature enhancement-based anomaly detection solution to video-based anomaly detection. We believe
system. In the study, we proposed including CNN and ViT- that our model holds significant promise for real-world
based pre-trained features, temporal features, graph-based applications, demonstrating superior performance and
temporal features and spatial enhancement of the features. efficacy in anomaly detection tasks.
The main contributions of the proposed model are given
below: II. LITERATURE REVIEW
• Stage 1: General Deep Learning Model Based The methodologies utilized in WVAED rely on labels at the
Dual-Stream Feature Extraction: video level, which consistently adhere to the MIL ranking
The first stage of our methodology is characterized framework [8]. According to the MIL approach, a regres-
by the innovative use of two streams, each contribut- sion model is trained using the WVAED method with the
ing distinct yet complementary features to the anomaly assumption that the maximum score of the positive bag
detection process. Leveraging CLIP and I3D, we extract is greater than that of the negative bag in order to assign
rich semantic information and spatiotemporal features, scores for video snippets. These [8], [9], [19], and [6],
respectively, setting a solid foundation for subsequent [11] all incorporated pre-trained models based on convo-
analysis. Building upon the extracted features, we seam- lutional neural networks into their experimental procedure
lessly integrate them into the Temporal Contextual setups. In addition, Sultani et al. [8] meticulously curated
Aggregation (TCA) mechanism. This module helps pre-annotated normal and abnormal video events at the video
to capture comprehensive contextual information by level to construct the widely recognized UCF-Crime dataset.
reusing the similarity matrix and implementing adaptive The dataset was employed for anomaly detection by utilizing
fusion. This integration facilitates the effective capture a weakly supervised framework. Within the confines of this
of temporal dependencies across video frames, enhanc- framework, C3D features [31] were extracted for video seg-
ing the model’s ability to discern anomalous patterns ments, and then a ranking loss function was used to train a
amidst dynamic scenes. fully connected neural network (FCNN). The purpose of this
• Stage 2: Graph-Based UR-DMU Model Integration function was to compute the loss between the most highly
and Refinement: scored rank examples in the positive bag and the negative bag.
In the second stage, we introduce the Uncertainty- Tian et al. [19] presented a model and utilized the C3D [31]
regulated Dual Memory Units (UR-DMU) model, and I3D [22] models for the aiming of feature extractors in
renowned for its ability to simultaneously learn rep- their WVAED model. They contended that by selecting the
resentations of regular and abnormal data. By incor- top 3 features based on their magnitude, a more pronounced
porating global and local structures through GCN and differentiation can be achieved between normal and abnormal
Global/Local Multi-Head Self Attention (GL-MHSA) videos (AVs). Specifically, in cases where multiple abnormal
modules, our model captures intricate associations snippets exist within an anomalous video, the average snippet
within video data. Additionally, refinement through feature magnitude of the anomalous video surpasses that of
a Multi-Layer Perceptron (MLP) enables non-linear normal videos (NVs). Hang et al. [9] presented a model to
mapping, further enhancing the model’s discriminatory extract positive and negative video-segmented C3D features
capabilities. by using a temporal convolution network [31]. Specifically,
• Stage-3 Feature Reduction Classification and Evalu- they trained the network between the previous adjacent seg-
ation with Impact and Promise: ment and the current segment. Further, they used inner and
In the third stage, we used a feature reduction module outer bag ranking losses to train the model based on two
using two-layer multilayer perception (MLP) integrated branches of an FCNN. This loss accounted for the greatest