0% found this document useful (0 votes)
120 views9 pages

Anomaly Detection in Surveillance

Uploaded by

Esraa Alaa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
120 views9 pages

Anomaly Detection in Surveillance

Uploaded by

Esraa Alaa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

.

Anomaly Detection in surveillance videos using deep learning

Esraa A. Mahareek 1* Eman K. El-Sayed 2


Nahed M. El-Desouky 3 Kamal A. El-Dahshan 4

123
Mathematics department Faculty of science Al-Azhar University (Girls branch), Cairo, Egypt
2
School of Computer science, Canadian International College CIC, Cairo, Egypt
4
Mathematics department Faculty of science Al-Azhar University, Cairo, Egypt
* Corresponding author’s Email: [email protected]

Abstract: This paper presents a new method for anomaly detection in surveillance videos using deep learning. The
proposed method is based on a deep network trained to identify objects and human activities in videos. The method
was tested on five real-world large-scale datasets (UCF-Crime, XD-Violence, UBI-Fights, and CCTV-Fights, UCF-
101) containing indoor and outdoor video sequences, as well as on synthetic datasets with different object sizes,
appearance, and activity type. We use 3D- convolutional neural network (3D-CNN) then convolutional log short term
memory (ConvLSTM) to extract features from video frames and the perform classification and recognition based on
these features. The results show that the proposed method achieves a high accuracy and AUC both in indoor and
outdoor scenarios compared to state-of-the-art methods reported in the comparison.
Keywords: Anomaly detection, 3D Convolutional neural network, Surveillance videos, bidirectional ConvLSTM,
fight detection, violence detection.

developing computer vision system that


1. Introduction automatically detect anomaly action in surveillance
videos is very necessary.
The monitoring ability to maintain public safety
The low – resolution and discontinuous nature of
and its quick response to serve this aim, as protection
many surveillance videos can make it difficult to
is the major reason for deploying video surveillance
changes in the scene. Traditional approaches to this
systems, is a considerable issue even for people. The
problem rely on hand-crafted feature extractors to
use of surveillance systems has increased, but human
identify anomalous events. These approaches are
capacity has not kept up [1]. As a result, it takes a lot
time-consuming and difficult to maintain as the video
of supervision to spot unusual events that could
format evolves over time. Recent advances in
endanger anyone or a business, even though there is
machine learning have made it possible to train
a significant loss of labour and time given how
algorithms to perform anomaly detection without
unlikely anomalous events are to occur compared to
manually identifying features.
regular ones.
In this paper we propose a new method for
Surveillance video is an important source of data
automatically detection and classification of
for law enforcement, security, and other organization.
anomalies on video recordings using convolutional
It is an automated system used for monitoring indoor
neural network which extract features from video
or outdoor environments such as airports, malls, and
frames and classify them according to different
parking lots. The recorded video streams are
anomaly classes such as assault, robbery, and
converted into images using 2D or 3D cameras.
fighting. In this approach we choose 3D-CNN to
These images are the analysed by computer vision
learn short term spatial temporal features of
algorithms to detect objects, people, and actions in
anomalies followed by ConvLSTM to learn long term
the scene. Detection of unusual events in these scenes
temporal spatial features then combined these
is an important task in video surveillance systems as
networks in unified architecture to perform
it enables detection and response to unexpected
classification of surveillance videos to improve
events such as robberies, assaults, vandalism, or
training stability and performance. Several layers of
traffic collisions.
convolutional networks are trained with millions of
However, anomaly events are rare compared to
images in order to learn unique image features that
normal ones monitoring surveillance videos is very
are discriminative for different anomaly classes and
important which is very time consuming, so
. 2

for each class they are trained to separate the normal representation of usual behaviour, SVM is trained
from the abnormal frames in a video recording. This using a Bag of Visual Words (BOVW) in this phase.
is achieved by evaluating similarity between a feature Campus violence is the most dangerous kind of
vector extracted from a normal frame to a feature that school bullying and is a global societal problem. As
is extracted for an anomaly frame belonging to the AI and remote monitoring capabilities develop, there
same class and then classifying the frame as either are several potential methods to detect campus
normal or abnormal by calculating a similarity score violence, including video-based ones. Ye et al. (2021)
between the two features vectors. The main [5] use audio and visual data to detect campus
disadvantage of this approach is that it requires a violence. Data on campus violence is gathered
large number of training images and very large through role-playing, and every 16 frames of video
datasets to be able to train the network to learn useful are used to extract 4096-dimension feature vectors.
image features. So, we trained our model on large The 3D CNN is employed for feature extraction and
dataset like UCF-crime that contains more than 128 classification, and an overall precision of 92.00
hours of recording videos divided into 8 anomaly percent is achieved.
classes and 1 normal class. We evaluate our model The Trajectory-Pooled Deep Convolutional
performance on the held-out test data and the results Networks ConvNet model, which has 17 convolution
show that it has reasonable classification accuracy for pool-norm layers and two fully linked layers, was
different types of anomaly events and outperforms employed by Meng et al. (2020) [5]. He applies his
other recent approaches. algorithm to both crowded and uncrowded datasets
Firstly, we describe the data set used in this paper with 92.5% accuracy in the Crowd Violence dataset
and also describe how it was pre-processed and and 98.6% in the Hockey Fight dataset.
trained and tested using a 3D-CNN approach to detect
different types of anomalies. Then we describe the
results obtained for the test dataset and show
classification accuracy and AUC for each dataset.
This paper is organized as following: Section 2
describes a literature review of various works related
to this research study. Section 3 describes 3D-CNN.
Section 4 describes the proposed technique. Section
5 describes the dataset; and section 6 describes briefly
how the training data is pre-processed followed by a
discussion and conclusions.
Figure. 1 [2]distribution of papers on violence detection per
2. Literature review year.
A new method for evaluating whether a movie
In the field of action detection, using computer
contains violent scenes is presented by Rendón-
vision to identify certain actions in security cameras
Segador et al. (2021) [6]. It is based on a modified 3D
has grown in popularity. This work is related to the
DenseNet for a multi-head self-attention layer and a
computer vision field. Many researchers have been
bidirectional ConvLSTM module.
trying to develop efficient machine-learning methods
A weakly supervised anomaly localization
for automatic video anomaly detection task. Fig. 1
(WSAL) technique is put out by Hui Lv et al.[6], and
represents the paper distribution for anomaly
it focuses on temporally localising anomalous
detection from publicly available literature between
portions inside anomalous films. Inspired by the
2015 and 2021[2] and some related keywords.
visual contrast in bizarre videos. To locating
A model-based technique for anomaly
anomalous segments, the evolution of nearby
identification for surveillance footage is proposed by
temporal segments is assessed. To do this, a high-
Kamoona et al [3] (2019). The system is divided into
order context encoding model is suggested that not
two phases. On this platform, numerous handcrafted
only extracts semantic representations but also
features have been displayed. Additionally, C3D
measures dynamic variations to make efficient use of
features and anomaly detection using SVM have been
the temporal environment.
extracted from video data using deep learning
Due to the difficulty in accurately capturing both
approaches. These techniques were applied by
the spatial and temporal information of successive
Sultani, Chen, and Shah (2018) [4]. Behavior
video frames, video classification is more complex
modelling is the following stage. At order to learn the
than it is for static images. The 3D convolution
operator was suggested by S. Ji et al. [7] for
. 3

computing features from both spatial and temporal respectively, and 𝑅𝑖 is the size of the 3D kernel along
data. the temporal dimension.
By examining the synergy between dictionary-
based representation and self-supervised learning, 4. Convolutional LSTM neural network
Wu et al. [8] offer a self-supervised sparse (ConvLSTM)
representation (S3R) framework in 2022 that models
the concept of anomalous at the feature level. The ConvLSTM was created especially for
The Magnitude Contrastive Loss and the Feature difficulties predicting spatial-temporal sequences.
Amplification Mechanism are proposed by Chen et ConvLSTM may extract spatial and temporal
al. in 2022 [9] to improve the discriminativeness of features from feature graph sets more effectively than
feature magnitudes for identifying anomalies. Results standard LSTM [11]. This is so that ConvLSTM,
of experiments using the UCF-Crime and XD- which analyses and forecasts the events in time
Violence benchmark datasets. series, can take the spatial information of a single
feature map into account. Therefore, ConvLSTM can
3. 3D Convolutional Neural Network be used to resolve timing issues more effectively in
dynamic anomaly recognition. The flowing equations
A 3D CNN is a type of neural network composed are used to formulate the ConvLSTM [11] equations.
of several 2D convolutional layers followed by 𝑖𝑡 = 𝜎(𝑊𝑥𝑖 ∗ 𝑋𝑡 + 𝑊ℎ𝑖 ∗ 𝐻𝑡−1 + 𝑊𝑐𝑖 ⃘𝐶𝑡−1 + 𝑦𝑖 )(2)
several layers of nonlinear units (called the “fully 𝑓𝑡 = 𝜎(𝑊𝑥𝑓 ∗ 𝑋𝑡 + 𝑊ℎ𝑓 ∗ 𝐻𝑡−1 + 𝑊𝑐𝑓 ⃘𝐶𝑡−1 + 𝑦𝑖 )(3)
connected” layers), all arranged in several parallel 𝐶𝑡 = 𝑓𝑡 ⃘𝐶𝑡−1 + 𝑖𝑡 ⃘𝑡𝑎𝑛ℎ(𝑊ℎ𝑐 ∗ 𝐻𝑡−1 + 𝑊𝑥𝑐 ∗ 𝑋𝑡 + 𝑦𝑐 )(4)
planes (i.e., three dimensionally). A convolution can 𝑂𝑡 = 𝜎(𝑊𝑥𝑜 ∗ 𝑋𝑡 + 𝑊ℎ𝑜 ∗ 𝐻𝑡−1 + 𝑊𝑐𝑜 ⃘𝐶𝑡 + 𝑦𝑜 )(5)
be applied along the time dimension to extract 𝑓𝑡 = 𝑂𝑡 ⃘𝑡𝑎𝑛ℎ(𝐶𝑡 )(6)
temporal patterns in the data, just like convolutional The inputs are 𝑋1, 𝑋2, . . . , 𝑋𝑡, the cell outputs are
layers can do the same for spatial patterns in image 𝐶1, 𝐶2, . . . , 𝐶𝑡 , and the hidden states are
data. But, if our data contains both spatial and 𝐻1, 𝐻2, . . . , 𝐻𝑡 . The three-dimensional tensors of
temporal patterns, as it is with video data, we should ConvLSTM are the gates 𝑖𝑡 , 𝑓𝑡 , and 𝑂𝑡 , respectively.
study these two types of patterns together since they Rows and columns are the final two dimensions,
can combine to create more complicated spatio- which are spatial. The convolution operator and
temporal patterns. The basic idea behind a 3D CNN "Hadamard product" are denoted by the operators
is to process the image or the video sequence in two "*" and " ⃘" respectively. The batch normalisation
dimensions (spatial and temporal) sequentially in layer and dropout layer are added to the ConvLSTM
order to obtain the final result. in this instance.
By extending CNN, 3DCNN does this by
enlarging the convolution kernel. Extraction of video 5. Proposed method
features is effective using 3D CNN [10]. For a more
LSTM and 3D CNN are coupled to classify video.
thorough analysis, 3DCNN extracts the spatial-
We will outline the 3DCNNConvLSTM model's
temporal features from the entire video. The 3D
architecture in this part. We recommended a 3D
convolution kernel is used to extract regional
convolution neural network (3DCNN) followed by a
spatiotemporal neighbourhood information, which is
Convolutional long short-term memory
appropriate given the video's data format. Eq. (1) is a
(ConvLSTM) network as a feature extraction model
representation of the formula 3DCNN:
𝑥𝑦𝑧 for the dynamic anomaly identification process.
𝑣𝑖𝑗 = 𝑅𝑒𝑙𝑢(𝑏𝑖𝑗 + The 3DCNN-ConvLSTM model's architecture is
𝑝𝑞𝑟 (𝑥+𝑝)(𝑦+𝑞)(𝑧+𝑟)
∑𝑚 ∑𝑃𝑝=𝑜𝑖 −1 𝑄𝑖 −1 𝑅𝑖 −1
∑𝑞=0 ∑𝑟=0 𝑤𝑖𝑗𝑚 𝑘(𝑖−1)𝑚 ) (1) shown in Fig. 2. A stack of continuous anomaly video
Where 𝑅𝑒𝑙𝑢 stands for the buried layer's frames that have been downsized to 16 × 32 × 32 ×
activation function. The current value at position 3 form the input layer. Four 3D convolutional layers,
(𝑥, 𝑦, 𝑧) in the 𝑖 𝑡ℎ and 𝑗 𝑡ℎ feature graph sets is each with a different filter (32, 32, 64, and 64), make
𝑥𝑦𝑧
represented by 𝑣𝑖𝑗 . up the architecture. however, the same 333 kernel
size. After then, a layer of ConvLSTM with 64-unit
The bias of the 𝑖 𝑡ℎ layer and the 𝑗 𝑡ℎ feature map is sizes was applied.
represented by the term 𝑏𝑖𝑗 . The (𝑝, 𝑞, 𝑟)𝑡ℎ value of A ReLU layer and a batch normalisation layer
the kernel associated to the 𝑚𝑡ℎ feature map in the come after each 3DCNN layer. 3D Max Pooling and
𝑝𝑞𝑟
preceding layer is represented by 𝑤𝑖𝑗𝑚 . 𝑃𝑖 , 𝑄𝑖 stand dropout layers were placed between each pair of
for the height and width of the convolution kernel, 3DCNN layers. Dropout layers with values of 0.3 and
0.5 were added. A fully connected layer with 512 is
used to implement the output probability, and it is
. 4

followed by the Softmax activation function, which fighting, burglary, explosion, arrest, abuse, and road
has many output units equal to the number of accidents, are listed in these movies. The collection
anomaly video classes. also includes "Normal" videos, meaning those
without any recordings of crimes. Two tasks can be
accomplished using this dataset. First, a general
analysis of anomalies is performed, considering all
anomalies in one group and all regular activities in
another. Figure 3 shows how the percentage of videos
in each UCF-Crime class are distributed per class.

Figure. 3 the distribution of the percentage of videos in each


UCF-Crime per class
The second dataset, XD-Violence[12], is a
massive, multi-scene dataset with a duration of 217
hours and a total of 4754 untrimmed films with audio
signals and shaky labels.
Figure. 2 General design of the 3DCNNConvLSTM model The third dataset, UBI-Fights, contains 80 hours
To classify the test video, it is collected, divided of footage that has been completely annotated at the
into 16 consecutive frames, and supplied into the frame level and is focused on a particular anomaly
trained/learned model. The features discovered by the detection while still offering a wide variety of combat
model are used to determine the probability score for scenarios. consisting of 1000 videos, of which 216
each frame. The majority voting schema is then given feature battle scenes and 784 depict ordinary daily
the forecast of 16 frames as input, and based on the occurrences. In order to prevent disruptions to the
probability score of each frame, it predicts the label learning process, all extraneous video segments such
of the video sequence. The majority voting formula as video introductions and news were deleted. The
is given in Eq (7). titles of the videos include indicators on the kind of
𝑌 = 𝑚𝑜𝑑𝑒 𝐶(𝑋1, 𝐶2, 𝐶3, . . . , 𝐶(𝑋16) (7) each video, such as indoor and outdoor environments,
Where 𝑋1, 𝑋2, .. ., and 𝑋16 indicate the frames RGB and grayscale videos, Fixed, Rotate, and
taken from the tested video and Y is the class label of Moveable cameras.
the sign gesture video. The anticipated class label is The final dataset, CCTV-Fights, includes 1,000
represented by 𝐶(𝑋1), 𝐶(𝑋2), 𝐶(𝑋3), . . . , 𝐶(𝑋16) videos of actual fights captured by CCTVs or
for each frame. portable cameras. There are 280 CCTV films in total,
with fights ranging in length from 5 seconds to 12
6. Datasets minutes, with an average of 2 minutes. Additionally,
Finding and analysing anomalies in video data is it includes 720 footage of actual battles taken from
becoming increasingly popular. In order to meet this other sources (referred to as Non-CCTV in this
need, we apply our approach to multiple significant document), mostly from mobile cameras but
datasets of videos to detect and characterize sometimes occasionally from dashcams, drones, and
anomalies. For example UCF-Crime [4], XD-Violence helicopters. These movies range in length from 3
[12], UBI-Fights [13], and NTU CCTV-Fights [14]. seconds to 7 minutes, with an average of 45 seconds,
The first dataset is a large, different variety of 128 however some of them have numerous fights, which
hours of video. It consists of 1900 long, eight can aid the model in making more generalisations.
categories of crimes, including assault, arson, The datasets utilized in this experiment are fully
described in Table 1.
. 5

Dataset #videos # hours


#Violence
Size specifics and provides model forecasts an excessive
types amount of confidence, which is harmful to business
UCF-Crime 1900 128 9 60GB
XD-violence 4754 217 6 123GB
goals. As it calibrates the trade-off between
CCTV 1000 17.68 1 7.2 GB sensitivity and specificity at the best-chosen
UBI-Fight 1000 80 2 7.9GB threshold,
Measure AUC is the preferred statistic in such
UCF101 13320 27 101 7 GB circumstances.
Table 1. Detailed information on each dataset utilized in the Additionally, accuracy assesses the performance
comparison of a single model, whereas AUC compares two
models and assesses the performance of a single
7. Implementation model at various thresholds.
We divide each dataset into 75%:25% training The recognition accuracy and AUC test were
and testing splits for evaluation. Each split is further used to evaluate how well the trained models were
divided in five folds that each contain approximately working. In our experiments, we show in Table 2 for
on third of the total videos for training or validation each of the UCF-crime, XD-violence, and CCTV
and the remaining videos are for testing. datasets and Table 3 for the UBI-Fight and UCF-101
The deep learning model was put into practise datasets both shows the performance for recognition
using a Windows 10 Pro computer, an Intel Corei7 accuracy and AUC according to the proposed
CPU, and 16 GB of RAM. Python was used to 3DCNN+ConvLSTM model at 10, 30, and 50
implement the system, along with the Anaconda iterations all at batch size 32.
environment and Spyder editor. Both Keras and According to Fig. 4 and 5 show the performance
TensorFlow were part of deep learning libraries. For on the training and validation for UCF-crime datasets
handling and pre-processing data, the Python on 10 and 30 epochs respectively. While fig 6 show
OpenCV package was used. the performance on the training and validation for
There are numerous parameters in deep learning UCF-101 datasets at 100 epochs, the model clearly
models that influence their development and performed well on the training dataset. For model
effectiveness. We'll talk about how the amount of training, the training accuracy was almost 100%. The
iterations affects the functionality of our network. trained model was tested using 25% of the dataset,
One of the most crucial hyperparameters in and the best recognition accuracy rate was 100% on
contemporary deep learning systems is the number of iteration value 50 epochs for UCF-101 dataset. While
iterations. In practise, fewer iterations are needed to the model achieves accuracy 98.5%, 95.1%, 99%,
train the model, which speeds up computation 97.1% for UCF-crime, XD-violence, CCTV, UBI-
significantly thanks to GPU parallelism. On the other fight datasets respectively. The model achieves the
hand, employing greater iteration numbers resulted in maximum recognition accuracy in the five datasets
training taking longer than using small iteration when trained for 50 epochs which take 25 hours for
numbers, but testing accuracy was higher. The UCF-crime dataset. While the model achieves 92.2%,
amount of the model training dataset can largely 87.7%, 94.3%, 93.3%, 92.3% AUC for UCF-crime,
affect the batch size and epochs number. XD-violence, CCTV, UBI-fight datasets
respectively. As the results are competitive with
respect to the recent research mention in the
8. Experimental results comparison in tables 4, 5 and 6.
Performance evaluation is an important duty. So,
Dataset
The AUC (Area Under the Curve) is used to assess or UCF-crime XD-violence CCTV
depict the performance of the multi-class
classification issue. It is one of the most fundamental Accuracy_10 89% 81.9% 91.7%
evaluation criteria for assessing the effectiveness of AUC_10 80% 79% 83%
any classification model. AUC stands for the level Accuracy_30 93.4% 92.3% 94.1%
or measurement of separability. It reveals how well AUC_30 85.6% 83.2% 89%
the model can differentiate across classes.
Accuracy_50 98.5% 95.1% 99%
The two metrics used for classification models
are accuracy and AUC. A model with high accuracy AUC_50 92.2% 87.7% 94.3%
makes very few erroneous predictions. The cost to the Table 2. Comparison of our model's performance for the
datasets of UCF- crime, XD-violence, and CCTV datasets
firm of those inaccurate estimates isn't considered,
though. The use of accuracy measurements in these Dataset
UBI-fight UCF-101
business challenges abstracts away the TP and FP Measure
. 6

Accuracy_10 89.7% 90.7% results 92.2% in 50 epochs and achieve 87.7%, 95.1%
AUC_10 82.6% 87% for AUC and accuracy for CCTV-fights dataset
Accuracy_30 respectively.
93.1% 95.1%
AUC_30 89.8% 89.3% REF. AUC Method year
Accuracy_50 97.1% 100% [9] 86.98% MGFN 2022
AUC_50 93.3% 92.3% [8] 85.99% S3R 2022
[6] 85.38% WSAL 2020
Table 3. Comparison of our model's performance for the
Learning Causal Temporal
datasets of UBI-Fight, and UCF-101
Relation and Feature
[15] 84.89% 2021
Discrimination for Anomaly
Detection
Multi-stream Network with Late
[16] 84.48% 2022
Fuzzy Fusion
[17] 84.03% RTFM 2021
[15] 82.67% DAM 2018
ours 92.2% 3DCNN+ConvLSTM 2023
Table 4. A comparison between the results of our model and
other models for UCF-crime dataset
Table 5 compares the results for more models
given by other studies for the XD-violence dataset in
order to properly evaluate the model and
demonstrates that our model provides the best AUC
Figure 4 shows the model's training and validation accuracy for results 87.7% at 50 epochs.
the UCF-crime dataset for 10 epochs.
REF. AUC Method year
[18] 83.54% CMA_LA 2022
[19] 83.4% MACIL_SD 2022
[8] 80.26% S3R 2022
[9] 82.11% MGFN 2022
[17] 77.81% RTFM 2021
ours 87.7% 3DCNN+ConvLSTM 2023
Table 5. A comparison between the results of our model and
other models for XD-violence dataset
Table 6 compares the results for more models
given by other studies for the UBI-fights dataset in
order to properly evaluate the model and
demonstrates that our model provides the best AUC
results 93.3% in 50 epochs.
Figure 5 shows the model's training and validation accuracy for
the UCF-crime dataset for 30 epochs.
REF. AUC Method year
[1] 90.6% GMM 2020
[4] 89.2% Sultani et al. 2018
[20] 61% S2-VAE 2018
ours 93.3% 3DCNN+ConvLSTM 2023
Table 6. A comparison between the results of our model and
other models for UBI-fights dataset
Table 7 compares the results for more models
given by other studies for the UCF101 dataset in
order to properly evaluate the model and
demonstrates that our model provides the best
accuracy results 100% in 50 epochs. Figure 7 and 8
demonstrates features characteristic along the real
Figure 6 shows the model's training and validation accuracy for time for abuse and explosion videos for example.
the UCF-101 dataset for 100 epochs. REF. Accuracy Method year
[21] 98.64% SMART 2020
Table 4 compares the results for more models [22] 98.6% OmniSource 2020
given by other studies for the UCF-crime dataset in [23] 98.2% Text4Vis 2022
order to properly evaluate the model and [24] 98.2% LGD-3D Two-stream 2019
ours 100% 3DCNN+ConvLSTM 2023
demonstrates that our model provides the best AUC
. 7

Table 7. A comparison between the results of our model and model training accuracy was 100%. The reliability of
other models for UCF101 dataset the recognition was correspondingly 98.5%, 99.2%,
and 94.5%. When compared to 3DCNN,
3DCNN+ConvLSTM produced a decent
performance with the datasets. The results from our
study show that the model is more accurate than the
other competing models. An extension of the current
work we intend to create a model for predicting
anomalies from surveillance video.

References
[1] B. M. Degardin, “Weakly and Partially
Supervised Learning Frameworks for
Anomaly Detection,” 2020.

[2] B. Omarov, S. Narynov, Z. Zhumanov, A.


Gumar, and M. Khassanova, “State-of-the-art
violence detection techniques in video
surveillance security systems: A systematic
review,” PeerJ Comput. Sci., vol. 8, 2022,
doi: 10.7717/PEERJ-CS.920.
Figure 7 shows the anomaly detection in abuse video along the
real time
[3] A. M. Kamoona, A. K. Gostar, A. Bab-
Hadiashar, and R. Hoseinnezhad, “Sparsity-
Based Naive Bayes Approach for Anomaly
Detection in Real Surveillance Videos,” in
ICCAIS 2019 - 8th International Conference
on Control, Automation and Information
Sciences, Oct. 2019. doi:
10.1109/ICCAIS46528.2019.9074564.

[4] W. Sultani, C. Chen, and M. Shah, “Real-


world Anomaly Detection in Surveillance
Videos,” Jan. 2018, [Online]. Available:
http://arxiv.org/abs/1801.04264

[5] L. Ye, T. Liu, T. Han, H. Ferdinando, T.


Seppänen, and E. Alasaarela, “Campus
violence detection based on artificial
intelligent interpretation of surveillance
video sequences,” Remote Sens., vol. 13, no.
4, pp. 1–17, 2021, doi: 10.3390/rs13040628.
Figure 8 shows the anomaly detection in explosion video along
the real time [6] H. Lv, C. Zhou, Z. Cui, C. Xu, Y. Li, and J.
Yang, “Localizing Anomalies from Weakly-
9. Conclusion Labeled Videos,” IEEE Trans. Image
Process., vol. 30, pp. 4505–4515, 2021, doi:
Since deep learning is a potent artificial 10.1109/TIP.2021.3072863.
intelligence technique for video categorization, we
suggested an anomaly detection model employing it [7] S. Ji, W. Xu, M. Yang, and K. Yu, “3D
in this study. The 3DCNN and ConvLSTM models Convolutional neural networks for human
work together to address anomaly detection issues. action recognition,” IEEE Trans. Pattern
Applying the suggested approach to five large-scale Anal. Mach. Intell., vol. 35, no. 1, pp. 221–
datasets allowed us to evaluate it. Excellent 231, 2013, doi: 10.1109/TPAMI.2012.59.
performance was shown by the five datasets, and
[8] J.-C. Wu, H.-Y. Hsieh, D.-J. Chen, C.-S.
. 8

Fuh, and T.-L. Liu, “Self-supervised Sparse real-world anomaly detection,” Expert Syst.
Representation for Video Anomaly Appl., vol. 201, 2022, doi:
Detection,” pp. 729–745, 2022, doi: 10.1016/j.eswa.2022.117030.
10.1007/978-3-031-19778-9_42.
[17] Y. Tian, G. Pang, Y. Chen, R. Singh, J. W.
[9] Y. Chen, Z. Liu, B. Zhang, W. Fok, X. Qi, Verjans, and G. Carneiro, “Weakly-
and Y.-C. Wu, “MGFN: Magnitude- supervised Video Anomaly Detection with
Contrastive Glance-and-Focus Network for Robust Temporal Feature Magnitude
Weakly-Supervised Video Anomaly Learning,” Proc. IEEE Int. Conf. Comput.
Detection,” 2022. Vis., pp. 4955–4966, 2021, doi:
10.1109/ICCV48922.2021.00493.
[10] E. K. Elsayed and D. R. Fathy, “Semantic
Deep Learning to Translate Dynamic Sign [18] Yujiang Pu; Xiaoyu Wu, “Audio-Guided
Language,” Int. J. Intell. Eng. Syst., vol. 14, Attention Network for Weakly Supervised
no. 1, pp. 316–325, Nov. 2020, doi: Violence Detection,” in 2022 2nd
10.22266/IJIES2021.0228.30. International Conference on Consumer
Electronics and Computer Engineering
[11] X. Shi, Z. Chen, H. Wang, D. Y. Yeung, W. (ICCECE), 2022.
K. Wong, and W. C. Woo, “Convolutional
LSTM network: A machine learning [19] J. Yu, J. Liu, Y. Cheng, R. Feng, and Y.
approach for precipitation nowcasting,” Adv. Zhang, “Modality-aware Contrastive
Neural Inf. Process. Syst., vol. 2015-Janua, Instance Learning with Self-Distillation for
no. July, pp. 802–810, 2015. Weakly-Supervised Audio-Visual Violence
Detection,” pp. 6278–6287, 2022, doi:
[12] P. Wu et al., “Not only Look, But Also 10.1145/3503161.3547868.
Listen: Learning Multimodal Violence
Detection Under Weak Supervision,” Lect. [20] T. Wang et al., “Generative Neural Networks
Notes Comput. Sci. (including Subser. Lect. for Anomaly Detection in Crowded Scenes,”
Notes Artif. Intell. Lect. Notes IEEE Trans. Inf. Forensics Secur., vol. 14,
Bioinformatics), vol. 12375 LNCS, pp. 322– no. 5, pp. 1390–1399, 2019, doi:
339, 2020, doi: 10.1007/978-3-030-58577- 10.1109/TIFS.2018.2878538.
8_20.
[21] S. N. Gowda, M. Rohrbach, and L. Sevilla-
[13] B. Degardin and H. Proenca, “Human Lara, “SMART Frame Selection for Action
activity analysis: Iterative weak/self- Recognition,” 35th AAAI Conf. Artif. Intell.
supervised learning frameworks for detecting AAAI 2021, vol. 2B, pp. 1451–1459, 2021,
abnormal events,” IJCB 2020 - IEEE/IAPR doi: 10.1609/aaai.v35i2.16235.
Int. Jt. Conf. Biometrics, 2020, doi:
10.1109/IJCB48548.2020.9304905. [22] H. Duan, Y. Zhao, Y. Xiong, W. Liu, and D.
Lin, “Omni-Sourced Webly-Supervised
[14] A. A. Einstein, “DETECTION OF REAL- Learning for Video Recognition,” Lect.
WORLD FIGHTS IN SURVEILLANCE Notes Comput. Sci. (including Subser. Lect.
VIDEOS (CCTV-Fights数据集- Notes Artif. Intell. Lect. Notes
ICASSP2019),” ICASSP, IEEE Int. Conf. Bioinformatics), vol. 12360 LNCS, pp. 670–
Acoust. Speech Signal Process. - Proc., pp. 688, 2020, doi: 10.1007/978-3-030-58555-
2662–2666, 2019. 6_40.

[15] B. Arzani et al., “007: Democratically [23] W. Wu, Z. Sun, and W. Ouyang,
finding the cause of packet drops,” Proc. “Transferring Textual Knowledge for Visual
15th USENIX Symp. Networked Syst. Des. Recognition,” 2022, [Online]. Available:
Implementation, NSDI 2018, pp. 419–435, http://arxiv.org/abs/2207.01297
2018.
[24] Z. Qiu, T. Yao, C. W. Ngo, X. Tian, and T.
[16] K. V. Thakare, N. Sharma, D. P. Dogra, H. Mei, “Learning spatio-temporal
Choi, and I. J. Kim, “A multi-stream deep representation with local and global
neural network with late fuzzy fusion for diffusion,” Proc. IEEE Comput. Soc. Conf.
. 9

Comput. Vis. Pattern Recognit., vol. 2019-


June, pp. 12048–12057, 2019, doi:
10.1109/CVPR.2019.01233.

You might also like