Anomaly Detection in Surveillance
Anomaly Detection in Surveillance
123
Mathematics department Faculty of science Al-Azhar University (Girls branch), Cairo, Egypt
2
School of Computer science, Canadian International College CIC, Cairo, Egypt
4
Mathematics department Faculty of science Al-Azhar University, Cairo, Egypt
* Corresponding author’s Email: [email protected]
Abstract: This paper presents a new method for anomaly detection in surveillance videos using deep learning. The
proposed method is based on a deep network trained to identify objects and human activities in videos. The method
was tested on five real-world large-scale datasets (UCF-Crime, XD-Violence, UBI-Fights, and CCTV-Fights, UCF-
101) containing indoor and outdoor video sequences, as well as on synthetic datasets with different object sizes,
appearance, and activity type. We use 3D- convolutional neural network (3D-CNN) then convolutional log short term
memory (ConvLSTM) to extract features from video frames and the perform classification and recognition based on
these features. The results show that the proposed method achieves a high accuracy and AUC both in indoor and
outdoor scenarios compared to state-of-the-art methods reported in the comparison.
Keywords: Anomaly detection, 3D Convolutional neural network, Surveillance videos, bidirectional ConvLSTM,
fight detection, violence detection.
for each class they are trained to separate the normal representation of usual behaviour, SVM is trained
from the abnormal frames in a video recording. This using a Bag of Visual Words (BOVW) in this phase.
is achieved by evaluating similarity between a feature Campus violence is the most dangerous kind of
vector extracted from a normal frame to a feature that school bullying and is a global societal problem. As
is extracted for an anomaly frame belonging to the AI and remote monitoring capabilities develop, there
same class and then classifying the frame as either are several potential methods to detect campus
normal or abnormal by calculating a similarity score violence, including video-based ones. Ye et al. (2021)
between the two features vectors. The main [5] use audio and visual data to detect campus
disadvantage of this approach is that it requires a violence. Data on campus violence is gathered
large number of training images and very large through role-playing, and every 16 frames of video
datasets to be able to train the network to learn useful are used to extract 4096-dimension feature vectors.
image features. So, we trained our model on large The 3D CNN is employed for feature extraction and
dataset like UCF-crime that contains more than 128 classification, and an overall precision of 92.00
hours of recording videos divided into 8 anomaly percent is achieved.
classes and 1 normal class. We evaluate our model The Trajectory-Pooled Deep Convolutional
performance on the held-out test data and the results Networks ConvNet model, which has 17 convolution
show that it has reasonable classification accuracy for pool-norm layers and two fully linked layers, was
different types of anomaly events and outperforms employed by Meng et al. (2020) [5]. He applies his
other recent approaches. algorithm to both crowded and uncrowded datasets
Firstly, we describe the data set used in this paper with 92.5% accuracy in the Crowd Violence dataset
and also describe how it was pre-processed and and 98.6% in the Hockey Fight dataset.
trained and tested using a 3D-CNN approach to detect
different types of anomalies. Then we describe the
results obtained for the test dataset and show
classification accuracy and AUC for each dataset.
This paper is organized as following: Section 2
describes a literature review of various works related
to this research study. Section 3 describes 3D-CNN.
Section 4 describes the proposed technique. Section
5 describes the dataset; and section 6 describes briefly
how the training data is pre-processed followed by a
discussion and conclusions.
Figure. 1 [2]distribution of papers on violence detection per
2. Literature review year.
A new method for evaluating whether a movie
In the field of action detection, using computer
contains violent scenes is presented by Rendón-
vision to identify certain actions in security cameras
Segador et al. (2021) [6]. It is based on a modified 3D
has grown in popularity. This work is related to the
DenseNet for a multi-head self-attention layer and a
computer vision field. Many researchers have been
bidirectional ConvLSTM module.
trying to develop efficient machine-learning methods
A weakly supervised anomaly localization
for automatic video anomaly detection task. Fig. 1
(WSAL) technique is put out by Hui Lv et al.[6], and
represents the paper distribution for anomaly
it focuses on temporally localising anomalous
detection from publicly available literature between
portions inside anomalous films. Inspired by the
2015 and 2021[2] and some related keywords.
visual contrast in bizarre videos. To locating
A model-based technique for anomaly
anomalous segments, the evolution of nearby
identification for surveillance footage is proposed by
temporal segments is assessed. To do this, a high-
Kamoona et al [3] (2019). The system is divided into
order context encoding model is suggested that not
two phases. On this platform, numerous handcrafted
only extracts semantic representations but also
features have been displayed. Additionally, C3D
measures dynamic variations to make efficient use of
features and anomaly detection using SVM have been
the temporal environment.
extracted from video data using deep learning
Due to the difficulty in accurately capturing both
approaches. These techniques were applied by
the spatial and temporal information of successive
Sultani, Chen, and Shah (2018) [4]. Behavior
video frames, video classification is more complex
modelling is the following stage. At order to learn the
than it is for static images. The 3D convolution
operator was suggested by S. Ji et al. [7] for
. 3
computing features from both spatial and temporal respectively, and 𝑅𝑖 is the size of the 3D kernel along
data. the temporal dimension.
By examining the synergy between dictionary-
based representation and self-supervised learning, 4. Convolutional LSTM neural network
Wu et al. [8] offer a self-supervised sparse (ConvLSTM)
representation (S3R) framework in 2022 that models
the concept of anomalous at the feature level. The ConvLSTM was created especially for
The Magnitude Contrastive Loss and the Feature difficulties predicting spatial-temporal sequences.
Amplification Mechanism are proposed by Chen et ConvLSTM may extract spatial and temporal
al. in 2022 [9] to improve the discriminativeness of features from feature graph sets more effectively than
feature magnitudes for identifying anomalies. Results standard LSTM [11]. This is so that ConvLSTM,
of experiments using the UCF-Crime and XD- which analyses and forecasts the events in time
Violence benchmark datasets. series, can take the spatial information of a single
feature map into account. Therefore, ConvLSTM can
3. 3D Convolutional Neural Network be used to resolve timing issues more effectively in
dynamic anomaly recognition. The flowing equations
A 3D CNN is a type of neural network composed are used to formulate the ConvLSTM [11] equations.
of several 2D convolutional layers followed by 𝑖𝑡 = 𝜎(𝑊𝑥𝑖 ∗ 𝑋𝑡 + 𝑊ℎ𝑖 ∗ 𝐻𝑡−1 + 𝑊𝑐𝑖 ⃘𝐶𝑡−1 + 𝑦𝑖 )(2)
several layers of nonlinear units (called the “fully 𝑓𝑡 = 𝜎(𝑊𝑥𝑓 ∗ 𝑋𝑡 + 𝑊ℎ𝑓 ∗ 𝐻𝑡−1 + 𝑊𝑐𝑓 ⃘𝐶𝑡−1 + 𝑦𝑖 )(3)
connected” layers), all arranged in several parallel 𝐶𝑡 = 𝑓𝑡 ⃘𝐶𝑡−1 + 𝑖𝑡 ⃘𝑡𝑎𝑛ℎ(𝑊ℎ𝑐 ∗ 𝐻𝑡−1 + 𝑊𝑥𝑐 ∗ 𝑋𝑡 + 𝑦𝑐 )(4)
planes (i.e., three dimensionally). A convolution can 𝑂𝑡 = 𝜎(𝑊𝑥𝑜 ∗ 𝑋𝑡 + 𝑊ℎ𝑜 ∗ 𝐻𝑡−1 + 𝑊𝑐𝑜 ⃘𝐶𝑡 + 𝑦𝑜 )(5)
be applied along the time dimension to extract 𝑓𝑡 = 𝑂𝑡 ⃘𝑡𝑎𝑛ℎ(𝐶𝑡 )(6)
temporal patterns in the data, just like convolutional The inputs are 𝑋1, 𝑋2, . . . , 𝑋𝑡, the cell outputs are
layers can do the same for spatial patterns in image 𝐶1, 𝐶2, . . . , 𝐶𝑡 , and the hidden states are
data. But, if our data contains both spatial and 𝐻1, 𝐻2, . . . , 𝐻𝑡 . The three-dimensional tensors of
temporal patterns, as it is with video data, we should ConvLSTM are the gates 𝑖𝑡 , 𝑓𝑡 , and 𝑂𝑡 , respectively.
study these two types of patterns together since they Rows and columns are the final two dimensions,
can combine to create more complicated spatio- which are spatial. The convolution operator and
temporal patterns. The basic idea behind a 3D CNN "Hadamard product" are denoted by the operators
is to process the image or the video sequence in two "*" and " ⃘" respectively. The batch normalisation
dimensions (spatial and temporal) sequentially in layer and dropout layer are added to the ConvLSTM
order to obtain the final result. in this instance.
By extending CNN, 3DCNN does this by
enlarging the convolution kernel. Extraction of video 5. Proposed method
features is effective using 3D CNN [10]. For a more
LSTM and 3D CNN are coupled to classify video.
thorough analysis, 3DCNN extracts the spatial-
We will outline the 3DCNNConvLSTM model's
temporal features from the entire video. The 3D
architecture in this part. We recommended a 3D
convolution kernel is used to extract regional
convolution neural network (3DCNN) followed by a
spatiotemporal neighbourhood information, which is
Convolutional long short-term memory
appropriate given the video's data format. Eq. (1) is a
(ConvLSTM) network as a feature extraction model
representation of the formula 3DCNN:
𝑥𝑦𝑧 for the dynamic anomaly identification process.
𝑣𝑖𝑗 = 𝑅𝑒𝑙𝑢(𝑏𝑖𝑗 + The 3DCNN-ConvLSTM model's architecture is
𝑝𝑞𝑟 (𝑥+𝑝)(𝑦+𝑞)(𝑧+𝑟)
∑𝑚 ∑𝑃𝑝=𝑜𝑖 −1 𝑄𝑖 −1 𝑅𝑖 −1
∑𝑞=0 ∑𝑟=0 𝑤𝑖𝑗𝑚 𝑘(𝑖−1)𝑚 ) (1) shown in Fig. 2. A stack of continuous anomaly video
Where 𝑅𝑒𝑙𝑢 stands for the buried layer's frames that have been downsized to 16 × 32 × 32 ×
activation function. The current value at position 3 form the input layer. Four 3D convolutional layers,
(𝑥, 𝑦, 𝑧) in the 𝑖 𝑡ℎ and 𝑗 𝑡ℎ feature graph sets is each with a different filter (32, 32, 64, and 64), make
𝑥𝑦𝑧
represented by 𝑣𝑖𝑗 . up the architecture. however, the same 333 kernel
size. After then, a layer of ConvLSTM with 64-unit
The bias of the 𝑖 𝑡ℎ layer and the 𝑗 𝑡ℎ feature map is sizes was applied.
represented by the term 𝑏𝑖𝑗 . The (𝑝, 𝑞, 𝑟)𝑡ℎ value of A ReLU layer and a batch normalisation layer
the kernel associated to the 𝑚𝑡ℎ feature map in the come after each 3DCNN layer. 3D Max Pooling and
𝑝𝑞𝑟
preceding layer is represented by 𝑤𝑖𝑗𝑚 . 𝑃𝑖 , 𝑄𝑖 stand dropout layers were placed between each pair of
for the height and width of the convolution kernel, 3DCNN layers. Dropout layers with values of 0.3 and
0.5 were added. A fully connected layer with 512 is
used to implement the output probability, and it is
. 4
followed by the Softmax activation function, which fighting, burglary, explosion, arrest, abuse, and road
has many output units equal to the number of accidents, are listed in these movies. The collection
anomaly video classes. also includes "Normal" videos, meaning those
without any recordings of crimes. Two tasks can be
accomplished using this dataset. First, a general
analysis of anomalies is performed, considering all
anomalies in one group and all regular activities in
another. Figure 3 shows how the percentage of videos
in each UCF-Crime class are distributed per class.
Accuracy_10 89.7% 90.7% results 92.2% in 50 epochs and achieve 87.7%, 95.1%
AUC_10 82.6% 87% for AUC and accuracy for CCTV-fights dataset
Accuracy_30 respectively.
93.1% 95.1%
AUC_30 89.8% 89.3% REF. AUC Method year
Accuracy_50 97.1% 100% [9] 86.98% MGFN 2022
AUC_50 93.3% 92.3% [8] 85.99% S3R 2022
[6] 85.38% WSAL 2020
Table 3. Comparison of our model's performance for the
Learning Causal Temporal
datasets of UBI-Fight, and UCF-101
Relation and Feature
[15] 84.89% 2021
Discrimination for Anomaly
Detection
Multi-stream Network with Late
[16] 84.48% 2022
Fuzzy Fusion
[17] 84.03% RTFM 2021
[15] 82.67% DAM 2018
ours 92.2% 3DCNN+ConvLSTM 2023
Table 4. A comparison between the results of our model and
other models for UCF-crime dataset
Table 5 compares the results for more models
given by other studies for the XD-violence dataset in
order to properly evaluate the model and
demonstrates that our model provides the best AUC
Figure 4 shows the model's training and validation accuracy for results 87.7% at 50 epochs.
the UCF-crime dataset for 10 epochs.
REF. AUC Method year
[18] 83.54% CMA_LA 2022
[19] 83.4% MACIL_SD 2022
[8] 80.26% S3R 2022
[9] 82.11% MGFN 2022
[17] 77.81% RTFM 2021
ours 87.7% 3DCNN+ConvLSTM 2023
Table 5. A comparison between the results of our model and
other models for XD-violence dataset
Table 6 compares the results for more models
given by other studies for the UBI-fights dataset in
order to properly evaluate the model and
demonstrates that our model provides the best AUC
results 93.3% in 50 epochs.
Figure 5 shows the model's training and validation accuracy for
the UCF-crime dataset for 30 epochs.
REF. AUC Method year
[1] 90.6% GMM 2020
[4] 89.2% Sultani et al. 2018
[20] 61% S2-VAE 2018
ours 93.3% 3DCNN+ConvLSTM 2023
Table 6. A comparison between the results of our model and
other models for UBI-fights dataset
Table 7 compares the results for more models
given by other studies for the UCF101 dataset in
order to properly evaluate the model and
demonstrates that our model provides the best
accuracy results 100% in 50 epochs. Figure 7 and 8
demonstrates features characteristic along the real
Figure 6 shows the model's training and validation accuracy for time for abuse and explosion videos for example.
the UCF-101 dataset for 100 epochs. REF. Accuracy Method year
[21] 98.64% SMART 2020
Table 4 compares the results for more models [22] 98.6% OmniSource 2020
given by other studies for the UCF-crime dataset in [23] 98.2% Text4Vis 2022
order to properly evaluate the model and [24] 98.2% LGD-3D Two-stream 2019
ours 100% 3DCNN+ConvLSTM 2023
demonstrates that our model provides the best AUC
. 7
Table 7. A comparison between the results of our model and model training accuracy was 100%. The reliability of
other models for UCF101 dataset the recognition was correspondingly 98.5%, 99.2%,
and 94.5%. When compared to 3DCNN,
3DCNN+ConvLSTM produced a decent
performance with the datasets. The results from our
study show that the model is more accurate than the
other competing models. An extension of the current
work we intend to create a model for predicting
anomalies from surveillance video.
References
[1] B. M. Degardin, “Weakly and Partially
Supervised Learning Frameworks for
Anomaly Detection,” 2020.
Fuh, and T.-L. Liu, “Self-supervised Sparse real-world anomaly detection,” Expert Syst.
Representation for Video Anomaly Appl., vol. 201, 2022, doi:
Detection,” pp. 729–745, 2022, doi: 10.1016/j.eswa.2022.117030.
10.1007/978-3-031-19778-9_42.
[17] Y. Tian, G. Pang, Y. Chen, R. Singh, J. W.
[9] Y. Chen, Z. Liu, B. Zhang, W. Fok, X. Qi, Verjans, and G. Carneiro, “Weakly-
and Y.-C. Wu, “MGFN: Magnitude- supervised Video Anomaly Detection with
Contrastive Glance-and-Focus Network for Robust Temporal Feature Magnitude
Weakly-Supervised Video Anomaly Learning,” Proc. IEEE Int. Conf. Comput.
Detection,” 2022. Vis., pp. 4955–4966, 2021, doi:
10.1109/ICCV48922.2021.00493.
[10] E. K. Elsayed and D. R. Fathy, “Semantic
Deep Learning to Translate Dynamic Sign [18] Yujiang Pu; Xiaoyu Wu, “Audio-Guided
Language,” Int. J. Intell. Eng. Syst., vol. 14, Attention Network for Weakly Supervised
no. 1, pp. 316–325, Nov. 2020, doi: Violence Detection,” in 2022 2nd
10.22266/IJIES2021.0228.30. International Conference on Consumer
Electronics and Computer Engineering
[11] X. Shi, Z. Chen, H. Wang, D. Y. Yeung, W. (ICCECE), 2022.
K. Wong, and W. C. Woo, “Convolutional
LSTM network: A machine learning [19] J. Yu, J. Liu, Y. Cheng, R. Feng, and Y.
approach for precipitation nowcasting,” Adv. Zhang, “Modality-aware Contrastive
Neural Inf. Process. Syst., vol. 2015-Janua, Instance Learning with Self-Distillation for
no. July, pp. 802–810, 2015. Weakly-Supervised Audio-Visual Violence
Detection,” pp. 6278–6287, 2022, doi:
[12] P. Wu et al., “Not only Look, But Also 10.1145/3503161.3547868.
Listen: Learning Multimodal Violence
Detection Under Weak Supervision,” Lect. [20] T. Wang et al., “Generative Neural Networks
Notes Comput. Sci. (including Subser. Lect. for Anomaly Detection in Crowded Scenes,”
Notes Artif. Intell. Lect. Notes IEEE Trans. Inf. Forensics Secur., vol. 14,
Bioinformatics), vol. 12375 LNCS, pp. 322– no. 5, pp. 1390–1399, 2019, doi:
339, 2020, doi: 10.1007/978-3-030-58577- 10.1109/TIFS.2018.2878538.
8_20.
[21] S. N. Gowda, M. Rohrbach, and L. Sevilla-
[13] B. Degardin and H. Proenca, “Human Lara, “SMART Frame Selection for Action
activity analysis: Iterative weak/self- Recognition,” 35th AAAI Conf. Artif. Intell.
supervised learning frameworks for detecting AAAI 2021, vol. 2B, pp. 1451–1459, 2021,
abnormal events,” IJCB 2020 - IEEE/IAPR doi: 10.1609/aaai.v35i2.16235.
Int. Jt. Conf. Biometrics, 2020, doi:
10.1109/IJCB48548.2020.9304905. [22] H. Duan, Y. Zhao, Y. Xiong, W. Liu, and D.
Lin, “Omni-Sourced Webly-Supervised
[14] A. A. Einstein, “DETECTION OF REAL- Learning for Video Recognition,” Lect.
WORLD FIGHTS IN SURVEILLANCE Notes Comput. Sci. (including Subser. Lect.
VIDEOS (CCTV-Fights数据集- Notes Artif. Intell. Lect. Notes
ICASSP2019),” ICASSP, IEEE Int. Conf. Bioinformatics), vol. 12360 LNCS, pp. 670–
Acoust. Speech Signal Process. - Proc., pp. 688, 2020, doi: 10.1007/978-3-030-58555-
2662–2666, 2019. 6_40.
[15] B. Arzani et al., “007: Democratically [23] W. Wu, Z. Sun, and W. Ouyang,
finding the cause of packet drops,” Proc. “Transferring Textual Knowledge for Visual
15th USENIX Symp. Networked Syst. Des. Recognition,” 2022, [Online]. Available:
Implementation, NSDI 2018, pp. 419–435, http://arxiv.org/abs/2207.01297
2018.
[24] Z. Qiu, T. Yao, C. W. Ngo, X. Tian, and T.
[16] K. V. Thakare, N. Sharma, D. P. Dogra, H. Mei, “Learning spatio-temporal
Choi, and I. J. Kim, “A multi-stream deep representation with local and global
neural network with late fuzzy fusion for diffusion,” Proc. IEEE Comput. Soc. Conf.
. 9