KianNet A Violence Detection Model Using An Attent
KianNet A Violence Detection Model Using An Attent
This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3339379
Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2023.0322000
ABSTRACT Violent behaviour is always an important issue that threatens any society. Therefore, many
organizations have used surveillance cameras to monitor such events to preserve public safety and mitigate
potential harm. It is difficult for human operators to monitor the copious camera feed manually, however,
automated systems are employed to enhance the accuracy of violence detection and reduce errors. In
this paper, we propose a novel model named KianNet that effectively detects violent incidents inside
recorded events by combining ResNet50 and ConvLSTM architectures with a multi-head self-attention layer.
The utilization of ResNet50 enables robust feature extraction, while ConvLSTM makes it easier to take
advantage of the temporal dependencies in the video sequences. Furthermore, the multi-head self-attention
layer enhances the model’s ability to focus on relevant spatiotemporal regions and their discriminatory
capacity. Empirical investigations on large datasets UCF-Crime and RWF confirm that the proposed model
outperforms its competitors.
INDEX TERMS Violence detection, Anomaly detection, Computer Vision, ResNet, ConvLSTM, Attention
mechanism, Multi-head Self-Attention, UCF-Crime, RWF, Vision Saccade.
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3339379
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3339379
tecture typically consists of multiple convolutional layers and tion. They are designed to combine the strengths of CNNs and
pooling layers that learn to extract spatiotemporal features RNNs to capture spatial and temporal features from video se-
from video sequences. The output of the convolutional layers quences. In hybrid CNN-RNN models, the CNN component
is then passed through fully connected layers and activation extracts spatial features from individual frames, while the
functions to make the final prediction. 3D CNNs have been RNN component captures temporal dependencies between
successfully applied in various video analysis tasks, including adjacent frames. The CNN component typically consists of
action recognition, gesture recognition, and video-based vio- several convolutional and pooling layers that learn to extract
lence detection. By leveraging spatial and temporal informa- features from individual frames. The RNN component, on
tion, 3D CNNs can achieve state-of-the-art performance on the other hand, takes the output of the CNN component and
these tasks, mainly when dealing with complex and dynamic processes it through a series of recurrent layers that capture
videos. In a recent study, Tran et al. [19] proposed a 3D temporal dependencies between adjacent frames.
CNN model that achieved state-of-the-art performance on Vosta and Yow [8] proposed a hybrid CNN-RNN model
the Sports-1M [20] dataset, which contains many violent and that uses both CNNs and RNNs to extract spatial and temporal
non-violent videos. features from the video frames. Hybrid CNN-RNN models in
Also, Sultani et al. in [9] introduced an approach based on video-based violence detection have improved performance
Multiple Instance Learning (MIL) [21], using 3D Convolu- compared to models that use only CNNs or RNNs. These
tional [22] features from various video segments to train a models can effectively capture spatial and temporal features,
fully-connected neural network framework the model only leading to better detection of violent events in videos. Later,
with video-level labels. Then, a remarkable ranking loss al- by replacing ConvLSTM with ConvGRU, another model
gorithm was utilized to analyze the network’s performance called ConvGRU-CNN was introduced in [28] for VD.
between the highest and lowest-scored instances for each Another CNN-RNN model in VD was proposed in [29],
positive (includes abnormal videos) and negative (includes where authors added Bi-Directional LSTM to a CNN feature
normal videos) bag. extraction model for real-time anomaly detection in surveil-
Recently, Magdy et al. in [23] proposed Violence 4D model lance cameras.
for automatic VD from video datasets. Violence 4D is com-
posed of three primary components, which Dense optical C. ATTENTION-BASED
flow, ResNet50 and 4D residual blocks leverage the capabil-
Attention-based models are deep learning architectures that
ities of four-dimensional convolution neural networks V4D
selectively focus on certain parts of the input data while
CNN. Three other techniques [24], [25] and [26] are also
ignoring others [30]. They are designed to improve the per-
introduced for the VD problem as the latest approaches so
formance of neural networks by allowing them to weigh the
far, which all of them are based on 3D-CNN for the feature
importance of different input features selectively. In tradi-
extraction part.
tional neural networks, all input features are given equal im-
Another use of 3D-CNN can be seen in two-stream CNN
portance, regardless of their relevance to the task. Attention-
as a deep learning architecture frequently used in VD tasks.
based models, however, assign different weights to input
This method became famous because of its ability to capture
features based on their importance. This allows the model to
spatial and temporal information. This approach involves
selectively attend to the most informative parts of the input
processing video frames using two separate streams - a spa-
while ignoring irrelevant information. In video-based vio-
tial stream that extracts static appearance information from
lence detection, attention-based models can help the network
the frames and a temporal stream that captures the motion
selectively focus on the most informative frames or regions
information. The spatial stream feeds each frame’s raw RGB
within a frame, leading to better performance. For example,
pixel values into a CNN architecture to extract appearance
some approaches use spatial attention to focus on specific
features. The temporal stream, on the other hand, computes
regions within a frame, while others use temporal attention
optical flow from the frames and feeds them into a separate
to focus on specific frames within a video. Using atten-
CNN to extract motion features. Finally, the output features
tion mechanisms, video-based violence detection models can
from both streams are merged to make a final prediction. In
achieve higher accuracy while reducing the computational
a recent study, Pratama et al. [27] proposed a two-stream 3D
cost.
CNN model that uses RGB and optical flow images for VD.
In recent years several works have taken advantage of at-
B. CNN-RNN tention mechanisms for violence detection mainly in two cat-
egories of attention-based techniques: Self-Attention ( [31],
Many researchers believe more than extracting features with
[32]) and MHSA ( [33], [13]).
CNNs is needed for video data. They maintain that there is a
need for adding RNNs to their model to consider the extracted
features in a time interval. Therefore, they proposed CNN- III. MODEL ARCHITECTURE
RNN models for anomaly detection in video datasets. CNN- A. OVERALL ARCHITECTURE
RNN models are a type of deep learning architecture used for The KianNet architecture has several steps, including Data
video analysis tasks requiring spatial and temporal informa- preprocessing, CNN-RNN, MHSA-ConvLSTM, and Classi-
VOLUME 11, 2023 3
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3339379
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3339379
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3339379
TABLE 1. The input and output size of each step in the proposed
ResNet50.
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3339379
K = XW K
V = XW V (1)
While the single attention function has dmodel -dimensional
keys, values, and queries, dmodel for an MHSA layer will be h
times dk , dk , and dv of a set of queries, keys, and values. Then,
these parameters are packed together into matrices Q, K, and
V, respectively. The Attention function will be calculated as
shown in Equation 2.
QK T
Attention(Q, K , V ) = softmax( √ )V (2)
dk FIGURE 10. Comparison the value of AUC in binary classification with
different number of heads.
Since MHSA is composed of several single self-attention
modules, and each head represents one scaled-dot attention
layer, the MultiHead function concatenates the headi = the context of their previous and subsequent frames. This
Attention(QWiQ , KWiK , VWiV ) as Equation 3 illustrates in the distinctive configuration gives KianNet an edge over other
following architectures, improving its precision in detecting violent
events.
MultiHead(Q, K , V ) = Concat(head1 , ..., headh )W O (3) One of the decisive factors in this attention technique is
, where WiQ ∈ Rdmodel ×dk , WiK ∈ Rdmodel ×dk , WiV ∈ Rdmodel ×dv , the number of heads, which shows the number of attention
and W O ∈ Rhdv ×dmodel are weight matrices in this process. layers or heads used. Each head computes its attention scores,
Fig. 9 shows the details of the MHSA-ConvLSTM model allowing the model to focus on different features in the input
used in KianNet, where we utilize the output from the Con- data. Each model can adjust the number of attention heads for
vLSTM layer as input for our MHSA layer. This approach en- their specific task depending on the dataset and techniques.
ables the model to concentrate on several objects, the same as Fig. 10 displays the value of AUC over the number of heads
the number of attention heads configured in the MHSA layer (h) in our experiments for our proposed model, KianNet, on
[46]. Following the application of these attention heads, the the UCF-Crime dataset in binary classification. The blue line
feature map for each input frame proceeds through another indicates the highest AUC value at 97.48% when h = 8.
ConvLSTM layer. Consequently, we decided to use eight heads for our further
The primary purpose of this step is to revisit and further experiments.
process the features prioritized by the previous attention lay- We use a multi-head self-attention layer because it can fo-
ers. Specifically, this second ConvLSTM layer facilitates the cus on several objects based on its number of heads. Although
model’s ability to consider these emphasized features again, other methods like CBAM can be used in our model, the inner
but this time over a temporal sequence of frames. Therefore, structure of our model, which contains two ConvLSTMs,
the MHSA-ConvLSTM mechanism identifies the most im- provides us with convolutional layers with LSTMs, which
portant features within each frame and tracks and analyzes work well on sequences of frames. This approach captures
them across a series of frames. In designing KianNet, we the spatial and temporal dynamics within the sequence, en-
strategically integrated the MHSA layer between two ConvL- hancing the model’s overall understanding and interpretation
STM layers as our model’s specific components. The primary of actions across time.
rationale behind this decision was to cater to scenarios where
multiple objects are simultaneously involved in various types E. CLASSIFICATION
of violent behaviours. This integration allows our model to The final stage of the proposed model will be in the shape
prioritize the most significant objects and analyze them in of a 4-dimensional tensor including n_frames, n_rows, and
VOLUME 11, 2023 7
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3339379
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3339379
TABLE 3. Details of the UCF-Crime dataset’s variants; Binary, AllCat, 4MajCat, and NREF.
Binary No. Videos AllCat No. Videos 4MajCat No. Videos NREF No. Videos
Abuse 50 Abuse 50 Theft 150 RoadAccident 30
Arrest 50 Arrest 50 (Burglary, Robbery,
Arson 50 Arson 50 Shoplifting, Stealing)
Assault 50 Assault 50
Burglary 100 Burglary 50 Vandalism 150 Explosion 50
Explosion 50 Explosion 50 (Arson, Explosion,
Fighting 50 Fighting 50 RoadAccident, Vandalism)
RoadAccident 150 RoadAccident 50
Robbery 150 Robbery 50 Violence behaviours 150 Fighting 70
Shooting 50 Shooting 50 (Abuse, Arrest,Assault,
Shoplifting 50 Shoplifting 50 Fighting, Shooting)
Stealing 100 Stealing 50
Vandalism 50 Vandalism 50 Normal 150 Normal 150
Normal 950 Normal 50
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3339379
TABLE 4. Binary classification on RWF dataset based on Accuracy. TABLE 6. Ablation study of architecture ResNetConvLSTM and KianNet:
ResNet50ConvLSTM-MHSA-ConvLSTM on UCF-Crime original and
modeified datasets based on Accuracy and AUC.
Author(s) Model Accuracy (%)
Sudhakaran et al. [40] Convolutional LSTM 77
Tran et al. [22] C3D 82.75 Model ResNetConvLSTM KianNet
Cheng et al. [16] Flow Gated Net 87.25 Dataset AUC (%) Accuracy (%) AUC (%) Accuracy (%)
Su et al. [50] SPIL Convolution 89.3 NREF 79.04 65.38 83.14 73.84
Islam et al. [51] SepConvLSTM-M 89.75 4MajCat 73.88 62.22 88.91 73.75
Pratama et al. [27] Two-stream 3D CNN 90.50 AllCat 53.88 22.72 63.71 23.88
Kang et al. [52] 2D CNNs + LSTM 92 Binary 81.71 62.50 97.48 92.98
Chelali et al. [53] 2D Spatio-Temporal 93.80
Magdy et al. [23] Violence 4D 94.67
Proposed method KianNet 96.21
in accuracy and AUC. Table 6 presents the results of the
TABLE 5. Binary classification on UCF-Crime dataset based on AUC.
ablation study of the architecture of ResNet50ConvLSTM
and KianNet, where the MHSA-ConvLSTM module was
Author(s) Model AUC (%) added to the previous model. As can be seen in Table 6, both
Sultani et al. [9] SVM 50 accuracy and AUC were consistently better when the attention
Tur et al. [24] k-diffusion 65.22
Simonyan et al. [54] VGG-16 72.66
module was used. The most significant improvement of using
Liu et al. [25] PFMF 74 KianNet happened when the model was applied to the binary
biradar et al. [55] DEARESt 76.66 classification dataset, where the AUC value rose from 81.71
Zhong et al. [56] TSN-OpticalFlow 78.08
Zhong et al. [56] C3D 81.08
to 97.48 percent. There are also improvements in violence
Vosta et al. [8] ResNetConvLSTM 81.71 detection performance in other datasets, NREF, 4MajCat,
Qasim et al. [28] ConvGRU-CNN 82.65 and AllCat. Another challenge in UCF-Crime is classifying
Tian et al. [57] RTFM 84.30
Ullah et al. [29] Multi-layer BD-LSTM 85.53
each video in the exact match class in AllCat. Therefore, the
Sun et al. [26] LSTC 85.88 classifier should classify each input into one of the 13 crime
Zhou et al. [33] UR-DMU 86.97 types and normal in AllCat. The situation worsens when some
Joo et al. [31] CLIP-TSA 87.58
Proposed method KianNet 97.48
categories are too similar to distinguish them, like shoplifting
and stealing. However, KianNet improved the accuracy value
for this classification marginally from 22.72% to 23.88%.
2) Experiments on UCF-Crime
The UCF-Crime dataset includes 13 types of anomalies, while V. CONCLUSION AND FUTURE WORK
the rest are all normal scenes. Many researchers evaluated This paper introduced KianNet, an approach for violence
their model using AUC for their experiments on the UCF- detection from surveillance camera footage. To deal with
Crime dataset. This is because AUC is a suitable performance such video datasets, we used ResNet50 to extract features
metric due to its threshold independence feature and the from each video frame and the ConvLSTM technique for
ability to work with imbalanced data, which can play an considering the relationship between frame sequences. We
essential role in UCF-Crime multi-class classification tasks, also brought vision saccade to our model through MHSA
where each category has various samples. to make the model more conscious, like how the human
In Table 5, several VD models are compared using AUC brain works. We conducted extensive experiments using the
in binary classification on the UCF-Crime dataset. As we can UCF-Crime dataset (original and modified versions) and the
see from Table 5, the MIL-C3D model proposed by Sultani et RWF dataset to test our proposed model, KianNet. The re-
al. in their paper [9] gained 74% in AUC. Also, Zhong et al. in sults demonstrated KianNet’s superior performance to other
[17] presented TSN models based on RGB and optical flow, violence detection techniques in binary classification. This
with the value of AUC 82% and 78%, respectively. However, further underlines the potential of our approach for practical
one of the best models for violence detection on UCF-Crime implementations in violence detection and prevention.
was proposed by Ullah et al. in [29] where they applied a Although we have proposed a powerful technique for de-
multi-layer BD-LSTM technique to achieve 85% in AUC. tecting violence in this study, there are still several aspects that
could be improved upon in the future to enhance the model.
E. ABLATION STUDY • To better understand the actions happening in a video
For the ablation study, a double experiment was proposed to file, we can offer a technique to recognize the action after
test how using a multi-head self-attention module followed by the feature extraction part by using YOLOv3 to recognize
a ConvLSTM layer mechanism affected the violence detec- the extracted body part and then build separate ConvLSTM
tion on the UCF-Crime dataset regarding Accuracy and AUC. to learn the movement patterns of each body part.
Although a more powerful backbone network was used than • KianNet can also be applied to other areas to analyze and
in previous work, we considered it interesting to check how detect several events. With its unique learning structure and
the performance improved by using the attention mechanism. strong performance in detecting violent behaviour from video
When comparing our proposed model with the one without surveillance, it can be effectively employed in areas such as
the MHSA-ConvLSTM module, we obtained better results fall detection in homecare settings or hospitals.
10 VOLUME 11, 2023
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3339379
• Another improvement we can make to our technique is [21] Guoqing Liu, Jianxin Wu, and Zhi-Hua Zhou. Key instance detection in
using the original image alongside the moving parts gained multi-instance learning. In Asian conference on machine learning, pages
253–268. PMLR, 2012.
from the subtraction of frames to improve the feature extrac- [22] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar
tion. Paluri. Learning spatiotemporal features with 3d convolutional networks.
• Since we work on videos, which usually have sounds, it In Proceedings of the IEEE international conference on computer vision,
pages 4489–4497, 2015.
would be much more helpful if we use the sounds as a separate [23] Mai Magdy, Mohamed Waleed Fakhr, and Fahima A Maghraby. Violence
line of input to the model to detect violent actions in videos 4d: Violence detection in surveillance using 4d convolutional neural net-
more accurately. works. IET Computer Vision, 2023.
[24] Anil Osman Tur, Nicola Dall’Asen, Cigdem Beyan, and Elisa Ricci. Ex-
ploring diffusion models for unsupervised video anomaly detection. arXiv
REFERENCES preprint arXiv:2304.05841, 2023.
[1] Jinzhu Lu, Lijuan Tan, and Huanyu Jiang. Review on convolutional neural [25] Zuhao Liu, Xiao-Ming Wu, Dian Zheng, Kun-Yu Lin, and Wei-Shi Zheng.
network (cnn) applied to plant leaf disease classification. Agriculture, Generating anomalies for video anomaly detection with prompt-based fea-
11(8):707, 2021. ture mapping. In Proceedings of the IEEE/CVF Conference on Computer
[2] Yoon Kim. Convolutional neural networks for sentence classification. Vision and Pattern Recognition, pages 24500–24510, 2023.
arXiv preprint arXiv:1408.5882, 2014. [26] Shengyang Sun and Xiaojin Gong. Long-short temporal co-teaching
[3] Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A for weakly supervised video anomaly detection. arXiv preprint
unified embedding for face recognition and clustering. In Proceedings of arXiv:2303.18044, 2023.
the IEEE conference on computer vision and pattern recognition, pages [27] Raka Aditya Pratama, Novanto Yudistira, and Fitra Abdurrachman
815–823, 2015. Bachtiar. Violence recognition on videos using two-stream 3d cnn with
[4] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet clas- custom spatiotemporal crop. Multimedia Tools and Applications, pages
sification with deep convolutional neural networks. Advances in neural 1–23, 2023.
information processing systems, 25, 2012. [28] Maryam Qasim Gandapur and Elena Verdú. Convgru-cnn: Spatiotempo-
[5] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r- ral deep learning for real-world anomaly detection in video surveillance
cnn: Towards real-time object detection with region proposal networks. system. 2023.
Advances in neural information processing systems, 28, 2015. [29] Waseem Ullah, Amin Ullah, Ijaz Ul Haq, Khan Muhammad, Muhammad
[6] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Sajjad, and Sung Wook Baik. Cnn features with bi-directional lstm for
Neural computation, 9(8):1735–1780, 1997. real-time anomaly detection in surveillance networks. Multimedia tools
[7] Rahul Dey and Fathi M Salem. Gate-variants of gated recurrent unit (gru) and applications, 80:16979–16995, 2021.
neural networks. In 2017 IEEE 60th international midwest symposium on [30] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion
circuits and systems (MWSCAS), pages 1597–1600. IEEE, 2017. Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention
[8] Soheil Vosta and Kin-Choong Yow. A cnn-rnn combined structure for is all you need. Advances in neural information processing systems, 30,
real-world violence detection in surveillance cameras. Applied Sciences, 2017.
12(3):1021, 2022. [31] Hyekang Kevin Joo, Khoa Vo, Kashu Yamazaki, and Ngan Le. Clip-tsa:
[9] Waqas Sultani, Chen Chen, and Mubarak Shah. Real-world anomaly Clip-assisted temporal self-attention for weakly-supervised video anomaly
detection in surveillance videos. In Proceedings of the IEEE conference detection. arXiv preprint arXiv:2212.05136, 2022.
on computer vision and pattern recognition, pages 6479–6488, 2018.
[32] Weichao Zhang, Guanjun Wang, Mengxing Huang, Hongyu Wang, and
[10] Meng-Hao Guo, Tian-Xing Xu, Jiang-Jiang Liu, Zheng-Ning Liu, Peng-
Shaoping Wen. Generative adversarial networks for abnormal event detec-
Tao Jiang, Tai-Jiang Mu, Song-Hai Zhang, Ralph R Martin, Ming-Ming
tion in videos based on self-attention mechanism. IEEE Access, 9:124847–
Cheng, and Shi-Min Hu. Attention mechanisms in computer vision: A
124860, 2021.
survey. Computational visual media, 8(3):331–368, 2022.
[33] Hang Zhou, Junqing Yu, and Wei Yang. Dual memory units with uncer-
[11] Yaran Chen, Dongbin Zhao, Le Lv, and Chengdong Li. A visual attention
tainty regulation for weakly supervised video anomaly detection. arXiv
based convolutional neural network for image classification. In 2016 12th
preprint arXiv:2302.05160, 2023.
World Congress on Intelligent Control and Automation (WCICA), pages
764–769. IEEE, 2016. [34] Abhronil Sengupta, Yuting Ye, Robert Wang, Chiao Liu, and Kaushik Roy.
[12] Shuhan Chen, Xiuli Tan, Ben Wang, Huchuan Lu, Xuelong Hu, and Yun Going deeper in spiking neural networks: Vgg and residual architectures.
Fu. Reverse attention-based residual network for salient object detection. Frontiers in neuroscience, 13:95, 2019.
IEEE Transactions on Image Processing, 29:3763–3776, 2020. [35] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training
[13] Fernando J Rendón-Segador, Juan A Álvarez-García, Fernando Enríquez, deep feedforward neural networks. In Yee Whye Teh and Mike Titter-
and Oscar Deniz. Violencenet: Dense multi-head self-attention with ington, editors, Proceedings of the Thirteenth International Conference on
bidirectional convolutional lstm for detecting violence. Electronics, Artificial Intelligence and Statistics, volume 9 of Proceedings of Machine
10(13):1601, 2021. Learning Research, pages 249–256, Chia Laguna Resort, Sardinia, Italy,
[14] Weijiang Li, Fang Qi, Ming Tang, and Zhengtao Yu. Bidirectional lstm 13–15 May 2010. PMLR.
with self-attention mechanism and multi-channel features for sentiment [36] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander
classification. Neurocomputing, 387:63–77, 2020. Alemi. Inception-v4, inception-resnet and the impact of residual connec-
[15] Ting Wu, Junjie Peng, Wenqiang Zhang, Huiran Zhang, Shuhua Tan, Fen tions on learning. In Proceedings of the AAAI conference on artificial
Yi, Chuanshuai Ma, and Yansong Huang. Video sentiment analysis with intelligence, volume 31, 2017.
bimodal information-augmented multi-head attention. Knowledge-Based [37] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-
Systems, 235:107676, 2022. Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE
[16] Boyu Chen, Zhihao Zhang, Nian Liu, Yang Tan, Xinyu Liu, and Tong conference on computer vision and pattern recognition, pages 248–255.
Chen. Spatiotemporal convolutional neural network with convolutional Ieee, 2009.
block attention module for micro-expression recognition. Information, [38] Yang Cong, Junsong Yuan, and Ji Liu. Abnormal event detection
11(8):380, 2020. in crowded scenes using sparse representation. Pattern Recognition,
[17] Ming Cheng, Kunjing Cai, and Ming Li. Rwf-2000: an open large scale 46(7):1851–1864, 2013.
video database for violence detection. In 2020 25th International Confer- [39] Gang Zhou and Youfu Wu. Anomalous event detection based on self-
ence on Pattern Recognition (ICPR), pages 4183–4190. IEEE, 2021. organizing map for supermarket monitoring. In 2009 International Confer-
[18] Greg Moreau. Police-reported crime statistics in canada, 2021, 2022. ence on Information Engineering and Computer Science, pages 1–4. IEEE,
[19] Jun Zhang and Zhijing Liu. Detecting abnormal motion of pedestrian in 2009.
video. In 2008 International Conference on Information and Automation, [40] Swathikiran Sudhakaran and Oswald Lanz. Learning to detect violent
pages 81–85. IEEE, 2008. videos using convolutional long short-term memory. In 2017 14th IEEE
[20] Jun Zhang and Zhi Jing Liu. Abnormal behavior of pedestrian detection international conference on advanced video and signal based surveillance
based on fuzzy theory. 2023. (AVSS), pages 1–6. IEEE, 2017.
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3339379
[41] Yi-ping Tang, Xiao-jun Wang, and Hai-feng Lu. Intelligent video analysis SOHEIL VOSTA received the B.Sc. degree in
technology for elevator cage abnormality detection in computer vision. In Computer Science from the University of Isfa-
2009 Fourth International Conference on Computer Sciences and Conver- han, Iran, in 2015. Two years later, he got the
gence Information Technology, pages 1252–1258. IEEE, 2009. M.Sc. degree in Computer Science-Computational
[42] Jie Feng, Chao Zhang, and Pengwei Hao. Online learning with self- Theory from Tarbiat Modares University, Iran. He
organizing maps for anomaly detection in crowd scenes. In 2010 20th is currently doing his Ph.D. in Software System
International Conference on Pattern Recognition, pages 3599–3602. IEEE, Engineering at the University of Regina, Canada.
2010.
Soheil is also an active Graduate Student Member
[43] Md Haidar Sharif, Sahin Uyaver, and Chabane Djeraba. Crowd behavior
for three years and an ExCom member of region-7
surveillance using bhattacharyya distance metric. In International Sym-
posium Computational Modeling of Objects Represented in Images, pages South Saskatchewan Section. His research interest
311–323. Springer, 2010. started in dimension reduction methods for image processing models and
[44] Oleg Gorokhov, Mikhail Petrovskiy, and Igor Mashechkin. Convolutional continued in deep learning and artificial intelligence techniques in video
neural networks for unsupervised anomaly detection in text data. In analysis.
International Conference on Intelligent Data Engineering and Automated
Learning, pages 500–507. Springer, 2017.
[45] Bharathkumar Ramachandra and Michael Jones. Street scene: A new
dataset and evaluation protocol for video anomaly detection. In Proceed-
ings of the IEEE/CVF Winter Conference on Applications of Computer
Vision, pages 2569–2578, 2020.
[46] Xianyun Wen and Weibang Li. Time series prediction based on lstm-
attention-lstm model. IEEE Access, 2023.
[47] Enrique Bermejo Nievas, Oscar Deniz Suarez, Gloria Bueno García, and
Rahul Sukthankar. Violence detection in video using computer vision tech-
niques. In Computer Analysis of Images and Patterns: 14th International
Conference, CAIP 2011, Seville, Spain, August 29-31, 2011, Proceedings,
Part II 14, pages 332–339. Springer, 2011.
[48] Tal Hassner, Yossi Itcher, and Orit Kliper-Gross. Violent flows: Real-
time detection of violent crowd behavior. In 2012 IEEE computer society
conference on computer vision and pattern recognition workshops, pages
1–6. IEEE, 2012.
[49] Brayan S Zapata-Impata, Pablo Gil, and Fernando Torres. Learning spatio
temporal tactile features with a convlstm for the direction of slip detection. KIN-CHOONG YOW received the B.Eng. (Elect.)
Sensors, 19(3):523, 2019. degree (Hons.) from the National University of
[50] Yukun Su, Guosheng Lin, Jinhui Zhu, and Qingyao Wu. Human interaction Singapore, in 1993, and the Ph.D. degree from
learning on 3d skeleton point clouds for video violence recognition. In the University of Cambridge, U.K., in 1998. He
Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, joined the University of Regina, in September
August 23–28, 2020, Proceedings, Part IV 16, pages 74–90. Springer, 2020.
2018, where he is currently a Professor with the
[51] Zahidul Islam, Mohammad Rukonuzzaman, Raiyan Ahmed, Md Hasanul
Faculty of Engineering and Applied Science. Prior
Kabir, and Moshiur Farazi. Efficient two-stream network for violence
detection using separable convolutional lstm. In 2021 International Joint to joining the University of Regina, he was an
Conference on Neural Networks (IJCNN), pages 1–8. IEEE, 2021. Associate Professor with the Gwangju Institute
[52] Min-Seok Kang, Rae-Hong Park, and Hyung-Min Park. Efficient spatio- of Science and Technology (GIST), Republic of
temporal modeling methods for real-time violence recognition. IEEE Korea, from 2013 to 2018; a Professor with the Shenzhen Institutes of
Access, 9:76270–76285, 2021. Advanced Technology (SIAT), China, from 2012 to 2013; and an Associate
[53] Mohamed Chelali, Camille Kurtz, and Nicole Vincent. Violence detection Professor with Nanyang Technological University (NTU), Singapore, from
from video under 2d spatio-temporal representations. In 2021 IEEE 1998 to 2013, where he worked as the Sub-Dean of Computer Engineering,
International Conference on Image Processing (ICIP), pages 2593–2597. from 1999 to 2005. He was the Associate Dean of Admissions with NTU,
IEEE, 2021. from 2006 to 2008. He has published over 100 top quality international
[54] Karen Simonyan and Andrew Zisserman. Very deep convolutional net- journal articles and conference papers. His research interests include artificial
works for large-scale image recognition. arXiv preprint arXiv:1409.1556, general intelligence and smart environments. He is also a member of APEGS
2014. and ACM. He has served as a Reviewer for a number of premier journals and
[55] Kuldeep Biradar, Sachin Dube, and Santosh Kumar Vipparthi. Dearest: conferences, including the IEEE Wireless Communications and the IEEE
deep convolutional aberrant behavior detection in real-world scenarios. In
Transactions on Education. He has been invited to give presentations at
2018 IEEE 13th international conference on industrial and information
various scientific meetings and workshops, such as ACIRS, from 2018 to
systems (ICIIS), pages 163–167. IEEE, 2018.
[56] Jia-Xing Zhong, Nannan Li, Weijie Kong, Shan Liu, Thomas H Li, and 2019; ICSPIC, in 2018; and ICATME, in 2021. He is also the Editor-in-Chief
Ge Li. Graph convolutional label noise cleaner: Train a plug-and-play of the Journal of Advances in Information Technology (JAIT).
action classifier for anomaly detection. In Proceedings of the IEEE/CVF
conference on computer vision and pattern recognition, pages 1237–1246,
2019.
[57] Yu Tian, Guansong Pang, Yuanhong Chen, Rajvinder Singh, Johan W Ver-
jans, and Gustavo Carneiro. Weakly-supervised video anomaly detection
with robust temporal feature magnitude learning. In Proceedings of the
IEEE/CVF international conference on computer vision, pages 4975–4986,
2021.
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4