0% found this document useful (0 votes)
22 views12 pages

KianNet A Violence Detection Model Using An Attent

Uploaded by

zahid.xpp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views12 pages

KianNet A Violence Detection Model Using An Attent

Uploaded by

zahid.xpp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

This article has been accepted for publication in IEEE Access.

This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3339379

Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2023.0322000

KianNet: A violence detection model using an


attention-based CNN-LSTM structure
SOHEIL VOSTA1 , (Student Member, IEEE), KIN-CHOONG YOW2 , (Senior Member, IEEE)
1
Department of Engineering and Applied Science, University of Regina, Regina, SK, S4S 0A2 Canada (e-mail: [email protected])
2
Department of Engineering and Applied Science, University of Regina, Regina, SK, S4S 0A2 Canada (e-mail: [email protected])
Corresponding author: Kin-Choong Yow (e-mail: [email protected]).
We acknowledge the support of the Natural Sciences and Engineering Research Council of Canada (NSERC), funding reference number
DDG-2020-00034. Cette recherche a été financée par le Conseil de recherches en sciences naturelles et en génie du Canada (CRSNG),
numéro de référence DDG-2020-00034.

ABSTRACT Violent behaviour is always an important issue that threatens any society. Therefore, many
organizations have used surveillance cameras to monitor such events to preserve public safety and mitigate
potential harm. It is difficult for human operators to monitor the copious camera feed manually, however,
automated systems are employed to enhance the accuracy of violence detection and reduce errors. In
this paper, we propose a novel model named KianNet that effectively detects violent incidents inside
recorded events by combining ResNet50 and ConvLSTM architectures with a multi-head self-attention layer.
The utilization of ResNet50 enables robust feature extraction, while ConvLSTM makes it easier to take
advantage of the temporal dependencies in the video sequences. Furthermore, the multi-head self-attention
layer enhances the model’s ability to focus on relevant spatiotemporal regions and their discriminatory
capacity. Empirical investigations on large datasets UCF-Crime and RWF confirm that the proposed model
outperforms its competitors.

INDEX TERMS Violence detection, Anomaly detection, Computer Vision, ResNet, ConvLSTM, Attention
mechanism, Multi-head Self-Attention, UCF-Crime, RWF, Vision Saccade.

I. INTRODUCTION be deployed before situations escalate, thereby preventing


harm. In contrast to traditional surveillance methods that rely
W ITH the growing challenges in public safety and se-
curity, the demand for comprehensive public safety
monitoring via video surveillance cameras has significantly
on human monitoring, automated systems can continuously
monitor numerous feeds simultaneously, leading to proactive
increased. interventions rather than waiting for something severe to
However, the abundance of video data generated by these happen and reacting.
surveillance cameras, associated with the limited availability One of the principal methodologies utilized in video clas-
and diversity of anomalous events such as violence, theft, sification is Supervised Learning, widely used in violence
or other types of crimes, presents a notable challenge to de- detection (VD) models for distinguishing violent behaviours
tecting abnormal behaviours. Manual monitoring of this ex- from normal ones. This method can efficiently use labelled
pansive data is impractical and labour-intensive and tends to data and learn unique characteristics for each category. How-
cause errors due to human visual fatigue. Thus, this highlights ever, when it comes to detecting abnormalities in videos,
the urgent requirement for effective and automated systems the spatiotemporal nature of the video data makes it more
for detecting violence. challenging. This is because it requires processing a sequence
Like any technological advance, the applications of these of frames in a time-series format. Therefore, to overcome
systems can be manifold in different aspects. One of the this challenge, it is crucial to extract significant features from
most significant societal implications of automated violence every frame and consider their relationship with adjacent
detection systems is improving public safety and proactivity. frames over time.
In that, surveillance systems that can automatically detect Convolutional Neural Networks (CNNs) have gained pop-
signs of violence or aggressive behaviour have the potential ularity in Deep Learning for extracting features from image
to save lives. For instance, if a system can detect a potential data because they can learn hierarchical representations of
act of violence in a public space, rapid response units can image features. They extract in-depth features from high-

VOLUME 11, 2023 1

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3339379

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

dimensional data sets using complex structure and classifica-


tion techniques, making them ideal for various applications
[1]. Although CNNs are widely used in various deep learning
tasks like text classification and Natural Language Processing
(NLP) [2], they are mainly used in computer vision, like
Face Recognition [3], Image Classification [4], and Object
Detection [5].
On the other hand, Recurrent Neural Networks (RNNs) are
known for their ability to model temporal dependencies in
time-series data thanks to their ability to process information
in both forward and backward directions. This allows the
network to recall information from the past and use it to
make informed decisions at the current time step. However, FIGURE 1. Crime severity index over the years [18]
as information passes through multiple time steps, the data
from the initial sequence may become diluted. To overcome
this problem, advanced versions of RNNs, like long short- proposes a combination of a CNN (ResNet50) and an RNN
term memory (LSTM) [6] networks and gated recurrent units (ConvLSTM) that utilizes MHSA modules [13] for VD from
(GRUs) [7], have been developed, which can better retain surveillance cameras. We will explore multiple models uti-
information over more extended periods. lizing publicly available datasets, such as UCF-Crime [9]
However, recent studies have suggested using deep learn- and RWF [17]. Subsequently, we will propose strategies to
ing architectures that combine CNNs and RNNs to enhance enhance the accuracy and robustness of these models in real-
the performance of supervised models for violence detection world scenarios. The main contributions of this paper are as
in video data [8], [9]. This method efficiently extracts spa- follows:
tiotemporal characteristics by utilizing a CNN model to col-
• We have designed a structure that uses an MHSA layer
lect critical features from each video frame and then feeding
followed by a ConvLSTM cell, which brings the infor-
them to an RNN model to analyze their temporal relationships
mation of each attention layer to the next one.
and forecast whether any violent events happened in a video.
• We have developed a unique VD model, KianNet, that
Aside from the mentioned methods, AI research has
merges MHSA-ConvLSTM with the ResNet50ConvLSTM
also concentrated on reducing the gap between human and
architecture for violence detection. This approach cap-
machine behaviour in detecting violence through attention
tures complex spatiotemporal features in videos, im-
mechanisms. In computer vision, attention mechanisms were
proving violence identification.
introduced to imitate the human visual system with a natural
• We have comprehensively evaluated our model and
ability to find salient areas in complex scenes. Primarily, this
showed that it outperforms other state-of-the-art algo-
can dynamically adjust the weight of input image features
rithms.
[10]. Attention mechanisms have demonstrated their effec-
tiveness in many visual tasks such as image classification The subsequent sections of this paper entail a comprehen-
[11], object detection [12], and video understanding [13]. sive analysis of relevant works that employ distinct models,
Different types of attention mechanisms have been proposed along with their respective sub-models, for detecting violence
and utilized in VD, including Self-Attention [14], Multi- in surveillance cameras (Section 2). After that, we present
Head Self-Attention (MHSA) [15], and Convolutional Block our proposed model (Section 3) and evaluate its performance
Attention Module (CBAM) [16]. through several experiments (Section 4). Finally, we conclude
Over recent years, the safety of people has been a concern with a discussion on future research ideas in Section 5.
for various areas of the world. Due to the global economic
crisis and the current socio-economic differences, the number II. RELATED WORKS
of violent crimes has increased. Fig 1 presents the police- The evolution of VD models has gained significant attention
reported crime statistics for Canada between 2013 and 2021, in recent years due to the increasing need for automated
clearly illustrating the upward trend in violent incidents dur- solutions to address violence in various settings. Researchers
ing this period [30]. As a result, it underscores the critical have proposed numerous approaches for VD, leveraging vi-
need for systems that detect these violent crimes, thus con- sual features extracted from video frames. This section will
tributing to a more secure society. The complexity of violence provide an overview of different approaches in VD.
detection, which requires identifying anomalous events over
time, is a primary focus of this study. The motivation for this A. 3D-CNN
paper is to develop a model that employs a novel methodology Three-dimensional CNNs are a type of deep learning archi-
for detecting abnormal events, focusing on those anomalies tecture used for video analysis tasks requiring spatial and
that represent violent behaviour, given their significant im- temporal information. 3D-CNN is an extension of 2D-CNN
plications for public safety and security. This manuscript that can handle video sequences as inputs. The 3D CNN archi-
2 VOLUME 11, 2023

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3339379

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

tecture typically consists of multiple convolutional layers and tion. They are designed to combine the strengths of CNNs and
pooling layers that learn to extract spatiotemporal features RNNs to capture spatial and temporal features from video se-
from video sequences. The output of the convolutional layers quences. In hybrid CNN-RNN models, the CNN component
is then passed through fully connected layers and activation extracts spatial features from individual frames, while the
functions to make the final prediction. 3D CNNs have been RNN component captures temporal dependencies between
successfully applied in various video analysis tasks, including adjacent frames. The CNN component typically consists of
action recognition, gesture recognition, and video-based vio- several convolutional and pooling layers that learn to extract
lence detection. By leveraging spatial and temporal informa- features from individual frames. The RNN component, on
tion, 3D CNNs can achieve state-of-the-art performance on the other hand, takes the output of the CNN component and
these tasks, mainly when dealing with complex and dynamic processes it through a series of recurrent layers that capture
videos. In a recent study, Tran et al. [19] proposed a 3D temporal dependencies between adjacent frames.
CNN model that achieved state-of-the-art performance on Vosta and Yow [8] proposed a hybrid CNN-RNN model
the Sports-1M [20] dataset, which contains many violent and that uses both CNNs and RNNs to extract spatial and temporal
non-violent videos. features from the video frames. Hybrid CNN-RNN models in
Also, Sultani et al. in [9] introduced an approach based on video-based violence detection have improved performance
Multiple Instance Learning (MIL) [21], using 3D Convolu- compared to models that use only CNNs or RNNs. These
tional [22] features from various video segments to train a models can effectively capture spatial and temporal features,
fully-connected neural network framework the model only leading to better detection of violent events in videos. Later,
with video-level labels. Then, a remarkable ranking loss al- by replacing ConvLSTM with ConvGRU, another model
gorithm was utilized to analyze the network’s performance called ConvGRU-CNN was introduced in [28] for VD.
between the highest and lowest-scored instances for each Another CNN-RNN model in VD was proposed in [29],
positive (includes abnormal videos) and negative (includes where authors added Bi-Directional LSTM to a CNN feature
normal videos) bag. extraction model for real-time anomaly detection in surveil-
Recently, Magdy et al. in [23] proposed Violence 4D model lance cameras.
for automatic VD from video datasets. Violence 4D is com-
posed of three primary components, which Dense optical C. ATTENTION-BASED
flow, ResNet50 and 4D residual blocks leverage the capabil-
Attention-based models are deep learning architectures that
ities of four-dimensional convolution neural networks V4D
selectively focus on certain parts of the input data while
CNN. Three other techniques [24], [25] and [26] are also
ignoring others [30]. They are designed to improve the per-
introduced for the VD problem as the latest approaches so
formance of neural networks by allowing them to weigh the
far, which all of them are based on 3D-CNN for the feature
importance of different input features selectively. In tradi-
extraction part.
tional neural networks, all input features are given equal im-
Another use of 3D-CNN can be seen in two-stream CNN
portance, regardless of their relevance to the task. Attention-
as a deep learning architecture frequently used in VD tasks.
based models, however, assign different weights to input
This method became famous because of its ability to capture
features based on their importance. This allows the model to
spatial and temporal information. This approach involves
selectively attend to the most informative parts of the input
processing video frames using two separate streams - a spa-
while ignoring irrelevant information. In video-based vio-
tial stream that extracts static appearance information from
lence detection, attention-based models can help the network
the frames and a temporal stream that captures the motion
selectively focus on the most informative frames or regions
information. The spatial stream feeds each frame’s raw RGB
within a frame, leading to better performance. For example,
pixel values into a CNN architecture to extract appearance
some approaches use spatial attention to focus on specific
features. The temporal stream, on the other hand, computes
regions within a frame, while others use temporal attention
optical flow from the frames and feeds them into a separate
to focus on specific frames within a video. Using atten-
CNN to extract motion features. Finally, the output features
tion mechanisms, video-based violence detection models can
from both streams are merged to make a final prediction. In
achieve higher accuracy while reducing the computational
a recent study, Pratama et al. [27] proposed a two-stream 3D
cost.
CNN model that uses RGB and optical flow images for VD.
In recent years several works have taken advantage of at-
B. CNN-RNN tention mechanisms for violence detection mainly in two cat-
egories of attention-based techniques: Self-Attention ( [31],
Many researchers believe more than extracting features with
[32]) and MHSA ( [33], [13]).
CNNs is needed for video data. They maintain that there is a
need for adding RNNs to their model to consider the extracted
features in a time interval. Therefore, they proposed CNN- III. MODEL ARCHITECTURE
RNN models for anomaly detection in video datasets. CNN- A. OVERALL ARCHITECTURE
RNN models are a type of deep learning architecture used for The KianNet architecture has several steps, including Data
video analysis tasks requiring spatial and temporal informa- preprocessing, CNN-RNN, MHSA-ConvLSTM, and Classi-
VOLUME 11, 2023 3

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3339379

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

put. Instead, we use the difference between two consecutive


frames to highlight the action. Therefore, we subtract each
frame from the next one to gain the action between time t and
t + 1 and ignore the parts that did not move during the time.
FIGURE 2. KianNet - The architecture.
Then, the produced image from subtraction frames will be the
input for the feature extraction model. Instead of analyzing
each video frame, we need to find the differences between
fication. Fig. 2 offers a general picture of the KianNet struc- frames in a period to show the movements. Figure 5 shows
ture, showing how everything fits together. the difference between two consecutive frames. Since each
frame is a uniform size of (224 × 224), and this size remains
• Data preprocessing: Each video file is divided into its
constant after subtraction, the input format for ResNet50 will
frames in a desired format, and the difference between be (n_frames, n_row, n_column, n_channels), which in our
frame n and n + 1 will be calculated. case is (20, 224, 224, 3).
• CNN-RNN: The frame differences gained from the
last step become an input for our ResNet50ConvLSTM C. CNN-RNN
structure to extract their features in a time series se- 1) CNN: ResNet50
quence.
Using CNN models in deep learning has become increas-
• MHSA-ConvLSTM: The output of each ConvLSTM
ingly popular for extracting features from image data. These
is fed to an MHSA layer to find the most important
models are built with multiple layers, including convolutional
objects the model needs for detecting violence. This
and pooling layers, that can identify the most crucial features
attention module is followed by another ConvLSTM cell
of input images. Several CNN architectures are developed
to consider the recently extracted features in a sequence
each year to handle different subjects and datasets. Some of
of frames.
these models have become more widely used due to their
• Classification: After passing through the final ConvL-
exceptional performance and efficiency. Examples of such
STM layer, the output undergoes multiple max pooling
models include VGG, Inception, and ResNet, which have
and fully connected layers to determine if the input video
various versions available [34].
is normal in binary classification. However, the model
While a CNN structure with multiple layers can assist the
aims to identify the specific type of violent event in
model in identifying intricate features, the network’s depth
multi-class classification for each video input.
can cause vanishing gradient problems that can result in slow
Fig. 3 shows the whole structure of KianNet in details of convergence or even halt the learning process [35]. One of
each primary steps; Data Preprocessing, CNN-RNN, MHSA- the techniques to mitigate the vanishing gradient problem
ConvLSTM, and Classification. Each of these main stages are and improve the training of CNNs is using skip connection
discussed in the following sections: and residual blocks, which allows the gradients to flow more
directly through the network, bypassing some layers and
B. DATA PREPROCESSING reducing the impact of the vanishing gradient problem [36].
Video cameras capture videos depending on their supporting In Fig. 6, we depict our experiments with different CNN
resolution shown by FPS (Frame Per Second), which shows structures in our model to find the best technique to pro-
how many frames are captured in a second. For instance, a vide us with higher accuracy with fewer parameters. The
video that is recorded with an FPS of 30 for 10 seconds will diagram shows that different models’ parameters vary based
comprise 300 frames (30 frames per second × 10 seconds on their network’s depth and the convolutional layers they
= 300 frames). The figure in Fig. 4 depicts a selected set of incorporate. However, accuracy only sometimes follows this
frames from a video file that the model will be trained on. trend. Although ResNet152 has more trainable parameters
In order to obtain a certain number of frames for the model, than ResNet50, it does not achieve higher accuracy on the
some frames must be skipped, with the number determined UCF-Crime dataset. Despite having more parameters, this
by skipped_frames. For instance, if we require 20 frames as discrepancy can be attributed to factors like the potential
input and the video file contains 300 frames, skipped_frames for overfitting, the specifics of the UCF-Crime dataset, and
would equal 15. Thus, we will choose every 15th frame to optimization challenges arising from issues such as vanishing
create our input sequence frames. Additionally, it is important or exploding gradients. As a result, we have chosen ResNet50
to consider that each video in the dataset has a different size. for KianNet as a CNN extraction because of its higher ac-
Therefore, every frame needs to be resized to a unique dimen- curacy with fewer parameters. ResNet50 does have enough
sion to enhance compatibility with the model that processes layers to extract features to detect the activities in a video and
the frames. For this research, we have resized each frame to use residual blocks that prevent the model from the vanishing
a resolution of 224 × 224 pixels, which allows it to work gradient problem [8].
seamlessly with the ResNet50 model. Besides, ResNet50 has been pre-trained on large image
Given that this work aims to detect violence in videos, datasets like ImageNet [37], which provides the model with a
we are not just looking at each frame as a standalone in- strong foundation for learning relevant features from images,
4 VOLUME 11, 2023

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3339379

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

FIGURE 3. The detailed architecture of KianNet.

FIGURE 4. Divide a video file into its frames.

FIGURE 5. The subtraction of each neighbouring frame.

including those related to violence. This pre-training enables


the model to yield higher accuracy in violence detection
tasks, especially when trained on limited datasets [38]. Fig. 7
FIGURE 6. Comparison of using several CNN models in KianNet
also depicts the ResNet50 structure we used in our proposed
model. This structure comprises five stages, each with a
varying number of residual blocks, and each block consists
2) RNN: ConvLSTM
of multiple convolutional layers. Besides, to better understand
Given that we work with video datasets composed of se-
the shape of each layer’s input and output, Table 1 provides
quences of frames, we require a framework that can effec-
the details of each step of ResNet50.
tively handle time-series data. RNNs are renowned for man-
aging time-series data in domains such as Natural Language
VOLUME 11, 2023 5

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3339379

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

FIGURE 7. ResNet50 inner structure in our proposed model.

TABLE 1. The input and output size of each step in the proposed
ResNet50.

Layer Type Input Shape Output Shape


Input Layer (n,224,224,3) (n,224,224,3)
Conv1 (7x7, stride 2) (n,224,224,3) (n,112,112,64)
MaxPooling (3x3, stride 2) (n,112,112,64) (n,56,56,64)
Conv Block 1 (3 layers) (n,56,56,64) (n,56,56,256)
Identity Block 1, 2 (n,56,56,256) (n,56,56,256)
Conv Block 2 (3 layers) (n,56,56,256) (n,28,28,512)
Identity Block 3, 4, 5 (n,28,28,512) (n,28,28,512)
Conv Block 3 (3 layers) (n,28,28,512) (n,14,14,1024)
Identity Block 6, 7, 8, 9, 10 (n,14,14,1024) (n,14,14,1024) FIGURE 8. ConvLSTM operations in details.
Conv Block 4 (3 layers) (n,14,14,1024) (n,7,7,2048)
Identity Block 11, 12, 13 (n,7,7,2048) (n,7,7,2048)

a convolutional operation with 256 filters with a filter size of


Processing (NLP), speech recognition, and video analysis 3×3 and a stride of 1 in all the gates (input, forget, output, and
[39]. However, the standard RNNs also suffer from gradient the gate controlling the cell state). As a result, the hidden state
vanishing problems like CNNs. To address this, units such of the ConvLSTM consists of 256 feature maps. This means
as Long Short-Term Memory (LSTM) are invented to take that each of the gate mechanisms in the ConvLSTM operates
advantage of having a "hidden state" in the network that can in a convolutional manner, making this model particularly
store information about the previous inputs, making them suited for tasks involving spatial data like images or video.
suitable for tasks requiring context or memory. The output (hidden state) is a three-dimensional tensor (for
LSTM has become a valuable tool for handling time series each step), maintaining the spatial structure of the input data
data in neural network models. In this paper, we used the Con- while encoding temporal dependencies. Therefore, the output
vLSTM model, one of the modified types of LSTM designed shape of our ConvLSTM will be (n_frames, 7, 7, 256).
for dealing with images and works better than the standard
vanilla LSTM. ConvLSTM is a neural network architecture D. MHSA-CONVLSTM
that utilizes a convolutional layer at the input gate to extract Researchers have been investigating methods to provide ma-
spatial features from each sequence frame while capturing chines with consciousness to bridge the gap between humans
temporal dependencies between the frames using LSTM lay- and machines, thanks to the advancements in artificial intel-
ers. This combination allows ConvLSTM to efficiently model ligence over the past few decades. [41]–[43]. One feature of
the spatial-temporal structure in data, resulting in fewer pa- human cognition that has been explored for implementation
rameters required for training [40]. Among Bi-LSTM, Con- in machine learning approaches, particularly in computer
vLSTM, and ConvGRU, we chose ConvLSTM for our model. vision, is the mental or vision saccades [44].
We preferred ConvLSTM over the other two because Bi- Many techniques in the field of computer vision have
LSTM lacks a convolutional layer to handle spatio-temporal been proposed to integrate attention mechanisms into deep
information, and ConvGRU does not have an explicit memory learning models. Self-Attention [30], MHSA [30], and the
cell, which is crucial for capturing long-term information. Convolutional Block Attention Module (CBAM) [16] are the
In KianNet’s structure, the output of the ResNet50 from most commonly used attention models. In our proposed VD,
the previous stage has a size of (n_frames, 7, 7, 2048), which we used the MHSA technique to train the model to focus on
goes to the X(t−1) input of the ConvLSTM layer. We used specific points where violent events are more likely to occur.
6 VOLUME 11, 2023

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3339379

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

We analyzed the input feature map in a time sequence struc-


ture. We used the MHSA layer to enhance the accuracy of
the final classification by selectively concentrating on crucial
parts [45].
Our model uses the ConvLSTM output as an input
(X ) to the proposed attention module. As Equation 1
shows, each input X reshaped to the size of (n_row ×
n_column, n_channels), undergoes a transformation to pro-
vide Q, K, and V by using learned weight matrices W Q , W K ,
and W V respectively.
FIGURE 9. The details of our proposed MHSA-ConvLSTM structure in
Q = XW Q KianNet.

K = XW K
V = XW V (1)
While the single attention function has dmodel -dimensional
keys, values, and queries, dmodel for an MHSA layer will be h
times dk , dk , and dv of a set of queries, keys, and values. Then,
these parameters are packed together into matrices Q, K, and
V, respectively. The Attention function will be calculated as
shown in Equation 2.

QK T
Attention(Q, K , V ) = softmax( √ )V (2)
dk FIGURE 10. Comparison the value of AUC in binary classification with
different number of heads.
Since MHSA is composed of several single self-attention
modules, and each head represents one scaled-dot attention
layer, the MultiHead function concatenates the headi = the context of their previous and subsequent frames. This
Attention(QWiQ , KWiK , VWiV ) as Equation 3 illustrates in the distinctive configuration gives KianNet an edge over other
following architectures, improving its precision in detecting violent
events.
MultiHead(Q, K , V ) = Concat(head1 , ..., headh )W O (3) One of the decisive factors in this attention technique is
, where WiQ ∈ Rdmodel ×dk , WiK ∈ Rdmodel ×dk , WiV ∈ Rdmodel ×dv , the number of heads, which shows the number of attention
and W O ∈ Rhdv ×dmodel are weight matrices in this process. layers or heads used. Each head computes its attention scores,
Fig. 9 shows the details of the MHSA-ConvLSTM model allowing the model to focus on different features in the input
used in KianNet, where we utilize the output from the Con- data. Each model can adjust the number of attention heads for
vLSTM layer as input for our MHSA layer. This approach en- their specific task depending on the dataset and techniques.
ables the model to concentrate on several objects, the same as Fig. 10 displays the value of AUC over the number of heads
the number of attention heads configured in the MHSA layer (h) in our experiments for our proposed model, KianNet, on
[46]. Following the application of these attention heads, the the UCF-Crime dataset in binary classification. The blue line
feature map for each input frame proceeds through another indicates the highest AUC value at 97.48% when h = 8.
ConvLSTM layer. Consequently, we decided to use eight heads for our further
The primary purpose of this step is to revisit and further experiments.
process the features prioritized by the previous attention lay- We use a multi-head self-attention layer because it can fo-
ers. Specifically, this second ConvLSTM layer facilitates the cus on several objects based on its number of heads. Although
model’s ability to consider these emphasized features again, other methods like CBAM can be used in our model, the inner
but this time over a temporal sequence of frames. Therefore, structure of our model, which contains two ConvLSTMs,
the MHSA-ConvLSTM mechanism identifies the most im- provides us with convolutional layers with LSTMs, which
portant features within each frame and tracks and analyzes work well on sequences of frames. This approach captures
them across a series of frames. In designing KianNet, we the spatial and temporal dynamics within the sequence, en-
strategically integrated the MHSA layer between two ConvL- hancing the model’s overall understanding and interpretation
STM layers as our model’s specific components. The primary of actions across time.
rationale behind this decision was to cater to scenarios where
multiple objects are simultaneously involved in various types E. CLASSIFICATION
of violent behaviours. This integration allows our model to The final stage of the proposed model will be in the shape
prioritize the most significant objects and analyze them in of a 4-dimensional tensor including n_frames, n_rows, and
VOLUME 11, 2023 7

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3339379

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

FIGURE 11. MaxPooling and Fully connected layers in classification


stage of KianNet.

TABLE 2. Details for several datasets in Violence Detection.

Datasset Data Scale Length/Clip (sec) Resolution


Hockey Fights [47] 1000 Clips 1.6-1.96 360 × 288
Movie Fights [47] 200 Clips 1.6-2 720 × 480 FIGURE 12. Example images of UCF-Crime dataset [9].
Crowd Violence [48] 246 Clips 1.04-6.52 Variable
UCF-Crime [9] 1900 Clips 60-600 Variable
RWF [17] 2000 Clips 5 Variable
the 14 categories.
UCF-Crime has been modified in [8] by adding two new
n_column, n_channels. Fig. 11 illustrates the layers of the subcategories: 4MajCat and NREF.
classification stage. In that, after applying a 3D MaxPooling 4MajCat sub-dataset divides the dataset into four major
layer of size (2×2), we flatten the tensor (n_frames, 3, 3, 256) categories: Theft, Vandalism, Violent behaviours, and Nor-
into a vector of the size of (1, n_frames × 3 × 3 × 256) for mal, which distinguishes more clearly between videos.
classification object. Then, reduce the dimension by dropping NREF contains 300 videos of Normal Road accidents, Ex-
out the less important features. Finally, we only need to use plosions, and Fighting split into 5 seconds. In this sub-dataset,
several fully connected layers of size 1000, 256, 10, n_classes the Normal category is gained by the trimmed parts of the
to classify the input video as normal or violent (abnormal). violent video, which has the same objects and backgrounds,
which helps the model to be trained more accurately.
IV. EXPERIMENTS Table 3 explains the details of the UCF-Crime dataset when
In this section, we present the experimental results of our it is used for binary classification (Binary), all categories
proposed model, "KianNet", and show how this model im- (AllCat), and the two modified sub-categories, 4MajCat and
proves the violence detection performance in our trained NREF.
video datasets, UCF-Crime [9] and RWF [17]. Another dataset captured from real-world scenes using
surveillance cameras is RWF, which contains 2000 real-world
A. DATA fighting videos for 5 seconds. Fig. 13 represents several sam-
While finding a dataset that covers all types of violent be- ples of the RWF dataset.
haviour may not be possible, several datasets shown in Table We chose the two benchmark datasets, UCF-Crime and
2 can assist in violence detection. RWF because they derive from real-world events. In contrast
One of the benchmark datasets in VD, which includes to other datasets used in VD, such as HockeyFight, where
different types of violent behaviour captured by surveillance data is collected in the same environments with many similar
cameras, is UCF-Crime. Several examples of this dataset are objects, UCF-Crime and RWF encompass a wide range of
shown in Fig. 12. scenarios, positions, situations, and objects. As a result, our
The UCF-Crime dataset includes 1900 videos of 13 crime proposed model is more likely to perform efficiently in real-
categories captured from surveillance cameras in various world applications when adequately trained on these datasets.
situations with diverse backgrounds. As a result, a model In the next section, we will evaluate our proposed method
trained on this dataset is more likely to detect violence when using the aforementioned datasets and compare its perfor-
presented with new input videos accurately. UCF-Crime is mance with other models to demonstrate its efficacy in dif-
usually considered a binary classification where it divides into ferent approaches.
a group of 950 data, including 13 types of crimes and 950
data for normal scenes. However, they can also be considered B. PERFORMANCE METRICS
individual groups for detecting their specific type of crime. In This study uses two primary metrics to evaluate the model’s
this case, we only used 700 videos, including 50 from each of performance: Accuracy and AUC (Area Under Curve). Accu-
8 VOLUME 11, 2023

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3339379

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

TABLE 3. Details of the UCF-Crime dataset’s variants; Binary, AllCat, 4MajCat, and NREF.

Binary No. Videos AllCat No. Videos 4MajCat No. Videos NREF No. Videos
Abuse 50 Abuse 50 Theft 150 RoadAccident 30
Arrest 50 Arrest 50 (Burglary, Robbery,
Arson 50 Arson 50 Shoplifting, Stealing)
Assault 50 Assault 50
Burglary 100 Burglary 50 Vandalism 150 Explosion 50
Explosion 50 Explosion 50 (Arson, Explosion,
Fighting 50 Fighting 50 RoadAccident, Vandalism)
RoadAccident 150 RoadAccident 50
Robbery 150 Robbery 50 Violence behaviours 150 Fighting 70
Shooting 50 Shooting 50 (Abuse, Arrest,Assault,
Shoplifting 50 Shoplifting 50 Fighting, Shooting)
Stealing 100 Stealing 50
Vandalism 50 Vandalism 50 Normal 150 Normal 150
Normal 950 Normal 50

classification), which is tricky. If the threshold is set too low,


the model classifies many scenes as violent files, including
many normal activities. Also, if set too high, many crimes
will be missed, increasing the false negative rate.
On the other hand, AUC is a performance metric used to
evaluate the quality of a binary classification model’s pre-
dictions. It measures the ability of the model to distinguish
between the normal and abnormal classes in our case. AUC
is typically used in the context of the receiver operating
characteristic (ROC) curve, which is a plot of the True Pos-
itive Rate (TPR) against the False Positive Rate (FPR) at
various classification thresholds. Therefore, AUC can work as
a threshold-independent metric. The AUC is then calculated
by computing the area under the ROC curve, which ranges
from 0 to 1, in that, as the value is close to 1, it shows a more
robust classifier.
FIGURE 13. Example images of RWF dataset [17].
C. SETTINGS
We did several experiments for our proposed model, KianNet,
racy measures the percentage of correctly classified instances using ResNet50, ConvLSTM, and MHSA via Keras libraries.
by a model (Equation 4). It is calculated as the ratio of We set batch_size=16, learning_rate=1 × 10−4 , epochs=50,
correctly classified instances to the total number of samples glorot_uniform as the initial weight, and RMSprop as an
in the dataset. In the case of binary classification, accuracy optimizer to compare the KianNet model with other methods
can be calculated as Equation 5, where TP= True Positives, on UCF-Crime and RWF datasets.
TN = True Negatives, FP = False Positives, and FN = False
Negatives. D. EVALUATION AND COMPARISON
1) Experiments on RWF
#Correct_predictions
Accuracy = (4) RWF has become one of the benchmark datasets in recent
#Total_predictions years. Several models achieved high accuracy, more than
TP + TN 80% in classifying violent or non-violent behaviours like 2D
Accuracy = (5)
TP + TN + FP + FN CNN+LSTM [49] and Violence 4D [47]. We run KianNet on
While accuracy can be a helpful metric for evaluating the the RWF dataset to evaluate the capability of our model in
overall performance of a model, it can be misleading in some violence detection with real-world fighting movies. In our
cases. For instance, in a dataset with imbalanced classes, experiments, we trained the model by 80% of the dataset
where one class is much more prevalent than the other, a while the rest was selected for testing. Table 4 compares
model that predicts the majority class for every instance can KianNet with several other models on the RWF dataset based
achieve high accuracy, even if it fails to classify instances on accuracy, where our proposed model achieved the highest
of the minority class correctly. Another drawback of using accuracy of 96.21% for violence detection.
accuracy is using a threshold (default set to 0.5 for binary
VOLUME 11, 2023 9

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3339379

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

TABLE 4. Binary classification on RWF dataset based on Accuracy. TABLE 6. Ablation study of architecture ResNetConvLSTM and KianNet:
ResNet50ConvLSTM-MHSA-ConvLSTM on UCF-Crime original and
modeified datasets based on Accuracy and AUC.
Author(s) Model Accuracy (%)
Sudhakaran et al. [40] Convolutional LSTM 77
Tran et al. [22] C3D 82.75 Model ResNetConvLSTM KianNet
Cheng et al. [16] Flow Gated Net 87.25 Dataset AUC (%) Accuracy (%) AUC (%) Accuracy (%)
Su et al. [50] SPIL Convolution 89.3 NREF 79.04 65.38 83.14 73.84
Islam et al. [51] SepConvLSTM-M 89.75 4MajCat 73.88 62.22 88.91 73.75
Pratama et al. [27] Two-stream 3D CNN 90.50 AllCat 53.88 22.72 63.71 23.88
Kang et al. [52] 2D CNNs + LSTM 92 Binary 81.71 62.50 97.48 92.98
Chelali et al. [53] 2D Spatio-Temporal 93.80
Magdy et al. [23] Violence 4D 94.67
Proposed method KianNet 96.21
in accuracy and AUC. Table 6 presents the results of the
TABLE 5. Binary classification on UCF-Crime dataset based on AUC.
ablation study of the architecture of ResNet50ConvLSTM
and KianNet, where the MHSA-ConvLSTM module was
Author(s) Model AUC (%) added to the previous model. As can be seen in Table 6, both
Sultani et al. [9] SVM 50 accuracy and AUC were consistently better when the attention
Tur et al. [24] k-diffusion 65.22
Simonyan et al. [54] VGG-16 72.66
module was used. The most significant improvement of using
Liu et al. [25] PFMF 74 KianNet happened when the model was applied to the binary
biradar et al. [55] DEARESt 76.66 classification dataset, where the AUC value rose from 81.71
Zhong et al. [56] TSN-OpticalFlow 78.08
Zhong et al. [56] C3D 81.08
to 97.48 percent. There are also improvements in violence
Vosta et al. [8] ResNetConvLSTM 81.71 detection performance in other datasets, NREF, 4MajCat,
Qasim et al. [28] ConvGRU-CNN 82.65 and AllCat. Another challenge in UCF-Crime is classifying
Tian et al. [57] RTFM 84.30
Ullah et al. [29] Multi-layer BD-LSTM 85.53
each video in the exact match class in AllCat. Therefore, the
Sun et al. [26] LSTC 85.88 classifier should classify each input into one of the 13 crime
Zhou et al. [33] UR-DMU 86.97 types and normal in AllCat. The situation worsens when some
Joo et al. [31] CLIP-TSA 87.58
Proposed method KianNet 97.48
categories are too similar to distinguish them, like shoplifting
and stealing. However, KianNet improved the accuracy value
for this classification marginally from 22.72% to 23.88%.
2) Experiments on UCF-Crime
The UCF-Crime dataset includes 13 types of anomalies, while V. CONCLUSION AND FUTURE WORK
the rest are all normal scenes. Many researchers evaluated This paper introduced KianNet, an approach for violence
their model using AUC for their experiments on the UCF- detection from surveillance camera footage. To deal with
Crime dataset. This is because AUC is a suitable performance such video datasets, we used ResNet50 to extract features
metric due to its threshold independence feature and the from each video frame and the ConvLSTM technique for
ability to work with imbalanced data, which can play an considering the relationship between frame sequences. We
essential role in UCF-Crime multi-class classification tasks, also brought vision saccade to our model through MHSA
where each category has various samples. to make the model more conscious, like how the human
In Table 5, several VD models are compared using AUC brain works. We conducted extensive experiments using the
in binary classification on the UCF-Crime dataset. As we can UCF-Crime dataset (original and modified versions) and the
see from Table 5, the MIL-C3D model proposed by Sultani et RWF dataset to test our proposed model, KianNet. The re-
al. in their paper [9] gained 74% in AUC. Also, Zhong et al. in sults demonstrated KianNet’s superior performance to other
[17] presented TSN models based on RGB and optical flow, violence detection techniques in binary classification. This
with the value of AUC 82% and 78%, respectively. However, further underlines the potential of our approach for practical
one of the best models for violence detection on UCF-Crime implementations in violence detection and prevention.
was proposed by Ullah et al. in [29] where they applied a Although we have proposed a powerful technique for de-
multi-layer BD-LSTM technique to achieve 85% in AUC. tecting violence in this study, there are still several aspects that
could be improved upon in the future to enhance the model.
E. ABLATION STUDY • To better understand the actions happening in a video
For the ablation study, a double experiment was proposed to file, we can offer a technique to recognize the action after
test how using a multi-head self-attention module followed by the feature extraction part by using YOLOv3 to recognize
a ConvLSTM layer mechanism affected the violence detec- the extracted body part and then build separate ConvLSTM
tion on the UCF-Crime dataset regarding Accuracy and AUC. to learn the movement patterns of each body part.
Although a more powerful backbone network was used than • KianNet can also be applied to other areas to analyze and
in previous work, we considered it interesting to check how detect several events. With its unique learning structure and
the performance improved by using the attention mechanism. strong performance in detecting violent behaviour from video
When comparing our proposed model with the one without surveillance, it can be effectively employed in areas such as
the MHSA-ConvLSTM module, we obtained better results fall detection in homecare settings or hospitals.
10 VOLUME 11, 2023

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3339379

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

• Another improvement we can make to our technique is [21] Guoqing Liu, Jianxin Wu, and Zhi-Hua Zhou. Key instance detection in
using the original image alongside the moving parts gained multi-instance learning. In Asian conference on machine learning, pages
253–268. PMLR, 2012.
from the subtraction of frames to improve the feature extrac- [22] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar
tion. Paluri. Learning spatiotemporal features with 3d convolutional networks.
• Since we work on videos, which usually have sounds, it In Proceedings of the IEEE international conference on computer vision,
pages 4489–4497, 2015.
would be much more helpful if we use the sounds as a separate [23] Mai Magdy, Mohamed Waleed Fakhr, and Fahima A Maghraby. Violence
line of input to the model to detect violent actions in videos 4d: Violence detection in surveillance using 4d convolutional neural net-
more accurately. works. IET Computer Vision, 2023.
[24] Anil Osman Tur, Nicola Dall’Asen, Cigdem Beyan, and Elisa Ricci. Ex-
ploring diffusion models for unsupervised video anomaly detection. arXiv
REFERENCES preprint arXiv:2304.05841, 2023.
[1] Jinzhu Lu, Lijuan Tan, and Huanyu Jiang. Review on convolutional neural [25] Zuhao Liu, Xiao-Ming Wu, Dian Zheng, Kun-Yu Lin, and Wei-Shi Zheng.
network (cnn) applied to plant leaf disease classification. Agriculture, Generating anomalies for video anomaly detection with prompt-based fea-
11(8):707, 2021. ture mapping. In Proceedings of the IEEE/CVF Conference on Computer
[2] Yoon Kim. Convolutional neural networks for sentence classification. Vision and Pattern Recognition, pages 24500–24510, 2023.
arXiv preprint arXiv:1408.5882, 2014. [26] Shengyang Sun and Xiaojin Gong. Long-short temporal co-teaching
[3] Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A for weakly supervised video anomaly detection. arXiv preprint
unified embedding for face recognition and clustering. In Proceedings of arXiv:2303.18044, 2023.
the IEEE conference on computer vision and pattern recognition, pages [27] Raka Aditya Pratama, Novanto Yudistira, and Fitra Abdurrachman
815–823, 2015. Bachtiar. Violence recognition on videos using two-stream 3d cnn with
[4] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet clas- custom spatiotemporal crop. Multimedia Tools and Applications, pages
sification with deep convolutional neural networks. Advances in neural 1–23, 2023.
information processing systems, 25, 2012. [28] Maryam Qasim Gandapur and Elena Verdú. Convgru-cnn: Spatiotempo-
[5] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r- ral deep learning for real-world anomaly detection in video surveillance
cnn: Towards real-time object detection with region proposal networks. system. 2023.
Advances in neural information processing systems, 28, 2015. [29] Waseem Ullah, Amin Ullah, Ijaz Ul Haq, Khan Muhammad, Muhammad
[6] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Sajjad, and Sung Wook Baik. Cnn features with bi-directional lstm for
Neural computation, 9(8):1735–1780, 1997. real-time anomaly detection in surveillance networks. Multimedia tools
[7] Rahul Dey and Fathi M Salem. Gate-variants of gated recurrent unit (gru) and applications, 80:16979–16995, 2021.
neural networks. In 2017 IEEE 60th international midwest symposium on [30] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion
circuits and systems (MWSCAS), pages 1597–1600. IEEE, 2017. Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention
[8] Soheil Vosta and Kin-Choong Yow. A cnn-rnn combined structure for is all you need. Advances in neural information processing systems, 30,
real-world violence detection in surveillance cameras. Applied Sciences, 2017.
12(3):1021, 2022. [31] Hyekang Kevin Joo, Khoa Vo, Kashu Yamazaki, and Ngan Le. Clip-tsa:
[9] Waqas Sultani, Chen Chen, and Mubarak Shah. Real-world anomaly Clip-assisted temporal self-attention for weakly-supervised video anomaly
detection in surveillance videos. In Proceedings of the IEEE conference detection. arXiv preprint arXiv:2212.05136, 2022.
on computer vision and pattern recognition, pages 6479–6488, 2018.
[32] Weichao Zhang, Guanjun Wang, Mengxing Huang, Hongyu Wang, and
[10] Meng-Hao Guo, Tian-Xing Xu, Jiang-Jiang Liu, Zheng-Ning Liu, Peng-
Shaoping Wen. Generative adversarial networks for abnormal event detec-
Tao Jiang, Tai-Jiang Mu, Song-Hai Zhang, Ralph R Martin, Ming-Ming
tion in videos based on self-attention mechanism. IEEE Access, 9:124847–
Cheng, and Shi-Min Hu. Attention mechanisms in computer vision: A
124860, 2021.
survey. Computational visual media, 8(3):331–368, 2022.
[33] Hang Zhou, Junqing Yu, and Wei Yang. Dual memory units with uncer-
[11] Yaran Chen, Dongbin Zhao, Le Lv, and Chengdong Li. A visual attention
tainty regulation for weakly supervised video anomaly detection. arXiv
based convolutional neural network for image classification. In 2016 12th
preprint arXiv:2302.05160, 2023.
World Congress on Intelligent Control and Automation (WCICA), pages
764–769. IEEE, 2016. [34] Abhronil Sengupta, Yuting Ye, Robert Wang, Chiao Liu, and Kaushik Roy.
[12] Shuhan Chen, Xiuli Tan, Ben Wang, Huchuan Lu, Xuelong Hu, and Yun Going deeper in spiking neural networks: Vgg and residual architectures.
Fu. Reverse attention-based residual network for salient object detection. Frontiers in neuroscience, 13:95, 2019.
IEEE Transactions on Image Processing, 29:3763–3776, 2020. [35] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training
[13] Fernando J Rendón-Segador, Juan A Álvarez-García, Fernando Enríquez, deep feedforward neural networks. In Yee Whye Teh and Mike Titter-
and Oscar Deniz. Violencenet: Dense multi-head self-attention with ington, editors, Proceedings of the Thirteenth International Conference on
bidirectional convolutional lstm for detecting violence. Electronics, Artificial Intelligence and Statistics, volume 9 of Proceedings of Machine
10(13):1601, 2021. Learning Research, pages 249–256, Chia Laguna Resort, Sardinia, Italy,
[14] Weijiang Li, Fang Qi, Ming Tang, and Zhengtao Yu. Bidirectional lstm 13–15 May 2010. PMLR.
with self-attention mechanism and multi-channel features for sentiment [36] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander
classification. Neurocomputing, 387:63–77, 2020. Alemi. Inception-v4, inception-resnet and the impact of residual connec-
[15] Ting Wu, Junjie Peng, Wenqiang Zhang, Huiran Zhang, Shuhua Tan, Fen tions on learning. In Proceedings of the AAAI conference on artificial
Yi, Chuanshuai Ma, and Yansong Huang. Video sentiment analysis with intelligence, volume 31, 2017.
bimodal information-augmented multi-head attention. Knowledge-Based [37] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-
Systems, 235:107676, 2022. Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE
[16] Boyu Chen, Zhihao Zhang, Nian Liu, Yang Tan, Xinyu Liu, and Tong conference on computer vision and pattern recognition, pages 248–255.
Chen. Spatiotemporal convolutional neural network with convolutional Ieee, 2009.
block attention module for micro-expression recognition. Information, [38] Yang Cong, Junsong Yuan, and Ji Liu. Abnormal event detection
11(8):380, 2020. in crowded scenes using sparse representation. Pattern Recognition,
[17] Ming Cheng, Kunjing Cai, and Ming Li. Rwf-2000: an open large scale 46(7):1851–1864, 2013.
video database for violence detection. In 2020 25th International Confer- [39] Gang Zhou and Youfu Wu. Anomalous event detection based on self-
ence on Pattern Recognition (ICPR), pages 4183–4190. IEEE, 2021. organizing map for supermarket monitoring. In 2009 International Confer-
[18] Greg Moreau. Police-reported crime statistics in canada, 2021, 2022. ence on Information Engineering and Computer Science, pages 1–4. IEEE,
[19] Jun Zhang and Zhijing Liu. Detecting abnormal motion of pedestrian in 2009.
video. In 2008 International Conference on Information and Automation, [40] Swathikiran Sudhakaran and Oswald Lanz. Learning to detect violent
pages 81–85. IEEE, 2008. videos using convolutional long short-term memory. In 2017 14th IEEE
[20] Jun Zhang and Zhi Jing Liu. Abnormal behavior of pedestrian detection international conference on advanced video and signal based surveillance
based on fuzzy theory. 2023. (AVSS), pages 1–6. IEEE, 2017.

VOLUME 11, 2023 11

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3339379

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

[41] Yi-ping Tang, Xiao-jun Wang, and Hai-feng Lu. Intelligent video analysis SOHEIL VOSTA received the B.Sc. degree in
technology for elevator cage abnormality detection in computer vision. In Computer Science from the University of Isfa-
2009 Fourth International Conference on Computer Sciences and Conver- han, Iran, in 2015. Two years later, he got the
gence Information Technology, pages 1252–1258. IEEE, 2009. M.Sc. degree in Computer Science-Computational
[42] Jie Feng, Chao Zhang, and Pengwei Hao. Online learning with self- Theory from Tarbiat Modares University, Iran. He
organizing maps for anomaly detection in crowd scenes. In 2010 20th is currently doing his Ph.D. in Software System
International Conference on Pattern Recognition, pages 3599–3602. IEEE, Engineering at the University of Regina, Canada.
2010.
Soheil is also an active Graduate Student Member
[43] Md Haidar Sharif, Sahin Uyaver, and Chabane Djeraba. Crowd behavior
for three years and an ExCom member of region-7
surveillance using bhattacharyya distance metric. In International Sym-
posium Computational Modeling of Objects Represented in Images, pages South Saskatchewan Section. His research interest
311–323. Springer, 2010. started in dimension reduction methods for image processing models and
[44] Oleg Gorokhov, Mikhail Petrovskiy, and Igor Mashechkin. Convolutional continued in deep learning and artificial intelligence techniques in video
neural networks for unsupervised anomaly detection in text data. In analysis.
International Conference on Intelligent Data Engineering and Automated
Learning, pages 500–507. Springer, 2017.
[45] Bharathkumar Ramachandra and Michael Jones. Street scene: A new
dataset and evaluation protocol for video anomaly detection. In Proceed-
ings of the IEEE/CVF Winter Conference on Applications of Computer
Vision, pages 2569–2578, 2020.
[46] Xianyun Wen and Weibang Li. Time series prediction based on lstm-
attention-lstm model. IEEE Access, 2023.
[47] Enrique Bermejo Nievas, Oscar Deniz Suarez, Gloria Bueno García, and
Rahul Sukthankar. Violence detection in video using computer vision tech-
niques. In Computer Analysis of Images and Patterns: 14th International
Conference, CAIP 2011, Seville, Spain, August 29-31, 2011, Proceedings,
Part II 14, pages 332–339. Springer, 2011.
[48] Tal Hassner, Yossi Itcher, and Orit Kliper-Gross. Violent flows: Real-
time detection of violent crowd behavior. In 2012 IEEE computer society
conference on computer vision and pattern recognition workshops, pages
1–6. IEEE, 2012.
[49] Brayan S Zapata-Impata, Pablo Gil, and Fernando Torres. Learning spatio
temporal tactile features with a convlstm for the direction of slip detection. KIN-CHOONG YOW received the B.Eng. (Elect.)
Sensors, 19(3):523, 2019. degree (Hons.) from the National University of
[50] Yukun Su, Guosheng Lin, Jinhui Zhu, and Qingyao Wu. Human interaction Singapore, in 1993, and the Ph.D. degree from
learning on 3d skeleton point clouds for video violence recognition. In the University of Cambridge, U.K., in 1998. He
Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, joined the University of Regina, in September
August 23–28, 2020, Proceedings, Part IV 16, pages 74–90. Springer, 2020.
2018, where he is currently a Professor with the
[51] Zahidul Islam, Mohammad Rukonuzzaman, Raiyan Ahmed, Md Hasanul
Faculty of Engineering and Applied Science. Prior
Kabir, and Moshiur Farazi. Efficient two-stream network for violence
detection using separable convolutional lstm. In 2021 International Joint to joining the University of Regina, he was an
Conference on Neural Networks (IJCNN), pages 1–8. IEEE, 2021. Associate Professor with the Gwangju Institute
[52] Min-Seok Kang, Rae-Hong Park, and Hyung-Min Park. Efficient spatio- of Science and Technology (GIST), Republic of
temporal modeling methods for real-time violence recognition. IEEE Korea, from 2013 to 2018; a Professor with the Shenzhen Institutes of
Access, 9:76270–76285, 2021. Advanced Technology (SIAT), China, from 2012 to 2013; and an Associate
[53] Mohamed Chelali, Camille Kurtz, and Nicole Vincent. Violence detection Professor with Nanyang Technological University (NTU), Singapore, from
from video under 2d spatio-temporal representations. In 2021 IEEE 1998 to 2013, where he worked as the Sub-Dean of Computer Engineering,
International Conference on Image Processing (ICIP), pages 2593–2597. from 1999 to 2005. He was the Associate Dean of Admissions with NTU,
IEEE, 2021. from 2006 to 2008. He has published over 100 top quality international
[54] Karen Simonyan and Andrew Zisserman. Very deep convolutional net- journal articles and conference papers. His research interests include artificial
works for large-scale image recognition. arXiv preprint arXiv:1409.1556, general intelligence and smart environments. He is also a member of APEGS
2014. and ACM. He has served as a Reviewer for a number of premier journals and
[55] Kuldeep Biradar, Sachin Dube, and Santosh Kumar Vipparthi. Dearest: conferences, including the IEEE Wireless Communications and the IEEE
deep convolutional aberrant behavior detection in real-world scenarios. In
Transactions on Education. He has been invited to give presentations at
2018 IEEE 13th international conference on industrial and information
various scientific meetings and workshops, such as ACIRS, from 2018 to
systems (ICIIS), pages 163–167. IEEE, 2018.
[56] Jia-Xing Zhong, Nannan Li, Weijie Kong, Shan Liu, Thomas H Li, and 2019; ICSPIC, in 2018; and ICATME, in 2021. He is also the Editor-in-Chief
Ge Li. Graph convolutional label noise cleaner: Train a plug-and-play of the Journal of Advances in Information Technology (JAIT).
action classifier for anomaly detection. In Proceedings of the IEEE/CVF
conference on computer vision and pattern recognition, pages 1237–1246,
2019.
[57] Yu Tian, Guansong Pang, Yuanhong Chen, Rajvinder Singh, Johan W Ver-
jans, and Gustavo Carneiro. Weakly-supervised video anomaly detection
with robust temporal feature magnitude learning. In Proceedings of the
IEEE/CVF international conference on computer vision, pages 4975–4986,
2021.

12 VOLUME 11, 2023

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4

You might also like