0% found this document useful (0 votes)
24 views11 pages

Neuromodulation Via Electromagnetic and Ultrasound Fields

This paper presents a novel Slow-Fast Convolution Neural Network (SF-CNN) for remote monitoring and identification of anti-social activities in surveillance applications. The SF-CNN architecture employs both slow and fast learning processes to analyze video frames, achieving an accuracy of 99.6%, which surpasses existing techniques. The study highlights the importance of real-time monitoring in various sectors to detect abnormal behaviors and enhance security.

Uploaded by

kdonnigan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views11 pages

Neuromodulation Via Electromagnetic and Ultrasound Fields

This paper presents a novel Slow-Fast Convolution Neural Network (SF-CNN) for remote monitoring and identification of anti-social activities in surveillance applications. The SF-CNN architecture employs both slow and fast learning processes to analyze video frames, achieving an accuracy of 99.6%, which surpasses existing techniques. The study highlights the importance of real-time monitoring in various sectors to detect abnormal behaviors and enhance security.

Uploaded by

kdonnigan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Measurement: Sensors 27 (2023) 100718

Contents lists available at ScienceDirect

Measurement: Sensors
journal homepage: www.sciencedirect.com/journal/measurement-sensors

Remote monitoring system using slow-fast deep convolution neural


network model for identifying anti-social activities in
surveillance applications
Edeh Michael Onyema a, b, *, Sundaravadivazhagn Balasubaramanian c, Kanimozhi Suguna S d,
Celestine Iwendi e, B.V.V. Siva Prasad f, Chinecherem Deborah Edeh g
a
Department of Vocational and Technical Education, Faculty of Education, Alex Ekwueme Federal University, Ndufu-Alike, Abakaliki, Nigeria
b
Saveetha School of Engineering, Saveetha Institute of Medical and Technical Sciences, Chennai, India
c
Department of Information Technology, University of Technology and Applied Sciences, AlMussanah, Oman
d
Department of Computer Applications, ArulmiguArthanareeswarar Arts and Science College, Tiruchengode, Tamil Nadu, India
e
School of Creative Technologies, University of Bolton, United Kingdom
f
Department of Computer Science and Engineering, Anurag University, Hyderabad, India
g
Faculty of Law, Enugu State University of Science and Technology, Nigeria

A R T I C L E I N F O A B S T R A C T

Keywords: Remote monitoring is the process that monitors and observes information from a distance utilizing sensors or
Deep learning electronic types of equipment. Remote monitoring is used in real-time applications like traffic, forest, military,
Convolutional neural network shops, and hospitals to determine abnormal activities. Earlier research has done video processing methods based
Video processing
on computer vision techniques, but the computational complexity regarding time and memory is high. This paper
Object detection and recognition
Abnormal activity detection
designs and implements a novel Slow-Fast Convolution Neural Network (SF–CNN) to identify, detect, and classify
Surveillance monitoring abnormal behaviours from a surveillance video. The proposed CNN architecture learns the video frames auto­
matically, obtains the most appropriate properties about various objects’ behaviour from a large set of videos.
The learning process of SF-CNN is carried out in two ways, such as slow learning and fast learning. The slow
learning process is enabled when the frame rate is less, and the rapid learning process is enabled when the frame
rate is high. Both the learning processes learn spatial and temporal information from the input video. Different
objects, such as humans, vehicles, and animals, are detected and recognized according to their actions. All the
videos have normal and abnormal activities that vary in various contexts. The proposed SF-CNN architecture
provides an end-to-end solution to dealing with multiple constraints abnormal movements. The experiment is
carried out on several benchmark datasets, and the performance of the SF-CNN architecture is evaluated. The
proposed approach obtained 99.6% of accuracy, which is higher than the other existing techniques.

1. Introduction records the activities using CCTV cameras and continuously records
them as videos. The output of the surveillance system is a video. The
Anit-Social activities are increasing day by day in innumerable fields. video is processed, and the abnormal activity is identified using the
Theft, illegal, and other outlawed activities are considered anti-social object detection and recognition method. The movement of the objects is
and must be identified immediately, and protect the area as quickly as classified as normal or abnormal. Today, there are ten reasons business
possible. It reduces the loss of data, things, and human death. The people need a video surveillance system. They are: Resolve Conflicts,
medical industry, forest, research centers, aerospace, vehicle stations, Increasing employee productivity, Reducing Theft, providing Better
malls, most extensive building, etc., are some areas that need to be experience, Real-Time Monitoring, Enhancing safety, Digital storage,
surveillance to avoid abnormal activities. The surveillance system Evidence making, Access control, and Business savings.

* Corresponding author. Department of Vocational and Technical Education, Faculty of Education, Alex Ekwueme Federal University, Ndufu-Alike, Abakaliki,
Nigeria.
E-mail addresses: [email protected] (E. Michael Onyema), [email protected] (S. Balasubaramanian), [email protected]
(K. Suguna S), [email protected] (C. Iwendi), [email protected] (B.V.V.S. Prasad), [email protected] (C.D. Edeh).

https://doi.org/10.1016/j.measen.2023.100718
Received 2 October 2022; Received in revised form 13 January 2023; Accepted 23 February 2023
Available online 4 March 2023
2665-9174/© 2023 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-
nc-nd/4.0/).
E. Michael Onyema et al. Measurement: Sensors 27 (2023) 100718

Though the surveillance monitoring system process is similar, the whereas the traditional neural network can accommodate only up to 2–3
applications are different. The cameras’ capacity and configuration and hidden layers. In a neural network, the number of hidden layers is
the surveillance systems have been changed based on the application. denoted by the term “deep.” Deep learning models are trained using a
Some surveillance applications are given in Figure-1, which shows sur­ large number of labeled and neural networks capable of learning char­
veillance in the office, road, official building, and backside of a house. acteristics directly from the data without the need for manual feature
Different intrusion detection systems have been proposed for security extraction [3]. Convolutional neural networks (CNN) are famous deep
provision earlier, but it detected after the abnormal event. There is no neural network types. This architecture is most suitable for processing
methodology, which can stop the strange activities automatically or 2D data like images as CNN uses 2D convolution layers and the input
manually. For preventing or controlling abnormal movements, location data that convolutes learned features.
information is required with complete knowledge about the geo-region. CNN does not require manual feature extraction to classify images,
It helps to identify the abnormal activity earlier and provides better and therefore identifying features is eliminated. Feature extraction is
security for the particular surveillance area. One of the best solutions to done directly from the images using CNN and the structure of the neural
enhance the existing security is surveillance monitoring. model is represented in Fig. 2. A collection of images is used to train the
This paper aims to design and implement a deep learning-based network, and the relevant features are learned simultaneously, elimi­
abnormal activity detection using a convolution neural network to nating the need for pre-training of the relevant features. In computer
provide a better solution. Deep learning is a machine learning technique vision tasks like object classification, high accuracy is attained using
that enables computers to exhibit human-like behaviours that humans automated feature extraction. An enormous number of hidden layers is
do as their second nature. For example, a particular deep learning required for the CNNs to identify various features of an image. Hidden
technique is used to detect the street light signs, and another is taught to layers raise the complexity of the image features that have been learned.
recognize a pedestrian from a cat, car, etc. It is also used to correct For instance, the first and second hidden layers will identify the edges
grammar, spellings, repetitive words, punctuations, and more in the and complex shapes to recognize an object.
given texts and automatically generates a new text with a nil error. More
progress is achieved in various fields today than earlier [1]. The deep
learning model is used for a computer to learn to do classification tasks 1.1. Machine learning versus deep learning
over different kinds of data like voice, images, texts. The level of per­
formance in deep learning exceeds human efficiency hence reaching Machine learning and deep learning are often interchangeable, but
state-of-the-art accuracy. Multi-layered neural network architecture and deep learning is specialized with human-like artificial intelligence,
an enormous amount of labeled data are used to train these models. making it more efficient than machine learning. In machine learning,
Because of its accuracy, deep learning technology was widely used in algorithms are manually fed to parse data. It learns the given data to
a variety of applications at such a higher and more critical level [2]. derive images, videos, texts, and other information used to design a
These techniques are efficient in electronics that help to reach user ex­ model that categorizes the object in the data. The model that performs a
pectations. The recent developments in deep learning outperform function uses its data and gets better at doing it over a period of time in
humans, thus used in safety and critical applications such as robotics, AI machine learning.
cars, image caption generation, etc. Though deep learning was specu­
lated for the first time in the 1980s, its usage has become very popular
only recently due to the following reasons:
Deep learning needs a massive amount of labeled data. i.e., in the
case of Autonomous vehicles (driverless cars), to train the computer for
that crucial job, millions of images and thousands of hours of footage are
required.
So far, it is basic knowledge that deep learning needs enormous
hardware and substantial computing power. It will require many weeks
of deep learning networks to perform efficiently. So, to reduce the
training period, the development teams enable, Hybrid Computing
rather than cluster or cloud computing, i.e., using GPU’s (Graphics
Processing Unit) massive parallel processing power to boost up the
performance.
A deep learning model is also known as a deep neural network, as
neural network architecture is used in primary deep learning methods. A
deep neural network can contain a maximum of 150 hidden layers, Fig. 2. Structure of a neural networks.

Fig. 1. Various applications of surveillance system.

2
E. Michael Onyema et al. Measurement: Sensors 27 (2023) 100718

The deep learning model uses a layered structure of algorithms in-network, medical diagnosis, marketing, and automatic surveillance.
known as Artificial Neural Network (ANN), inspired by the human The authors in Ref. [9] said that providing a security algorithm could
brain’s Neural System. ANN can analyze data continuously and draw break the anomaly detection. Most of the research works focused on
precise decisions on its own, just like a human brain would do. Deep delivering security algorithms to avoid anomalies. Some of the research
learning keeps improving with an increase in the size of the data and the works focused on detecting abnormal events. The authors in Ref. [10]
sample layered architecture is given in Fig. 3. Machine learning also used the optical flow method and the Gaussian mixture model to identify
provides various models and techniques, from which one can choose and detect strange activities occurring in crowd scenes. The authors in
based on the application they need to sort. For the successful deep Ref. [11] presented some significant research methods associated with
learning technique, an enormous amount of data, i.e., millions of images abnormal behavior detection in crowd scenes. It can also be used in
and hours of videos, is required to train the model. A Graphics Pro­ vehicle detection during traffic conditions in a congested area.
cessing Unit is used for the fast processing of data. If the former and Thermal cameras were employed in several of the research studies to
latter are not available, it is best to use machine learning algorithms monitor surveillance. Simultaneously, the authors of [12] explained that
rather than deep learning. its videos’ sensitivity and quality are lower than those of the other
The paper’s contribution is that it uses two learning processes to videos. Thermal videos, according to the scientists, seem to be noisy and
learn the video frames to increase object recognition accuracy. They are also have poor visual quality. Feature extraction, according to the au­
slowly learning and fast learning. The paper’s novelty is that the deep thors in Ref. [13], is a method for extracting motion as well as spatial
learning model understands the video data and analyzes it according to information from video frames. According to the authors in Ref. [14],
the frame rate. If the frame rate is low, it analyses the frames slowly and the high - and low features are strongly integrated in inferring unusual
obtains the spatial semantics. If the frame rate is high, it analyses the behaviors, which aids the programmer in quickly identifying weird ac­
structures and gets the temporal semantics. Thus this proposed deep tivities and analysing any complicated behaviourswith in video. The
learning model learns the spatial and temporal information for video authors in Ref. [15] motivated to detect multiple objects in a video based
processing and objects recognition. on feature extraction and classification. The authors obtained 90% ac­
curacy in object detection from the experiment, which further improved.
Numerous researchers are still focusing on and researching surveillance
1.2. Related works monitoring-based applications. In Ref. [26], the authors have proposed a
Cybrog intelligence for designing an intrusion detection system in a
Before analyzing the proposed approach, it is necessary to conduct a cloud network traffic and various security challenges were projected in
deep study on various existing methods, to understand the issues and Ref. [27]. The authors in Ref. [28] have implemented SF-CNN model for
challenges faced by the earlier works. For example, authors in Ref. [4] detecting the suspicious activity with the support of surveillance ap­
stated that pervasive computing uses CCTV cameras for video surveil­ plications. RelativeNas is another advanced CNN model proposed by the
lance. So, it is considered a recent research work where the video data is authors in Ref. [29] which describes the performance of combined fast
analyzed using machine learning approaches. The devices used for video and slow learners. This model aids in searching of objects with reduced
capturing are very common throughout the world, and the human re­ cost and error rates. Another application of SF-CNN is presented in the
sources used for video analysis are minimal but expensive. In most cases, article [30] in which the model is tested for performing action recog­
surveillance cameras are used for surveillance monitoring. The authors nition through detection procedures. The detailed description about this
in Ref. [5] stated that some human factors like tiredness and fatigue lead two-pathway CNN model is elaborated in Ref. [31]. However, SoFTNet
to lousy tracking. Also, humans working with CCTV monitoring suffer which is a concept controlled DCNN and Attention Slow-Fast Fusion
monotony. The reason is that in several cases, strange or unusual events Networks presented in Refs. [32,33] are certain other applications that
occur rarely. The authors in Ref. [6] said that various kinds of methods uses this SF-CNN model.
like anomaly detection, abnormal detection [7], outlier detection [8] are
big topics used in different real-time areas like intrusion detection

Fig. 3. Layered Architecture of the proposed System.

3
E. Michael Onyema et al. Measurement: Sensors 27 (2023) 100718

1.3. Limitation and motivation their activities are normal and abnormal. Some of the specific abnormal
activities are different activities than the usual activities. For improving
Though the accuracy of the object detection and classification the efficiency of video/image processing, the images are initially applied
regarding abnormal activities needs to be improved, some semi- for preprocessing using a moving 3 × 3 average filter, which removes the
automatic methods consumed more computational time, increasing noises occurring in the images. It can be represented as in Equation (1).
the cost. Most of the earlier research works have been proposed for

m ∑
m
specific applications in surveillance monitoring. Also, the video learning yij = wkl xi+k,j+l (1)
process can extract a particular type of feature from the video, which k=− m l=− m

cannot bring high accuracy in the recognition. The learning rate of the
entire dataset is a variable, where it is changed during the program where the input image is represented as xij , (i, j) represents the pixels in
execution to tune the results. It takes more time complexity in video the image, and yij represents the output image. Similarly, a linear filter
processing models. But the industry needs a fast and accurate video with the size 3 × 3, is used on Equation (2).
recognition model for surveillance applications. Hence, this paper aims (2m + 1) x (2m + 1) (2)
to design and implement a slow-fast deep CNN model to learn the video
frames with different learning rates, which retains various spatial and Equation (1) having the weights wkl for every k and l from − m to m,
temporal features for automatically improving object recognition equal to 1.
accuracy. Each video is made up of a greater number of frames and the video
processing using DCNN is given in Fig. 4. For example, a 1-min video has
2. Deep learning-based object detection and recognition 100 to 110 video frames. The video frames are extracted from the input
videos and called a video sequence by writing a computer program or
From the above discussions, it is clear that the proposed deep performing manually. Similarly, one action has many frames where
learning approach is highly suitable for video processing, object detec­ most of the frames have the same objects with the same activities, so the
tion and recognition, pattern recognition, and speech recognition ap­ video processing time is a waste. Hence, it is essential to choose the
plications. This paper recommended using the Convolution Neural video frames to speed up the process and provide better classification
Network in Deep Learning for object detection and abnormality identi­ efficiency. For improving the classification accuracy, 30% of the video
fication. Initially, the input video is obtained from the system/PC con­ frames are initially applied for the training process, labeling the objects
nected with the application’s wired or wireless CCTV cameras. The as “normal” or “abnormal.” The number of abnormal activities is less
video file is automatically stored on the PC if it is a wired connection. An than the normal activities. So, labeling the abnormal activities is suffi­
intermediate device, such as a router, is used to send the video to the cient for the classification, reducing the computational time, memory,
appropriate PC. In this work, some assumptions are established, which and complexity. The time complexity can also be reduced by processing
aid in a thorough understanding of the entire procedure. the selected frames and other frames labeled as normal. Only the
abnormal frames and the labels are stored in a particular database for
1. Any number of surveillance cameras can be connected to the appli­ testing process references during the training process. The abnormal
cation network. activities are identified as wrong features/odd features, which differ
2. The PC interlinked with the camera is installed with the MATLAB from the normal video sequences. The remaining 70% of the frames are
software, and our proposed algorithm is executed automatically used for the testing process, and the final extracted features are
when the user clicks the run button. compared with the trained features.
3. Our proposed approach knows the video file’s location, the video The proposed CNN architecture comprises three essential compo­
files’ name, and the other meta information. nents: input, output, and hidden layers. The hidden layers are also called
4. The model learns the video frame rate while streaming and act convolutional, middle, or feature detection layers. The output layers are
accordingly. also called classification layers, which have two components: fully
5. The spatial with temporal features was extracted automatically from connected and SoftMax. The size of all the frames is resized as 32 × 32,
the video frames to identify the object and its motions. increasing the time of the training process. The input layers read all the
6. The abnormal activity occurs within midst of a video rather than at input images and send them to the middle layers. In the middle layer,
the start or end. three modes of operation are carried out: convolution and pooling with
rectification. Rectified Linear Unit does this operation (ReLU). This
The deep learning approach is used in the same work to detect and proposed SF-CNN uses three convolutional, two completely connected
identify aberrant actions in such a surveillance video. Deep learning is
well recognized to be inspired by several neural network architectures
with a deep framework for learning features while representing data. A
specific neural network model comprises the input layer, output layer,
different kinds, and a more significant number of hidden layers. Deep
learning has a various number of large size networks with a more sub­
stantial number of layered networks. One famous and most applicable
deep learning network is Convolutional Neural Network (CNN). This
paper is aimed to use a CNN architecture for identifying abnormal ac­
tivities. CNN learns and classifies the features automatically from the
video/image data. This paper also analyses and evaluates the proposed
CNN architecture’s performance regarding the human, vehicle, and
animal behaviours involving various backgrounds, which kind of mul­
tiple data processing is not carried out before.

2.1. Proposed approach

The proposed CNN architecture is explained in this section. The


video V is divided into frames (images) F, in which various objects and Fig. 4. Deep convolution neural network-based video processing.

4
E. Michael Onyema et al. Measurement: Sensors 27 (2023) 100718

with one SoftMax layer. The architecture of the proposed SF-CNN to be completed successfully, the CNN performs the classification process.
shown in Figure-5. Figure-5(a) shows how SF-CNN can change the There are two different layers: 1. fully connected and 2—SoftMax layers
learning method based on the frame rate. Figure-5(b) depicts the com­ used in the network and used for classification. The first fully connected
plete functionality of the CNN. layer comprises 64 neurons obtained from the input image size of 32 ×
Some features are extracted by convolution filter and activated in the 32, followed by a ReLU layer. Then the second fully connected layer
input image, such as texture, edge, and corner features. These features generates the output signals to be classified. The entire CNN network
are the most valuable ones that help identify the actions performed in combines the input, middle, and final layers. Finally, the SoftMax layer
the frames. The feature values that do not match the overall set represent computes the probability of distribution over each class. The weight of
the abnormal action. In the 1st convolution layer, 32 filters (5 × 5 x 3) the convolution layer is initialized by a random number 0.0001 (stan­
are used. In 5 × 5 x 3, 3 represents the input and color images. Sym­ dard deviation) for distribution, decreasing the loss during the learning
metrically, a two-pixel-pad is added to consider the image’s edges that process in the network.
are also taken for the process. It is more important because it saves the
edges from elimination in the CN network. Negative values are changed 3. SF-CNN
into zero in the ReLU layer to retain only the positive values for pro­
cessing. The ReLU layer works faster in training the network than all the In this paper, deep CNN is incorporated with the Slow-Fast learning
layers. ReLU layer always follows the pooling layer with a 3 × 3 spatial method for analyzing the video segments. It comprises two parallel CNN
pooling region with 2 pixels as strides. Then the data size is down- modesl for the same input video-a Slow learning and Fast Learning.
sampled into 15 × 15 from the initial 32 × 32. All three convolutions Generally, video content contains two different data: static and dy­
with pooling and ReLU layers are repeatedly executed multiple times to namic. The static data will not be changed or slowly changed, but the
extract the number of features and hidden information from the input dynamic data will continuously be changed (moving objects). According
image as much as possible. to Figure-5(a), the video frames obtained from fast streaming is input to
The more significant number of pooling layers can be avoided to slow frame rate learner since the slow learner can learn the output of the
prevent the down-sampling data since essential features may be dis­ fast learning. The data format used in the SF learner is written as in the
carded in the earlier stage itself. Once the feature extraction process is following Equation (3).

Fig. 5. Proposed Deep Learning Architectures. (a). SF-CNN framework, (b). Functionality of CNN.

5
E. Michael Onyema et al. Measurement: Sensors 27 (2023) 100718

{ }
f astlearning = αT, S2 , βC 4.1. Dataset used

{ }
slowlearning = T, S2 , αβC (3) In this paper, seven different kinds of datasets are used in the
experiment. The full dataset details are given in Table-1, which com­
Equation (3) is used fused to create the SF learning process. The SF- prises human activities, vehicle activities, and animal activities. Each
CNN suggests a different methodology for transforming the data. The frame is an RGB color image with different sizes in the number of pixels.
final one is the most efficient. For example, the size of the images from other datasets is 276 × 236,
T-2-C (Time-2-Channel): data reshaping and transposing: {αT, S2 , 352 × 240, 360 × 288, etc. Because of the frames’ varying sizes, all the
βC}→{T, S2 , αβC}, that is all the α frames as one frame to channel. images are resized into 32 × 32 pixels. All the frames have a portion of
TSS (Time-Strided-Sampling): take each α frame as a sample, and it positives and negative samples. The total videos taken for the training
makes {αT, S2 , βC} = {T, S2 , βC}. and testing process are 100 from the entire dataset. The total number of
TSC (Time-Strided-Convolution): It performs as a three-dimensional images used, which were extracted from 100 videos, is 12000 frames.
convolution of a 5 × 12 kernel with 2 βC output channel and α as stride. Out of this, 5000 frames are normal, and other frames are considered
Finally, both slow and fast learning are combined as SF learning to abnormal.
perform global pooling operator, efficiently reducing dimensionality. In the proposed framework, two different experiment stages are
Then integrate both learners, and the output is fed to the FC layer to carried out. Initially, from experiment-1, the normal versus abnormal
classify it. The FC layer uses the SoftMax layer to classify the object’s classes are classified. Then from experiment-2, all the abnormal classes
behavior as normal or abnormal. The full functionality of the proposed are classified. For training the network, stochastic gradient descent with
SF-CNN architecture is given in the form of an algorithm. It is imple­ momentum method is used. All the parameters inside the network are
mented and experimented with in any computer programming lan­ tuned to obtain all the features which affect the output network. The
guage, and the results are verified. The pseudocode of the SF-CNN number of epochs used in the experiment is 10–100, and the learning
architecture is given as Pseudocode-1. rate is 0.001–0.1. for fine-tuning the hyper-parameters of the network.
One round of operation, including forwarding and backward passing in
the training samples with the learning rate, is considered one epoch.
Deep learning needs more inputs to provide a high level of accuracy.
In terms of hardware requirements, deep learning needs a high-
performance GPU. For experimenting with the proposed CNN,
MATLAB-2017 software is installed in Intel core i7. The images are
represented in binary for the classification process in the experiment.
The normal and abnormal images are obtained from various datasets,
and the abnormal activities are classified. The normal activities obtained
from human-related videos are walking, pointing, hugging, and hand­
shaking. The abnormal activities in human-related videos are kicking,
pushing, and punching. The number of classes obtained using the pro­
posed CNN is compared with the classes already labeled in the dataset,
and the performance is evaluated. The results are given in terms of
images, shown in Figure-6 it every row in the set of all normal and
abnormal images as sample images.
From the experiment, using the proposed CNN, the classified normal
and abnormal images are given in Figure-7. By comparing the images
given in Figure-6 and Figure-7, it is easy to evaluate the performance of
the proposed CNN. Abnormal activities in each frame are identified and
detected using a binary classification model, and the abnormal activity
classification is obtained using CNN. All the abnormal activity is high­
lighted using a yellow color bounding box in the frames and is shown in
Figure-7. Figure-7 shows the abnormality as pushing, kicking, fighting,
walking in the wrong path, and the crowd in wrong places, fighting,
This SF-CNN can also be called as Dual-mode CNN for understanding beating, car in the roadside, cycle, car, truck, and jeeps coming in
the video. In Deep CNN, is used to identify patterns in the images and pedestrian road. The abnormal identification and detection accuracy can
videos. In this DCNN, each frame of the video is treated as an image and be obtained by assigning the appropriate learning rate assigned in the
the object or pattern identification is performed over it by considering experiment.
one frame at a given time. Whereas, the SF-CNN can analyze multiple
frames at a given time for collecting the static and dynamic data in the
video. Table 1
Dataset information.
4. Experimental results and discussion Dataset Videos Total Frames- Frames-
Frames Normal Abnormal
In this paper, the algorithm is implemented and verified in MATLAB CMU Graphics Lab 11 2477 1209 1268
software, where it contains a built-in CNN module in the Image Pro­ Motion [16]
cessing toolbox. It provides more functionalities and enables various UT-Interaction Dataset 54 5069 2706 2903
(UTI) [17]
inbuilt algorithms like regression methods, decision-making methods, Peliculas Dataset (PEL) 2 368 100 268
and it can be chosen for performance verification. [18]
Hockey Fighting Dataset 12 1800 900 900
(HOF) [19]
Web Dataset (WED) [20] 10 1280 640 640
UCSD-AD [21] 5 600 480 120

6
E. Michael Onyema et al. Measurement: Sensors 27 (2023) 100718

Fig. 6. Sample Normal and Abnormal Images collected from various datasets.

7
E. Michael Onyema et al. Measurement: Sensors 27 (2023) 100718

Fig. 7. Abnormality Detection using proposed CNN.

8
E. Michael Onyema et al. Measurement: Sensors 27 (2023) 100718

In both experiments, normal and abnormal, including all abnormal


classes, are more accurate using the proposed CNN, which is understood
from Figure-7. The performance of the proposed CNN is high and is
evident by comparing both results given in Figure-6 and Figure-7. The
abnormal detection accuracy is increased since the testing process is
always compared with the training process. Hence, for human interacted
classification, the training classes are highly accurate and are used in the
testing process.
The performance of the proposed CNN approach can be evaluated by
verifying the time and computational complexity. The experiment’s
time complexity is initially calculated, and the obtained results are
shown in Figure-8. The duration of each phase of the procedure,
including video segmentation, frame segmentation, object recognition,
and classification, is also displayed. From the results, it is understood
that more time is required for video processing than the time needed for
image processing. Hence, it is understood that once the video is con­
verted into frames, the time required for object detection and classifi­
cation is less.
Some errors may occur in the video data because of the sources,
power failure or power fluctuations, and converting the video format. It
spoils the entire object detection and classification process and degrades
Fig. 9. Dataset based error analysis.
image processing performance like object detection and classification.
Hence it is necessary to compute the errors present in the input frames in
terms of frame-level and pixel-level. Figure-9 shows the error prediction Table 2
results obtained using the proposed CNN method, including the existing Performance calculation.
methods as in Refs. [22–24], and [25]. The low frame-level and high
Data Total Frames Total frames Normal Abnormal
pixel-level errors in all the methods are calculated and compared with before the error after error frames frames
other methods. From the comparison, it has been found that the pro­
Dataset 12180 12134 6035 6099
posed CNN method obtained lesser error in frames using pixel levels. Proposed 12180 12134 6034 6098
Once the error frames are identified, they are eliminated from the CNN
process, improving object detection and classification accuracy. In the
experiment, 70% of the frames are applied to the testing process, and the
classification accuracy is calculated. The object classification perfor­
mance obtained from the experiment is given in Table-2. The total
number of frames used in the investigation and the correctly classified
frames are calculated and are given in Table-2. The total number of
frames collected from all seven datasets is 12134, of which 6035 are
normal. The remaining 6099 frames are abnormal, predefined already
and verified by various earlier research works.
From Table-2, it is evident that 6034 frames are classified as normal
frames out of 6035, 6098 frames as abnormal frames out of 6099 frames
using the proposed CNN architecture. For evaluating the performance,
the set of all performance factors such as TP-True Positive, TN-True
Negative, FP- False Positive, False-Negative, Sensitivity, Specificity,
and accuracy are calculated from Table-2. The accuracy is calculated
and compared with various existing approaches based on the perfor­
mance measures. The accuracy is shown in Fig. 10. From the comparison

Fig. 10. Classification accuracy.

results, it is clearly understood that the proposed CNN architecture


outperforms the other techniques. The proposed CNN obtained 99.6%,
higher than the different approaches discussed in the literature survey.
The classification accuracy depends ona number of epochs with a
learning rate. The learning rate of the CNN is changed as 0.001, 0.01,
and 0.1 for various epochs varying from 10 to 50. The accuracy is
calculated for different datasets under different learning rates given in
different epochs. The obtained results are given in Table-3. The accuracy
is increased according to the number of epochs. Hence, the proposed
approach is executed with an increased number of epochs up to 100,
increasing accuracy. According to the findings, the suggested CNN
performs better than the other methods currently in use in regards of
object recognition, classification accuracy, and time complexity.
Sensitivity analysis deals with the ability of the model to identify the
Fig. 8. Time complexity calculation. frames with normal and abnormal conditions of the human activity. The

9
E. Michael Onyema et al. Measurement: Sensors 27 (2023) 100718

Table 3
Accuracy based on learning rate.
Learning Max. No. of. Dataset Accuracy (%)
Rate Epochs
CMU UTI PEL HOF WED UCS-
AD

the 0.001 10 99.66 54.9 90.38 58.5 89.21 99.9


20 100 56.55 90.38 100 100 100
30 100 57.74 90.38 100 100 100
40 100 70.18 90.38 100 100 100
50 100 99.15 90.38 100 100 100
0.01 10 100 54.9 90.38 100 100 100
20 100 99.6 90.38 100 100 100
30 100 99.75 87.98 100 100 100
40 100 99.55 100 100 100 100
50 100 99.8 100 100 100 100
0.1 10 46.64 54.9 9.62 100 100 100 Figure-12. Prediction Time (s)
20 0 0 9.62 100 100 100
30 0 54.9 0 100 100 100
40 0 54.9 90.38 100 100 100
50 0 0 9.62 100 100 100

analysis results is presented in Fig. 11, in which the results show that the
proposed CNN models shows the highest ability towards the identifi­
cation of the frames with 98% with a highest difference of 22% with
Reddy et al. (2011) [26] model.
Fig. 12 is the representation of prediction time taken by the models in
identifying or classifying the frames into normal or abnormal state for
the occurrence of anti-social activities. The prediction time is measured
in milli seconds (ms). The graph describes that the proposed CNN model
takes least duration in predicting the abnormal human activities of 50
ms, whereas the model utilized by Saligrama and Chen (2012) [15] takes
highest duration of 97 ms for a random frame in a given video.
Efficiency analysis focuses on the measurement of models in
achieving speed gain during the overall progress of any given task. In Fig. 13. Efficiency of the implemented Model (%).
this research, efficiency parameter is considered for evaluating the
model in classifying the frames in the video for normal and abnormal implementation of a unique deep learning algorithm for surveillance
human activities. Results of this evaluation is presented in Fig. 13. system anomaly detection. Hence, this paper creates a CNN architecture
Among the models considered for evaluation, the proposed CNN model for education, extracting information, and classifying the abnormality
have achieved the highest efficiency which makes a minimum increase on surveillance of video frames. The specialty of this paper is that ab­
of 20% when compared with the existing models. normality identification is carried out over several datasets. The main
As this research concentrates on the collection of video data from the motto is to design a common abnormal identification system for any
surveillance camera to the cloud system to perform the analysis on kind of surveillance application, including human, animal, and vehicle
classifying the video frames, it is essential to evaluate the delay caused monitoring. The proposed deep learning model is trained for improving
in data transmission. Delay of data transmission is calculated in milli the accuracy level. The learning rate and the number of epochs are
seconds (ms). Fig. 12 depicts this evaluation on delay and from the graph varied in the experiment, and the performances are verified. The
it can be observed that the proposed model shows the least delay of 14% increased number of epochs can tune the accuracy to high levels. The
than the other existing models. results of the experiments showed that the proposed CNN performs
better than the alternative approaches. The obtained accuracy is 99.6%
5. Conclusion for abnormal activity classification. The various classes of the abnor­
mality can be classified individually under different situations making
Objective of this paper’s primary goal is the design and the system fully automatic and suitable for any surveillance system. As a
future enhancement of this research, the model can be tested with
varying video categories with certain other parameters such as duration
of the video, timestamp of the video recorded and also based on the
number of persons available in each frame of the video.

Funding statement

The author(s) received no specific funding for this study.

Declaration of competing interest

The authors declare that they have no known competing financial


interests or personal relationships that could have appeared to influence
the work reported in this paper.

Fig. 11. Sensitivity analysis of the models.

10
E. Michael Onyema et al. Measurement: Sensors 27 (2023) 100718

Data availability [16] CMU Graphics lab motion capture database, last accessed 2018/1/2, http://mocap.
cs.cmu.edu/.
[17] Ryoo, MS, and Aggarwal, J.K.: "UT-interaction Dataset, ICPR Contest on Semantic
Data will be made available on request. Description of Human Activities (SDHA)," HTTP://cvrc.ece.utexas.edu/
SDHA2010/Human\_Interaction.html.
References [18] Peliculas movies fight detection dataset, last accessed 2018/1/5, http://academict
orrents.com/details/70e0794e2292fc051a13f05ea6f5b6c16f3d3635/tech&h it
=1&filelist=1.
[1] Z.Q. Zhao, P. Zheng, S.T. Xu, X. Wu, Object detection with deep learning: a review, [19] E. Bermejo, O. Deniz, G. Bueno, R. Sukthankar, Violence detection in video using
IEEE Transact. Neural Networks Learn. Syst. 30 (11) (2019) 3212–3232. computer vision techniques, in: Proceedings of Computer Analysis of Images and
[2] M.M. Najafabadi, F. Villanustre, T.M. Khoshgoftaar, N. Seliya, R. Wald, Patterns, 2011.
E. Muharemagic, Deep learning applications and challenges in big data analytics, [20] CRF web dataset, last accessed 2018/1/5, http://crcv.ucf.
Journal of Big Data 2 (1) (2015) 1. edu/projects/Abnormal_Crowd/#WebDataset.
[3] S.P. Balamurugan, M. Duraisamy, Deep convolution neural network with gradient [21] http://www.svcl.ucsd.edu/projects/anomaly/dataset.htm.
boosting tree for COVID-19 diagnosis and classification model, European Journal [22] Balasundaram, Chellappan, An intelligent video analytics model for abnormal
of Molecular & Clinical Medicine 7 (11) (2020). event detection in online surveillance video, Journal of Real-Time Image
[4] Sutrisno Ibrahim, A comprehensive review on intelligent surveillance systems, Processing (2018), https://doi.org/10.1007/s11554-018-0840-6.
Communications in Science and Technology 1 (1) (2016). [23] M. Bertini, A. Del Bimbo, L. Seidenari, Multi-scale and real-time non-parametric
[5] Roberto Arroyo, J. Javier Yebes, Luis M. Bergasa, Ivan G. Daza, Javier Almazan, approach for anomaly detection and localization, Compt. Vis. Image Und. 116 (3)
Expert video-surveillance system for real-time detection of suspicious behaviors in (2012) 320–329.
shopping malls, Expert Syst. Appl. 42 (21) (2015) 7991–8005. [24] V. Reddy, C. Sanderson, B.C. Lovell, Improved anomaly detection in crowded
[6] Kun Wang, Langevin Stanley, Corey O’Hern, Mark Shattuck, Serenity Ogle, scenes via cell-based analysis of foreground speed, size, and texture, In CVPR
Adriana Forero, Juliet Morrison, Richard Slayden, Michael Katze, Michael Kirby, Workshops (2011) 55–61.
Anomaly detection in host signaling pathways for the early prognosis of acute [25] V. Saligrama, C. Zhu, Video anomaly detection based on local statistical
infection, PLoS One 11 (8) (2016) 2016. aggregates, In CVPR (2012) 2112–2119.
[7] Yudong Zhang, GenlinJi, Jiquan Yang, Shuihua Wang, Zhengchao Dong, [26] N. Celandroni, E. Ferro, A. Gotta, et al., A survey of architectures and scenarios in
Preetha Phillips, Ping Sune, Preliminary Research on Abnormal Brain Detection by statellite-based wireless sensor networks: system design aspects, Int. J. Satell.
Wavelet Energy and Quantum-Behaved PSO, Technology and Health Care, 2016, Commun. Netw. 31 (1) (2013) 1–38.
pp. 1–9. [27] Weia Xia, Xijuna Yan, Wei Xiaodong, Design of wireless sensor networks for
[8] Soumi Ray, Wright Adam, Detecting anomalies in alert firing within clinical monitoring at construction sites, Intelligent Automation & Soft Computing. 18 (6)
decision support systems using anomaly/outlier detection techniques, in: (2012) 635–646.
Proceedings of the 7th ACM International Conference on Bioinformatics, [28] M. Agarwal, P. Parashar, A. Mathur, K. Utkarsh, A. Sinha, Suspicious activity
Computational Biology, and Health Informatics, USA, 2016, pp. 185–190. detection in surveillance applications using slow-fast convolutional neural
[9] Wei Wang, Lin Chen, Shin Kang, LingjieDuan, Thwarting intelligent malicious network, in: Advances in Data Computing, Communication and Security, Springer
behaviors in cooperative spectrum sensing, IEEE Trans. Mobile Comput. 14 (11) Nature Singapore, 2022, pp. 647–658.
(2015) 2392–2405. [29] H. Tan, et al., RelativeNAS: relative neural architecture search via slow-fast
[10] Oscar Rojas, ClesioTozzi, Abnormal crowd behavior detection based on Gaussian learning, IEEE Transact. Neural Networks Learn. Syst. (2021) 1–15, https://doi.
mixture model, in: Proceedings of European Conference on Computer Vision, org/10.1109/tnnls.2021.3096658.
Springer International Publishing, Cham, 2016, pp. 668–675. [30] M.-H. Ha, O.T.-C. Chen, Deep Neural Networks Using Residual Fast-Slow Refined
[11] Andrea Pennisi, DomenicoBloisi, Luca Iocchi, Online real-time crowd behavior Highway and Global Atomic Spatial Attention for Action Recognition and
detection in video sequences, Comput. Vis. Image Understand. 144 (2016) Detection, IEEE Access, 2021, pp. 164887–164902, https://doi.org/10.1109/
166–176. access.2021.3134694.
[12] SupriyaMangale, MadhuriKhambete, Camouflaged Target Detection and tracking [31] Z. Jie, W. Muqing, X. Weiyao, A two-pathway convolutional neural network with
using thermal infrared and visible spectrum imaging, in: Proceedings of temporal pyramid network for action recognition, in: 2020 IEEE 6th International
International Symposium on Intelligent Systems Technologies and Applications, Conference on Computer and Communications (ICCC), Dec. 2020, https://doi.org/
Springer International Publishing, Cham, 2016, pp. 193–207. 10.1109/iccc51575.2020.9345152.
[13] Xian Yang, XuejianRong, Xiaodong Yang, YingliTian, Evaluation of low-level [32] T. Zia, N. Bashir, M.A. Ullah, S. Murtaza, SoFTNet: a concept-controlled deep
features for real-world surveillance event detection, IEEE Trans. Circ. Syst. Video learning architecture for interpretable image classification, Knowl. Base Syst.
Technol. 27 (3) (2016) 624–634. (2022) 108066, https://doi.org/10.1016/j.knosys.2021.108066. Mar.
[14] Giuseppe Donatiello SerhanCoşar, Carolina Garate VaniaBogorny, [33] X. Zhang, Y. Tie, L. Qi, Multimodal gesture recognition based on attention slow-fast
Luis OtavioAlvares, François Brémond, Toward abnormal trajectory and event fusion networks, J. Phys. Conf. 1 (2021), https://doi.org/10.1088/1742-6596/
detection in video surveillance, IEEE Trans. Circ. Syst. Video Technol. 27 (3) 1757/1/012031, 012031, Jan.
(2017) 683–695.
[15] Sarita Chaudhary, Mohd Aamir Khana, Charul Bhatnagar, Multiple anomalous
activity detection in videos, Procedia Comput. Sci. 125 (2018) 336–345.

11

You might also like