Ieee Paper
Ieee Paper
Approach
1st N Sri Sai Nitya, Department of Electronics and Communication Engineering,
Visvesvaraya National Institute of Technology, Nagpur, India
[email protected]
Abstract—Video Surveillance plays a pivotal role in today’s Surveillance cameras may aid in the detection of disruptive
world. The integration of artificial intelligence, machine learning, conduct among students on campus, such as bullying and
and deep learning has led to systems capable of differentiating fighting [9], [10]. They can also assist in enhancing the cam-
between suspicious,Non suspicious and Border case activities.
This paper classifies human activities into suspicious(shoot with pus’s anti-theft protection. In the examination hall, automated
gun,punching,pushing,etc..),Non suspicious(cheer up,open bot- security cameras are used to identify suspicious behavior by
tle,etc..) and Border case activities(follow,step on foot,etc..) using students such as stealing and copying [11].
Spatiotemporal convolutional neural networks (ST-CNNs). High-
level features are extracted using spatiotemporal CNNs, and final
predictions are made based on pooling layer results. Safety cameras are increasingly being used in small busi-
Index Terms—Suspicious activity, deep learning, convolutional nesses, factories, and shopping centers [12], [13]. They’re used
neural network, video surveillance to apprehend shoplifters and robbers, as well as to keep armed
robberies at bay [14], [15]. Security cameras are also used to
I. I NTRODUCTION track supplies and inventory held in warehouses and to detect
employee bribery and theft [16], [17].
In today’s world, we can see that crime has escalated
despite the presence of surveillance cameras everywhere. To
detect suspicious behavior, a model must be developed that Platforms, routes, roads, tunnels, and parking lots are all
reduces the time taken to detect it so that we can take action. monitored by security cameras in railways and bus stations.
Inthis case, the surveillance camera is in the form of film. Terrorists may use these areas as a staging ground for explo-
Splitting the video into images and then editing it is the sive attacks by leaving a bag containing explosives [4], [18].
easiest way to process it [1], [2]. There are many machine Automated security cameras can detect discarded bags and
learning approaches available today to process images, but as warn officials, who can then remove them to protect passengers
the dataset grows larger, the accuracy decreases, so we turned and facilities [19].
to some deep learning algorithms.
In public infrastructures such as parking lots, jails, military Video monitoring can be used to keep an eye on patients
bases, mosques, borders, and public transit stations, automated in hospitals and elderly people in their homes [20]. It is
video monitoring can help deter harm due to overcrowding, capable of detecting abnormal behavior in patients, such as
people fighting each other, and people carrying arms that could vomiting, fainting, or any other irregular behavior [14]. As a
be used to inflict damage on other people, people carrying result, given the wide range of applications, we must devise a
bombs, robbery, vandalism, and so on [3], [4]. methodfor detecting suspicious activities in videos [21], [22].
Video monitoring is an important part of enhancing the The remaining paper is organized as follows: literature survey
security of banks and ATMs [5], [6]. The presence of auto- is discussed in Section 2, activity classification and the CNN
matic surveillance cameras in banks will aid in the prevention models are presented in Section 3, the proposed framework
of armed robberies and heists. ATMs are a common target is presented in Section 4, details of the dataset are presented
for thieves, and automatic surveillance cameras may help to in Section 5, and results are discussed in Section 6 and we
improve their security [7], [8] conclude the paper in Section 7.
II. L ITERATURE S URVEY footage from inside the bank. The color-based motion and
A. Security Camera Research for Detecting Violent Activity appearance are used to keep track of the object in motion.
The presented model reliably detects robbery using a sin-
In this part, we’ll go through some of the research that’s glethreaded ontology, but the model’s key flaw is that it is
been done in the field of detecting violent behavior in secu- unable to detect robberies in which more than one personis
rity cameras. Fighting, vandalism, punching, kicking, scratch- involved. The algorithm of fuzzy k-means, which was based
ing,peeping, shooting, and other violent acts are examples. on histogram ratio, was used by Chuang et al. [20]to recognize
A non-tracking, real-time algorithm that detected suspicious suspicious behavior. Using a system known as GMM, the
behavior, is very useful in crowded and public areas [23].In- suspicious activity was correctly identified. The entity is
stead of object tracking, the algorithm keeps track of lowlevel detected in this model using a commonly used ratio histogram.
measurements in a series of fixed spatial locations. This algo- The fuzzy color histogram was used to solve the problem of
rithm has the downside of not providing sequentialtracking. color similarity. By tracking the transferring state, abnormal
Willim et al. [24] used contextual information to identify behaviors have been discovered.
suspicious behavior in that study. A data stream clusteringalgo-
rithm, a device inference algorithm, and a context space model
were the three components he used. Continuous information C. .Security camera research for detecting abandoned objects
upgradation from incoming videos was possible using a data
stream type clustering algorithm. The Inference algorithm Abandoned object detection can be difficult, particularly in
makes a decision based on a combination of contextual densely populated areas where the object may be partially or
information and machine awareness. The framework used fully obscured from view by cameras. Many researchers have
two datasets: two clips from the Queensland University of focused on detecting an abandoned objects using surveillance
Technology’s Z-Block dataset and 23 clips from the CAVIAR cameras in order to protect people and public facilities from
dataset. The AUC of this method is 0.787, with 0.135 errors. possible explosives in the bag. Sacchi and Regazzoni [27]
Ghazal et al. [25] discovered that videos could be used to proposed a model that uses security camera footage to detect
detect vandalism such as graffiti and theft. The writer used a an object left behind at a train station. If the left-behind object
history model and a Gaussian model that is additive in nature is detected in the model, analarm is activated in the nearby
for segmentation. A frame difference is applied between the station, and proper authorities are notified, allowing the danger
current frame and the historical model. To find the area’s to be avoided. This model uses multiple access with direct
main features as well as the color histogram, LPF with sequence code sharing to create a noise tolerant device and
adaptive thresholding is used, as well as contour tracing and ensure a secure connection between remotes and stations. This
morphological edge detection. He used the shape and motion model is designed to work with monochrome cameras. By
features to monitor objects. using colored images, the model shown can be enhanced in the
Gowsikhaa et al. [26] discovered fraudulent practices in event of a false alarm or an object that is identified by accident.
exam halls. He used the student’s head role to detect fraudulent But this comes with a major disadvantage that it increases the
activities such as theft, transferring sheets of paper between computational time of the system and hence it cannot be used
students, and conversing with other students, among other as a real-time system. Ellingsen [28] proposed a model that
things. He did so by combining adaptive background subtrac- uses mean pixel intensity and pixel standard deviation to detect
tion with sequential and periodic modeling of the background. fall artifacts. A foreground image is formed by subtracting a
His machine, on the other hand, couldn’t manage occlusion. frame from a background image containing multiple objects.
Tripathi et al. [16] provided a model that detects suspi- This approachis used to find objects that are moving. The fea-
cious ATM behaviors such as (forcefully taking money, and tures extracted to locate the object dropped by the individual
customer fights), and an alarm is activated if the activity is are region, minor axis, major axis, the center of mass, and so
detected. The videos’ main features were extracted using Hu on because it contains more than enough information about
and MHI moments. The features are classified using an SVM it. It is essentialto function on a learning mechanism and an
classifier, and the dimension of the features is reduced using automated featurevectors classifier In this paper, we used many
PCA. A window-size study based on MHI has been carried videos from real-world surveillance cameras, as well as some
out videos from the caviar dataset, to train and test our system.
Human behaviors are divided into three categories: common,
B. Research in Theft Detection in Surveillance Cameras suspicious, and unusual. Sitting, walking, jogging, and hand
Centered on ontology, Akdemir et al. [19] proposed the waving are all popular practices.Running, boxing, war, and
identification of human behavior in banks and other places other suspicious activities are examples. Convolutional neural
in this paper. The authors used design consistency, ontology networks are used to accomplish this grouping. To begin, high-
consistency, minimal coding bias, extensibility, and minimal level features f=rom images are extracted using a convolutional
ontology binding as criteria. The model was put to the test neural network. In doing so, the convolutional network clas-
on six videos, four of which depicted robbery and two of sification is taken into account,the final pooling layer result is
which depicted normal behavior. Many of the videos include extracted, and the final prediction is made.
III. PRELIMINARIES Layer Descriptions
A. Activity Classification • Layer 1: A 3D convolutional layer with a kernel size of
3 × 3 × 3, stride 1 × 1 × 1, and 32 filters.
Activities are classified into: – Input dimensions: 30 × 64 × 64 × 3
• Suspicious Activities: shoot with gun,punching. – Output dimensions: 30 × 64 × 64 × 32
• Non Suspicious Activities:cheer up,open bottle,Tear up – Parameters calculation: 3 × 3 × 3 × 3 × 32 + 32 =
Paper,etc.. 2, 656
• Border case Activities: follow,carry object,step on – Activation: ReLU
foot,etc... • Layer 2: A 3D max-pooling layer with kernel size 2 ×
These activities were captured from various angles to ensure 2 × 2 and stride 2 × 2 × 2.
a diverse dataset for training and testing. – Input dimensions: 30 × 64 × 64 × 32
– Output dimensions: 15 × 32 × 32 × 32
B. SpatioTemporal Convolutional Neural Network (CNN) Ar- • Layer 3: A 3D convolutional layer with a kernel size of
chitecture 3 × 3 × 3, stride 1 × 1 × 1, and 64 filters.
Spatiotemporal Convolutional Neural Networks (ST-CNNs) – Input dimensions: 15 × 32 × 32 × 32
extend traditional CNNs by incorporating the temporal dimen- – Output dimensions: 15 × 32 × 32 × 64
sion, making them well-suited for video-based tasks. They – Parameters calculation: 3 × 3 × 3 × 32 × 64 + 64 =
use 3D convolutional kernels to simultaneously process spatial 55, 360
features (height and width of the frame) and temporal features – Activation: ReLU
(across multiple frames). This allows ST-CNNs to capture • Layer 4: A 3D max-pooling layer with kernel size 2 ×
motion patterns and dynamic behaviors, making them highly 2 × 2 and stride 2 × 2 × 2.
effective for activity detection in video surveillance. – Input dimensions: 15 × 32 × 32 × 64
– Output dimensions: 7 × 16 × 16 × 64
• Layer 5: A 3D convolutional layer with a kernel size of
3 × 3 × 3, stride 1 × 1 × 1, and 128 filters.
– Input dimensions: 7 × 16 × 16 × 64
– Output dimensions: 7 × 16 × 16 × 128
– Parameters calculation: 3×3×3×64×128+128 =
221, 312
– Activation: ReLU
• Layer 6: A 3D max-pooling layer with kernel size 2 ×
2 × 2 and stride 2 × 2 × 2.
– Input dimensions: 7 × 16 × 16 × 128
– Output dimensions: 3 × 8 × 8 × 128
• Layer 7: Flattening the output of the previous layer to a
1D vector.
– Input dimensions: 3 × 8 × 8 × 128
– Output dimensions: 3, 072
• Layer 8: A fully connected layer with 512 units.
Fig. 1. Spatiotemporal CNN. – Input dimensions: 3, 072
– Output dimensions: 512
– Parameters calculation: 3, 072 × 512 + 512 =
S PATIOT EMPORAL CNN M ODEL FOR S USPICIOUS 1, 572, 864
ACTIVITY D ETECTION – Activation: ReLU
• Layer 9: A fully connected layer with six units (one per
Various CNN models have been proposed depending on the
target outputs. SpatioTemporal CNN is specifically designed class).
for video-based activity recognition tasks by capturing both – Input dimensions: 512
spatial and temporal features using 3D convolutional layers. – Output dimensions: 6
The architecture consists of nine layers, including three sets of – Parameters calculation: 512 × 6 + 6 = 3, 078
3D convolutional layers, three sets of 3D max-pooling layers, – Activation: Softmax
a flattening layer, and two fully connected layers, followed This model effectively captures both spatial features (image-
by a softmax classifier. A graphical representation of the level information) and temporal features (motion patterns
SpatioTemporal CNN architecture is presented in Fig. 4. across frames), making it well-suited for suspicious activity
accessing this dataset must register and adhere to its academic
usage policies. Its comprehensive nature and well-structured
annotations have cemented its role as a cornerstone in 3D
action recognition research.
Scenario Accuracy
All 7 activities combined 78.4%
Knife and Gun as one combined activity; other 5 activities separately 90.1%
Only recognizing Knife and Gun 47%
Recognizing Knife (no Gun) and other 5 activities 92%
Recognizing Gun (no Knife) and other 5 activities 97%
TABLE I
ACCURACY FOR DIFFERENT ACTIVITY RECOGNITION SCENARIOS .