0% found this document useful (0 votes)
5 views5 pages

Ieee Paper

Uploaded by

Gowtham
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views5 pages

Ieee Paper

Uploaded by

Gowtham
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Multilevel Activity Detection Using Deep Learning

Approach
1st N Sri Sai Nitya, Department of Electronics and Communication Engineering,
Visvesvaraya National Institute of Technology, Nagpur, India
[email protected]

2nd Gowtham Kumar, Department of Electronics and Communication Engineering,


Visvesvaraya National Institute of Technology, Nagpur, India
[email protected]

3rd Vishal Satpute, Department of Electronics and Communication Engineering,


Visvesvaraya National Institute of Technology, Nagpur, India
[email protected]

Abstract—Video Surveillance plays a pivotal role in today’s Surveillance cameras may aid in the detection of disruptive
world. The integration of artificial intelligence, machine learning, conduct among students on campus, such as bullying and
and deep learning has led to systems capable of differentiating fighting [9], [10]. They can also assist in enhancing the cam-
between suspicious,Non suspicious and Border case activities.
This paper classifies human activities into suspicious(shoot with pus’s anti-theft protection. In the examination hall, automated
gun,punching,pushing,etc..),Non suspicious(cheer up,open bot- security cameras are used to identify suspicious behavior by
tle,etc..) and Border case activities(follow,step on foot,etc..) using students such as stealing and copying [11].
Spatiotemporal convolutional neural networks (ST-CNNs). High-
level features are extracted using spatiotemporal CNNs, and final
predictions are made based on pooling layer results. Safety cameras are increasingly being used in small busi-
Index Terms—Suspicious activity, deep learning, convolutional nesses, factories, and shopping centers [12], [13]. They’re used
neural network, video surveillance to apprehend shoplifters and robbers, as well as to keep armed
robberies at bay [14], [15]. Security cameras are also used to
I. I NTRODUCTION track supplies and inventory held in warehouses and to detect
employee bribery and theft [16], [17].
In today’s world, we can see that crime has escalated
despite the presence of surveillance cameras everywhere. To
detect suspicious behavior, a model must be developed that Platforms, routes, roads, tunnels, and parking lots are all
reduces the time taken to detect it so that we can take action. monitored by security cameras in railways and bus stations.
Inthis case, the surveillance camera is in the form of film. Terrorists may use these areas as a staging ground for explo-
Splitting the video into images and then editing it is the sive attacks by leaving a bag containing explosives [4], [18].
easiest way to process it [1], [2]. There are many machine Automated security cameras can detect discarded bags and
learning approaches available today to process images, but as warn officials, who can then remove them to protect passengers
the dataset grows larger, the accuracy decreases, so we turned and facilities [19].
to some deep learning algorithms.
In public infrastructures such as parking lots, jails, military Video monitoring can be used to keep an eye on patients
bases, mosques, borders, and public transit stations, automated in hospitals and elderly people in their homes [20]. It is
video monitoring can help deter harm due to overcrowding, capable of detecting abnormal behavior in patients, such as
people fighting each other, and people carrying arms that could vomiting, fainting, or any other irregular behavior [14]. As a
be used to inflict damage on other people, people carrying result, given the wide range of applications, we must devise a
bombs, robbery, vandalism, and so on [3], [4]. methodfor detecting suspicious activities in videos [21], [22].
Video monitoring is an important part of enhancing the The remaining paper is organized as follows: literature survey
security of banks and ATMs [5], [6]. The presence of auto- is discussed in Section 2, activity classification and the CNN
matic surveillance cameras in banks will aid in the prevention models are presented in Section 3, the proposed framework
of armed robberies and heists. ATMs are a common target is presented in Section 4, details of the dataset are presented
for thieves, and automatic surveillance cameras may help to in Section 5, and results are discussed in Section 6 and we
improve their security [7], [8] conclude the paper in Section 7.
II. L ITERATURE S URVEY footage from inside the bank. The color-based motion and
A. Security Camera Research for Detecting Violent Activity appearance are used to keep track of the object in motion.
The presented model reliably detects robbery using a sin-
In this part, we’ll go through some of the research that’s glethreaded ontology, but the model’s key flaw is that it is
been done in the field of detecting violent behavior in secu- unable to detect robberies in which more than one personis
rity cameras. Fighting, vandalism, punching, kicking, scratch- involved. The algorithm of fuzzy k-means, which was based
ing,peeping, shooting, and other violent acts are examples. on histogram ratio, was used by Chuang et al. [20]to recognize
A non-tracking, real-time algorithm that detected suspicious suspicious behavior. Using a system known as GMM, the
behavior, is very useful in crowded and public areas [23].In- suspicious activity was correctly identified. The entity is
stead of object tracking, the algorithm keeps track of lowlevel detected in this model using a commonly used ratio histogram.
measurements in a series of fixed spatial locations. This algo- The fuzzy color histogram was used to solve the problem of
rithm has the downside of not providing sequentialtracking. color similarity. By tracking the transferring state, abnormal
Willim et al. [24] used contextual information to identify behaviors have been discovered.
suspicious behavior in that study. A data stream clusteringalgo-
rithm, a device inference algorithm, and a context space model
were the three components he used. Continuous information C. .Security camera research for detecting abandoned objects
upgradation from incoming videos was possible using a data
stream type clustering algorithm. The Inference algorithm Abandoned object detection can be difficult, particularly in
makes a decision based on a combination of contextual densely populated areas where the object may be partially or
information and machine awareness. The framework used fully obscured from view by cameras. Many researchers have
two datasets: two clips from the Queensland University of focused on detecting an abandoned objects using surveillance
Technology’s Z-Block dataset and 23 clips from the CAVIAR cameras in order to protect people and public facilities from
dataset. The AUC of this method is 0.787, with 0.135 errors. possible explosives in the bag. Sacchi and Regazzoni [27]
Ghazal et al. [25] discovered that videos could be used to proposed a model that uses security camera footage to detect
detect vandalism such as graffiti and theft. The writer used a an object left behind at a train station. If the left-behind object
history model and a Gaussian model that is additive in nature is detected in the model, analarm is activated in the nearby
for segmentation. A frame difference is applied between the station, and proper authorities are notified, allowing the danger
current frame and the historical model. To find the area’s to be avoided. This model uses multiple access with direct
main features as well as the color histogram, LPF with sequence code sharing to create a noise tolerant device and
adaptive thresholding is used, as well as contour tracing and ensure a secure connection between remotes and stations. This
morphological edge detection. He used the shape and motion model is designed to work with monochrome cameras. By
features to monitor objects. using colored images, the model shown can be enhanced in the
Gowsikhaa et al. [26] discovered fraudulent practices in event of a false alarm or an object that is identified by accident.
exam halls. He used the student’s head role to detect fraudulent But this comes with a major disadvantage that it increases the
activities such as theft, transferring sheets of paper between computational time of the system and hence it cannot be used
students, and conversing with other students, among other as a real-time system. Ellingsen [28] proposed a model that
things. He did so by combining adaptive background subtrac- uses mean pixel intensity and pixel standard deviation to detect
tion with sequential and periodic modeling of the background. fall artifacts. A foreground image is formed by subtracting a
His machine, on the other hand, couldn’t manage occlusion. frame from a background image containing multiple objects.
Tripathi et al. [16] provided a model that detects suspi- This approachis used to find objects that are moving. The fea-
cious ATM behaviors such as (forcefully taking money, and tures extracted to locate the object dropped by the individual
customer fights), and an alarm is activated if the activity is are region, minor axis, major axis, the center of mass, and so
detected. The videos’ main features were extracted using Hu on because it contains more than enough information about
and MHI moments. The features are classified using an SVM it. It is essentialto function on a learning mechanism and an
classifier, and the dimension of the features is reduced using automated featurevectors classifier In this paper, we used many
PCA. A window-size study based on MHI has been carried videos from real-world surveillance cameras, as well as some
out videos from the caviar dataset, to train and test our system.
Human behaviors are divided into three categories: common,
B. Research in Theft Detection in Surveillance Cameras suspicious, and unusual. Sitting, walking, jogging, and hand
Centered on ontology, Akdemir et al. [19] proposed the waving are all popular practices.Running, boxing, war, and
identification of human behavior in banks and other places other suspicious activities are examples. Convolutional neural
in this paper. The authors used design consistency, ontology networks are used to accomplish this grouping. To begin, high-
consistency, minimal coding bias, extensibility, and minimal level features f=rom images are extracted using a convolutional
ontology binding as criteria. The model was put to the test neural network. In doing so, the convolutional network clas-
on six videos, four of which depicted robbery and two of sification is taken into account,the final pooling layer result is
which depicted normal behavior. Many of the videos include extracted, and the final prediction is made.
III. PRELIMINARIES Layer Descriptions
A. Activity Classification • Layer 1: A 3D convolutional layer with a kernel size of
3 × 3 × 3, stride 1 × 1 × 1, and 32 filters.
Activities are classified into: – Input dimensions: 30 × 64 × 64 × 3
• Suspicious Activities: shoot with gun,punching. – Output dimensions: 30 × 64 × 64 × 32
• Non Suspicious Activities:cheer up,open bottle,Tear up – Parameters calculation: 3 × 3 × 3 × 3 × 32 + 32 =
Paper,etc.. 2, 656
• Border case Activities: follow,carry object,step on – Activation: ReLU
foot,etc... • Layer 2: A 3D max-pooling layer with kernel size 2 ×
These activities were captured from various angles to ensure 2 × 2 and stride 2 × 2 × 2.
a diverse dataset for training and testing. – Input dimensions: 30 × 64 × 64 × 32
– Output dimensions: 15 × 32 × 32 × 32
B. SpatioTemporal Convolutional Neural Network (CNN) Ar- • Layer 3: A 3D convolutional layer with a kernel size of
chitecture 3 × 3 × 3, stride 1 × 1 × 1, and 64 filters.
Spatiotemporal Convolutional Neural Networks (ST-CNNs) – Input dimensions: 15 × 32 × 32 × 32
extend traditional CNNs by incorporating the temporal dimen- – Output dimensions: 15 × 32 × 32 × 64
sion, making them well-suited for video-based tasks. They – Parameters calculation: 3 × 3 × 3 × 32 × 64 + 64 =
use 3D convolutional kernels to simultaneously process spatial 55, 360
features (height and width of the frame) and temporal features – Activation: ReLU
(across multiple frames). This allows ST-CNNs to capture • Layer 4: A 3D max-pooling layer with kernel size 2 ×
motion patterns and dynamic behaviors, making them highly 2 × 2 and stride 2 × 2 × 2.
effective for activity detection in video surveillance. – Input dimensions: 15 × 32 × 32 × 64
– Output dimensions: 7 × 16 × 16 × 64
• Layer 5: A 3D convolutional layer with a kernel size of
3 × 3 × 3, stride 1 × 1 × 1, and 128 filters.
– Input dimensions: 7 × 16 × 16 × 64
– Output dimensions: 7 × 16 × 16 × 128
– Parameters calculation: 3×3×3×64×128+128 =
221, 312
– Activation: ReLU
• Layer 6: A 3D max-pooling layer with kernel size 2 ×
2 × 2 and stride 2 × 2 × 2.
– Input dimensions: 7 × 16 × 16 × 128
– Output dimensions: 3 × 8 × 8 × 128
• Layer 7: Flattening the output of the previous layer to a
1D vector.
– Input dimensions: 3 × 8 × 8 × 128
– Output dimensions: 3, 072
• Layer 8: A fully connected layer with 512 units.
Fig. 1. Spatiotemporal CNN. – Input dimensions: 3, 072
– Output dimensions: 512
– Parameters calculation: 3, 072 × 512 + 512 =
S PATIOT EMPORAL CNN M ODEL FOR S USPICIOUS 1, 572, 864
ACTIVITY D ETECTION – Activation: ReLU
• Layer 9: A fully connected layer with six units (one per
Various CNN models have been proposed depending on the
target outputs. SpatioTemporal CNN is specifically designed class).
for video-based activity recognition tasks by capturing both – Input dimensions: 512
spatial and temporal features using 3D convolutional layers. – Output dimensions: 6
The architecture consists of nine layers, including three sets of – Parameters calculation: 512 × 6 + 6 = 3, 078
3D convolutional layers, three sets of 3D max-pooling layers, – Activation: Softmax
a flattening layer, and two fully connected layers, followed This model effectively captures both spatial features (image-
by a softmax classifier. A graphical representation of the level information) and temporal features (motion patterns
SpatioTemporal CNN architecture is presented in Fig. 4. across frames), making it well-suited for suspicious activity
accessing this dataset must register and adhere to its academic
usage policies. Its comprehensive nature and well-structured
annotations have cemented its role as a cornerstone in 3D
action recognition research.

V. R ESULTS AND D ISCUSSION

Scenario Accuracy
All 7 activities combined 78.4%
Knife and Gun as one combined activity; other 5 activities separately 90.1%
Only recognizing Knife and Gun 47%
Recognizing Knife (no Gun) and other 5 activities 92%
Recognizing Gun (no Knife) and other 5 activities 97%
TABLE I
ACCURACY FOR DIFFERENT ACTIVITY RECOGNITION SCENARIOS .

VI. C ONCLUSION AND F UTURE W ORK


In conclusion, while the model shows significant potential
for detecting suspicious human behavior, further enhancements
are required to improve overall accuracy and efficiency. This
lays a strong foundation for future work, with scope for
refining the model to achieve more reliable and comprehensive
activity recognition in real-world scenarios.
Fig. 2. Block Diagram.
R EFERENCES
detection. The final accuracy is evaluated using the CrossEn- [1] P. Thombare, V. Gond, and V. Satpute, “Artificial intelligence for
low-level suspicious activity detection,” in Applications of Advanced
tropy loss function and Adam optimizer with learning rate Computing in Systems, pp. 219–226, Springer, 2021.
adjustments. [2] V. M. Kamble, M. R. Parate, and K. M. Bhurchandi, “No reference
noise estimation in digital images using random conditional selection
IV. DATASETS and sampling theory,” The Visual Computer, vol. 35, pp. 5–21, 2019.
[3] C. Amrutha, C. Jyotsna, and J. Amudha, “Deep learning approach
The NTU RGB+D Action Recognition Dataset, curated for suspicious activity detection from surveillance video,” in 2020
by the ROSE Lab at Nanyang Technological University, is 2nd International Conference on Innovative Mechanisms for Industry
Applications (ICIMIA), pp. 335–339, IEEE, 2020.
a benchmark dataset for 3D human activity recognition. It
[4] P. Gajbhiye, C. Naveen, and V. R. Satpute, “Virtue: Video surveillance
comprises over 56,000 video samples across 60 action classes, for rail-road traffic safety at unmanned level crossings;(incorporating
capturing diverse human activities ranging from everyday tasks Indian scenario),” in 2017 IEEE Region 10 Symposium (TENSYMP),
to complex interactions. The dataset includes four modalities: pp. 1–4, IEEE, 2017.
[5] R. Nale, M. Sawarbandhe, N. Chegogoju, and V. Satpute, “Suspicious
RGB video, depth maps, infrared images, and 3D skeletal data, human activity detection using pose estimation and LSTM,” in 2021
which encompass up to 25 joints per subject. The multi-modal International Symposium of Asian Control Association on Intelligent
nature of the dataset, coupled with its diverse action categories, Robotics and Industrial Automation (IRIA), pp. 197–202, IEEE, 2021.
[6] S. Kadu, N. Cheggoju, and V. R. Satpute, “Noise-resilient compressed
makes it a versatile resource for developing and evaluating domain video watermarking system for in-car camera security,” Multi-
machine learning algorithms, particularly in action recognition media Systems, vol. 24, pp. 583–595, 2018.
and temporal modeling tasks. [7] D.-G. Lee, H.-I. Suk, S.-K. Park, and S.-W. Lee, “Motion influence
map for unusual human activity detection and localization in crowded
One of the standout features of this dataset is its sup- scenes,” IEEE Transactions on Circuits and Systems for Video Technol-
port for cross-subject and cross-view evaluation. Videos were ogy, vol. 25, no. 10, pp. 1612–1623, 2015.
recorded from varying camera angles and included participants [8] V. Kamble and K. Bhurchandi, “Noise estimation and quality assess-
ment of Gaussian noise corrupted images,” in IOP Conference Series:
of different demographics, ensuring a rich variety of data. Materials Science and Engineering, vol. 331, p. 012019, IOP Publishing,
This makes the dataset highly suitable for creating robust 2018.
algorithms capable of generalizing to unseen environments. [9] S. Chaudhary, M. A. Khan, and C. Bhatnagar, “Multiple anomalous
activity detection in videos,” Procedia Computer Science, vol. 125, pp.
Additionally, the inclusion of skeletal data provides a view- 336–345, 2018.
invariant representation of human actions, which is particularly [10] A. A. Bhadke, S. Kannaiyan, and V. Kamble, “Symmetric chaos-based
beneficial for 3D-based approaches. image encryption technique on image bit-planes using SHA-256,” in
Twenty Fourth National Conference on Communications (NCC), pp. 1–6,
The NTU RGB+D dataset has been pivotal in advancing IEEE, 2018.
research in fields such as surveillance, healthcare, and human- [11] A. Dixit, S. Pathak, R. Raj, C. Naveen, and V. R. Satpute, “An efficient
computer interaction. It has facilitated numerous challenges, fuzzy-based edge estimation for iris localization and pupil detection in
human eye for automated cataract detection system,” in 9th International
such as the Action Recognition Challenge, encouraging the Conference on Computing, Communication and Networking Technolo-
development of novel techniques in this domain. Researchers gies (ICCCNT), pp. 1–6, IEEE, 2018.
[12] A. Jirafe, M. Jibhe, and V. Satpute, “Camera handoff for multi-camera
surveillance,” in Applications of Advanced Computing in Systems, pp.
267–274, Springer, 2021.
[13] C. Naveen and V. R. Satpute, “Image encryption technique using
improved A5/1 cipher on image bitplanes for wireless data security,”
in International Conference on Microelectronics, Computing and Com-
munications (MicroCom), pp. 1–5, IEEE, 2016.
[14] A. Gupta, V. Satpute, K. Kulat, and N. Bokde, “Real-time abandoned
object detection using video surveillance,” in Proceedings of the Inter-
national Conference on Recent Cognizance in Wireless Communication
& Image Processing, pp. 837–843, Springer, 2016.
[15] V. R. Satpute, K. D. Kulat, and A. G. Keskar, “A novel approach based
on variance for local feature analysis of facial images,” in IEEE Recent
Advances in Intelligent Computational Systems, pp. 210–215, IEEE,
2011.
[16] V. Tripathi, D. Gangodkar, V. Latta, and A. Mittal, “Robust abnormal
event recognition via motion and shape analysis at ATM installations,”
Journal of Electrical and Computer Engineering, vol. 2015, 2015.
[17] A. L. Alappat and V. Kamble, “Image quality assessment using selective
contourlet coefficients,” in 11th International Conference on Comput-
ing, Communication and Networking Technologies (ICCCNT), pp. 1–7,
IEEE, 2020.
[18] C. Naveen, V. Satpute, and A. Keskar, “An efficient low dynamic range
image compression using improved block-based EZW,” in 2015 IEEE
Workshop on Computational Intelligence: Theories, Applications and
Future Directions (WCI), pp. 1–6, IEEE, 2015.
[19] U. Akdemir, P. Turaga, and R. Chellappa, “An ontology-based approach
for activity recognition from video,” in Proceedings of the 16th ACM
International Conference on Multimedia, pp. 709–712, 2008.
[20] C.-H. Chuang, J.-W. Hsieh, L.-W. Tsai, P.-S. Ju, and K.-C. Fan,
“Suspicious object detection using fuzzy-color histogram,” in 2008 IEEE
International Symposium on Circuits and Systems, pp. 3546–3549, IEEE,
2008.
[21] A. Pawade, R. Anjaria, and V. Satpute, “Suspicious activity detection for
security cameras,” in Applications of Advanced Computing in Systems,
pp. 211–217, Springer, 2021.
[22] P. Gangal, V. Satpute, K. Kulat, and A. Keskar, “Object detection and
tracking using 2D-DWT and variance method,” in Students Conference
on Engineering and Systems, pp. 1–6, IEEE, 2014.
[23] A. Adam, E. Rivlin, I. Shimshoni, and D. Reinitz, “Robust real-time
unusual event detection using multiple fixed-location monitors,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol. 30, no.
3, pp. 555–560, 2008.
[24] A. Wiliem, V. Madasu, W. Boles, and P. Yarlagadda, “A suspicious
behavior detection using a context space model for smart surveillance
systems,” Computer Vision and Image Understanding, vol. 116, no. 2,
pp. 194–209, 2012.
[25] M. Ghazal, C. Vázquez, and A. Amer, “Real-time automatic detection
of vandalism behavior in video sequences,” in 2007 IEEE International
Conference on Systems, Man and Cybernetics, pp. 1056–1060, IEEE,
2007.
[26] D. Gowsikhaa, S. Abirami, et al., “Suspicious human activity detection
from surveillance videos,” International Journal on Internet & Dis-
tributed Computing Systems, vol. 2, no. 2, 2012.
[27] C. Sacchi and C. S. Regazzoni, “A distributed surveillance system for de-
tection of abandoned objects in unmanned railway environments,” IEEE
Transactions on Vehicular Technology, vol. 49, no. 5, pp. 2013–2026,
2000.
[28] K. Ellingsen, “Salient event-detection in video surveillance scenarios,”
in Proceedings of the 1st ACM Workshop on Analysis and Retrieval of
Events/Actions and Workflows in Video Streams, pp. 57

You might also like