0% found this document useful (0 votes)
12 views18 pages

Exploring Anomaly Detection Techniques For Crime Detection

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views18 pages

Exploring Anomaly Detection Techniques For Crime Detection

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Exploring Anomaly Detection

Techniques for Crime Detection


Ashwin Singh1 , Aakanksha Singh1† , Ayush Bajaj1† ,
Sarang Deb Saha2 , Abhishek Sharma3†
1 Communication And Computer Engineering, The LNMIIT, Jamdoli
Jaipur, 302031, Rajasthan, India.
2 Communication Science Engineering, The LNMIIT, Jamdoli, Jaipur,

302031, Rajasthan, India.


3 Department of Electronics and Communication Engineering, The

LNMIIT, Jamdoli, Jaipur, 302031, Rajasthan, India.

Contributing authors: [email protected]; [email protected];


[email protected]; [email protected];
[email protected];
† These authors contributed equally to this work.

Abstract
Crime anomaly detection is critical for proactive law enforcement and public
safety measures. This paper emphasizes the identification and detection of
anomalous events harbouring criminal intent using the applications of deep
learning techniques, one such e.g. being Convolutional Neural Networks. Lever-
aging current research about neural networks, the study explores multiple
approaches using pre-trained neural network architectures, including VGG19,
DenseNet121, ResNet50, and MobileNetV2 to categorize criminal behavior into
multiple classes such as, Abuse, Arrest, Arson, Assault, Burglary, Explosion,
Fighting, Road Accident, Robbery, Shooting, Shoplifting, Stealing, Vandalism
and Normal Events.

The research systematically analyzes the performance of each model using var-
ious metrics to gauge the models’ ability to discern anomalies effectively in the
UCF Crime Dataset. It has been observed that DenseNet121 model has garnered
the most accuracy at 82.91%. The proposed methodology provides a founda-
tion for future research in refining crime prediction systems, contributing to
advancements in law enforcement technologies

1
Keywords: Neural Networks, Crime, Deep Learning, Anomaly Detection, Transfer
learning, CNN

1 Introduction
CCTV surveillance has been around for almost 70 years and has been a common
choice among law enforcement agencies and the general population to monitor abnor-
mal activities for public or personal safety. The global prominence of CCTV systems
is evident in the exponential growth of the market, with projections soaring from a
substantial $35.47 billion in 2022 to a staggering $105.20 billion by 2029, reflecting a
robust CAGR of 16.8% during the forecast period, 2022-2029 [1]. However, over the
years, even with the widespread deployment of CCTV cameras, the increasing popu-
lation and rapid urbanization have led to an alarming surge in criminal activities.
Given the increasing abundance of data collected on surveillance feeds and the climb-
ing crime rates, it becomes increasingly overwhelming and expensive to rely on human
monitoring of surveillance cameras, humans are limited by constraints such as fatigue
and other unaccounted errors.[2][3]
Hence, in the face of these evolving trends, the need for more sophisticated and intel-
ligent systems for automated prediction and detection of crimes and monitoring of
video surveillance arises. The use of Artificial Intelligence(AI) in video surveillance
has become a hot topic for research in recent years. The amount of research on the
use of neural networks for real-time crime detection has also seen significant growth
over the past few years with the developments in machine learning practices in the
21st century. Machine learning has become a popular choice among researchers owing
to how well ML techniques scale to large amounts of data . Neural network(NNs) is
a form of machine learning technique that attempts to model and learn based on the
functionality of human brains and is considered the most powerful clustering technol-
ogy available for unstructured data which includes grid-like data such as images and
video data. [4] [5] [6]
Therefore, this paper aims to propose, develop and compare deep-learning networks
to identify crimes using surveillance footage. The different models are compared using
metrics determining accuracy and computational costs. The problem at hand can be
divided into two parts: Step 1 - Detection of a crime, Step 2 - Identifying and classifying
the type of crime taking place. To accomplish this, the study focuses on using trans-
fer learning on some well-studied and current up-to-date CNN models - Densenet121,
VGG-19, ResNet50 and MobileNet V2 , and compare the different performance rates
for test images. The models are re-trained on a large dataset containing millions of
labelled images featuring occurrences of crimes from various CCTV footage.
Convolutional Neural Networks (CNN) or ConvNet, is one of the most widely used
techniques of deep learning for the purposes of object detection, recognition and image
classification problems, over the years, due to CNN achieving state-of-the-art accu-
racy for such tasks, CNN has been presenting an operative class of models for better

2
understanding of content present in an image, therefore resulting in better image
recognition, segmentation, detection, and retrieval [7].
The architecture of CNNs contains the following layers :
1. Input Layer: The starting point of a model, accepts images to be passed down
further
2. Convolution Layer: The building block of CNNs, responsible for extracting features
using filters applying convolution operation.
3. ReLU unit layer: Applies ReLU activation function to the output of the convolution
layer and converts the negative numbers to 0 for faster training.
4. Pooling layers: Does dimension reduction of feature maps, thereby reducing the
computational load.
5. Fully Connected layer: connects the information extracted from the previous steps
to the output layer and eventually classifies the input into the desired label.
6. Softmax: Present just before the output layer and gives the probabilities of each
class
The first four layers mentioned above are called the feature extraction layers and
the two remaining ones are called classification layers.

Fig. 1: Basic CNN architecture

However, traditional CNNs’ effectiveness is notably contingent on large-scale


datasets and substantial computational resources. The training of CNNs demands an
extensive amount of labelled data and computational power, making it a resource-
intensive and time-consuming process. The paradigm of transfer learning emerges to
aid with this problem. Transfer learning is an ML technique whereby a model is
trained and developed for one task and then re-used for similar tasks with minimal
modifications in the output or some of the hidden layers [8]. Transfer learning offers a

3
significant benefit by alleviating the necessity for an extensive dataset when training a
deep CNN, by fine-tuning a portion of the parameters from a pre-trained model in the
source domain using limited labelled data from the target domain, transfer learning
can yield better performance on the target dataset.[3] This fine-tuning is commonly
done by freezing the feature extraction layers and using the pre-trained weights as is,
as seen in figure 1 and modifying the classification layers to fit the target needs. How-
ever, there are instances where practitioners also modify the feature extraction layers
to better align with the specific requirements of the target application. In our study,
we utilize the former approach.

Fig. 2: Basic intuition behind transfer learning

2 Literature Review
According to the Sami Ansari et al. 2015 report [9], criminal cases in India have exhib-
ited a contrasting pattern compared to the global crime trend. Traditional damage
control strategies typically rely on the presence of law enforcement officials to carefully
review Closed Circuit Television (CCTV) recordings. The utilization of closed-circuit
television (CCTV) for surveillance in public areas has proven to be a valuable instru-
ment in both crime resolution and crime prevention [10]
. Often, the presence of a visible CCTV system deters criminals from carrying
out their illegal activities, leading to their subsequent capture. Nevertheless, the
task of personally monitoring each video sample to detect suspicious activities
becomes increasingly tedious, intricate, and time-consuming. It requires labour and
round-the-clock, constant attention.
Deep learning methods breakthroughs in recent years have assisted the automa-
tion in tasks like that of anomaly detection. Anomaly detection has a long history in
statistics, and artificial intelligence, and is a lauded problem. Integrating Anomaly
Detection with Convolutional Neural Networks has witnessed considerable surge in

4
recent years. Kowshik et al. study on Real Time Crime Detection proposes YOLOv5
as an effective object detection technique, employing just a single convolutional neu-
ral network. In the study publication, YOLOv5 was compared with its predecessors
using a proprietary real-time facial recognition dataset. This lays the framework for
the arrival of deep learning approaches in tackling age-old problems like those of
anomaly identification in sensitive scenarios pertaining to those of surveillance. [11]

Vipin Shukla et al. 2015 research on Automatic Alert of Security Threat offered
the approaches of background subtraction, coupled with human outline detection
utilizing edge estimator algorithms. The result is then utilized to examine human posi-
tion in succeeding frames therefore identifying their activities as suspicious or benign.
Although their research proposes recognizing whether an abnormal behaviour is hap-
pening, it doesn’t hypothesize on how to measure the nature of this behaviour. [12]

Nandhini T J et al. 2023 study explored the topic of criminal objects not being
apparent in places with deficient lightning. Automatic Night-time monitoring sensors
are vital to identify crime objects since most of them can be missed if evaluated by
the naked eye. The author examines the accuracy of recognizing 7 variables, namely:
knife, smartphone, car, animals, gun, blood, and currency by deploying a model for
object detection in IR (infrared) photographs. The author provided CNN architecture
and trained the model with 147 photos, with the accuracy of recognizing knives being
the greatest at 99.8%. [13]

The following table illustrates all the studies that were reviewed to create a compre-
hensive study of the existing literature on real-time crime analysis using deep learning
techniques. By building upon this existing body of literature, this research paper aims
to compile and compare the various methodologies used to implement

Table 1: Review of existing related studies

5
3 Methodology
This section outlines the general flow of the development of the models from choosing
the appropriate dataset to testing the models all of which are discussed in detail here.

Fig. 3: Processing steps for crime detection

3.1 Dataset
The data used for training the NN models is a modified and sized-down version of the
open-source UCF-Crime Dataset, obtained from Kaggle[14]. THe UCF-Crime Dataset
consists of long untrimmed surveillance videos which cover 13 real-world anomalies,

6
including Abuse, Arrest, Arson, Assault, Road Accident, Burglary, Explosion, Fight-
ing, Robbery, Shooting, Stealing, Shoplifting, and Vandalism. These anomalies are
selected because they have a significant impact on public safety [14]. The Kaggle
dataset contains images(64*64 px) extracted from every video from the UCF Crime
Dataset. Every 10th frame is extracted from each full-length video and combined for
every video in that class. This is done so that the size of the larger UCF dataset can
be reduced without losing any of the spatial and temporal information between the
images.
Fig. 4 presents samples of crimes from the different categories present in the UCF
crime dataset. Fig. 4 (a) Shows a man abusing a stray animal, Fig. 4 (b) shows several
policemen attempting to arrest someone after a car crash, Fig. 4 (c) depicts a man
pouring gasoline outside the victim’s house, Fig. 4 (d) is a snapshot of an assault in
progress with two men trying to hit the victim from behind, Fig. 4 (e) depicts an ongo-
ing burglary, Fig 4 (f) shows a large-scale explosion occurring, Fig 4 (g) is an instance
of someone harassing a couple followed by a physical altercation, Fig. 4 (h) shows an
incident of a road accident with a vehicle flipped over a person, Fig. 4 (i) shows an
instance of someone robbing the victim at gunpoint, Fig. 4 (j) is the footage of a per-
son laying unconscious after getting shot, Fig. 4 (k) is an instance of shoplifting, Fig.
4 (l) shows two people stealing some car parts, Fig. 4 (m) shows someone attempting
to flee the scene after breaking a glass pane and Fig. 3 (n) shows an instance of normal
occurrence.

3.2 Data Preprocessing


After acquiring the dataset, the data was partitioned into test and training sets. The
next steps in preprocessing utilize the Tensorflow Keras pipeline to resize the image
data to the standard 64*64 dimensions, apply the respective preprocessing for each of
the four models and generate new training data by applying data augmentation tech-
niques to the training data using the ’ImageDataGenerator’ from the Keras library.
Data augmentation improves the accuracy of the model and helps in reducing overfit-
ting by improving the generalization ability of the model. Generalizability refers to the
performance difference of a model when evaluated on previously seen data (training
data) versus data it has never seen before (testing data). Models with poor gener-
alizability have overfitted the training data [15]. The techniques used to achieve the
same include horizontal flipping, random width shifts (up to 10%), and random height
shifts (up to 5% ). The image pixels were subsequently normalized to the range [0,1]
by dividing each pixel value by 255 to reduce the computational complexity and speed
up network training.

7
(a) Abuse (b) Arrest (c) Arson (d) Assault

(e) Burglary (f) Explosion (g) Fighting (h) Road Accident

(i) Robbery (j) Shooting (k) Shoplifting (l) Stealing

(m) Vandalism (n) Normal

Fig. 4: Samples from the different crimes in the UCF crime dataset

3.3 Selected deep learning models for crime detection


Transfer learning has been used for training the following models :

3.3.1 DenseNet121
One of the key problems with these traditional CNNs is as the number of layers in
the CNN increases, i.e as they get ”deeper”, the gradient of the loss function starts
to diminish, also known as the ”vanishing gradient problem”, DenseNets resolve this
problem by modifying the standard CNN architecture and simplifying the connectivity
pattern between layers. In a DenseNet architecture, each layer is connected directly
with every other layer .[16]. The number of layers for ’L’ layers present is given by

L(L + 1)
L= (1)
2

8
This allows for feature reuse and requires fewer parameters than a traditional CNN,
and helps in reducing overfitting[16]. DenseNet121 is a variant of DenseNet and con-
tains 121 layers trained on large datasets such as CIFAR-100 and ImageNet. In terms
of architecture, each dense block consists of a varying number of layers featuring
two convolutions each; a 1x1-sized kernel as the bottleneck layer and a 3x3 kernel to
perform the convolution operation followed by a transition layer containing a 1x1 con-
volutional layer and a 2x2 average pooling layer with a stride of 2[16]. Densenet121
has been studied for crime prediction achieving a AUC score of 82.91%[17].

Fig. 5: DenseNet architecture with two dense blocks

3.3.2 VGG19
VGG19 being an acronym for ”Visual Geometry Group 19” is one of the most often
used image recognition architecture in the present day. The term ”19” connotes 19
weight layers—16 convolution layers, 3 fully connected layers, 5 maxpool layers, and 1
softmax layer. VGGNet takes an input image size of 224x224 RGB; the first two layers
are convolution layers with the kernel size of 3x3 of stride 1, and these layers use 64
filters each that result in a volume of 224x224x64 of the same padding. The small size
of convolution filters allow VGG to have a larger number of weight layers, leading to
improved accuracy. After this a batch pooling layer with a max-pool of size 2x2 and
stride 2 resulting in a reduction of height and width from 224x224x64 to 112x112x64
and so on. The VGG convolution layers are followed by a ReLu unit—it is a piecewise
linear function that will output the input if positive; otherwise, the output is zero.
The VGGNet has three fully connected layers, the first two having 4096 channels each,
and the third has 1000 channels. VGGNet has achieved 92.7% top-5 test accuracy
in ImageNet: a dataset consisting of 14 million images belonging to more than 1000
classes.

3.3.3 ResNet50
Residual Networks are another class of neural networks that solve the problem of van-
ishing gradient and high training error by introducing residual learning. In residual
learning, instead of trying to learn some features, we try to learn some residual. Resid-
ual can be simply understood as the subtraction of feature learned from input of that
layer which it achieves by introducing several shortcut or residual connections which
allows the input to bypass one or two layers[18]. The skip connections perform identity

9
mapping, and their outputs are added to the outputs of the stacked layers.ResNet-50
is a 50-layer deep convolutional neural network, trained on more than a million images
from the ImageNet database and has an input image of size 224*224.The architecture
is similar to the VGGNet consisting mostly of 3X3 filters. From the VGGNet, the
shortcut connection as described above is inserted to form a residual network. Resnet
achieved a top 5 accuracy of 92%.

Fig. 6: Residual Connection Network Block Diagram

3.3.4 MobileNetV2
As can be inferred from its name, MobileNetV2 is a CNN architecture that is aimed
to perform well in mobile devices. In its previous versions, MobileNetV1 was focussed
on reducing the complexity cost and model size of the network by utilizing Depthwise
Separable Convolution. The basic idea is to replace a full convolution layer into two
separate layers. The first layer is called a depthwise convolution, it performs lightweight
filtering by applying a single convolutional filter per input channel. The second layer
is a 1 × 1 convolution, called a pointwise convolution, which is responsible for build-
ing new features. MobileNetV2 also employs ”Inverted Residuals” building upon the
intuition that the bottleneck layer, albeit being on a lower dimensionality, has all the
necessary information. Using this information, MobileNetV2 establishes shortcut con-
nections between the bottleneck layers; thereby being more memory efficient than its
counterparts. [19]

10
Fig. 7: MobileNetV2 Block Diagram

4 Analysis and Report


4.1 Evaluation Metrics
The evaluation metrics used to compare the effectiveness of the models used are
discussed here. Metrics used include precision score, F1-score and ROC-AUC score.
Precision is given by equation 1.

TP
Precision = (2)
TP + FP

True Positive(TP) is the total number of occurrences where the crime/anomaly was
correctly detected whereas False Positive(FP) gives the number of occurrences where
the crime was falsely detected. A low precision would mean that the model predicts
some false positives and labels some normal occurrences as crimes. This type of error
is unwanted but allows a human analyser to review and correct the false alarm. The
other useful metric is recall given by Equation 2

TP
Recall = (3)
TP + FN

False Negatives(FN) is the number of occurrences where the model was unable to
detect criminal activity in the process.This type of error can be life-threatening since
it would lead to late response time by law enforcement authorities. F1-score acts as a

11
binding metric that unifies the precision and recall and gives a single score to judge
the model’s accuracy against some baseline. F1-score is given by equation 3 :

2 × Precision × Recall
F1 = (4)
Precision + Recall

The other powerful metric used is the Area under the ROC curve which has gained
much popularity for multiclass classification problems. diagnostic ability of a binary
classifier system as its discrimination threshold is varied. It is created by plotting the
True Positive Rate (TPR) against the False Positive Rate (FPR) at various threshold
settings.
The AUC score offers a quick overview of the ROC curve, shedding light on how
well a classifier can distinguish between different classes. The greater the AUC score,
the better the model is.

4.2 Results and discussion


The models were implemented using the Tensorflow Keras module. The selection of
hyperparameters was driven by the consideration of the dataset’s relatively large size,
aiming to strike an optimal balance between model convergence, stability, and com-
putational efficiency. The models were tuned for 1 epoch (to avoid prolonged training
times) with a batch size of 64 which is a common choice for deep learning models [20].
The models were compiled using SGD(for DenseNet121) and Adam(for VGG, ResNet
and MobileNet) optimizers with a learning rate of 0.00003 and with the loss function
of ’categorical crossentropy’. Each model’s feature extraction layers were frozen and
initialized with the pre-trained weights, and the fully connected layer at the top of the
network was excluded for modification to suit our classification needs. After extract-
ing the features, Global Average Pooling (GAP) was applied to each model to reduce
the feature map. Following GAP, dense layers were added to the models with ReLu
activation (three dense fully connected layers with 256, 1024, and 512 units in the
case of DenseNet and a single dense layer with 512 units for VGG19, ResNet50 and
MobileNetv2), each dense layer being followed by a dropout layer to mitigate overfit-
ting. Finally, a dense layer is added with a softmax activation function. This approach
of freezing the extraction part of a pre-trained model and modifying the classifica-
tion parts is efficient, as it saves significant computational resources and time, and
effective, as it can improve model performance.

12
Table 3: AUC scores for different classes
across the models

Class DenseNet121 VGG19 ResNet50 MobileNetV2


Abuse 0.67 0.23 0.87 0.61
Table 2: Metrics Table Arrest 0.49 0.53 0.46 0.51
Arson 0.82 0.77 0.66 0.82
Model Precision F1-Score AUC Score Train Time (s) Assault 0.65 0.58 0.69 0.46
DenseNet121 0.5601 0.6407 0.8361 5167 Burglary 0.79 0.58 0.71 0.72
VGG19 0.5659 0.5680 0.5464 4296 Explosion 0.77 0.61 0.69 0.73
ResNet50 0.5659 0.5660 0.6082 4421 Fighting 0.39 0.55 0.38 0.47
MobileNetV2 0.5659 0.5670 0.6525 4589 Normal 0.75 0.65 0.76 0.76
Road Accident 0.66 0.62 0.76 0.80
Robbery 0.58 0.39 0.65 0.60
Shooting 0.65 0.55 0.66 0.68
Shoplifting 0.53 0.56 0.82 0.76
Stealing 0.58 0.55 0.57 0.71
Vandalism 0.57 0.48 0.56 0.51

(a) Densenet121 (b) VGG19

(c) ResNet50 (d) MobilenetV2

Fig. 8: Comparison of ROC curves for different crime classes

The AUC score of 1 indicates a perfect classifier while the score of 0.5 implies the
model hasn’t learnt anything but instead is making random guesses, The closer the
curve is to the top-left corner, the higher the true positive rate (sensitivity) for a given
false positive rate (1-specificity), which indicates a better-performing model and con-
versely, a point along the diagonal line from the bottom left to the top right indicates

13
Fig. 9: Combined ROC curves for ’Normal’ class

that the true positive rate equals the false positive rate, representing a classifier that
performs no better than random chance.
Judging by the metrics from Table 2 and the ROC curves 8 we can safely conclude
that DenseNet121 performs considerably better than the other models followed by
MobileNetV2. However, when training times are taken into account, MobileNetV2
exhibits a 12% reduction in training time compared to DenseNet121, achieving com-
parable but slightly lower accuracy. This trade-off between the two needs to be taken
into account while picking a model for a specific use case. Another key observation
to be made is that different classes of anomalies perform better or worse on different
models. ’Arson’ performs better than all the other classes across all models except in
ResNet50 with the top performers being DenseNet and MobileNet with an AUC score
of 0.82. Whereas, the classes ’Fighting’ and ’Arrest’ turned out to be the worst per-
formers with below 0.5 AUC scores across the models. The findings of [21], point out
that the models struggle with instances of explosion and shooting due to smoke being
a common entity accompanied in both instances.
While analysing the performance of the models it is also quite useful to compare the
performances of models to distinguish clips from normal incidents to any anomalies,
to gauge this we investigate the ROC curves of the different models for the class ’Nor-
mal’ specifically. This measure displays further importance when it is realised that
the ’Normal’ class consists of 75.25% of the total data present in the training set,
which is a large portion of the dataset, hence the ability to distinguish between nor-
mal and the rest of the anomaly classes serves as a cardinal metric. The results of this
process are exhibited in Figure 9. The evaluation results indicate that VGG19 and
DenseNet121 outperform other architectures, achieving AUC scores of 0.76 and 0.77,

14
respectively. This suggests their efficacy in effectively discriminating between normal
and anomalous occurrences.

5 Conclusion
The main motive of this study was to utilize and compare different frameworks for
real-time anomaly detection as an aftereffect on surveillance camera footage snippets
from the UCF crime dataset. We employed the fundamentals of 4 models, namely:
DenseNet121, VGG19, ResNet50, and MobileNetV2 where it is observed that barring
the training time, DenseNet121 achieves a much better performance on the metrics
used to gauge all the models. The pinnacle aim of these findings lies in the contribu-
tion to the pre-existing biosphere of literature done in the advent of Deep-Learning
techniques in real-life situations.

The general structure of our framework follows the principle of Convolutional


Neural Networks, thereby proposing a model that works upon weakly-labeled training
videos. Since Densenet121 has the highest AUC score, it is deemed the best model at
predicting positive and negative classes as true. Inferring from the ROC curve, we can
conclude that DenseNet121 performs the best followed by MobileNetV2. ResNet50
and VGG19 do not outperform each other patently, rather they outshine others on
selected anomalies for instance the category of ’Abuse’.

Building on the aforementioned point, different models perform better or worse


on different classes of anomalies. ’Arson’ performs better across all models with an
AUC score of 0.82 in DenseNet and MobileNet, whereas ’Fighting’ and ’Arrest’ have
a below-average AUC score across all models.

6 Scope For Future Work


The research done in this paper is limited to the UCF crime dataset. As noted in
this paper [21], the UCF crime dataset must focus more on the crime scene frame
rather than having weakly labelled anomaly and no-anomaly clippings. This creates
limitations for the accurate training of our model. UCF-Crime’s testing set comprises
92.4% of normal frames and 7.6% of abnormal ones [22] , this reiterates the need for
more fitting evaluation metrics to test on unbalanced datasets. In Addition, the UCF
dataset only accounts for 13 anomaly classes, but doesn’t train the model on how to
detect normality patterns in the clippings.
Consequently, due to resource restraints, the models could only be trained for a sin-
gle epoch which might’ve resulted in erraticism in the accuracy for different classes of
models, this can be mitigated by training for multiple epochs to reduce the errors. The
next stepping stone to further our study would be to fine-tune our dataset. This can be
achieved by integrating our visual data with sensory data such as audio, and infrared
to improve evaluation accuracy. Furthermore, issues pertaining to object detection in
dim-light areas is another avenue to be considered. Furthermore, considering the sig-
nificant variations in performance across different classes on specific models, it opens
up avenues for further research in crafting more tailored models. Drawing inspiration

15
from the discussed architectures, there is room to explore the incorporation of addi-
tional elements such as LSTM to capture and leverage temporal information present in
CCTV footage. This approach aims to enhance the adaptability of models to diverse
scenarios and improve overall predictive accuracy

16
References
[1] Business Insights, F.: CCTV camera market size, growth: Global
report [2022-2029] (2023). https://www.fortunebusinessinsights.com/
cctv-camera-market-107115

[2] Malik, A.A.: Urbanization and crime : A relational analysis. (2016). https://api.
semanticscholar.org/CorpusID:22424407

[3] Ansari, S., Verma, A., Dadkhah, K.: Crime rates in india. International Criminal
Justice Review 25 (2015) https://doi.org/10.1177/1057567715596047

[4] Mandalapu, V., Elluri, L., Vyas, P., Roy, N.: Crime prediction using machine
learning and deep learning: A systematic review and future directions. IEEE
Access (2023)

[5] Mena, J.: Machine learning forensics for law enforcement, security, and intelli-
gence. (2011). https://api.semanticscholar.org/CorpusID:113740871

[6] Nguyen, M.T., Truong, L.H., Tran, T.T., Chien, C.-F.: Artificial intelligence
based data processing algorithm for video surveillance to empower industry 3.5.
Computers & Industrial Engineering 148, 106671 (2020)

[7] Sharma, N., Jain, V., Mishra, A.: An analysis of convolutional neural networks
for image classification. Procedia Computer Science 132, 377–384 (2018) https://
doi.org/10.1016/j.procs.2018.05.198 . International Conference on Computational
Intelligence and Data Science

[8] Hussain, M., Bird, J., Faria, D.: A study on cnn transfer learning for image
classification. (2018)

[9] Ansari, S., Verma, A., Dadkhah, K.M.: Crime rates in india: A trend analysis.
International Criminal Justice Review 25(4), 318–336 (2015) https://doi.org/10.
1177/1057567715596047

[10] Rohit Malpan, M.C.: Impact of cctv surveillance on crime. (2021)

[11] Kowshik, D.Y.R.D. Shoeb: Real time crime detection using deep learning. (2023)

[12] Shukla, V., Singh, G., Shah, P.: Automatic alert of security threat through video
surveillance system. (2013)

[13] J, N.T., Thinakaran, K.: Detection of crime scene objects using deep learning
techniques. In: 2023 International Conference on Intelligent Data Communication
Technologies and Internet of Things (IDCIoT), pp. 357–361 (2023). https://doi.
org/10.1109/IDCIoT56793.2023.10053440

[14] Real-world Anomaly Detection in Surveillance Videos, link = https://www.crcv.

17
ucf.edu/projects/real-world/,

[15] Shorten, C., Khoshgoftaar, T.M.: A survey on image data augmentation for deep
learning. J. Big Data 6(1) (2019)

[16] Huang, G., Liu, Z., Maaten, L., Weinberger, K.: Densely connected convolutional
networks. (2017). https://doi.org/10.1109/CVPR.2017.243

[17] Hasija, S., Peddaputha, A., Hemanth, M.B., Sharma, S.: Video anomaly clas-
sification using densenet feature extractor. In: Tiwari, R., Pavone, M.F.,
Ravindranathan Nair, R. (eds.) Proceedings of International Conference on
Computational Intelligence, pp. 347–357. Springer, Singapore (2023)

[18] He, K., Zhang, X., Ren, S., Sun, J.: Deep Residual Learning for Image Recognition
(2015)

[19] Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.-C.: MobileNetV2:
Inverted Residuals and Linear Bottlenecks (2019)

[20] Bengio, Y.: Practical recommendations for gradient-based training of deep


architectures (2012)

[21] Dua, A., Kalra, B., Bhatia, A., Madan, M., Dhull, A., Gigras, Y.: Crime
alert through smart surveillance using deep learning techniques. In: Proceed-
ings of the 4th International Conference on Information Management & Machine
Intelligence, pp. 1–8 (2022)

[22] Caetano, F., Carvalho, P., Cardoso, J.S.: Unveiling the performance of video
anomaly detection models — a benchmark-based review. Intelligent Systems with
Applications 18, 200236 (2023) https://doi.org/10.1016/j.iswa.2023.200236

18

You might also like