Weapon Detection in Surveillance Videos Using Deep
Weapon Detection in Surveillance Videos Using Deep
Abstract. Object detection uses computer vision technique to identify and locate
objects in an image or video. This feature can help to improve the security level as
it can be deployed to detect a dangerous weapon with object detection methods.
Driven by the success of deep learning methods, this study aims to develop and
evaluate the use the deep neural network for weapon detection in surveillance
videos. The YOLOv3 with Darknet-53 as feature extractor is used for detecting
two types of weapons namely pistol and knife. The YOLOv3 Darknet-53 is further
improved by optimizing the network backbone. This is achieved by adding a fourth
prediction layer and customizing the anchor boxes in order to detect the smaller
objects. The proposed model is evaluated with the Sohas weapon detection dataset.
The performance of the model is evaluated in terms of precision, recall, mean
average precision (mAP) and detection speed in frame per second (FPS).
1 Introduction
Dangerous weapons are being used in criminal activities and terrorism. Therefore, imple-
mentation of weapon detection in a surveillance camera (CCTV) can improve the secu-
rity level as it helps in automatically detecting weapon from the video feed. Current
implementation for weapon detection using deep neural network provides sufficient
high accuracy but suffers from poor detection speed and thus unable to work in real-
time application. The trade-off between accuracy and speed are the main problem to
implement the object detection in the real world especially for security purposes. Imple-
menting weapon detection is an important requirement in surveillance video system.
However, the weapon detection system faces difficulty to detect an object in low reso-
lution surveillance footage. Furthermore, monitoring the CCTV for 24 hours requires a
lot of manpower and due to human visual error, some security incidents may be missed.
The aim of this work is to develop and evaluate deep neural network model for a
weapon detection system to achieve high accuracy and fast inference time. In addition,
it aims to improve the deep neural model for a small weapon object. The system is
developed based on TensorFlow framework and uses the YOLOv3 Darknet-53 model.
2 Literature Review
Detecting dangerous weapon from surveillance video is difficult. According to [2], the
attention of security personnel monitoring the CCTV will deteriorate after 20 minutes.
Paper [3] shows that after 12 minutes of continuous video monitoring, a security guard
is likely to miss up to 45% of screen activity, and after 22 minutes, up to 95% of activity
is missed. By implementing deep learning, the monitoring of the surveillance video feed
can be automated to detect important events.
Among the proposed object detection solution, YOLO architecture is one of the
most used method in real-time weapon detection [1, 4, 5]. YOLO architectures give a
balanced performance in accuracy and faster inference time compared to other CNN
models [1]. Besides, [6] uses Faster R-CNN with different feature extractors such as
Inception-ResNetV2, ResNer50, VGG16 and MobileNetV2 and the performances are
compared with YOLOv2 model. Only Faster R-CNN with Inception-ResNetV2 produce
a better mAP than the YOLOv2. Faster R-CNN uses Region Proposal Network (RPN)
to make the prediction more accurate, however the downside of this architecture is it
reduces the inference time.
To improve weapon detection performance, relevant confusion object is added to the
training set [1]. The weapon such as pistol and revolver are small, and it is likely to be
confused with other small objects. Figure 4 shows sample of small hand-held objects
that are likely to be confused as weapon. The YOLO model cannot achieve good result
with small objects as it has been pre- trained with high quality training images [1][7].
In [8], YOLOv3 algorithm has shown good performance to detect small and large
objects due to its network with three prediction scales. The performance of YOLOv3
186 M. E. E. Quyyum and M. H. L. Abdullah
Table 1. Comparison of different method pre- processing and their performance [9]
Fig. 5. The AATpI illustration, where the white box represents a true positive [9]
can be further improved by choosing the suitable number of candidate anchor boxes and
their aspect ratio dimensions for each scale.
In the paper titled “Brightness guided pre-processing for automatic cold steel weapon
detection in surveillance videos with deep learning” [9], a method called DaCoLT (Dark-
ening and Contrast at Learning and Test stages) is proposed to improve the robustness
of the weapon detection model with regards to variation of brightness. There are two
data-augmentation stages used. The method applies training with brightness and visual
quality adjustment. Table 1 illustrates different method of pre-processing result with
different performances.
In order to evaluate weapon detection signal flagged by the model, the proposed
AATpI (Alarm Activation time per Interval) [10] is used. AATpI is a metric that indicates
how long it takes for an automated detection alarm system to identify at least k successful
frames of true positives. Figure 5 shows the alarm detection system diagram. The crime
alert can then be generated automatically and send to the authority through email or
Short Message Service (SMS).
In summary, YOLOv3 model can be used for object detection improvement as it
achieves good result in accuracy and fast detection speed. In comparison to CNN archi-
tectures such as Faster R-CNN and SSD, the complexity of its network increases the
training time.
Weapon Detection in Surveillance Videos 187
3 Approach
3.1 Design Requirements
YOLOv3 is an object detection algorithm based on convolutional neural network and
uses the concept of bounding box regression [16]. Instead of processing region of interest
(ROI) to detect the target object, it directly predicts bounding boxes and its classes for
the image with a single stage architecture. YOLOv3 components consists of feature
extractor, prediction layer and grid cells with anchor boxes. Given an image, the feature
extractor compute the salient feature and it is passed to the prediction layer to produce
the predicted bounding boxes.
width. Each anchor box predicts the object label, probability of the object presents in the
cell within the anchor box and the bounding box coordinate (w, h, x, y). For instance, in
Fig. 9, the image is divided into 9 (3 × 3) grid cells. Each grid cell with anchor boxes,
predicts the presence of the object in the cell. The size of the grid cell will influence the
ability of the model to make prediction for small object. Therefore, smaller grid cell is
used for detecting smaller object as shown in Fig. 10.
Weapon Detection in Surveillance Videos 189
Fig. 10. Comparison between large grid cells and small grid cells
Modification on YOLOv3 model has been made to improve the performance of the
model for detecting gun and pistol. The modification is made by using custom anchor
boxes, dataset expansion and adding extra prediction layer.
3.2.2 Dataset
In order to evaluate the performance of the detection model, the Sohas weapon detection
dataset [12] is used. The dataset contains 4014 images with weapons. The images are
categorized based on the type of handheld weapon objects used, namely pistol and knife.
A total of 3250 images is used for training and 764 images are used for testing the trained
model. The number of images for the two classes is approximately balanced. A total of
1425 images from the pistol category and 1825 images from the knife category is used
for training. The test set consist of 374 pistol images and 390 knife images.
190 M. E. E. Quyyum and M. H. L. Abdullah
The Sohas dataset lacks blurry image sample typically found in surveillance video
frame. Therefore, the training and test samples is expanded with blurry images from
surveillance video to obtain a larger and more diverse dataset. This can help the model
to generalize better with images obtained from surveillance videos.
The additional images are obtained by extracting the images from the YouTube video.
Five surveillance videos obtained from Closed-Circuit Television (CCTV) camera are
downloaded from YouTube. One frame is extracted every ten seconds of the video and
the image that contained weapons are used. This dataset is named as Dataset 1 with a
total of 479 images. A total of 339 images is allocated for model training and 140 images
for testing.
4 Discussion of Findings
The YOLOv3 model is trained and evaluated with the Sohas dataset. In the second
experiment trial, the surveillance video images from YouTube (Dataset 1) is added
to the Sohas dataset and is used for model training and evaluation. Object detection
performance in Table 2 shows improved result when the training images is augmented
with additional surveillance video images.
The next experiment evaluates the performance of the customized YOLOv3 model
with one additional prediction layer (Improved YOLOv3 Darknet-53). Comparison of
the performance of the original YOLOv3 model and the improved model is shown in
Table 3. It is observed that the Mean Average Precision (mAP) improved from 88.97%
to 90.20% with a slight reduction in detection speed from 10.25 to 12.32 Frame Rate
Per Second.
In order to measure the performance of weapon object detection by the proposed model,
the precision recall curve is used. The IoU threshold for positive class identification is
set to 0.5. The Precision P and Recall R metric are calculated by using formula (1) and
(2).
TP
P= (1)
TP + FP
192 M. E. E. Quyyum and M. H. L. Abdullah
Table 3. Comparison of performance for YOLOv3 with Improved YOLOv3 by using Sohas
Dataset
TP
R= (2)
TP + FN
The precision-recall (PR) curve graph for the knife and pistol detection is plotted
based on the precision and recall with various confidence level of the model predictions.
Figure 13 shows the PR curve for the detection of knife and pistol. The detection of
combined knife and pistol classes is shown as well. The performance of the trained
model on the knife object detection is better than pistol. This is due to the appearance
of the pistol that is more similar to commonly handheld objects.
Fig. 14. Object detection for Pistol in different background using Improved YOLOv3 Darknet-53
Fig. 15. Object detection for Knife with different brightness using Improved YOLOv3 Darknet-53
References
1. M. T. Bhatti, M. G. Khan, M. Aslam, and M. J. Fiaz, “Weapon Detection in Real-Time CCTV
Videos Using Deep Learning,” IEEE Access, vol. 9, pp. 34366–34382, 2021, doi: https://doi.
org/10.1109/ACCESS.2021.3059170.
2. M. M. Fernandez-Carrobles, O. Deniz, and F. Maroto, Gun and Knife Detection Based on
Faster R-CNN for Video Surveillance, vol. 11868 LNCS. Springer International Publish-
ing,2019.
194 M. E. E. Quyyum and M. H. L. Abdullah
19. Reynoso, R. (2021, May 25). A Complete History of Artificial Intelligence. Retrieved from
g2.com: https://www.g2.com/articles/history-of-artificial-intelligence
20. Rosebrock, A. (2016, July 1). Intersection over Union (IoU) for object detection. Retrieved
from pyimagesearch.com:
21. West, D. M., & Allen, J. R. (2018, April 24). How artificial intelligence is transforming the
world. Retrieved from brookings.edu: https://www.brookings.edu/research/how-artificial-int
elligence-is-transforming-the-world/#:~:text=Summary,transforming%20every%20walk%
20of%20life.
Open Access This chapter is licensed under the terms of the Creative Commons Attribution-
NonCommercial 4.0 International License (http://creativecommons.org/licenses/by-nc/4.0/),
which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any
medium or format, as long as you give appropriate credit to the original author(s) and the source,
provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.