0% found this document useful (0 votes)
16 views6 pages

Ref 19

This study presents a method for drone detection using an end-to-end object detection model based on convolutional neural networks (CNNs). To address the challenge of limited training data, an extensive artificial dataset was created by combining real images of drones and birds with various backgrounds. The proposed approach achieved high precision and recall, ranking third in the Drone-vs-Bird Detection Challenge.

Uploaded by

satishbokka1619
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views6 pages

Ref 19

This study presents a method for drone detection using an end-to-end object detection model based on convolutional neural networks (CNNs). To address the challenge of limited training data, an extensive artificial dataset was created by combining real images of drones and birds with various backgrounds. The proposed approach achieved high precision and recall, ranking third in the Drone-vs-Bird Detection Challenge.

Uploaded by

satishbokka1619
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Using Deep Networks for Drone Detection

Cemal Aker, Sinan Kalkan


KOVAN Research Lab.
Computer Engineering, Middle East Technical University
Ankara, Turkey
{cemal, skalkan}@ceng.metu.edu.tr

Abstract

Drone detection is the problem of finding the smallest


rectangle that encloses the drone(s) in a video sequence.
In this study, we propose a solution using an end-to-end
object detection model based on convolutional neural net-
works. To solve the scarce data problem for training the
network, we propose an algorithm for creating an extensive
artificial dataset by combining background-subtracted real
images. With this approach, we can achieve precision and Figure 1: Detection samples from the created dataset where
recall values both of which are high at the same time. the green rectangles show the bounding boxes of the drones.

1. Introduction tasks, the studies to detect UAVs have not taken advantage
of it by placing deep learning methods at the core of the
Drone, as a general definition, is the name coined for approach. To this end, this study is the first to evaluate the
the unmanned vehicles. However, in this paper the term success of convolutional neural networks (CNN) as a stan-
will refer to a specific type, namely unmanned aerial ve- dalone approach on drone detection.
hicles (UAV). With the rapid development in the field of
In this study we have used an end-to-end object detection
unmanned vehicles and technology used to construct them,
method based on CNNs to predict the location of the drone
the number of drones manufactured for military, commer-
in the video frames. In order to be able to train the network,
cial or recreational purposes increases sharply with each
we created an artificial dataset by combining real drone and
passing day. This situation poses crucial privacy and se-
bird images with different background videos. The results
curity threats when cameras or weapons are attached to the
show that the variance and the scale of the dataset make it
drones. Hence, detecting the position and attributes, like
possible to perform well on drone detection problem. With
speed and direction, of drones before an undesirable event,
this method, we have participated in the Drone-vs-Bird De-
has become very crucial.
tection Challenge1 organized within the International Work-
Unpredictable computer controlled movements, speed
shop on Small-drone Surveillance, Detection and Counter-
and maneuver abilities of drones, their resemblance to birds
action Techniques, and our trained network ranked third in
in appearance when observed from a distance make it chal-
terms of lowest prediction penalty described in Section 4.
lenging to detect, identify and correctly localize them. In
order to solve this problem, one can think of various types
of sensors to perceive the presence of a drone in the environ-
ment. These may include global positioning systems, radio 2. Related Work
waves, infrared, and audible sound or ultrasound signals.
However, it has been reported that they have many limita- In this section, we review the related studies in two parts.
tions for this problem, and suggested that computer vision
techniques be used [6]. Although deep learning methods
have been shown to be very powerful in computer vision 1 https://wosdetc.wordpress.com/challenge

978-1-5386-2939-0/17/$31.00 2017
c IEEE IEEE AVSS 2017, August 2017, Lecce, ITALY
2.1. Object Detection Methods with Computer Vi- method that first creates spatio-temporal cubes using sliding
sion window method at different scales, applies motion compen-
sation to stabilize spatio-temporal cubes, and finally utilizes
The task of object detection is to decide whether there boosted tree and CNN based regressors for bounding box
are any predefined objects in a given image or not, and re- detection [14].
port the locations and dimensions of the smallest rectangles
that bind them if they exist. Early attempts for this task in-
volves the representations of objects using handcrafted fea- 3. Method
tures whereas the state of the art techniques utilizes deep
learning. Our solution is based on a single shot object detection
Detection with Handcrafted Features: The most suc- model, YOLOv2 [12], which is the follow-up study of
cessful approaches using handcrafted features require bag YOLO. We adapt and fine-tune this model to detect objects
of visual words (BoVW) [16] representations of the ob- of two classes (i.e., drone and bird). Although the problem
jects with the help of local feature descriptors such as scale is detecting drones in the scene, we have included the bird
invariant feature transform (SIFT) [9], speeded-up robust class so that the network can learn robust features to dis-
features (SURF) [1], and histogram of oriented gradients tinguish them too. In order to achieve high accuracy with
(HOG) [3]. After training a discriminative machine learn- such deep models, one needs a large scale dataset that in-
ing model, e.g., support vector machines (SVM) [2], with cludes many scenarios of the problem, to get better general-
such representations, the images are scanned for occurrence ization. To this end, we created an artificial dataset includ-
of learned objects with the sliding window technique gener- ing real drones, real birds and real backgrounds. The fol-
ally. These methods have two crucial drawbacks. The first lowing paragraphs first describe the approach in YOLOv2,
one is that the features have to be crafted well for the prob- the dataset creation approach, training and testing details.
lem domain to highlight and describe the important infor-
mation in the image. The second one is the computational
burden of the exhaustive search done by the sliding window 3.1. The deep network
technique. YOLOv2 tries to devise an end-to-end regression solu-
Detection with Deep Networks: With the remarkable tion to the object detection problem. Former layers of the
achievements of the deep learning methods in the image fully convolutional architecture that can be seen in Figure 2
classification tasks, similar approaches have started to be are trained to extract high level features. Then the two high-
used for attacking the object detection problem. These tech- est level features are combined to get the final feature map
niques can be divided into two simple categories; region of the image. Then it is divided into an S × S grid where
proposal based and single shot methods. The approaches the duty of each grid cell is predicting bounding boxes of
in the first category differs from the traditional methods by the form (x, y, w, h, c). In this output, x and y are the coor-
using features learned from data with CNNs and selective dinates of the centers of the boxes with respect to the grid
search or region proposal networks to decrease the number cell, w and h are the width and height in proportion to the
of possible regions [4, 5, 13]. In the single shot approach, whole image, and c is the confidence that an object is in
the aim is to compute bounding boxes of the objects in the the bounding box. The final task of a grid cell is comput-
image directly instead of dealing with regions in the im- ing conditional class probabilities given the probability that
age. A method for this is extracting multi-scale features us- the corresponding bounding boxes have objects in them.
ing CNNs and combining them to predict bounding boxes While predicting those bounding boxes, the model utilizes
[7, 8]. Another one, named YOLO, divides the final feature some prior information computed by K-means clustering on
map into a 2D grid and predicts a bounding box using each width and heights of ground truth bounding boxes. The final
grid cell [11]. output size for a grid cell is:

2.2. UAV Detection Methods with Computer Vision Output Size = (Ncls + Ncoord + 1) × Nanc ,
Although the problem of detecting UAVs is not a well
studied subject, there are some attempts to mention. Mejias where Ncls is the number of classes, Ncoord is the num-
et al. utilized morphological pre-processing and Hidden ber of coordinates, Nanc is the number of anchor bounding
Markov Model filters to detect and track micro unmanned boxes used as prior knowledge and the 1 in the parenthesis
planes [10]. Gökçe et al. used cascaded boosted classi- is for the confidence value. In our approach, grid size is set
fiers along with some local feature descriptors [6]. In addi- to 15, number of classes is two, number of coordinates is
tion to this pure spatial information based methods, spatio- four and number of anchor boxes is five. Hence, the final
temporal approaches exist. Rozantsev et al. propose a output is of the shape 15 × 15 × 35.
480

240 Reshape

480 120 30 30 30 15 15 15
240 120 30 30 30 15 15 15 35
32 64 256 256 512 1024 3072
3
Conv. Layer Conv. Layer Conv. Layers Conv. Layers Conv. Layer Max Pooling Combine Conv. Layers
3x3-s1 3x3-s1 3x3-s1 3x3-s1 x2 3x3-s1 2x2-s2 3x3-s1
Max Pooling Max Pooling 1x1-s1 1x1-s1 Conv. Layers 1x1-s1
x2 3x3-s1
2x2-s2 2x2-s2 3x3-s1
x2
Max Pooling 1x1-s1
2x2-s2 3x3-s1 x3

Figure 2: Our adaptation of the YOLOv2 network. All layers are fine-tuned with the dataset collected in the paper.

3.2. Dataset Preparation


Having mentioned the model details, we can now come
to the most important part of the study which is dataset
preparation. Since drone flights have limitations due to in-
adequate battery technology, weather conditions and leg-
islative regulations, there is no publicly available large scale
dataset for training deep networks. However, our approach
requires immense amount of data to learn useful features.
One possible solution to this is creating an artificial dataset.
For this end, we have collected public domain pictures of
drones and birds, and videos of coastal areas. After sub-
tracting the background of drones and birds, they are ran- Figure 3: Samples from the artificial dataset which repre-
domly placed on the frames of the videos. The overall sent various scenarios with different backgrounds and bird
process is summarized in Algorithm 1. The details of the inclusion. Although the dataset includes very small objects,
dataset can be found in the Table 1. As can easily be seen, the bigger ones have been chosen for better visibility. (best
the dataset needs a huge storage size when all of the con- viewed in color).
figurations are used. Hence, we eliminated some portion of
the configurations with probability
M ax. allowed size This technique is useful especially when the training data is
p=1− , scarce. After training, there comes the prediction for unseen
T otal size f or all conf igurations
data. Since the network is trained with two classes, the bird
to reduce the size of the dataset to reasonable amounts. detections are eliminated after getting all predicted bound-
Samples drawn from the resulting dataset can be seen in the ing boxes from the last layer. Than a threshold, which can
Figure 3. These samples show that although they are cre- be determined according to accuracy on a validation set (our
ated artificially, they look like real images of flying drones approach is explained in Section 4), is applied on the con-
and birds. fidence values for objectness. If this operations eliminate
The last things to mention about our approach are the all predicted bounding boxes, it means that the frame does
training and prediction procedures. After creating the artifi- not include a drone or it is not clear enough to detect. Oth-
cial dataset, we have applied a commonly used technique erwise, the one that has the highest confidence is selected
called fine-tuning. In this technique the network is first as the prediction. We choose the best prediction to report
trained with a different and more general dataset for a simi- since the aforementioned challenge requires to detect the
lar problem. This provides us with better initial points than only drone in the scene. However, the algorithm can easily
random for the parameters of the network. Then, training be extended to multi-drone situations with more intelligent
is continued with the actual dataset for the actual problem. thresholding strategies. One possible problem with this ap-
Algorithm 1: The algorithm for preparing the dataset. Table 1: Details of the dataset.
1 S ← predefined size intervals Aspect Information
2 D ← foregrounds of drone images # drones 89
3 B ← foregrounds of bird images # birds 126
4 V ← background videos # background videos 11
5 R ← # of rows that the image will be divided into # rows in grid 12
6 C ← # of columns that the image will be divided into # columns in grid 10
7 G ← R × C grid # size intervals 19
8 foreach (d, g, s, v) ∈ D × G × S × V do size intervals in [5,160]
9 ignore this configuration with probability (bias towards smaller values)
p = 1 − T otal size M ax. allowed size
f or all conf igurations , and
image resolution 850 × 480
continue # resulting images 676,534
10 draw a random position p0 in g
11 draw a random size s0 for smaller edge of the
drone from s
After exceeding the limit, we reset it and cancel the tech-
12 draw a random frame f0 from v
nique for the same number of frames. During this period,
13 resize d with respect to s0
we report the current predictions directly. Likewise, when
14 overlay f0 with d in position p0
there is no predicted bounding box in the previous frame,
15 draw (p1 , s1 , f1 ) in the same way
we directly report the prediction in current frame.
16 draw a random bird b0 from B
17 draw (pb,0 , sb,0 ) for bird where sb,0 is drawn from
smaller half of S 4. Experiments
18 resize d with respect to s1
19 overlay f1 with d in position p1 This section describes training details and the conducted
20 resize b0 with respect to sb,0 experiments on the artificial dataset and the real dataset pro-
21 overlay f1 with b0 in position pb,0 vided by the organizers of the challenge. The former ones
22 draw (p2 , s2 , f2 ) in the same way are evaluated quantitatively whereas the others are evalu-
23 draw a random bird b1 from B ated qualitatively due to the lack of ground truth informa-
24 draw (pb,1 , sb,1 ) for bird where sb,1 is drawn from tion.
greater half of S Training details: In order to apply fine tuning men-
25 resize d with respect to s2 tioned in Section 3, we have started with the pre-trained
26 overlay f2 with d in position p2 weights using the ImageNet dataset [15] for image classi-
27 resize b1 with respect to sb,1 fication problem. Then the dataset provided by the chal-
28 overlay f1 with b1 in position pb,1 lenge organizers and the created one are divided into train-
29
ing (85%) and validation (15%) parts. The training part of
30 save f0 , f1 , f2 into the dataset the former one is duplicated four times before combining
them to training sets since it is too scarce compared to the
31 end
artificially created, large scale one. Then, the network is
fine-tuned for 10,000 iterations with 128 as batch size and
batch normalization after all convolutional layers.
proach is encountered when the network mixes a bird up After the training phase is completed, we combined the
with the drone. If the objectness confidence of it is higher two validation sets to evaluate the resulting network. Al-
than that of the drone, it is selected as the prediction. In though we use 480 × 480 × 3 as input size in training (see
order to decrease the number of such misinterpretations, we Figure 2), we increase the resolution to 800 × 800 × 3 in
propose a limited ignorance approach. After determining testing configuration. This is applicable since the network
the bounding box that the network is most confident, we is fully convolutional. This increase is helpful in detecting
control its intersection with the rectangle having same cen- small sized targets.
ter, three times the width and height as the predicted bound- Evaluation metrics: We use precision-recall curves to
ing box in the previous frame, assuming that the drone can- evaluate the network. The curves are constructed by chang-
not move more than its height or width in a single frame. If ing the detection threshold. The precision metric is defined
tp
the rectangles intersect, we can accept the newly predicted as tp+f p , where tp is the number of true positives and f p
one. Otherwise, we ignore the current prediction and re- is the number of false positives. Recall is then defined as
tp
port the previous one if the limit has not been exceeded yet. tp+f n , where f n is the number of false negatives. We count
1 220

0.9 200

180
0.8

Avg Penalty
160
Precision

0.7
140
0.6
120
0.5
100
0.4 80

0.3 60
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.2 0.4 0.6 0.8 1
Recall Detection Threshold

Figure 4: Precision-Recall (PR) curve showing the perfor- Figure 5: Change of prediction penalty with respect to de-
mance of the approach on the outdoor test videos. tection threshold.

a predicted bounding box as true positive if the area of the 5. Conclusion


overlap of the predicted bounding box with the ground truth
In this study, we showed that drones can be detected and
is greater than half of the area of their union.
distinguished from birds using an object detection model
Another metric that we used is the prediction penalty,
based on a CNN. The trained network generalizes well as
which is basically the area of smallest rectangle that in-
it can achieve high precision and recall values at the same
cludes both the ground truth and predicted bounding boxes
time.
divided by the area of ground truth bounding box.
For future work we plan to consider time domain to im-
Results: Figure 4 presents the performance of the prove the performance even further. Since collecting such
method with different detection thresholds in the range data is not easy, we plan to devise an algorithm that gen-
[0,1]. The closer the Precision-Recall (PR) curve to the top erates random flight videos instead of randomly generated
right corner the better the performance of the method. We images.
can understand from the curve that precision and recall can
be achieved to be approximately 0.9 at the same time. This
shows that the approach performs well in detecting the cor- References
rect bounding boxes. [1] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool. Speeded-up
Figure 5 shows the change of the average penalty with re- robust features (surf). CVIU, 110(3):346–359, 2008.
spect to detection threshold. The reason for higher penalties [2] C. Cortes and V. Vapnik. Support-vector networks. Machine
is that when the threshold increases detection rate decreases. learning, 20(3):273–297, 1995.
When a drone cannot be found, the top-left pixel is reported [3] N. Dalal and B. Triggs. Histograms of oriented gradients for
as prediction, which results in a huge penalty. Hence, we human detection. In CVPR, 2005.
have chosen the smallest possible threshold (which is zero) [4] R. Girshick. Fast r-cnn. In CVPR, 2015.
for quantitative evaluation on the test video of the chal- [5] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich fea-
lenge. Although this threshold hurts precision in the arti- ture hierarchies for accurate object detection and semantic
ficial dataset, it works well in the provided test video except segmentation. In CVPR, 2014.
its detecting the bird as drone when the bird is closer to the [6] F. Gökçe, G. Üçoluk, E. Şahin, and S. Kalkan. Vision-based
camera and in specific poses that cannot be easily distin- detection and distance estimation of micro unmanned aerial
guished from a drone by human eye. Another observation vehicles. Sensors, 15(9):23805–23846, 2015.
is that when the drone and the bird are too close to each [7] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling
other, the network supposes that the bird is a part of the in deep convolutional networks for visual recognition. In
drone, and outputs a bounding box enclosing both of them. ECCV, 2014.
The predictions are provided online2 as a video rendered in [8] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y.
15 fps. Fu, and A. C. Berg. Ssd: Single shot multibox detector. In
ECCV, 2016.
2
http://user.ceng.metu.edu.tr/˜cemal/ [9] D. G. Lowe. Object recognition from local scale-invariant
predictions.mp4 features. In ICCV, 1999.
[10] L. Mejias, S. McNamara, J. Lai, and J. Ford. Vision-based
detection and tracking of aerial targets for uav collision
avoidance. In IROS, 2010.
[11] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You
only look once: Unified, real-time object detection. In
CVPR, 2016.
[12] J. Redmon and A. Farhadi. YOLO9000: better, faster,
stronger. CoRR, abs/1612.08242, 2016.
[13] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards
real-time object detection with region proposal networks. In
NIPS, 2015.
[14] A. Rozantsev, V. Lepetit, and P. Fua. Detecting Flying Ob-
jects using a Single Moving Camera. PAMI, 39:879 – 892,
2017.
[15] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,
S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein,
et al. Imagenet large scale visual recognition challenge.
IJCV, 115(3):211–252, 2015.
[16] J. Sivic, A. Zisserman, et al. Video google: A text retrieval
approach to object matching in videos. In ICCV, 2003.

You might also like