Ref 19
Ref 19
Abstract
1. Introduction tasks, the studies to detect UAVs have not taken advantage
of it by placing deep learning methods at the core of the
Drone, as a general definition, is the name coined for approach. To this end, this study is the first to evaluate the
the unmanned vehicles. However, in this paper the term success of convolutional neural networks (CNN) as a stan-
will refer to a specific type, namely unmanned aerial ve- dalone approach on drone detection.
hicles (UAV). With the rapid development in the field of
In this study we have used an end-to-end object detection
unmanned vehicles and technology used to construct them,
method based on CNNs to predict the location of the drone
the number of drones manufactured for military, commer-
in the video frames. In order to be able to train the network,
cial or recreational purposes increases sharply with each
we created an artificial dataset by combining real drone and
passing day. This situation poses crucial privacy and se-
bird images with different background videos. The results
curity threats when cameras or weapons are attached to the
show that the variance and the scale of the dataset make it
drones. Hence, detecting the position and attributes, like
possible to perform well on drone detection problem. With
speed and direction, of drones before an undesirable event,
this method, we have participated in the Drone-vs-Bird De-
has become very crucial.
tection Challenge1 organized within the International Work-
Unpredictable computer controlled movements, speed
shop on Small-drone Surveillance, Detection and Counter-
and maneuver abilities of drones, their resemblance to birds
action Techniques, and our trained network ranked third in
in appearance when observed from a distance make it chal-
terms of lowest prediction penalty described in Section 4.
lenging to detect, identify and correctly localize them. In
order to solve this problem, one can think of various types
of sensors to perceive the presence of a drone in the environ-
ment. These may include global positioning systems, radio 2. Related Work
waves, infrared, and audible sound or ultrasound signals.
However, it has been reported that they have many limita- In this section, we review the related studies in two parts.
tions for this problem, and suggested that computer vision
techniques be used [6]. Although deep learning methods
have been shown to be very powerful in computer vision 1 https://wosdetc.wordpress.com/challenge
978-1-5386-2939-0/17/$31.00 2017
c IEEE IEEE AVSS 2017, August 2017, Lecce, ITALY
2.1. Object Detection Methods with Computer Vi- method that first creates spatio-temporal cubes using sliding
sion window method at different scales, applies motion compen-
sation to stabilize spatio-temporal cubes, and finally utilizes
The task of object detection is to decide whether there boosted tree and CNN based regressors for bounding box
are any predefined objects in a given image or not, and re- detection [14].
port the locations and dimensions of the smallest rectangles
that bind them if they exist. Early attempts for this task in-
volves the representations of objects using handcrafted fea- 3. Method
tures whereas the state of the art techniques utilizes deep
learning. Our solution is based on a single shot object detection
Detection with Handcrafted Features: The most suc- model, YOLOv2 [12], which is the follow-up study of
cessful approaches using handcrafted features require bag YOLO. We adapt and fine-tune this model to detect objects
of visual words (BoVW) [16] representations of the ob- of two classes (i.e., drone and bird). Although the problem
jects with the help of local feature descriptors such as scale is detecting drones in the scene, we have included the bird
invariant feature transform (SIFT) [9], speeded-up robust class so that the network can learn robust features to dis-
features (SURF) [1], and histogram of oriented gradients tinguish them too. In order to achieve high accuracy with
(HOG) [3]. After training a discriminative machine learn- such deep models, one needs a large scale dataset that in-
ing model, e.g., support vector machines (SVM) [2], with cludes many scenarios of the problem, to get better general-
such representations, the images are scanned for occurrence ization. To this end, we created an artificial dataset includ-
of learned objects with the sliding window technique gener- ing real drones, real birds and real backgrounds. The fol-
ally. These methods have two crucial drawbacks. The first lowing paragraphs first describe the approach in YOLOv2,
one is that the features have to be crafted well for the prob- the dataset creation approach, training and testing details.
lem domain to highlight and describe the important infor-
mation in the image. The second one is the computational
burden of the exhaustive search done by the sliding window 3.1. The deep network
technique. YOLOv2 tries to devise an end-to-end regression solu-
Detection with Deep Networks: With the remarkable tion to the object detection problem. Former layers of the
achievements of the deep learning methods in the image fully convolutional architecture that can be seen in Figure 2
classification tasks, similar approaches have started to be are trained to extract high level features. Then the two high-
used for attacking the object detection problem. These tech- est level features are combined to get the final feature map
niques can be divided into two simple categories; region of the image. Then it is divided into an S × S grid where
proposal based and single shot methods. The approaches the duty of each grid cell is predicting bounding boxes of
in the first category differs from the traditional methods by the form (x, y, w, h, c). In this output, x and y are the coor-
using features learned from data with CNNs and selective dinates of the centers of the boxes with respect to the grid
search or region proposal networks to decrease the number cell, w and h are the width and height in proportion to the
of possible regions [4, 5, 13]. In the single shot approach, whole image, and c is the confidence that an object is in
the aim is to compute bounding boxes of the objects in the the bounding box. The final task of a grid cell is comput-
image directly instead of dealing with regions in the im- ing conditional class probabilities given the probability that
age. A method for this is extracting multi-scale features us- the corresponding bounding boxes have objects in them.
ing CNNs and combining them to predict bounding boxes While predicting those bounding boxes, the model utilizes
[7, 8]. Another one, named YOLO, divides the final feature some prior information computed by K-means clustering on
map into a 2D grid and predicts a bounding box using each width and heights of ground truth bounding boxes. The final
grid cell [11]. output size for a grid cell is:
2.2. UAV Detection Methods with Computer Vision Output Size = (Ncls + Ncoord + 1) × Nanc ,
Although the problem of detecting UAVs is not a well
studied subject, there are some attempts to mention. Mejias where Ncls is the number of classes, Ncoord is the num-
et al. utilized morphological pre-processing and Hidden ber of coordinates, Nanc is the number of anchor bounding
Markov Model filters to detect and track micro unmanned boxes used as prior knowledge and the 1 in the parenthesis
planes [10]. Gökçe et al. used cascaded boosted classi- is for the confidence value. In our approach, grid size is set
fiers along with some local feature descriptors [6]. In addi- to 15, number of classes is two, number of coordinates is
tion to this pure spatial information based methods, spatio- four and number of anchor boxes is five. Hence, the final
temporal approaches exist. Rozantsev et al. propose a output is of the shape 15 × 15 × 35.
480
240 Reshape
480 120 30 30 30 15 15 15
240 120 30 30 30 15 15 15 35
32 64 256 256 512 1024 3072
3
Conv. Layer Conv. Layer Conv. Layers Conv. Layers Conv. Layer Max Pooling Combine Conv. Layers
3x3-s1 3x3-s1 3x3-s1 3x3-s1 x2 3x3-s1 2x2-s2 3x3-s1
Max Pooling Max Pooling 1x1-s1 1x1-s1 Conv. Layers 1x1-s1
x2 3x3-s1
2x2-s2 2x2-s2 3x3-s1
x2
Max Pooling 1x1-s1
2x2-s2 3x3-s1 x3
Figure 2: Our adaptation of the YOLOv2 network. All layers are fine-tuned with the dataset collected in the paper.
0.9 200
180
0.8
Avg Penalty
160
Precision
0.7
140
0.6
120
0.5
100
0.4 80
0.3 60
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.2 0.4 0.6 0.8 1
Recall Detection Threshold
Figure 4: Precision-Recall (PR) curve showing the perfor- Figure 5: Change of prediction penalty with respect to de-
mance of the approach on the outdoor test videos. tection threshold.