Schumann 2017
Schumann 2017
Arne Schumann1 Lars Sommer2,1 Johannes Klatte1 Tobias Schuchert1 Jürgen Beyerer1,2
1 2
Fraunhofer IOSB Vision and Fusion Lab
Fraunhoferstrasse 1 Karlsruhe Institute of Technology KIT
76131 Karlsruhe, Germany Adenauerring 4, 76131 Karlsruhe, Germany
[email protected]
Abstract
Recent progress in the development of unmanned aerial
vehicles (UAVs) causes serious safety issues for mass events
and safety-sensitive locations like prisons or airports. To
address these concerns, robust UAV detection systems are
required. In this work, we propose an UAV detection frame-
work based on video images. Depending on whether the
video images are recorded by static cameras or moving
cameras, we initially detect regions that are likely to con-
tain an object by median background subtraction or a deep
learning based object proposal method, respectively. Then,
the detected regions are classified into UAV or distractors,
such as birds, by applying a convolutional neural network
Figure 1. Flying object detections can be filtered by a classifier to
(CNN) classifier. To train this classifier, we use our own
reduce the amount of false alarms.
dataset comprised of crawled and self-acquired drone im-
ages, as well as bird images from a publicly available
dataset. We show that, even across a significant domain
gap, the resulting classifier can successfully identify UAVs on video images, which allows to detect and localize UAVs
in our target dataset. We evaluate our UAV detection frame- at large distances. Our detection framework is composed of
work on six challenging video sequences that contain UAVs two core modules: the first module detects regions which
at different distances as well as birds and background mo- are likely to contain an UAV followed by a classification
tion. module to distinguish each hypothesis into UAV or dis-
tractor classes, such as birds (see Figure 1). To detect re-
gions which are likely to contain an UAV, we consider two
1. Introduction complementary detection techniques which exhibit promis-
ing results on video sequences containing UAVs at differ-
The fast growing market of commercially available ent distances. Depending on whether the video images are
unmanned aerial vehicles (UAVs) as well as the recent recorded by static cameras or moving cameras, we apply
progress in the development of such UAVs poses serious median background subtraction or a deep learning based
threats to public events like festivals or sports events or proposals method, respectively. To reduce the high number
safety-sensitive facilities and infrastructures like prisons or of false alarms, we apply a CNN classifier. However, clas-
airports. For instance, this years’ UEFA Champions League sifying UAVs in real world data is a challenging task due to
final was played under a closed roof for the first time due to varying object dimensions (in the range of less than ten to
concerns over a terrorist drone attack on the stadium. hundreds of pixels), the large variety of existing UAVs, and
To address these threats, automatic detection systems are often a lack of training data. Furthermore, the classifica-
required to detect the presence of UAVs as early as possi- tion is impeded by varying illumination conditions, differ-
ble. In this work, we propose a detection framework based ing backgrounds, and localization errors of the detector. To
978-1-5386-2939-0/17/$31.00 2017
c IEEE IEEE AVSS 2017, August 2017, Lecce, ITALY
address the various object dimensions, we propose a small contain an UAV followed by a classification module to dis-
network that is optimized to handle low resolution objects tinguish each UAV hypothesis into UAV, bird or background
such as UAVs at large distances. We use our own dataset classes.
to train the CNN classifier. The dataset is composed of
crawled and self-acquired UAV images, bird images of a 3.1. Flying Object Detection
publicly available dataset and crawled background images
To detect regions which are likely to contain an UAV, we
to account for the large variety of existing UAVs, other dis-
consider two complementary detection techniques, which
tracting flying objects, and varying illumination conditions
exhibit the best detection results on six sequences that con-
and background scenes. We show that, even though parts of
tain UAVs [1]. Following the findings in [1] we apply
the dataset have significantly different characteristics, the
median background subtraction in case of static cameras
trained model generalizes well to our target domain.
whereas a Region Proposal Network is applied in case of
moving cameras.
2. Related Work
A number of previous works have studied the problem 3.1.1 Median Background Subtraction
of UAV detection [9]. Such approaches can rely on UAV
control or audio signals, but we will focus our discussion In case of static cameras, we apply background subtraction
on approaches which use camera images and computer vi- to identify moving objects. Therefore, the difference im-
sion algorithms. An established pipeline among such ap- age between the current frame and its corresponding back-
proaches is to first detect flying objects and then distinguish ground model is computed. Due to the short sequence
between relevant and distracting detections by means of an length, we compute the background model by calculating
UAV classifier. the pixel-wise intensity median over the entire video se-
In [3], the authors apply a passive color camera in com- quence. After computing the difference image, a fixed
bination with an active laser range-gated viewing sensor in threshold value is used to distinguish the pixels into fore-
the short wave infrared (SWIR) band, which allows to ef- ground (moving objects) and background pixels. Higher
fectively suppress foreground and background around an threshold values result in more missed detections whereas
object. Then, a keypoint-based tracker is used to generate lower threshold values cause more false alarms and less ac-
tracks out of the obtained detections. In [4], the authors curate localization. Following the findings in [1] we use
propose two-frame differencing to detect moving objects. a threshold value of 0.06 in order to avoid missed detec-
Then, local features (SURF) are used to distinguish whether tions. Morphological operations are applied on the thresh-
the object is a drone or not. Hu et al. [6] compute a co- olded image to remove single pixel detections and to fill in
herence score for each blob generated by two-frame differ- holes or gaps in object contours. Finally, connected com-
encing to mitigate the number of false alarms. Rozantsev et ponent analysis is performed to cluster neighboring pixels.
al. [9] present a regression-based approach for motion stabi- The bounding box around each cluster output is considered
lization of image patches to allow an effective classification as a detection and forwarded to the classification module.
on spatio-temporal image cubes extracted by a multi-scale
sliding window approach. To classify the extracted cubes, 3.1.2 Region Proposal Network
CNN and hand-designed features are evaluated.
To the best of our knowledge, there exist no further pub- In case of moving cameras, we apply the Region Proposal
lications on CNN based drone classification. However, ap- Network (RPN) proposed by Ren et al. [8]. Instead of de-
plying CNNs for classifying low resolution imagery showed tecting objects by relying on motion cues, a set of candidate
impressive results in different domains, especially for face regions that are likely to contain an object are generated for
recognition. The impact of different network architectures each frame separately.
and of the input image resolution [5] has been studied for For this, a small network is shifted in a sliding window
low resolution face recognition. In [9], the authors demon- approach over the output of the last convolutional layer of
strate the potential of CNN based classification of flying a fully convolutional network. The small network is com-
objects with low resolution. Therefore, flying birds in an posed of a 3x3 convolutional layer, whose output is fed
image sequence at a wind farm are classified into bird or into a classification layer and a bounding box regression
non-bird, which is closely related to drone classification. layer. The classification layer provides a confidence value
about the presence of an object and the bounding box re-
3. Drone Detection Framework gression layer provides the corresponding coordinates. The
100 top-ranked proposals based on the confidence value are
Our drone detection framework is composed of two forwarded to the classification module. The convolutional
modules: the first module detects regions that are likely to layer blocks of the VGG- 16 network [11] are used as fully
convolutional network as we achieved the best detection re-
sults for this configuration (see [1]). The implementation
details are adopted from [1].
and one moving camera sequence. For evaluation we use Table 1. Confusion Matrix -Input Image Size 16×16.
the rate of false positive detections per image and the rate of Actual/Predicted class UAV Bird Clutter
missed detections across all sequences. We also report con- UAV 0.975 0.025 0
fusion matrices to analyze the accuracy of our classifier for Bird 0 0.997 0.003
the various classes and domains. Our self-created dataset Clutter 0.023 0.003 0.974
is randomly split into 90% training data and 10% test data UAV 0.996 0.004 0
to allow for evaluation of our classifier on both source and Bird 0.084 0.886 0.030
target domains.
Table 2. Confusion Matrix -Input Image Size 32×32.
4.1. Flying Object Classification Actual/Predicted class UAV Bird Clutter
Tables 1 to 3 show the confusion matrices of our pro- UAV 0.983 0.017 0
posed CNN. The confusion matrix for CaffeNet is given in Bird 0 0.994 0.006
Table 4. The first rows of each table show the results on our Clutter 0.011 0.006 0.983
self-created source dataset and the last two rows show the UAV 0.981 0.019 0
results on the five static camera sequences of the target chal- Bird 0.084 0.891 0.025
lenge dataset. For this, we manually annotated all occurring
birds in the target video sequences. Since the ground truth
does not contain background patches, we only report the Table 3. Confusion Matrix -Input Image Size 64×64.
classification results for classes UAV and bird on the target Actual/Predicted class UAV Bird Clutter
data. UAV 0.992 0.008 0
Using an input size of 64×64 exhibits the best results Bird 0 0.991 0.009
on our source dataset for the UAV class. We assume that Clutter 0.009 0.003 0.989
the slightly better results are due to the high resolution of UAV 0.997 0.003 0
the crawled images. However, the results for an input size Bird 0.102 0.864 0.034
of 16×16 and 32×32 as well as for the pre-trained Caf-
feNet are similarly high. Applying the classifiers trained
on our dataset on the challenge data leads to only a very niques applied on the five video sequences with and with-
slight drop in accuracy for the two common classes. The out our proposed CNN classifier. Following [1] we use a
best results on the five target sequences are achieved for an low IoU threshold of 0.2 to better judge the number of fully
input image size of 16×16. The results for class UAV are missed detections.
even improved. This is likely due to a higher number of For the median image detector we choose an initial low-
small objects in the target dataset. Since our aim is to de- miss threshold of 0.06 as recommended in [1]. This value
tect UAVs as early as possible, we select the 16×16 net for results in 0 misses at the cost of 91 false positive detections
application with our object detectors. per image. We then aim to improve this trade-off by scoring
all detections with our classifier and accepting only the top-
4.2. Framework Performance
k scored detections as candidates for UAVs. Results are
The impact of our classifier to reduce the number of false depicted in Figure 4. The number of false positives can be
alarms is given by plotting the miss rate with respect to the reduced at no cost up to a value of 15. After this, additional
number of false positives per image for both detection tech- misses occur. However, the rise in miss rate is significantly
Table 4. Confusion Matrix -CaffeNet.
Actual/Predicted class UAV Bird Clutter
UAV 0.989 0.011 0
Bird 0.003 0.997 0
Clutter 0.006 0.008 0.986
UAV 0.996 0.004 0
Bird 0.084 0.886 0.030