0% found this document useful (0 votes)
46 views6 pages

Schumann 2017

This document proposes a deep learning framework for detecting and classifying unmanned aerial vehicles (UAVs) in video images. The framework first detects regions likely to contain objects using either median background subtraction for static cameras or a region proposal network for moving cameras. It then classifies the detected regions as UAVs, birds, or other using a convolutional neural network (CNN) trained on the authors' dataset of UAV, bird and background images. Evaluation on challenging video sequences showed the CNN classifier could successfully identify UAVs despite differences from the training data domain.

Uploaded by

XxgametrollerxX
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views6 pages

Schumann 2017

This document proposes a deep learning framework for detecting and classifying unmanned aerial vehicles (UAVs) in video images. The framework first detects regions likely to contain objects using either median background subtraction for static cameras or a region proposal network for moving cameras. It then classifies the detected regions as UAVs, birds, or other using a convolutional neural network (CNN) trained on the authors' dataset of UAV, bird and background images. Evaluation on challenging video sequences showed the CNN classifier could successfully identify UAVs despite differences from the training data domain.

Uploaded by

XxgametrollerxX
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Deep Cross-Domain Flying Object Classification for Robust UAV Detection

Arne Schumann1 Lars Sommer2,1 Johannes Klatte1 Tobias Schuchert1 Jürgen Beyerer1,2
1 2
Fraunhofer IOSB Vision and Fusion Lab
Fraunhoferstrasse 1 Karlsruhe Institute of Technology KIT
76131 Karlsruhe, Germany Adenauerring 4, 76131 Karlsruhe, Germany
[email protected]

Abstract
Recent progress in the development of unmanned aerial
vehicles (UAVs) causes serious safety issues for mass events
and safety-sensitive locations like prisons or airports. To
address these concerns, robust UAV detection systems are
required. In this work, we propose an UAV detection frame-
work based on video images. Depending on whether the
video images are recorded by static cameras or moving
cameras, we initially detect regions that are likely to con-
tain an object by median background subtraction or a deep
learning based object proposal method, respectively. Then,
the detected regions are classified into UAV or distractors,
such as birds, by applying a convolutional neural network
Figure 1. Flying object detections can be filtered by a classifier to
(CNN) classifier. To train this classifier, we use our own
reduce the amount of false alarms.
dataset comprised of crawled and self-acquired drone im-
ages, as well as bird images from a publicly available
dataset. We show that, even across a significant domain
gap, the resulting classifier can successfully identify UAVs on video images, which allows to detect and localize UAVs
in our target dataset. We evaluate our UAV detection frame- at large distances. Our detection framework is composed of
work on six challenging video sequences that contain UAVs two core modules: the first module detects regions which
at different distances as well as birds and background mo- are likely to contain an UAV followed by a classification
tion. module to distinguish each hypothesis into UAV or dis-
tractor classes, such as birds (see Figure 1). To detect re-
gions which are likely to contain an UAV, we consider two
1. Introduction complementary detection techniques which exhibit promis-
ing results on video sequences containing UAVs at differ-
The fast growing market of commercially available ent distances. Depending on whether the video images are
unmanned aerial vehicles (UAVs) as well as the recent recorded by static cameras or moving cameras, we apply
progress in the development of such UAVs poses serious median background subtraction or a deep learning based
threats to public events like festivals or sports events or proposals method, respectively. To reduce the high number
safety-sensitive facilities and infrastructures like prisons or of false alarms, we apply a CNN classifier. However, clas-
airports. For instance, this years’ UEFA Champions League sifying UAVs in real world data is a challenging task due to
final was played under a closed roof for the first time due to varying object dimensions (in the range of less than ten to
concerns over a terrorist drone attack on the stadium. hundreds of pixels), the large variety of existing UAVs, and
To address these threats, automatic detection systems are often a lack of training data. Furthermore, the classifica-
required to detect the presence of UAVs as early as possi- tion is impeded by varying illumination conditions, differ-
ble. In this work, we propose a detection framework based ing backgrounds, and localization errors of the detector. To

978-1-5386-2939-0/17/$31.00 2017
c IEEE IEEE AVSS 2017, August 2017, Lecce, ITALY
address the various object dimensions, we propose a small contain an UAV followed by a classification module to dis-
network that is optimized to handle low resolution objects tinguish each UAV hypothesis into UAV, bird or background
such as UAVs at large distances. We use our own dataset classes.
to train the CNN classifier. The dataset is composed of
crawled and self-acquired UAV images, bird images of a 3.1. Flying Object Detection
publicly available dataset and crawled background images
To detect regions which are likely to contain an UAV, we
to account for the large variety of existing UAVs, other dis-
consider two complementary detection techniques, which
tracting flying objects, and varying illumination conditions
exhibit the best detection results on six sequences that con-
and background scenes. We show that, even though parts of
tain UAVs [1]. Following the findings in [1] we apply
the dataset have significantly different characteristics, the
median background subtraction in case of static cameras
trained model generalizes well to our target domain.
whereas a Region Proposal Network is applied in case of
moving cameras.
2. Related Work
A number of previous works have studied the problem 3.1.1 Median Background Subtraction
of UAV detection [9]. Such approaches can rely on UAV
control or audio signals, but we will focus our discussion In case of static cameras, we apply background subtraction
on approaches which use camera images and computer vi- to identify moving objects. Therefore, the difference im-
sion algorithms. An established pipeline among such ap- age between the current frame and its corresponding back-
proaches is to first detect flying objects and then distinguish ground model is computed. Due to the short sequence
between relevant and distracting detections by means of an length, we compute the background model by calculating
UAV classifier. the pixel-wise intensity median over the entire video se-
In [3], the authors apply a passive color camera in com- quence. After computing the difference image, a fixed
bination with an active laser range-gated viewing sensor in threshold value is used to distinguish the pixels into fore-
the short wave infrared (SWIR) band, which allows to ef- ground (moving objects) and background pixels. Higher
fectively suppress foreground and background around an threshold values result in more missed detections whereas
object. Then, a keypoint-based tracker is used to generate lower threshold values cause more false alarms and less ac-
tracks out of the obtained detections. In [4], the authors curate localization. Following the findings in [1] we use
propose two-frame differencing to detect moving objects. a threshold value of 0.06 in order to avoid missed detec-
Then, local features (SURF) are used to distinguish whether tions. Morphological operations are applied on the thresh-
the object is a drone or not. Hu et al. [6] compute a co- olded image to remove single pixel detections and to fill in
herence score for each blob generated by two-frame differ- holes or gaps in object contours. Finally, connected com-
encing to mitigate the number of false alarms. Rozantsev et ponent analysis is performed to cluster neighboring pixels.
al. [9] present a regression-based approach for motion stabi- The bounding box around each cluster output is considered
lization of image patches to allow an effective classification as a detection and forwarded to the classification module.
on spatio-temporal image cubes extracted by a multi-scale
sliding window approach. To classify the extracted cubes, 3.1.2 Region Proposal Network
CNN and hand-designed features are evaluated.
To the best of our knowledge, there exist no further pub- In case of moving cameras, we apply the Region Proposal
lications on CNN based drone classification. However, ap- Network (RPN) proposed by Ren et al. [8]. Instead of de-
plying CNNs for classifying low resolution imagery showed tecting objects by relying on motion cues, a set of candidate
impressive results in different domains, especially for face regions that are likely to contain an object are generated for
recognition. The impact of different network architectures each frame separately.
and of the input image resolution [5] has been studied for For this, a small network is shifted in a sliding window
low resolution face recognition. In [9], the authors demon- approach over the output of the last convolutional layer of
strate the potential of CNN based classification of flying a fully convolutional network. The small network is com-
objects with low resolution. Therefore, flying birds in an posed of a 3x3 convolutional layer, whose output is fed
image sequence at a wind farm are classified into bird or into a classification layer and a bounding box regression
non-bird, which is closely related to drone classification. layer. The classification layer provides a confidence value
about the presence of an object and the bounding box re-
3. Drone Detection Framework gression layer provides the corresponding coordinates. The
100 top-ranked proposals based on the confidence value are
Our drone detection framework is composed of two forwarded to the classification module. The convolutional
modules: the first module detects regions that are likely to layer blocks of the VGG- 16 network [11] are used as fully
convolutional network as we achieved the best detection re-
sults for this configuration (see [1]). The implementation
details are adopted from [1].

3.2. Drone-vs-Birds Classification Dataset


In real-world scenarios there is usually no large amount
of training data available to train the classifier in the second
stage of our framework. The same problem arises in our
target dataset which consists of only six video sequences.
In [1] it is demonstrated that even small amounts of training
data can help by fine-tuning a classifier from a pre-trained
net. But a more diverse and larger dataset will result in a
more robust and better generalized classifier and allows for
training from scratch. To train a robust CNN classifier, we
thus created our own dataset which contains 3,386 drone Figure 2. Training dataset consisting of crawled drone images (first
row), acquired drone images (second row), bird images of the pub-
images, 3,500 bird images and 3,500 background images as
licly available Wild Birds in a Wind Farm: Image Dataset for Bird
illustrated in Figure 2. The drone images are comprised of Detection [12] (third row) and crawled background images (fourth
images crawled from the web (first row) and self-acquired row).
images (second row). Both subsets of images contain var-
ious different types of drones. We chose query terms to
cover relevant influences, such as time of day and vary- varied from 16×16 over 32×32 to 64×64 pixels to analyze
ing weather conditions, for our crawled images. The self- which input image size is most suited to address object di-
acquired images contain drones in the range of a few to mensions in the range of less than ten to hundreds of pixels.
hundreds of pixels to account for drones at different dis- For our experiments, we trained the CNN from scratch on
tances and consequently different object resolutions. We the dataset presented in Section 3.2. The learning rate was
enrich our dataset by using bird images of the publicly avail- set to 0.005 and the number of iterations was set to 50,000.
able Wild Birds in a Wind Farm: Image Dataset for Bird We compare our proposed CNN with CaffeNet [7] pre-
Detection dataset [12] which provides annotated bounding trained on ImageNet [10]. The network architecture is sim-
boxes for flying birds in an image sequence acquired at a ilar to the architecture of our proposed CNN. However, the
wind farm. The dataset comprises images with different input size is 227×227, which clearly differs from the input
bird species like hawks and crows. This dataset is randomly sizes used for our proposed CNN and our nets’ number of
subsampled to balance the amount of birds with the amount parameters is an order of magnitude smaller. We fine-tuned
of UAVs in our training data. In order to create data for the CaffeNet with a learning rate of 0.001 for 50,000 itera-
the background class, we crawled background images for tions1 .
various scenes at different daytimes. To account for differ- We observe two main challenges of successful UAV clas-
ent resolutions, sections with different dimensions are ran- sification. Firstly, large variation in resolution can occur
domly cropped from these images. The crawled UAVs have while an UAV is nearing the camera. The classifier has to
an average bounding box diagonal of 670 pixels while our be robust to this variation of detail. Secondly, our detectors
target dataset has an average diagonal of 60 pixels. The do- are designed for low miss rate and can thus often generate
main gap between the two datasets is thus large. We can detections with only coarse alignment with the object. The
to some extent alleviate this issue by relying on the self- classifier thus has to be able to deal with such misalign-
recorded UAVs which have an average of 59 pixels diagonal ments. We address both issues by performing correspond-
and the bird dataset which has a 45 pixels diagonal. ing data augmentation during training. The high resolution
data from our crawled images allows us to cover a broad
3.3. Flying Object Classification spectrum of resolutions by down- and up-scaling. We also
To classify the detected regions that are likely to contain crop training samples with varying degree of margin in or-
an UAV, we propose a CNN optimized for handling small der to increase robustness to misalignments.
objects. The basic structure of the CNN is given in Fig-
ure 3. The network comprises four convolutional layers, 4. Evaluation
max pooling layers after the first, second and fourth convo-
We evaluate our framework on the drones-vs-birds chal-
lutional layers and two fully connected layers followed by a
lenge dataset which consists of five static camera sequences
classification layer that outputs a confidence for the classes
UAV, bird and background. The input size of the images is 1 Our classifier is available at s.fhg.de/uavdet-avss17
Figure 3. Basic structure of the proposed CNN classifier to handle small object dimensions. Input images of size 16×16, 32×32, and
64×64 have been evaluated.

and one moving camera sequence. For evaluation we use Table 1. Confusion Matrix -Input Image Size 16×16.
the rate of false positive detections per image and the rate of Actual/Predicted class UAV Bird Clutter
missed detections across all sequences. We also report con- UAV 0.975 0.025 0
fusion matrices to analyze the accuracy of our classifier for Bird 0 0.997 0.003
the various classes and domains. Our self-created dataset Clutter 0.023 0.003 0.974
is randomly split into 90% training data and 10% test data UAV 0.996 0.004 0
to allow for evaluation of our classifier on both source and Bird 0.084 0.886 0.030
target domains.
Table 2. Confusion Matrix -Input Image Size 32×32.
4.1. Flying Object Classification Actual/Predicted class UAV Bird Clutter
Tables 1 to 3 show the confusion matrices of our pro- UAV 0.983 0.017 0
posed CNN. The confusion matrix for CaffeNet is given in Bird 0 0.994 0.006
Table 4. The first rows of each table show the results on our Clutter 0.011 0.006 0.983
self-created source dataset and the last two rows show the UAV 0.981 0.019 0
results on the five static camera sequences of the target chal- Bird 0.084 0.891 0.025
lenge dataset. For this, we manually annotated all occurring
birds in the target video sequences. Since the ground truth
does not contain background patches, we only report the Table 3. Confusion Matrix -Input Image Size 64×64.
classification results for classes UAV and bird on the target Actual/Predicted class UAV Bird Clutter
data. UAV 0.992 0.008 0
Using an input size of 64×64 exhibits the best results Bird 0 0.991 0.009
on our source dataset for the UAV class. We assume that Clutter 0.009 0.003 0.989
the slightly better results are due to the high resolution of UAV 0.997 0.003 0
the crawled images. However, the results for an input size Bird 0.102 0.864 0.034
of 16×16 and 32×32 as well as for the pre-trained Caf-
feNet are similarly high. Applying the classifiers trained
on our dataset on the challenge data leads to only a very niques applied on the five video sequences with and with-
slight drop in accuracy for the two common classes. The out our proposed CNN classifier. Following [1] we use a
best results on the five target sequences are achieved for an low IoU threshold of 0.2 to better judge the number of fully
input image size of 16×16. The results for class UAV are missed detections.
even improved. This is likely due to a higher number of For the median image detector we choose an initial low-
small objects in the target dataset. Since our aim is to de- miss threshold of 0.06 as recommended in [1]. This value
tect UAVs as early as possible, we select the 16×16 net for results in 0 misses at the cost of 91 false positive detections
application with our object detectors. per image. We then aim to improve this trade-off by scoring
all detections with our classifier and accepting only the top-
4.2. Framework Performance
k scored detections as candidates for UAVs. Results are
The impact of our classifier to reduce the number of false depicted in Figure 4. The number of false positives can be
alarms is given by plotting the miss rate with respect to the reduced at no cost up to a value of 15. After this, additional
number of false positives per image for both detection tech- misses occur. However, the rise in miss rate is significantly
Table 4. Confusion Matrix -CaffeNet.
Actual/Predicted class UAV Bird Clutter
UAV 0.989 0.011 0
Bird 0.003 0.997 0
Clutter 0.006 0.008 0.986
UAV 0.996 0.004 0
Bird 0.084 0.886 0.030

Figure 5. Comparison of the detection results before and after ap-


plying our proposed CNN classifier. For this, the maximum num-
ber of accepted proposals is varied from 1 to 100 (markers at 1,
10, 30, 100).

Figure 4. Based on an initial image difference threshold value of


0.06, which results in 0 misses at 91 false positives per image, we
are able to significantly improve this trade-off up to 15 false pos-
itives per image without new misses. This is done by filtering out
detections which are classified as non-UAV with high confidence.
Figure 6. Qualitative detection results on the challenge test se-
quence.
lower than that of the original curve which is based on the
difference image threshold.
Figure 5 shows the comparison for the RPN. For this, we among the 606 frames. Our resulting score was the winning
varied the maximum number of accepted proposals from 1 result of the challenge. Qualitative detections are depicted
to 100. For each static camera sequence, we trained the in Figure 6.
RPN separately on the four other sequences. The number
of false positives is much reduced in the range from 100
to 20 proposals while the number of missed detections is 5. Conclusion
almost unchanged. However, at high object resolutions the
classifier tends to rank parts of the UAV (e.g. the camera We presented an UAV detection framework which is able
mounted below it) higher than the overall UAV which leads to accurately detect flying objects and distinguish between
to an increase in miss rate. This problem does not occur relevant UAVs and distractor classes, such as birds. Using
with the median approach, because the detector stage does the described RPN detector our framework can handle static
not generate detections of parts of the UAV. and moving camera sequences and won the 2017 drones vs.
birds UAV detection challenge [2]. Our results also show
4.3. Challenge Results that training data from strongly different domains, such as
For participation in the 2017 dones-vs-birds challenge web images, can nevertheless help to greatly improve clas-
which consists of a moving camera video we chose trained sification accuracy.
the VGG-conv5 RPN on all five static camera sequences In future work we will focus on including temporal infor-
and classified the proposals using the classifier net with in- mation into the detection process in order to generate even
put size 16×16. This combination results in a penatly score more confident UAV detections and reduce the number of
of 2.64 according to the official challenge metric [2]. The false positives to a degree that will allow for an early warn-
main source of error are three separate missed detections ing detection system with little manual supervision.
References of unmanned aerial vehicles using a visible camera system.
Applied Optics, 56(3):B214–B221, 2017. 2
[1] Flying object detection for automatic uav recognition. In-
[7] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet
ternational Workshop on Small-Drone Surveillance, Detec-
classification with deep convolutional neural networks. In
tion and Counteraction Techniques in conjuction with IEEE
Advances in neural information processing systems, pages
AVSS 2017, in submission. 2, 3, 4
1097–1105, 2012. 3
[2] Drones-vs-birds challenge with the international workshop
[8] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards
on small-drone surveillance, detection and counteraction
real-time object detection with region proposal networks. In
techniques. https://wosdetc.wordpress.com/
Advances in neural information processing systems, pages
challenge/, 2017. 5
91–99, 2015. 2
[3] F. Christnacher, S. Hengy, M. Laurenzis, A. Matwyschuk,
[9] A. Rozantsev, V. Lepetit, and P. Fua. Detecting flying objects
P. Naz, S. Schertzer, and G. Schmitt. Optical and acous-
using a single moving camera. Technical report, Institute of
tical UAV detection. In SPIE Security+ Defence, pages
Electrical and Electronics Engineers, 2016. 2
99880B–99880B. International Society for Optics and Pho-
[10] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,
tonics, 2016. 2
S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein,
[4] S. R. Ganti and Y. Kim. Implementation of detection and
et al. Imagenet large scale visual recognition challenge.
tracking mechanism for small UAS. In Unmanned Aircraft
arXiv preprint arXiv:1409.0575, 2014. 3
Systems (ICUAS), 2016 International Conference on, pages
[11] K. Simonyan and A. Zisserman. Very deep convolutional
1254–1260. IEEE, 2016. 2
networks for large-scale image recognition. arXiv preprint
[5] C. Herrmann, D. Willersinn, and J. Beyerer. Low-resolution
arXiv:1409.1556, 2014. 2
convolutional neural networks for video face recognition.
[12] R. Yoshihashi, R. Kawakami, M. Iida, and T. Naemura. Con-
In Advanced Video and Signal Based Surveillance (AVSS),
struction of a bird image dataset for ecological investiga-
2016 13th IEEE International Conference on, pages 221–
tions. In Image Processing (ICIP), 2015 IEEE International
227. IEEE, 2016. 2
Conference on, pages 4248–4252. IEEE, 2015. 3
[6] S. Hu, G. H. Goldman, and C. C. Borel-Donohue. Detection

You might also like