0% found this document useful (0 votes)
40 views8 pages

Air-to-Air Visual Detection of Micro-UAVs An Experimental Evaluation of Deep Learning

This document presents an experimental evaluation of deep learning algorithms for the task of air-to-air visual detection of micro unmanned aerial vehicles (UAVs). It introduces a new dataset called Det-Fly consisting of over 13,000 images of a target UAV acquired by another flying camera UAV under various conditions. It then evaluates the performance of eight representative deep learning object detection algorithms on this dataset. This is the first comprehensive experimental evaluation of deep learning for air-to-air UAV detection. The results highlight challenges in this problem and suggest ways to develop new algorithms.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views8 pages

Air-to-Air Visual Detection of Micro-UAVs An Experimental Evaluation of Deep Learning

This document presents an experimental evaluation of deep learning algorithms for the task of air-to-air visual detection of micro unmanned aerial vehicles (UAVs). It introduces a new dataset called Det-Fly consisting of over 13,000 images of a target UAV acquired by another flying camera UAV under various conditions. It then evaluates the performance of eight representative deep learning object detection algorithms on this dataset. This is the first comprehensive experimental evaluation of deep learning for air-to-air UAV detection. The results highlight challenges in this problem and suggest ways to develop new algorithms.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

1020 IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 6, NO.

2, APRIL 2021

Air-to-Air Visual Detection of Micro-UAVs: An


Experimental Evaluation of Deep Learning
Ye Zheng , Zhang Chen, Dailin Lv , Zhixing Li, Zhenzhong Lan, and Shiyu Zhao , Member, IEEE

Abstract—This letter studies the problem of air-to-air visual


detection of micro unmanned aerial vehicles (UAVs) by monocular
cameras. This problem is important for many applications such
as vision-based swarming of UAVs, malicious UAV detection, and
see-and-avoid systems for UAVs. Although deep learning methods
have exhibited superior performance in many object detection
tasks, their potential for UAV detection has not been well explored.
As the first main contribution of this letter, we present a new dataset,
named Det-Fly, which consists of more than 13 000 images of a
flying target UAV acquired by another flying UAV. Compared to
the existing datasets, the proposed one is more comprehensive in Fig. 1. A DJI M210 platform with XT2 camera was used to acquire images
the sense that it covers a wide range of practical scenarios with dif- of a flying target UAV (DJI Mavic).
ferent background scenes, viewing angles, relative distance, flying
altitude, and lightning conditions. The second main contribution
of this letter is to present an experimental evaluation of eight
representative deep-learning algorithms based on the proposed neighboring UAVs [1]. In addition, the hostile use of micro UAVs
dataset. To the best of our knowledge, this is the first comprehensive has become a serious threat to public safety and personal privacy
experimental evaluation of deep learning algorithms for the task nowadays. Visual detection of malicious micro UAVs [2], [3] is
of visual UAV detection so far. The evaluation results highlight a key technology for developing civilian UAV defense systems.
some key challenges in the problem of air-to-air UAV detection and
suggest potential ways to develop new algorithms in the future. The Another application is see-and-avoid among UAVs [4]. In partic-
dataset is available at https://github.com/Jake-WU/Det-Fly. ular, as more and more commercial UAVs occupy low-altitude
airspace for the purpose of, for example, parcel delivery, how
Index Terms—Deep learning, UAV detection, visual detection.
to ensure UAVs to detect other UAVs timely to navigate safely
without colliding with each other is an important problem.
I. INTRODUCTION The detection of UAVs could be classified into two application
ISUAL detection of micro unmanned aerial vehicles scenarios. The first is ground-to-air, where cameras are placed
V (UAVs) has attracted increasing attention in recent years
since it is the core technology for many important applications.
on the ground to detect flying UAVs. The second scenario is
air-to-air, where a flying UAV uses its onboard cameras to detect
For example, visual detection of UAVs is essential to achieve other flying UAVs (see, for example, Fig. 1). This paper focuses
vision-based UAV swarming systems, where each UAV needs on the air-to-air scenario. In addition, although different types
to use onboard cameras to measure the relative motion of their of sensors could be used to detect micro UAVs such as vision,
radar [5], and acoustic sensors [6], visual sensors are one of
Manuscript received September 24, 2020; accepted January 17, 2021. Date of the few suitable options for the air-to-air scenario due to the
publication February 1, 2021; date of current version February 16, 2021. This extremely limited onboard payload of micro UAVs. This paper
letter was recommended for publication by Associate Editor F. Ruggiero and focuses on the most widely used RGB monocular cameras.
Editor P. Pounds upon evaluation of the reviewers’ comments. This work was
supported in part by the National Natural Science Foundation of China under While ground-to-air UAV detection has attracted increasing
Grant 61903308, and in part by the Westlake University and Bright Dream Joint research attention in recent years (see Section II for a review),
Institute for Intelligent Robotics. (Corresponding author: Shiyu Zhao.) the air-to-air case, which is even more challenging, is far from
Ye Zheng is with the Department of Computer Science & Technology,
Zhejiang University, Hangzhou 310027, China, and also with the School of being well solved up to now. In many ground-to-air UAV de-
Engineering, Westlake University, Hangzhou 310024, China (e-mail: zhengye@ tection tasks, ground cameras are usually stationary or moving
westlake.edu.cn). slowly [7], and the background of target UAV images is a clear
Zhang Chen is with the Department of Automation, Tsinghua University,
Beijing 100085, China (e-mail: [email protected]). or cloudy sky. As a comparison, in an air-to-air UAV detection
Dailin Lv and Zhixing Li are with the School of Electronics and Information task, a flying UAV may observe the target UAV from top or side
Engineering, Hangzhou Dianzi University, Hangzhou 310018, China (e-mail: view angles. As a result, the background of the target UAV image
[email protected]; [email protected]).
Zhenzhong Lan and Shiyu Zhao are with the School of Engineering, Westlake could be extremely complex scenes such as urban and natural
University, Hangzhou 310024, China (e-mail: [email protected]; fields (see Fig. 2 for example). Moreover, since the onboard
[email protected]). camera is flying dynamically, the appearance of the target UAV
This article has supplementary downloadable material available at https://doi.
org/10.1109/LRA.2021.3056059, provided by the authors. such as its shape, scale, and color may vary dramatically. Since
Digital Object Identifier 10.1109/LRA.2021.3056059 micro UAVs are small in size, their images may be extremely
2377-3766 © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on October 03,2021 at 08:05:09 UTC from IEEE Xplore. Restrictions apply.
ZHENG et al.: AIR-TO-AIR VISUAL DETECTION OF MICRO-UAVs 1021

Fig. 2. Samples of images in the dataset and the corresponding detection results by the eight algorithms. The dataset contains four types of background scenes:
sky, mountain, field, and urban. The detected areas by the eight algorithms are given right of each sample image with color-coded boxes. If the corresponding area
is blank, it means that the algorithm does not detect any target UAV in this image.

small (e.g., less than 10 × 10 pixels), which thus increases the Although deep learning methods have exhibited superior
difficulty of detection. performance in many object detection tasks, their potential for
The existing approaches for UAV detection could be classified UAV detection has not been well explored or evaluated up to
into two streams. The first stream is the conventional approaches now (see Section II-B for a review). As the first step towards
that are composed of two-step operations. The first step is to establishing a robust approach to air-to-air UAV detection, this
extract object features represented by, for example, Histogram of paper proposes a new dataset of micro UAV images and presents
Oriented Gradients (HOG) or Scale Invariant Feature Transform a comprehensive experimental evaluation of eight representative
(SIFT). The second step is to classify the features using machine- deep-learning algorithms. It is worth noting that we focus on
learning algorithms such as Support Vector Machine (SVM) the case where the target UAVs are known in advance such
or Adaboost. The second stream is the deep-learning-based that a dataset of them could be built up for the purpose of
approaches, which directly outputs detection results using end- training. This case applies to tasks like vision-based coopera-
to-end artificial neural networks. In contrast to the conventional tive control multi-UAV systems, which is our main motivation
approaches, which use hand-craft features, deep-learning-based for UAV detection. Although the algorithms exhibit a certain
approaches rely on deep convolutional neural network (DCNN) generalization ability to detect unknown UAVs with similar
features and consequently have a stronger capability to represent appearances, other measures such as building up datasets of
complex objects. However, the disadvantage of using DCNN is multiple types of UAVs or target motion sensing [2] may be
that it has high computational requirements and it requires large required.
datasets to train. A detailed review of the existing approaches is The novelty and contribution of this work are detailed as
given in Section II. follows.

Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on October 03,2021 at 08:05:09 UTC from IEEE Xplore. Restrictions apply.
1022 IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 6, NO. 2, APRIL 2021

First, this paper presents a dataset of 13 271 images of a flying detection accuracy drops rapidly in the case of partial occlusion.
target UAV (DJI Mavic) acquired by another flying UAV (DJI Motivated by moving object detection in see-and-avoid tasks,
M210). Compared to the existing air-to-air datasets, the pro- the work in [19] utilizes optical flow matching to integrate
posed one is more systematically designed and comprehensive spatial and temporal information to track moving targets. This
in the sense that it covers a wide range of practical scenarios approach requires high-precision motion compensation. The
with different background scenes, viewing angles, relative dis- optical flow method is also used to locate moving objects in [20].
tance, flying altitude, and lightning conditions. In particular, the The consequent step is to recognize the moving objects by
environmental background scenes vary from simple ones such template matching, which is not robust to variance in the target
as clear sky to complex ones such as mountain, field, and urban. appearance. The work in [21] also adopts template matching as
The relative distance of the target UAV varies from 10 m to well as morphological filtering for UAV detection. A real-time
100 m, and the flight altitude from 20 m to 110 m. Since lightning detection and tracking strategy is proposed in [22] where the
conditions are also important factors in flying UAV detection, object of interest can be automatically detected in a saliency
the time for data collection varies from morning to evening map by computing background connectivity cue at each frame.
in different periods of the day. The dataset also covers some The work in [23] proposes a pyramidal Lucas-Kanade (PLK)
challenging scenarios with, for example, strong light, motion algorithm to detect motion targets in a team of cooperative
blur, and partial target occlusion. UAVs. The work in [24] detects moving target by extracting
Second, this paper presents an experimental evaluation of geometry features and dynamic features in the segmentation
eight representative deep-learning algorithms based on our pro- image, and classifies them by discriminant function derived from
posed dataset: RetinaNet [8], SSD [9], YOLOv3 [10], FPN [11], the Bayesian theorem.
Faster R-CNN [12], RefineDet [13], Grid R-CNN [14], and In summary, although UAV detection has been studied based
Cascade R-CNN [15]. To the best of our knowledge, this is on many conventional approaches, these approaches are ef-
the first comprehensive evaluation of deep learning algorithms fective only in restricted scenarios where, for example, the
for UAV detection tasks. The evaluation results suggest that the background scene is relatively simple or the target appearance
overall performance of Cascade R-CNN and Grid R-CNN is does not vary considerably.
superior compared to the others. We also evaluated the impact of
some key factors such as background scene complexity, relative B. Deep-Learning Approaches
viewing angles, and target scales on the detection performance.
Although the methods based on deep learning have made great
The proposed dataset could be used as a benchmark to eval-
progress in the field of general object detection, they have not
uate different UAV detection algorithms (either conventional
been well explored in the field of UAV detection. Up to now,
or deep-learning-based). The evaluation results highlight some
there are only few studies on visual detecting UAVs by deep
key challenges in the problem of air-to-air UAV detection and
learning algorithms. For example, an approach to detect flying
suggest potential ways to develop new algorithms in the future.
objects using motion compensation is proposed in [25], where
the features of moving objects are classified by CNNs. This
II. RELATED WORK
approach leads to high average detection precision, whereas
This section gives a review of the existing studies on visual the motion compensation step requires high-precision measure-
detection of micro UAVs. We only consider the case of using ment of the motion of the camera. The work in [26] combines
monocular cameras. SegNet with bottom-hat morphological processing for detecting
large-size aircraft in the air. This approach could detect aircraft
A. Conventional Approaches within a long-range up to 2800 m, but the accuracy is as low as
13.4%. Although some other studies such as [27], [28] also adopt
The conventional techniques adopted by existing UAV detec-
deep learning algorithms such as YOLOv2 to detect UAVs, the
tion works can be classified into two categories. The first is to
performance of different representative deep learning algorithms
use feature extraction methods to obtain target features, and then
for UAV detection have not been evaluated or compared.
use a discriminant classifier to determine the target location. The
second is to detect moving objects in the image, and then use a
generative classifier to determine whether the moving object is C. Existing Datasets for UAV Detection
the target. Up to now, there are very few comprehensive datasets for the
In particular, the work in [16] adopts Haar wavelet based purpose of training deep learning algorithms for UAV detection.
AdaBoost to detect UAVs. The approach is demonstrated by The dataset in [29] comprises 20 video sequences and each of
flight experiments to be effective in the simple case of the them has about 4000 752 × 480 gray frames. The image of the
cloudy sky background. The work in [17] proposes a cascade flying target UAV is captured by a camera mounted on another
approach to detect UAVs based on Haar-like features, local UAV in indoor and outdoor environments. The dataset proposed
binary patterns, and HOG. Since it is a combination of different in [30] consists of two sub-datasets. The first is a Public-Domain
detection methods, this approach has a low running speed. HOG drone dataset that contains 30 video sequences with different
feature is adopted in [18] for training classical cascade detectors. drone models captured in indoor and outdoor environments. The
Although this approach significantly reduces the number of other one is the USC drone dataset that contains 30 video clips of
repeated detections by applying non-maximum suppression, the the same target UAV. This dataset is acquired on the USC campus

Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on October 03,2021 at 08:05:09 UTC from IEEE Xplore. Restrictions apply.
ZHENG et al.: AIR-TO-AIR VISUAL DETECTION OF MICRO-UAVs 1023

images in the dataset are small. In particular, nearly half of them


are smaller than 5% of the entire image size. When the height and
width of a target UAV image are smaller than 10% of the entire
image, it could be regarded as a small object, whose detection
is a well-known challenging task. In addition, since the lighting
conditions are also important factors in flying UAV detection,
the time of image collection varies from morning to evening in
different periods of a day. The dataset also covers some chal-
lenging scenarios with such as strong/weak lighting (10.8%),
motion blur (11.2%), and partial target occlusion (0.8%).
Some remarks about the proposed dataset are given below.
First, each image in this dataset only contains one single target
UAV. However, the algorithms trained based on the dataset could
Fig. 3. Statistical data of the target UAV size in our dataset. The blue points naturally detect multiple UAVs, which is required by vision-
correspond to the UAV images whose width and height are less than 5% of the based UAV swarming. Second, although the dataset covers a
image size. The orange points represent the data that the sizes are less than 10%
of the image size. The remaining red points are samples greater than 10%. Since wide range of environmental scenarios, it is impossible to cover
the attitude of the camera may frequently change during flight, there are various all possible scenarios. The primary purpose for establishing
height-width ratios of bounding boxes. the dataset is to evaluate different deep learning algorithms. If
one is interested in implementing a deep-learning approach in
practice in a specific environmental scenario, the dataset should
and the background of most samples is a clean or cloudy sky, be adjusted to cover either the specific environment where the
which is relatively simple compared to our proposed dataset. In UAV detection is performed or more environmental scenarios to
order to increase the number of images in the dataset, the authors enhance the generalization ability. Third, this dataset only covers
of the USC dataset developed a model-based automatic data one single type of UAV (DJI Mavic). If one is interested in de-
augmentation method to paste clipped drone model images into tecting more types of UAVs, other measures such as building up
background images. Although the size of data can be expanded datasets of multiple types of UAVs or target motion sensing [2]
in this way, the work in [31] shows that networks trained based may be required.
on such kind of data may not be significantly improved by the
augmentation. Very recently, a new dataset, named MIDGARD, IV. EXPERIMENTAL SETUP
was presented in [32]. This dataset contains different kinds of
backgrounds and varying lighting conditions. It also proposed In this paper, we finely tune and evaluate eight classic deep-
a new method for automatic annotation by using their previous learning based object detection methods: SSD [9], RetinaNet [8],
work of UltraViolet Direction And Ranging [33]. A detailed YOLOv3 [10], RefineDet [13], Faster R-CNN [12], FPN [11],
comparison between MIDGARD and our dataset is given in Cascade R-CNN [15], and Grid R-CNN [14]. These methods
Section V. generate similar performance in terms of small-scale objects on
COCO dataset in which mean Average Precision (mAP) is used
as an evaluation metric.
III. THE PROPOSED DATASET
According to the types of detection algorithms, the selected
The proposed dataset, named Det-Fly, consists of 13 271 methods can be divided into two categories: one-stage networks
images of a target micro UAV (DJI Mavic). Each image has and two-stage networks. A one-stage network does classifica-
3840 × 2160 pixels. Some images of the dataset are sampled tion and regression directly on the feature map to achieve fast
from videos at 5 FPS and the others are captured from desired object detection. Among the selected methods, SSD, RetinaNet,
relative poses. All the images are manually annotated by profes- YOLOv3, and RefineDet are one-stage networks. A two-stage
sionals. Some sample images are given in Fig. 2. The dataset is network consists of a region proposal network (RPN) that pro-
available at https://github.com/Jake-WU/Det-Fly. poses several candidate boxes and a classification and regression
Det-Fly covers a wide range of scenarios including differ- network that achieves recognition and localization for a speci-
ent viewing angles, background scenes, relative ranges, and fied object. Among the selected methods, Faster R-CNN, FPN,
lighting conditions. In particular, Det-Fly involves four types of Cascade R-CNN, and Grid R-CNN are two-stage networks.
environmental background: sky, urban, field, and mountain (see The primary hyper-parameters of the algorithms implemented
Fig. 2). Each type of environmental background occupies nearly in our work are given in Table I. Since ResNet achieves state-of-
the same proportion (about 20%–30%) of the entire dataset. the-art performance on ImageNet, we adopt it as the backbone in
In terms of relative viewing angles, Det-Fly can be split into most of the algorithms. Generally, ResNet has two versions for
three categories: front view, top view, bottom view. The data common use, named ResNet-50 and ResNet-101. In our work,
proportion of the three viewing angles are, respectively, 36.4% we choose ResNet-50 over ResNet-101 because it is lite and
(front view), 32.5% (top view), and 31.1% (bottom view). suitable to be implemented in embedded computers on micro
In terms of the image size of the target UAV, the statistics UAVs. Since DarkNet-53 is widely used as the backbone of
data given in Fig. 3 shows that a large portion of the target UAV YOLOv3 and it exhibits similar performance as ResNet-50 [10],

Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on October 03,2021 at 08:05:09 UTC from IEEE Xplore. Restrictions apply.
1024 IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 6, NO. 2, APRIL 2021

TABLE I
THE HYPERPARAMETERS IN OUR IMPLEMENTATION OF THE EIGHT ALGORITHMS

[34], we choose DarkNet-53 for YOLOv3 in our experiments. a comparison between areas. Thus, we use AP as the evaluation
The original optimizers are used. The learning rate (LR), mo- metrics.
mentum, weight decay, and iteration are finely tuned based on
extensive tests. V. EVALUATIONS RESULTS
Our experiments are implemented on a computer with an Intel
i7, 32 GB RAM, Nvidia RTX 2080Ti rather than an embedded A. Average Precision
computer in order to reduce training time. We train the models The APs of the eight algorithms are shown in Table II.
based on 70% of the images, in which 10% is evaluated for Grid R-CNN achieves the best performance (82.4%) among
validation, and test them based on the remaining 30% images. all detectors, while RefineDet the worst (69.5%). Among two-
In addition, we use non-maximum suppression (NMS) to remove stage networks, Cascade R-CNN achieves the best performance
overlapping bounding boxes, so that an object is only contained (79.4%), whereas Faster R-CNN, which is the main framework
in one bounding box. As an important parameter in NMS to of two-stage networks in our experiments, achieves the worst
evaluate the overlapping rate of predicted bounding boxes, IoU (70.5%). For one-stage networks, SSD512 (78.7%) and Reti-
is defined as naNet (77.9%) both perform well, whereas YOLOv3 achieves
area(Op ∩ Ogt ) only 72.3%. Although one-stage networks sacrifice detection
Ao = .
area(Op ∪ Ogt ) performance to obtain high implementation efficiency, SSD512
achieves the same AP as FPN, which suggests that SSD512 could
In our experiments, the IoU threshold is set to 0.5. be a good alternative for tasks requiring high computational
In the training stage, we set the training epoch as eight and efficiency.
save model parameters in each epoch. If the training loss and To further evaluate the performance of the algorithms, we
validation loss remain stable we conclude that the detector is split the testing set into two sets. One set, named Det-Fly-
well trained. Otherwise, we modify the setting epoch and resume Simple, contains images with a relatively simple background
training until the model is well trained. (e.g., clean sky), short sensor-target range, and low flight speed.
Precision is a metric to evaluate missing detection. The calcu- The other set, named Det-Fly-Complex, consists of a more
lation of Precision in this paper is the same as the ones in general complex background (e.g., complex urban) and small target size.
visual object detection, which traverses all predicted boxes to Both datasets contribute about 50% images of the entire dataset.
calculate Precision. If the UAV is successfully detected, then the The evaluation results on Det-Fly-Simple suggest that the two-
predicted bounding box will be regarded as true positive (TP). stage networks, Cascade R-CNN and Grid R-CNN, achieve the
Otherwise, it will be regarded as a false positive (FP). Precision highest AP (more than 82.0%) among all the eight networks.
is defined as Among one-stage networks, RetinaNet and SSD512 achieve
TP the best performance (nearly 81.0%). Except for RefineDet and
Precision = .
TP + FP YOLOv3, the performance of other algorithms is higher than
Recall is a metric to measure false detection and defined as 80.0%. Compared with Det-Fly-Simple, the detection perfor-
mance of most of the algorithms on Det-Fly-Complex drops
TP
Recall = . sharply by nearly an average of 5.0%, due to the high complexity
TP + FN of Det-Fly-Complex. The mean Precision of the algorithms could
The performance of an object detector can be evaluated by only achieve 74.4%. In particular, Grid R-CNN still achieves
Precision × Recall (P-R) curve, which considers false detec- the best performance and it is also the only one exceed 80.0%.
tions with respect to missing detections for varying thresholds. RetinaNet and SSD512, which have similar performance, still
However, P-R curves are often zigzag curves going up and down perform best within one-stage networks. In general, two-stage
and tend to cross each other frequently, it is usually not easy networks perform a little better than one-stage networks in this
to compare different curves (different detectors) in the same test.
plot. Instead, numerical metrics called Average Precision (AP) In summary, Grid R-CNN and Cascade R-CNN show
can help us compare different detectors. AP is the area under a stable and superior performance compared to the others in all
curve (AUC) of the Precision × Recall curve. It is easy to make evaluation scenarios. One stage networks, SSD512 and

Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on October 03,2021 at 08:05:09 UTC from IEEE Xplore. Restrictions apply.
ZHENG et al.: AIR-TO-AIR VISUAL DETECTION OF MICRO-UAVs 1025

TABLE II
THE AP OF THE EIGHT ALGORITHMS TESTED ON DET-FLY (%)

TABLE III
THE AP FOR DIFFERENT ENVIRONMENTAL BACKGROUND SCENES (%)

* F: field, U: urban, M: mountain, S: sky

TABLE IV
Fig. 4. The inference time of all algorithms in our experiment.
THE AP FOR DIFFERENT CHALLENGING CONDITIONS (%)

RetinaNet, also show stable and good performance. Since


they could achieve higher computational speed, SSD512
and RetinaNet may be a good choice for tasks with limited
computational resources.

B. Network Attributes Affecting UAV Detection


The inference speed of the algorithms is an important aspect * S: Strong/weak light, M: Motion blur,
for practical implementation, especially in onboard embedded P: Partial occlusion.
systems. Fig. 4 shows the average inference time of the eight
deep learning algorithms in our experiments. As can be seen,
one-stage networks have a faster inference speed than two-stage multi-stage structures could generate better-regressed bounding
networks. Although Grid R-CNN archives the best AP perfor- boxes. While using multi-stage structures will cost more time,
mance among all algorithms, it is also the most time-consuming the grid mechanism is highly recommended for future detector
one. The inference time of YOLOv3 (32 ms) is nearly one-fifth design. Furthermore, the performance of RefineDet is weaker
of that of Grid R-CNN (157 ms). If computational efficiency is than SSD512, which may suggest that high-resolution input
the priority for an application, YOLOv3 is recommended since also could improve the UAV detection capability. In addition,
it is the fastest and its performance is better than the other two RetinaNet shows good and stable performance among one-stage
algorithms (RefineDet and Faster R-CNN) as shown in Table II. networks, which suggests that focal loss may be a recommended
All the compared models except YOLOv3 were implemented method to solve the problem of class imbalance.
with the ResNet-50 backbone network in our experiments. Al-
though ResNet-50 has already been run in real-time on some
C. Image Attributes Affecting UAV Detection
embedded devices, one may be interested in the performance
with a even lighter backbone network. To this end, we tested We next evaluate some key aspects of the images such as envi-
SSD512 on our dataset with MobileNetv2 as the backbone. ronmental background, target scales, viewing angles, and other
The resulting AP on the dataset is 68.8%, which is nearly 10% challenging conditions on the detection performance. Since the
less than the result of SSD512 with ResNet-50. However, the performance of an algorithm could be affected by many aspects
inference time of SSD512 with MobileNetv2 (53 ms) is much such as insufficient training and different parameters, we take
shorter than SSD512 with ResNet-50 (84 ms). Therefore, lighter the mean Average Precision (mAP) of these algorithms as the
backbones such as MobileNet may be considered when the criteria for a fair evaluation.
onboard computational resource is extremely limited. 1) Environmental Background: The complexity of the back-
The different performance of FPN and Faster R-CNN ground scene has a great impact on UAV detection performance.
suggests that the network structure FPN can improve UAV Table III shows the APs of the algorithms for different types
detection capability significantly. Since Grid R-CNN and of environmental background. In particular, the mAP suggests
Cascade R-CNN have superior and robust performance than that the sky (88.3%) is the easiest type of background for UAV
Faster R-CNN, it suggests that the grid guided mechanism and detection, while urban (62.0%) is the hardest. This is consistent

Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on October 03,2021 at 08:05:09 UTC from IEEE Xplore. Restrictions apply.
1026 IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 6, NO. 2, APRIL 2021

TABLE V
THE AP OF THE EIGHT ALGORITHMS TESTED ON MIDGARD (%)

Fig. 5. The AP of the algorithms for different target scales. If both the width
and height of the annotated bounding box are, respectively, smaller than x (x ∈
{1/40, 1/20, 1/10}) of the width and height of the entire image, then it is
classified as < x[W, H]. The AP is calculated by the algorithms with data in Fig. 6. The AP of different viewing angles. This figure is divided into three
the internals. The mAP represents the mean AP of the eight algorithms in each parts which are Top (top view), Fro (front view), and Bot (bottom view). The
scale interval. vertical axis of each part, which is the AP of the algorithms, is from 0.5 to 1.0. The
mAP of each part is about 0.78 (Top), 0.72 (Fro), and 0.85 (Bot), respectively.
The marker in each part represents the performance of the algorithm.

with our intuition that the complex urban background makes part of the target UAV is out of the field of view. All the images
visual UAV detection very challenging. in these cases can be found online in our dataset.
As for the performance of algorithms, Grid R-CNN shows The testing results of the eight algorithms under the three
consistent and high Precision across different types of back- challenging conditions are reported in Table IV. It is notable
ground scenes, whereas the performance of Faster R-CNN and that partial occlusion causes much lower AP. Part of the reason
RefineDet drops rapidly when the background complexity in- is that partially occluded target detection is indeed a challenging
creases. task, and in the meantime, the images of this case only occupy a
2) Target Scales: The size of the target UAV in the image small proportion of the dataset. On the other hand, strong/weak
has a great impact on detection performance. Fig. 5 shows the lighting conditions and motion blur do not compromise the
APs of all the algorithms with respect to the target size/ratio. As performance significantly, which verifies the robustness of the
shown in the figure, the APs of all algorithms increase at different deep learning algorithms.
rates when the target scale increases. In particular, Grid R-CNN
shows the best performance for different target scales, whereas
the performance of RefineDet and Faster R-CNN drops rapidly D. Comparison With the State-of-the-Art Dataset
when the target scale becomes small. To the best of our knowledge, MIDGARD is the latest compre-
3) Viewing Angles: It is noticed from our experiment that hensive dataset designed for deep-learning-based micro-UAV
the viewing angles of the target UAV also has an impact on detection [32]. Compared to MIDGARD, the annotation bound-
the detection performance. Fig. 6 shows the AP for different ing box of each image in Det-Fly is tighter, because the images
viewing angles. It can be seen that the bottom view leads to in Det-Fly are annotated one by one manually by professionals,
the highest Precision, whereas the front view is the lowest. The whereas the images in MIDGARD are automatically annotated
reason is that, for the bottom-view cases, the target shows rich based on UVDAR and relative pose estimation. Moreover, Det-
geometric information, and in the meantime, the background Fly covers a wider range of relative target distances. In particular,
scene is a blue or cloudy sky. However, for the front-view cases, the longest relative target distance in Det-Fly reaches more than
the target is flat and hence shows less geometric information, 100 m, but the longest distance in MIDGARD is less than 20 m.
and in the meantime, the background could be more complex Due to the wide range of relative distances, the scale of the target
than the bottom view case. UAV in Det-Fly is more diverse.
4) Other Challenging Conditions: The dataset covers some The eight algorithms have been trained and tested on
challenging conditions such as strong/weak lighting, motion MIDGARD. The testing results are shown in Table V. As can
blur, and partial occlusion. The ratios of the images of the be seen, the results of MIDGARD are 10% better than that of
three scenarios in our dataset are 10.8%, 11.2%, and 0.8%, Det-Fly. This might be caused by the complexity and diversity
respectively. Here, partial occlusion refers to the case where of the samples in Det-Fly.

Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on October 03,2021 at 08:05:09 UTC from IEEE Xplore. Restrictions apply.
ZHENG et al.: AIR-TO-AIR VISUAL DETECTION OF MICRO-UAVs 1027

VI. CONCLUSION [13] S. Zhang, L. Wen, X. Bian, Z. Lei, and S. Z. Li, “Single-shot refinement
neural network for object detection,” in Proc. IEEE/CVF Conf. Comput.
This letter presented a new dataset, named Det-Fly, for Vis. Pattern Recognit., 2018, pp. 4203–4212.
air-to-air UAV detection and evaluated eight representative [14] X. Lu, B. Li, Y. Yue, Q. Li, and J. Yan, “Grid R-CNN,” in Proc. IEEE/CVF
Conf. Comput. Vis. Pattern Recognit., 2019, pp. 7363–7372.
deep-learning algorithms based on this dataset. Not only the [15] Z. Cai and N. Vasconcelos, “Cascade R-CNN: Delving into high quality
overall performance of the algorithms are carefully evaluated object detection,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recog-
and compared, the impact of environmental background, target nit., 2018, pp. 6154–6162.
[16] F. Lin, K. Peng, X. Dong, S. Zhao, and B. M. Chen, “Vision-based
scales, viewing angles, and other challenging conditions on formation for UAVs,” in Proc. IEEE Int. Conf. Control Automat., 2014,
the detection performance is also analyzed. According to the pp. 1375–1380.
experimental results, suggestions on how to design algorithms [17] F. Gökçe, G. Üçoluk, E. Şahin, and S. Kalkan, “Vision-based detection and
distance estimation of micro unmanned aerial vehicles,” Sensors, vol. 15,
to achieve better detecting performance in the future are given. no. 9, pp. 23805–23846, 2015.
In the future, to detect unknown UAVs in various environ- [18] K. R. Sapkota et al., “Vision-based unmanned aerial vehicle detection and
ments, the dataset should be further enhanced by adding more tracking for sense and avoid systems,” in Proc. IEEE/RSJ Int. Conf. Intell.
Robot. Syst., 2016, pp. 1556–1561.
types of UAVs and background scenarios. Moreover, an ablation [19] J. Li, D. H. Ye, T. Chung, M. Kolsch, J. Wachs, and C. Bouman, “Multi-
study is necessary to design deep-learning algorithms that are target detection and tracking from a single camera in unmanned aerial
specifically suitable for UAV detection tasks and to be imple- vehicles (UAVs),” in Proc. IEEE/RSJ Int. Conf. Intell. Robot. Syst., 2016,
pp. 4992–4997.
mented onboard. In addition, interpretable technology may be [20] S. Minaeian, J. Liu, and Y.-J. Son, “Effective and efficient detection of
adopted to explain why the recommended network structures moving targets from a UAV’s camera,” IEEE Trans. Intell. Transp. Syst.,
or methods could improve detection performance. Algorithms vol. 19, no. 2, pp. 497–506, Feb. 2018.
[21] R. Opromolla, G. Fasano, and D. Accardo, “A vision-based approach to
that are able to process high-resolution images also need more UAV detection and tracking in cooperative applications,” Sensors, vol. 18,
attention. no. 10, 2018.
[22] Y. Wu, Y. Sui, and G. Wang, “Vision-based real-time aerial object lo-
calization and tracking for UAV sensing system,” IEEE Access, vol. 5,
REFERENCES pp. 23 969–23 978, 2017.
[1] Y. Tang et al., “Vision-aided multi-UAV autonomous flocking in GPS- [23] S. Minaeian, J. Liu, and Y. Son, “Vision-based target detection and lo-
denied environment,” IEEE Trans. Ind. Electron., vol. 66, no. 1, calization via a team of cooperative UAV and UGVs,” IEEE Trans. Syst.,
pp. 616–626, Jan. 2019. Man, Cybern. Syst., vol. 46, no. 7, pp. 1005–1016, Jul. 2016.
[2] J. Xie, J. Yu, J. Wu, Z. Shi, and J. Chen, “Adaptive switching spatial- [24] F. Lin, X. Dong, B. M. Chen, K. Lum, and T. H. Lee, “A robust real-
temporal fusion detection for remote flying drones,” IEEE Trans. Veh. time embedded vision system on an unmanned rotorcraft for ground target
Technol., vol. 69, no. 7, pp. 6964–6976, Jul. 2020. following,” IEEE Trans. Ind. Electron., vol. 59, no. 2, pp. 1038–1049,
[3] R. Mitchell and I. Chen, “Adaptive intrusion detection of malicious un- Feb. 2012.
manned air vehicles using behavior rule specifications,” IEEE Trans. Syst., [25] A. Rozantsev, V. Lepetit, and P. Fua, “Detecting flying objects using a
Man, Cybern. Syst., vol. 44, no. 5, pp. 593–604, May 2014. single moving camera,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39,
[4] J. Zhang, C. Hu, R. G. Chadha, and S. Singh, “Maximum likeli- no. 5, pp. 879–892, May 2017.
hood path planning for fast aerial maneuvers and collision avoid- [26] J. James, J. J. Ford, and T. L. Molloy, “Learning to detect aircraft for
ance,” in Proc. IEEE/RSJ Int. Conf. Intell. Robot. Syst., 2019, long-range vision-based sense-and-avoid systems,” IEEE Robot. Automat.
pp. 2805–2812. Lett., vol. 3, no. 4, pp. 4383–4390, Oct. 2018.
[5] J. Ren and X. Jiang, “Regularized 2-d complex-log spectral analysis [27] C. Aker and S. Kalkan, “Using deep networks for drone detection,” in
and subspace reliability analysis of micro-doppler signature for UAV Proc. IEEE Int. Conf. Adv. Video Signal Based Surveill., 2017, pp. 1–6.
detection,” Pattern Recognit., vol. 69, pp. 225–237, 2017. [28] A. Schumann, L. Sommer, J. Klatte, T. Schuchert, and J. Beyerer, “Deep
[6] A. Bernardini, F. Mangiatordi, E. Pallotti, and L. Capodiferro, “Drone cross-domain flying object classification for robust UAV detection,” in
detection by acoustic signature identification,” Electron. Imag., vol. 2017, Proc. IEEE Int. Conf. Adv. Video Signal Based Surveill., 2017, pp. 1–6.
no. 10, pp. 60–64, 2017. [29] A. Rozantsev, V. Lepetit, and P. Fua, “Flying objects detection from a
[7] R. Yoshihashi, T. T. Trinh, R. Kawakami, S. You, M. Iida, and T. single moving camera,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
Naemura, “Differentiating objects by motion: Joint detection and track- Recognit., 2015, pp. 4128–4136.
ing of small flying objects,” Comput. Vis. Pattern Recognit., 2017, [30] Y. Chen, P. Aggarwal, J. Choi, and C. C. J. Kuo, “A deep learning approach
arXiv:1709.04666. to drone monitoring,” in Proc. Asia-Pacific Signal Inf. Process. Assoc.
[8] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for Annu. Summit Conf., 2017, pp. 686–691.
dense object detection,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2017, [31] X. Peng, B. Sun, K. Ali, and K. Saenko, “Learning deep object detectors
pp. 2980–2988. from 3D models,” in Proc. IEEE Int. Conf. Comput. Vis., 2015, pp. 1278–
[9] W. Liu et al., “SSD: Single shot multibox detector,” in Proc. Eur. Conf. 1286.
Comput. Vis., 2016, pp. 21–37. [32] V. Walter, M. Vrba, and M. Saska, “On training datasets for machine
[10] J. Redmon and A. Farhadi, “YOLOv3: An incremental improvement,” learning-based visual relative localization of micro-scale UAVs,” in Proc.
2018, arXiv:1804.02767. IEEE Int. Conf. Robot. Automat., 2020, pp. 10 674–10 680.
[11] T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, [33] V. Walter, M. Saska, and A. Franchi, “Fast mutual relative localization of
“Feature pyramid networks for object detection,” in Proc. IEEE Conf. UAVs using ultraviolet led markers,” in Proc. Int. Conf. Unmanned Aircr.
Comput. Vis. Pattern Recognit., 2017, pp. 936–944. Syst., 2018, pp. 1217–1226.
[12] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time [34] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
object detection with region proposal networks,” IEEE Trans. Pattern Anal. recognition,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.,
Mach. Intell., vol. 39, no. 6, pp. 1137–1149, Jun. 2017. 2016, pp. 770–778.

Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on October 03,2021 at 08:05:09 UTC from IEEE Xplore. Restrictions apply.

You might also like