0% found this document useful (0 votes)
5 views21 pages

Vision Based Learning For Drones A Survey Final

This document provides a comprehensive survey on vision-based learning for drones, highlighting its significance in enhancing drone autonomy and functionality. It categorizes vision-based control methods, explores various applications, and discusses challenges and innovations in the field. The survey aims to guide future research and development towards achieving artificial general intelligence in drones.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views21 pages

Vision Based Learning For Drones A Survey Final

This document provides a comprehensive survey on vision-based learning for drones, highlighting its significance in enhancing drone autonomy and functionality. It categorizes vision-based control methods, explores various applications, and discusses challenges and innovations in the field. The survey aims to guide future research and development towards achieving artificial general intelligence in drones.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

1

Vision-Based Learning for Drones: A Survey


Jiaping Xiao, Graduate Student Member, IEEE, Rangya Zhang, Yuhang Zhang, and Mir Feroskhan, Member,
IEEE

Abstract—Drones, as advanced cyber-physical systems (CPSs),


are undergoing a transformative shift with the advent of vision-
based learning, a field that is rapidly gaining prominence due to
its profound impact on drone autonomy and functionality. Unlike
existing task-specific surveys, this work offers a comprehensive
overview of vision-based learning for drones, emphasizing its
pivotal role in enhancing their operational capabilities across
various scenarios. Firstly, the fundamental principles of vision-
based learning are elucidated, demonstrating how it signifi-
cantly improves drones’ visual perception and decision-making
processes. Vision-based control methods are then categorized
into indirect, semi-direct, and end-to-end approaches from the
perception-control perspective. Various applications of vision-
based drones with learning capabilities are further explored,
ranging from single-agent systems to more complex multiagent
and heterogeneous system scenarios, while highlighting the chal-
lenges and innovations characterizing each domain. Finally, open Fig. 1: Applications of vision-based drones. (a) Parcel deliv-
questions and potential solutions are discussed to guide future
research and development in this dynamic and rapidly evolving
ery; (b) Photography; (c) Precision agriculture; (d) Power grid
field. With the growth of large language models (LLMs) and inspection.
embodied intelligence, vision-based learning for drones provides
a promising yet challenging road towards achieving artificial
general intelligence (AGI) in the 3D physical world. artificial intelligence (AI) and onboard computation capability.
Index Terms—Drones, learning systems, robot learning, em- From the perspective of innovation, there are three main
bodied intelligence. directions for the future development of drones. Specifically:
1) The minimization of drones. Micro and nano drones
I. I NTRODUCTION are capable of accomplishing missions at a low cost and
without space constraints. The small autonomous drones
D RONES are intelligent cyber-physical systems (CPS) [1],
[2] that rely on functional sensing, system communi-
cation, real-time computation, and flight control to achieve
raised the interest of scientists to obtain more inspiration
from biology, such as bees [11] and flapping birds [12].
2) The novel designs of drones. Structure and aerody-
perception and autonomous flight. Due to their high autonomy
namic design enable drones to obtain increased maneu-
and maneuverability, drones [3] have been widely used in
verability and improved flight performance. Tilting, mor-
various missions, such as industrial inspection [4], preci-
phing, and folding structures and actuators are widely
sion agriculture [5], parcel delivery [6], [7] and search-and-
studied in drone design and control [7], [13]–[15].
rescue [8] (some applications are shown in Fig. 1). As such,
3) The autonomy of drones. Drone autonomy achieves
drones in smart cities are an essential feature of the service
autonomous navigation and task execution. It requires
industry of tomorrow. For future applications, many novel
real-time perception and onboard computation capabil-
drones and functions are under development with the advent
ities. Drones that integrate visual sensors, efficient on-
of advanced materials (adhesive and flexible materials [9]),
line planning, and learning-based decision-making algo-
miniaturization of electronic and optical components (sensors,
rithms have notably enhanced autonomy and intelligence
microprocessors), onboard computers (Nividia Jetson, Intel
[16], [17], even beating world-level champions in drone
NUC, Raspeberry Pi, etc.), batteries (LiPo battery, Sunlight
racing with a vision-based learning system called Swift
power [10]), and localization systems (SLAM, UWB, GPS,
[18].
etc.). Meanwhile, the functionality of drones is becoming
more complex and intelligent due to the rapid advancement of Recently, to improve drone autonomy, vision-based learning
drones, which combine advanced sensing with learning capa-
This work was supported by the Economic Development Board (EDB) bilities, are attracting more insights (see the rapid growth trend
under its Space Technology Development Programme (STDP) Thematic Grant in Fig. 2). With such capabilities, drones are even emerging
Call on Space Technologies (Award S23-020019-STDP). (Corresponding
author: Mir Feroskhan; Jiaping Xiao.) towards artificial general intelligence (AGI) in the 3D physical
J. Xiao, R. Zhang, Y. Zhang and M. Feroskhan are with world, especially when integrated with rapid-growing large
the School of Mechanical and Aerospace Engineering, Nanyang language models (LLMs) [19], [20] and embodied intelligence
Technological University, Singapore 639798, Singapore (e-mail:
[email protected]; [email protected]; [email protected]; [21]. Existing vision-based drone-related surveys focus only
[email protected]). on specific tasks and applications, such as autonomous systems
2

Image Processing
Images
Images

Deep learning
Visual odometry

Pa�erns
Flight Control

Commands
Commands
Visual Percep�on
Visual Percep�on

Fig. 3: General framework of vision-based drones.


Fig. 2: Number of related publications in Google Scholar using
keyword “vision based learning drone”.
Section VI; The open questions faced by vision-based learning
drones and potential solutions are listed in Section VII; Section
navigation [22], uncrewed aerial vehicle (UAV) navigation VIII summarizes and concludes this survey.
[23], [24], vision-based UAV landing [25], obstacle avoidance
[26], vision-based inspection with UAVs [27], [28], aerial II. BACKGROUND
detection [29] and autonomous drone racing [30], which A. Vision-based Learning Drones
limits the understanding of vision-based learning in drones
from a holistic perspective. Therefore, this survey provides a A typical vision-based drone consists of three parts (see Fig.
comprehensive review of vision-based learning for drones to 3): (1) Visual perception: sensing the environment around
deliver a more general view of current drone autonomy and the drone via monocular cameras or stereo cameras; (2)
learning technologies, including background, visual percep- Image processing: extracting features from an observed image
tion, vision-based control, datasets and simulators, applications sequence and output specific patterns or information, such as
and challenges, and open questions with potential solutions. To navigation information, depth information, object information;
summarize, our contributions are: (3) Flight controller: generating high-level and low-level
commands for drones to perform assigned missions. Image
1) We discussed the development of vision-based drones processing and flight controllers are generally conducted on
with learning capabilities and analyzed the core compo- the onboard computer, while visual perception relies on the
nents, especially visual perception and machine learning performance of visual sensors. Vision-based drones have been
(ML) applied in drones. We further highlighted object widely used in traditional missions such as environmental
detection with visual perception and how it benefits exploration [31], navigation [16], and obstacle avoidance [32].
drone applications. With efficient image processing and simple path planners, they
2) We discussed the current state of vision-based con- can avoid dynamic obstacles effectively [33].
trol methods for drones and categorized them into in- Currently, vision-based learning drones, which utilize vi-
direct, semi-direct, and end-to-end methods from the sual sensors and efficient learning algorithms, have achieved
perception-control perspective. This perspective helps to remarkable advanced performance in a series of standardized
understand vision-based control methods and differenti- visual perception and decision-making tasks, such as agile
ate them with better features. flight control [34], navigation [35] and obstacle avoidance
3) We provided the public available datasets and simulators [18]. Various cases have showcased the power of learning al-
for beginners, summarized applications of vision-based gorithms in improving the agility and perception capabilities of
learning drones in single-agent systems, multiagent sys- vision-based drones. For instance, using only depth cameras,
tems (MAS), and heterogeneous systems and discussed inertial measurement units (IMU) and a lightweight onboard
the corresponding challenges in different applications. computer, the vision-based learning drone in [34] succeeded in
4) We explored several open questions that can hinder the performing high-speed flight in unseen and unstructured wild
development and applicability of vision-based learning environments. The controller of the drone in [34] was trained
for drones. Furthermore, potential solutions for each from a high-fidelity simulation and transferred to a physical
question are discussed. platform. Following that, the Swift system was developed in
The rest of this survey is organized as follows: Section II [18] to achieve world champion-level autonomous drone rac-
discusses the concept of vision-based learning drones and their ing with a tracking camera and a shallow neural network. Such
core components; Section III summarizes object detection with kinds of vision-based learning drones are leading the future
visual perception and its application to vision-based drones; of drones due to their perception and learning capabilities in
The vision-based control methods for drones are introduced complex environments. Similarly, with event cameras and a
and categorized in Section IV; Section V provides publicly trained narrow neural network-EVDodgeNet, the vision-based
available datasets and simulators. The applications and chal- learning drone in [36] was able to dodge multiple dynamic
lenges of vision-based learning for drones are discussed in obstacles (balls) during flight. Afterwards, to improve the
3

Lidar RGB-D Camera Event Camera


2) Camera: Compared to LIDAR, cameras provide a cheap
Laser receivers Entire housing Camera lens
(2 groups of 32) unit spins at 10
Hz
and lightweight way for drones to perceive the environment.
Right imager Left imager

Laser emitters
Cameras are external passive sensors used to monitor the
(4 groups of 8)
Motor
drone’s geometric and dynamic relationship to its task, en-
Laser
housing emitters (4
groups of 8)
IR
projector
RGB
module Event signal
vironment or the objects that it is handling. Cameras are
sensor
Mounting base
commonly used perception sensors for drones to sense en-
vironment information, such as objects’ position and a point
cloud map of the environment. In contrast to the motion
capture system, which can only broadcast global geometric
and dynamic pose information within a limited space from
an offboard synchronized system, cameras enable a drone to
(a) (b) (c) fly without space constraints. Cameras can provide positioning
for drone navigation in GPS-denied environments via visual
Fig. 4: Visual perception sensors for drones. (a) Velodyne
inertial odometry (VIO) [43]–[45] and visual simultaneous
surrounding LIDAR; (b) Intel D435 RGBD camera; (c) Sony
localization and mapping systems (V-SLAM) [35], [46], [47].
event camera.
Meanwhile, object detection and depth estimation can be per-
formed with cameras to obtain the relative positions and sizes
perception onboard in the real world, an uncertainty estimation of obstacles [48]. However, to avoid dynamic obstacles, even
module was trained with the Ajna network [37], which sig- physical attacks like bird chasing, the agile vision-based drone
nificantly increased the generalization capability of learning- poses fundamental challenges to visual perception. Motion
based control for drones. The power of deep learning (DL) blur, sparse texture environments, and unbalanced lighting
in handling uncertain information frees traditional approaches conditions can cause the loss of feature detection in the VIO
from complex computation with necessary accurate modeling. and object detection. LIDAR and event cameras [49] can
partially address these challenges. However, LIDAR and event
cameras are either too bulky or too expensive for agile drone
B. Visual Perception applications. Considering the agility requirement of physical
Visual perception for drones is the ability of drones to attack avoidance, lightweight dual-fisheye cameras are used
perceive their surroundings and their own states through the for visual perception. With dual fisheye cameras, the drone can
extraction of necessary features for specific tasks with visual achieve better navigation capability [50] and omnidirectional
sensors. Light detection and ranging (LIDAR) and cameras visual perception [48], [51]. Some sensor fusion and state
are commonly used sensors (see Fig. 4) to perceive the estimation techniques are required to alleviate the accuracy
surrounding environment for drones. loss brought by the motion blur.
1) LIght Detection And Ranging: LIDAR is a kind of active
range sensor that relies on the calculated time of flight (TOF)
between the transmitted and received beams (laser) to estimate C. Machine Learning
the distance between the robot and the reflected surface of Recently, ML, especially DL, has attracted much attention
objects [38]. Based on the scanning mechanism, LIDAR can from various fields and has been widely applied to robotics for
be divided into solid-state LIDAR, which has a fixed field of environmental exploration [55], [56], navigation in unknown
view (FOV) without moving parts, and surrounding LIDAR, environments [57]–[59], obstacle avoidance, and intelligent
which spins to provide a 360-degree horizontal view. Sur- control [60]. In the domain of drones, learning-based methods
rounding LIDAR is also referred to as “laser scanning” or “3D have also achieved promising success, particularly for deep
scanning”, which creates a 3D representation of the explored reinforcement learning (DRL) [18], [34], [61]–[65]. In [63],
environment using eye-safe laser beams. A typical LIDAR a curriculum learning augmented end-to-end reinforcement
(see Fig. 4(a)) consists of laser emitters, laser receivers, and a learning (RL) was proposed for a UAV to fly through a narrow
spinning motor. The vertical FOV of a LIDAR is determined gap in the real world. A vision-based end-to-end learning
by the number of vertical arrays of lasers. For instance, a method was successfully developed in [34] to fly agile quadro-
vertical array of 16 lasers scanning 30 degrees gives a vertical tors through complex wild and human-made environments
resolution of 2 degrees in a typical configuration. LIDAR has with only onboard sensing and computation capabilities, such
recently been used on drones for mapping [39], power grid as depth information. A visual drone swarm was developed
inspection [40], pose estimation [41] and object detection [42]. in [65] to perform collaborative target search with adap-
LIDAR provides sufficient and accurate depth information for tive curriculum embedded multistage learning. These works
drones to navigate in cluttered environments. However, it is verified the marvelous power of learning-based methods on
bulky and power-hungry and does not fit within the payload drone applications, which pushes the agility and cooperation
restrictions of agile autonomous drones. Meanwhile, using of drones to a level that classical approaches can hardly
raycast representation in a simulation environment makes it reach. Different from the classical approaches relying on
hard to match the inputs of a real LIDAR device, which brings separate mapping, localization, and planning, learning-based
many challenges for the Sim2Real transfer when a learning methods map the observations, such as the visual information
approach is considered. or localization of obstacles, to commands directly without
4

ts with regression
Faster
R-CNN R-CNN Object is a cat Refine BB position
Faster
Fast
R-CNN
Predict BBox offsets with regression
R-CNN
R-CNN
Object is a cat
Predict
Refine BB position
BBox offsets with regression
Bounding-box
Fast
Faster R-CNN
R-CNN Object is a cat Refine BB position
Fast R-CN
BBoClassification Bounding-box BBo Classification Classification Bounding-box
xreg loss regression loss Linear+
xreg loss regression loss Linear+ loss regression loss
BBo SVM Classify regions with BBo SVM Classify regions with
Bounding-box Bounding-box
egions with softmax Linear softmax Linear
xreg s xreg
Softmax s regressors Softmax regressors Softmax
SVMs SVMs SVMs BB proposal SVMs
Object
BBoor not object BB proposal BBo Objectclassifier
or not object Object or not object classifier
BB proposal classifier
xreg xreg Fully-connected layers Fully-connected layers
SVMs
Classification Bounding-box SVMs
Classification Bounding-box
FCs Classification Bounding-box FCs
regression loss Forward “Rol Pooling” loss regression loss Forward “Rol Pooling” loss regression loss “Rol Pooling”
loss
on CN each region layer CN each region layer layer
N N “Rol Pooling” layer “Rol Pooling” layer
CNN through CNN through CNN
CN CN
proposals proposals proposals
N Regions of
N feature map Regions of feature map Regions of
CN CN (Rols)
Interest Interest (Rols) Interest (Rols)
N Warped image N
from a proposal Warped image from a proposal from a proposal
Region Proposal Network Region Proposal Network Region Proposal Network
regions method regionsForward whole image through method Forward whole image throughmethod
feature map feature map CNN feature map CNN
of Interest Regions of Interest Regions of Interest
m a proposal (Rol) from a proposal CNN (Rol) from a proposal CNN
~2k) method (~2k)
pre-train image-net method (~2k) pre-train image-net
pre-train image-net

CNN CNN CNN


image image image image
image image image

(a) (b) (c)


Fig. 5: Typical multi-stage object detection methods. (a) R-CNN neural network architecture [52]; (b) Fast R-CNN neural
network architecture [53]; (c) Faster RCNN neural network architecture [54].

further planning. This greatly helps drones handle uncertain the realm of object detection [71]. These models demonstrate
information in operations. However, learning-based methods superior performance by utilizing the self-attention mecha-
require massive experiences and training datasets to obtain nism, which processes visual information non-locally [72].
good generalization capability, which poses another challenge However, a major concern of ViTs is their high computational
in deployment over unknown environments. demand. This presents difficulties in achieving real-time infer-
ence, particularly on drone platforms with limited resources.
III. O BJECT D ETECTION WITH V ISUAL P ERCEPTION
Object detection is a pivotal module in vision-based learning A. Multi-stage Algorithms
drones when handling complex missions such as inspection, Classic multi-stage algorithms include region-based CNN
avoidance, and search and rescue. Object detection is to find (R-CNN) [52], Fast R-CNN [53] and Faster R-CNN [54]
out all the objects of interest in the image and determine their (see Fig. 5). Multi-stage algorithms can basically meet the
position and size [66]. Object detection is one of the core accuracy requirements in real-life scenarios, but the model is
problems in the field of computer vision (CV). Nowadays, more complex and cannot be really applied to scenarios with
the applications of object detection include face detection, high-efficiency requirements. In the R-CNN structure [52],
pedestrian detection, vehicle detection, and terrain detection it is necessary to first give some regional proposals (RPs),
in remote sensing images. Object detection has always been then use the convolutional layer for feature extraction, and
one of the most challenging problems in the field of CV due to then classify the regions according to these features. That is,
the different appearances, shapes, and poses of various objects, the object detection problem is transformed into an image
as well as the interference of factors such as illumination classification problem. The R-CNN model is very intuitive,
and occlusion during imaging. At present, the object detection but the disadvantage is that it is too slow, and the output
algorithm can be roughly divided into two categories: multi- is obtained via training multiple Support Vector Machines
stage (two-stage) algorithm, whose idea is to first generate (SVMs). To solve the problem of slow training speed, the
candidate regions and then perform classification, and one- Fast R-CNN model is proposed (Fig. 5b). This model has
stage algorithm, the idea of which is to directly apply the two improvements to R-CNN: (1) first use the convolutional
algorithm to the input image and output the categories and layer to perform feature selection on the image so that only
corresponding positions. Beyond that, to retrieve 3D positions, one convolutional layer can be used to obtain RP; (2) convert
depth estimation has been a popular research subbranch related training multiple SVMs to use only one fully-connected layer
to object detection whether using monocular [67] or stereo and a softmax layer. These techniques greatly improve the
depth estimation [68]. For a very long time, the core neural computation speed but still fail to address the efficiency issue
network module (backbone) of object detection has been the of the Selective Search Algorithm (SSA) for RP.
convolutional neural network (CNN) [69]. CNN is a classic Faster R-CNN is an improvement on the basis of Fast R-
neural network in image processing that originates from the CNN (see Fig. 5c). In order to solve the problem of SSA, the
study of the human optic nerve system. The main idea is SSA that generates RP in Fast R-CNN is replaced by a Region
to convolve the image with the convolution kernel to obtain Proposal Network (RPN) and uses a model that integrates RP
a series of reorganization features, and these reorganization generation, feature extraction, object classification and object
features represent the important information of the image. As box regression. RPN is a fully convolutional network that
such, CNN not only has the ability to recognize the image simultaneously predicts object boundaries at each location.
but also effectively decreases the requirement for computing RPN is trained end-to-end to generate high-quality region
resources. Recently, vision transformers (ViTs) [70], originally proposals, which are then detected by Fast R-CNN. At the
proposed for image classification tasks, have been extended to same time, RPN and Fast R-CNN share convolutional features.
5

Meanwhile, in the feature extraction stage, Faster R-CNN uses


a CNN. The model achieves 73.2% and 70.4% mean Average
Precision (mAP) per category on the PASCAL VOC 2007 and
2012 datasets, respectively. Faster R-CNN has been greatly
improved in speed than Fast R-CNN, and the accuracy has
reached the state-of-the-art (SOTA), and it also fully developed
an end-to-end object detection framework. However, Faster
R-CNN still cannot achieve real-time object detection. After Fig. 7: SSD network architecture [77].
obtaining RP, it requires heavy computation for each RP
classification.
Currently, R-CNN series are still widely used for drone generated default boxes are collected and filtered using NMS.
applications and have achieved remarkable performance in The neural network architecture of SSD is shown in Fig. 7.
object detection and tracking over various benchmark datasets, SSD uses more convolutional layers than YOLO, which means
such as VisDrone [73] and Ded-Fly [74]. Faster-RCNN was that SSD has more feature maps. At the same time, SSD uses
used in [75] to identify asbestos slate locations in buildings and different convolution segments based on the VGG model to
obtained an accuracy of 98.9%. [76] achieved slow-speed nav- output feature maps to the regressor, which tries to improve
igation in tree plantations with trained faster-RCNN for tree the detection accuracy over small objects.
trunk detection and a control strategy to avoid collision. Multi- The aforementioned multi-stage algorithms and one-stage
stage algorithms like Faster R-CNN are beneficial for target algorithms have their own advantages and disadvantages.
tracking and surveillance tasks due to their high accuracy. Multi-stage algorithms achieve high detection accuracy, but
However, their computational load makes them unsuitable for bring more computing overhead and repeated detection. The
autonomous landing or real-time obstacle avoidance in drones, one-stage model generally consists of a basic network (Back-
where low-latency performance is necessary. bone Network) and a Detection Head. The former is used as a
feature extractor to give representations of different sizes and
abstraction levels of images; the latter one learns classification
B. One-stage Algorithms and location associations based on these representations and a
One-stage algorithms such as the Single Shot Multibox De- supervised dataset. The two tasks of category prediction and
tector (SSD) model [77] and the YOLO series models [78] are position regression, which are responsible for detecting the
generally slightly less accurate than the two-stage algorithms, head, are often carried out in parallel, formulating a multi-
but have simpler architectures, which can facilitate end-to-end task loss function for joint training. There is only one class
training and are more suitable for real-time object detection. prediction and position regression, and most of the weights are
The basic process of YOLO (see Fig. 6) is divided into three shared. Hence, one-stage algorithms are more time-efficient at
phases, namely, zooming the image, passing the image through the cost of accuracy. To address these issues, RetinaNet [80]
a full CNN, and using nonmaximum suppression (NMS). The and RefineDet [81] have been developed by designing a new
main advantages of the YOLO model are that it is fast, with focal loss function and anchor refinement module, respectively.
few background errors via global processing, and it has good Some ensemble learning methods combining the multi-stage
generalization performance. Meanwhile, YOLO can formulate model and one-stage model have also been proposed to im-
the detection task as a unified, end-to-end regression problem, prove performance and maintain less time cost. For instance,
and simultaneously obtain the location and classification by CBNet [82] groups multiple identical backbones such as faster-
processing the image only once. But there are also some RCNN and RetinaNet in a composite connection. In any case,
problems with the YOLO, such as rough mesh, which will with the continuous development of DL in the field of CV,
limit the YOLO’s performance over small objects. However, object detection algorithms are also constantly learning from
the subsequent YOLOv3, YOLOv5, YOLOX [79], YOLOv8, and improving each other. Newly designed visual backbones
and YOLOv10 improved the network on the basis of the will greatly benefit the perception requirements of vision-
original YOLO and achieved better detection results. based learning for drones.

C. Vision Transformer
ViTs have emerged as the most active research field in object
detection tasks recently, with models like Swin-Transformer
[71], [83], ViTdet [84], DERT [85], and DINO [86] leading
Fig. 6: YOLO network architecture [78]. the forefront. Unlike conventional CNNs, ViTs leverage self-
attention mechanisms to process image patches as sequences,
SSD is another classic one-stage object detection algorithm. offering a more flexible representation of spatial hierarchies.
The flowchart of SSD is (1) first to extract features from the The core mechanism of these models involves dividing an
image through a CNN, (2) generate feature maps, (3) extract image into a sequence of patches and applying Transformer
feature maps of multiple layers, and then (4) generate default encoders [87] to capture complex dependencies between them.
boxes at each point of the feature map. Finally, (5) all the This process enables ViTs to efficiently learn global context,
6

ing, [92] collected an occlusion-aware target tracking dataset


MDMT and proposed a multi-matching identity authentication
network (MIA-Net) to associate occluded targets with multi-
ple drone views. To achieve high-speed autonomous landing
for a fixed-wing drone, [93] developed Vision Transformer
Particle Region-based Convolutional Neural Network (VitP-
(a) RCNN) combining Mobile Vision Transformers (MobileViT)
and particle filter to continuously track the position of landing
marker with 51 frame-per-second (FPS).
Besides, massive drone datasets are required for training
and testing. Zheng Ye et al. [74] collected an air-to-air drone
dataset “Det-Fly” and evaluated air-to-air object detection of
a micro-UAV with eight different object detection algorithms,
(b) namely RetinaNet [94], SSD, Faster R-CNN, YOLOv3 [95],
Fig. 8: Typical ViTs detectors. (a) Swin Transformer V2 FPN [96], Cascade R-CNN [97] and Grid R-CNN [98]. The
framework [83]; (b) DETR framework [85]. evaluation results in [74] showed that the overall performance
of Cascade R-CNN and Grid R-CNN is superior compared to
others. However, the YOLOv3 provides the fastest inference
which is pivotal in understanding comprehensive scene lay- speed among others. Wei Xun et al. [99] conducted another
outs and object relations. For instance, the Swin-Transformer investigation into drone detection, employing the YOLOv3
[71] introduces a hierarchical structure with shifted windows, architecture and deploying the model on the NVIDIA Jet-
enhancing the model’s ability to capture both local and global son TX2. They collected a dataset comprising 1435 images
features. In the following, the Swin-Transformer was scaled featuring various UAVs, including drones, hexacopters, and
to Swin-Transformer V2 [83] with the capability of training quadcopters. Utilizing custom-trained weights, the YOLOv3
high-resolution images (see Fig. 8(a)). DERT further simplifies model demonstrated proficiency in drone detection. However,
the object detection pipeline with end-to-end transformer- the deployment of this trained model faced constraints due to
based framework by leveraging global attention and set-based the limited computation capacity of the Jetson TX2, which
prediction (see Fig. 8(b)). posed challenges for effective real-time application.
The primary advantages of ViTs in object detection are It is also necessary to find the best balance between
their scalability to large datasets and superior performance in computation speed and accuracy. In agile flight, computation
capturing long-range dependencies. This makes them particu- speed is more important than accuracy since real-time object
larly effective in scenarios where contextual understanding is detection is required to avoid obstacles swiftly. Therefore,
crucial. Additionally, ViTs demonstrate strong transfer learn- a simple gate detector with a six-level U-Net and a filter
ing capabilities, performing well across various domains with algorithm are adopted as the basis of the visual perception of
minimal fine-tuning. However, challenges with ViTs include Swift [18]. Considering the agility of the drone, a stabilization
their computational intensity due to self-attention mechanisms, module is required to obtain more robust and accurate object
particularly when processing high-resolution images. This can detection and tracking results in real-time flight. Moreover, in
limit their deployment in real-time drone applications where the drone datasets covered in most existing works, each image
computational resources are constrained [88]. Additionally, only includes a single UAV. To classify and detect different
ViTs often require large-scale datasets for pre-training to classes of drones in multi-drone systems for pursuit-evasion,
achieve optimal performance, which can be a limitation in a new ICG-Drone dataset [88] of multiple types of drones
data-scarce environments. Despite these challenges, ongoing is collected. The trained drone detector and tracker Drone-
advancements in ViT architectures, such as the development YOLOSORT is further optimized with TensorRT in Jetson to
of efficient attention mechanisms [89] and hybrid CNN- achieve real-time performance at 20 FPS. Furthermore, the
Transformer models [90], continue to enhance their applica- dataset is adapted with additional depth images in [48] to
bility and performance in diverse object detection tasks. capture adversary drones in omnidirectional visual perception
with dual fisheye cameras to enhance avoidance capability.
Based on that, a two-stage omnidirectional 3D drone detection
D. Application and Summary
method is proposed with a YOLOv7 detector and monocular
When applying object detection algorithms and related depth estimation, providing a lightweight yet effective solution
visual encoders to drone applications such as target tracking, for drone navigation.
search and rescue, and autonomous landing, it poses several IV. V ISION - BASED C ONTROL
challenges as drones are small and moving fast in different
task scenarios. To improve the detection accuracy of faster- Vision-based control for robotics has been widely studied in
RCNN for arial tracking, [91] integrated multi-stream archi- recent years, whether for ground robots or aerial robotics. For
tecture into faster-RCNN (MS-Faster R-CNN) to retrive multi- drones flying in a GPS-denied environment, VIO [43]–[45]
scale information for object detection. To address the target and V-SLAM [35], [46], [47] have been preferred choices for
occulusion challenge in the multi-drone multi-target track- navigation. Meanwhile, in a clustered environment, research
7

Indirect Method

Z Perception Mapping Planning


(a) Yaw
Image Processing Path Planning
Pitch

Raw Images (b) Learning-based Control Y


Roll
(c) X
Visual Sensor Perception End Control End Drone
Starting Point
Goal
(a) Indirect Method (b) Semi-direct Method (c) End-to-end Method
Event camera AFP Force generation

(a)
Fig. 9: Based on the connection ways of visual perception
and control, vision-based control methods can be divided into
indirect methods, semi-direct methods, and end-to-end meth-
ods. (a) Indirect methods divide the mission into perception,
mapping and planning with image processing; (b) semi-direct
methods extract intermediate features from raw images for RGBD camera OctoMap Fast-Planner

learning-based policy (indirect-direct) or generate maps from (b)

raw images and conduct planning (direct-indirect); (c) end-to- Fig. 10: Indirect methods divide the mission into perception,
end methods map raw images to actions directly via DL. mapping and planning. (a) An APF method used in drones’
dynamic obstacle avoidance [33]. (b) A vision-based drone
is traversing through a cluttered indoor environment with a
on obstacle avoidance [16], [34], [100], [101] based on visual
generated map and online path planning [106].
perception has attracted much attention in the past few years.
Obstacle avoidance has been a main task for vision-based
control as well as for the current learning algorithms for
drones. From the perspective of how drones obtain visual On the mapping side, a point cloud map [110] or OctoMap
perception (perception end) and how drones generate con- [111], representing a set of data points in a 3D space is com-
trol commands from visual perception (control end), existing monly generated. Each point has its own Cartesian coordinates
vision-based control methods can be categorized into indirect and can be used to represent a 3D shape or an object. A
methods, semi-direct methods, and end-to-end methods. The 3D point cloud map is not from the view of a drone but
relationship between these three categories is illustrated in constructs a global 3D map that provides global environmental
Fig. 9. In the following, these methods will be discussed and information. The point-cloud map can be generated from a
evaluated in three different categories, respectively. LIDAR scanner or many overlapped images combined with
depth information. An illustration of an original scene and a
A. Indirect Methods OctoMap are shown in Fig. 10 (b), where the drone can travel
Indirect methods [16], [33], [102]–[109] refer to extracting around without colliding with static obstacles.
features from images or videos to generate visual odometry, Planning is a basic requirement for a vision-based drone
depth maps, and 3D point cloud maps for drones to perform to avoid obstacles. Within the indirect methods, planning can
path planning based on traditional optimization algorithms (see be further divided into two categories: one is offline methods
Fig. 10). Obstacle states, such as 3D shape, position, and based on high-resolution maps and pre-known position infor-
velocity, are detected and mapped before a maneuver is taken. mation, such as Dijkstra’s algorithm [112], A-star [113], RRT-
Once online maps are built or obstacles are located, the drone connect [114] and sequential convex optimization [115]; the
can generate a feasible path or take actions to avoid obstacles. other is online methods based on real-time visual perception
SOTA indirect methods generally divide the mission into and decision-making. Online methods can be further catego-
several subtasks, namely perception, mapping and planning. rized into online path planning [16], [106], [107] and artificial
On the perception side, depth images are always required to potential field (APF) methods [33], [116].
generate corresponding distance and position information for Most vision-based drones rely on online methods. Com-
navigation. A depth image is a grey-level or color image that pared to offline methods, which require an accurate pre-built
can represent the distance between the surfaces of objects from global map, online methods provide advanced maneuvering
the viewpoint of the agent. The illuminance is proportional to capabilities for drones, especially in a dynamic environment.
the distance from the camera. A lighter color denotes a nearer Currently, due to the advantages of optimization and prediction
surface, and darker areas mean further surfaces. A depth map capabilities, online path planning methods have become the
provides the necessary distance information for drones to make preferred choice for drone obstacle avoidance. For instance,
decisions to avoid static and dynamic obstacles. Currently, off- in the SOTA work [106], Zhou Boyu et al. introduced a ro-
the-shelf RGB-D cameras, such as the Intel RealSense depth bust and efficient motion planning system called Fast-Planner
camera D415, the ZED 2 stereo camera, and the Structure for a vision-based drone to perform high-speed flight in an
Core depth camera, are widely used for drone applications. unknown cluttered environment. The key contributions of this
Therefore, traditional obstacle avoidance methods can treat work are a robust and efficient planning scheme incorporating
depth information as a direct input. However, for omnidi- path searching, B-spline optimization, and time adjustment
rectional perception in wide-view scenarios, efficient onboard to generate feasible and safe trajectories for obstacle avoid-
monocular depth estimation is always required, which is a ance. Using only onboard visual perception and computing,
challenge to address with existing methods. this work demonstrated agile drone navigation in unexplored
End-To-End Method

Simulation Experiment
indoor and outdoor environments. However, this approach
can only achieve maximum speeds of 3m/s and requires
RGBD Image Neural Network Velocity Command Training
7.3ms for computation in each step. To improve the flight Z
Yaw
performance and save computation time, Zhou Xin et al. Pitch
Sim2Real
[16] provided a Euclidean Signed Distance Field (ESDF)-free Y
Roll
gradient-based planning framework solution, EGO-Planner, X
Deployment
for drone autonomous navigation in unknown obstacle-rich
situations. Compared to the Fast-Planner, the EGO-Planner
Real-world Flight
achieved faster speeds and saved a lot of computation time.
However, these online path planning methods require bulky Fig. 11: Quadrotor drone flying with end-to-end RL method
visual sensors, such as RGBD cameras or LIDAR, and a pow- [34]. The policy was first trained in the simulation platform
erful onboard computer for the complex numerical calculation and then transferred to real-world flight via Sim2Real.
to obtain a local or global optimal trajectory.
In contrast to online path planning, the APF methods
require less computation resources and can well cope with
Aqeel Anwar et al. [119] presented an end-to-end RL approach
dynamic obstacle avoidance using limited sensor information.
called NAVREN-RL to navigate a quadrotor drone in an
The APF algorithm is one of the algorithms in the robot
indoor environment with expert data and knowledge-based
path planning approach that uses attractive force to reach the
data aggregation. The reward function in [119] was formulated
objective position and repulsive force to avoid obstacles in an
from a ground truth depth image and a generated depth image.
unknown environment [117]. Falanga et al. [33] developed an
Loquercio et al. [34] developed an end-to-end approach that
efficient and fast control strategy based on the APF method
can autonomously guide a quadrotor drone through complex
to avoid fast approaching dynamic obstacles. The obstacles in
wild and human-made environments at high speeds with purely
[33] are represented as repulsive fields that decay over time,
onboard visual perception (depth image) and computation. The
and the repulsive forces are generated from the first-order
neural network policy was trained in a high-fidelity simulation
derivation of the repulsive fields at each time step. However,
environment with massive expert knowledge data. While end-
the repulsive forces computed only reach substantial values
to-end methods provide us with a straightforward way to
when the obstacle is very close, which may lead to unstable
generate obstacle avoidance policies for drones, they require
and aggressive behavior. Besides, APF methods are heuristic
massive training data with domain randomization (usually
methods that cannot guarantee global optimization and robust-
counted in the millions) to obtain acceptable generalization
ness for drones. Hence, it is not ideal to adopt potential field
capabilities. Meanwhile, without expert knowledge data, it
methods to navigate through cluttered environments.
is challenging for the neural network policy to update its
weights when the reward space is sparse. To implement end-to-
B. End-to-end Methods end methods, the following aspects are commonly considered,
In contrast to the indirect methods, which divide the whole namely neural network architecture, training process, and
mission into multiple sub-tasks, such as perception, mapping, Sim2Real transfer.
and planning, end-to-end methods [34], [57], [118]–[120] 1) Neural Network Architecture: The neural network ar-
combine CV and RL to map the visual observations to actions chitecture is the core component in the end-to-end methods,
directly (see Fig. 11). RL [121] is a technique for mapping which determines the computation efficiency and intelligence
the state (observation) space to the action space in order to level of the policy. In the end-to-end method, the input
maximize a long-term return with given rewards. The learner is of the neural network architecture is the image raw data
not explicitly told what action to carry out but must figure out (RGB/RGBD), and the output is the action vector an agent
which action will yield the highest reward during the exploring needs to take. The images are encoded into a vector and
process. A typical RL model features agents, the environment, then concatenated with other normalized observations to form
reward functions, action, and state space. The policy model an input vector of a policy network. For the image encoder,
achieves convergent status via constant interactions between there are many pre-trained neural network architectures that
the agents and the environment, where the reward function can be considered, such as ResNet [122], VGG [123], and
guides the training process. nature CNN [124]. Zhu et al. [57] developed a target-driven
With end-to-end methods, the visual perception of drones visual navigation approach for robots with end-to-end DRL,
is encoded by a deep neural network (DNN) into an obser- where a pre-trained ResNet-50 is used to encode the image
vation vector of the policy network. The mapping process into a feature vector (see Fig. 12(a)). In [59], a more data-
is usually trained offline with abundant data, which requires efficient image encoder with 5 layers was designed for target-
high-performance computers and simulation platforms. The driven visual navigation for robots with end-to-end IL (see Fig.
training dataset is collected from expert flight demonstrations 12(b)). Before designing the neural network architecture, it is
for imitation learning (IL) or from simulations for online firstly required to determine the observation space and action
training. To better generalize the performance of a trained space of the task. The training efficiency and space complexity
neural network model, scenario randomization (domain ran- are the two critical aspects that need to be considered in the
domization) is essential during the training process. Malik designing process.
9

Path sampling
Reference trajectory
Top 3 trajectories
Full quadrotor model
Environment point cloud

Depth
Imitation Learning
Depth
backbone

M x32 conv1D
Ego state
conv1D
conv1D
Trajectory
State
backbone
Desired state MPC
M x32
action
Visual encoder

(a) Fig. 13: IL training process used in [34] to fly an agile drone
through forest, where the expert trajectories are generated and
sampled for training and followed by the trajectory tracking
controller.

in an unknown environment. However, off-policy algorithms


can be very unstable due to the unstatic environments the
offline data seldom covers. IL [128] is another form of off-
policy algorithm that tries to mimic demonstrated behavior in a
given task. Through training from demonstrations, IL can save
(b) remarkable exploration costs. [34] adopted IL to speed up the
training process. The whole training process is illustrated in
Fig. 12: The neural network architectures of end-to-end RL Fig. 13. With this privileged expert knowledge, the policy can
methods. (a) Target driven navigation with ResNet50 encoder be trained to find a time-efficient trajectory to avoid obstacles.
[57]; (b) Target driven visual navigation for a robot with For MAS, multiagent reinforcement learning (MARL) is
shallow CNN encoder [59]. widely studied to encourage agent collaboration. In MARL,
the credit assignment is a crucial issue that determines the
contribution of each agent to the group’s success or failure.
2) Training Process: The training process is the most time- COMA [129] is a baseline method that uses a centralized
consuming part of the end-to-end learning methods. It requires critic to estimate the action-value function, and for each agent,
the use of gradient information from the loss function to it computes a counterfactual advantage function to represent
update the weights of the neural network and improve the the value difference, which can determine the contribution
policy model. There are two main training algorithms for RL, of each agent. However, COMA still does not fully address
namely the on-policy algorithm and the off-policy algorithm. the model complexity issue. QMIX [130] is well developed
On-policy algorithms attempt to evaluate or improve the policy to address the scalability issue by decomposing the global
that is used to make decisions, i.e., the behavior policy for action-value function into individual agent’s value functions.
generating actions is the same as the target policy for learn- However, QMIX assumes the environment is fully observable
ing. When the agent is exploring the environment, on-policy and may not be able to handle scenarios with continuous
algorithms are more stable than off-policy algorithms. SARSA action space. Hence, attention mechanism-enabled MARL is
(State-Action-Reward-State-Action experiences to update the a promising direction to address the variable observations.
Q-values) [121] is an on-policy RL algorithm that estimates Besides, to balance individual and team reward, a MARL with
the value function of the policy being carried out. PPO [125] mixed credit assignment algorithm, POCA-Mix, was proposed
is another efficient on-policy algorithm widely used in RL. in [131] to achieve collaborative multi-target search with a
By compensating for the fact that more probable actions are visual drone swarm. In the following work [132], POMA
going to be taken more often, PPO addresses the issue of high was developed to improve the scalability of POCA-Mix with
variance and low convergence with policy gradient methods. attention-enhanced local observation.
To avoid large weights’ vibration, a “clipped” policy gradient 3) Sim2Real Transfer: For complex tasks, the policy neural
update formulation is designed. networks are usually trained on simulation platforms such as
In contrast, off-policy algorithms evaluate or improve a AirSim [134], Unity [135] or Gazebo [136]. The differences
policy different from that used to generate the action cur- between the simulation and the real environment are non-
rently. The off-policy algorithms such as Q-learning [126], negligible. Sim2Real is the way to deploy the neural net-
[127] try to learn a greedy policy in every step. Off-policy work model trained in a simulation environment on a real
algorithms perform better at movement predictions, especially physical agent. For deployment, it is essential to validate the
10

TABLE I: Summary of Vision-Based Control Methods for Drones


Type Work Perception End Control End Main Contribution Advantages / Limitations
[102] stereo camera + LIDAR A* + cascade PID fast and robust autonomous navigation accurate for safety-critical tasks / computationally intensive
[103] monocular camera + IMU piecewise minimum-time optimization optimal time path planning reliable and precise / offline planning
Indirect [106] LIDAR + IMU B-spline trajectory optimization robust and efficient path planning high-speed flight in cluttered environments / heavy computation
[105] RGBD camera + IMU spatial–temporal planning teach–repeat–replan for aggressive flight real-time responsiveness / limited in cluttered environments
[33] stereo event cameras APF with position estimation agile dynamic obstacle avoidance rapid perception and response / struggles in cluttered environments
[107] RGBD camera + IMU path-guided optimization enhanced dynamic obstacle avoidance effective in dense environments / struggles in dynamic obstacles
[119] monocular camera DRL with expert data DRL for UAV visual collision avoidance effective in structured environments / limited generalization capabilities
[118] monocular camera two-stage learning data efficient learning for UAV visual navigation reduces training time and resources / limited action space
End-to-end
[34] RGBD camera IL + MPC tajectory tracking Learning agile flight in wild environments improved genelization in Sim2Real / relies on expert trajectories
[65] monocular camera MARL for action mapping target search with DRL from visual observations simplifies action mapping / data-intensive
[100] monocular camera IL with intermediate features lightweight learning-based visual navigation balances generalization and efficiency / requires careful feature design
[133] monocular camera detected position features for DRL learning-based visual target tracking improves real-time adaptability / computationally demanding
Semi-DirectSemi-direct
Method
[18] Intel T265 camera PPO for thrust and body rate control champion level drone racing with gate detector suitable for limited environments / lacks flexibility in dynamic settings

Depth estimation
tion and tracking or the point cloud map) and train the control
Raw RGB image Neural network Command
policy with DRL; another is to obtain the required states (such
Z as depth image or 3D world model) directly from the raw
intermediate Yaw
features
Pitch image data with DL and perform the tasks using numerical
Bounding box Y
or heuristic methods. These two methods can be denoted as
Roll
X indirect (front end)-direct (back end) methods and direct (front
end)-indirect (back end) methods.
Fig. 14: Semi-direct methods extract the intermidiate features 1) Indirect-direct methods: [18], [100], [133] firstly obtain
such as structure statistics and optical flow [100] or bounding
End-To-End Method intermediate features such as relative position or velocity of
box [133] for the following RL trainig to improve the genel- the obstacles from image processing and then use this inter-
ization of vision-based learning. mediate feature information as observations to train the policy
neural network via DRL. Indirect-direct methods generally
Monocular image DNN 3D space Obstacle avoidance rely on designing suitable intermediate features. In [100],
the features related to depth cues such as Radon features
(30 dimensional), structure tensor statistics (15 dimensional),
Laws’ masks (8 dimensional), and optical flow (5 dimensional)
were extracted and concatenated into a single feature vector
Fig. 15: Via DL, a 3D depth space is generated from the as visual observation. Together with the other nine additional
monocular image for obstacle avoidance with direct-indirect features, the control policy was trained with IL to navigate the
method [142]. drone through a dense forest environment. Moulay et al. [133]
proposed a semi-direct vision-based learning control policy for
UAV pursuit-evasion. Firstly, a deep object detector (YOLOv2)
generalization capability of trained neural network models. and a search area proposal (SAP) were used to predict the
A wide variety of Sim2Real techniques [137]–[141] have relative position of the target UAV in the next frame for target
been developed to improve the generalization and transfer tracking. Afterward, DRL was adopted to predict the actions
capabilities of models. Domain randomization [140] is one the follower UAV needs to perform to track the target UAV.
of these techniques to improve the generalization capabil- Indirect-direct methods are able to improve the generalization
ity of the trained neural network to unseen environments. capability of policy neural networks, but at the cost of heavy
Domain randomization is a method of trying to discover a computation and time overhead.
representation that can be used in a variety of scenes or
2) Direct-indirect methods: Direct-indirect methods try to
domains. Existing domain randomization techniques [34], [63]
obtain depth images [145], [146] and track obstacles/targets
for drones’ obstacle avoidance include position randomization,
[93], [142] from training and use non-learning-based methods,
depth image noise randomization, texture randomization, size
such as path planning or APF to avoid obstacles. Direct-
randomization, etc. Therefore, domain randomization is gen-
indirect methods can be applied to a microlight drone with
erally required to apply in our training process to enhance the
only monocular vision, but they require a lot of training data
generalization capability of the trained model in deployment.
to obtain depth images or 3D poses of the obstacles. Michele
et al. [142] developed an object detection system with a
C. Semi-direct Methods monocular camera to detect obstacles at a very long range and
Compared to the end-to-end methods, which generate ac- very high speed (see Fig. 15), without certain assumptions on
tions from image raw data directly, semi-direct methods [18], the type of motion. With a DNN trained on real and synthetic
[51], [100], [133], [143], [144] introduce an intermediate phase image data, fast, robust and consistent depth information
for drones to take actions from visual perception, aiming to can be used for drones’ obstacle avoidance. Direct-indirect
improve the generalization and transfer capabilities of the methods address the ego drift problem of monocular depth
methods over unseen environments (see Fig. 14). There are estimation using Structure from Motion (SfM) and provide a
two ways to design the semi-direct method architectures: one direct way to get depth information from the image. However,
is to generate the required information from image processing the massive train dataset and limited generalization capability
(such as the relative positions of obstacles from object detec- are the main challenges for their further applications.
11

TABLE II: Comparison of Performance for Vision-Based annotations, it is widely used for object detection and tracking
Control Methods (SR (%) or distance (m) for performance) tasks from the drone view. Similarly, the Anti-UAV dataset
Method Task Processor Peformance FPS [147] focuses on outdoor UAV tracking, offering 318 RGB-T
[102] navigation NUC i7 78.6% 25 videos recorded during both day and night, enabling models to
[103] path planning i7-5500U NA 9.2
[106] path planning Jetson TX2 77.8% 15 handle varying lighting conditions. The SUAV-DATA dataset
[105] path planning DJI manifold 2-C 84.6m 18 [148] comprises over 5,000 multi-scale annotated images of
[33] obstacle avoidance Jetson TX2 93.0% 280
[107] path planning i7-8550U 100.0% 100
small UAVs at different altitudes, addressing the challenges of
[119] navigation NVIDIA GTX1080 5.8m NA small object detection in aerial imagery. The Det-Fly dataset
[118] obstacle avoidance NVIDIA GTX1080 142m 5.3 [74] complements this by providing air-to-air annotated drone
[34] navigation Jetson TX2 90% 24
[65] target search Jetson Xavier NX 64.0% 21 images in flight for drone navigation and target detection,
[100] navigation NVIDIA GTX1080 62% 10 supporting research into collision avoidance and flight plan-
[133] target tracking NA 80.0% 30 ning. The MIDGARD dataset [149] introduces multi-modal
[18] drone racing Jetson TX2 91.0% 20
data, including RGB, depth, and thermal imagery, along with
rich annotations for various tasks. It is an essential resource
D. Comparision and Discussion for studying integrated perception systems in various envi-
ronments. In comparison, the MDMT dataset [92] focuses on
Overall, the field of vision-based control for drones encom- multi-drone, multi-target tracking, with over 39,000 annotated
passes a variety of methods, each with its own unique approach frames involving occlusion and overlapping targets.
to perception and control. A comprehensive overview of For aggressive motion estimation, the UZH-FPV dataset
these methods, along with key studies in each category, is [150] provides high-resolution RGB images, inertial measure-
summarized in Table I, and their performance and runtime ments, event-camera data, and precise ground truth poses.
are listed in Table II, which provides a detailed comparison Captured during agile drone maneuvers, this dataset has driven
of their perception and control strategies. Indirect methods advancements in motion planning and state estimation, par-
rely on traditional optimization algorithms and depth or 3D ticularly agile flight with learning-based methods [17]. The
point cloud maps for navigation and obstacle avoidance. End- NeuroBEM dataset [151] combines aerodynamic modeling
to-end methods leverage DNNs for visual perception from with highly aggressive flight data. It includes over an hour
monocular cameras or depth images, and utilize RL for direct of flight data with pose, body rate, and battery voltage infor-
action mapping. Semi-direct methods balance between com- mation, enabling detailed dynamics modeling for quadrotors.
putational efficiency and generalization by using intermediate Similarly, the DroneRacing dataset [18] focuses on dynamics
features from image processing and a combination of DRL identification for high-speed flight, providing raw flight data
and heuristic methods for action generation but introduce with states and ground truth annotations, catering specifically
extra computation costs. While traditional indirect methods for learning-based methods in drone racing competitions.
perform more robustly across different missions if accurate
depth maps or point cloud maps are available, learning-based
methods can achieve higher runtime frequency onboard (≥ 20 B. Simulators
FPS) by directly learning from visual perception and better Simulators are crucial tools for training and validating
optimization capability to address uncertainties. vision-based learning algorithms in customizable and risk-free
environments. They offer controlled scenarios for evaluating
V. DATASETS AND S IMULATORS navigation, obstacle avoidance, and collaboration.
Datasets and simulators play essential roles in vision-based AirSim [134] is a highly detailed vehicle simulator widely
learning for drones. Collected datasets significantly contribute used for navigation and obstacle avoidance [154]. It offers
to the training of commonly used neural networks, such as customizable environments and realistic physics, making it
those for depth estimation, drone detection and tracking, and suitable for benchmarking and training vision-based systems.
drone dynamics modeling. Meanwhile, the development of Similarly, Flightmare [152] is designed specifically for drone
realistic simulators has accelerated the training and verification applications, such as racing and navigation with learning
of designed learning methods before deploying them in real- methods [17], [18], [155]. It provides RGBD camera data and
world scenarios. In the following, publicly available datasets drone-specific environments optimized for high-speed testing.
and simulators are discussed and summarized in Table III. Unity ML-Agents toolkit [135] serves as a versatile platform
for training DRL algorithms especially in MAS. It supports
collaborative learning and self-play and can be tailored for
A. Datasets complex scenarios, which has been used for collaborative
A wide range of publicly available datasets supports the de- target search [65], [132] and multipursuit evasion [156]. On the
velopment of vision-based learning methods for drones. These other hand, gym-pybullet-drones [153] focuses on multi-drone
datasets enable researchers to train and evaluate perception, control tasks, providing predefined scenarios for collaborative
tracking, and control algorithms under various environmental and individual learning.
conditions. The well-known VisDrone dataset [73] provides These simulators accelerate the development of vision-based
over 10,000 images and videos captured in urban environments learning algorithms, enabling scholars to test and refine their
under various weather and lighting conditions. With dense methods before deploying them in real-world environments.
12

TABLE III: Datasets/Simulators for Vision-Based Drone Applications


Dataset Type Applications Features Size
VisDrone [73] Images, video Object detection, tracking Real-world scenarios with dense annotations 10K+ images, diverse weather/lighting
Anti-UAV [147] Video Object detection, tracking Outdoors UAV at day and night 318 RGB-T videos
SUAV-DATA [148] Images Object detection Small UAVs at various altitudes with multi-scale data 5K+ images, multi-scale object annotations
Det-Fly [74] Images Object detection Annotated images for air-to-air drone detection 13,000 images across 3 viewing angles
MIDGARD [149] Multi-modal data Multi-task vision applications Rich multi-modal annotations for diverse tasks 1K+ sequences in varied environments
MDMT [92] Video Object detection, tracking Multi-drone multi-target with target occlusion 39,678 frames with 11454 IDs
UZH-FPV [150] Images, states aggressive motion estimation images, IMU, event-camera data, and ground truth poses 27+ sequences of > 10km distance
NeuroBEM [151] ROS bag raw data Dynamics modeling pose, body rate and battery voltage during agile flight 1h:15min of agile quadrotor flights
DroneRacing [18] ROS bag raw data Dynamics identification Flight raw data with states and ground truth 445.9 MB flight data
Gym-gazebo [136] Robot simulator Navigation, obstacle avoidance Gym trainer connectting with Gazebo and ROS/ROS2 1MB for Linux only
AirSim [134] Vehicle simulator Navigation, obstacle avoidance Simulated, customizable environments 69GB for Linux/Windows
Flightmare [152] Drone simulator Drone racing, navigation Drone specific environments with RGBD cameras 893MB for Linux only
ML-Agents [135] DRL simulator Navigation, collaboration Customized training plaftform for MAS 110MB for Linux/Windows
gym-pybullet-drones [153] Multidrone simulator Control, collaboration Fixed scenarios for multidrone control and learning 78.5MB for Linux/Windows

By combining the realism of datasets with the adaptability of Additionally, in the study [160], a drone-based crowd surveil-
simulators, scholars can address critical challenges in vision- lance system was tested to achieve the goal of saving scarce
based drone operations efficiently, such as dynamic navigation, energy of the drone battery. This approach involved offloading
real-time perception, and multiagent collaboration. video data processing from the drones by employing the
Mobile Edge Computing (MEC) method. Nevertheless, while
VI. A PPLICATIONS AND C HALLENGES off-board processing diminishes computational demands and
A. Single Drone Application energy consumption, it inevitably heightens the need for data
transmission. Addressing the challenge of achieving real-time
The versatility of single drones is increasingly recognized surveillance in environments with limited signal connectivity
in a variety of challenging environments. These autonomous is an additional critical issue that requires resolution.
systems, with their inherent advantages, are effectively em-
3) Search and Rescue: In the field of search and rescue
ployed in critical areas such as hazardous environment detec-
operations, a primary challenge is extracting maximum useful
tion and search and rescue operations (see Fig. 16). Single
information from limited data sources. This is crucial for
drone applications in vision-based learning primarily involve
improving the efficiency and success rate of these missions.
tasks like obstacle avoidance [34], [36], surveillance [157]–
Goodrich et al. [161] address this by developing a contour
[160], search-and-rescue operations [161]–[163], environmen-
search algorithm designed to optimize video data analysis,
tal monitoring [164]–[167], industrial inspection [28], [168],
enhancing the capability to identify key elements swiftly. How-
[169] and autonomous racing [170]–[172]. Each field, while
ever, incorporating temporal information into this algorithm
benefiting from the unique capabilities of drones, also presents
introduces additional computational demands. These increased
its own set of challenges and areas for development.
requirements present new challenges, such as the need for
1) Obstacle Avoidance: The development of obstacle
more powerful processing capabilities and potentially greater
avoidance capabilities in drones, especially for vision-based
energy consumption.
control systems, poses significant challenges. Recent studies
have primarily focused on static or simple dynamic envi- 4) Environmental Monitoring: A major challenge for en-
ronments, where obstacle paths are predictable [32], [34], vironmental monitoring lies in efficiently collecting high-
[173]. However, complex scenarios involving unpredictable resolution data while navigating the constraints of battery life,
physical attacks from birds or intelligent adversaries remain flight duration, and diverse weather conditions. Addressing
largely unaddressed. For instance, [32], [33], [36], [174] this, Senthilnath et al. [164] showcased the use of fixed-
have explored basic dynamic obstacle avoidance but do not wing and Vertical Take-Off and Landing (VTOL) drones in
account for adversarial environments. To effectively handle vegetation analysis, focusing on the challenge of detailed
such threats, drones require advanced features like omnidirec- mapping through spectral-spatial classification methods. In
tional visual perception and agile maneuvering capabilities. another study, Lu et al. [167] demonstrated the utility of
Current research, however, is limited in addressing these drones for species classification in grasslands, contributing to
needs, underscoring the necessity for further development in the development of methodologies for drone-acquired imagery
drone technology to enhance evasion strategies against smart, processing, which is crucial for environmental assessment and
unpredictable adversaries. management. While these studies represent significant steps
2) Surveillance: While drones play a pivotal role in surveil- in drone applications for environmental monitoring, several
lance tasks, their deployment is not without challenges. Key challenges persist. Future research aims to improve the drones’
challenges include managing high data processing loads and resilience to diverse environmental conditions, and extend
addressing the limitations of onboard computational resources. their operational range and duration to comprehensively cover
In addressing these challenges, Singh et al. [157] presented a extensive and varied landscapes.
real-time drone surveillance system used to identify violent 5) Industrial Inspection: In industrial inspection, drones
individuals in public areas. The proposed study was facil- face key challenges like safely navigating complex environ-
itated by cloud processing of drone images to address the ments and conducting precise measurements in the presence of
challenge of slow and memory-intensive computations while various disturbances. Kim et al. [168] addressed the challenge
still maintaining onboard short-term navigation capabilities. of autonomous navigation by using drones for proximity
13

(a) (b) (a) (b)

(c) (d) (c) (d)


Fig. 16: Applications of single drone. (a) Surveillance [157]; Fig. 17: Applications of multi-drone systems. (a) Coordinated
(b) Search and rescue [175]; (c) Environmental monitoring surveying [176]; (b) Cooperative tracking [177]; (c) Synchro-
[167]; (d) Autonomous racing [18]. nized monitoring [178]; (d) Disaster response [179].

measurement among construction entities, enhancing safety Advanced technological solutions are essential to overcome
in the construction industry. Additionally, Khuc et al. [169] these challenges, ensuring that single drones can operate
focused on precise structural health inspection with drones, efficiently and reliably in diverse scenarios and paving the
especially in high or inaccessible locations. Despite these way for future innovations.
advancements in autonomous navigation and measurement ac-
curacy, maintaining data accuracy and reliability in industrial B. Multi-Drone Application
settings with interference from machinery, electromagnetic While single drones offer convenience, their limited mon-
fields, and dynamic obstacles continues to be a significant itoring range has prompted interest in multi-drone collabora-
challenge, necessitating advanced autonomy and intelligence tion. This approach seeks to overcome range limitations by
in this domain. leveraging the collective capabilities of multiple drones for
6) Autonomous Racing: In autonomous drone racing, the broader, more efficient operations. Multi-drone applications
central challenge is reducing delays exist in visual information (see Fig. 17), encompassing activities such as coordinated sur-
processing and decision making and enhancing the adaptability veying [176], [180]–[182], cooperative tracking [177], [183],
of perception networks. In [170], a novel sensor fusion method [184], synchronized monitoring [178], [185], [186], and disas-
was proposed to enable high-speed autonomous racing for ter response [65], [132], [175], [179], [187], [188], bring the
mini-drones. This work also addressed issues with occasional added complexity of inter-drone communication, coordination
large outliers and vision delays commonly encountered in fast and real-time data integration. These applications leverage the
drone racing. Another work [171] introduced an innovative combined capabilities of multiple drones to achieve greater
approach to drone control, where a DNN was used to fuse efficiency and coverage than single drone operations.
trajectories from multiple controllers. In the latest work [18], 1) Coordinated Surveying: In coordinated surveying, sev-
the vision-based drone outperformed world champions in eral challenges are prominent: merging diverse data from
the racing task, purely relying on onboard perception and a individual drones, and addressing computational demands in
trained neural network. The primary challenge in autonomous cooperative process. These challenges were tackled by some
drone racing, as identified in these studies, lies in the need works. In [180], a monocular visual odometry algorithm was
for improved adaptability of perception networks to various used to enable autonomous onboard control with cooperative
environments and textures, which is crucial for the high-speed localization and mapping. This work addressed the challenges
demands of the sport. of coordinating and merging different maps constructed by
Overall, the primary challenges in single drone applications each drone platform and what’s more, the computational
include limited battery life, which restricts operational dura- bottlenecks typically associated with 3D RGB-D cooperative
tion, and the need for effective obstacle avoidance in dynamic SLAM. Micro-air vehicles also play an outstanding role in the
environments. Additionally, limitations in data processing ca- field of coordinated surveying. Similarly, in [182], a sensor
pabilities affect real-time decision-making and adaptability. fusion scheme was proposed to improve the accuracy of
14

localization in micro aerial vehicle (MAV) fleets. Ensuring


that each MAV effectively contributes to the overall percep-
tion of the environment poses a challenge addressed in this
work. Both methods depend on effective collaboration and
communication among multiple agents. However, maintaining
stable communication between drones remains a critical and
unresolved issue in multi-drone operations.
2) Cooperative Tracking: In the field of multi-drone object
tracking, navigating complex operational environments and
overcoming limited communication bandwidth are significant (a) (b)
challenges. These topics have also garnered significant re-
search interest. In the study [183], researchers developed a
system for cooperative surveillance and tracking in urban
settings, specifically tackling the issue of collision avoidance
among drones. Additionally, another work by Farmani et al.
[184] explores a decentralized tracking system for drones,
focusing on overcoming limited communication bandwidth
and intermittent connectivity challenges. However, a persistent
difficulty in this field is navigating complex outdoor environ-
ments. Effective path planning and avoiding obstacles during
(c) (d)
multi-drone operations remain crucial challenges that require
ongoing attention and innovation. Fig. 18: Applications of heterogeneous systems. (a) UAV-UGV
3) Synchronized Monitoring: In synchronized monitoring path planning [190]; (b) UAV-UGV precise landing [195]; (c)
missions with multiple drones, the focus is on how to effec- UAV-UGV inventory inspection [198]; (d) UAV-UGV object
tively allocate tasks among drones and improve overall mission detection [199].
efficiency, as well as overcoming computational limitations.
Gu et al. in [178] developed a small target detection and
tracking model using data fusion, establishing a cooperative Addressing these challenges is vital for unlocking the full
network for synchronized monitoring. This study also ad- potential of multi-drone systems, paving the way for advance-
dresses task distribution challenges and efficiency optimiza- ments in various application domains.
tion. Moreover, [185] explores implementing DNN models on
computationally limited drones, focusing on reducing classifi- C. Heterogeneous Systems Application
cation latency in collaborative drone systems. However, issues As the complexity and uncertainty of task scenarios escalate,
like collision avoidance in complex environments and signal it becomes increasingly challenging for a single robot or even a
interference in multi-drone systems are not comprehensively homogeneous multi-robot system to efficiently adapt to diverse
addressed in these studies. environments. Consequently, in recent years, heterogeneous
4) Disaster Response: In the domain of multi-drone sys- multi-robot systems (HMRS) have emerged as a focal point of
tems for disaster response, challenges such as autonomous research within the community [189]. In the domain of UAV
navigation, communication limitations, and real-time decision- applications, HMRS mainly refers to communication-enabled
making are paramount. To address these problems, Tang et networks that integrate various UAVs with other intelligent
al. [187] introduced a tracking-learning-detection framework robotic platforms, such as uncrewed ground vehicles (UGVs)
with an advanced flocking strategy for exploration and search and uncrewed surface vehicles (USVs). This integration facil-
missions. However, the study does not examine the crucial itates a diverse range of applications, leveraging the unique
aspect of task distribution and optimization for enhancing capabilities of each system to enhance overall operational
mission efficiency in multi-drone disaster response. Xiao et al. efficiency. These systems execute a range of tasks, either
[132] developed a MARL-based multitarget search approach individually or in a collaborative manner. The inherently
to better the target assignment and collaboration while the heterogeneous scheduling approach of HMRS significantly en-
generalization capability remains improved. This shortcoming hances feasibility and adaptability, thereby effectively tackling
points to an essential area for further research, as effective a series of demanding tasks across different environments.
task management across various missions is key to utilizing Consequently, applications of HMRS are rapidly evolving; ex-
the full capabilities of multi-drone systems, particularly in the amples (see Fig. 18) include but are not limited to localization
dynamic and urgent context of disaster scenarios. and path planning [190]–[194], precise landing [195]–[197],
The primary challenges in multi-drone applications include and comprehensive inspection and detection [198]–[200].
maintaining stable and reliable communication links, collision 1) Localization and Path Planning: The UAV-UGV co-
avoidance between drones, and effective distribution of tasks operation system can address the challenges of GPS denial
to optimize the overall mission efficiency. Additionally, issues and limited sensing range in UGVs. In this system, UAVs
like signal interference and managing the flight paths of provide essential auxiliary information for UGV localization
multiple drones simultaneously are significant bottlenecks. and path planning, relying solely on cost-effective cameras
15

and wireless communication channels. Niu [190] introduced impede the pace of development and real-world applicability
a framework wherein a single UAV served multiple UGVs, of these methods. These challenges span various aspects,
achieving optimal path planning based on aerial imagery. from data collection and simulation accuracy to operational
This approach outperformed traditional heuristic path planning efficiency and safety concerns.
algorithms. Furthermore, Liu et al. [191] presented a joint
UAV-UGV architecture designed to overcome frequent target A. Dataset
occlusion issues encountered by single-ground platforms. This
A major impediment in the field is the absence of a com-
architecture enabled accurate and dynamic target localization,
prehensive, public dataset analogous to Open X-Embodiment
leveraging visual inputs from UAVs.
[201] in robotic manipulation. This unified dataset should ide-
2) Precise Landing: Given the potential necessity for bat-
ally encompass a wide range of scenarios and tasks to facilitate
tery recharging and emergency maintenance of UAVs dur-
generalizable learning. As shown in Table III, the current
ing extended missions, the UAV-UGV heterogeneous system
reliance on domain-specific datasets like “Anti-UAV” [147]
can facilitate UAV landings. This design reduces reliance
and “SUAV-DATA” [148] limits the scope and applicability of
on manual intervention while enhancing the UAVs’ capacity
research. A potential solution is the collaborative development
for prolonged, uninterrupted operation. In the study [195], a
of a diverse, multi-purpose dataset by academic and indus-
vision-based heterogeneous system was proposed to address
try stakeholders, incorporating various tasks, environmental,
the challenge of UAVs’ temporary landings during long-
weather, and lighting conditions.
range inspections. This system accomplished precise target
geolocation and safe landings in the absence of GPS data by
detecting QR codes mounted on UGVs. Additionally, Xu et B. Simulator
al. [196] illustrated the application of UAV heterogeneous sys- While simulators are vital for training and validating vision-
tems for landing on USVs. A similar approach was explored based learning models, their realism and accuracy often fall
for UAVs’ target localization and landing, leveraging QR code short of replicating real-world complexities. This gap hampers
recognition on USVs. the transition from simulation to actual deployment. Mean-
3) Inspection and Detection: Heterogeneous UAV systems while, there is no unified simulator covering most of the
present an effective solution to overcome the limitations of drone tasks, resulting in repetitive domain-specific simulator
background clutter and incoherent target interference often development [134], [152], [153]. Drawing inspiration from
encountered in single-ground vision detection platforms. By the self-driving car domain, the integration of off-the-shelf
leveraging the expansive FOV and swift scanning capabilities and highly flexible simulators such as CARLA [202] could
of UAVs, in conjunction with the endurance and high accuracy be a solution. These simulators, known for their advanced
of UGVs, such heterogeneous systems can achieve time- features in realistic traffic simulation and diverse environ-
efficient and accurate target inspection and detection in specific mental conditions, can provide more authentic and varied
applications. For instance, Kalinov et al. [198] introduced a data for training. Adapting such simulators to drone-specific
heterogeneous inventory management system, pairing a ground scenarios could greatly enhance the quality of training and
robot with a UAV. In this system, the ground robot determined testing environments.
motion trajectories by deploying the SLAM algorithm, while
the UAV, with its high maneuverability, was tasked with C. Sample Efficiency
scanning barcodes. Furthermore, Pretto [200] developed a Enhancing sample efficiency in ML models for drones is
heterogeneous farming system to enhance agricultural automa- crucial, particularly in environments where data collection is
tion. This innovative system utilized the aerial perspective of hazardous or impractical. Even though simulators are available
the UAV to assist in farmland segmentation and the classifi- for generating training data, there are still challenges in ensur-
cation of crops from weeds, significantly contributing to the ing the realism and diversity of these simulated environments.
advancement of automated farming practices. The gap between simulated and real-world data can lead
To sum up, most of the applications above primarily focus to performance discrepancies when models are deployed in
on single UAV to single UGV or single UAV to multiple actual scenarios. Developing algorithms that leverage transfer
UGV configurations, with few scenarios designed for multiple learning [203], few-shot learning [204], and synthetic data
UAVs interacting with multiple UGVs. It is evident that there generation [205] could provide significant strides in learning
remains significant research potential in the realm of vision- efficiently from limited datasets. These approaches aim to
based, multiagent-to-multiagent heterogeneous systems. Key bridge the gap between simulation and reality, enhancing the
areas such as communication and data integration within het- applicability and robustness of ML models in diverse and
erogeneous systems, coordination and control in dynamic and dynamic real-world situations.
unpredictable environments, and individual agents’ autonomy
and decision-making capabilities warrant further exploration.
D. Inference Speed
Balancing inference speed with accuracy is a critical chal-
VII. O PEN Q UESTIONS AND P OTENTIAL S OLUTIONS
lenge for drones operating in dynamic environments even
Despite significant advancements in the domain of vision- though some approaches have reached near real-time inference
based learning for drones, numerous challenges remain that capability as listed in Table II. The key lies in optimizing ML
16

models for edge computing, enabling drones to process data particularly emphasizing their evolving role in multi-drone
and make decisions swiftly. Techniques like model pruning systems and complex environments such as search and res-
[206], [207], quantization [208], [209], distillation [210] and cue missions and adversarial settings. The investigation re-
the development of specialized hardware accelerators can play vealed that drones are increasingly becoming sophisticated,
a pivotal role in this regard. autonomous systems capable of intricate tasks, largely driven
by advancements in AI, ML, and sensor technology. The
E. Real World Deployment exploration of micro and nano drones, innovative structural
designs, and enhanced autonomy stand out as key trends shap-
Transitioning from controlled simulation environments to
ing the future of drone technology. Crucially, the integration of
real-world deployment (Sim2Real) involves addressing un-
visual perception with ML algorithms, including DRL, opens
predictability in environmental conditions, regulatory compli-
up new avenues for drones to operate with greater efficiency
ance, and adaptability to diverse operational contexts. Domain
and intelligence. These capabilities are particularly pertinent in
randomization [140] tries to address the Sim2Real issue in
the context of object detection and decision-making processes,
a certain way but is limited to predicted scenarios with
vital for complex drone operations. This survey categorized
known domain distributions. Developing robust and adap-
vision-based control methods into indirect, semi-direct, and
tive algorithms capable of on-the-fly continuous learning and
end-to-end methods, offering an in-depth understanding of
decision-making, along with rigorous field testing under varied
how drones perceive and interact with their environment.
conditions, can aid in overcoming these challenges.
Applications of vision-based learning drones, spanning from
single-agent to multiagent and heterogeneous systems, demon-
F. Embodied Intelligence in Open World strate their versatility and potential in various sectors, includ-
Existing vision-based learning methods for drones require ing agriculture, industrial inspection, and emergency response.
explicit task descriptions and formal constraints, while in an However, this expansion also brings forth challenges such
open world, it is hard to provide all necessary formulations at as data processing limitations, real-time decision-making, and
the beginning to find the optimal solution. For instance, in a ensuring robustness in diverse operational scenarios.
complex search and rescue mission, the drone can only find This survey highlights open questions and potential so-
the targets first and conduct rescue based on the information lutions in the field, stressing the need for comprehensive
collected. In each stage, the task may change, and there is datasets, realistic simulators, improved sample efficiency, and
no prior explicit problem at the start. Human interactions are faster inference speeds. Addressing these challenges is crucial
necessary during this mission. With large language models and for the effective deployment of drones in real-world scenarios.
embodied intelligence, the potential of drone autonomy can Safety and security, especially in the context of adversarial
be greatly increased. Through interactions in the open world environments, remain paramount concerns that need ongo-
[21], [211] or provide few-shot imitation [212], vision-based ing attention. While significant progress has been made in
learning can emerge with full autonomy for drone applications. vision-based learning for drones, the journey towards fully
autonomous, intelligent, and reliable systems, even AGI in the
G. Safety and Security physical world, is ongoing. Future research and development
in this field hold the promise of revolutionizing various indus-
Ensuring the safety and security of drone operations is tries, pushing the boundaries of what’s possible with drone
paramount, especially in densely populated or sensitive areas. technology in complex and dynamic environments.
This includes not only physical safety but also cybersecurity
concerns [213], [214]. The security aspect extends beyond data R EFERENCES
protection, including the resilience of drones to adversarial [1] R. Rajkumar, I. Lee, L. Sha, and J. Stankovic, “Cyber-physical systems:
attacks [215]. Such attacks could take various forms, from the next computing revolution,” in Design Automation Conference.
IEEE, 2010, pp. 731–736.
signal jamming to deceptive inputs aimed at misleading vision- [2] R. Baheti and H. Gill, “Cyber-physical systems,” The impact of control
based systems and DRL algorithms [216]. Addressing these technology, vol. 12, no. 1, pp. 161–166, 2011.
concerns requires a multifaceted approach. Firstly, incorporat- [3] M. Hassanalian and A. Abdelkefi, “Classifications, applications, and
design challenges of drones: A review,” Progress in Aerospace Sci-
ing advanced cryptographic techniques ensures data integrity ences, vol. 91, pp. 99–131, 2017.
and secure communication. Secondly, implementing anomaly [4] P. Nooralishahi, C. Ibarra-Castanedo, S. Deane, F. López, S. Pant,
detection systems can help identify and mitigate unusual M. Genest, N. P. Avdelidis, and X. P. Maldague, “Drone-based non-
destructive inspection of industrial sites: A review and case studies,”
patterns indicative of adversarial interference. Moreover, im- Drones, vol. 5, no. 4, p. 106, 2021.
proving the robustness of learning models against adversarial [5] N. J. Stehr, “Drones: The newest technology for precision agriculture,”
attacks and investigating the explainability of designed models Natural Sciences Education, vol. 44, no. 1, pp. 89–91, 2015.
[6] P. M. Kornatowski, M. Feroskhan, W. J. Stewart, and D. Floreano,
[217] are imperative. Lastly, regular updates and patches to the “Downside up:rethinking parcel position for aerial delivery,” IEEE
drone’s software, based on the latest threat intelligence, can Robotics and Automation Letters, vol. 5, no. 3, pp. 4297–4304, 2020.
fortify its defenses against evolving cyber threats. [7] P. M. Kornatowski, M. Feroskhan, W. J. Stewart, and D. Floreano, “A
morphing cargo drone for safe flight in proximity of humans,” IEEE
Robotics and Automation Letters, vol. 5, no. 3, pp. 4233–4240, 2020.
VIII. C ONCLUSION [8] D. Câmara, “Cavalry to the rescue: Drones fleet to help rescuers
operations over disasters scenarios,” in 2014 IEEE Conference on
This comprehensive survey has thoroughly explored the Antenna Measurements & Applications (CAMA). IEEE, 2014, pp.
rapidly growing field of vision-based learning for drones, 1–4.
17

[9] Y.-H. Hsiao, S. Bai, Y. Zhou, H. Jia, R. Ding, Y. Chen, Z. Wang, [31] B. Zhou, H. Xu, and S. Shen, “Racer: Rapid collaborative explo-
and P. Chirarattananon, “Energy efficient perching and takeoff of a ration with a decentralized multi-uav system,” IEEE Transactions on
miniature rotorcraft,” Communications Engineering, vol. 2, no. 1, p. 38, Robotics, vol. 39, no. 3, pp. 1816–1835, 2023.
2023. [32] E. Kaufmann, A. Loquercio, R. Ranftl, A. Dosovitskiy, V. Koltun, and
[10] W. Shen, J. Peng, R. Ma, J. Wu, J. Li, Z. Liu, J. Leng, X. Yan, and D. Scaramuzza, “Deep drone racing: Learning agile flight in dynamic
M. Qi, “Sunlight-powered sustained flight of an ultralight micro aerial environments,” in Conference on Robot Learning. PMLR, 2018, pp.
vehicle,” Nature, vol. 631, no. 8021, pp. 537–543, 2024. 133–145.
[11] M. Graule, P. Chirarattananon, S. Fuller, N. Jafferis, K. Ma, M. Spenko, [33] D. Falanga, K. Kleber, and D. Scaramuzza, “Dynamic obstacle avoid-
R. Kornbluh, and R. Wood, “Perching and takeoff of a robotic insect on ance for quadrotors with event cameras,” Science Robotics, vol. 5,
overhangs using switchable electrostatic adhesion,” Science, vol. 352, no. 40, 2020.
no. 6288, pp. 978–982, 2016. [34] A. Loquercio, E. Kaufmann, R. Ranftl, M. Müller, V. Koltun, and
[12] D. Floreano and R. J. Wood, “Science, technology and the future of D. Scaramuzza, “Learning high-speed flight in the wild,” Science
small autonomous drones,” Nature, vol. 521, no. 7553, pp. 460–466, Robotics, vol. 6, no. 59, p. eabg5810, 2021.
2015. [35] T. Qin, P. Li, and S. Shen, “Vins-mono: A robust and versatile monoc-
[13] J. Shu and P. Chirarattananon, “A quadrotor with an origami-inspired ular visual-inertial state estimator,” IEEE Transactions on Robotics,
protective mechanism,” IEEE Robotics and Automation Letters, vol. 4, vol. 34, no. 4, pp. 1004–1020, 2018.
no. 4, p. 3820–3827, 2019. [36] N. J. Sanket, C. M. Parameshwara, C. D. Singh, A. V. Kuruttukulam,
[14] E. Ajanic, M. Feroskhan, S. Mintchev, F. Noca, and D. Floreano, C. Fermüller, D. Scaramuzza, and Y. Aloimonos, “Evdodgenet: Deep
“Bioinspired wing a and tail morphing extends drone flight capabil- dynamic obstacle dodging with event cameras,” in 2020 IEEE Interna-
ities,” Sci. Robot., vol. 5, p. eabc2897, 2020. tional Conference on Robotics and Automation (ICRA). IEEE, 2020,
[15] L. Chen, J. Xiao, Y. Zheng, N. A. Alagappan, and M. Feroskhan, pp. 10 651–10 657.
“Design, modeling, and control of a coaxial drone,” IEEE Transactions [37] N. J. Sanket, C. D. Singh, C. Fermüller, and Y. Aloimonos, “Ajna:
on Robotics, vol. 40, pp. 1650–1663, 2024. Generalized deep uncertainty for minimal perception on parsimonious
[16] X. Zhou, J. Zhu, H. Zhou, C. Xu, and F. Gao, “Ego-swarm: A fully robots,” Science Robotics, vol. 8, no. 81, p. eadd5139, 2023.
autonomous and decentralized quadrotor swarm system in cluttered [38] R. Siegwart, I. R. Nourbakhsh, and D. Scaramuzza, Introduction to
environments,” in 2021 IEEE international conference on robotics and autonomous mobile robots. MIT press, 2011.
automation (ICRA). IEEE, 2021, pp. 4101–4107. [39] N. Chen, F. Kong, W. Xu, Y. Cai, H. Li, D. He, Y. Qin, and F. Zhang, “A
[17] E. Kaufmann, A. Loquercio, R. Ranftl, M. Müller, V. Koltun, and self-rotating, single-actuated uav with extended sensor field of view for
D. Scaramuzza, “Deep drone acrobatics,” Proceedings of Robotics: autonomous navigation,” Science Robotics, vol. 8, no. 76, p. eade4538,
Science and Systems XVI, 2020. 2023.
[18] E. Kaufmann, L. Bauersfeld, A. Loquercio, M. Müller, V. Koltun, and [40] H. Guan, X. Sun, Y. Su, T. Hu, H. Wang, H. Wang, C. Peng, and
D. Scaramuzza, “Champion-level drone racing using deep reinforce- Q. Guo, “UAV-lidar aids automatic intelligent powerline inspection,”
ment learning,” Nature, vol. 620, no. 7976, pp. 982–987, 2023. International Journal of Electrical Power and Energy Systems, vol.
130, p. 106987, sep 2021.
[19] I. Singh, V. Blukis, A. Mousavian, A. Goyal, D. Xu, J. Tremblay,
[41] W. Xu, Y. Cai, D. He, J. Lin, and F. Zhang, “Fast-lio2: Fast direct lidar-
D. Fox, J. Thomason, and A. Garg, “Progprompt: Generating situated
inertial odometry,” IEEE Transactions on Robotics, vol. 38, no. 4, pp.
robot task plans using large language models,” in 2023 IEEE Interna-
2053–2073, 2022.
tional Conference on Robotics and Automation (ICRA). IEEE, 2023,
[42] Z. Wang, Z. Zhao, Z. Jin, Z. Che, J. Tang, C. Shen, and Y. Peng,
pp. 11 523–11 530.
“Multi-stage fusion for multi-class 3d lidar detection,” in Proceedings
[20] S. Liu, H. Zhang, Y. Qi, P. Wang, Y. Zhang, and Q. Wu, “Aeri-
of the IEEE/CVF International Conference on Computer Vision, 2021,
alvln: Vision-and-language navigation for uavs,” in Proceedings of the
pp. 3120–3128.
IEEE/CVF International Conference on Computer Vision, 2023, pp.
[43] M. O. Aqel, M. H. Marhaban, M. I. Saripan, and N. B. Ismail, “Review
15 384–15 394.
of visual odometry: types, approaches, challenges, and applications,”
[21] A. Gupta, S. Savarese, S. Ganguli, and L. Fei-Fei, “Embodied intel- SpringerPlus, vol. 5, pp. 1–26, 2016.
ligence via learning and evolution,” Nature communications, vol. 12, [44] J. Delmerico and D. Scaramuzza, “A benchmark comparison of monoc-
no. 1, p. 5721, 2021. ular visual-inertial odometry algorithms for flying robots,” in 2018
[22] Y. Tang, C. Zhao, J. Wang, C. Zhang, Q. Sun, W. X. Zheng, W. Du, IEEE international conference on robotics and automation (ICRA).
F. Qian, and J. Kurths, “Perception and navigation in autonomous IEEE, 2018, pp. 2502–2509.
systems in the era of learning: A survey,” IEEE Transactions on Neural [45] D. Scaramuzza and Z. Zhang, Aerial Robots, Visual-Inertial Odometry
Networks and Learning Systems, vol. 34, no. 12, pp. 9604–9624, 2023. of. Berlin, Heidelberg: Springer Berlin Heidelberg, 2020, pp. 1–9.
[23] Y. Lu, Z. Xue, G.-S. Xia, and L. Zhang, “A survey on vision-based [Online]. Available: https://doi.org/10.1007/978-3-642-41610-1 71-1
uav navigation,” Geo-spatial information science, vol. 21, no. 1, pp. [46] R. Mur-Artal, J. M. M. Montiel, and J. D. Tardos, “Orb-slam: a versatile
21–32, 2018. and accurate monocular slam system,” IEEE transactions on robotics,
[24] M. Y. Arafat, M. M. Alam, and S. Moh, “Vision-based navigation vol. 31, no. 5, pp. 1147–1163, 2015.
techniques for unmanned aerial vehicles: Review and challenges,” [47] C. Campos, R. Elvira, J. J. G. Rodrı́guez, J. M. M. Montiel, and
Drones, vol. 7, no. 2, p. 89, 2023. J. D. Tardós, “Orb-slam3: An accurate open-source library for visual,
[25] E. Kakaletsis, C. Symeonidis, M. Tzelepi, I. Mademlis, A. Tefas, visual–inertial, and multimap slam,” IEEE Transactions on Robotics,
N. Nikolaidis, and I. Pitas, “Computer vision for autonomous uav flight vol. 37, no. 6, pp. 1874–1890, 2021.
safety: An overview and a vision-based safe landing pipeline example,” [48] P. Pisutsin, J. Xiao, and M. Feroskhan, “Omnidrone-det: Omnidirec-
Acm Computing Surveys (Csur), vol. 54, no. 9, pp. 1–37, 2021. tional 3d drone detection in flight,” in 2024 IEEE 20th International
[26] A. Mcfadyen and L. Mejias, “A survey of autonomous vision-based Conference on Automation Science and Engineering (CASE), 2024, pp.
see and avoid for unmanned aircraft systems,” Progress in Aerospace 2409–2414.
Sciences, vol. 80, pp. 1–17, 2016. [49] G. Gallego, T. Delbrück, G. Orchard, C. Bartolozzi, B. Taba, A. Censi,
[27] R. Jenssen, D. Roverso et al., “Automatic autonomous vision-based S. Leutenegger, A. J. Davison, J. Conradt, K. Daniilidis et al., “Event-
power line inspection: A review of current status and the potential role based vision: A survey,” IEEE transactions on pattern analysis and
of deep learning,” International Journal of Electrical Power & Energy machine intelligence, vol. 44, no. 1, pp. 154–180, 2020.
Systems, vol. 99, pp. 107–120, 2018. [50] W. Gao, K. Wang, W. Ding, F. Gao, T. Qin, and S. Shen, “Autonomous
[28] B. F. Spencer Jr, V. Hoskere, and Y. Narazaki, “Advances in computer aerial robot using dual-fisheye cameras,” Journal of Field Robotics,
vision-based civil infrastructure inspection and monitoring,” Engineer- vol. 37, no. 4, pp. 497–514, 2020.
ing, vol. 5, no. 2, pp. 199–222, 2019. [51] V. R. Kumar, S. Yogamani, H. Rashed, G. Sitsu, C. Witt, I. Leang,
[29] A. Bouguettaya, H. Zarzour, A. Kechida, and A. M. Taberkit, “Vehicle S. Milz, and P. Mäder, “Omnidet: Surround view cameras based multi-
detection from uav imagery with deep learning: A review,” IEEE task visual perception network for autonomous driving,” IEEE Robotics
Transactions on Neural Networks and Learning Systems, vol. 33, and Automation Letters, vol. 6, no. 2, pp. 2830–2837, 2021.
no. 11, pp. 6047–6067, 2022. [52] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Region-based
[30] D. Hanover, A. Loquercio, L. Bauersfeld, A. Romero, R. Penicka, convolutional networks for accurate object detection and segmenta-
Y. Song, G. Cioffi, E. Kaufmann, and D. Scaramuzza, “Autonomous tion,” IEEE transactions on pattern analysis and machine intelligence,
drone racing: A survey,” arXiv e-prints, pp. arXiv–2301, 2023. vol. 38, no. 1, pp. 142–158, 2015.
18

[53] R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE international [73] P. Zhu, L. Wen, D. Du, X. Bian, H. Fan, Q. Hu, and H. Ling, “Detection
conference on computer vision, 2015, pp. 1440–1448. and tracking meet drones challenge,” IEEE Transactions on Pattern
[54] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time Analysis and Machine Intelligence, vol. 44, no. 11, pp. 7380–7399,
object detection with region proposal networks,” Advances in neural 2021.
information processing systems, vol. 28, 2015. [74] Y. Zheng, Z. Chen, D. Lv, Z. Li, Z. Lan, and S. Zhao, “Air-to-air visual
[55] A. D. Haumann, K. D. Listmann, and V. Willert, “DisCoverage: A detection of micro-uavs: An experimental evaluation of deep learning,”
new paradigm for multi-robot exploration,” in Proceedings - IEEE IEEE Robotics and Automation Letters, vol. 6, no. 2, pp. 1020–1027,
International Conference on Robotics and Automation, 2010, pp. 929– 2021.
934. [75] D.-M. Seo, H.-J. Woo, M.-S. Kim, W.-H. Hong, I.-H. Kim, and S.-
[56] A. H. Tan, F. P. Bejarano, Y. Zhu, R. Ren, and G. Nejat, “Deep C. Baek, “Identification of asbestos slates in buildings based on faster
reinforcement learning for decentralized multi-robot exploration with region-based convolutional neural network (faster r-cnn) and drone-
macro actions,” IEEE Robotics and Automation Letters, vol. 8, no. 1, based aerial imagery,” Drones, vol. 6, no. 8, p. 194, 2022.
pp. 272–279, 2022. [76] H. Y. Lee, H. W. Ho, and Y. Zhou, “Deep learning-based monocular
[57] Y. Zhu, R. Mottaghi, E. Kolve, J. J. Lim, A. Gupta, L. Fei-Fei, and obstacle avoidance for unmanned aerial vehicle navigation in tree plan-
A. Farhadi, “Target-driven visual navigation in indoor scenes using tations: Faster region-based convolutional neural network approach,”
deep reinforcement learning,” in 2017 IEEE international conference Journal of Intelligent & Robotic Systems, vol. 101, no. 1, p. 5, 2021.
on robotics and automation (ICRA). IEEE, 2017, pp. 3357–3364. [77] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu,
[58] P. Mirowski, R. Pascanu, F. Viola, H. Soyer, A. J. Ballard, A. Banino, and A. C. Berg, “Ssd: Single shot multibox detector,” in European
M. Denil, R. Goroshin, L. Sifre, K. Kavukcuoglu, D. Kumaran, and conference on computer vision. Springer, 2016, pp. 21–37.
R. Hadsell, “Learning to navigate in complex environments,” 5th [78] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look
International Conference on Learning Representations, ICLR 2017 - once: Unified, real-time object detection,” in Proceedings of the IEEE
Conference Track Proceedings, 2017. conference on computer vision and pattern recognition, 2016, pp. 779–
[59] Q. Wu, X. Gong, K. Xu, D. Manocha, J. Dong, and J. Wang, “Towards 788.
target-driven visual navigation in indoor scenes via generative imitation [79] Z. Ge, S. Liu, F. Wang, Z. Li, and J. Sun, “Yolox: Exceeding yolo
learning,” IEEE Robotics and Automation Letters, vol. 6, no. 1, pp. series in 2021,” arXiv preprint arXiv:2107.08430, 2021.
175–182, 2020. [80] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for
[60] OpenAI, I. Akkaya, M. Andrychowicz, M. Chociej, M. Litwin, B. Mc- dense object detection,” IEEE Transactions on Pattern Analysis and
Grew, A. Petron, A. Paino, M. Plappert, G. Powell, R. Ribas, J. Schnei- Machine Intelligence, vol. 42, no. 2, pp. 318–327, 2020.
der, N. Tezak, J. Tworek, P. Welinder, L. Weng, Q. Yuan, W. Zaremba, [81] S. Zhang, L. Wen, X. Bian, Z. Lei, and S. Z. Li, “Single-shot refinement
and L. Zhang, “Solving rubik’s cube with a robot hand,” arXiv preprint, neural network for object detection,” in Proceedings of the IEEE
2019. conference on computer vision and pattern recognition, 2018, pp.
[61] A. K. Kamath, S. G. Anavatti, and M. Feroskhan, “A physics-informed 4203–4212.
neural network approach to augmented dynamics visual servoing of [82] T. Liang, X. Chu, Y. Liu, Y. Wang, Z. Tang, W. Chu, J. Chen, and
multirotors,” IEEE Transactions on Cybernetics, vol. 54, no. 11, pp. H. Ling, “Cbnet: A composite backbone network architecture for object
6319–6332, 2024. detection,” IEEE Transactions on Image Processing, vol. 31, pp. 6893–
[62] C. Wu, B. Ju, Y. Wu, X. Lin, N. Xiong, G. Xu, H. Li, and X. Liang, 6906, 2022.
“Uav autonomous target search based on deep reinforcement learning [83] Z. Liu, H. Hu, Y. Lin, Z. Yao, Z. Xie, Y. Wei, J. Ning, Y. Cao, Z. Zhang,
in complex disaster scene,” IEEE Access, vol. 7, pp. 117 227–117 245, L. Dong et al., “Swin transformer v2: Scaling up capacity and
2019. resolution,” in Proceedings of the IEEE/CVF conference on computer
[63] C. Xiao, P. Lu, and Q. He, “Flying through a narrow gap using end-to- vision and pattern recognition, 2022, pp. 12 009–12 019.
end deep reinforcement learning augmented with curriculum learning [84] Y. Li, H. Mao, R. Girshick, and K. He, “Exploring plain vision
and sim2real,” IEEE Transactions on Neural Networks and Learning transformer backbones for object detection,” in European Conference
Systems, vol. 34, no. 5, pp. 2701–2708, 2023. on Computer Vision. Springer, 2022, pp. 280–296.
[64] Y. Song, M. Steinweg, E. Kaufmann, and D. Scaramuzza, “Autonomous [85] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and
drone racing with deep reinforcement learning,” in 2021 IEEE/RSJ S. Zagoruyko, “End-to-end object detection with transformers,” in
International Conference on Intelligent Robots and Systems (IROS). European conference on computer vision. Springer, 2020, pp. 213–
IEEE, 2021, pp. 1205–1212. 229.
[65] J. Xiao, P. Pisutsin, and M. Feroskhan, “Collaborative target search [86] H. Zhang, F. Li, S. Liu, L. Zhang, H. Su, J. Zhu, L. Ni, and H.-Y.
with a visual drone swarm: An adaptive curriculum embedded multi- Shum, “Dino: Detr with improved denoising anchor boxes for end-
stage reinforcement learning approach,” IEEE Transactions on Neural to-end object detection,” in The Eleventh International Conference on
Networks and Learning Systems, vol. 36, no. 1, pp. 313–327, 2025. Learning Representations, 2022.
[66] Z.-Q. Zhao, P. Zheng, S.-T. Xu, and X. Wu, “Object detection with [87] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.
deep learning: A review,” IEEE Transactions on Neural Networks and Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”
Learning Systems, vol. 30, no. 11, pp. 3212–3232, 2019. Advances in neural information processing systems, vol. 30, 2017.
[67] S. F. Bhat, R. Birkl, D. Wofk, P. Wonka, and M. Müller, “Zoedepth: [88] J. Xiao, J. H. Chee, and M. Feroskhan, “Real-time multi-drone de-
Zero-shot transfer by combining relative and metric depth,” arXiv tection and tracking for pursuit-evasion with parameter search,” IEEE
preprint arXiv:2302.12288, 2023. Transactions on Intelligent Vehicles, pp. 1–11, 2024.
[68] H. Laga, L. V. Jospin, F. Boussaid, and M. Bennamoun, “A survey [89] Z. Shen, M. Zhang, H. Zhao, S. Yi, and H. Li, “Efficient attention:
on deep learning techniques for stereo-based depth estimation,” IEEE Attention with linear complexities,” in Proceedings of the IEEE/CVF
Transactions on Pattern Analysis and Machine Intelligence, vol. 44, winter conference on applications of computer vision, 2021, pp. 3531–
no. 4, pp. 1738–1764, 2020. 3539.
[69] Z. Li, F. Liu, W. Yang, S. Peng, and J. Zhou, “A survey of convolutional [90] M. Maaz, A. Shaker, H. Cholakkal, S. Khan, S. W. Zamir, R. M.
neural networks: Analysis, applications, and prospects,” IEEE Trans- Anwer, and F. Shahbaz Khan, “Edgenext: efficiently amalgamated cnn-
actions on Neural Networks and Learning Systems, vol. 33, no. 12, pp. transformer architecture for mobile vision applications,” in European
6999–7019, 2022. Conference on Computer Vision. Springer, 2022, pp. 3–20.
[70] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, [91] D. Avola, L. Cinque, A. Diko, A. Fagioli, G. L. Foresti, A. Mecca,
T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., D. Pannone, and C. Piciarelli, “Ms-faster r-cnn: Multi-stream backbone
“An image is worth 16x16 words: Transformers for image recognition for improved faster r-cnn object detection and aerial tracking from uav
at scale,” arXiv preprint arXiv:2010.11929, 2020. images,” Remote Sensing, vol. 13, no. 9, p. 1670, 2021.
[71] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, [92] Z. Liu, Y. Shang, T. Li, G. Chen, Y. Wang, Q. Hu, and P. Zhu,
“Swin transformer: Hierarchical vision transformer using shifted win- “Robust multi-drone multi-target tracking to resolve target occlusion:
dows,” in Proceedings of the IEEE/CVF international conference on A benchmark,” IEEE Transactions on Multimedia, vol. 25, pp. 1462–
computer vision, 2021, pp. 10 012–10 022. 1476, 2023.
[72] Y. Liu, Y. Zhang, Y. Wang, F. Hou, J. Yuan, J. Tian, Y. Zhang, Z. Shi, [93] B. Yuan, W. Ma, and F. Wang, “High speed safe autonomous landing
J. Fan, and Z. He, “A survey of visual transformers,” IEEE Transactions marker tracking of fixed wing drone based on deep learning,” IEEE
on Neural Networks and Learning Systems, pp. 1–21, 2023. Access, vol. 10, pp. 80 415–80 436, 2022.
19

[94] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss programming approach,” in 2012 IEEE/RSJ international conference
for dense object detection,” in Proceedings of the IEEE international on Intelligent Robots and Systems. IEEE, 2012, pp. 1917–1922.
conference on computer vision, 2017, pp. 2980–2988. [116] I. Iswanto, A. Ma’arif, O. Wahyunggoro, and A. Imam, “Artificial
[95] J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,” potential field algorithm implementation for quadrotor path planning,”
arXiv preprint arXiv:1804.02767, 2018. Int. J. Adv. Comput. Sci. Appl, vol. 10, no. 8, pp. 575–585, 2019.
[96] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, [117] O. Khatib, “Real-time obstacle avoidance for manipulators and mo-
“Feature pyramid networks for object detection,” in Proceedings of the bile robots,” in Proceedings. 1985 IEEE International Conference on
IEEE conference on computer vision and pattern recognition, 2017, Robotics and Automation, vol. 2. IEEE, 1985, pp. 500–505.
pp. 2117–2125. [118] X. Dai, Y. Mao, T. Huang, N. Qin, D. Huang, and Y. Li, “Automatic
[97] Z. Cai and N. Vasconcelos, “Cascade r-cnn: Delving into high quality obstacle avoidance of quadrotor uav via cnn-based learning,” Neuro-
object detection,” in Proceedings of the IEEE conference on computer computing, vol. 402, pp. 346–358, 2020.
vision and pattern recognition, 2018, pp. 6154–6162. [119] M. A. Anwar and A. Raychowdhury, “Navren-rl: Learning to fly in real
[98] X. Lu, B. Li, Y. Yue, Q. Li, and J. Yan, “Grid r-cnn,” in Proceed- environment via end-to-end deep reinforcement learning using monoc-
ings of the IEEE/CVF Conference on Computer Vision and Pattern ular images,” in 2018 25th International Conference on Mechatronics
Recognition, 2019, pp. 7363–7372. and Machine Vision in Practice (M2VIP). IEEE, 2018, pp. 1–6.
[99] D. T. Wei Xun, Y. L. Lim, and S. Srigrarom, “Drone detection [120] Y. Zhang, K. H. Low, and C. Lyu, “Partially-observable monocular
using yolov3 with transfer learning on nvidia jetson tx2,” in 2021 autonomous navigation for uav through deep reinforcement learning,”
Second International Symposium on Instrumentation, Control, Artificial in AIAA AVIATION 2023 Forum, 2023, p. 3813.
Intelligence, and Robotics (ICA-SYMP), 2021, pp. 1–6. [121] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction.
[100] S. Ross, N. Melik-Barkhudarov, K. S. Shankar, A. Wendel, D. Dey, MIT press, 2018.
J. A. Bagnell, and M. Hebert, “Learning monocular reactive uav [122] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
control in cluttered natural environments,” in 2013 IEEE international recognition,” in Proceedings of the IEEE conference on computer vision
conference on robotics and automation. IEEE, 2013, pp. 1765–1772. and pattern recognition, 2016, pp. 770–778.
[101] L. Xie, S. Wang, A. Markham, and N. Trigoni, “Towards monocular [123] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
vision based obstacle avoidance through deep reinforcement learning,” large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
arXiv preprint arXiv:1706.09829, 2017. [124] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.
[102] K. Mohta, M. Watterson, Y. Mulgaonkar, S. Liu, C. Qu, A. Makineni, Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski,
K. Saulnier, K. Sun, A. Zhu, J. Delmerico et al., “Fast, autonomous S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran,
flight in gps-denied and cluttered environments,” Journal of Field D. Wierstra, S. Legg, and D. Hassabis, “Human-level control through
Robotics, vol. 35, no. 1, pp. 101–120, 2018. deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533,
[103] F. Gao, W. Wu, J. Pan, B. Zhou, and S. Shen, “Optimal Time feb 2015.
Allocation for Quadrotor Trajectory Generation,” in IEEE International [125] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox-
Conference on Intelligent Robots and Systems. Institute of Electrical imal policy optimization algorithms,” arXiv preprint arXiv:1707.06347,
and Electronics Engineers Inc., dec 2018, pp. 4715–4722. 2017.
[126] C. J. Watkins and P. Dayan, “Q-learning,” Machine learning, vol. 8,
[104] F. Gao, W. Wu, W. Gao, and S. Shen, “Flying on point clouds: Online
no. 3, pp. 279–292, 1992.
trajectory generation and autonomous navigation for quadrotors in
[127] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning
cluttered environments,” Journal of Field Robotics, vol. 36, no. 4, pp.
with double q-learning,” in Proceedings of the AAAI conference on
710–733, jun 2019.
artificial intelligence, vol. 30, no. 1, 2016.
[105] F. Gao, L. Wang, B. Zhou, X. Zhou, J. Pan, and S. Shen, “Teach-
[128] Z. Zhu, K. Lin, B. Dai, and J. Zhou, “Off-policy imitation learning from
repeat-replan: A complete and robust system for aggressive flight in
observations,” Advances in Neural Information Processing Systems,
complex environments,” IEEE Transactions on Robotics, vol. 36, no. 5,
vol. 33, pp. 12 402–12 413, 2020.
pp. 1526–1545, 2020.
[129] J. Foerster, G. Farquhar, T. Afouras, N. Nardelli, and S. Whiteson,
[106] B. Zhou, F. Gao, L. Wang, C. Liu, and S. Shen, “Robust and efficient “Counterfactual multi-agent policy gradients,” Proceedings of the AAAI
quadrotor trajectory generation for fast autonomous flight,” IEEE conference on artificial intelligence, vol. 32, no. 1, 2018.
Robotics and Automation Letters, vol. 4, no. 4, pp. 3529–3536, 2019. [130] T. Rashid, M. Samvelyan, C. S. De Witt, G. Farquhar, J. Foerster,
[107] B. Zhou, J. Pan, F. Gao, and S. Shen, “Raptor: Robust and perception- and S. Whiteson, “Monotonic value function factorisation for deep
aware trajectory replanning for quadrotor fast flight,” IEEE Transac- multi-agent reinforcement learning,” The Journal of Machine Learning
tions on Robotics, 2021. Research, vol. 21, no. 1, pp. 7234–7284, 2020.
[108] L. Quan, Z. Zhang, X. Zhong, C. Xu, and F. Gao, “Eva-planner: En- [131] J. Xiao, Y. X. M. Tan, X. Zhou, and M. Feroskhan, “Learning
vironmental adaptive quadrotor planning,” in 2021 IEEE International collaborative multi-target search for a visual drone swarm,” in 2023
Conference on Robotics and Automation (ICRA). IEEE, 2021, pp. IEEE Conference on Artificial Intelligence (CAI), 2023, pp. 5–7.
398–404. [132] J. Xiao, P. Pisutsin, and M. Feroskhan, “Toward collaborative multitar-
[109] Y. Zhang, Q. Yu, K. H. Low, and C. Lv, “A self-supervised monocular get search and navigation with attention-enhanced local observation,”
depth estimation approach based on uav aerial images,” in 2022 Advanced Intelligent Systems, p. 2300761, 2024.
IEEE/AIAA 41st Digital Avionics Systems Conference (DASC). IEEE, [133] M. A. Akhloufi, S. Arola, and A. Bonnet, “Drones chasing drones:
2022, pp. 1–8. Reinforcement learning and deep search area proposal,” Drones, vol. 3,
[110] R. B. Rusu, Z. C. Marton, N. Blodow, M. Dolha, and M. Beetz, “To- no. 3, p. 58, 2019.
wards 3d point cloud based object maps for household environments,” [134] S. Shah, D. Dey, C. Lovett, and A. Kapoor, “Airsim: High-fidelity
Robotics and Autonomous Systems, vol. 56, no. 11, pp. 927–941, 2008. visual and physical simulation for autonomous vehicles,” in Field and
[111] A. Hornung, K. M. Wurm, M. Bennewitz, C. Stachniss, and service robotics. Springer, 2018, pp. 621–635.
W. Burgard, “OctoMap: An efficient probabilistic 3D mapping [135] A. Juliani, V.-P. Berges, E. Teng, A. Cohen, J. Harper, C. Elion, C. Goy,
framework based on octrees,” Autonomous Robots, 2013, software Y. Gao, H. Henry, M. Mattar et al., “Unity: A general platform for
available at https://octomap.github.io. [Online]. Available: https: intelligent agents,” arXiv preprint arXiv:1809.02627, 2018.
//octomap.github.io [136] I. Zamora, N. G. Lopez, V. M. Vilches, and A. H. Cordero, “Extending
[112] E. W. Dijkstra, “A note on two problems in connexion with graphs,” the openai gym for robotics: a toolkit for reinforcement learning using
Numerische Mathematik, vol. 1, pp. 269–271, 1959. ros and gazebo,” arXiv preprint arXiv:1608.05742, 2016.
[113] P. E. Hart, N. J. Nilsson, and B. Raphael, “A formal basis for the [137] P. Christiano, Z. Shah, I. Mordatch, J. Schneider, T. Blackwell, J. To-
heuristic determination of minimum cost paths,” IEEE Transactions on bin, P. Abbeel, and W. Zaremba, “Transfer from simulation to real
Systems Science and Cybernetics, vol. 4, no. 2, pp. 100–107, 1968. world through learning deep inverse dynamics model,” arXiv preprint
[114] J. J. Kuffner and S. M. LaValle, “Rrt-connect: An efficient approach arXiv:1610.03518, 2016.
to single-query path planning,” in Proceedings 2000 ICRA. Millennium [138] J. Tan, T. Zhang, E. Coumans, A. Iscen, Y. Bai, D. Hafner, S. Bo-
Conference. IEEE International Conference on Robotics and Automa- hez, and V. Vanhoucke, “Sim-to-real: Learning agile locomotion for
tion. Symposia Proceedings (Cat. No. 00CH37065), vol. 2. IEEE, quadruped robots,” arXiv preprint arXiv:1804.10332, 2018.
2000, pp. 995–1001. [139] C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning
[115] F. Augugliaro, A. P. Schoellig, and R. D’Andrea, “Generation of for fast adaptation of deep networks,” in International conference on
collision-free trajectories for a quadrocopter fleet: A sequential convex machine learning. PMLR, 2017, pp. 1126–1135.
20

[140] J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel, [159] H. Zhou, H. Kong, L. Wei, D. Creighton, and S. Nahavandi, “On
“Domain randomization for transferring deep neural networks from detecting road regions in a single uav image,” IEEE transactions on
simulation to the real world,” in 2017 IEEE/RSJ international con- intelligent transportation systems, vol. 18, no. 7, pp. 1713–1722, 2016.
ference on intelligent robots and systems (IROS). IEEE, 2017, pp. [160] N. H. Motlagh, M. Bagaa, and T. Taleb, “Uav-based iot platform: A
23–30. crowd surveillance use case,” IEEE Communications Magazine, vol. 55,
[141] O. M. Andrychowicz, B. Baker, M. Chociej, R. Jozefowicz, B. Mc- no. 2, pp. 128–134, 2017.
Grew, J. Pachocki, A. Petron, M. Plappert, G. Powell, A. Ray et al., [161] M. A. Goodrich, B. S. Morse, D. Gerhardt, J. L. Cooper, M. Quigley,
“Learning dexterous in-hand manipulation,” The International Journal J. A. Adams, and C. Humphrey, “Supporting wilderness search and
of Robotics Research, vol. 39, no. 1, pp. 3–20, 2020. rescue using a camera-equipped mini uav,” Journal of Field Robotics,
[142] M. Mancini, G. Costante, P. Valigi, and T. A. Ciarfuglia, “Fast vol. 25, no. 1-2, pp. 89–110, 2008.
robust monocular depth estimation for obstacle detection with fully [162] E. Lygouras, N. Santavas, A. Taitzoglou, K. Tarchanidis, A. Mitropou-
convolutional networks,” in 2016 IEEE/RSJ International Conference los, and A. Gasteratos, “Unsupervised human detection with an em-
on Intelligent Robots and Systems (IROS). IEEE, 2016, pp. 4296– bedded vision system on a fully autonomous uav for search and rescue
4303. operations,” Sensors, vol. 19, no. 16, p. 3542, 2019.
[143] S. Geyer and E. Johnson, “3d obstacle avoidance in adversarial environ- [163] T. Tomic, K. Schmid, P. Lutz, A. Domel, M. Kassecker, E. Mair,
ments for unmanned aerial vehicles,” in AIAA Guidance, Navigation, I. L. Grixa, F. Ruess, M. Suppa, and D. Burschka, “Toward a fully
and Control Conference and Exhibit, 2006, p. 6542. autonomous uav: Research platform for indoor and outdoor urban
[144] Y. Zhang, J. Xiao, and M. Feroskhan, “Learning cross-modal visuo- search and rescue,” IEEE robotics & automation magazine, vol. 19,
motor policies for autonomous drone navigation,” IEEE Robotics and no. 3, pp. 46–56, 2012.
Automation Letters, vol. 10, no. 6, pp. 5425–5432, 2025. [164] J. Senthilnath, M. Kandukuri, A. Dokania, and K. Ramesh, “Applica-
[145] C. Godard, O. Mac Aodha, and G. J. Brostow, “Unsupervised monocu- tion of uav imaging platform for vegetation analysis based on spectral-
lar depth estimation with left-right consistency,” in Proceedings of the spatial methods,” Computers and Electronics in Agriculture, vol. 140,
IEEE conference on computer vision and pattern recognition, 2017, pp. 8–24, 2017.
pp. 270–279. [165] M. R. Khosravi and S. Samadi, “Bl-alm: A blind scalable edge-guided
[146] C. Godard, O. Mac Aodha, M. Firman, and G. J. Brostow, “Digging reconstruction filter for smart environmental monitoring through green
into self-supervised monocular depth estimation,” in Proceedings of iomt-uav networks,” IEEE Transactions on Green Communications and
the IEEE/CVF International Conference on Computer Vision, 2019, Networking, vol. 5, no. 2, pp. 727–736, 2021.
pp. 3828–3838. [166] C. Donmez, O. Villi, S. Berberoglu, and A. Cilek, “Computer vision-
[147] N. Jiang, K. Wang, X. Peng, X. Yu, Q. Wang, J. Xing, G. Li, G. Guo, based citrus tree detection in a cultivated environment using uav
Q. Ye, J. Jiao, J. Zhao, and Z. Han, “Anti-uav: A large-scale benchmark imagery,” Computers and Electronics in Agriculture, vol. 187, p.
for vision-based uav tracking,” IEEE Transactions on Multimedia, 106273, 2021.
vol. 25, pp. 486–500, 2023. [167] B. Lu and Y. He, “Species classification using unmanned aerial vehicle
(uav)-acquired high spatial resolution imagery in a heterogeneous
[148] Y. Zhao, Z. Ju, T. Sun, F. Dong, J. Li, R. Yang, Q. Fu, C. Lian,
grassland,” ISPRS Journal of Photogrammetry and Remote Sensing,
and P. Shan, “Tgc-yolov5: An enhanced yolov5 drone detection model
vol. 128, pp. 73–85, 2017.
based on transformer, gam & ca attention mechanism,” Drones, vol. 7,
no. 7, p. 446, 2023. [168] D. Kim, M. Liu, S. Lee, and V. R. Kamat, “Remote proximity mon-
itoring between mobile construction resources using camera-mounted
[149] V. Walter, M. Vrba, and M. Saska, “On training datasets for machine
uavs,” Automation in Construction, vol. 99, pp. 168–182, 2019.
learning-based visual relative localization of micro-scale UAVs,” in
[169] T. Khuc, T. A. Nguyen, H. Dao, and F. N. Catbas, “Swaying displace-
2020 IEEE International Conference on Robotics and Automation
ment measurement for structural monitoring using computer vision and
(ICRA), Aug 2020, pp. 10 674–10 680.
an unmanned aerial vehicle,” Measurement, vol. 159, p. 107769, 2020.
[150] J. Delmerico, T. Cieslewski, H. Rebecq, M. Faessler, and D. Scara- [170] S. Li, E. van der Horst, P. Duernay, C. De Wagter, and G. C.
muzza, “Are we ready for autonomous drone racing? the uzh-fpv drone de Croon, “Visual model-predictive localization for computationally
racing dataset,” in 2019 International Conference on Robotics and efficient autonomous racing of a 72-g drone,” Journal of Field Robotics,
Automation (ICRA), 2019, pp. 6713–6719. vol. 37, no. 4, pp. 667–692, 2020.
[151] L. Bauersfeld, E. Kaufmann, P. Foehn, S. Sun, and D. Scaramuzza, [171] M. Muller, G. Li, V. Casser, N. Smith, D. L. Michels, and B. Ghanem,
“Neurobem: Hybrid aerodynamic quadrotor model,” Proceedings of “Learning a controller fusion network by online trajectory filtering for
Robotics: Science and Systems XVII, p. 42, 2021. vision-based uav racing,” in Proceedings of the IEEE/CVF Conference
[152] Y. Song, S. Naji, E. Kaufmann, A. Loquercio, and D. Scaramuzza, on Computer Vision and Pattern Recognition Workshops, 2019, pp.
“Flightmare: A flexible quadrotor simulator,” in Conference on Robot 0–0.
Learning. PMLR, 2021, pp. 1147–1157. [172] M. Muller, V. Casser, N. Smith, D. L. Michels, and B. Ghanem,
[153] J. Panerati, H. Zheng, S. Zhou, J. Xu, A. Prorok, and A. P. Schoel- “Teaching uavs to race: End-to-end regression of agile controls in
lig, “Learning to fly—a gym environment with pybullet physics for simulation,” in Proceedings of the European Conference on Computer
reinforcement learning of multi-agent quadcopter control,” in 2021 Vision (ECCV) Workshops, 2018, pp. 0–0.
IEEE/RSJ International Conference on Intelligent Robots and Systems [173] R. Penicka and D. Scaramuzza, “Minimum-time quadrotor waypoint
(IROS), 2021, pp. 7512–7519. flight in cluttered environments,” IEEE Robotics and Automation
[154] B. Alvey, D. T. Anderson, A. Buck, M. Deardorff, G. Scott, and Letters, vol. 7, no. 2, pp. 5719–5726, 2022.
J. M. Keller, “Simulated photorealistic deep learning framework and [174] E. Kaufmann, M. Gehrig, P. Foehn, R. Ranftl, A. Dosovitskiy,
workflows to accelerate computer vision and unmanned aerial vehicle V. Koltun, and D. Scaramuzza, “Beauty and the beast: Optimal methods
research,” in Proceedings of the IEEE/CVF International Conference meet learning for drone racing,” in 2019 International Conference on
on Computer Vision, 2021, pp. 3889–3898. Robotics and Automation (ICRA). IEEE, 2019, pp. 690–696.
[155] Y. Song and D. Scaramuzza, “Policy search for model predictive control [175] L. Xing, X. Fan, Y. Dong, Z. Xiong, L. Xing, Y. Yang, H. Bai, and
with application for agile drone flight,” IEEE Transaction on Robotics, C. Zhou, “Multi-uav cooperative system for search and rescue based
2021. on yolov5,” International Journal of Disaster Risk Reduction, vol. 76,
[156] J. Xiao and M. Feroskhan, “Learning multipursuit evasion for safe p. 102972, 2022.
targeted navigation of drones,” IEEE Transactions on Artificial Intelli- [176] B. Lin, L. Wu, and Y. Niu, “End-to-end vision-based cooperative target
gence, vol. 5, no. 12, pp. 6210–6224, 2024. geo-localization for multiple micro uavs,” Journal of Intelligent &
[157] A. Singh, D. Patil, and S. Omkar, “Eye in the sky: Real-time drone Robotic Systems, vol. 106, no. 1, p. 13, 2022.
surveillance system (dss) for violent individuals identification using [177] M. E. Campbell and W. W. Whitacre, “Cooperative tracking using
scatternet hybrid deep learning network,” in Proceedings of the IEEE vision measurements on seascan uavs,” IEEE Transactions on Control
conference on computer vision and pattern recognition workshops, Systems Technology, vol. 15, no. 4, pp. 613–626, 2007.
2018, pp. 1629–1637. [178] J. Gu, T. Su, Q. Wang, X. Du, and M. Guizani, “Multiple moving
[158] W. Li, H. Li, Q. Wu, X. Chen, and K. N. Ngan, “Simultaneously targets surveillance based on a cooperative network for multi-uav,”
detecting and counting dense vehicles from drone images,” IEEE IEEE Communications Magazine, vol. 56, no. 4, pp. 82–89, 2018.
Transactions on Industrial Electronics, vol. 66, no. 12, pp. 9651–9662, [179] Y. Cao, F. Qi, Y. Jing, M. Zhu, T. Lei, Z. Li, J. Xia, J. Wang, and G. Lu,
2019. “Mission chain driven unmanned aerial vehicle swarms cooperation for
21

the search and rescue of outdoor injured human targets,” Drones, vol. 6, [199] S. Minaeian, J. Liu, and Y.-J. Son, “Vision-based target detection
no. 6, p. 138, 2022. and localization via a team of cooperative uav and ugvs,” IEEE
[180] G. Loianno, J. Thomas, and V. Kumar, “Cooperative localization and Transactions on systems, man, and cybernetics: systems, vol. 46, no. 7,
mapping of mavs using rgb-d sensors,” in 2015 IEEE International pp. 1005–1016, 2015.
Conference on Robotics and Automation (ICRA). IEEE, 2015, pp. [200] A. Adaptable, “Building an aerial–ground robotics system for precision
4021–4028. farming,” IEEE ROBOTICS & AUTOMATION MAGAZINE, 2021.
[181] P. Tong, X. Yang, Y. Yang, W. Liu, and P. Wu, “Multi-uav collaborative [201] Q. Vuong, S. Levine, H. R. Walke, K. Pertsch, A. Singh, R. Doshi,
absolute vision positioning and navigation: A survey and discussion,” C. Xu, J. Luo, L. Tan, D. Shah et al., “Open x-embodiment: Robotic
Drones, vol. 7, no. 4, p. 261, 2023. learning datasets and rt-x models,” in 2nd Workshop on Language and
[182] N. Piasco, J. Marzat, and M. Sanfourche, “Collaborative localization Robot Learning: Language as Grounding, 2023.
and formation flying using distributed stereo-vision,” in 2016 IEEE [202] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun, “Carla:
International Conference on Robotics and Automation (ICRA). IEEE, An open urban driving simulator,” in Conference on robot learning.
2016, pp. 1202–1207. PMLR, 2017, pp. 1–16.
[183] D. Liu, X. Zhu, W. Bao, B. Fei, and J. Wu, “Smart: Vision-based [203] F. Zhuang, Z. Qi, K. Duan, D. Xi, Y. Zhu, H. Zhu, H. Xiong, and
method of cooperative surveillance and tracking by multiple uavs in the Q. He, “A comprehensive survey on transfer learning,” Proceedings of
urban environment,” IEEE Transactions on Intelligent Transportation the IEEE, vol. 109, no. 1, pp. 43–76, 2020.
Systems, vol. 23, no. 12, pp. 24 941–24 956, 2022. [204] Y. Wang, Q. Yao, J. T. Kwok, and L. M. Ni, “Generalizing from a few
[184] N. Farmani, L. Sun, and D. J. Pack, “A scalable multitarget tracking examples: A survey on few-shot learning,” ACM computing surveys
system for cooperative unmanned aerial vehicles,” IEEE Transactions (csur), vol. 53, no. 3, pp. 1–34, 2020.
on Aerospace and Electronic Systems, vol. 53, no. 4, pp. 1947–1961, [205] J. Fonseca and F. Bacao, “Tabular and latent space synthetic data
2017. generation: a literature review,” Journal of Big Data, vol. 10, no. 1, p.
[185] M. Jouhari, A. K. Al-Ali, E. Baccour, A. Mohamed, A. Erbad, 115, 2023.
M. Guizani, and M. Hamdi, “Distributed cnn inference on resource- [206] Z. Liu, M. Sun, T. Zhou, G. Huang, and T. Darrell, “Rethinking the
constrained uavs for surveillance systems: Design and optimization,” value of network pruning,” in International Conference on Learning
IEEE Internet of Things Journal, vol. 9, no. 2, pp. 1227–1242, 2021. Representations, 2018.
[186] W. J. Yun, S. Park, J. Kim, M. Shin, S. Jung, D. A. Mohaisen, and J.-H. [207] Y. Jiang, S. Wang, V. Valls, B. J. Ko, W.-H. Lee, K. K. Leung, and
Kim, “Cooperative multiagent deep reinforcement learning for reliable L. Tassiulas, “Model pruning enables efficient federated learning on
surveillance via autonomous multi-uav control,” IEEE Transactions on edge devices,” IEEE Transactions on Neural Networks and Learning
Industrial Informatics, vol. 18, no. 10, pp. 7086–7096, 2022. Systems, 2022.
[187] Y. Tang, Y. Hu, J. Cui, F. Liao, M. Lao, F. Lin, and R. S. Teo, “Vision- [208] Y. Zhou, S.-M. Moosavi-Dezfooli, N.-M. Cheung, and P. Frossard,
aided multi-uav autonomous flocking in gps-denied environment,” “Adaptive quantization for deep neural network,” in Proceedings of
IEEE Transactions on industrial electronics, vol. 66, no. 1, pp. 616– the AAAI Conference on Artificial Intelligence, vol. 32, no. 1, 2018.
626, 2018. [209] L. Deng, G. Li, S. Han, L. Shi, and Y. Xie, “Model compression and
hardware acceleration for neural networks: A comprehensive survey,”
[188] J. Scherer, S. Yahyanejad, S. Hayat, E. Yanmaz, T. Andre, A. Khan,
Proceedings of the IEEE, vol. 108, no. 4, pp. 485–532, 2020.
V. Vukadinovic, C. Bettstetter, H. Hellwagner, and B. Rinner, “An
[210] L. Wang and K.-J. Yoon, “Knowledge distillation and student-teacher
autonomous multi-uav system for search and rescue,” in Proceedings
learning for visual intelligence: A review and new outlooks,” IEEE
of the first workshop on micro aerial vehicle networks, systems, and
transactions on pattern analysis and machine intelligence, vol. 44,
applications for civilian use, 2015, pp. 33–38.
no. 6, pp. 3048–3068, 2021.
[189] Y. Rizk, M. Awad, and E. W. Tunstel, “Cooperative heterogeneous
[211] R. Gong, Q. Huang, X. Ma, H. Vo, Z. Durante, Y. Noda, Z. Zheng,
multi-robot systems: A survey,” ACM Computing Surveys (CSUR),
S.-C. Zhu, D. Terzopoulos, L. Fei-Fei et al., “Mindagent: Emergent
vol. 52, no. 2, pp. 1–31, 2019.
gaming interaction,” arXiv preprint arXiv:2309.09971, 2023.
[190] G. Niu, L. Wu, Y. Gao, and M.-O. Pun, “Unmanned aerial vehicle [212] A. Bhoopchand, B. Brownfield, A. Collister, A. Dal Lago, A. Edwards,
(uav)-assisted path planning for unmanned ground vehicles (ugvs) R. Everett, A. Fréchette, Y. G. Oliveira, E. Hughes, K. W. Mathewson
via disciplined convex-concave programming,” IEEE Transactions on et al., “Learning few-shot imitation as cultural transmission,” Nature
Vehicular Technology, vol. 71, no. 7, pp. 6996–7007, 2022. Communications, vol. 14, no. 1, p. 7536, 2023.
[191] D. Liu, W. Bao, X. Zhu, B. Fei, Z. Xiao, and T. Men, “Vision-aware [213] J. Xiao and M. Feroskhan, “Cyber attack detection and isolation for
air-ground cooperative target localization for uav and ugv,” Aerospace a quadrotor uav with modified sliding innovation sequences,” IEEE
Science and Technology, vol. 124, p. 107525, 2022. Transactions on Vehicular Technology, vol. 71, no. 7, pp. 7202–7214,
[192] J. Li, G. Deng, C. Luo, Q. Lin, Q. Yan, and Z. Ming, “A hybrid path 2022.
planning method in unmanned air/ground vehicle (uav/ugv) cooperative [214] T. T. Nguyen and V. J. Reddi, “Deep reinforcement learning for
systems,” IEEE Transactions on Vehicular Technology, vol. 65, no. 12, cyber security,” IEEE Transactions on Neural Networks and Learning
pp. 9585–9596, 2016. Systems, vol. 34, no. 8, pp. 3779–3795, 2023.
[193] L. Zhang, F. Gao, F. Deng, L. Xi, and J. Chen, “Distributed estimation [215] J. Xiao, X. Fang, Q. Jia, and M. Feroskhan, “Learning resilient
of a layered architecture for collaborative air–ground target geolocation formation control of drones with graph attention network,” IEEE
in outdoor environments,” IEEE Transactions on Industrial Electronics, Internet of Things Journal, pp. 1–1, 2025.
vol. 70, no. 3, pp. 2822–2832, 2022. [216] I. Ilahi, M. Usama, J. Qadir, M. U. Janjua, A. Al-Fuqaha, D. T. Hoang,
[194] Y. Chen and J. Xiao, “Target search and navigation in heterogeneous and D. Niyato, “Challenges and countermeasures for adversarial attacks
robot systems with deep reinforcement learning,” Machine Intelligence on deep reinforcement learning,” IEEE Transactions on Artificial
Research, pp. 1–12, 2025. Intelligence, vol. 3, no. 2, pp. 90–109, 2021.
[195] G. Niu, Q. Yang, Y. Gao, and M.-O. Pun, “Vision-based autonomous [217] A. Rawal, J. McCoy, D. B. Rawat, B. M. Sadler, and R. S. Amant,
landing for unmanned aerial and ground vehicles cooperative systems,” “Recent advances in trustworthy explainable artificial intelligence:
IEEE robotics and automation letters, vol. 7, no. 3, pp. 6234–6241, Status, challenges, and perspectives,” IEEE Transactions on Artificial
2021. Intelligence, vol. 3, no. 6, pp. 852–866, 2021.
[196] Z.-C. Xu, B.-B. Hu, B. Liu, X. Wang, and H.-T. Zhang, “Vision-
based autonomous landing of unmanned aerial vehicle on a motional
unmanned surface vessel,” in 2020 39th Chinese Control Conference
(CCC). IEEE, 2020, pp. 6845–6850.
[197] C. Hui, C. Yousheng, L. Xiaokun, and W. W. Shing, “Autonomous
takeoff, tracking and landing of a uav on a moving ugv using onboard
monocular vision,” in Proceedings of the 32nd Chinese control confer-
ence. IEEE, 2013, pp. 5895–5901.
[198] I. Kalinov, A. Petrovsky, V. Ilin, E. Pristanskiy, M. Kurenkov,
V. Ramzhaev, I. Idrisov, and D. Tsetserukou, “Warevision: Cnn barcode
detection-based uav trajectory optimization for autonomous warehouse
stocktaking,” IEEE Robotics and Automation Letters, vol. 5, no. 4, pp.
6647–6653, 2020.

You might also like