Vision Based Learning For Drones A Survey Final
Vision Based Learning For Drones A Survey Final
Image Processing
Images
Images
Deep learning
Visual odometry
Pa�erns
Flight Control
Commands
Commands
Visual Percep�on
Visual Percep�on
Laser emitters
Cameras are external passive sensors used to monitor the
(4 groups of 8)
Motor
drone’s geometric and dynamic relationship to its task, en-
Laser
housing emitters (4
groups of 8)
IR
projector
RGB
module Event signal
vironment or the objects that it is handling. Cameras are
sensor
Mounting base
commonly used perception sensors for drones to sense en-
vironment information, such as objects’ position and a point
cloud map of the environment. In contrast to the motion
capture system, which can only broadcast global geometric
and dynamic pose information within a limited space from
an offboard synchronized system, cameras enable a drone to
(a) (b) (c) fly without space constraints. Cameras can provide positioning
for drone navigation in GPS-denied environments via visual
Fig. 4: Visual perception sensors for drones. (a) Velodyne
inertial odometry (VIO) [43]–[45] and visual simultaneous
surrounding LIDAR; (b) Intel D435 RGBD camera; (c) Sony
localization and mapping systems (V-SLAM) [35], [46], [47].
event camera.
Meanwhile, object detection and depth estimation can be per-
formed with cameras to obtain the relative positions and sizes
perception onboard in the real world, an uncertainty estimation of obstacles [48]. However, to avoid dynamic obstacles, even
module was trained with the Ajna network [37], which sig- physical attacks like bird chasing, the agile vision-based drone
nificantly increased the generalization capability of learning- poses fundamental challenges to visual perception. Motion
based control for drones. The power of deep learning (DL) blur, sparse texture environments, and unbalanced lighting
in handling uncertain information frees traditional approaches conditions can cause the loss of feature detection in the VIO
from complex computation with necessary accurate modeling. and object detection. LIDAR and event cameras [49] can
partially address these challenges. However, LIDAR and event
cameras are either too bulky or too expensive for agile drone
B. Visual Perception applications. Considering the agility requirement of physical
Visual perception for drones is the ability of drones to attack avoidance, lightweight dual-fisheye cameras are used
perceive their surroundings and their own states through the for visual perception. With dual fisheye cameras, the drone can
extraction of necessary features for specific tasks with visual achieve better navigation capability [50] and omnidirectional
sensors. Light detection and ranging (LIDAR) and cameras visual perception [48], [51]. Some sensor fusion and state
are commonly used sensors (see Fig. 4) to perceive the estimation techniques are required to alleviate the accuracy
surrounding environment for drones. loss brought by the motion blur.
1) LIght Detection And Ranging: LIDAR is a kind of active
range sensor that relies on the calculated time of flight (TOF)
between the transmitted and received beams (laser) to estimate C. Machine Learning
the distance between the robot and the reflected surface of Recently, ML, especially DL, has attracted much attention
objects [38]. Based on the scanning mechanism, LIDAR can from various fields and has been widely applied to robotics for
be divided into solid-state LIDAR, which has a fixed field of environmental exploration [55], [56], navigation in unknown
view (FOV) without moving parts, and surrounding LIDAR, environments [57]–[59], obstacle avoidance, and intelligent
which spins to provide a 360-degree horizontal view. Sur- control [60]. In the domain of drones, learning-based methods
rounding LIDAR is also referred to as “laser scanning” or “3D have also achieved promising success, particularly for deep
scanning”, which creates a 3D representation of the explored reinforcement learning (DRL) [18], [34], [61]–[65]. In [63],
environment using eye-safe laser beams. A typical LIDAR a curriculum learning augmented end-to-end reinforcement
(see Fig. 4(a)) consists of laser emitters, laser receivers, and a learning (RL) was proposed for a UAV to fly through a narrow
spinning motor. The vertical FOV of a LIDAR is determined gap in the real world. A vision-based end-to-end learning
by the number of vertical arrays of lasers. For instance, a method was successfully developed in [34] to fly agile quadro-
vertical array of 16 lasers scanning 30 degrees gives a vertical tors through complex wild and human-made environments
resolution of 2 degrees in a typical configuration. LIDAR has with only onboard sensing and computation capabilities, such
recently been used on drones for mapping [39], power grid as depth information. A visual drone swarm was developed
inspection [40], pose estimation [41] and object detection [42]. in [65] to perform collaborative target search with adap-
LIDAR provides sufficient and accurate depth information for tive curriculum embedded multistage learning. These works
drones to navigate in cluttered environments. However, it is verified the marvelous power of learning-based methods on
bulky and power-hungry and does not fit within the payload drone applications, which pushes the agility and cooperation
restrictions of agile autonomous drones. Meanwhile, using of drones to a level that classical approaches can hardly
raycast representation in a simulation environment makes it reach. Different from the classical approaches relying on
hard to match the inputs of a real LIDAR device, which brings separate mapping, localization, and planning, learning-based
many challenges for the Sim2Real transfer when a learning methods map the observations, such as the visual information
approach is considered. or localization of obstacles, to commands directly without
4
ts with regression
Faster
R-CNN R-CNN Object is a cat Refine BB position
Faster
Fast
R-CNN
Predict BBox offsets with regression
R-CNN
R-CNN
Object is a cat
Predict
Refine BB position
BBox offsets with regression
Bounding-box
Fast
Faster R-CNN
R-CNN Object is a cat Refine BB position
Fast R-CN
BBoClassification Bounding-box BBo Classification Classification Bounding-box
xreg loss regression loss Linear+
xreg loss regression loss Linear+ loss regression loss
BBo SVM Classify regions with BBo SVM Classify regions with
Bounding-box Bounding-box
egions with softmax Linear softmax Linear
xreg s xreg
Softmax s regressors Softmax regressors Softmax
SVMs SVMs SVMs BB proposal SVMs
Object
BBoor not object BB proposal BBo Objectclassifier
or not object Object or not object classifier
BB proposal classifier
xreg xreg Fully-connected layers Fully-connected layers
SVMs
Classification Bounding-box SVMs
Classification Bounding-box
FCs Classification Bounding-box FCs
regression loss Forward “Rol Pooling” loss regression loss Forward “Rol Pooling” loss regression loss “Rol Pooling”
loss
on CN each region layer CN each region layer layer
N N “Rol Pooling” layer “Rol Pooling” layer
CNN through CNN through CNN
CN CN
proposals proposals proposals
N Regions of
N feature map Regions of feature map Regions of
CN CN (Rols)
Interest Interest (Rols) Interest (Rols)
N Warped image N
from a proposal Warped image from a proposal from a proposal
Region Proposal Network Region Proposal Network Region Proposal Network
regions method regionsForward whole image through method Forward whole image throughmethod
feature map feature map CNN feature map CNN
of Interest Regions of Interest Regions of Interest
m a proposal (Rol) from a proposal CNN (Rol) from a proposal CNN
~2k) method (~2k)
pre-train image-net method (~2k) pre-train image-net
pre-train image-net
further planning. This greatly helps drones handle uncertain the realm of object detection [71]. These models demonstrate
information in operations. However, learning-based methods superior performance by utilizing the self-attention mecha-
require massive experiences and training datasets to obtain nism, which processes visual information non-locally [72].
good generalization capability, which poses another challenge However, a major concern of ViTs is their high computational
in deployment over unknown environments. demand. This presents difficulties in achieving real-time infer-
ence, particularly on drone platforms with limited resources.
III. O BJECT D ETECTION WITH V ISUAL P ERCEPTION
Object detection is a pivotal module in vision-based learning A. Multi-stage Algorithms
drones when handling complex missions such as inspection, Classic multi-stage algorithms include region-based CNN
avoidance, and search and rescue. Object detection is to find (R-CNN) [52], Fast R-CNN [53] and Faster R-CNN [54]
out all the objects of interest in the image and determine their (see Fig. 5). Multi-stage algorithms can basically meet the
position and size [66]. Object detection is one of the core accuracy requirements in real-life scenarios, but the model is
problems in the field of computer vision (CV). Nowadays, more complex and cannot be really applied to scenarios with
the applications of object detection include face detection, high-efficiency requirements. In the R-CNN structure [52],
pedestrian detection, vehicle detection, and terrain detection it is necessary to first give some regional proposals (RPs),
in remote sensing images. Object detection has always been then use the convolutional layer for feature extraction, and
one of the most challenging problems in the field of CV due to then classify the regions according to these features. That is,
the different appearances, shapes, and poses of various objects, the object detection problem is transformed into an image
as well as the interference of factors such as illumination classification problem. The R-CNN model is very intuitive,
and occlusion during imaging. At present, the object detection but the disadvantage is that it is too slow, and the output
algorithm can be roughly divided into two categories: multi- is obtained via training multiple Support Vector Machines
stage (two-stage) algorithm, whose idea is to first generate (SVMs). To solve the problem of slow training speed, the
candidate regions and then perform classification, and one- Fast R-CNN model is proposed (Fig. 5b). This model has
stage algorithm, the idea of which is to directly apply the two improvements to R-CNN: (1) first use the convolutional
algorithm to the input image and output the categories and layer to perform feature selection on the image so that only
corresponding positions. Beyond that, to retrieve 3D positions, one convolutional layer can be used to obtain RP; (2) convert
depth estimation has been a popular research subbranch related training multiple SVMs to use only one fully-connected layer
to object detection whether using monocular [67] or stereo and a softmax layer. These techniques greatly improve the
depth estimation [68]. For a very long time, the core neural computation speed but still fail to address the efficiency issue
network module (backbone) of object detection has been the of the Selective Search Algorithm (SSA) for RP.
convolutional neural network (CNN) [69]. CNN is a classic Faster R-CNN is an improvement on the basis of Fast R-
neural network in image processing that originates from the CNN (see Fig. 5c). In order to solve the problem of SSA, the
study of the human optic nerve system. The main idea is SSA that generates RP in Fast R-CNN is replaced by a Region
to convolve the image with the convolution kernel to obtain Proposal Network (RPN) and uses a model that integrates RP
a series of reorganization features, and these reorganization generation, feature extraction, object classification and object
features represent the important information of the image. As box regression. RPN is a fully convolutional network that
such, CNN not only has the ability to recognize the image simultaneously predicts object boundaries at each location.
but also effectively decreases the requirement for computing RPN is trained end-to-end to generate high-quality region
resources. Recently, vision transformers (ViTs) [70], originally proposals, which are then detected by Fast R-CNN. At the
proposed for image classification tasks, have been extended to same time, RPN and Fast R-CNN share convolutional features.
5
C. Vision Transformer
ViTs have emerged as the most active research field in object
detection tasks recently, with models like Swin-Transformer
[71], [83], ViTdet [84], DERT [85], and DINO [86] leading
Fig. 6: YOLO network architecture [78]. the forefront. Unlike conventional CNNs, ViTs leverage self-
attention mechanisms to process image patches as sequences,
SSD is another classic one-stage object detection algorithm. offering a more flexible representation of spatial hierarchies.
The flowchart of SSD is (1) first to extract features from the The core mechanism of these models involves dividing an
image through a CNN, (2) generate feature maps, (3) extract image into a sequence of patches and applying Transformer
feature maps of multiple layers, and then (4) generate default encoders [87] to capture complex dependencies between them.
boxes at each point of the feature map. Finally, (5) all the This process enables ViTs to efficiently learn global context,
6
Indirect Method
(a)
Fig. 9: Based on the connection ways of visual perception
and control, vision-based control methods can be divided into
indirect methods, semi-direct methods, and end-to-end meth-
ods. (a) Indirect methods divide the mission into perception,
mapping and planning with image processing; (b) semi-direct
methods extract intermediate features from raw images for RGBD camera OctoMap Fast-Planner
raw images and conduct planning (direct-indirect); (c) end-to- Fig. 10: Indirect methods divide the mission into perception,
end methods map raw images to actions directly via DL. mapping and planning. (a) An APF method used in drones’
dynamic obstacle avoidance [33]. (b) A vision-based drone
is traversing through a cluttered indoor environment with a
on obstacle avoidance [16], [34], [100], [101] based on visual
generated map and online path planning [106].
perception has attracted much attention in the past few years.
Obstacle avoidance has been a main task for vision-based
control as well as for the current learning algorithms for
drones. From the perspective of how drones obtain visual On the mapping side, a point cloud map [110] or OctoMap
perception (perception end) and how drones generate con- [111], representing a set of data points in a 3D space is com-
trol commands from visual perception (control end), existing monly generated. Each point has its own Cartesian coordinates
vision-based control methods can be categorized into indirect and can be used to represent a 3D shape or an object. A
methods, semi-direct methods, and end-to-end methods. The 3D point cloud map is not from the view of a drone but
relationship between these three categories is illustrated in constructs a global 3D map that provides global environmental
Fig. 9. In the following, these methods will be discussed and information. The point-cloud map can be generated from a
evaluated in three different categories, respectively. LIDAR scanner or many overlapped images combined with
depth information. An illustration of an original scene and a
A. Indirect Methods OctoMap are shown in Fig. 10 (b), where the drone can travel
Indirect methods [16], [33], [102]–[109] refer to extracting around without colliding with static obstacles.
features from images or videos to generate visual odometry, Planning is a basic requirement for a vision-based drone
depth maps, and 3D point cloud maps for drones to perform to avoid obstacles. Within the indirect methods, planning can
path planning based on traditional optimization algorithms (see be further divided into two categories: one is offline methods
Fig. 10). Obstacle states, such as 3D shape, position, and based on high-resolution maps and pre-known position infor-
velocity, are detected and mapped before a maneuver is taken. mation, such as Dijkstra’s algorithm [112], A-star [113], RRT-
Once online maps are built or obstacles are located, the drone connect [114] and sequential convex optimization [115]; the
can generate a feasible path or take actions to avoid obstacles. other is online methods based on real-time visual perception
SOTA indirect methods generally divide the mission into and decision-making. Online methods can be further catego-
several subtasks, namely perception, mapping and planning. rized into online path planning [16], [106], [107] and artificial
On the perception side, depth images are always required to potential field (APF) methods [33], [116].
generate corresponding distance and position information for Most vision-based drones rely on online methods. Com-
navigation. A depth image is a grey-level or color image that pared to offline methods, which require an accurate pre-built
can represent the distance between the surfaces of objects from global map, online methods provide advanced maneuvering
the viewpoint of the agent. The illuminance is proportional to capabilities for drones, especially in a dynamic environment.
the distance from the camera. A lighter color denotes a nearer Currently, due to the advantages of optimization and prediction
surface, and darker areas mean further surfaces. A depth map capabilities, online path planning methods have become the
provides the necessary distance information for drones to make preferred choice for drone obstacle avoidance. For instance,
decisions to avoid static and dynamic obstacles. Currently, off- in the SOTA work [106], Zhou Boyu et al. introduced a ro-
the-shelf RGB-D cameras, such as the Intel RealSense depth bust and efficient motion planning system called Fast-Planner
camera D415, the ZED 2 stereo camera, and the Structure for a vision-based drone to perform high-speed flight in an
Core depth camera, are widely used for drone applications. unknown cluttered environment. The key contributions of this
Therefore, traditional obstacle avoidance methods can treat work are a robust and efficient planning scheme incorporating
depth information as a direct input. However, for omnidi- path searching, B-spline optimization, and time adjustment
rectional perception in wide-view scenarios, efficient onboard to generate feasible and safe trajectories for obstacle avoid-
monocular depth estimation is always required, which is a ance. Using only onboard visual perception and computing,
challenge to address with existing methods. this work demonstrated agile drone navigation in unexplored
End-To-End Method
Simulation Experiment
indoor and outdoor environments. However, this approach
can only achieve maximum speeds of 3m/s and requires
RGBD Image Neural Network Velocity Command Training
7.3ms for computation in each step. To improve the flight Z
Yaw
performance and save computation time, Zhou Xin et al. Pitch
Sim2Real
[16] provided a Euclidean Signed Distance Field (ESDF)-free Y
Roll
gradient-based planning framework solution, EGO-Planner, X
Deployment
for drone autonomous navigation in unknown obstacle-rich
situations. Compared to the Fast-Planner, the EGO-Planner
Real-world Flight
achieved faster speeds and saved a lot of computation time.
However, these online path planning methods require bulky Fig. 11: Quadrotor drone flying with end-to-end RL method
visual sensors, such as RGBD cameras or LIDAR, and a pow- [34]. The policy was first trained in the simulation platform
erful onboard computer for the complex numerical calculation and then transferred to real-world flight via Sim2Real.
to obtain a local or global optimal trajectory.
In contrast to online path planning, the APF methods
require less computation resources and can well cope with
Aqeel Anwar et al. [119] presented an end-to-end RL approach
dynamic obstacle avoidance using limited sensor information.
called NAVREN-RL to navigate a quadrotor drone in an
The APF algorithm is one of the algorithms in the robot
indoor environment with expert data and knowledge-based
path planning approach that uses attractive force to reach the
data aggregation. The reward function in [119] was formulated
objective position and repulsive force to avoid obstacles in an
from a ground truth depth image and a generated depth image.
unknown environment [117]. Falanga et al. [33] developed an
Loquercio et al. [34] developed an end-to-end approach that
efficient and fast control strategy based on the APF method
can autonomously guide a quadrotor drone through complex
to avoid fast approaching dynamic obstacles. The obstacles in
wild and human-made environments at high speeds with purely
[33] are represented as repulsive fields that decay over time,
onboard visual perception (depth image) and computation. The
and the repulsive forces are generated from the first-order
neural network policy was trained in a high-fidelity simulation
derivation of the repulsive fields at each time step. However,
environment with massive expert knowledge data. While end-
the repulsive forces computed only reach substantial values
to-end methods provide us with a straightforward way to
when the obstacle is very close, which may lead to unstable
generate obstacle avoidance policies for drones, they require
and aggressive behavior. Besides, APF methods are heuristic
massive training data with domain randomization (usually
methods that cannot guarantee global optimization and robust-
counted in the millions) to obtain acceptable generalization
ness for drones. Hence, it is not ideal to adopt potential field
capabilities. Meanwhile, without expert knowledge data, it
methods to navigate through cluttered environments.
is challenging for the neural network policy to update its
weights when the reward space is sparse. To implement end-to-
B. End-to-end Methods end methods, the following aspects are commonly considered,
In contrast to the indirect methods, which divide the whole namely neural network architecture, training process, and
mission into multiple sub-tasks, such as perception, mapping, Sim2Real transfer.
and planning, end-to-end methods [34], [57], [118]–[120] 1) Neural Network Architecture: The neural network ar-
combine CV and RL to map the visual observations to actions chitecture is the core component in the end-to-end methods,
directly (see Fig. 11). RL [121] is a technique for mapping which determines the computation efficiency and intelligence
the state (observation) space to the action space in order to level of the policy. In the end-to-end method, the input
maximize a long-term return with given rewards. The learner is of the neural network architecture is the image raw data
not explicitly told what action to carry out but must figure out (RGB/RGBD), and the output is the action vector an agent
which action will yield the highest reward during the exploring needs to take. The images are encoded into a vector and
process. A typical RL model features agents, the environment, then concatenated with other normalized observations to form
reward functions, action, and state space. The policy model an input vector of a policy network. For the image encoder,
achieves convergent status via constant interactions between there are many pre-trained neural network architectures that
the agents and the environment, where the reward function can be considered, such as ResNet [122], VGG [123], and
guides the training process. nature CNN [124]. Zhu et al. [57] developed a target-driven
With end-to-end methods, the visual perception of drones visual navigation approach for robots with end-to-end DRL,
is encoded by a deep neural network (DNN) into an obser- where a pre-trained ResNet-50 is used to encode the image
vation vector of the policy network. The mapping process into a feature vector (see Fig. 12(a)). In [59], a more data-
is usually trained offline with abundant data, which requires efficient image encoder with 5 layers was designed for target-
high-performance computers and simulation platforms. The driven visual navigation for robots with end-to-end IL (see Fig.
training dataset is collected from expert flight demonstrations 12(b)). Before designing the neural network architecture, it is
for imitation learning (IL) or from simulations for online firstly required to determine the observation space and action
training. To better generalize the performance of a trained space of the task. The training efficiency and space complexity
neural network model, scenario randomization (domain ran- are the two critical aspects that need to be considered in the
domization) is essential during the training process. Malik designing process.
9
Path sampling
Reference trajectory
Top 3 trajectories
Full quadrotor model
Environment point cloud
Depth
Imitation Learning
Depth
backbone
M x32 conv1D
Ego state
conv1D
conv1D
Trajectory
State
backbone
Desired state MPC
M x32
action
Visual encoder
(a) Fig. 13: IL training process used in [34] to fly an agile drone
through forest, where the expert trajectories are generated and
sampled for training and followed by the trajectory tracking
controller.
Depth estimation
tion and tracking or the point cloud map) and train the control
Raw RGB image Neural network Command
policy with DRL; another is to obtain the required states (such
Z as depth image or 3D world model) directly from the raw
intermediate Yaw
features
Pitch image data with DL and perform the tasks using numerical
Bounding box Y
or heuristic methods. These two methods can be denoted as
Roll
X indirect (front end)-direct (back end) methods and direct (front
end)-indirect (back end) methods.
Fig. 14: Semi-direct methods extract the intermidiate features 1) Indirect-direct methods: [18], [100], [133] firstly obtain
such as structure statistics and optical flow [100] or bounding
End-To-End Method intermediate features such as relative position or velocity of
box [133] for the following RL trainig to improve the genel- the obstacles from image processing and then use this inter-
ization of vision-based learning. mediate feature information as observations to train the policy
neural network via DRL. Indirect-direct methods generally
Monocular image DNN 3D space Obstacle avoidance rely on designing suitable intermediate features. In [100],
the features related to depth cues such as Radon features
(30 dimensional), structure tensor statistics (15 dimensional),
Laws’ masks (8 dimensional), and optical flow (5 dimensional)
were extracted and concatenated into a single feature vector
Fig. 15: Via DL, a 3D depth space is generated from the as visual observation. Together with the other nine additional
monocular image for obstacle avoidance with direct-indirect features, the control policy was trained with IL to navigate the
method [142]. drone through a dense forest environment. Moulay et al. [133]
proposed a semi-direct vision-based learning control policy for
UAV pursuit-evasion. Firstly, a deep object detector (YOLOv2)
generalization capability of trained neural network models. and a search area proposal (SAP) were used to predict the
A wide variety of Sim2Real techniques [137]–[141] have relative position of the target UAV in the next frame for target
been developed to improve the generalization and transfer tracking. Afterward, DRL was adopted to predict the actions
capabilities of models. Domain randomization [140] is one the follower UAV needs to perform to track the target UAV.
of these techniques to improve the generalization capabil- Indirect-direct methods are able to improve the generalization
ity of the trained neural network to unseen environments. capability of policy neural networks, but at the cost of heavy
Domain randomization is a method of trying to discover a computation and time overhead.
representation that can be used in a variety of scenes or
2) Direct-indirect methods: Direct-indirect methods try to
domains. Existing domain randomization techniques [34], [63]
obtain depth images [145], [146] and track obstacles/targets
for drones’ obstacle avoidance include position randomization,
[93], [142] from training and use non-learning-based methods,
depth image noise randomization, texture randomization, size
such as path planning or APF to avoid obstacles. Direct-
randomization, etc. Therefore, domain randomization is gen-
indirect methods can be applied to a microlight drone with
erally required to apply in our training process to enhance the
only monocular vision, but they require a lot of training data
generalization capability of the trained model in deployment.
to obtain depth images or 3D poses of the obstacles. Michele
et al. [142] developed an object detection system with a
C. Semi-direct Methods monocular camera to detect obstacles at a very long range and
Compared to the end-to-end methods, which generate ac- very high speed (see Fig. 15), without certain assumptions on
tions from image raw data directly, semi-direct methods [18], the type of motion. With a DNN trained on real and synthetic
[51], [100], [133], [143], [144] introduce an intermediate phase image data, fast, robust and consistent depth information
for drones to take actions from visual perception, aiming to can be used for drones’ obstacle avoidance. Direct-indirect
improve the generalization and transfer capabilities of the methods address the ego drift problem of monocular depth
methods over unseen environments (see Fig. 14). There are estimation using Structure from Motion (SfM) and provide a
two ways to design the semi-direct method architectures: one direct way to get depth information from the image. However,
is to generate the required information from image processing the massive train dataset and limited generalization capability
(such as the relative positions of obstacles from object detec- are the main challenges for their further applications.
11
TABLE II: Comparison of Performance for Vision-Based annotations, it is widely used for object detection and tracking
Control Methods (SR (%) or distance (m) for performance) tasks from the drone view. Similarly, the Anti-UAV dataset
Method Task Processor Peformance FPS [147] focuses on outdoor UAV tracking, offering 318 RGB-T
[102] navigation NUC i7 78.6% 25 videos recorded during both day and night, enabling models to
[103] path planning i7-5500U NA 9.2
[106] path planning Jetson TX2 77.8% 15 handle varying lighting conditions. The SUAV-DATA dataset
[105] path planning DJI manifold 2-C 84.6m 18 [148] comprises over 5,000 multi-scale annotated images of
[33] obstacle avoidance Jetson TX2 93.0% 280
[107] path planning i7-8550U 100.0% 100
small UAVs at different altitudes, addressing the challenges of
[119] navigation NVIDIA GTX1080 5.8m NA small object detection in aerial imagery. The Det-Fly dataset
[118] obstacle avoidance NVIDIA GTX1080 142m 5.3 [74] complements this by providing air-to-air annotated drone
[34] navigation Jetson TX2 90% 24
[65] target search Jetson Xavier NX 64.0% 21 images in flight for drone navigation and target detection,
[100] navigation NVIDIA GTX1080 62% 10 supporting research into collision avoidance and flight plan-
[133] target tracking NA 80.0% 30 ning. The MIDGARD dataset [149] introduces multi-modal
[18] drone racing Jetson TX2 91.0% 20
data, including RGB, depth, and thermal imagery, along with
rich annotations for various tasks. It is an essential resource
D. Comparision and Discussion for studying integrated perception systems in various envi-
ronments. In comparison, the MDMT dataset [92] focuses on
Overall, the field of vision-based control for drones encom- multi-drone, multi-target tracking, with over 39,000 annotated
passes a variety of methods, each with its own unique approach frames involving occlusion and overlapping targets.
to perception and control. A comprehensive overview of For aggressive motion estimation, the UZH-FPV dataset
these methods, along with key studies in each category, is [150] provides high-resolution RGB images, inertial measure-
summarized in Table I, and their performance and runtime ments, event-camera data, and precise ground truth poses.
are listed in Table II, which provides a detailed comparison Captured during agile drone maneuvers, this dataset has driven
of their perception and control strategies. Indirect methods advancements in motion planning and state estimation, par-
rely on traditional optimization algorithms and depth or 3D ticularly agile flight with learning-based methods [17]. The
point cloud maps for navigation and obstacle avoidance. End- NeuroBEM dataset [151] combines aerodynamic modeling
to-end methods leverage DNNs for visual perception from with highly aggressive flight data. It includes over an hour
monocular cameras or depth images, and utilize RL for direct of flight data with pose, body rate, and battery voltage infor-
action mapping. Semi-direct methods balance between com- mation, enabling detailed dynamics modeling for quadrotors.
putational efficiency and generalization by using intermediate Similarly, the DroneRacing dataset [18] focuses on dynamics
features from image processing and a combination of DRL identification for high-speed flight, providing raw flight data
and heuristic methods for action generation but introduce with states and ground truth annotations, catering specifically
extra computation costs. While traditional indirect methods for learning-based methods in drone racing competitions.
perform more robustly across different missions if accurate
depth maps or point cloud maps are available, learning-based
methods can achieve higher runtime frequency onboard (≥ 20 B. Simulators
FPS) by directly learning from visual perception and better Simulators are crucial tools for training and validating
optimization capability to address uncertainties. vision-based learning algorithms in customizable and risk-free
environments. They offer controlled scenarios for evaluating
V. DATASETS AND S IMULATORS navigation, obstacle avoidance, and collaboration.
Datasets and simulators play essential roles in vision-based AirSim [134] is a highly detailed vehicle simulator widely
learning for drones. Collected datasets significantly contribute used for navigation and obstacle avoidance [154]. It offers
to the training of commonly used neural networks, such as customizable environments and realistic physics, making it
those for depth estimation, drone detection and tracking, and suitable for benchmarking and training vision-based systems.
drone dynamics modeling. Meanwhile, the development of Similarly, Flightmare [152] is designed specifically for drone
realistic simulators has accelerated the training and verification applications, such as racing and navigation with learning
of designed learning methods before deploying them in real- methods [17], [18], [155]. It provides RGBD camera data and
world scenarios. In the following, publicly available datasets drone-specific environments optimized for high-speed testing.
and simulators are discussed and summarized in Table III. Unity ML-Agents toolkit [135] serves as a versatile platform
for training DRL algorithms especially in MAS. It supports
collaborative learning and self-play and can be tailored for
A. Datasets complex scenarios, which has been used for collaborative
A wide range of publicly available datasets supports the de- target search [65], [132] and multipursuit evasion [156]. On the
velopment of vision-based learning methods for drones. These other hand, gym-pybullet-drones [153] focuses on multi-drone
datasets enable researchers to train and evaluate perception, control tasks, providing predefined scenarios for collaborative
tracking, and control algorithms under various environmental and individual learning.
conditions. The well-known VisDrone dataset [73] provides These simulators accelerate the development of vision-based
over 10,000 images and videos captured in urban environments learning algorithms, enabling scholars to test and refine their
under various weather and lighting conditions. With dense methods before deploying them in real-world environments.
12
By combining the realism of datasets with the adaptability of Additionally, in the study [160], a drone-based crowd surveil-
simulators, scholars can address critical challenges in vision- lance system was tested to achieve the goal of saving scarce
based drone operations efficiently, such as dynamic navigation, energy of the drone battery. This approach involved offloading
real-time perception, and multiagent collaboration. video data processing from the drones by employing the
Mobile Edge Computing (MEC) method. Nevertheless, while
VI. A PPLICATIONS AND C HALLENGES off-board processing diminishes computational demands and
A. Single Drone Application energy consumption, it inevitably heightens the need for data
transmission. Addressing the challenge of achieving real-time
The versatility of single drones is increasingly recognized surveillance in environments with limited signal connectivity
in a variety of challenging environments. These autonomous is an additional critical issue that requires resolution.
systems, with their inherent advantages, are effectively em-
3) Search and Rescue: In the field of search and rescue
ployed in critical areas such as hazardous environment detec-
operations, a primary challenge is extracting maximum useful
tion and search and rescue operations (see Fig. 16). Single
information from limited data sources. This is crucial for
drone applications in vision-based learning primarily involve
improving the efficiency and success rate of these missions.
tasks like obstacle avoidance [34], [36], surveillance [157]–
Goodrich et al. [161] address this by developing a contour
[160], search-and-rescue operations [161]–[163], environmen-
search algorithm designed to optimize video data analysis,
tal monitoring [164]–[167], industrial inspection [28], [168],
enhancing the capability to identify key elements swiftly. How-
[169] and autonomous racing [170]–[172]. Each field, while
ever, incorporating temporal information into this algorithm
benefiting from the unique capabilities of drones, also presents
introduces additional computational demands. These increased
its own set of challenges and areas for development.
requirements present new challenges, such as the need for
1) Obstacle Avoidance: The development of obstacle
more powerful processing capabilities and potentially greater
avoidance capabilities in drones, especially for vision-based
energy consumption.
control systems, poses significant challenges. Recent studies
have primarily focused on static or simple dynamic envi- 4) Environmental Monitoring: A major challenge for en-
ronments, where obstacle paths are predictable [32], [34], vironmental monitoring lies in efficiently collecting high-
[173]. However, complex scenarios involving unpredictable resolution data while navigating the constraints of battery life,
physical attacks from birds or intelligent adversaries remain flight duration, and diverse weather conditions. Addressing
largely unaddressed. For instance, [32], [33], [36], [174] this, Senthilnath et al. [164] showcased the use of fixed-
have explored basic dynamic obstacle avoidance but do not wing and Vertical Take-Off and Landing (VTOL) drones in
account for adversarial environments. To effectively handle vegetation analysis, focusing on the challenge of detailed
such threats, drones require advanced features like omnidirec- mapping through spectral-spatial classification methods. In
tional visual perception and agile maneuvering capabilities. another study, Lu et al. [167] demonstrated the utility of
Current research, however, is limited in addressing these drones for species classification in grasslands, contributing to
needs, underscoring the necessity for further development in the development of methodologies for drone-acquired imagery
drone technology to enhance evasion strategies against smart, processing, which is crucial for environmental assessment and
unpredictable adversaries. management. While these studies represent significant steps
2) Surveillance: While drones play a pivotal role in surveil- in drone applications for environmental monitoring, several
lance tasks, their deployment is not without challenges. Key challenges persist. Future research aims to improve the drones’
challenges include managing high data processing loads and resilience to diverse environmental conditions, and extend
addressing the limitations of onboard computational resources. their operational range and duration to comprehensively cover
In addressing these challenges, Singh et al. [157] presented a extensive and varied landscapes.
real-time drone surveillance system used to identify violent 5) Industrial Inspection: In industrial inspection, drones
individuals in public areas. The proposed study was facil- face key challenges like safely navigating complex environ-
itated by cloud processing of drone images to address the ments and conducting precise measurements in the presence of
challenge of slow and memory-intensive computations while various disturbances. Kim et al. [168] addressed the challenge
still maintaining onboard short-term navigation capabilities. of autonomous navigation by using drones for proximity
13
measurement among construction entities, enhancing safety Advanced technological solutions are essential to overcome
in the construction industry. Additionally, Khuc et al. [169] these challenges, ensuring that single drones can operate
focused on precise structural health inspection with drones, efficiently and reliably in diverse scenarios and paving the
especially in high or inaccessible locations. Despite these way for future innovations.
advancements in autonomous navigation and measurement ac-
curacy, maintaining data accuracy and reliability in industrial B. Multi-Drone Application
settings with interference from machinery, electromagnetic While single drones offer convenience, their limited mon-
fields, and dynamic obstacles continues to be a significant itoring range has prompted interest in multi-drone collabora-
challenge, necessitating advanced autonomy and intelligence tion. This approach seeks to overcome range limitations by
in this domain. leveraging the collective capabilities of multiple drones for
6) Autonomous Racing: In autonomous drone racing, the broader, more efficient operations. Multi-drone applications
central challenge is reducing delays exist in visual information (see Fig. 17), encompassing activities such as coordinated sur-
processing and decision making and enhancing the adaptability veying [176], [180]–[182], cooperative tracking [177], [183],
of perception networks. In [170], a novel sensor fusion method [184], synchronized monitoring [178], [185], [186], and disas-
was proposed to enable high-speed autonomous racing for ter response [65], [132], [175], [179], [187], [188], bring the
mini-drones. This work also addressed issues with occasional added complexity of inter-drone communication, coordination
large outliers and vision delays commonly encountered in fast and real-time data integration. These applications leverage the
drone racing. Another work [171] introduced an innovative combined capabilities of multiple drones to achieve greater
approach to drone control, where a DNN was used to fuse efficiency and coverage than single drone operations.
trajectories from multiple controllers. In the latest work [18], 1) Coordinated Surveying: In coordinated surveying, sev-
the vision-based drone outperformed world champions in eral challenges are prominent: merging diverse data from
the racing task, purely relying on onboard perception and a individual drones, and addressing computational demands in
trained neural network. The primary challenge in autonomous cooperative process. These challenges were tackled by some
drone racing, as identified in these studies, lies in the need works. In [180], a monocular visual odometry algorithm was
for improved adaptability of perception networks to various used to enable autonomous onboard control with cooperative
environments and textures, which is crucial for the high-speed localization and mapping. This work addressed the challenges
demands of the sport. of coordinating and merging different maps constructed by
Overall, the primary challenges in single drone applications each drone platform and what’s more, the computational
include limited battery life, which restricts operational dura- bottlenecks typically associated with 3D RGB-D cooperative
tion, and the need for effective obstacle avoidance in dynamic SLAM. Micro-air vehicles also play an outstanding role in the
environments. Additionally, limitations in data processing ca- field of coordinated surveying. Similarly, in [182], a sensor
pabilities affect real-time decision-making and adaptability. fusion scheme was proposed to improve the accuracy of
14
and wireless communication channels. Niu [190] introduced impede the pace of development and real-world applicability
a framework wherein a single UAV served multiple UGVs, of these methods. These challenges span various aspects,
achieving optimal path planning based on aerial imagery. from data collection and simulation accuracy to operational
This approach outperformed traditional heuristic path planning efficiency and safety concerns.
algorithms. Furthermore, Liu et al. [191] presented a joint
UAV-UGV architecture designed to overcome frequent target A. Dataset
occlusion issues encountered by single-ground platforms. This
A major impediment in the field is the absence of a com-
architecture enabled accurate and dynamic target localization,
prehensive, public dataset analogous to Open X-Embodiment
leveraging visual inputs from UAVs.
[201] in robotic manipulation. This unified dataset should ide-
2) Precise Landing: Given the potential necessity for bat-
ally encompass a wide range of scenarios and tasks to facilitate
tery recharging and emergency maintenance of UAVs dur-
generalizable learning. As shown in Table III, the current
ing extended missions, the UAV-UGV heterogeneous system
reliance on domain-specific datasets like “Anti-UAV” [147]
can facilitate UAV landings. This design reduces reliance
and “SUAV-DATA” [148] limits the scope and applicability of
on manual intervention while enhancing the UAVs’ capacity
research. A potential solution is the collaborative development
for prolonged, uninterrupted operation. In the study [195], a
of a diverse, multi-purpose dataset by academic and indus-
vision-based heterogeneous system was proposed to address
try stakeholders, incorporating various tasks, environmental,
the challenge of UAVs’ temporary landings during long-
weather, and lighting conditions.
range inspections. This system accomplished precise target
geolocation and safe landings in the absence of GPS data by
detecting QR codes mounted on UGVs. Additionally, Xu et B. Simulator
al. [196] illustrated the application of UAV heterogeneous sys- While simulators are vital for training and validating vision-
tems for landing on USVs. A similar approach was explored based learning models, their realism and accuracy often fall
for UAVs’ target localization and landing, leveraging QR code short of replicating real-world complexities. This gap hampers
recognition on USVs. the transition from simulation to actual deployment. Mean-
3) Inspection and Detection: Heterogeneous UAV systems while, there is no unified simulator covering most of the
present an effective solution to overcome the limitations of drone tasks, resulting in repetitive domain-specific simulator
background clutter and incoherent target interference often development [134], [152], [153]. Drawing inspiration from
encountered in single-ground vision detection platforms. By the self-driving car domain, the integration of off-the-shelf
leveraging the expansive FOV and swift scanning capabilities and highly flexible simulators such as CARLA [202] could
of UAVs, in conjunction with the endurance and high accuracy be a solution. These simulators, known for their advanced
of UGVs, such heterogeneous systems can achieve time- features in realistic traffic simulation and diverse environ-
efficient and accurate target inspection and detection in specific mental conditions, can provide more authentic and varied
applications. For instance, Kalinov et al. [198] introduced a data for training. Adapting such simulators to drone-specific
heterogeneous inventory management system, pairing a ground scenarios could greatly enhance the quality of training and
robot with a UAV. In this system, the ground robot determined testing environments.
motion trajectories by deploying the SLAM algorithm, while
the UAV, with its high maneuverability, was tasked with C. Sample Efficiency
scanning barcodes. Furthermore, Pretto [200] developed a Enhancing sample efficiency in ML models for drones is
heterogeneous farming system to enhance agricultural automa- crucial, particularly in environments where data collection is
tion. This innovative system utilized the aerial perspective of hazardous or impractical. Even though simulators are available
the UAV to assist in farmland segmentation and the classifi- for generating training data, there are still challenges in ensur-
cation of crops from weeds, significantly contributing to the ing the realism and diversity of these simulated environments.
advancement of automated farming practices. The gap between simulated and real-world data can lead
To sum up, most of the applications above primarily focus to performance discrepancies when models are deployed in
on single UAV to single UGV or single UAV to multiple actual scenarios. Developing algorithms that leverage transfer
UGV configurations, with few scenarios designed for multiple learning [203], few-shot learning [204], and synthetic data
UAVs interacting with multiple UGVs. It is evident that there generation [205] could provide significant strides in learning
remains significant research potential in the realm of vision- efficiently from limited datasets. These approaches aim to
based, multiagent-to-multiagent heterogeneous systems. Key bridge the gap between simulation and reality, enhancing the
areas such as communication and data integration within het- applicability and robustness of ML models in diverse and
erogeneous systems, coordination and control in dynamic and dynamic real-world situations.
unpredictable environments, and individual agents’ autonomy
and decision-making capabilities warrant further exploration.
D. Inference Speed
Balancing inference speed with accuracy is a critical chal-
VII. O PEN Q UESTIONS AND P OTENTIAL S OLUTIONS
lenge for drones operating in dynamic environments even
Despite significant advancements in the domain of vision- though some approaches have reached near real-time inference
based learning for drones, numerous challenges remain that capability as listed in Table II. The key lies in optimizing ML
16
models for edge computing, enabling drones to process data particularly emphasizing their evolving role in multi-drone
and make decisions swiftly. Techniques like model pruning systems and complex environments such as search and res-
[206], [207], quantization [208], [209], distillation [210] and cue missions and adversarial settings. The investigation re-
the development of specialized hardware accelerators can play vealed that drones are increasingly becoming sophisticated,
a pivotal role in this regard. autonomous systems capable of intricate tasks, largely driven
by advancements in AI, ML, and sensor technology. The
E. Real World Deployment exploration of micro and nano drones, innovative structural
designs, and enhanced autonomy stand out as key trends shap-
Transitioning from controlled simulation environments to
ing the future of drone technology. Crucially, the integration of
real-world deployment (Sim2Real) involves addressing un-
visual perception with ML algorithms, including DRL, opens
predictability in environmental conditions, regulatory compli-
up new avenues for drones to operate with greater efficiency
ance, and adaptability to diverse operational contexts. Domain
and intelligence. These capabilities are particularly pertinent in
randomization [140] tries to address the Sim2Real issue in
the context of object detection and decision-making processes,
a certain way but is limited to predicted scenarios with
vital for complex drone operations. This survey categorized
known domain distributions. Developing robust and adap-
vision-based control methods into indirect, semi-direct, and
tive algorithms capable of on-the-fly continuous learning and
end-to-end methods, offering an in-depth understanding of
decision-making, along with rigorous field testing under varied
how drones perceive and interact with their environment.
conditions, can aid in overcoming these challenges.
Applications of vision-based learning drones, spanning from
single-agent to multiagent and heterogeneous systems, demon-
F. Embodied Intelligence in Open World strate their versatility and potential in various sectors, includ-
Existing vision-based learning methods for drones require ing agriculture, industrial inspection, and emergency response.
explicit task descriptions and formal constraints, while in an However, this expansion also brings forth challenges such
open world, it is hard to provide all necessary formulations at as data processing limitations, real-time decision-making, and
the beginning to find the optimal solution. For instance, in a ensuring robustness in diverse operational scenarios.
complex search and rescue mission, the drone can only find This survey highlights open questions and potential so-
the targets first and conduct rescue based on the information lutions in the field, stressing the need for comprehensive
collected. In each stage, the task may change, and there is datasets, realistic simulators, improved sample efficiency, and
no prior explicit problem at the start. Human interactions are faster inference speeds. Addressing these challenges is crucial
necessary during this mission. With large language models and for the effective deployment of drones in real-world scenarios.
embodied intelligence, the potential of drone autonomy can Safety and security, especially in the context of adversarial
be greatly increased. Through interactions in the open world environments, remain paramount concerns that need ongo-
[21], [211] or provide few-shot imitation [212], vision-based ing attention. While significant progress has been made in
learning can emerge with full autonomy for drone applications. vision-based learning for drones, the journey towards fully
autonomous, intelligent, and reliable systems, even AGI in the
G. Safety and Security physical world, is ongoing. Future research and development
in this field hold the promise of revolutionizing various indus-
Ensuring the safety and security of drone operations is tries, pushing the boundaries of what’s possible with drone
paramount, especially in densely populated or sensitive areas. technology in complex and dynamic environments.
This includes not only physical safety but also cybersecurity
concerns [213], [214]. The security aspect extends beyond data R EFERENCES
protection, including the resilience of drones to adversarial [1] R. Rajkumar, I. Lee, L. Sha, and J. Stankovic, “Cyber-physical systems:
attacks [215]. Such attacks could take various forms, from the next computing revolution,” in Design Automation Conference.
IEEE, 2010, pp. 731–736.
signal jamming to deceptive inputs aimed at misleading vision- [2] R. Baheti and H. Gill, “Cyber-physical systems,” The impact of control
based systems and DRL algorithms [216]. Addressing these technology, vol. 12, no. 1, pp. 161–166, 2011.
concerns requires a multifaceted approach. Firstly, incorporat- [3] M. Hassanalian and A. Abdelkefi, “Classifications, applications, and
design challenges of drones: A review,” Progress in Aerospace Sci-
ing advanced cryptographic techniques ensures data integrity ences, vol. 91, pp. 99–131, 2017.
and secure communication. Secondly, implementing anomaly [4] P. Nooralishahi, C. Ibarra-Castanedo, S. Deane, F. López, S. Pant,
detection systems can help identify and mitigate unusual M. Genest, N. P. Avdelidis, and X. P. Maldague, “Drone-based non-
destructive inspection of industrial sites: A review and case studies,”
patterns indicative of adversarial interference. Moreover, im- Drones, vol. 5, no. 4, p. 106, 2021.
proving the robustness of learning models against adversarial [5] N. J. Stehr, “Drones: The newest technology for precision agriculture,”
attacks and investigating the explainability of designed models Natural Sciences Education, vol. 44, no. 1, pp. 89–91, 2015.
[6] P. M. Kornatowski, M. Feroskhan, W. J. Stewart, and D. Floreano,
[217] are imperative. Lastly, regular updates and patches to the “Downside up:rethinking parcel position for aerial delivery,” IEEE
drone’s software, based on the latest threat intelligence, can Robotics and Automation Letters, vol. 5, no. 3, pp. 4297–4304, 2020.
fortify its defenses against evolving cyber threats. [7] P. M. Kornatowski, M. Feroskhan, W. J. Stewart, and D. Floreano, “A
morphing cargo drone for safe flight in proximity of humans,” IEEE
Robotics and Automation Letters, vol. 5, no. 3, pp. 4233–4240, 2020.
VIII. C ONCLUSION [8] D. Câmara, “Cavalry to the rescue: Drones fleet to help rescuers
operations over disasters scenarios,” in 2014 IEEE Conference on
This comprehensive survey has thoroughly explored the Antenna Measurements & Applications (CAMA). IEEE, 2014, pp.
rapidly growing field of vision-based learning for drones, 1–4.
17
[9] Y.-H. Hsiao, S. Bai, Y. Zhou, H. Jia, R. Ding, Y. Chen, Z. Wang, [31] B. Zhou, H. Xu, and S. Shen, “Racer: Rapid collaborative explo-
and P. Chirarattananon, “Energy efficient perching and takeoff of a ration with a decentralized multi-uav system,” IEEE Transactions on
miniature rotorcraft,” Communications Engineering, vol. 2, no. 1, p. 38, Robotics, vol. 39, no. 3, pp. 1816–1835, 2023.
2023. [32] E. Kaufmann, A. Loquercio, R. Ranftl, A. Dosovitskiy, V. Koltun, and
[10] W. Shen, J. Peng, R. Ma, J. Wu, J. Li, Z. Liu, J. Leng, X. Yan, and D. Scaramuzza, “Deep drone racing: Learning agile flight in dynamic
M. Qi, “Sunlight-powered sustained flight of an ultralight micro aerial environments,” in Conference on Robot Learning. PMLR, 2018, pp.
vehicle,” Nature, vol. 631, no. 8021, pp. 537–543, 2024. 133–145.
[11] M. Graule, P. Chirarattananon, S. Fuller, N. Jafferis, K. Ma, M. Spenko, [33] D. Falanga, K. Kleber, and D. Scaramuzza, “Dynamic obstacle avoid-
R. Kornbluh, and R. Wood, “Perching and takeoff of a robotic insect on ance for quadrotors with event cameras,” Science Robotics, vol. 5,
overhangs using switchable electrostatic adhesion,” Science, vol. 352, no. 40, 2020.
no. 6288, pp. 978–982, 2016. [34] A. Loquercio, E. Kaufmann, R. Ranftl, M. Müller, V. Koltun, and
[12] D. Floreano and R. J. Wood, “Science, technology and the future of D. Scaramuzza, “Learning high-speed flight in the wild,” Science
small autonomous drones,” Nature, vol. 521, no. 7553, pp. 460–466, Robotics, vol. 6, no. 59, p. eabg5810, 2021.
2015. [35] T. Qin, P. Li, and S. Shen, “Vins-mono: A robust and versatile monoc-
[13] J. Shu and P. Chirarattananon, “A quadrotor with an origami-inspired ular visual-inertial state estimator,” IEEE Transactions on Robotics,
protective mechanism,” IEEE Robotics and Automation Letters, vol. 4, vol. 34, no. 4, pp. 1004–1020, 2018.
no. 4, p. 3820–3827, 2019. [36] N. J. Sanket, C. M. Parameshwara, C. D. Singh, A. V. Kuruttukulam,
[14] E. Ajanic, M. Feroskhan, S. Mintchev, F. Noca, and D. Floreano, C. Fermüller, D. Scaramuzza, and Y. Aloimonos, “Evdodgenet: Deep
“Bioinspired wing a and tail morphing extends drone flight capabil- dynamic obstacle dodging with event cameras,” in 2020 IEEE Interna-
ities,” Sci. Robot., vol. 5, p. eabc2897, 2020. tional Conference on Robotics and Automation (ICRA). IEEE, 2020,
[15] L. Chen, J. Xiao, Y. Zheng, N. A. Alagappan, and M. Feroskhan, pp. 10 651–10 657.
“Design, modeling, and control of a coaxial drone,” IEEE Transactions [37] N. J. Sanket, C. D. Singh, C. Fermüller, and Y. Aloimonos, “Ajna:
on Robotics, vol. 40, pp. 1650–1663, 2024. Generalized deep uncertainty for minimal perception on parsimonious
[16] X. Zhou, J. Zhu, H. Zhou, C. Xu, and F. Gao, “Ego-swarm: A fully robots,” Science Robotics, vol. 8, no. 81, p. eadd5139, 2023.
autonomous and decentralized quadrotor swarm system in cluttered [38] R. Siegwart, I. R. Nourbakhsh, and D. Scaramuzza, Introduction to
environments,” in 2021 IEEE international conference on robotics and autonomous mobile robots. MIT press, 2011.
automation (ICRA). IEEE, 2021, pp. 4101–4107. [39] N. Chen, F. Kong, W. Xu, Y. Cai, H. Li, D. He, Y. Qin, and F. Zhang, “A
[17] E. Kaufmann, A. Loquercio, R. Ranftl, M. Müller, V. Koltun, and self-rotating, single-actuated uav with extended sensor field of view for
D. Scaramuzza, “Deep drone acrobatics,” Proceedings of Robotics: autonomous navigation,” Science Robotics, vol. 8, no. 76, p. eade4538,
Science and Systems XVI, 2020. 2023.
[18] E. Kaufmann, L. Bauersfeld, A. Loquercio, M. Müller, V. Koltun, and [40] H. Guan, X. Sun, Y. Su, T. Hu, H. Wang, H. Wang, C. Peng, and
D. Scaramuzza, “Champion-level drone racing using deep reinforce- Q. Guo, “UAV-lidar aids automatic intelligent powerline inspection,”
ment learning,” Nature, vol. 620, no. 7976, pp. 982–987, 2023. International Journal of Electrical Power and Energy Systems, vol.
130, p. 106987, sep 2021.
[19] I. Singh, V. Blukis, A. Mousavian, A. Goyal, D. Xu, J. Tremblay,
[41] W. Xu, Y. Cai, D. He, J. Lin, and F. Zhang, “Fast-lio2: Fast direct lidar-
D. Fox, J. Thomason, and A. Garg, “Progprompt: Generating situated
inertial odometry,” IEEE Transactions on Robotics, vol. 38, no. 4, pp.
robot task plans using large language models,” in 2023 IEEE Interna-
2053–2073, 2022.
tional Conference on Robotics and Automation (ICRA). IEEE, 2023,
[42] Z. Wang, Z. Zhao, Z. Jin, Z. Che, J. Tang, C. Shen, and Y. Peng,
pp. 11 523–11 530.
“Multi-stage fusion for multi-class 3d lidar detection,” in Proceedings
[20] S. Liu, H. Zhang, Y. Qi, P. Wang, Y. Zhang, and Q. Wu, “Aeri-
of the IEEE/CVF International Conference on Computer Vision, 2021,
alvln: Vision-and-language navigation for uavs,” in Proceedings of the
pp. 3120–3128.
IEEE/CVF International Conference on Computer Vision, 2023, pp.
[43] M. O. Aqel, M. H. Marhaban, M. I. Saripan, and N. B. Ismail, “Review
15 384–15 394.
of visual odometry: types, approaches, challenges, and applications,”
[21] A. Gupta, S. Savarese, S. Ganguli, and L. Fei-Fei, “Embodied intel- SpringerPlus, vol. 5, pp. 1–26, 2016.
ligence via learning and evolution,” Nature communications, vol. 12, [44] J. Delmerico and D. Scaramuzza, “A benchmark comparison of monoc-
no. 1, p. 5721, 2021. ular visual-inertial odometry algorithms for flying robots,” in 2018
[22] Y. Tang, C. Zhao, J. Wang, C. Zhang, Q. Sun, W. X. Zheng, W. Du, IEEE international conference on robotics and automation (ICRA).
F. Qian, and J. Kurths, “Perception and navigation in autonomous IEEE, 2018, pp. 2502–2509.
systems in the era of learning: A survey,” IEEE Transactions on Neural [45] D. Scaramuzza and Z. Zhang, Aerial Robots, Visual-Inertial Odometry
Networks and Learning Systems, vol. 34, no. 12, pp. 9604–9624, 2023. of. Berlin, Heidelberg: Springer Berlin Heidelberg, 2020, pp. 1–9.
[23] Y. Lu, Z. Xue, G.-S. Xia, and L. Zhang, “A survey on vision-based [Online]. Available: https://doi.org/10.1007/978-3-642-41610-1 71-1
uav navigation,” Geo-spatial information science, vol. 21, no. 1, pp. [46] R. Mur-Artal, J. M. M. Montiel, and J. D. Tardos, “Orb-slam: a versatile
21–32, 2018. and accurate monocular slam system,” IEEE transactions on robotics,
[24] M. Y. Arafat, M. M. Alam, and S. Moh, “Vision-based navigation vol. 31, no. 5, pp. 1147–1163, 2015.
techniques for unmanned aerial vehicles: Review and challenges,” [47] C. Campos, R. Elvira, J. J. G. Rodrı́guez, J. M. M. Montiel, and
Drones, vol. 7, no. 2, p. 89, 2023. J. D. Tardós, “Orb-slam3: An accurate open-source library for visual,
[25] E. Kakaletsis, C. Symeonidis, M. Tzelepi, I. Mademlis, A. Tefas, visual–inertial, and multimap slam,” IEEE Transactions on Robotics,
N. Nikolaidis, and I. Pitas, “Computer vision for autonomous uav flight vol. 37, no. 6, pp. 1874–1890, 2021.
safety: An overview and a vision-based safe landing pipeline example,” [48] P. Pisutsin, J. Xiao, and M. Feroskhan, “Omnidrone-det: Omnidirec-
Acm Computing Surveys (Csur), vol. 54, no. 9, pp. 1–37, 2021. tional 3d drone detection in flight,” in 2024 IEEE 20th International
[26] A. Mcfadyen and L. Mejias, “A survey of autonomous vision-based Conference on Automation Science and Engineering (CASE), 2024, pp.
see and avoid for unmanned aircraft systems,” Progress in Aerospace 2409–2414.
Sciences, vol. 80, pp. 1–17, 2016. [49] G. Gallego, T. Delbrück, G. Orchard, C. Bartolozzi, B. Taba, A. Censi,
[27] R. Jenssen, D. Roverso et al., “Automatic autonomous vision-based S. Leutenegger, A. J. Davison, J. Conradt, K. Daniilidis et al., “Event-
power line inspection: A review of current status and the potential role based vision: A survey,” IEEE transactions on pattern analysis and
of deep learning,” International Journal of Electrical Power & Energy machine intelligence, vol. 44, no. 1, pp. 154–180, 2020.
Systems, vol. 99, pp. 107–120, 2018. [50] W. Gao, K. Wang, W. Ding, F. Gao, T. Qin, and S. Shen, “Autonomous
[28] B. F. Spencer Jr, V. Hoskere, and Y. Narazaki, “Advances in computer aerial robot using dual-fisheye cameras,” Journal of Field Robotics,
vision-based civil infrastructure inspection and monitoring,” Engineer- vol. 37, no. 4, pp. 497–514, 2020.
ing, vol. 5, no. 2, pp. 199–222, 2019. [51] V. R. Kumar, S. Yogamani, H. Rashed, G. Sitsu, C. Witt, I. Leang,
[29] A. Bouguettaya, H. Zarzour, A. Kechida, and A. M. Taberkit, “Vehicle S. Milz, and P. Mäder, “Omnidet: Surround view cameras based multi-
detection from uav imagery with deep learning: A review,” IEEE task visual perception network for autonomous driving,” IEEE Robotics
Transactions on Neural Networks and Learning Systems, vol. 33, and Automation Letters, vol. 6, no. 2, pp. 2830–2837, 2021.
no. 11, pp. 6047–6067, 2022. [52] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Region-based
[30] D. Hanover, A. Loquercio, L. Bauersfeld, A. Romero, R. Penicka, convolutional networks for accurate object detection and segmenta-
Y. Song, G. Cioffi, E. Kaufmann, and D. Scaramuzza, “Autonomous tion,” IEEE transactions on pattern analysis and machine intelligence,
drone racing: A survey,” arXiv e-prints, pp. arXiv–2301, 2023. vol. 38, no. 1, pp. 142–158, 2015.
18
[53] R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE international [73] P. Zhu, L. Wen, D. Du, X. Bian, H. Fan, Q. Hu, and H. Ling, “Detection
conference on computer vision, 2015, pp. 1440–1448. and tracking meet drones challenge,” IEEE Transactions on Pattern
[54] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time Analysis and Machine Intelligence, vol. 44, no. 11, pp. 7380–7399,
object detection with region proposal networks,” Advances in neural 2021.
information processing systems, vol. 28, 2015. [74] Y. Zheng, Z. Chen, D. Lv, Z. Li, Z. Lan, and S. Zhao, “Air-to-air visual
[55] A. D. Haumann, K. D. Listmann, and V. Willert, “DisCoverage: A detection of micro-uavs: An experimental evaluation of deep learning,”
new paradigm for multi-robot exploration,” in Proceedings - IEEE IEEE Robotics and Automation Letters, vol. 6, no. 2, pp. 1020–1027,
International Conference on Robotics and Automation, 2010, pp. 929– 2021.
934. [75] D.-M. Seo, H.-J. Woo, M.-S. Kim, W.-H. Hong, I.-H. Kim, and S.-
[56] A. H. Tan, F. P. Bejarano, Y. Zhu, R. Ren, and G. Nejat, “Deep C. Baek, “Identification of asbestos slates in buildings based on faster
reinforcement learning for decentralized multi-robot exploration with region-based convolutional neural network (faster r-cnn) and drone-
macro actions,” IEEE Robotics and Automation Letters, vol. 8, no. 1, based aerial imagery,” Drones, vol. 6, no. 8, p. 194, 2022.
pp. 272–279, 2022. [76] H. Y. Lee, H. W. Ho, and Y. Zhou, “Deep learning-based monocular
[57] Y. Zhu, R. Mottaghi, E. Kolve, J. J. Lim, A. Gupta, L. Fei-Fei, and obstacle avoidance for unmanned aerial vehicle navigation in tree plan-
A. Farhadi, “Target-driven visual navigation in indoor scenes using tations: Faster region-based convolutional neural network approach,”
deep reinforcement learning,” in 2017 IEEE international conference Journal of Intelligent & Robotic Systems, vol. 101, no. 1, p. 5, 2021.
on robotics and automation (ICRA). IEEE, 2017, pp. 3357–3364. [77] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu,
[58] P. Mirowski, R. Pascanu, F. Viola, H. Soyer, A. J. Ballard, A. Banino, and A. C. Berg, “Ssd: Single shot multibox detector,” in European
M. Denil, R. Goroshin, L. Sifre, K. Kavukcuoglu, D. Kumaran, and conference on computer vision. Springer, 2016, pp. 21–37.
R. Hadsell, “Learning to navigate in complex environments,” 5th [78] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look
International Conference on Learning Representations, ICLR 2017 - once: Unified, real-time object detection,” in Proceedings of the IEEE
Conference Track Proceedings, 2017. conference on computer vision and pattern recognition, 2016, pp. 779–
[59] Q. Wu, X. Gong, K. Xu, D. Manocha, J. Dong, and J. Wang, “Towards 788.
target-driven visual navigation in indoor scenes via generative imitation [79] Z. Ge, S. Liu, F. Wang, Z. Li, and J. Sun, “Yolox: Exceeding yolo
learning,” IEEE Robotics and Automation Letters, vol. 6, no. 1, pp. series in 2021,” arXiv preprint arXiv:2107.08430, 2021.
175–182, 2020. [80] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for
[60] OpenAI, I. Akkaya, M. Andrychowicz, M. Chociej, M. Litwin, B. Mc- dense object detection,” IEEE Transactions on Pattern Analysis and
Grew, A. Petron, A. Paino, M. Plappert, G. Powell, R. Ribas, J. Schnei- Machine Intelligence, vol. 42, no. 2, pp. 318–327, 2020.
der, N. Tezak, J. Tworek, P. Welinder, L. Weng, Q. Yuan, W. Zaremba, [81] S. Zhang, L. Wen, X. Bian, Z. Lei, and S. Z. Li, “Single-shot refinement
and L. Zhang, “Solving rubik’s cube with a robot hand,” arXiv preprint, neural network for object detection,” in Proceedings of the IEEE
2019. conference on computer vision and pattern recognition, 2018, pp.
[61] A. K. Kamath, S. G. Anavatti, and M. Feroskhan, “A physics-informed 4203–4212.
neural network approach to augmented dynamics visual servoing of [82] T. Liang, X. Chu, Y. Liu, Y. Wang, Z. Tang, W. Chu, J. Chen, and
multirotors,” IEEE Transactions on Cybernetics, vol. 54, no. 11, pp. H. Ling, “Cbnet: A composite backbone network architecture for object
6319–6332, 2024. detection,” IEEE Transactions on Image Processing, vol. 31, pp. 6893–
[62] C. Wu, B. Ju, Y. Wu, X. Lin, N. Xiong, G. Xu, H. Li, and X. Liang, 6906, 2022.
“Uav autonomous target search based on deep reinforcement learning [83] Z. Liu, H. Hu, Y. Lin, Z. Yao, Z. Xie, Y. Wei, J. Ning, Y. Cao, Z. Zhang,
in complex disaster scene,” IEEE Access, vol. 7, pp. 117 227–117 245, L. Dong et al., “Swin transformer v2: Scaling up capacity and
2019. resolution,” in Proceedings of the IEEE/CVF conference on computer
[63] C. Xiao, P. Lu, and Q. He, “Flying through a narrow gap using end-to- vision and pattern recognition, 2022, pp. 12 009–12 019.
end deep reinforcement learning augmented with curriculum learning [84] Y. Li, H. Mao, R. Girshick, and K. He, “Exploring plain vision
and sim2real,” IEEE Transactions on Neural Networks and Learning transformer backbones for object detection,” in European Conference
Systems, vol. 34, no. 5, pp. 2701–2708, 2023. on Computer Vision. Springer, 2022, pp. 280–296.
[64] Y. Song, M. Steinweg, E. Kaufmann, and D. Scaramuzza, “Autonomous [85] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and
drone racing with deep reinforcement learning,” in 2021 IEEE/RSJ S. Zagoruyko, “End-to-end object detection with transformers,” in
International Conference on Intelligent Robots and Systems (IROS). European conference on computer vision. Springer, 2020, pp. 213–
IEEE, 2021, pp. 1205–1212. 229.
[65] J. Xiao, P. Pisutsin, and M. Feroskhan, “Collaborative target search [86] H. Zhang, F. Li, S. Liu, L. Zhang, H. Su, J. Zhu, L. Ni, and H.-Y.
with a visual drone swarm: An adaptive curriculum embedded multi- Shum, “Dino: Detr with improved denoising anchor boxes for end-
stage reinforcement learning approach,” IEEE Transactions on Neural to-end object detection,” in The Eleventh International Conference on
Networks and Learning Systems, vol. 36, no. 1, pp. 313–327, 2025. Learning Representations, 2022.
[66] Z.-Q. Zhao, P. Zheng, S.-T. Xu, and X. Wu, “Object detection with [87] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.
deep learning: A review,” IEEE Transactions on Neural Networks and Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”
Learning Systems, vol. 30, no. 11, pp. 3212–3232, 2019. Advances in neural information processing systems, vol. 30, 2017.
[67] S. F. Bhat, R. Birkl, D. Wofk, P. Wonka, and M. Müller, “Zoedepth: [88] J. Xiao, J. H. Chee, and M. Feroskhan, “Real-time multi-drone de-
Zero-shot transfer by combining relative and metric depth,” arXiv tection and tracking for pursuit-evasion with parameter search,” IEEE
preprint arXiv:2302.12288, 2023. Transactions on Intelligent Vehicles, pp. 1–11, 2024.
[68] H. Laga, L. V. Jospin, F. Boussaid, and M. Bennamoun, “A survey [89] Z. Shen, M. Zhang, H. Zhao, S. Yi, and H. Li, “Efficient attention:
on deep learning techniques for stereo-based depth estimation,” IEEE Attention with linear complexities,” in Proceedings of the IEEE/CVF
Transactions on Pattern Analysis and Machine Intelligence, vol. 44, winter conference on applications of computer vision, 2021, pp. 3531–
no. 4, pp. 1738–1764, 2020. 3539.
[69] Z. Li, F. Liu, W. Yang, S. Peng, and J. Zhou, “A survey of convolutional [90] M. Maaz, A. Shaker, H. Cholakkal, S. Khan, S. W. Zamir, R. M.
neural networks: Analysis, applications, and prospects,” IEEE Trans- Anwer, and F. Shahbaz Khan, “Edgenext: efficiently amalgamated cnn-
actions on Neural Networks and Learning Systems, vol. 33, no. 12, pp. transformer architecture for mobile vision applications,” in European
6999–7019, 2022. Conference on Computer Vision. Springer, 2022, pp. 3–20.
[70] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, [91] D. Avola, L. Cinque, A. Diko, A. Fagioli, G. L. Foresti, A. Mecca,
T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., D. Pannone, and C. Piciarelli, “Ms-faster r-cnn: Multi-stream backbone
“An image is worth 16x16 words: Transformers for image recognition for improved faster r-cnn object detection and aerial tracking from uav
at scale,” arXiv preprint arXiv:2010.11929, 2020. images,” Remote Sensing, vol. 13, no. 9, p. 1670, 2021.
[71] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, [92] Z. Liu, Y. Shang, T. Li, G. Chen, Y. Wang, Q. Hu, and P. Zhu,
“Swin transformer: Hierarchical vision transformer using shifted win- “Robust multi-drone multi-target tracking to resolve target occlusion:
dows,” in Proceedings of the IEEE/CVF international conference on A benchmark,” IEEE Transactions on Multimedia, vol. 25, pp. 1462–
computer vision, 2021, pp. 10 012–10 022. 1476, 2023.
[72] Y. Liu, Y. Zhang, Y. Wang, F. Hou, J. Yuan, J. Tian, Y. Zhang, Z. Shi, [93] B. Yuan, W. Ma, and F. Wang, “High speed safe autonomous landing
J. Fan, and Z. He, “A survey of visual transformers,” IEEE Transactions marker tracking of fixed wing drone based on deep learning,” IEEE
on Neural Networks and Learning Systems, pp. 1–21, 2023. Access, vol. 10, pp. 80 415–80 436, 2022.
19
[94] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss programming approach,” in 2012 IEEE/RSJ international conference
for dense object detection,” in Proceedings of the IEEE international on Intelligent Robots and Systems. IEEE, 2012, pp. 1917–1922.
conference on computer vision, 2017, pp. 2980–2988. [116] I. Iswanto, A. Ma’arif, O. Wahyunggoro, and A. Imam, “Artificial
[95] J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,” potential field algorithm implementation for quadrotor path planning,”
arXiv preprint arXiv:1804.02767, 2018. Int. J. Adv. Comput. Sci. Appl, vol. 10, no. 8, pp. 575–585, 2019.
[96] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, [117] O. Khatib, “Real-time obstacle avoidance for manipulators and mo-
“Feature pyramid networks for object detection,” in Proceedings of the bile robots,” in Proceedings. 1985 IEEE International Conference on
IEEE conference on computer vision and pattern recognition, 2017, Robotics and Automation, vol. 2. IEEE, 1985, pp. 500–505.
pp. 2117–2125. [118] X. Dai, Y. Mao, T. Huang, N. Qin, D. Huang, and Y. Li, “Automatic
[97] Z. Cai and N. Vasconcelos, “Cascade r-cnn: Delving into high quality obstacle avoidance of quadrotor uav via cnn-based learning,” Neuro-
object detection,” in Proceedings of the IEEE conference on computer computing, vol. 402, pp. 346–358, 2020.
vision and pattern recognition, 2018, pp. 6154–6162. [119] M. A. Anwar and A. Raychowdhury, “Navren-rl: Learning to fly in real
[98] X. Lu, B. Li, Y. Yue, Q. Li, and J. Yan, “Grid r-cnn,” in Proceed- environment via end-to-end deep reinforcement learning using monoc-
ings of the IEEE/CVF Conference on Computer Vision and Pattern ular images,” in 2018 25th International Conference on Mechatronics
Recognition, 2019, pp. 7363–7372. and Machine Vision in Practice (M2VIP). IEEE, 2018, pp. 1–6.
[99] D. T. Wei Xun, Y. L. Lim, and S. Srigrarom, “Drone detection [120] Y. Zhang, K. H. Low, and C. Lyu, “Partially-observable monocular
using yolov3 with transfer learning on nvidia jetson tx2,” in 2021 autonomous navigation for uav through deep reinforcement learning,”
Second International Symposium on Instrumentation, Control, Artificial in AIAA AVIATION 2023 Forum, 2023, p. 3813.
Intelligence, and Robotics (ICA-SYMP), 2021, pp. 1–6. [121] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction.
[100] S. Ross, N. Melik-Barkhudarov, K. S. Shankar, A. Wendel, D. Dey, MIT press, 2018.
J. A. Bagnell, and M. Hebert, “Learning monocular reactive uav [122] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
control in cluttered natural environments,” in 2013 IEEE international recognition,” in Proceedings of the IEEE conference on computer vision
conference on robotics and automation. IEEE, 2013, pp. 1765–1772. and pattern recognition, 2016, pp. 770–778.
[101] L. Xie, S. Wang, A. Markham, and N. Trigoni, “Towards monocular [123] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
vision based obstacle avoidance through deep reinforcement learning,” large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
arXiv preprint arXiv:1706.09829, 2017. [124] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.
[102] K. Mohta, M. Watterson, Y. Mulgaonkar, S. Liu, C. Qu, A. Makineni, Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski,
K. Saulnier, K. Sun, A. Zhu, J. Delmerico et al., “Fast, autonomous S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran,
flight in gps-denied and cluttered environments,” Journal of Field D. Wierstra, S. Legg, and D. Hassabis, “Human-level control through
Robotics, vol. 35, no. 1, pp. 101–120, 2018. deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533,
[103] F. Gao, W. Wu, J. Pan, B. Zhou, and S. Shen, “Optimal Time feb 2015.
Allocation for Quadrotor Trajectory Generation,” in IEEE International [125] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox-
Conference on Intelligent Robots and Systems. Institute of Electrical imal policy optimization algorithms,” arXiv preprint arXiv:1707.06347,
and Electronics Engineers Inc., dec 2018, pp. 4715–4722. 2017.
[126] C. J. Watkins and P. Dayan, “Q-learning,” Machine learning, vol. 8,
[104] F. Gao, W. Wu, W. Gao, and S. Shen, “Flying on point clouds: Online
no. 3, pp. 279–292, 1992.
trajectory generation and autonomous navigation for quadrotors in
[127] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning
cluttered environments,” Journal of Field Robotics, vol. 36, no. 4, pp.
with double q-learning,” in Proceedings of the AAAI conference on
710–733, jun 2019.
artificial intelligence, vol. 30, no. 1, 2016.
[105] F. Gao, L. Wang, B. Zhou, X. Zhou, J. Pan, and S. Shen, “Teach-
[128] Z. Zhu, K. Lin, B. Dai, and J. Zhou, “Off-policy imitation learning from
repeat-replan: A complete and robust system for aggressive flight in
observations,” Advances in Neural Information Processing Systems,
complex environments,” IEEE Transactions on Robotics, vol. 36, no. 5,
vol. 33, pp. 12 402–12 413, 2020.
pp. 1526–1545, 2020.
[129] J. Foerster, G. Farquhar, T. Afouras, N. Nardelli, and S. Whiteson,
[106] B. Zhou, F. Gao, L. Wang, C. Liu, and S. Shen, “Robust and efficient “Counterfactual multi-agent policy gradients,” Proceedings of the AAAI
quadrotor trajectory generation for fast autonomous flight,” IEEE conference on artificial intelligence, vol. 32, no. 1, 2018.
Robotics and Automation Letters, vol. 4, no. 4, pp. 3529–3536, 2019. [130] T. Rashid, M. Samvelyan, C. S. De Witt, G. Farquhar, J. Foerster,
[107] B. Zhou, J. Pan, F. Gao, and S. Shen, “Raptor: Robust and perception- and S. Whiteson, “Monotonic value function factorisation for deep
aware trajectory replanning for quadrotor fast flight,” IEEE Transac- multi-agent reinforcement learning,” The Journal of Machine Learning
tions on Robotics, 2021. Research, vol. 21, no. 1, pp. 7234–7284, 2020.
[108] L. Quan, Z. Zhang, X. Zhong, C. Xu, and F. Gao, “Eva-planner: En- [131] J. Xiao, Y. X. M. Tan, X. Zhou, and M. Feroskhan, “Learning
vironmental adaptive quadrotor planning,” in 2021 IEEE International collaborative multi-target search for a visual drone swarm,” in 2023
Conference on Robotics and Automation (ICRA). IEEE, 2021, pp. IEEE Conference on Artificial Intelligence (CAI), 2023, pp. 5–7.
398–404. [132] J. Xiao, P. Pisutsin, and M. Feroskhan, “Toward collaborative multitar-
[109] Y. Zhang, Q. Yu, K. H. Low, and C. Lv, “A self-supervised monocular get search and navigation with attention-enhanced local observation,”
depth estimation approach based on uav aerial images,” in 2022 Advanced Intelligent Systems, p. 2300761, 2024.
IEEE/AIAA 41st Digital Avionics Systems Conference (DASC). IEEE, [133] M. A. Akhloufi, S. Arola, and A. Bonnet, “Drones chasing drones:
2022, pp. 1–8. Reinforcement learning and deep search area proposal,” Drones, vol. 3,
[110] R. B. Rusu, Z. C. Marton, N. Blodow, M. Dolha, and M. Beetz, “To- no. 3, p. 58, 2019.
wards 3d point cloud based object maps for household environments,” [134] S. Shah, D. Dey, C. Lovett, and A. Kapoor, “Airsim: High-fidelity
Robotics and Autonomous Systems, vol. 56, no. 11, pp. 927–941, 2008. visual and physical simulation for autonomous vehicles,” in Field and
[111] A. Hornung, K. M. Wurm, M. Bennewitz, C. Stachniss, and service robotics. Springer, 2018, pp. 621–635.
W. Burgard, “OctoMap: An efficient probabilistic 3D mapping [135] A. Juliani, V.-P. Berges, E. Teng, A. Cohen, J. Harper, C. Elion, C. Goy,
framework based on octrees,” Autonomous Robots, 2013, software Y. Gao, H. Henry, M. Mattar et al., “Unity: A general platform for
available at https://octomap.github.io. [Online]. Available: https: intelligent agents,” arXiv preprint arXiv:1809.02627, 2018.
//octomap.github.io [136] I. Zamora, N. G. Lopez, V. M. Vilches, and A. H. Cordero, “Extending
[112] E. W. Dijkstra, “A note on two problems in connexion with graphs,” the openai gym for robotics: a toolkit for reinforcement learning using
Numerische Mathematik, vol. 1, pp. 269–271, 1959. ros and gazebo,” arXiv preprint arXiv:1608.05742, 2016.
[113] P. E. Hart, N. J. Nilsson, and B. Raphael, “A formal basis for the [137] P. Christiano, Z. Shah, I. Mordatch, J. Schneider, T. Blackwell, J. To-
heuristic determination of minimum cost paths,” IEEE Transactions on bin, P. Abbeel, and W. Zaremba, “Transfer from simulation to real
Systems Science and Cybernetics, vol. 4, no. 2, pp. 100–107, 1968. world through learning deep inverse dynamics model,” arXiv preprint
[114] J. J. Kuffner and S. M. LaValle, “Rrt-connect: An efficient approach arXiv:1610.03518, 2016.
to single-query path planning,” in Proceedings 2000 ICRA. Millennium [138] J. Tan, T. Zhang, E. Coumans, A. Iscen, Y. Bai, D. Hafner, S. Bo-
Conference. IEEE International Conference on Robotics and Automa- hez, and V. Vanhoucke, “Sim-to-real: Learning agile locomotion for
tion. Symposia Proceedings (Cat. No. 00CH37065), vol. 2. IEEE, quadruped robots,” arXiv preprint arXiv:1804.10332, 2018.
2000, pp. 995–1001. [139] C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning
[115] F. Augugliaro, A. P. Schoellig, and R. D’Andrea, “Generation of for fast adaptation of deep networks,” in International conference on
collision-free trajectories for a quadrocopter fleet: A sequential convex machine learning. PMLR, 2017, pp. 1126–1135.
20
[140] J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel, [159] H. Zhou, H. Kong, L. Wei, D. Creighton, and S. Nahavandi, “On
“Domain randomization for transferring deep neural networks from detecting road regions in a single uav image,” IEEE transactions on
simulation to the real world,” in 2017 IEEE/RSJ international con- intelligent transportation systems, vol. 18, no. 7, pp. 1713–1722, 2016.
ference on intelligent robots and systems (IROS). IEEE, 2017, pp. [160] N. H. Motlagh, M. Bagaa, and T. Taleb, “Uav-based iot platform: A
23–30. crowd surveillance use case,” IEEE Communications Magazine, vol. 55,
[141] O. M. Andrychowicz, B. Baker, M. Chociej, R. Jozefowicz, B. Mc- no. 2, pp. 128–134, 2017.
Grew, J. Pachocki, A. Petron, M. Plappert, G. Powell, A. Ray et al., [161] M. A. Goodrich, B. S. Morse, D. Gerhardt, J. L. Cooper, M. Quigley,
“Learning dexterous in-hand manipulation,” The International Journal J. A. Adams, and C. Humphrey, “Supporting wilderness search and
of Robotics Research, vol. 39, no. 1, pp. 3–20, 2020. rescue using a camera-equipped mini uav,” Journal of Field Robotics,
[142] M. Mancini, G. Costante, P. Valigi, and T. A. Ciarfuglia, “Fast vol. 25, no. 1-2, pp. 89–110, 2008.
robust monocular depth estimation for obstacle detection with fully [162] E. Lygouras, N. Santavas, A. Taitzoglou, K. Tarchanidis, A. Mitropou-
convolutional networks,” in 2016 IEEE/RSJ International Conference los, and A. Gasteratos, “Unsupervised human detection with an em-
on Intelligent Robots and Systems (IROS). IEEE, 2016, pp. 4296– bedded vision system on a fully autonomous uav for search and rescue
4303. operations,” Sensors, vol. 19, no. 16, p. 3542, 2019.
[143] S. Geyer and E. Johnson, “3d obstacle avoidance in adversarial environ- [163] T. Tomic, K. Schmid, P. Lutz, A. Domel, M. Kassecker, E. Mair,
ments for unmanned aerial vehicles,” in AIAA Guidance, Navigation, I. L. Grixa, F. Ruess, M. Suppa, and D. Burschka, “Toward a fully
and Control Conference and Exhibit, 2006, p. 6542. autonomous uav: Research platform for indoor and outdoor urban
[144] Y. Zhang, J. Xiao, and M. Feroskhan, “Learning cross-modal visuo- search and rescue,” IEEE robotics & automation magazine, vol. 19,
motor policies for autonomous drone navigation,” IEEE Robotics and no. 3, pp. 46–56, 2012.
Automation Letters, vol. 10, no. 6, pp. 5425–5432, 2025. [164] J. Senthilnath, M. Kandukuri, A. Dokania, and K. Ramesh, “Applica-
[145] C. Godard, O. Mac Aodha, and G. J. Brostow, “Unsupervised monocu- tion of uav imaging platform for vegetation analysis based on spectral-
lar depth estimation with left-right consistency,” in Proceedings of the spatial methods,” Computers and Electronics in Agriculture, vol. 140,
IEEE conference on computer vision and pattern recognition, 2017, pp. 8–24, 2017.
pp. 270–279. [165] M. R. Khosravi and S. Samadi, “Bl-alm: A blind scalable edge-guided
[146] C. Godard, O. Mac Aodha, M. Firman, and G. J. Brostow, “Digging reconstruction filter for smart environmental monitoring through green
into self-supervised monocular depth estimation,” in Proceedings of iomt-uav networks,” IEEE Transactions on Green Communications and
the IEEE/CVF International Conference on Computer Vision, 2019, Networking, vol. 5, no. 2, pp. 727–736, 2021.
pp. 3828–3838. [166] C. Donmez, O. Villi, S. Berberoglu, and A. Cilek, “Computer vision-
[147] N. Jiang, K. Wang, X. Peng, X. Yu, Q. Wang, J. Xing, G. Li, G. Guo, based citrus tree detection in a cultivated environment using uav
Q. Ye, J. Jiao, J. Zhao, and Z. Han, “Anti-uav: A large-scale benchmark imagery,” Computers and Electronics in Agriculture, vol. 187, p.
for vision-based uav tracking,” IEEE Transactions on Multimedia, 106273, 2021.
vol. 25, pp. 486–500, 2023. [167] B. Lu and Y. He, “Species classification using unmanned aerial vehicle
(uav)-acquired high spatial resolution imagery in a heterogeneous
[148] Y. Zhao, Z. Ju, T. Sun, F. Dong, J. Li, R. Yang, Q. Fu, C. Lian,
grassland,” ISPRS Journal of Photogrammetry and Remote Sensing,
and P. Shan, “Tgc-yolov5: An enhanced yolov5 drone detection model
vol. 128, pp. 73–85, 2017.
based on transformer, gam & ca attention mechanism,” Drones, vol. 7,
no. 7, p. 446, 2023. [168] D. Kim, M. Liu, S. Lee, and V. R. Kamat, “Remote proximity mon-
itoring between mobile construction resources using camera-mounted
[149] V. Walter, M. Vrba, and M. Saska, “On training datasets for machine
uavs,” Automation in Construction, vol. 99, pp. 168–182, 2019.
learning-based visual relative localization of micro-scale UAVs,” in
[169] T. Khuc, T. A. Nguyen, H. Dao, and F. N. Catbas, “Swaying displace-
2020 IEEE International Conference on Robotics and Automation
ment measurement for structural monitoring using computer vision and
(ICRA), Aug 2020, pp. 10 674–10 680.
an unmanned aerial vehicle,” Measurement, vol. 159, p. 107769, 2020.
[150] J. Delmerico, T. Cieslewski, H. Rebecq, M. Faessler, and D. Scara- [170] S. Li, E. van der Horst, P. Duernay, C. De Wagter, and G. C.
muzza, “Are we ready for autonomous drone racing? the uzh-fpv drone de Croon, “Visual model-predictive localization for computationally
racing dataset,” in 2019 International Conference on Robotics and efficient autonomous racing of a 72-g drone,” Journal of Field Robotics,
Automation (ICRA), 2019, pp. 6713–6719. vol. 37, no. 4, pp. 667–692, 2020.
[151] L. Bauersfeld, E. Kaufmann, P. Foehn, S. Sun, and D. Scaramuzza, [171] M. Muller, G. Li, V. Casser, N. Smith, D. L. Michels, and B. Ghanem,
“Neurobem: Hybrid aerodynamic quadrotor model,” Proceedings of “Learning a controller fusion network by online trajectory filtering for
Robotics: Science and Systems XVII, p. 42, 2021. vision-based uav racing,” in Proceedings of the IEEE/CVF Conference
[152] Y. Song, S. Naji, E. Kaufmann, A. Loquercio, and D. Scaramuzza, on Computer Vision and Pattern Recognition Workshops, 2019, pp.
“Flightmare: A flexible quadrotor simulator,” in Conference on Robot 0–0.
Learning. PMLR, 2021, pp. 1147–1157. [172] M. Muller, V. Casser, N. Smith, D. L. Michels, and B. Ghanem,
[153] J. Panerati, H. Zheng, S. Zhou, J. Xu, A. Prorok, and A. P. Schoel- “Teaching uavs to race: End-to-end regression of agile controls in
lig, “Learning to fly—a gym environment with pybullet physics for simulation,” in Proceedings of the European Conference on Computer
reinforcement learning of multi-agent quadcopter control,” in 2021 Vision (ECCV) Workshops, 2018, pp. 0–0.
IEEE/RSJ International Conference on Intelligent Robots and Systems [173] R. Penicka and D. Scaramuzza, “Minimum-time quadrotor waypoint
(IROS), 2021, pp. 7512–7519. flight in cluttered environments,” IEEE Robotics and Automation
[154] B. Alvey, D. T. Anderson, A. Buck, M. Deardorff, G. Scott, and Letters, vol. 7, no. 2, pp. 5719–5726, 2022.
J. M. Keller, “Simulated photorealistic deep learning framework and [174] E. Kaufmann, M. Gehrig, P. Foehn, R. Ranftl, A. Dosovitskiy,
workflows to accelerate computer vision and unmanned aerial vehicle V. Koltun, and D. Scaramuzza, “Beauty and the beast: Optimal methods
research,” in Proceedings of the IEEE/CVF International Conference meet learning for drone racing,” in 2019 International Conference on
on Computer Vision, 2021, pp. 3889–3898. Robotics and Automation (ICRA). IEEE, 2019, pp. 690–696.
[155] Y. Song and D. Scaramuzza, “Policy search for model predictive control [175] L. Xing, X. Fan, Y. Dong, Z. Xiong, L. Xing, Y. Yang, H. Bai, and
with application for agile drone flight,” IEEE Transaction on Robotics, C. Zhou, “Multi-uav cooperative system for search and rescue based
2021. on yolov5,” International Journal of Disaster Risk Reduction, vol. 76,
[156] J. Xiao and M. Feroskhan, “Learning multipursuit evasion for safe p. 102972, 2022.
targeted navigation of drones,” IEEE Transactions on Artificial Intelli- [176] B. Lin, L. Wu, and Y. Niu, “End-to-end vision-based cooperative target
gence, vol. 5, no. 12, pp. 6210–6224, 2024. geo-localization for multiple micro uavs,” Journal of Intelligent &
[157] A. Singh, D. Patil, and S. Omkar, “Eye in the sky: Real-time drone Robotic Systems, vol. 106, no. 1, p. 13, 2022.
surveillance system (dss) for violent individuals identification using [177] M. E. Campbell and W. W. Whitacre, “Cooperative tracking using
scatternet hybrid deep learning network,” in Proceedings of the IEEE vision measurements on seascan uavs,” IEEE Transactions on Control
conference on computer vision and pattern recognition workshops, Systems Technology, vol. 15, no. 4, pp. 613–626, 2007.
2018, pp. 1629–1637. [178] J. Gu, T. Su, Q. Wang, X. Du, and M. Guizani, “Multiple moving
[158] W. Li, H. Li, Q. Wu, X. Chen, and K. N. Ngan, “Simultaneously targets surveillance based on a cooperative network for multi-uav,”
detecting and counting dense vehicles from drone images,” IEEE IEEE Communications Magazine, vol. 56, no. 4, pp. 82–89, 2018.
Transactions on Industrial Electronics, vol. 66, no. 12, pp. 9651–9662, [179] Y. Cao, F. Qi, Y. Jing, M. Zhu, T. Lei, Z. Li, J. Xia, J. Wang, and G. Lu,
2019. “Mission chain driven unmanned aerial vehicle swarms cooperation for
21
the search and rescue of outdoor injured human targets,” Drones, vol. 6, [199] S. Minaeian, J. Liu, and Y.-J. Son, “Vision-based target detection
no. 6, p. 138, 2022. and localization via a team of cooperative uav and ugvs,” IEEE
[180] G. Loianno, J. Thomas, and V. Kumar, “Cooperative localization and Transactions on systems, man, and cybernetics: systems, vol. 46, no. 7,
mapping of mavs using rgb-d sensors,” in 2015 IEEE International pp. 1005–1016, 2015.
Conference on Robotics and Automation (ICRA). IEEE, 2015, pp. [200] A. Adaptable, “Building an aerial–ground robotics system for precision
4021–4028. farming,” IEEE ROBOTICS & AUTOMATION MAGAZINE, 2021.
[181] P. Tong, X. Yang, Y. Yang, W. Liu, and P. Wu, “Multi-uav collaborative [201] Q. Vuong, S. Levine, H. R. Walke, K. Pertsch, A. Singh, R. Doshi,
absolute vision positioning and navigation: A survey and discussion,” C. Xu, J. Luo, L. Tan, D. Shah et al., “Open x-embodiment: Robotic
Drones, vol. 7, no. 4, p. 261, 2023. learning datasets and rt-x models,” in 2nd Workshop on Language and
[182] N. Piasco, J. Marzat, and M. Sanfourche, “Collaborative localization Robot Learning: Language as Grounding, 2023.
and formation flying using distributed stereo-vision,” in 2016 IEEE [202] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun, “Carla:
International Conference on Robotics and Automation (ICRA). IEEE, An open urban driving simulator,” in Conference on robot learning.
2016, pp. 1202–1207. PMLR, 2017, pp. 1–16.
[183] D. Liu, X. Zhu, W. Bao, B. Fei, and J. Wu, “Smart: Vision-based [203] F. Zhuang, Z. Qi, K. Duan, D. Xi, Y. Zhu, H. Zhu, H. Xiong, and
method of cooperative surveillance and tracking by multiple uavs in the Q. He, “A comprehensive survey on transfer learning,” Proceedings of
urban environment,” IEEE Transactions on Intelligent Transportation the IEEE, vol. 109, no. 1, pp. 43–76, 2020.
Systems, vol. 23, no. 12, pp. 24 941–24 956, 2022. [204] Y. Wang, Q. Yao, J. T. Kwok, and L. M. Ni, “Generalizing from a few
[184] N. Farmani, L. Sun, and D. J. Pack, “A scalable multitarget tracking examples: A survey on few-shot learning,” ACM computing surveys
system for cooperative unmanned aerial vehicles,” IEEE Transactions (csur), vol. 53, no. 3, pp. 1–34, 2020.
on Aerospace and Electronic Systems, vol. 53, no. 4, pp. 1947–1961, [205] J. Fonseca and F. Bacao, “Tabular and latent space synthetic data
2017. generation: a literature review,” Journal of Big Data, vol. 10, no. 1, p.
[185] M. Jouhari, A. K. Al-Ali, E. Baccour, A. Mohamed, A. Erbad, 115, 2023.
M. Guizani, and M. Hamdi, “Distributed cnn inference on resource- [206] Z. Liu, M. Sun, T. Zhou, G. Huang, and T. Darrell, “Rethinking the
constrained uavs for surveillance systems: Design and optimization,” value of network pruning,” in International Conference on Learning
IEEE Internet of Things Journal, vol. 9, no. 2, pp. 1227–1242, 2021. Representations, 2018.
[186] W. J. Yun, S. Park, J. Kim, M. Shin, S. Jung, D. A. Mohaisen, and J.-H. [207] Y. Jiang, S. Wang, V. Valls, B. J. Ko, W.-H. Lee, K. K. Leung, and
Kim, “Cooperative multiagent deep reinforcement learning for reliable L. Tassiulas, “Model pruning enables efficient federated learning on
surveillance via autonomous multi-uav control,” IEEE Transactions on edge devices,” IEEE Transactions on Neural Networks and Learning
Industrial Informatics, vol. 18, no. 10, pp. 7086–7096, 2022. Systems, 2022.
[187] Y. Tang, Y. Hu, J. Cui, F. Liao, M. Lao, F. Lin, and R. S. Teo, “Vision- [208] Y. Zhou, S.-M. Moosavi-Dezfooli, N.-M. Cheung, and P. Frossard,
aided multi-uav autonomous flocking in gps-denied environment,” “Adaptive quantization for deep neural network,” in Proceedings of
IEEE Transactions on industrial electronics, vol. 66, no. 1, pp. 616– the AAAI Conference on Artificial Intelligence, vol. 32, no. 1, 2018.
626, 2018. [209] L. Deng, G. Li, S. Han, L. Shi, and Y. Xie, “Model compression and
hardware acceleration for neural networks: A comprehensive survey,”
[188] J. Scherer, S. Yahyanejad, S. Hayat, E. Yanmaz, T. Andre, A. Khan,
Proceedings of the IEEE, vol. 108, no. 4, pp. 485–532, 2020.
V. Vukadinovic, C. Bettstetter, H. Hellwagner, and B. Rinner, “An
[210] L. Wang and K.-J. Yoon, “Knowledge distillation and student-teacher
autonomous multi-uav system for search and rescue,” in Proceedings
learning for visual intelligence: A review and new outlooks,” IEEE
of the first workshop on micro aerial vehicle networks, systems, and
transactions on pattern analysis and machine intelligence, vol. 44,
applications for civilian use, 2015, pp. 33–38.
no. 6, pp. 3048–3068, 2021.
[189] Y. Rizk, M. Awad, and E. W. Tunstel, “Cooperative heterogeneous
[211] R. Gong, Q. Huang, X. Ma, H. Vo, Z. Durante, Y. Noda, Z. Zheng,
multi-robot systems: A survey,” ACM Computing Surveys (CSUR),
S.-C. Zhu, D. Terzopoulos, L. Fei-Fei et al., “Mindagent: Emergent
vol. 52, no. 2, pp. 1–31, 2019.
gaming interaction,” arXiv preprint arXiv:2309.09971, 2023.
[190] G. Niu, L. Wu, Y. Gao, and M.-O. Pun, “Unmanned aerial vehicle [212] A. Bhoopchand, B. Brownfield, A. Collister, A. Dal Lago, A. Edwards,
(uav)-assisted path planning for unmanned ground vehicles (ugvs) R. Everett, A. Fréchette, Y. G. Oliveira, E. Hughes, K. W. Mathewson
via disciplined convex-concave programming,” IEEE Transactions on et al., “Learning few-shot imitation as cultural transmission,” Nature
Vehicular Technology, vol. 71, no. 7, pp. 6996–7007, 2022. Communications, vol. 14, no. 1, p. 7536, 2023.
[191] D. Liu, W. Bao, X. Zhu, B. Fei, Z. Xiao, and T. Men, “Vision-aware [213] J. Xiao and M. Feroskhan, “Cyber attack detection and isolation for
air-ground cooperative target localization for uav and ugv,” Aerospace a quadrotor uav with modified sliding innovation sequences,” IEEE
Science and Technology, vol. 124, p. 107525, 2022. Transactions on Vehicular Technology, vol. 71, no. 7, pp. 7202–7214,
[192] J. Li, G. Deng, C. Luo, Q. Lin, Q. Yan, and Z. Ming, “A hybrid path 2022.
planning method in unmanned air/ground vehicle (uav/ugv) cooperative [214] T. T. Nguyen and V. J. Reddi, “Deep reinforcement learning for
systems,” IEEE Transactions on Vehicular Technology, vol. 65, no. 12, cyber security,” IEEE Transactions on Neural Networks and Learning
pp. 9585–9596, 2016. Systems, vol. 34, no. 8, pp. 3779–3795, 2023.
[193] L. Zhang, F. Gao, F. Deng, L. Xi, and J. Chen, “Distributed estimation [215] J. Xiao, X. Fang, Q. Jia, and M. Feroskhan, “Learning resilient
of a layered architecture for collaborative air–ground target geolocation formation control of drones with graph attention network,” IEEE
in outdoor environments,” IEEE Transactions on Industrial Electronics, Internet of Things Journal, pp. 1–1, 2025.
vol. 70, no. 3, pp. 2822–2832, 2022. [216] I. Ilahi, M. Usama, J. Qadir, M. U. Janjua, A. Al-Fuqaha, D. T. Hoang,
[194] Y. Chen and J. Xiao, “Target search and navigation in heterogeneous and D. Niyato, “Challenges and countermeasures for adversarial attacks
robot systems with deep reinforcement learning,” Machine Intelligence on deep reinforcement learning,” IEEE Transactions on Artificial
Research, pp. 1–12, 2025. Intelligence, vol. 3, no. 2, pp. 90–109, 2021.
[195] G. Niu, Q. Yang, Y. Gao, and M.-O. Pun, “Vision-based autonomous [217] A. Rawal, J. McCoy, D. B. Rawat, B. M. Sadler, and R. S. Amant,
landing for unmanned aerial and ground vehicles cooperative systems,” “Recent advances in trustworthy explainable artificial intelligence:
IEEE robotics and automation letters, vol. 7, no. 3, pp. 6234–6241, Status, challenges, and perspectives,” IEEE Transactions on Artificial
2021. Intelligence, vol. 3, no. 6, pp. 852–866, 2021.
[196] Z.-C. Xu, B.-B. Hu, B. Liu, X. Wang, and H.-T. Zhang, “Vision-
based autonomous landing of unmanned aerial vehicle on a motional
unmanned surface vessel,” in 2020 39th Chinese Control Conference
(CCC). IEEE, 2020, pp. 6845–6850.
[197] C. Hui, C. Yousheng, L. Xiaokun, and W. W. Shing, “Autonomous
takeoff, tracking and landing of a uav on a moving ugv using onboard
monocular vision,” in Proceedings of the 32nd Chinese control confer-
ence. IEEE, 2013, pp. 5895–5901.
[198] I. Kalinov, A. Petrovsky, V. Ilin, E. Pristanskiy, M. Kurenkov,
V. Ramzhaev, I. Idrisov, and D. Tsetserukou, “Warevision: Cnn barcode
detection-based uav trajectory optimization for autonomous warehouse
stocktaking,” IEEE Robotics and Automation Letters, vol. 5, no. 4, pp.
6647–6653, 2020.