0% found this document useful (0 votes)
7 views7 pages

SA031PL

This paper presents a novel deep reinforcement learning (DRL) model for the autonomous navigation of drones in complex environments, utilizing TD3 and PPO algorithms for improved performance. The study emphasizes the importance of explainability in DRL models through techniques like LIME and SHAP to address the black box nature of these algorithms. The simulation environment is developed using Unreal Engine and AirSim, demonstrating the model's effectiveness in real-time navigation tasks while highlighting the need for efficient algorithms in real-world applications.

Uploaded by

chrysayiss
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views7 pages

SA031PL

This paper presents a novel deep reinforcement learning (DRL) model for the autonomous navigation of drones in complex environments, utilizing TD3 and PPO algorithms for improved performance. The study emphasizes the importance of explainability in DRL models through techniques like LIME and SHAP to address the black box nature of these algorithms. The simulation environment is developed using Unreal Engine and AirSim, demonstrating the model's effectiveness in real-time navigation tasks while highlighting the need for efficient algorithms in real-world applications.

Uploaded by

chrysayiss
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Proceedings of the 28th World Multi-Conference on Systemics, Cybernetics and Informatics (WMSCI 2024)

Autonomous Navigation of Drones Using Explainable Deep Reinforcement Learning


in Complex Environments

Pawan Kumar SOURIPALLI


Mechanical Engineering department
Visvesvaraya National Institute of Technology, Nagpur, India
[email protected]

Laeba Jeelani SAYED


Mechanical Engineering department
Visvesvaraya National Institute of Technology, Nagpur, India
[email protected]

Dr. Shital S. CHIDDARWAR


Mechanical Engineering department
Visvesvaraya National Institute of Technology, Nagpur, India
[email protected]

ABSTRACT 1. INTRODUCTION

Autonomous navigation of Unmanned Aerial Vehicles (UAV) in Unmanned Aerial Vehicles (UAVs) have revolutionized tasks
complex environments is still a challenging field. Recognizing from delivery to surveillance, significantly impacted by
UAV real-time perception as a sequential decision-making Artificial Intelligence (AI) and Information Technology (IT)
challenge, researchers increasingly adopt learning-based advancements. They offer unparalleled flexibility and
methods, leveraging machine learning to enhance navigation in time-saving capabilities in applications such as safe drone
complex environments.In this paper, a novel deep reinforcement operation models in urban air traffic flows, hovermap drone
learning (DRL) model has been proposed for the smooth systems, and AI-driven 5G UAV systems. [1, 2, 3]. Their
navigation of the UAV. The paper provides an overview of autonomous navigation is crucial for path planning, obstacle
existing techniques, laying the foundation for our proposed avoidance, and control, which can be addressed through
work, which not only addresses certain limitations but also real-time perception or through the utilization of existing
demonstrates superior performance in complex environments. environmental data. To address complex scenarios, the former
The simulation environment is built using Unreal Engine, and approach increasingly adopts learning-based techniques such as
the connections have been established using AirSim APIs. The Machine Learning (ML) [4].
implementation of the TD3 algorithm is chosen for its
exceptional adaptability in continuous action spaces due to its Deep Reinforcement Learning (DRL), a subset of ML, combines
off policy, value-based approach, resulting in improved stability Deep Learning (DL) for neural network training and
and sample efficiency whereas the implementation of PPO Reinforcement Learning (RL) for sequential decision-making
algorithm is due to its on-policy method that leads to stable through Markov Decision Process (MDP). Based on prevalent
learning without the need for value function estimation. Our analysis, DRL algorithms combined with PID controllers reduce
model undergoes training in a customized landscape collision rates in UAV control, but highlight the need for sample
mountainous environment, and the results, obtained after efficient algorithms. Relevant Experience Learning (REL) and
rigorous training, are thoroughly analyzed. The state-action non-sparse rewards handle large state and action spaces, while
pairs of our trained TD3 agent are explained using LIME and PPO and LSTM networks emphasize sensor data importance for
SHAP techniques. The paper concludes by presenting promising accurate models [5]. Other RL algorithms like TEXPLORE
directions for further exploration and advancement in this were utilized for autonomous navigation but it fell short due to
evolving field. absence of comparative analysis and scalability issues in real
world applications [6]. The lack of real-world training scenarios
Keywords: Unmanned Aerial Vehicle, Deep Reinforcement poses a hurdle for validating algorithms, as seen in the analysis
Learning (DRL), Twin Delayed DDPG (TD3), Proximal Policy of incremental learning with PPO [7].
Optimization (PPO), AirSim, Unreal Engine, Explainable The opacity of DRL methods, acting as black boxes, necessitates
Artificial Intelligence (XAI), Local Interpretable Model Explainability to render the model transparent. Explainability
agnostic Explanation (LIME), Shapley additive
can be approached globally or locally. For complex models, this
explanations (SHAP), Application Program Interface (APIs).
involves using a model-agnostic approach like LIME and SHAP,
or a model-specific approach like Random Forest Regressor,

ISBN: 978-1-950492-79-4
ISSN: 2771-0947
44 https://doi.org/10.54808/WMSCI2024.01.44
Proceedings of the 28th World Multi-Conference on Systemics, Cybernetics and Informatics (WMSCI 2024)

where the former can be applied to any ML model, and the latter solving a first-order differential equation with the help of the
is tailored to work for a particular type of model or algorithm[8]. new, more manageable constraint .
The essence of our contribution can be outlined as follows: ___________________________________________________
Algorithm : Proximal Policy Optimization (PPO) Pseudocode.
1. Comparison of TD3 and PPO algorithms with MLP neural ___________________________________________________
networks for drone navigation in complex environments, Initialization of total number of iterations (I), number of actors
established in Unreal Engine, utilizing connections through (J), and the time steps (T);
AirSim.
for iteration = 1 to I do
2. Evaluating the TD3 agent's state-action pairs, using SHAP
and LIME explainability techniques to address the black box for actor = 1 to J do
issue. Run the policy πθ on the prior theta value, θold, for T
time steps; Calculate the advantage estimates Aˆ t for
2. PRELIMINARIES the time steps;
A. TD3 Algorithm end
Twin Delayed DDPG (TD3) is an off-policy model-free DRL Optimize the objective function L(θ) with respect to θ and
approach used for model training to provide smooth control
commands for UAV navigation. TD3 replaces DDPG, call it θopt;
introducing three crucial techniques to overcome the Q-value Define θold = θopt ;
overestimation problem: delayed policy update, target policy
smoothing, and clipped double Q-learning. end
Unlike the DDPG algorithm, the TD3 algorithm minimizes the _________________________________________________
mean squared Bellman error while simultaneously learning two
Q-functions, Q1 and Q2 [8]. The process involves initialization, iterations, and the execution
of the policy in the environment to gather data such as states,
actions, rewards, and subsequent states. It assesses the benefit of
selecting a particular action at a certain time step compared to
the average value of actions, then refines the policy parameters
by optimizing an objective function. This function, which
defines the goal of the optimization problem, aims to maximize
the cumulative reward in the context of reinforcement learning.
By adjusting the policy parameters, the function seeks to
maximize expected returns and incrementally enhance the
policy’s performance.

3. DRL FRAMEWORK
Figure 1 : TD3 Networks
During each learning time step, the attributes of the actor and
critic are updated, while a stochastic noise model is applied to
perturb the action selected by the policy.**
Its ability to handle continuous action spaces, reduce
overestimation bias, and encourage exploration make it an
effective choice for training policies in complex and dynamic
environments.
B. PPO Algorithm
PPO follows the general framework used by many RL
algorithms as shown by equation 1 where the expected reward
(Q-value) for taking action (a) in state (s) and then following a
policy (π) sums up the expected rewards for all possible future
states and actions, considering the probabilities defined by the
policy. The foundation lies in substituting flexible constraints,
Figure 2 : The Deep Reinforcement Learning workflow involves
regarded as penalties, for rigid ones. An approximation of the
the agent (Top Block) acquiring precursory state and reward
second-order optimization of a differential equation is found by
from the environment (Bottom Block), after which the agent
generates actions accordingly.

45
Proceedings of the 28th World Multi-Conference on Systemics, Cybernetics and Informatics (WMSCI 2024)

With the use of a machine learning technique called deep | |. |


𝑅𝑠𝑡𝑎𝑡𝑒 = 𝑑𝑑𝑒𝑠𝑖𝑟𝑒𝑑(𝑥𝑦) − 𝑑𝑥𝑦 + 𝑑𝑑𝑒𝑠𝑖𝑟𝑒𝑑(𝑧) − 𝑑𝑧 | (3)
reinforcement learning (DRL), computers can learn from their
behavior in a manner akin to how people learn from experience.
This kind of machine learning is far more centered on Penalizing gap between desired state and actual state.
interaction-based, goal-directed learning than previous methods.
𝑑−𝑑 𝑐𝑟𝑎𝑠ℎ
The learning entity is not instructed on what steps to take; 𝑅𝑜𝑏𝑠 = 1 − 𝑚𝑎𝑥(0, 𝑚𝑖𝑛(1, )) (4)
5
instead, it must try potential courses of action and determine
which ones yield the highest reward or is much closer to the
objective. Reward for maintaining a safe distance from the obstacles and
avoiding crash.
A. State Space
A state is a precise location and time, an instantaneous 𝑅𝑎𝑐𝑡𝑖𝑜𝑛 = 𝑅𝑣𝑒𝑙𝑜𝑐𝑖𝑡𝑦 + 𝑅𝑦𝑎𝑤 𝑒𝑟𝑟𝑜𝑟 (5)
configuration that positions the agent in relation to other
important objects such as goals and impediments. It represents | 𝑉 −𝑉 |
the physical and immediate circumstance in which the agent 𝑅𝑣𝑒𝑙𝑜𝑐𝑖𝑡𝑦 = | 𝑚𝑎𝑥 |, (6)
| 𝑉𝑚𝑎𝑥 |
finds itself. It may also be the environment's feedback in the
form of the current or future situation. 𝑅𝑦𝑎𝑤 𝑒𝑟𝑟𝑜𝑟 = 𝑎𝑏𝑠(𝑦𝑎𝑤_𝑒𝑟𝑟𝑜𝑟 / 90) (7)
B. Action Space
In the DRL framework, the action space comprises the set of Penalizing for the error between the instructed action and the
possible actions available to an agent at each state. The agent action performed.
selects an action informed by observations, with the objective of
optimizing cumulative long-term rewards in alignment with a 𝑅𝑡 = 𝐾1 * 𝑅𝑑𝑖𝑠𝑡 + 𝐾2 * 𝑅𝑠𝑡𝑎𝑡𝑒 + 𝐾3 * 𝑅𝑜𝑏𝑠 + 𝐾4 * 𝑅𝑎𝑐𝑡𝑖𝑜𝑛 (8)
specified policy.
K1, K2, K3, K4 are weight parameters.
C. Reward
The reward function is used to define a goal in a reinforcement 𝑅𝑓𝑖𝑛𝑎𝑙 = 𝑅𝑡 (if the episode is still running)
learning problem. It is the mapping of each perceived state (or
state-action pair) of the environment to a single number, 𝑅𝑒 (if the episode is completed)
specifying the intrinsic desirability of that state.

The computation of both the final reward and its cumulative


4. REWARD MECHANISM reward will act as inputs for our model optimization process.
The Final reward will act as a real-time feedback system to
Within a Deep Reinforcement Learning (DRL) framework, the make swift adjustments, while the cumulative rewards will
reward function is a crucial element. It serves as the metric for provide a broader perspective showing the consistency in
agent learning, imbuing each interaction with significance. A performance.
well-designed reward function is not merely a metric; it
becomes the perspective through which the agent perceives its
environment, evaluates its actions, and refines its strategies. 5. TRAINING PROCESS
We crafted our reward function for both completed episodes and
those still in progress. A. Setting up environment
For the first case, the reward assignment is simple: if the The simulation environment is built on AirSim which is
destination is reached, it's +10; if the agent crashes, it's profound in generating environments with minimal gaps
penalized with -20; and if the agent goes out of the boundaries, between the real-world and simulation. In order to stabilize the
it's -10. UAV, this simulator can offer a low-level controller and a
high-fidelity environment with a ground truth depth picture.
reward = 0
reward_reach = 10
reward_crash = -20
reward_outside = -10

For the second case where the episode is still in the running Figure 3 : Connection Establishment between AirSim and
phase; we will be inculcating multiple factors in the reward Visual Studio 2022
function.
𝑅𝑑𝑖𝑠𝑡 = 𝑑𝑔(𝑡−1) − 𝑑𝑔(𝑡) (2)
The establishment of a connection between AirSim and Visual
Studio 2022 represents a significant advancement in the field of
Reward based on the change in distance to the goal. It’s autonomous navigation.
computed based on the variation between the previous and This modern linkage not only introduces a new method for
current distances to the goal (the distance it has moved towards model training but does so with notable efficiency through
the goal). seamless integration using AirSim APIs.

46
Proceedings of the 28th World Multi-Conference on Systemics, Cybernetics and Informatics (WMSCI 2024)

Through this connection, the convergence of AirSim simulation Our model was trained using the TD3 Reinforcement Learning
with the development environment of VS2022 brings us, algorithm, integrated with MLP Deep Neural Networks. This
notably closer to real-world scenarios. This innovative coupling strategic integration facilitated the sequential decision-making
offers a training platform that closely emulates the intricate process, within the framework of a Markov Decision Process
challenges of genuine environments. (MDP).

B. Hardware Setup Careful attention was devoted to the tuning of hyperparameters,


ensuring a good trade-off between exploration and exploitation.
We trained our model using a strong GPU (NVIDIA Quadro This meticulous tuning strategy not only fostered a conducive
RTX 6000) and an AMD Ryzen Threadripper 2950X 16-Core environment for learning but also expedited the model's
Processor. During the training process, the destination points are convergence towards an optimal function.
randomized.
Our approach fosters a robust framework for autonomous
decision-making and the fine-tuning of hyperparameters
reinforces the model’s adaptability and its efficiency.
Table 1
Parameters for the Simulation Environment

No Parameter Value
1 Time step(dt) 0.1
2 Maximum acceleration in Horizontal 2.0
plane
3 Maximum Velocity in Horizontal plane 5.0
4 Minimum Velocity in Horizontal plane 0.5
5 Maximum Velocity in Vertical plane 2.0
6 Maximum yaw rate 50
7 Crash Distance 2
8 Accept radius 2

Table 2
Hyperparameters for the DRL Algorithms Figure 4 : Customized landscape environment built in AirSim.

No.
6. EXPLAINABILITY
Hyperparameters Value
1 Gamma 0.99 Explainability enhances transparency in DRL models by
2 Learning rate 1e-3 revealing black box elements and clarifying the relationships
3 Learning starts 1000 between actions and state features. We employed SHAP and
4 Buffer size 50000 LIME to explain our model.
5 Batch size 128 A. Shapley additive explanations (SHAP)
6 Train frequency 100
7 Gradient steps 200 SHAP employs Shapley values to provide a comprehensive
8 Action noise sigma 0.1 understanding of feature importance, both locally and globally.
In our XDRL-based autonomous drone navigation model,
SHAP enhances interpretability and optimization by offering
clear visual insights into model decisions at the individual data
The navigation network is trained in AirSim, a simulator built point level.
on Unreal Engine. This simulator provides an extremely
realistic environment with a ground truth depth picture and a B. Local Interpretable Model-agnostic Explanations (LIME)
simple controller to keep the UAV stable. A customized LIME generates interpretable surrogate models around specific
landscape mountain environment is created for training,
data points to clarify individual predictions. In the context of
featuring a square environment with a side length of 256 units.
At the beginning of each episode, the quadrotor takes off from autonomous drone navigation using XDRL, LIME offers
the random start position in the environment. The goal position localized insights into feature contributions, enhancing both
is set randomly within the boundaries of rectangular model performance and interpretability across dynamic
coordinates. The episode ends either when the quadrotor scenarios. Its adaptability to various machine learning
reaches the goal position within an acceptable radius or when it techniques ensures precise decision-making in complex
crashes into obstacles. In order to generate the velocity setpoint environments.
in the three-dimensional environment, the neural network gets
both the quadrotor's state information and the depth image at
each time step.

47
Proceedings of the 28th World Multi-Conference on Systemics, Cybernetics and Informatics (WMSCI 2024)

Figure 5A : Graphical user interface obtained in the training Figure 5B : Graphical user interface obtained in the training
process of TD3 algorithm (Reward per episode) process of TD3 algorithm (Cumulative Reward)

Figure 6A : Graphical user interface obtained in the training Figure 6B : Graphical user interface obtained in the training
process of TD3 algorithm (Yaw Rate) process of TD3 algorithm (Trajectory of agent)

7. RESULTS AND DISCUSSIONS Figure 5A shows the plot of rewards per episode, where the
model assigns rewards based on performance in each episode.
Figure 5B presents the plot of cumulative rewards, indicating
Evaluation metric TD3 PPO the consistency in model performance and optimization. Figure
Mean episode reward 78 72 6A illustrates the oscillations in yaw, which are decreasing and
Crash rate 16% 22% approaching a saturation state, demonstrating that the drone has
Convergence time 2400 episodes 3300 episodes achieved stabilization after progressive iterations. Figure 6B
depicts the drone's trajectory, showing how efficiently the
Success rate 77% 68%
model avoids obstacles, maintains a safe distance, and moves
swiftly towards the destination. Complex environments pose
After designing the reward function, the model is trained increased challenges for the model, which typically results in
extensively with nearly 100,000 steps with 197,000 updates. longer convergence times compared to training in simpler
TD3 algorithm has been converged after 2400 episodes with a environments.
success rate of 77%. When compared with the PPO algorithm,
We observed several positive aspects, including an increase in
the mean episode reward, decrease in crash rate, and a
significant increase in success rate.

48
Proceedings of the 28th World Multi-Conference on Systemics, Cybernetics and Informatics (WMSCI 2024)

Figure 7A : Using Lime method Feature analysis Velocity in the Figure 7B : Using SHAP method Feature analysis for Velocity
XY plane Vxy. in the XY plane Vxy.

Figure 8A : Using Lime method Feature analysis Vertical speed Figure 8B : Using SHAP method Feature analysis for Vertical
Vz. speed Vz.

Figure 9A : Using Lime method Feature analysis Yaw rate. Figure 9B : Using SHAP method Feature analysis for Yaw rate.

We have utilized SHAP and LIME methods to analyze feature


contributions to the model's decisions. Figures 7A and 7B Actions and their major contributors in states are :
display feature analysis plots for velocity in the XY plane, Velocity in Horizontal plane: angular velocity, Linear
Figures 8A and 8B for vertical speed, and Figures 9A and 9B velocity_xy
for steering speed (yaw rate). These analyses reveal the Vertical speed: Vertical Distance, Relative
influence of different features on the model's predictions for Yaw, Linear Velocity in Z
horizontal, vertical, and rotational movements. Random forest Yaw rate(Steering speed): Linear velocity Z, Vertical
regression was employed to establish correlations between the Distance and Angular
six state features and the three action features, offering insights Velocity
into the model's decision-making process.
The table summarizes the correlation of the independent
features with the dependent feature

49
Proceedings of the 28th World Multi-Conference on Systemics, Cybernetics and Informatics (WMSCI 2024)

The major contributors for the action, velocity in Horizontal [7] V.J Hodge, R. Hawkins, R. Alexander, “Deep
plane in states are angular velocity and linear velocity_xy. For reinforcement learning for drone navigation using sensor
the action Vertical speed features affecting the most are Vertical data”, Neural Computing & Applications, 2021, Vol.33,
Distance, Relative yaw, Linear Velocity_z and for Yaw rate; pp. 2015–2033.
linear velocity in Z direction, Vertical distance and angular [8] L. He, N. Aouf, B. Song, “Explainable Deep
velocity have good correlation. Reinforcement Learning for UAV Autonomous
Navigation”, Aerospace Science and Technology, 2021,
Vol. 118, pp. 107052.
8. CONCLUSION [9] A.T. Azar, F.E. Serrano, N.A. Kamal, A. Kouba, “Robust
Kinematic Control of Unmanned Aerial Vehicles with
In this paper autonomous navigation of UAVs is addressed Non-holonomic Constraints”, International Conference
using Explainable Deep Reinforcement Learning Technique. on Advanced Intelligent System and Informatics, 2020.
The simulation environment was built on AirSim and pp. 839-850.
connections were established between VS2022 and Unreal [10] Y. Song, L.-T. Hsu, “Tightly coupled integrated navigation
engine. We have used the standard TD3 and PPO algorithms system via factor graph for UAV indoor localization”,
and molded it according to the requirements it must serve. A Aerospace Science and Technology, 2021, Vol.108, pp.
well crafted reward function was prepared considering all the 106370.
crucial factors and the hyperparameters were tuned efficiently to
obtain the desired outputs. The training process has been [11] M. Shah, N. Aouf, “3D cooperative Pythagorean
intensive producing a success rate of 77% for the TD3 hodograph path planning and obstacle avoidance for
algorithm which has a significant hike compared to PPO multiple UAVs”, IEEE 9th International Conference on
algorithm. The use of explainability techniques has greatly Cybernetic Intelligent Systems, 2010, pp. 1–6.
improved the transparency of our model. We now understand [12] Y. Shin, E. Kim, “Hybrid path planning using positioning
why certain actions occur in specific states, giving us insight risk and artificial potential fields”, Aerospace Science and
into the decision-making process.The future scope of this work Technology, 2021, Vol. 112, pp. 106640.
is integrating defogging techniques which will make our model [13] N. Imanberdiyev, C. Fu, E. Kayacan, I.-M. Chen,
robust enough to tackle the foggy conditions. “Autonomous navigation of UAV by using real-time
model-based reinforcement learning”, 14th International
Conference on Control, Automation, Robotics and
9. REFERENCES Vision (ICARCV), 2016, pp. 1–6.
[14] V. Mnih, K. Kavukcuoglu, D. Silver, A.A. Rusu, J. Veness,
[1] D.D. Nguyen, J. Rohacs, D. Rohacs, “Autonomous Flight M.G. Bellemare, A. Graves, M. Riedmiller, A.K.
Trajectory Control System for Drones in Smart City Traffic Fidjeland, G. Ostrovski, et al., “Human-level control
Management”, International Journal of through deep reinforcement learning”, 2015, Nature, Vol.
Geo-Information, 2021, Vol. 10, pp. 338. 518, pp. 529–533.
[2] E. Jones, J. Sofonia, C. Canales, S. Hrabar, F. Kendoul, [15] A.B. Arrieta, N. Díaz-Rodríguez, J. Del Ser, A. Bennetot,
“Applications for the Hovermap autonomous drone system S. Tabik, A. Barbado, S. García, S. Gil-López, D. Molina,
in underground mining operations”, Journal of the R. Benjamins, et al., “Explainable artificial intelligence
Southern African Institute of Mining and Metallurgy, (XAI): concepts, taxonomies, opportunities and challenges
2020, Vol. 120, pp. 49-56. toward responsible AI”, Information Fusion, 2019, Vol.
[3] H. Kim, J. Ben-Othman, L. Mokdad, J. Son, C. Li, 58, pp. 82–115.
"Research Challenges and Security Threats to AI-Driven [16] A. Singla, S. Padakandla, S. Bhatnagar, “Memory-based
5G Virtual Emotion Applications Using Autonomous deep reinforcement learning for obstacle avoidance in UAV
Vehicles, Drones, and Smart Devices," in IEEE Network, with limited environment knowledge”, IEEE
2020, vol. 34, no. 6, pp. 288-294. Transactions on Intelligent Transportation Systems,
[4] A.T.Azar, A. Koubaa, N.A. Mohamed, H.A. Ibrahim, Z.F. 2020, Vol. 20, No.1, pp. 107-118.
Ibrahim, M. Kazim, A. Ammar, B. Benjdira, A.M. Khamis, [17] O. Bouhamed, H. Ghazzai, H. Besbes, Y. Massoud,
I.A. Hameed, et al., “Drone Deep Reinforcement Learning: “Autonomous UAV navigation: A DDPG-based deep
A Review”, Electronics, 2021, Vol.10, pp. 999. reinforcement learning approach”, In Proceedings of IEEE
[5] Z. Hu, X. Gao, K. Wan, Y. Zhai, Q. Wang, “Relevant International Symposium on Circuits and Systems
experience learning: A deep reinforcement learning (ISCAS), 2020, pp. 1–5.
method for UAV autonomous motion planning in complex [18] U. Challita, W. Saad, C. Bettstetter, “Interference
unknown environments”, Chinese Journal of management for cellular-connected UAVs: A deep
Aeronautics, 2021, Vol. 34, No. 12. reinforcement learning approach”, IEEE Transactions on
[6] N. Imanberdiyev, C. Fu, E. Kayacan, I.-M. Chen, Wireless Communications, 2019, Vol. 18, pp. 2125–2140.
“Autonomous navigation of UAV by using real-time [19] C. Yan, X. Xiang, C. Wang, “Towards Real-Time Path
model-based reinforcement learning”, 14th International Planning through Deep Reinforcement Learning for a UAV
Conference on Control, Automation, Robotics and in Dynamic Environments”, Journal of Intelligent and
Vision (ICARCV), 2016, pp. 1–6. Robotic Systems, 2019, Vol. 98, pp. 297–309.

50

You might also like