SA031PL
SA031PL
ABSTRACT 1. INTRODUCTION
Autonomous navigation of Unmanned Aerial Vehicles (UAV) in Unmanned Aerial Vehicles (UAVs) have revolutionized tasks
complex environments is still a challenging field. Recognizing from delivery to surveillance, significantly impacted by
UAV real-time perception as a sequential decision-making Artificial Intelligence (AI) and Information Technology (IT)
challenge, researchers increasingly adopt learning-based advancements. They offer unparalleled flexibility and
methods, leveraging machine learning to enhance navigation in time-saving capabilities in applications such as safe drone
complex environments.In this paper, a novel deep reinforcement operation models in urban air traffic flows, hovermap drone
learning (DRL) model has been proposed for the smooth systems, and AI-driven 5G UAV systems. [1, 2, 3]. Their
navigation of the UAV. The paper provides an overview of autonomous navigation is crucial for path planning, obstacle
existing techniques, laying the foundation for our proposed avoidance, and control, which can be addressed through
work, which not only addresses certain limitations but also real-time perception or through the utilization of existing
demonstrates superior performance in complex environments. environmental data. To address complex scenarios, the former
The simulation environment is built using Unreal Engine, and approach increasingly adopts learning-based techniques such as
the connections have been established using AirSim APIs. The Machine Learning (ML) [4].
implementation of the TD3 algorithm is chosen for its
exceptional adaptability in continuous action spaces due to its Deep Reinforcement Learning (DRL), a subset of ML, combines
off policy, value-based approach, resulting in improved stability Deep Learning (DL) for neural network training and
and sample efficiency whereas the implementation of PPO Reinforcement Learning (RL) for sequential decision-making
algorithm is due to its on-policy method that leads to stable through Markov Decision Process (MDP). Based on prevalent
learning without the need for value function estimation. Our analysis, DRL algorithms combined with PID controllers reduce
model undergoes training in a customized landscape collision rates in UAV control, but highlight the need for sample
mountainous environment, and the results, obtained after efficient algorithms. Relevant Experience Learning (REL) and
rigorous training, are thoroughly analyzed. The state-action non-sparse rewards handle large state and action spaces, while
pairs of our trained TD3 agent are explained using LIME and PPO and LSTM networks emphasize sensor data importance for
SHAP techniques. The paper concludes by presenting promising accurate models [5]. Other RL algorithms like TEXPLORE
directions for further exploration and advancement in this were utilized for autonomous navigation but it fell short due to
evolving field. absence of comparative analysis and scalability issues in real
world applications [6]. The lack of real-world training scenarios
Keywords: Unmanned Aerial Vehicle, Deep Reinforcement poses a hurdle for validating algorithms, as seen in the analysis
Learning (DRL), Twin Delayed DDPG (TD3), Proximal Policy of incremental learning with PPO [7].
Optimization (PPO), AirSim, Unreal Engine, Explainable The opacity of DRL methods, acting as black boxes, necessitates
Artificial Intelligence (XAI), Local Interpretable Model Explainability to render the model transparent. Explainability
agnostic Explanation (LIME), Shapley additive
can be approached globally or locally. For complex models, this
explanations (SHAP), Application Program Interface (APIs).
involves using a model-agnostic approach like LIME and SHAP,
or a model-specific approach like Random Forest Regressor,
ISBN: 978-1-950492-79-4
ISSN: 2771-0947
44 https://doi.org/10.54808/WMSCI2024.01.44
Proceedings of the 28th World Multi-Conference on Systemics, Cybernetics and Informatics (WMSCI 2024)
where the former can be applied to any ML model, and the latter solving a first-order differential equation with the help of the
is tailored to work for a particular type of model or algorithm[8]. new, more manageable constraint .
The essence of our contribution can be outlined as follows: ___________________________________________________
Algorithm : Proximal Policy Optimization (PPO) Pseudocode.
1. Comparison of TD3 and PPO algorithms with MLP neural ___________________________________________________
networks for drone navigation in complex environments, Initialization of total number of iterations (I), number of actors
established in Unreal Engine, utilizing connections through (J), and the time steps (T);
AirSim.
for iteration = 1 to I do
2. Evaluating the TD3 agent's state-action pairs, using SHAP
and LIME explainability techniques to address the black box for actor = 1 to J do
issue. Run the policy πθ on the prior theta value, θold, for T
time steps; Calculate the advantage estimates Aˆ t for
2. PRELIMINARIES the time steps;
A. TD3 Algorithm end
Twin Delayed DDPG (TD3) is an off-policy model-free DRL Optimize the objective function L(θ) with respect to θ and
approach used for model training to provide smooth control
commands for UAV navigation. TD3 replaces DDPG, call it θopt;
introducing three crucial techniques to overcome the Q-value Define θold = θopt ;
overestimation problem: delayed policy update, target policy
smoothing, and clipped double Q-learning. end
Unlike the DDPG algorithm, the TD3 algorithm minimizes the _________________________________________________
mean squared Bellman error while simultaneously learning two
Q-functions, Q1 and Q2 [8]. The process involves initialization, iterations, and the execution
of the policy in the environment to gather data such as states,
actions, rewards, and subsequent states. It assesses the benefit of
selecting a particular action at a certain time step compared to
the average value of actions, then refines the policy parameters
by optimizing an objective function. This function, which
defines the goal of the optimization problem, aims to maximize
the cumulative reward in the context of reinforcement learning.
By adjusting the policy parameters, the function seeks to
maximize expected returns and incrementally enhance the
policy’s performance.
3. DRL FRAMEWORK
Figure 1 : TD3 Networks
During each learning time step, the attributes of the actor and
critic are updated, while a stochastic noise model is applied to
perturb the action selected by the policy.**
Its ability to handle continuous action spaces, reduce
overestimation bias, and encourage exploration make it an
effective choice for training policies in complex and dynamic
environments.
B. PPO Algorithm
PPO follows the general framework used by many RL
algorithms as shown by equation 1 where the expected reward
(Q-value) for taking action (a) in state (s) and then following a
policy (π) sums up the expected rewards for all possible future
states and actions, considering the probabilities defined by the
policy. The foundation lies in substituting flexible constraints,
Figure 2 : The Deep Reinforcement Learning workflow involves
regarded as penalties, for rigid ones. An approximation of the
the agent (Top Block) acquiring precursory state and reward
second-order optimization of a differential equation is found by
from the environment (Bottom Block), after which the agent
generates actions accordingly.
45
Proceedings of the 28th World Multi-Conference on Systemics, Cybernetics and Informatics (WMSCI 2024)
For the second case where the episode is still in the running Figure 3 : Connection Establishment between AirSim and
phase; we will be inculcating multiple factors in the reward Visual Studio 2022
function.
𝑅𝑑𝑖𝑠𝑡 = 𝑑𝑔(𝑡−1) − 𝑑𝑔(𝑡) (2)
The establishment of a connection between AirSim and Visual
Studio 2022 represents a significant advancement in the field of
Reward based on the change in distance to the goal. It’s autonomous navigation.
computed based on the variation between the previous and This modern linkage not only introduces a new method for
current distances to the goal (the distance it has moved towards model training but does so with notable efficiency through
the goal). seamless integration using AirSim APIs.
46
Proceedings of the 28th World Multi-Conference on Systemics, Cybernetics and Informatics (WMSCI 2024)
Through this connection, the convergence of AirSim simulation Our model was trained using the TD3 Reinforcement Learning
with the development environment of VS2022 brings us, algorithm, integrated with MLP Deep Neural Networks. This
notably closer to real-world scenarios. This innovative coupling strategic integration facilitated the sequential decision-making
offers a training platform that closely emulates the intricate process, within the framework of a Markov Decision Process
challenges of genuine environments. (MDP).
No Parameter Value
1 Time step(dt) 0.1
2 Maximum acceleration in Horizontal 2.0
plane
3 Maximum Velocity in Horizontal plane 5.0
4 Minimum Velocity in Horizontal plane 0.5
5 Maximum Velocity in Vertical plane 2.0
6 Maximum yaw rate 50
7 Crash Distance 2
8 Accept radius 2
Table 2
Hyperparameters for the DRL Algorithms Figure 4 : Customized landscape environment built in AirSim.
No.
6. EXPLAINABILITY
Hyperparameters Value
1 Gamma 0.99 Explainability enhances transparency in DRL models by
2 Learning rate 1e-3 revealing black box elements and clarifying the relationships
3 Learning starts 1000 between actions and state features. We employed SHAP and
4 Buffer size 50000 LIME to explain our model.
5 Batch size 128 A. Shapley additive explanations (SHAP)
6 Train frequency 100
7 Gradient steps 200 SHAP employs Shapley values to provide a comprehensive
8 Action noise sigma 0.1 understanding of feature importance, both locally and globally.
In our XDRL-based autonomous drone navigation model,
SHAP enhances interpretability and optimization by offering
clear visual insights into model decisions at the individual data
The navigation network is trained in AirSim, a simulator built point level.
on Unreal Engine. This simulator provides an extremely
realistic environment with a ground truth depth picture and a B. Local Interpretable Model-agnostic Explanations (LIME)
simple controller to keep the UAV stable. A customized LIME generates interpretable surrogate models around specific
landscape mountain environment is created for training,
data points to clarify individual predictions. In the context of
featuring a square environment with a side length of 256 units.
At the beginning of each episode, the quadrotor takes off from autonomous drone navigation using XDRL, LIME offers
the random start position in the environment. The goal position localized insights into feature contributions, enhancing both
is set randomly within the boundaries of rectangular model performance and interpretability across dynamic
coordinates. The episode ends either when the quadrotor scenarios. Its adaptability to various machine learning
reaches the goal position within an acceptable radius or when it techniques ensures precise decision-making in complex
crashes into obstacles. In order to generate the velocity setpoint environments.
in the three-dimensional environment, the neural network gets
both the quadrotor's state information and the depth image at
each time step.
47
Proceedings of the 28th World Multi-Conference on Systemics, Cybernetics and Informatics (WMSCI 2024)
Figure 5A : Graphical user interface obtained in the training Figure 5B : Graphical user interface obtained in the training
process of TD3 algorithm (Reward per episode) process of TD3 algorithm (Cumulative Reward)
Figure 6A : Graphical user interface obtained in the training Figure 6B : Graphical user interface obtained in the training
process of TD3 algorithm (Yaw Rate) process of TD3 algorithm (Trajectory of agent)
7. RESULTS AND DISCUSSIONS Figure 5A shows the plot of rewards per episode, where the
model assigns rewards based on performance in each episode.
Figure 5B presents the plot of cumulative rewards, indicating
Evaluation metric TD3 PPO the consistency in model performance and optimization. Figure
Mean episode reward 78 72 6A illustrates the oscillations in yaw, which are decreasing and
Crash rate 16% 22% approaching a saturation state, demonstrating that the drone has
Convergence time 2400 episodes 3300 episodes achieved stabilization after progressive iterations. Figure 6B
depicts the drone's trajectory, showing how efficiently the
Success rate 77% 68%
model avoids obstacles, maintains a safe distance, and moves
swiftly towards the destination. Complex environments pose
After designing the reward function, the model is trained increased challenges for the model, which typically results in
extensively with nearly 100,000 steps with 197,000 updates. longer convergence times compared to training in simpler
TD3 algorithm has been converged after 2400 episodes with a environments.
success rate of 77%. When compared with the PPO algorithm,
We observed several positive aspects, including an increase in
the mean episode reward, decrease in crash rate, and a
significant increase in success rate.
48
Proceedings of the 28th World Multi-Conference on Systemics, Cybernetics and Informatics (WMSCI 2024)
Figure 7A : Using Lime method Feature analysis Velocity in the Figure 7B : Using SHAP method Feature analysis for Velocity
XY plane Vxy. in the XY plane Vxy.
Figure 8A : Using Lime method Feature analysis Vertical speed Figure 8B : Using SHAP method Feature analysis for Vertical
Vz. speed Vz.
Figure 9A : Using Lime method Feature analysis Yaw rate. Figure 9B : Using SHAP method Feature analysis for Yaw rate.
49
Proceedings of the 28th World Multi-Conference on Systemics, Cybernetics and Informatics (WMSCI 2024)
The major contributors for the action, velocity in Horizontal [7] V.J Hodge, R. Hawkins, R. Alexander, “Deep
plane in states are angular velocity and linear velocity_xy. For reinforcement learning for drone navigation using sensor
the action Vertical speed features affecting the most are Vertical data”, Neural Computing & Applications, 2021, Vol.33,
Distance, Relative yaw, Linear Velocity_z and for Yaw rate; pp. 2015–2033.
linear velocity in Z direction, Vertical distance and angular [8] L. He, N. Aouf, B. Song, “Explainable Deep
velocity have good correlation. Reinforcement Learning for UAV Autonomous
Navigation”, Aerospace Science and Technology, 2021,
Vol. 118, pp. 107052.
8. CONCLUSION [9] A.T. Azar, F.E. Serrano, N.A. Kamal, A. Kouba, “Robust
Kinematic Control of Unmanned Aerial Vehicles with
In this paper autonomous navigation of UAVs is addressed Non-holonomic Constraints”, International Conference
using Explainable Deep Reinforcement Learning Technique. on Advanced Intelligent System and Informatics, 2020.
The simulation environment was built on AirSim and pp. 839-850.
connections were established between VS2022 and Unreal [10] Y. Song, L.-T. Hsu, “Tightly coupled integrated navigation
engine. We have used the standard TD3 and PPO algorithms system via factor graph for UAV indoor localization”,
and molded it according to the requirements it must serve. A Aerospace Science and Technology, 2021, Vol.108, pp.
well crafted reward function was prepared considering all the 106370.
crucial factors and the hyperparameters were tuned efficiently to
obtain the desired outputs. The training process has been [11] M. Shah, N. Aouf, “3D cooperative Pythagorean
intensive producing a success rate of 77% for the TD3 hodograph path planning and obstacle avoidance for
algorithm which has a significant hike compared to PPO multiple UAVs”, IEEE 9th International Conference on
algorithm. The use of explainability techniques has greatly Cybernetic Intelligent Systems, 2010, pp. 1–6.
improved the transparency of our model. We now understand [12] Y. Shin, E. Kim, “Hybrid path planning using positioning
why certain actions occur in specific states, giving us insight risk and artificial potential fields”, Aerospace Science and
into the decision-making process.The future scope of this work Technology, 2021, Vol. 112, pp. 106640.
is integrating defogging techniques which will make our model [13] N. Imanberdiyev, C. Fu, E. Kayacan, I.-M. Chen,
robust enough to tackle the foggy conditions. “Autonomous navigation of UAV by using real-time
model-based reinforcement learning”, 14th International
Conference on Control, Automation, Robotics and
9. REFERENCES Vision (ICARCV), 2016, pp. 1–6.
[14] V. Mnih, K. Kavukcuoglu, D. Silver, A.A. Rusu, J. Veness,
[1] D.D. Nguyen, J. Rohacs, D. Rohacs, “Autonomous Flight M.G. Bellemare, A. Graves, M. Riedmiller, A.K.
Trajectory Control System for Drones in Smart City Traffic Fidjeland, G. Ostrovski, et al., “Human-level control
Management”, International Journal of through deep reinforcement learning”, 2015, Nature, Vol.
Geo-Information, 2021, Vol. 10, pp. 338. 518, pp. 529–533.
[2] E. Jones, J. Sofonia, C. Canales, S. Hrabar, F. Kendoul, [15] A.B. Arrieta, N. Díaz-Rodríguez, J. Del Ser, A. Bennetot,
“Applications for the Hovermap autonomous drone system S. Tabik, A. Barbado, S. García, S. Gil-López, D. Molina,
in underground mining operations”, Journal of the R. Benjamins, et al., “Explainable artificial intelligence
Southern African Institute of Mining and Metallurgy, (XAI): concepts, taxonomies, opportunities and challenges
2020, Vol. 120, pp. 49-56. toward responsible AI”, Information Fusion, 2019, Vol.
[3] H. Kim, J. Ben-Othman, L. Mokdad, J. Son, C. Li, 58, pp. 82–115.
"Research Challenges and Security Threats to AI-Driven [16] A. Singla, S. Padakandla, S. Bhatnagar, “Memory-based
5G Virtual Emotion Applications Using Autonomous deep reinforcement learning for obstacle avoidance in UAV
Vehicles, Drones, and Smart Devices," in IEEE Network, with limited environment knowledge”, IEEE
2020, vol. 34, no. 6, pp. 288-294. Transactions on Intelligent Transportation Systems,
[4] A.T.Azar, A. Koubaa, N.A. Mohamed, H.A. Ibrahim, Z.F. 2020, Vol. 20, No.1, pp. 107-118.
Ibrahim, M. Kazim, A. Ammar, B. Benjdira, A.M. Khamis, [17] O. Bouhamed, H. Ghazzai, H. Besbes, Y. Massoud,
I.A. Hameed, et al., “Drone Deep Reinforcement Learning: “Autonomous UAV navigation: A DDPG-based deep
A Review”, Electronics, 2021, Vol.10, pp. 999. reinforcement learning approach”, In Proceedings of IEEE
[5] Z. Hu, X. Gao, K. Wan, Y. Zhai, Q. Wang, “Relevant International Symposium on Circuits and Systems
experience learning: A deep reinforcement learning (ISCAS), 2020, pp. 1–5.
method for UAV autonomous motion planning in complex [18] U. Challita, W. Saad, C. Bettstetter, “Interference
unknown environments”, Chinese Journal of management for cellular-connected UAVs: A deep
Aeronautics, 2021, Vol. 34, No. 12. reinforcement learning approach”, IEEE Transactions on
[6] N. Imanberdiyev, C. Fu, E. Kayacan, I.-M. Chen, Wireless Communications, 2019, Vol. 18, pp. 2125–2140.
“Autonomous navigation of UAV by using real-time [19] C. Yan, X. Xiang, C. Wang, “Towards Real-Time Path
model-based reinforcement learning”, 14th International Planning through Deep Reinforcement Learning for a UAV
Conference on Control, Automation, Robotics and in Dynamic Environments”, Journal of Intelligent and
Vision (ICARCV), 2016, pp. 1–6. Robotic Systems, 2019, Vol. 98, pp. 297–309.
50