0% found this document useful (0 votes)
16 views15 pages

Design and Experimental Validation of Deep Reinforcement Learning-Based Fast Trajectory Planning and Control For Mobile Robot in Unknown Environment

This article presents a hierarchical deep learning-based control framework for mobile robots to optimize trajectory planning and control in unknown environments. It utilizes a recurrent deep neural network (RDNN) for motion planning and a deep reinforcement learning (DRL) algorithm for collision-free waypoint tracking, significantly improving training speed and performance. The proposed methods are validated through simulations and real-world experiments, demonstrating enhanced motion planning and collision avoidance capabilities.

Uploaded by

pamanh97
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views15 pages

Design and Experimental Validation of Deep Reinforcement Learning-Based Fast Trajectory Planning and Control For Mobile Robot in Unknown Environment

This article presents a hierarchical deep learning-based control framework for mobile robots to optimize trajectory planning and control in unknown environments. It utilizes a recurrent deep neural network (RDNN) for motion planning and a deep reinforcement learning (DRL) algorithm for collision-free waypoint tracking, significantly improving training speed and performance. The proposed methods are validated through simulations and real-world experiments, demonstrating enhanced motion planning and collision avoidance capabilities.

Uploaded by

pamanh97
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

5778 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 35, NO.

4, APRIL 2024

Design and Experimental Validation of Deep


Reinforcement Learning-Based Fast Trajectory
Planning and Control for Mobile Robot
in Unknown Environment
Runqi Chai , Member, IEEE, Hanlin Niu , Member, IEEE, Joaquin Carrasco , Member, IEEE,
Farshad Arvin , Senior Member, IEEE, Hujun Yin , Senior Member, IEEE,
and Barry Lennox , Senior Member, IEEE

Abstract— This article is concerned with the problem of plan- PER algorithm in terms of training speed. Experimental videos
ning optimal maneuver trajectories and guiding the mobile robot are also uploaded, and the corresponding results confirm that the
toward target positions in uncertain environments for exploration proposed strategy is able to fulfill the autonomous exploration
purposes. A hierarchical deep learning-based control framework mission with improved motion planning performance, enhanced
is proposed which consists of an upper level motion planning collision avoidance ability, and less training time.
layer and a lower level waypoint tracking layer. In the motion
planning phase, a recurrent deep neural network (RDNN)-based Index Terms— Deep reinforcement learning (DRL), mobile
algorithm is adopted to predict the optimal maneuver profiles for robot, motion control, noisy prioritized experience replay (PER),
the mobile robot. This approach is built upon a recently proposed optimal motion planning, recurrent neural network, unexpected
idea of using deep neural networks (DNNs) to approximate the obstacles.
optimal motion trajectories, which has been validated that a fast
approximation performance can be achieved. To further enhance I. I NTRODUCTION
the network prediction performance, a recurrent network model
capable of fully exploiting the inherent relationship between
preoptimized system state and control pairs is advocated. In the
T HE development of unmanned ground vehicles (UGVs)
or mobile robots has received considerable attention over
the last two decades due to its potential benefits in terms
lower level, a deep reinforcement learning (DRL)-based collision-
free control algorithm is established to achieve the waypoint of sensing the unknown environment, alleviating urban traffic
tracking task in an uncertain environment (e.g., the existence congestion, reducing harmful emissions, and enhancing road
of unexpected obstacles). Since this approach allows the con- safety. Generally speaking, four key modules are involved
trol policy to directly learn from human demonstration data, in a robotic system: environmental perception [1], motion
the time required by the training process can be significantly
reduced. Moreover, a noisy prioritized experience replay (PER) planning [2], motion control [3], [4], and vehicle-to-vehicle
algorithm is proposed to improve the exploring rate of con- communication [5]. Among them, the planning and control
trol policy. The effectiveness of applying the proposed deep modules are mainly responsible for navigating the route and
learning-based control is validated by executing a number of making maneuver decisions. Consequently, it is usually recog-
simulation and experimental case studies. The simulation result nized as a key indicator to reflect the intelligence level of a
shows that the proposed DRL method outperforms the vanilla
robotic system. However, there are challenges when it comes
Manuscript received 15 October 2021; revised 4 July 2022; accepted to planning and steering the motion of mobile robots in an opti-
18 September 2022. Date of publication 10 October 2022; date of current mal and timely way, considering the complexity of the envi-
version 5 April 2024. This work was supported in part by the Engineering and
Physical Sciences Research Council (EPSRC) under Project EP/S03286X/1, ronment and various physical limitations of the robot. These
in part by EPSRC Robotics and Artificial Intelligence for Nuclear (RAIN) challenges, together with other mission-dependent require-
under Project EP/R026084/1, and in part by EPSRC Robotics for Nuclear ments, have stimulated both scholars and engineers to explore
Environments (RNE) under Project EP/P01366X/1. (Corresponding author:
Hanlin Niu.) promising solutions to this kind of problem.
Runqi Chai was with the Department of Electrical and Electronic Engi-
neering, The University of Manchester, Manchester M13 9PL, U.K. He is A. Related Works
now with the School of Automation, Beijing Institue of Technology, Beijing The mission investigated in this research work focuses
100081, China (e-mail: [email protected]).
Hanlin Niu, Joaquin Carrasco, Hujun Yin, and Barry Lennox are with on applying the mobile robot to explore unknown target
the Department of Electrical and Electronic Engineering, The University of regions. Numerous research works have been carried out in
Manchester, M13 9PL Manchester, U.K. (e-mail: hanlin.niu@manchester. developing advanced path planning and control algorithms
ac.uk; [email protected]; [email protected];
[email protected]). for dealing with this problem [6], [7], [8], [9]. For instance,
Farshad Arvin is with the Department of Computer Science, Durham Zuo et al. [7] proposed a two-level path planning scheme
University, Durham DH1 3LE, U.K. (e-mail: [email protected]). incorporating the standard A-star method and an approximate
This article has supplementary downloadable material available at
https://doi.org/10.1109/TNNLS.2022.3209154, provided by the authors. policy iteration algorithm to generate a smooth trajectory
Digital Object Identifier 10.1109/TNNLS.2022.3209154 for mobile robots. Similarly, Kontoudis and Vamvoudakis [8]
2162-237X © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Gyeongsang National Univ. Downloaded on January 30,2025 at 09:48:41 UTC from IEEE Xplore. Restrictions apply.
CHAI et al.: DESIGN AND EXPERIMENTAL VALIDATION OF DRL-BASED FAST TRAJECTORY PLANNING AND CONTROL 5779

proposed a hybrid path planning approach combining rapidly 1) Convergence failures can frequently be detected in the
exploring random trees and Q-learning techniques to achieve optimization process with the consideration of mechan-
collision-free navigation. In [9], a trajectory planner without ical limitations, nonlinear path/dynamic constraints, and
the time-consuming process of building the Euclidean signed nonconvex collision avoidance constraints.
distance field is proposed. EGO-planner explicitly constructs 2) The (re)optimization process requires a large amount of
an objective function that keeps the trajectory away from onboard computing power.
obstacles, while ESDF-based methods have to calculate colli- 3) The actual operation environment of the mobile robot
sion information with a heavy computation burden. Although is uncertain, meaning that there may exist obstacles
EGO-planner reduces a significant amount of computing time suddenly appear within the sensing range of the robot.
compared with ESDF-based planners [10], [11], it still needs This will certainly create a hazard and lead to significant
to parameterize trajectory by a uniform B-spline curve first and safety risks.
then optimize it using smoothness penalty, collision penalty, These three issues become more prominent when it comes
and feasibility penalty terms. Moreover, it needs an additional to real-time applications.
time reallocation procedure to ensure dynamical feasibility
and generate a local optimal collision-free path for unknown
obstacles. In this article, we propose an end-to-end DRL C. Main Contributions
method to deal with unknown obstacles, and our DRL method The main contribution of this work lies in dealing with the
uses LiDAR sensor data as input of neural network and three challenges stated previously and developing a hierarchi-
generates collision avoidance command directly, and it does cal deep learning-based control framework for the autonomous
not need a manually designed optimization approach, such exploration problem. To be more specific, for the first two
as EGO-planner. The DRL method is also integrated with issues, we make an attempt to address them by exploring the
our proposed global optimal path planning method for known idea of planning the optimal motion trajectory via a deep
obstacles. neural network (DNN)-based direct recalling. This type of
These works belong to the class of sample-and-search strategy aims to train DNNs on a preoptimized state-action
methods where a finite set of mesh grids is first applied to dataset. Following that, the trained networks will be applied
discretize the searching space. Then, a satisfactory connection as an online motion planner to iteratively predict optimal
between the initial pose and the target pose is selected. maneuver actions. Such a direct recalling strategy can acquire
Another potentially effective strategy is to construct an near-optimal motion planning performance and better real-time
optimal control model consisting of the vehicle dynam- applicability in comparison to other optimization-based tech-
ics and other vehicle-related constraints. Subsequently, niques, such as NMPC and dynamic programming [18], [19],
well-developed optimization-based control algorithms, such [20], [21], [22]. However, most of them assumed the network
as swarm intelligence-oriented algorithms [12], dynamic inputs are independent of each other and applied a fully
programming-based methods [13], [14], [15], and nonlinear connected network structure to learn the optimal mapping rela-
model predictive control (NMPC)-based approaches, are uti- tionship. This indicates that the inherent relations between the
lized to plan and steer the motion trajectories [16], [17]. optimized state and control action pairs may not be sufficiently
Specifically, by taking into account public traffic restrictions, exploited. Therefore, building upon our previous studies [23],
a model predictive navigation and control framework was we enrich the power of the DNN-based motion approximator
established in [16] to plan and steer the UGV trajectory. by using a gated recurrent unit recurrent DNN (GRURDNN)
Besides, Wan et al. [17] proposed an MPC-scheme, which is model.
capable of anticipating the motion of the front vehicle such that As for the third issue, we make an attempt to address it
the autonomous car following control task can be successfully by taking advantage of deep reinforcement learning (DRL)-
fulfilled. Compared to sample-and-search methods, the appli- based techniques [24], [25]. More precisely, DRL has been
cation of optimization-based planning/control approaches has used on autonomous vehicles for collision avoidance using
several advantages: 1) different mission/environment-related LiDAR [26] or camera data [27]. However, previous research
requirements (e.g., constraints) can be explicitly considered works rely heavily on the training data and the exploration
and 2) since the motion trajectory is directly planned, the path usually starts with random action to gather experience data for
velocity decomposition becomes unnecessary. Hence, these replaying. This random action is likely to result in low-quality
types of algorithms have received increasing attention in the training data and slow convergence speed. In this research,
context of autonomous exploration. we proposed a noisy prioritized experience replay (PER)
algorithm to improve the training speed of the standard
B. Motivations PER algorithm. The proposed noisy PER algorithm is inte-
Different from the aforementioned research works, this grated into the deep deterministic policy gradient (DDPG)
article aims to design, test, and validate a fast yet optimal algorithm. Compared to the vanilla PER algorithm imple-
motion planning and control architecture for mobile robots mented in our previous work [28], this modification can
operating under an unstructured and partly unknown envi- introduce more exploring actions in the early training stage
ronment (e.g., target region exploration, rapid rescue, and and perform higher training speed. The effectiveness of this
autonomous parking). For the considered mission scenarios, approach will be validated in the experimental section of this
the main challenges may result from three aspects. article.

Authorized licensed use limited to: Gyeongsang National Univ. Downloaded on January 30,2025 at 09:48:41 UTC from IEEE Xplore. Restrictions apply.
5780 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 35, NO. 4, APRIL 2024

D. Organization and control variables must vary in their allowable ranges


The remainder of this article is arranged as follows. during the entire movement. This can be written as

In Section II, a constrained trajectory optimization model is ⎨
⎨ x(t) ∈ X, u(t) ∈ U (3a)
first formulated, so as to describe the mobile robot autonomous
X = {x(t) ∈ R5×1 : |x(t)| ≤ x̄ max } (3b)
exploration mission profile. Following that, Section III con- ⎨

structs the designed GRURDNN optimal trajectory approx- U = {u(t) ∈ R 2×1
: |u(t)| ≤ ūmax } (3c)
imator, while the DRL-based collision avoidance approach
where x̄ max = [ pxmax , p ymax , θmax , v max , ωmax ]T and ūmax =
is presented in Section IV such that unexpected obstacles
[avmax , aωmax ]T are, respectively, the maximum allowable values
can be avoided during the maneuver process. Simulation
for the state and control variables.
verification analysis, along with experimental validation results
Remark 1: Note that, in (1), the last two dynamic equations
on a real-world test platform, is demonstrated in Section V to
are introduced such that the continuity and smoothness of θ (t)
confirm the effectiveness and appreciate the advantages of the
and v(t) can be improved. Meanwhile, it is relatively simple
proposed design. Some concluding remarks are summarized
to consider the linear and angular acceleration constraints by
and presented in Section VI.
imposing explicit constraints on the system control variables
[see e.g., (3c)]. Note that the continuity and smoothness of the
II. P ROBLEM F ORMULATION
velocity vector can be further improved by considering higher
For the considered autonomous exploration task, we are time derivatives of position after acceleration, namely, “jerk”
interested in transferring a mobile robot from its current or “snap,” as used, respectively, in [30] and [31].
position to an unknown target position for an exploration
purpose. During the movement, it is required that the collision
B. Collision Avoidance for Expected Obstacles
risk between the mobile robot and obstacles (e.g., expected
objects on the environmental map as well as the unexpected In real-world scenarios, certain constraints should be
ones) can be removed while simultaneously minimizing the imposed to avoid the collision risk of the mobile robot and
mission complete time. To formulate this problem, a con- obstacles existing on the exploration map. Suppose that the
strained nonlinear trajectory optimization model is established position information of the on-map obstacles is available for
in this section. the motion planner [21], [22], [32]. We can define the region
occupied by the mobile robot and obstacles as A(x) ∈ R2 and
A. Dynamic Model and System Constraints of a Mobile O(m) ∈ R2 with m = 1, 2, . . . , No being the obstacle index.
Robot Then, a collision-free trajectory should satisfy the following
condition:
The kinematic equations of motion for the mobile robot are
characterized as [13] O(m) ∩ A(x) = ∅, ∀m = 1, . . . , No (4)


⎨ ṗx (t) = v(t) cos(θ (t)) where the full dimensional mobile robot A(x) and on-map



⎨ obstacles O(m) are modeled as convex polytopes
⎨ ṗ y (t) = v(t) sin(θ (t))

θ̇ (t) = ω(t) (1)

⎨ A(x) = A(x)W + B(x) (5a)

⎨ v̇(t) = av (t)


(m)
O = z ∈ R |P z ≤ q
2 (m) (m)
(5b)
⎩ ω̇(t) = a (t).
ω
with W being an initial compact set regulated by (G, g)
Equation (1) can be abbreviated as ẋ(t) = f (x(t), u(t)), (e.g., W = {z ∈ R2 |Gz ≤ g}), and similarly, (P (m) , q (m) )
in which the state and control variables of the mobile robot are regulates the shape of the mth known obstacle. Besides, A(x)
defined as x(t) = [ px (t), p y (t), θ (t), v(t), ω(t)]T ∈ R5×1 and and B(x) are the rotation matrix and the translation vector that
u(t) = [av (t), aω (t)]T ∈ R2×1 , respectively. ( px , p y ) stands determine the orientation and position of the mobile robot.
for the position of the mobile robot, and θ is the orientation By introducing a safety margin parameter dmin , the collision
angle. The linear and angular velocities are denoted as v and ω, between the robot and the mth obstacle can be avoided if the
whereas av and aω are the corresponding accelerations. following inequality holds true:
Based on (1), one can derive the nonholonomic con-  
straint ṗx (t) sin(θ ) − ṗ y (t) cos(θ ) = 0, indicating that the dist A(x)O(m) > dmin ≥ 0 (6)
mobile robot cannot perform lateral movement [29]. How-
ever, forward maneuvers can be performed as one has with dist(A(x), O(m) ) defined by
   
ṗx (t) cos(θ ) + ṗ y (t) sin(θ ) = v(t). dist A(x), O(m) : = min r (m) : A(x) + r (m) ∩ O(m) = ∅ .
For the exploration problem, the boundary conditions for r (m)

the mobile robot states at t0 = 0 and t f can be specified as (7)

x(t0 ) = x 0 Equation (6) can be directly adhered to the mobile robot


  trajectory optimization model as path constraint during the
x tf = x f (2)
entire mission t ∈ [0, t f ]. However, it may result in compu-
where x 0 , x f ∈ R 5×1
are scenario-dependent. Due to the exis- tational issues, such as numerical difficulties and convergence
tence of mechanical/environmental limitations, system state failure for gradient-based optimization algorithms. Thanks to

Authorized licensed use limited to: Gyeongsang National Univ. Downloaded on January 30,2025 at 09:48:41 UTC from IEEE Xplore. Restrictions apply.
CHAI et al.: DESIGN AND EXPERIMENTAL VALIDATION OF DRL-BASED FAST TRAJECTORY PLANNING AND CONTROL 5781

the proposition established in [21] and validated in [22], Algorithm 1 Optimal Parking Maneuver Dataset Generation
condition (6) can be transformed to a smoother form such that 1: procedure
the computational complexity of the optimization process can 2: Initialize E = ∅;
Nk
be alleviated significantly. To be more specific, imposing (6) 3: Form a set of temporal node {tk }k=1 ;
Ni
is equivalent to 4: Apply MCMC to sample a set of noise value {ix }i=1 ;
5: /*Main loop*/
Find λ(m) ≥ 0, μ(m) ≥ 0 6: for i := 1, 2, . . . , do
7: while E < Ne do
subject to ∀t ∈ 0, t f
 (m) T 8: (a) Specify
P B(x) − q (m) λ(m) − g T μ(m) > dmin x0 = x0 + ix
A T (x)P (m) λ(m) + G T μ(m) = 0
T N
9: (b) Use {tk }k=1
k
to construct the discrete-time
10: optimization model of (9):
P (m) λ(m) ≤ 1
T
(8)

Nk
 
(m) (m) min J= L(xk , u k ) +  x Nk , tk
in which (λ ,μ ) denotes the dual variables. (m) (m)
x k ,u k ,λk ,μk k=0
s.t. ∀m ∈ {1, . . . , No }, ∀k ∈ {0, . . . , Nk }
x k+1 = f (x k , uk , tk )
C. Trajectory Optimization Formulation  (m) T
P B(x k ) − q (m) λ(m) T (m)
k − g μk > dmin
According to the mobile robot dynamic model, mechani-
A T (x k )P (m) λ(m) + G T μ(m)
T
k k =0
cal constraints, boundary conditions, and collision avoidance
constraints introduced previously, we are now in a place to P (m)T
λ(m)
k  ≤ 1

establish a finite-horizon constrained trajectory optimization x 0 = x 0 + ix


formulation to describe the overall exploration mission. That x Nk = x f
is, a collision-free optimal exploration trajectory for the mobile x k ∈ X, uk ∈ U
robot can be obtained by addressing the following optimization λ(m) ≥ 0, μ(m) ≥0 (10)
k k
model:
 tf 11: (c) Perform Algorithm 2 to generate an initial feasible
    12: guess trajectory history (x (0) , u(0) );
min J =  x tf ,tf + L(x(t), u(t))dt
x,u,λ(m) ,μ(m) t0 13: (d) Address the static nonlinear programming problem
subject to ∀m ∈ {1, . . . , No }, ∀t ∈ 0, t f 14: (10) via numerical optimization algorithm [33] with
15: (x (0) , u(0) ) as the initial guess;
ẋ(t) = f (x(t), u(t)) 16: if optimization tolerance is satisfied then
 (m) T
P B(x) − q (m) λ(m) − g T μ(m) > dmin 17: (e) Output the optimal time history and perform

A T (x)P (m) λ(m) + G T μ(m) = 0 E = E ∪ {(x ∗k , u∗k )}


T

P (m) λ(m) ≤ 1
T
18: else
  19: (f) Discard the unconverged solution and perform
x(t0 ) = x 0 , x t f = x f
x(t) ∈ X, u(t) ∈ U E=E∪∅
(m) (m)
λ ≥ 0, μ ≥0 (9) 20: end if
21: (g) Update the index i ← i + 1;
in which J stands for the objective function consisting of 22: end while
a terminal cost term (x(t f ), t f ) and a process cost term 23: end for
24: Divide E into subsets via Eq.(17);
L(x(t), u(t)). Since we are aiming to fulfill the exploration
25: Output E, Etr , Ete and Ev .
mission in the shortest time, the objective function is set to
26: end procedure
J = t f . Note that t f is not given in advance and is included
in the constraints of optimization model (9).
Remark 2: Certain advantages can be obtained if the opti-
mization model (9) is applied to plan the exploration trajectory III. GRURDNN-BASED O PTIMAL M OTION P LANNING
of mobile robots: 1) since P (m) λ(m)  ≤ 1 is a convex
T
A GRURDNN-based real-time motion planner capable of
constraint, an improved solution time and convergence rate can approximating the optimal exploration trajectories is con-
be expected; 2) this formulation considers convex obstacles, structed in this section. This approach can be viewed as an
and no further assumptions are applied to the geometric extended version of the fully connected DNN-based motion
shape of obstacles; 3) additional mission-dependent condi- planner advocated in our previous work [23]. The main
tions/requirements can be easily considered in (9) by modeling novelty of the proposed one is that it implements a recur-
them as algebraic inequalities or equalities; and 4) though we rent network structure with a gated recurrent unit mecha-
are aiming for time-optimal exploration motions, it is simple nism to further exploit the inherent relationships between
to modify this optimization model to consider other mission different network inputs (e.g., state variables of the mobile
objectives. robot). To form the GRURDNN-based optimal motion planner,

Authorized licensed use limited to: Gyeongsang National Univ. Downloaded on January 30,2025 at 09:48:41 UTC from IEEE Xplore. Restrictions apply.
5782 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 35, NO. 4, APRIL 2024

Algorithm 2 Initial Feasible Trajectory Generation utilizing the Markov chain Monte Carlo (MCMC) sampling
1: procedure method. That is, the trajectory optimization model (9) is
2: Create the exploration map based on ( pxmax , p ymax ) and iteratively solved with x(t0 ) = x 0 + x . Here, x is a
expected obstacle information; random noise value. Note that, in real-world applications,
3: Grid the exploration map uniformly with specific reso- it is reasonable to assume that the initial configuration of
lution level; the mobile robot is noise-perturbed due to the existence of
4: Dilate the irregular obstacle to meet the safety margin localization errors.
dmin [34]; To present the process of forming E in detail, pseudocode
5: Apply the A-star algorithm to explore a shortest is established (see Algorithm 1). The final output dataset is
collision-free trajectory X (0)p = ( p (0) (0)
x , p y ) between the
divided into three subsets, which will be used to train, test,
initial and target points; and validate the designed GRURDNN network. That is,
6: for k := 0, 1, . . . , Nk do E = Etr ∪ Ete ∪ Ev (17)
7: (a) calculate other state profiles via
 (0)  where Ntr , Ntr , and Ntr denote the training, testing, and
(k+1)− px(0) (k)
θ (0) (k) = sin−1 Xp(0)x (k+1)−X (0)
(k)
validation sets with Ntr = Etr , Nte = Ete , and Nv = Ev 
being the size of Etr , Ete , and Ev , respectively. Note that,
p p 2

(0) X (0) (0)


p (k+1)−X p (k)2 (11)
v (k) = tk in (10), the state and control variables at time instant tk are
(0) θ (0) (k+1)−θ (0) (k)2
ω (k) = tk denoted as x k and uk , respectively. The time step is defined
as tk = tk+1 − tk with t1 = 0 and t Nk = t f .
8: (b) Modify the oriental angle via
 Remark 3: Compared to the trajectory dataset generation
(0) θmax if θ (0) (k) > θmax , strategy designed in [23] and [35], an improved reference
θ (k) = (12)
−θmax if − θ (0) (k) < −θmax . trajectory generation strategy is designed and embedded in
9: (c) Modify the linear velocity via Algorithm 1 (see Algorithm 2 for a detailed illustration). This
⎧ strategy adopts the A-star algorithm to find the initial reference
⎨ (0) v (0) (k+1)−v (0) (k)
⎨ v̄ (k) if
⎨ tk
> avmax , trajectory. It guarantees that the gradient-based numerical
⎨ (0) (0)
v (k+1)−v (k) (0)

v (0) (k) = v (k) if < −avmax , optimizer can quickly start the searching process at a collision-
tk (13)

⎨ v max (0)
if v (k) > θmax , free path. As a result, the success rate of optimal solution


−v max if − v (0) (k) < −θmax . detection, along with the time required to form the optimized
trajectory dataset, is likely to be further improved. Note that
10: with designs in [23] and [35] are likely to take longer computational
v̄ (0) (k + 1) = v (0) (k) + avmax tk time, as a preoptimization process has to be applied.
(14)
v(k + 1)(0) = v (0) (k) − avmax tk Remark 4: Performing Lines 6–14 is to generate the ini-
11: (d) Modify the angular velocities via tial reference trajectory for θ (0) , v (0) , and ω(0) such that
⎧ (0) (0)
they can satisfy the dynamic inequality constraints (3b).


⎨ ω̄0 (k) if ω (k+1)−ωtk
(k)
> aωmax , This is achieved by re-shaping θ (0) , v (0) , and ω(0) according
⎨ (0) (0)
ω (k+1)−ω (k)(0)
to (12), (13), and (15). This modification may make the
ω(0) (k) = ω (k) if tk < −aωmax , (15)

⎨ ωmax (0)
if ω (k) > ωmax , trajectory fail to meet system dynamics. It is worth noting that

⎩ the primary aim of this initial trajectory generation strategy is
−ωmax if − ω(0) (k) < −ωmax .
to quickly find a collision-free maneuver path to warmly start
12: with the gradient-based optimization process. It does not need to
ω̄(0) (k + 1) = ω(0) (k) + aωmax tk exactly satisfy the system dynamics at all time nodes.
(16)
ω(0) (k + 1) = ω(0) (k) − aωmax tk Remark 5: Note that the exploration trajectory dataset is
generated from known static obstacle locations and is, there-
13: end for
fore, environment-specific. However, datasets generated in this
14: Output the initial collision-free trajectory
way have the possibility to be applicable to other scenarios
x (0) = [ p(0) (0) (0) (0) (0) T
x , py , θ , v , ω ] ∈ R
5×Nk +1
through methods such as transfer learning.
15: end procedure B. Structuring and Training the Network
Recurrent DNN (RDNN) motion planning networks with
GRU mechanism are structured and trained on the pregen-
four main steps are required, which will be discussed in
erated training dataset Etr such that the mapping relations
Sections III-A–III-C.
between the optimized states and control actions can be
approximated. The motivation for the use of DNN with a
A. Forming the Exploration Trajectory Dataset recurrent structure relies on its capability to dig the dynamic
Prior to construct the network model, a trajectory dataset temporal behavior of the time series data, as it introduces
E containing Ne optimized vehicle state and control evolution memory blocks in neurons. Moreover, compared to the widely
histories is pregenerated for a particular mobile robot explo- used long short-term memory (LSTM) mechanism, the imple-
ration scenario (e.g., E = Ne ). This can be achieved by mentation of GRU can result in less number of network

Authorized licensed use limited to: Gyeongsang National Univ. Downloaded on January 30,2025 at 09:48:41 UTC from IEEE Xplore. Restrictions apply.
CHAI et al.: DESIGN AND EXPERIMENTAL VALIDATION OF DRL-BASED FAST TRAJECTORY PLANNING AND CONTROL 5783

C. Approximating the Optimized Exploration Trajectory


It should be noted that the construction of the optimized
trajectory dataset and GRURDNN networks, along with the
training process, can all be performed offline. Subsequently,
the trained network can serve as the motion planner to approx-
imate the optimal trajectory in real time. General procedures
of this online approximation process are detailed in the next
pseudocode (see Algorithm 3).

Algorithm 3 Approximating Time-Optimal Exploration


Fig. 1. Architecture of GRURDNN. Trajectory
1: procedure
2: At each time point tk , k = 0, 1, . . .; do
structural parameters, thereby easing the training process to 3: (a) Measure the actual state of the mobile robot x k ;
some degree. As for the mobile robot autonomous exploration 4: (b) Approximate the optimal motion command uk via
mission, two GRURDNNs, including a linear acceleration con-
uk :≈ Nu (x k )
trol network Nav and an angular acceleration control network
Naω , are built in order to approximate the time-optimal actions 5: (c) Apply Nu (x k ) to propagate the nonlinear dynamics:
u∗k with x k as the network input. More precisely, we have
x k+1 = f (x k , Nu (x k ), tk )
av∗ (tk ) :≈ Nav (x k ) 6: (d) Update the time point tk ← tk+1 , and go back to (a).
aω∗ (tk ) :≈ Naω (x k ). (18) 7: end procedure

For simplicity reasons, the above equation is abbreviated as


uk :≈ Nu (x k ) for the rest of this article. As shown in Fig. 1,
supposing that the GRURDNN network has N L layers and Nn IV. C OLLISION -F REE DRL-BASED O NLINE C ONTROL
neurons at each layer, then, at time node tk , the output of the In order to avoid unexpected collision risks when the
gth neuron in the j th layer o j,g,tk can be written as mobile robot is following the optimal trajectory planned by
⎧   the proposed GRURDNN-based algorithm, we proposed a

⎨ z j,g,tk = σs Wz x j,g,tk + Uz o j,g,tk−1 + bz collision-free DRL-based online control algorithm. The pro-

⎨h  
j,g,tk = σs Wh x j,g,tk + Uh o j,g,tk−1 + bh posed algorithm is a DRL-based mapless collision avoidance
    (19)

⎨ ĥ = σ W x + U o h + b algorithm that does not need accurate map information and


j,g,tk h ĥ j,g,tk

ĥ j,g,tk−1
 j,g,tk c
o j,g,tk = z j,g,tk ĥ j,g,tk + 1 − z j,g,tk h j,g,tk−1 also has a fast computing speed that can cope with unexpected
obstacles. The proposed training method enables the control
where j = {1, . . . , Nl } and g = {1, . . . , Nn }. stands for policy to learn from human demonstration data instead of
the Hadamard product operator. (z (·) , h(·) , ĥ(·) , o(·) ) represent, exploring random data. A simple yet efficient noisy weight
respectively, the update gate, the reset gate, activation, and method is also proposed to improve the training speed.
output vectors. (W(·) , b(·) ) denote the weight matrices and bias
vectors. σs (·) and σh (·) are the activation functions, which can
A. State and Action Space
be written as
To differ the upper motion planning level, the state space at
ex time step t is denoted by vector st that consists of the LiDAR
σs (x) =
1 + ex data vector lt , the velocity data vector v t−1 at previous time
e x − e−x step, and the relative target position vector pt . The LiDAR
σh (x) = x . (20)
e + e−x data are normalized to the range of [0, 1]. pt is represented in
the polar coordinate. The action space at time step t is denoted
To train the network and update the weight matrices and bias
vectors, a performance index should be defined. As suggested by v t and consists of the linear velocity v lt and the angular
velocity v at .
in [36], the mean square error (mse) measure is applied to
evaluate the performance of the network approximation ability
B. Reward Space
1  1 
Nk
  Ntr
    2
L W(·) , b(·) = o t jT − x t jT (21) A mobile robot is controlled to follow the designed optimal
Ntr i=1 Nk j =1 trajectory while also taking into account unexpected obstacles.
The reward function is designed to navigate the mobile robot
in which T = (t f /Nk ), whereas o(t j T ) and x(t j T ) are, to the next waypoint while avoiding unexpected obstacles. The
respectively, the actual and target network outputs. To train reward function is defined as
the network, the stochastic gradient descent (SGD) algorithm,
incorporating the adaptive learning rate technique [37], is used. rsum = rgd + rsc + rv (22)

Authorized licensed use limited to: Gyeongsang National Univ. Downloaded on January 30,2025 at 09:48:41 UTC from IEEE Xplore. Restrictions apply.
5784 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 35, NO. 4, APRIL 2024

Fig. 2. Actor network architecture: the input layer concatenates laser scanner
data, relative target position, and velocity, followed by three dense layers. The
number in each block means the corresponding dimensional number of each
vector or layer.

where rsum represents the sum reward, rgd describes the goal
distance reward, rsc is the safety clearance reward, and rv Fig. 3. Critic network architecture: the input layer concatenates laser scanner
data, relative target position data, velocity data, and action data, followed
represents velocity control reward. rgd can be calculated by by three dense layers. The number in each block means the corresponding
 dimensional number of each vector or layer.
ra , if drg < drgmin
rgd = (23)
k p d p , otherwise
angular velocity generated by a hyperbolic tangent function.
where the mobile robot receives arrival reward ra when the
The input layer and the output layer of the actor network
distance between robot and goal drg is less than threshold
architecture are merged and used as the input layer of the critic
drgmin . Otherwise, rgd is proportional to d p that represents
network, as shown in Fig. 3. The critic network also includes
the distance robot moves toward the goal at the last time step.
three dense layers, and each dense layer includes 512 nodes
k p is a user-configurable parameter. Safety clearance reward
and the ReLU function. The Q value is finally generated by
rsc can be calculated by
 a linear activation function.
−rcp , if dro < dromin
rsc = (24)
0, otherwise. D. Human Teleoperation Data Collection
When the distance between robot and obstacle dro is less than Instead of making a robot learn from scratch, this method
threshold dromin , negative reward/punishment −rcp is applied. allows the robot to learn from human demonstration data in
The velocity control reward rv , the angular velocity reward the beginning. The human demonstration data are received
rav , and the linear velocity rlv reward are given by by recording the teleoperation data of one user. The user can
control the robot to navigate around obstacles and arrive at the
rv = rav + rlv (25)
 goal by using the keyboard to change the speed and heading
−rap , if |v a | > v amax of the turtlebot in a simulated environment. The state data
rav = (26)
0, otherwise st at time step t consist of LiDAR data lt , velocity data v t ,
 and relative position data pt . Velocity command v t is sent
−rlp , if v l < v lmin by using the keyboard. After one time step, next state st+1
rlv = (27)
0, otherwise can be recorded. The reward value r is evaluated using (22).
Then, (st , v t , st+1 , rt ) is stored in experience dataset for later
where v amax denotes the angular velocity threshold and v lmin
sampling.
represents the linear velocity threshold. As we want mobile
robot to move to goal waypoint at high linear speed and low
angular speed, if angular velocity v a is bigger than v amax or E. Noisy Prioritized Experience Replay Algorithm
linear velocity vl is less than v lmin , the robot will receive
To improve the training speed of PER algorithm, we pro-
punishment value −rap or −rlp , respectively.
posed a noisy PER algorithm to increase the exploration rate
of the control policy. We integrated the noisy PER algorithm
C. Actor/Critic Network Structure and DDPG to enable the control policy to sample the human
The architectures of the actor network and the critic network demonstration data more frequently in the beginning and
are shown in Figs. 2 and 3, respectively. In Fig. 2, the input sample dataset with higher priority more frequently afterward.
layer concatenates laser scanner data (24-D vector), relative The noisy PER algorithm is given as follows:
target position (2-D vector), and velocity data (2-D vector). pα
The input layer is followed by three dense layers, and each Pi =  i α (28)
layer includes 512 nodes and the rectified linear unit (ReLU) k pk

function. The output layer is the velocity command, including where Pi represents the sampling probability of the i th transi-
the linear velocity generated by a sigmoid function and the tion, α is the distribution factor, and pi stands for the priority

Authorized licensed use limited to: Gyeongsang National Univ. Downloaded on January 30,2025 at 09:48:41 UTC from IEEE Xplore. Restrictions apply.
CHAI et al.: DESIGN AND EXPERIMENTAL VALIDATION OF DRL-BASED FAST TRAJECTORY PLANNING AND CONTROL 5785

TABLE I
P HYSICAL PARAMETERS /C ONSTRAINTS OF THE M OBILE ROBOT

TABLE II
I NITIAL /F INAL C ONFIGURATION FOR D IFFERENT T EST C ASES

Fig. 4. Whole process of the proposed method.

of a set of transition data, formulated as follows:


 
pi = δi2 + λ|a Q si , ai |θ Q |2 + + D (29)
where the time difference (TD) error is denoted by δ, the
second term λ|a Q(si , ai |θ Q )|2 denotes the actor loss, λ is a
contribution factor, and represents a small positive sampling
probability for each transition that ensures each transition
data can be sampled once. D represents the extra sam-
pling probability of the demonstration data to improve the
sampling frequency of demonstration data. Each transition is are displayed to evaluate the performance of the proposed
evaluated by hierarchical deep learning-based control framework.
 
1 1 β
ωicritic = · (30)
N Pi
 
1 1 β A. Parameter/Scenario Definitions
ωiactor = · (1 + μ) (31)
N Pi Three simulation test cases are executed in this study.
in which ωicritic
and ωiactor
represent the sampling weight for Table I demonstrates the physical parameters/constraints of the
updating the critic network and the actor network, respectively, mobile robot.
indicating the importance of each transition data, N stands In terms of scenario-related parameters, such as the initial
for the batch size, and β is a constant value for adjusting the and terminal configurations of the mobile robot, they are
sampling weight. The sampling weight of the actor network assigned in Table II. In addition, No = 8, 9, 12 obstacles are
has a noisy term μ, which can be regarded as the probability randomly generated for scenarios 1–3, respectively.
density function of the standard normal distribution. Different Other GRURDNN-related parameters are specified as
from the noisy parameter method [38], we added the noisy Nk = 100 and dmin = 0.1 m. Perturbations acting on
term μ to the weight of the actor network directly, and this ( p x (0), p y (0), θ (0)) of the three mission cases were supposed
will not increase the computational complexity and yet ensure to be normally distributed on [−1, 1 m], [−0.5, 0.5 m], and
the actor network to explore more actions. [−0.0873, 0.0873 rad], respectively. As investigated in [23],
the specification of network structural parameters, such as Nl ,
Nn , and the number of sequential states Nse , is likely to result
F. Overall Framework
in a significant impact on the approximation performance.
The processes of the proposed GRURDNN-based motion Hence, extensive test trials and analyses were performed to
planner and the DRL-based online control algorithm are determine a proper combination of these parameters using
concluded in Fig. 4 to better illustrate the hierarchical deep the strategy proposed in [35]. Based on the obtained results,
learning-based control framework. we construct the GRURDNN network with (Nl , Nn , Nse ) =
(5, 64, 4). A network with a more complex structure cannot
V. S IMULATION V ERIFICATION AND
bring apparent improvement with respect to the approximation
E XPERIMENTAL VALIDATION
ability for the considered problem. It should be noted that
In this section, simulation verification and experimental val- the best combination of these parameters may vary among
idation results on different autonomous exploration test cases problems, but the same determination strategy can be applied.

Authorized licensed use limited to: Gyeongsang National Univ. Downloaded on January 30,2025 at 09:48:41 UTC from IEEE Xplore. Restrictions apply.
5786 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 35, NO. 4, APRIL 2024

Fig. 5. Autonomous exploration results: Scenario 1.

Fig. 6. Autonomous exploration results: Scenario 2.

Fig. 7. Autonomous exploration results: Scenario 3.

Fig. 8. Autonomous exploration results: Scenario 4.

B. Performance Verification of the Motion Planner become two main aspects to evaluate the GRURDNN approx-
imation performance.
The performance of applying the proposed GRURDNN- Figs. 5–8 illustrate the obtained exploration trajectories,
based motion prediction algorithm for the mobile robot variation histories between the actual and preplanned optimal
autonomous exploration problem is tested and verified in this solutions (e.g., e = [ex , e y , eθ ]T ), and velocity-/acceleration-
section. Due to the nature of the GRURDNN-based optimal related profiles. First, attention is given to the feasibility of the
motion predictor, an exploration trajectory can always be obtained GRURDNN solutions. From the profiles shown in the
generated. Therefore, the solution feasibility and the optimality left column of Figs. 5–8, a collision-free maneuver trajectory

Authorized licensed use limited to: Gyeongsang National Univ. Downloaded on January 30,2025 at 09:48:41 UTC from IEEE Xplore. Restrictions apply.
CHAI et al.: DESIGN AND EXPERIMENTAL VALIDATION OF DRL-BASED FAST TRAJECTORY PLANNING AND CONTROL 5787

can be produced by applying the proposed method for all three TABLE III
test scenarios. That is, the mobile robot is able to move from its P ERFORMANCE E VALUATION OF D IFFERENT A LGORITHMS
initial position to the final target point without colliding with
obstacles existing on the exploration map. Besides, from the
linear and angular velocity/acceleration profiles, it is obvious
that the variable path constraints are satisfied during the entire
maneuver process, which further confirms the feasibility of the
obtained exploration solution.
Next, we focus on the optimality of the obtained exploration
solutions. It can be seen from the optimization formula-
tion (9) that the command controls (av , aω ) appear in the
motion dynamics linearly. Moreover, they do not appear in
the path constraint explicitly. As a result, a bang-singular-bang
mode can be expected in the command control profiles over
t ∈ [0, t f ] if a time-optimal solution is found. According to
the acceleration of the corresponding linear velocity profiles
displayed in the right column of Figs. 5–8, we can see that
the proposed motion planner is able to generate a near-
optimal time-optimal exploration trajectory for all the test
cases.
Furthermore, to highlight the advantages of applying
GRURDNN, comparative studies were carried out. The
methods selected for comparison are the fully connected
DNN-based motion planner established in [23] and the NMPC
tracking scheme constructed in [18] and [20]. The corre-
sponding results are demonstrated in Figs. 5–8, from where
it is obvious that all the methods studied in this article
can produce feasible exploration solutions for the considered
mission cases. In terms of algorithm performance, although
differences can be identified among obtained solutions, the
GRURDNN is more likely to produce an exploration trajectory
closer to the preplanned optimal solution. To further evaluate
the performance of different algorithms, we pay attention to
the objective
t function value, the accumulation of variation
(e.g., t0 f e(t)dt), and the average execution time required for
each time step tcpu . These results are tabulated in Table III.
It can be seen from Table III that the proposed method
can better approximate the preplanned optimal solution than Fig. 9. Experimental mobile robot: TurtleBot3 Waffle Pi.
the fully connected DNN-based  t method suggested in [23],
as lower values of J ∗ and t0 f e(t)dt are obtained. This
can be attributed to the recurrent network structure with C. Experimental Setup
the GRU mechanism. However, the price that we paid for It is worth noting that simply considering collision-free
this performance enhancement is that the computation time constraints for expected obstacles might result in significant
required for each step tends to experience a slight increase, safety issues for the mobile robot. This is because, in real-
which is still at an acceptable level. On the contrary, the world scenarios, an object can suddenly appear around the
performance of the NMPC algorithm is much worse than the robot due to the uncertainty of the environment or inaccurate
two DNN-based direct recalling strategies. Due to the con- map information. Therefore, we perform real-world experi-
sideration of different system constraints and multiple colli- ments in an uncertain environment where unexpected obstacles
sion avoidance constraints, some NMPC online reoptimization will suddenly appear around the robot.
processes might not be maturely solved. Thus, the control In the real-world experiment, we used Turtlebot3 Waffle Pi
performance is likely to become worse, and the required mobile robot, as shown in Fig. 9, to validate the proposed
computation time tends to increase. In summary, the above hierarchical GRURDNN-DRL control framework. Turtlebot3
results and discussions not only confirm the feasibility of supports an open-source robot operating system (ROS). The
the proposed GRURDNN motion planner but also appreciate proposed algorithms were implemented under ROS. The com-
its enhanced approximation performance. Consequently, it is puting platform of Turtlebot3 is the Raspberry Pi 3 board,
advantageous to implement the proposed GRURDNN-based which receives and executes the command from the host
approach to guide the mobile robot to fulfill the time-optimal computer with Nvidia GTX 1080 GPU and Intel Core i9 CPU
autonomous exploration problem. (2.9 GHz). Turtlebot3 is equipped with a 360◦ laser scanner

Authorized licensed use limited to: Gyeongsang National Univ. Downloaded on January 30,2025 at 09:48:41 UTC from IEEE Xplore. Restrictions apply.
5788 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 35, NO. 4, APRIL 2024

Fig. 10. Gazebo environment 1. Fig. 12. Q value comparison in environment 1: DDPG+PER versus the
proposed algorithm.

Fig. 11. Reward comparison in environment 1: DDPG+PER versus the


proposed algorithm.

for sensing the environment. The LiDAR detecting range was Fig. 13. Algorithm validation in real environment 1.
configured as 1 m in this experiment.

D. Validation of the Collision-Free DRL-Based Control


The DRL-based collision avoidance algorithm was trained
in the Gazebo simulator, as shown in Fig. 10. The turtlebot
mobile robot needs to reach the target while avoiding all the
obstacles illustrated via cubes, cuboids, cylinders, and so on.
The target is represented by a red ball in the top-right corner.
The laser detection range was shown as the blue rays. It was
assumed that a mobile robot will only detect the obstacles in
front of it, and its detection range is 180◦ . In environment
1, the target is placed at a fixed position, as shown in
Fig. 10. In environment 2, when the mobile robot collides
with a collision, the red ball position will be reset randomly
Fig. 14. Final trajectory of environment 1.
outside of the maze, and the mobile robot will be placed in
the middle of the maze by default. In terms of the reward
function parameters, the threshold values drgmin = 0.2 m, in Figs. 18–20. In environment 1, the proposed algorithm
dromin = 0.20 m, v amax = 0.8 rad/s, and v lmin = 0.052 m/s achieves the average reward value of 5 in less than 10 000
are set according to the geometry and capacity of Turtlebot3. steps; however, using the vanilla PER needs around 20 000
Note that these values can be reconfigured for other robotic training steps. The Q value has the same trend of reward.
platforms. We set N = 512, λ = 0.2, = 0.2, and D = 0.4 in In environment 2, it is found that the average reward of the
the DRL algorithm based on empirical experience. The reward proposed algorithm arrives at 6 only using 16 000 steps, and
and the Q value of the proposed algorithm are compared the vanilla PER algorithm needs 31 000 steps. The reason is
with the vanilla PER algorithm in environment 1, as shown in that the proposed algorithm introduces more exploring action
Figs. 11 and 12, respectively. The proposed algorithm is also by using the noisy term in the sampling weight in order to
validated in the real environment, as shown in Figs. 13 and 14. improve the training speed of the PER algorithm. The Q value
The gazebo environment 2 is shown in Fig. 15. The reward has the same trend of reward. It can be found that the proposed
and Q value of environment 2 are shown in Figs. 16 and 17, algorithm outperforms the vanilla PER algorithm in terms of
respectively. The validation of the real environment are shown training speed.

Authorized licensed use limited to: Gyeongsang National Univ. Downloaded on January 30,2025 at 09:48:41 UTC from IEEE Xplore. Restrictions apply.
CHAI et al.: DESIGN AND EXPERIMENTAL VALIDATION OF DRL-BASED FAST TRAJECTORY PLANNING AND CONTROL 5789

Fig. 15. Gazebo environment 2.


Fig. 18. Algorithm validation in real environment 2.

Fig. 16. Reward comparison in environment 2: DDPG+PER versus the


proposed algorithm.
Fig. 19. Manually placed obstacle during the mission.

Fig. 17. Q value comparison in environment 2: DDPG+PER versus the


proposed algorithm.
Fig. 20. Exploration trajectory of the mission in environment 2.

E. Validation of the Proposed Algorithms in Real follow the preplanned path to reach it. If an obstacle appears
Environment around the robot and blocks the way to the target position, the
The trained model was transferred to the real robot directly DRL-based algorithm will be triggered to avoid collision and
without fine-tuning and tested in a living room environ- navigate the robot to the next target waypoint.
ment. Both static environment (environment 1) and suddenly As shown in Figs. 13 and 18, the rviz software, third
changed environment (environment 2) are tested. Turtlebot person view, and onboard camera view were recorded. The
published its velocity, heading, and laser data to the ROS robot localization was realized using the ROS built-in adaptive
master that is a laptop running the DRL model, and the ROS Monte Carlo localization (AMCL) algorithm. The turtlebot
master publishes the velocity and heading command to the mobile robot needs to avoid all the static obstacles and also
mobile robot. The living room map was prescanned using an unexpected obstacle, which is manually placed during
gmapping ROS node, and an optimal path was calculated by the mission. The static obstacles are prescanned using ROS
the proposed deep learning algorithm. The mobile robot will built-in gmapping algorithm, and they are represented as a

Authorized licensed use limited to: Gyeongsang National Univ. Downloaded on January 30,2025 at 09:48:41 UTC from IEEE Xplore. Restrictions apply.
5790 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 35, NO. 4, APRIL 2024

Compared with a conventional control algorithm that needs


a specific rule for a mobile robot to follow, for example,
we could specify that the translational velocity is proportional
to the inverse of angular velocity, the DRL algorithm does not
need a specific rule to follow, and it can explore the optimal
control policy by itself. For instance, in the experimental
video, we can notice that when we placed the unexpected
obstacles in front of the turtlebot, it slowed down immediately
and made a big turn to avoid the obstacle by itself, and we
did not design a specific control law of translational velocity
and angular velocity.
An experimental video was also recorded to support this
Fig. 21. Comparison between the GRURDNN path and the real trajectory conclusion. Interested readers can find it in the Supplementary
in environment 2. Material or via the link: https://youtu.be/Wbpj8RxhsmA.

black area on the map. The red dots represent laser points.
VI. C ONCLUSION
As we can see, only a certain range of obstacles can be scanned
around the turtlebot during the mission. The final target was In this article, the problem of time-optimal autonomous
illustrated as a green circle. exploration for mobile robots operating in an unknown envi-
In environment 1, all the obstacles were static, and obstacles ronment has been investigated. An upper–lower hierarchical
positions were similar to Scenario 4 in Section V-B. The path deep learning-based control framework is designed, which
was generated by the proposed path planning algorithm and the leverages a GRURDNN motion planning algorithm and a
mobile robot followed the path to avoid the static obstacles. DRL-based online collision avoidance approach. The proposed
As shown in Fig. 14, the final path was similar to the path GRURDNN motion planning scheme was trained offline on a
generated in Scenario 4, which validated the proposed path pregenerated optimized maneuver trajectory dataset such that
planning algorithm. In environment 2, after the mobile robot it can predict the optimal motion in real time. In this way, the
avoided the fourth obstacle, we manually placed one obstacle time-consuming online optimization process can be omitted.
in front of the robot, and the obstacle edge was scanned as a Subsequently, the DRL-based online control scheme is applied
red line, as shown in Fig. 19. It is worth noting that the placed to deal with unexpected obstacles that are frequently identified
obstacle was not marked as a black area, indicating that the in real-world scenarios. After performing a number of simu-
placed obstacle was unknown to the mobile robot. The DRL lation and experimental studies, some concluding remarks can
algorithm was then triggered to navigate the robot, so as to be summarized.
reach the final target. Note that, due to the limitation of rviz 1) It is advantageous to embed the GRU mechanism in the
software, the turtlebot path was originally recorded as light and DNN-based motion planner, as the inherent relationships
thin green lines, as shown in Fig. 19. For better illustration, between optimized robot state and control time histories
we highlighted the actual robot trajectory in green, as shown are better exploited.
in Fig. 20. 2) The designed DRL-based online control scheme is
The odometry data of the turtlebot were also recorded and able to directly learn the control policy from human
shown in Fig. 21. The real trajectory is shown as the blue demonstration data and generate more exploration than
line, and the preplanned GRURDNN path is represented as the vanilla PER algorithm, thereby reducing the time
the green line. The maximum distance between the planned required by the training process.
path and the real trajectory is around 4.5 cm that is acceptable 3) The proposed hierarchical deep learning-based control
for the mobile robot to operate in the 25 m2 size area. When framework has the capability of fulfilling different
unknown obstacles occur on the map, shown as a red rectangle, autonomous exploration cases effectively and efficiently
the DRL algorithm was triggered to navigate the mobile robot in the presence of both expected and unexpected
to avoid it finally. As the DRL method is the end-to-end- obstacles.
based method and does not need to compute ESDF information Consequently, we believe that the proposed control scheme
and optimization equation, its computing time is only 1.2 ms, can serve as a promising tool for solving the mobile robot
which is comparable with the state-of-art EGO-planner that is autonomous exploration problem. In future work, the proposed
0.81 ms. Note that all of our codes are developed under Python strategy can be extended to address multimobile robot tra-
as the deep learning library TensorFlow is also developed jectory planning problems, where collision avoidance among
using Python. EGO-planner is developed using C++ that is the mobile robots should be considered. Moreover, transfer
much more computationally efficient than Python. We believe learning can be used to extend the network as a motion planner
that our DRL method can be equivalent to or even faster for different scenarios. In terms of DRL, 2-D LiDAR data
than EGO-planner if it is redeveloped under the C++ code in can be extended to 3-D point cloud data that can be used as
the future. In total, it can be observed that the mobile robot input of the neural network to perceive the 3-D environment.
was able to avoid both prescanned and unexpected obstacles To improve the training efficiency, federated learning using
throughout the way to the destination. multiple agents can also be applied.

Authorized licensed use limited to: Gyeongsang National Univ. Downloaded on January 30,2025 at 09:48:41 UTC from IEEE Xplore. Restrictions apply.
CHAI et al.: DESIGN AND EXPERIMENTAL VALIDATION OF DRL-BASED FAST TRAJECTORY PLANNING AND CONTROL 5791

R EFERENCES [23] R. Chai, A. Tsourdos, A. Savvaris, S. Chai, Y. Xia, and


C. L. P. Chen, “Design and implementation of deep neural network-
[1] K. Jo, J. Kim, D. Kim, C. Jang, and M. Sunwoo, “Develop- based control for automatic parking maneuver process,” IEEE Trans.
ment of autonomous car—Part I: Distributed system architecture and Neural Netw. Learn. Syst., vol. 33, no. 4, pp. 1400–1413, Apr. 2022,
development process,” IEEE Trans. Ind. Electron., vol. 61, no. 12, doi: 10.1109/TNNLS.2020.3042120.
pp. 7131–7140, Dec. 2014. [24] H. Hu, S. Song, and C. L. P. Chen, “Plume tracing via model-free
[2] Y. Li, R. Cui, Z. Li, and D. Xu, “Neural network approximation based reinforcement learning method,” IEEE Trans. Neural Netw. Learn. Syst.,
near-optimal motion planning with kinodynamic constraints using RRT,” vol. 30, no. 8, pp. 2515–2527, Aug. 2019.
IEEE Trans. Ind. Electron., vol. 65, no. 11, pp. 8718–8729, Nov. 2018. [25] J. Zhu, J. Zhu, Z. Wang, S. Guo, and C. Xu, “Hierarchical decision
[3] S. Yang, Z. Li, R. Cui, and B. Xu, “Neural network-based motion control and control for continuous multitarget problem: Policy evaluation with
of an underactuated wheeled inverted pendulum model,” IEEE Trans. action delay,” IEEE Trans. Neural Netw. Learn. Syst., vol. 30, no. 2,
Neural Netw. Learn. Syst., vol. 25, no. 11, pp. 2004–2016, Nov. 2014. pp. 464–473, Feb. 2019.
[4] S. Liu, Z. Hou, T. Tian, Z. Deng, and Z. Li, “A novel dual successive [26] L. Tai, G. Paolo, and M. Liu, “Virtual-to-real deep reinforcement
projection-based model-free adaptive control method and application to learning: Continuous control of mobile robots for mapless navigation,”
an autonomous car,” IEEE Trans. Neural Netw. Learn. Syst., vol. 30, in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS), Sep. 2017,
no. 11, pp. 3444–3457, Nov. 2019. pp. 31–36.
[5] K. Jo, J. Kim, D. Kim, C. Jang, and M. Sunwoo, “Development of [27] K. Kang, S. Belkhale, G. Kahn, P. Abbeel, and S. Levine, “Generaliza-
autonomous car—Part II: A case study on the implementation of an tion through simulation: Integrating simulated and real data into deep
autonomous driving system based on distributed architecture,” IEEE reinforcement learning for vision-based autonomous flight,” in Proc. Int.
Trans. Ind. Electron., vol. 62, no. 8, pp. 5119–5132, Aug. 2015. Conf. Robot. Autom. (ICRA), May 2019, pp. 6008–6014.
[6] L. Chen, Y. Shan, W. Tian, B. Li, and D. Cao, “A fast and effi- [28] H. Niu, Z. Ji, F. Arvin, B. Lennox, H. Yin, and J. Carrasco, “Accelerated
cient double-tree RRT∗ -like sampling-based planner applying on mobile sim-to-real deep reinforcement learning: Learning collision avoidance
robotic systems,” IEEE/ASME Trans. Mechatronics, vol. 23, no. 6, from human player,” in Proc. IEEE/SICE Int. Symp. Syst. Integr. (SII),
pp. 2568–2578, Dec. 2018. Jan. 2021, pp. 144–149.
[7] L. Zuo, Q. Guo, X. Xu, and H. Fu, “A hierarchical path planning [29] M. Sainte Catherine and E. Lucet, “A modified hybrid reciprocal velocity
approach based on A∗ and least-squares policy iteration for mobile obstacles approach for multi-robot motion planning without communica-
robots,” Neurocomputing, vol. 170, pp. 257–266, Dec. 2015. tion,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS), Oct. 2020,
[8] G. P. Kontoudis and K. G. Vamvoudakis, “Kinodynamic motion plan- pp. 5708–5714.
ning with continuous-time Q-learning: An online, model-free, and safe [30] H. Eslamiat, Y. Li, N. Wang, A. K. Sanyal, and Q. Qiu, “Autonomous
navigation framework,” IEEE Trans. Neural Netw. Learn. Syst., vol. 30, waypoint planning, optimal trajectory generation and nonlinear tracking
no. 12, pp. 3803–3817, Dec. 2019. control for multi-rotor UAVs,” in Proc. 18th Eur. Control Conf. (ECC),
[9] X. Zhou, Z. Wang, H. Ye, C. Xu, and F. Gao, “EGO-planner: An ESDF- Jun. 2019, pp. 2695–2700.
free gradient-based local planner for quadrotors,” IEEE Robot. Autom. [31] Y. Li, H. Eslamiat, N. Wang, Z. Zhao, A. K. Sanyal, and Q. Qiu,
Lett., vol. 6, no. 2, pp. 478–485, Apr. 2021. “Autonomous waypoints planning and trajectory generation for multi-
[10] B. Zhou, F. Gao, L. Wang, C. Liu, and S. Shen, “Robust and efficient rotor UAVs,” in Proc. Workshop Design Autom. CPS IoT (DESTION),
quadrotor trajectory generation for fast autonomous flight,” IEEE Robot. New York, NY, USA, 2019, pp. 31–40.
Autom. Lett., vol. 4, no. 4, pp. 3529–3536, Oct. 2019. [32] R. Jiang, H. Zhou, H. Wang, and S. S. Ge, “Road-constrained geometric
[11] B. Zhou, J. Pan, F. Gao, and S. Shen, “RAPTOR: Robust and perception- pose estimation for ground vehicles,” IEEE Trans. Autom. Sci. Eng.,
aware trajectory replanning for quadrotor fast flight,” IEEE Trans. vol. 17, no. 2, pp. 748–760, Apr. 2020.
Robot., vol. 37, no. 6, pp. 1992–2009, Dec. 2021. [33] A. Wächter and L. T. Biegler, “On the implementation of an interior-
[12] R. Chai, A. Tsourdos, A. Savvaris, S. Chai, and Y. Xia, “Two-stage tra- point filter line-search algorithm for large-scale nonlinear programming,”
jectory optimization for autonomous ground vehicles parking maneuver,” Math. Program., vol. 106, no. 1, pp. 25–57, May 2006.
IEEE Trans. Ind. Informat., vol. 15, no. 7, pp. 3899–3909, Jul. 2019. [34] C. Sun, Q. Li, and L. Li, “A gridmap-path reshaping algorithm for path
[13] C. Lian, X. Xu, H. Chen, and H. He, “Near-optimal tracking control of planning,” IEEE Access, vol. 7, pp. 183150–183161, 2019.
mobile robots via receding-horizon dual heuristic programming,” IEEE [35] R. Chai, A. Tsourdos, A. Savvaris, S. Chai, Y. Xia, and C. L. P. Chen,
Trans. Cybern., vol. 46, no. 11, pp. 2484–2496, Nov. 2016. “Six-DOF spacecraft optimal trajectory planning and real-time attitude
[14] J. Li, Q. Yang, B. Fan, and Y. Sun, “Robust state/output-feedback control control: A deep neural network-based approach,” IEEE Trans. Neural
of coaxial-rotor MAVs based on adaptive NN approach,” IEEE Trans. Netw. Learn. Syst., vol. 31, no. 11, pp. 5005–5013, Nov. 2020.
Neural Netw. Learn. Syst., vol. 30, no. 12, pp. 3547–3557, Dec. 2019. [36] X. Wang, R. Jiang, L. Li, Y. Lin, X. Zheng, and F.-Y. Wang, “Capturing
[15] L. Liu, D. Wang, Z. Peng, C. L. P. Chen, and T. Li, “Bounded neural car-following behaviors by deep learning,” IEEE Trans. Intell. Transp.
network control for target tracking of underactuated autonomous surface Syst., vol. 19, no. 3, pp. 910–920, Mar. 2018.
vehicles in the presence of uncertain target dynamics,” IEEE Trans. [37] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
Neural Netw. Learn. Syst., vol. 30, no. 4, pp. 1241–1249, Apr. 2019. 2014, arXiv:1412.6980.
[16] T. Weiskircher, Q. Wang, and B. Ayalew, “Predictive guidance and [38] M. Fortunato et al., “Noisy networks for exploration,” 2017,
control framework for (semi-)autonomous vehicles in public traffic,” arXiv:1706.10295.
IEEE Trans. Control Syst. Technol., vol. 25, no. 6, pp. 2034–2046,
Nov. 2017.
[17] N. Wan, C. Zhang, and A. Vahidi, “Probabilistic anticipation and control
in autonomous car following,” IEEE Trans. Control Syst. Technol.,
vol. 27, no. 1, pp. 30–38, Jan. 2019.
[18] Z. Sun and Y. Xia, “Receding horizon tracking control of unicycle-
type robots based on virtual structure,” Int. J. Robust Nonlinear Control,
vol. 26, no. 17, pp. 3900–3918, Nov. 2016.
[19] B. Li, K. Wang, and Z. Shao, “Time-optimal maneuver planning in
automatic parallel parking using a simultaneous dynamic optimiza-
tion approach,” IEEE Trans. Intell. Transp. Syst., vol. 17, no. 11, Runqi Chai (Member, IEEE) received the Ph.D.
pp. 3263–3274, Nov. 2016. degree in aerospace engineering from Cranfield Uni-
[20] Z. Q. Sun, L. Dai, K. Liu, Y. Q. Xia, and K. H. Johansson, “Robust MPC versity, Cranfield, U.K., in August 2018.
for tracking constrained unicycle robots with additive disturbances,” He was a Research Fellow with Cranfield Uni-
Automatica, vol. 90, pp. 172–184, Apr. 2018. versity from 2018 to 2021. He was also a Visit-
[21] R. Soloperto, J. Kohler, F. Allguwer, and M. A. Müller, “Collision ing Researcher with The University of Manchester,
avoidance for uncertain nonlinear systems with moving obstacles using Manchester, U.K., in 2021. He is currently a Profes-
robust model predictive control,” in Proc. 18th Eur. Control Conf. (ECC), sor with the Beijing Institute of Technology, Beijing,
Naples, Italy, Jun. 2019, pp. 811–817. China. His research interests include trajectory opti-
[22] X. Zhang, A. Liniger, and F. Borrelli, “Optimization-based colli- mization, networked control systems, multiagent
sion avoidance,” IEEE Trans. Control Syst. Technol., vol. 29, no. 3, control systems, and autonomous vehicle motion
pp. 972–983, May 2021. planning and control.

Authorized licensed use limited to: Gyeongsang National Univ. Downloaded on January 30,2025 at 09:48:41 UTC from IEEE Xplore. Restrictions apply.
5792 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 35, NO. 4, APRIL 2024

Hanlin Niu (Member, IEEE) received the B.Eng. Hujun Yin (Senior Member, IEEE) received the
degree in mechanics from Tianjin University, B.Eng. degree in electronic engineering and the
Tianjin, China, in 2012, and the Ph.D. degree in M.Sc. degree in signal processing from Southeast
aeronautical engineering from Cranfield University, University in 1983 and 1986, respectively, and the
Cranfield, U.K., in 2018. Ph.D. degree in neural networks from the University
From July 2017 to January 2020, he was a of York, York, U.K., in 1997.
Research Associate in robotics with Cardiff Univer- He joined The University of Manchester (UMIST
sity. Since January 2020, he has been a Research before the merge), Manchester, U.K., in 1996, as a
Associate in robotics with The University of Post-Doctoral Research Associate, where he was
Manchester, Manchester, U.K., being involved in the appointed as a Lecturer in 1998 and subsequently
Robotics and AI in Nuclear Project, one of the four promoted to a Senior Lecturer, a Reader, and a
big robotics and AI projects funded by the Engineering and Physical Sciences Professor. He has also been the Head of business engagement in AI and
Research Council. His research interests include path planning, path following, data of the Faculty of Science and Engineering, UMIST, since 2019.
collision avoidance, deep reinforcement learning of autonomous vehicles, and Dr. Yin is a Turing Fellow of the Alan Turing Institute (ATI) and a member
teleoperation of the robotic arm. of the EPSRC Peer Review College. He has been the Vice-Chair of the IEEE
Dr. Niu is currently a member of the Robotics for Nuclear Environments Computational Intelligence Society (CIS) UK Ireland Chapter since 2019.
Project (EPSRC no. EP/P01366X/1).

Joaquin Carrasco (Member, IEEE) was born in


Abarán, Spain, in 1978. He received the B.Sc. degree
in physics and the Ph.D. degree in control engineer-
ing from the University of Murcia, Murcia, Spain,
in 2004 and 2009, respectively.
From 2009 to 2010, he was with the Insti-
tute of Measurement and Automatic Control, Leib-
niz Universität Hannover, Hannover, Germany.
From 2010 to 2011, he was a Research Associate
with the Control Systems Centre, School of Electri-
cal and Electronic Engineering, The University of
Manchester, Manchester, U.K. He is currently a Reader with the Control
Systems Centre, Department of Electrical and Electronic Engineering, The
University of Manchester. His current research interests include absolute
stability, multiplier theory, and robotics applications.

Barry Lennox (Senior Member, IEEE) received


Farshad Arvin (Senior Member, IEEE) received the B.Eng. degree in chemical engineering and the
the M.Sc. degree in computer systems engineering Ph.D. degree in control systems from Newcastle
from University Putra Malaysia, Malaysia, in 2010, University, Newcastle upon Tyne, U.K., in 1991 and
and the Ph.D. degree in computer science from the 1996, respectively.
University of Lincoln, Lincoln, U.K., in 2015. He was the Director of the Robotics and Artificial
He was a Senior Lecturer in robotics with The Intelligence for Nuclear (RAIN) Hub and is an
University of Manchester, Manchester, U.K. He vis- expert in applied control and its use in robotics
ited several leading institutes, including the Arti- and process operations and has considerable expe-
ficial Life Laboratory, University of Graz, Graz, rience in transferring leading-edge technology into
Austria, the Institute of Microelectronics, Tsinghua the industry. He is currently a professor of applied
University, Beijing, China, and the Italian Institute control and nuclear engineering decommissioning.
of Technology (iit), Genoa, Italy, as a Senior Visiting Research Scholar. He is Dr. Lennox is a fellow of the Royal Academy of Engineering, the Institution
currently an Associate Professor of robotics with the Department of Computer of Engineering and Technology (IET), and the Institute of Measurement and
Science, Durham University, Durham, U.K. His research interests include Control (InstMC) and a Chartered Engineer. He holds a Royal Academy of
swarm robotics and autonomous multiagent systems. Engineering Chair in Emerging Technologies.

Authorized licensed use limited to: Gyeongsang National Univ. Downloaded on January 30,2025 at 09:48:41 UTC from IEEE Xplore. Restrictions apply.

You might also like