Design and Experimental Validation of Deep Reinforcement Learning-Based Fast Trajectory Planning and Control For Mobile Robot in Unknown Environment
Design and Experimental Validation of Deep Reinforcement Learning-Based Fast Trajectory Planning and Control For Mobile Robot in Unknown Environment
4, APRIL 2024
Abstract— This article is concerned with the problem of plan- PER algorithm in terms of training speed. Experimental videos
ning optimal maneuver trajectories and guiding the mobile robot are also uploaded, and the corresponding results confirm that the
toward target positions in uncertain environments for exploration proposed strategy is able to fulfill the autonomous exploration
purposes. A hierarchical deep learning-based control framework mission with improved motion planning performance, enhanced
is proposed which consists of an upper level motion planning collision avoidance ability, and less training time.
layer and a lower level waypoint tracking layer. In the motion
planning phase, a recurrent deep neural network (RDNN)-based Index Terms— Deep reinforcement learning (DRL), mobile
algorithm is adopted to predict the optimal maneuver profiles for robot, motion control, noisy prioritized experience replay (PER),
the mobile robot. This approach is built upon a recently proposed optimal motion planning, recurrent neural network, unexpected
idea of using deep neural networks (DNNs) to approximate the obstacles.
optimal motion trajectories, which has been validated that a fast
approximation performance can be achieved. To further enhance I. I NTRODUCTION
the network prediction performance, a recurrent network model
capable of fully exploiting the inherent relationship between
preoptimized system state and control pairs is advocated. In the
T HE development of unmanned ground vehicles (UGVs)
or mobile robots has received considerable attention over
the last two decades due to its potential benefits in terms
lower level, a deep reinforcement learning (DRL)-based collision-
free control algorithm is established to achieve the waypoint of sensing the unknown environment, alleviating urban traffic
tracking task in an uncertain environment (e.g., the existence congestion, reducing harmful emissions, and enhancing road
of unexpected obstacles). Since this approach allows the con- safety. Generally speaking, four key modules are involved
trol policy to directly learn from human demonstration data, in a robotic system: environmental perception [1], motion
the time required by the training process can be significantly
reduced. Moreover, a noisy prioritized experience replay (PER) planning [2], motion control [3], [4], and vehicle-to-vehicle
algorithm is proposed to improve the exploring rate of con- communication [5]. Among them, the planning and control
trol policy. The effectiveness of applying the proposed deep modules are mainly responsible for navigating the route and
learning-based control is validated by executing a number of making maneuver decisions. Consequently, it is usually recog-
simulation and experimental case studies. The simulation result nized as a key indicator to reflect the intelligence level of a
shows that the proposed DRL method outperforms the vanilla
robotic system. However, there are challenges when it comes
Manuscript received 15 October 2021; revised 4 July 2022; accepted to planning and steering the motion of mobile robots in an opti-
18 September 2022. Date of publication 10 October 2022; date of current mal and timely way, considering the complexity of the envi-
version 5 April 2024. This work was supported in part by the Engineering and
Physical Sciences Research Council (EPSRC) under Project EP/S03286X/1, ronment and various physical limitations of the robot. These
in part by EPSRC Robotics and Artificial Intelligence for Nuclear (RAIN) challenges, together with other mission-dependent require-
under Project EP/R026084/1, and in part by EPSRC Robotics for Nuclear ments, have stimulated both scholars and engineers to explore
Environments (RNE) under Project EP/P01366X/1. (Corresponding author:
Hanlin Niu.) promising solutions to this kind of problem.
Runqi Chai was with the Department of Electrical and Electronic Engi-
neering, The University of Manchester, Manchester M13 9PL, U.K. He is A. Related Works
now with the School of Automation, Beijing Institue of Technology, Beijing The mission investigated in this research work focuses
100081, China (e-mail: [email protected]).
Hanlin Niu, Joaquin Carrasco, Hujun Yin, and Barry Lennox are with on applying the mobile robot to explore unknown target
the Department of Electrical and Electronic Engineering, The University of regions. Numerous research works have been carried out in
Manchester, M13 9PL Manchester, U.K. (e-mail: hanlin.niu@manchester. developing advanced path planning and control algorithms
ac.uk; [email protected]; [email protected];
[email protected]). for dealing with this problem [6], [7], [8], [9]. For instance,
Farshad Arvin is with the Department of Computer Science, Durham Zuo et al. [7] proposed a two-level path planning scheme
University, Durham DH1 3LE, U.K. (e-mail: [email protected]). incorporating the standard A-star method and an approximate
This article has supplementary downloadable material available at
https://doi.org/10.1109/TNNLS.2022.3209154, provided by the authors. policy iteration algorithm to generate a smooth trajectory
Digital Object Identifier 10.1109/TNNLS.2022.3209154 for mobile robots. Similarly, Kontoudis and Vamvoudakis [8]
2162-237X © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Gyeongsang National Univ. Downloaded on January 30,2025 at 09:48:41 UTC from IEEE Xplore. Restrictions apply.
CHAI et al.: DESIGN AND EXPERIMENTAL VALIDATION OF DRL-BASED FAST TRAJECTORY PLANNING AND CONTROL 5779
proposed a hybrid path planning approach combining rapidly 1) Convergence failures can frequently be detected in the
exploring random trees and Q-learning techniques to achieve optimization process with the consideration of mechan-
collision-free navigation. In [9], a trajectory planner without ical limitations, nonlinear path/dynamic constraints, and
the time-consuming process of building the Euclidean signed nonconvex collision avoidance constraints.
distance field is proposed. EGO-planner explicitly constructs 2) The (re)optimization process requires a large amount of
an objective function that keeps the trajectory away from onboard computing power.
obstacles, while ESDF-based methods have to calculate colli- 3) The actual operation environment of the mobile robot
sion information with a heavy computation burden. Although is uncertain, meaning that there may exist obstacles
EGO-planner reduces a significant amount of computing time suddenly appear within the sensing range of the robot.
compared with ESDF-based planners [10], [11], it still needs This will certainly create a hazard and lead to significant
to parameterize trajectory by a uniform B-spline curve first and safety risks.
then optimize it using smoothness penalty, collision penalty, These three issues become more prominent when it comes
and feasibility penalty terms. Moreover, it needs an additional to real-time applications.
time reallocation procedure to ensure dynamical feasibility
and generate a local optimal collision-free path for unknown
obstacles. In this article, we propose an end-to-end DRL C. Main Contributions
method to deal with unknown obstacles, and our DRL method The main contribution of this work lies in dealing with the
uses LiDAR sensor data as input of neural network and three challenges stated previously and developing a hierarchi-
generates collision avoidance command directly, and it does cal deep learning-based control framework for the autonomous
not need a manually designed optimization approach, such exploration problem. To be more specific, for the first two
as EGO-planner. The DRL method is also integrated with issues, we make an attempt to address them by exploring the
our proposed global optimal path planning method for known idea of planning the optimal motion trajectory via a deep
obstacles. neural network (DNN)-based direct recalling. This type of
These works belong to the class of sample-and-search strategy aims to train DNNs on a preoptimized state-action
methods where a finite set of mesh grids is first applied to dataset. Following that, the trained networks will be applied
discretize the searching space. Then, a satisfactory connection as an online motion planner to iteratively predict optimal
between the initial pose and the target pose is selected. maneuver actions. Such a direct recalling strategy can acquire
Another potentially effective strategy is to construct an near-optimal motion planning performance and better real-time
optimal control model consisting of the vehicle dynam- applicability in comparison to other optimization-based tech-
ics and other vehicle-related constraints. Subsequently, niques, such as NMPC and dynamic programming [18], [19],
well-developed optimization-based control algorithms, such [20], [21], [22]. However, most of them assumed the network
as swarm intelligence-oriented algorithms [12], dynamic inputs are independent of each other and applied a fully
programming-based methods [13], [14], [15], and nonlinear connected network structure to learn the optimal mapping rela-
model predictive control (NMPC)-based approaches, are uti- tionship. This indicates that the inherent relations between the
lized to plan and steer the motion trajectories [16], [17]. optimized state and control action pairs may not be sufficiently
Specifically, by taking into account public traffic restrictions, exploited. Therefore, building upon our previous studies [23],
a model predictive navigation and control framework was we enrich the power of the DNN-based motion approximator
established in [16] to plan and steer the UGV trajectory. by using a gated recurrent unit recurrent DNN (GRURDNN)
Besides, Wan et al. [17] proposed an MPC-scheme, which is model.
capable of anticipating the motion of the front vehicle such that As for the third issue, we make an attempt to address it
the autonomous car following control task can be successfully by taking advantage of deep reinforcement learning (DRL)-
fulfilled. Compared to sample-and-search methods, the appli- based techniques [24], [25]. More precisely, DRL has been
cation of optimization-based planning/control approaches has used on autonomous vehicles for collision avoidance using
several advantages: 1) different mission/environment-related LiDAR [26] or camera data [27]. However, previous research
requirements (e.g., constraints) can be explicitly considered works rely heavily on the training data and the exploration
and 2) since the motion trajectory is directly planned, the path usually starts with random action to gather experience data for
velocity decomposition becomes unnecessary. Hence, these replaying. This random action is likely to result in low-quality
types of algorithms have received increasing attention in the training data and slow convergence speed. In this research,
context of autonomous exploration. we proposed a noisy prioritized experience replay (PER)
algorithm to improve the training speed of the standard
B. Motivations PER algorithm. The proposed noisy PER algorithm is inte-
Different from the aforementioned research works, this grated into the deep deterministic policy gradient (DDPG)
article aims to design, test, and validate a fast yet optimal algorithm. Compared to the vanilla PER algorithm imple-
motion planning and control architecture for mobile robots mented in our previous work [28], this modification can
operating under an unstructured and partly unknown envi- introduce more exploring actions in the early training stage
ronment (e.g., target region exploration, rapid rescue, and and perform higher training speed. The effectiveness of this
autonomous parking). For the considered mission scenarios, approach will be validated in the experimental section of this
the main challenges may result from three aspects. article.
Authorized licensed use limited to: Gyeongsang National Univ. Downloaded on January 30,2025 at 09:48:41 UTC from IEEE Xplore. Restrictions apply.
5780 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 35, NO. 4, APRIL 2024
Authorized licensed use limited to: Gyeongsang National Univ. Downloaded on January 30,2025 at 09:48:41 UTC from IEEE Xplore. Restrictions apply.
CHAI et al.: DESIGN AND EXPERIMENTAL VALIDATION OF DRL-BASED FAST TRAJECTORY PLANNING AND CONTROL 5781
the proposition established in [21] and validated in [22], Algorithm 1 Optimal Parking Maneuver Dataset Generation
condition (6) can be transformed to a smoother form such that 1: procedure
the computational complexity of the optimization process can 2: Initialize E = ∅;
Nk
be alleviated significantly. To be more specific, imposing (6) 3: Form a set of temporal node {tk }k=1 ;
Ni
is equivalent to 4: Apply MCMC to sample a set of noise value {ix }i=1 ;
5: /*Main loop*/
Find λ(m) ≥ 0, μ(m) ≥ 0 6: for i := 1, 2, . . . , do
7: while E < Ne do
subject to ∀t ∈ 0, t f
(m) T 8: (a) Specify
P B(x) − q (m) λ(m) − g T μ(m) > dmin x0 = x0 + ix
A T (x)P (m) λ(m) + G T μ(m) = 0
T N
9: (b) Use {tk }k=1
k
to construct the discrete-time
10: optimization model of (9):
P (m) λ(m) ≤ 1
T
(8)
Nk
(m) (m) min J= L(xk , u k ) + x Nk , tk
in which (λ ,μ ) denotes the dual variables. (m) (m)
x k ,u k ,λk ,μk k=0
s.t. ∀m ∈ {1, . . . , No }, ∀k ∈ {0, . . . , Nk }
x k+1 = f (x k , uk , tk )
C. Trajectory Optimization Formulation (m) T
P B(x k ) − q (m) λ(m) T (m)
k − g μk > dmin
According to the mobile robot dynamic model, mechani-
A T (x k )P (m) λ(m) + G T μ(m)
T
k k =0
cal constraints, boundary conditions, and collision avoidance
constraints introduced previously, we are now in a place to P (m)T
λ(m)
k ≤ 1
P (m) λ(m) ≤ 1
T
18: else
19: (f) Discard the unconverged solution and perform
x(t0 ) = x 0 , x t f = x f
x(t) ∈ X, u(t) ∈ U E=E∪∅
(m) (m)
λ ≥ 0, μ ≥0 (9) 20: end if
21: (g) Update the index i ← i + 1;
in which J stands for the objective function consisting of 22: end while
a terminal cost term (x(t f ), t f ) and a process cost term 23: end for
24: Divide E into subsets via Eq.(17);
L(x(t), u(t)). Since we are aiming to fulfill the exploration
25: Output E, Etr , Ete and Ev .
mission in the shortest time, the objective function is set to
26: end procedure
J = t f . Note that t f is not given in advance and is included
in the constraints of optimization model (9).
Remark 2: Certain advantages can be obtained if the opti-
mization model (9) is applied to plan the exploration trajectory III. GRURDNN-BASED O PTIMAL M OTION P LANNING
of mobile robots: 1) since P (m) λ(m) ≤ 1 is a convex
T
A GRURDNN-based real-time motion planner capable of
constraint, an improved solution time and convergence rate can approximating the optimal exploration trajectories is con-
be expected; 2) this formulation considers convex obstacles, structed in this section. This approach can be viewed as an
and no further assumptions are applied to the geometric extended version of the fully connected DNN-based motion
shape of obstacles; 3) additional mission-dependent condi- planner advocated in our previous work [23]. The main
tions/requirements can be easily considered in (9) by modeling novelty of the proposed one is that it implements a recur-
them as algebraic inequalities or equalities; and 4) though we rent network structure with a gated recurrent unit mecha-
are aiming for time-optimal exploration motions, it is simple nism to further exploit the inherent relationships between
to modify this optimization model to consider other mission different network inputs (e.g., state variables of the mobile
objectives. robot). To form the GRURDNN-based optimal motion planner,
Authorized licensed use limited to: Gyeongsang National Univ. Downloaded on January 30,2025 at 09:48:41 UTC from IEEE Xplore. Restrictions apply.
5782 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 35, NO. 4, APRIL 2024
Algorithm 2 Initial Feasible Trajectory Generation utilizing the Markov chain Monte Carlo (MCMC) sampling
1: procedure method. That is, the trajectory optimization model (9) is
2: Create the exploration map based on ( pxmax , p ymax ) and iteratively solved with x(t0 ) = x 0 + x . Here, x is a
expected obstacle information; random noise value. Note that, in real-world applications,
3: Grid the exploration map uniformly with specific reso- it is reasonable to assume that the initial configuration of
lution level; the mobile robot is noise-perturbed due to the existence of
4: Dilate the irregular obstacle to meet the safety margin localization errors.
dmin [34]; To present the process of forming E in detail, pseudocode
5: Apply the A-star algorithm to explore a shortest is established (see Algorithm 1). The final output dataset is
collision-free trajectory X (0)p = ( p (0) (0)
x , p y ) between the
divided into three subsets, which will be used to train, test,
initial and target points; and validate the designed GRURDNN network. That is,
6: for k := 0, 1, . . . , Nk do E = Etr ∪ Ete ∪ Ev (17)
7: (a) calculate other state profiles via
(0) where Ntr , Ntr , and Ntr denote the training, testing, and
(k+1)− px(0) (k)
θ (0) (k) = sin−1 Xp(0)x (k+1)−X (0)
(k)
validation sets with Ntr = Etr , Nte = Ete , and Nv = Ev
being the size of Etr , Ete , and Ev , respectively. Note that,
p p 2
v (0) (k) = v (k) if < −avmax , optimizer can quickly start the searching process at a collision-
tk (13)
⎨
⎨ v max (0)
if v (k) > θmax , free path. As a result, the success rate of optimal solution
⎨
⎩
−v max if − v (0) (k) < −θmax . detection, along with the time required to form the optimized
trajectory dataset, is likely to be further improved. Note that
10: with designs in [23] and [35] are likely to take longer computational
v̄ (0) (k + 1) = v (0) (k) + avmax tk time, as a preoptimization process has to be applied.
(14)
v(k + 1)(0) = v (0) (k) − avmax tk Remark 4: Performing Lines 6–14 is to generate the ini-
11: (d) Modify the angular velocities via tial reference trajectory for θ (0) , v (0) , and ω(0) such that
⎧ (0) (0)
they can satisfy the dynamic inequality constraints (3b).
⎨
⎨
⎨ ω̄0 (k) if ω (k+1)−ωtk
(k)
> aωmax , This is achieved by re-shaping θ (0) , v (0) , and ω(0) according
⎨ (0) (0)
ω (k+1)−ω (k)(0)
to (12), (13), and (15). This modification may make the
ω(0) (k) = ω (k) if tk < −aωmax , (15)
⎨
⎨ ωmax (0)
if ω (k) > ωmax , trajectory fail to meet system dynamics. It is worth noting that
⎨
⎩ the primary aim of this initial trajectory generation strategy is
−ωmax if − ω(0) (k) < −ωmax .
to quickly find a collision-free maneuver path to warmly start
12: with the gradient-based optimization process. It does not need to
ω̄(0) (k + 1) = ω(0) (k) + aωmax tk exactly satisfy the system dynamics at all time nodes.
(16)
ω(0) (k + 1) = ω(0) (k) − aωmax tk Remark 5: Note that the exploration trajectory dataset is
generated from known static obstacle locations and is, there-
13: end for
fore, environment-specific. However, datasets generated in this
14: Output the initial collision-free trajectory
way have the possibility to be applicable to other scenarios
x (0) = [ p(0) (0) (0) (0) (0) T
x , py , θ , v , ω ] ∈ R
5×Nk +1
through methods such as transfer learning.
15: end procedure B. Structuring and Training the Network
Recurrent DNN (RDNN) motion planning networks with
GRU mechanism are structured and trained on the pregen-
four main steps are required, which will be discussed in
erated training dataset Etr such that the mapping relations
Sections III-A–III-C.
between the optimized states and control actions can be
approximated. The motivation for the use of DNN with a
A. Forming the Exploration Trajectory Dataset recurrent structure relies on its capability to dig the dynamic
Prior to construct the network model, a trajectory dataset temporal behavior of the time series data, as it introduces
E containing Ne optimized vehicle state and control evolution memory blocks in neurons. Moreover, compared to the widely
histories is pregenerated for a particular mobile robot explo- used long short-term memory (LSTM) mechanism, the imple-
ration scenario (e.g., E = Ne ). This can be achieved by mentation of GRU can result in less number of network
Authorized licensed use limited to: Gyeongsang National Univ. Downloaded on January 30,2025 at 09:48:41 UTC from IEEE Xplore. Restrictions apply.
CHAI et al.: DESIGN AND EXPERIMENTAL VALIDATION OF DRL-BASED FAST TRAJECTORY PLANNING AND CONTROL 5783
Authorized licensed use limited to: Gyeongsang National Univ. Downloaded on January 30,2025 at 09:48:41 UTC from IEEE Xplore. Restrictions apply.
5784 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 35, NO. 4, APRIL 2024
Fig. 2. Actor network architecture: the input layer concatenates laser scanner
data, relative target position, and velocity, followed by three dense layers. The
number in each block means the corresponding dimensional number of each
vector or layer.
where rsum represents the sum reward, rgd describes the goal
distance reward, rsc is the safety clearance reward, and rv Fig. 3. Critic network architecture: the input layer concatenates laser scanner
data, relative target position data, velocity data, and action data, followed
represents velocity control reward. rgd can be calculated by by three dense layers. The number in each block means the corresponding
dimensional number of each vector or layer.
ra , if drg < drgmin
rgd = (23)
k p d p , otherwise
angular velocity generated by a hyperbolic tangent function.
where the mobile robot receives arrival reward ra when the
The input layer and the output layer of the actor network
distance between robot and goal drg is less than threshold
architecture are merged and used as the input layer of the critic
drgmin . Otherwise, rgd is proportional to d p that represents
network, as shown in Fig. 3. The critic network also includes
the distance robot moves toward the goal at the last time step.
three dense layers, and each dense layer includes 512 nodes
k p is a user-configurable parameter. Safety clearance reward
and the ReLU function. The Q value is finally generated by
rsc can be calculated by
a linear activation function.
−rcp , if dro < dromin
rsc = (24)
0, otherwise. D. Human Teleoperation Data Collection
When the distance between robot and obstacle dro is less than Instead of making a robot learn from scratch, this method
threshold dromin , negative reward/punishment −rcp is applied. allows the robot to learn from human demonstration data in
The velocity control reward rv , the angular velocity reward the beginning. The human demonstration data are received
rav , and the linear velocity rlv reward are given by by recording the teleoperation data of one user. The user can
control the robot to navigate around obstacles and arrive at the
rv = rav + rlv (25)
goal by using the keyboard to change the speed and heading
−rap , if |v a | > v amax of the turtlebot in a simulated environment. The state data
rav = (26)
0, otherwise st at time step t consist of LiDAR data lt , velocity data v t ,
and relative position data pt . Velocity command v t is sent
−rlp , if v l < v lmin by using the keyboard. After one time step, next state st+1
rlv = (27)
0, otherwise can be recorded. The reward value r is evaluated using (22).
Then, (st , v t , st+1 , rt ) is stored in experience dataset for later
where v amax denotes the angular velocity threshold and v lmin
sampling.
represents the linear velocity threshold. As we want mobile
robot to move to goal waypoint at high linear speed and low
angular speed, if angular velocity v a is bigger than v amax or E. Noisy Prioritized Experience Replay Algorithm
linear velocity vl is less than v lmin , the robot will receive
To improve the training speed of PER algorithm, we pro-
punishment value −rap or −rlp , respectively.
posed a noisy PER algorithm to increase the exploration rate
of the control policy. We integrated the noisy PER algorithm
C. Actor/Critic Network Structure and DDPG to enable the control policy to sample the human
The architectures of the actor network and the critic network demonstration data more frequently in the beginning and
are shown in Figs. 2 and 3, respectively. In Fig. 2, the input sample dataset with higher priority more frequently afterward.
layer concatenates laser scanner data (24-D vector), relative The noisy PER algorithm is given as follows:
target position (2-D vector), and velocity data (2-D vector). pα
The input layer is followed by three dense layers, and each Pi = i α (28)
layer includes 512 nodes and the rectified linear unit (ReLU) k pk
function. The output layer is the velocity command, including where Pi represents the sampling probability of the i th transi-
the linear velocity generated by a sigmoid function and the tion, α is the distribution factor, and pi stands for the priority
Authorized licensed use limited to: Gyeongsang National Univ. Downloaded on January 30,2025 at 09:48:41 UTC from IEEE Xplore. Restrictions apply.
CHAI et al.: DESIGN AND EXPERIMENTAL VALIDATION OF DRL-BASED FAST TRAJECTORY PLANNING AND CONTROL 5785
TABLE I
P HYSICAL PARAMETERS /C ONSTRAINTS OF THE M OBILE ROBOT
TABLE II
I NITIAL /F INAL C ONFIGURATION FOR D IFFERENT T EST C ASES
Authorized licensed use limited to: Gyeongsang National Univ. Downloaded on January 30,2025 at 09:48:41 UTC from IEEE Xplore. Restrictions apply.
5786 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 35, NO. 4, APRIL 2024
B. Performance Verification of the Motion Planner become two main aspects to evaluate the GRURDNN approx-
imation performance.
The performance of applying the proposed GRURDNN- Figs. 5–8 illustrate the obtained exploration trajectories,
based motion prediction algorithm for the mobile robot variation histories between the actual and preplanned optimal
autonomous exploration problem is tested and verified in this solutions (e.g., e = [ex , e y , eθ ]T ), and velocity-/acceleration-
section. Due to the nature of the GRURDNN-based optimal related profiles. First, attention is given to the feasibility of the
motion predictor, an exploration trajectory can always be obtained GRURDNN solutions. From the profiles shown in the
generated. Therefore, the solution feasibility and the optimality left column of Figs. 5–8, a collision-free maneuver trajectory
Authorized licensed use limited to: Gyeongsang National Univ. Downloaded on January 30,2025 at 09:48:41 UTC from IEEE Xplore. Restrictions apply.
CHAI et al.: DESIGN AND EXPERIMENTAL VALIDATION OF DRL-BASED FAST TRAJECTORY PLANNING AND CONTROL 5787
can be produced by applying the proposed method for all three TABLE III
test scenarios. That is, the mobile robot is able to move from its P ERFORMANCE E VALUATION OF D IFFERENT A LGORITHMS
initial position to the final target point without colliding with
obstacles existing on the exploration map. Besides, from the
linear and angular velocity/acceleration profiles, it is obvious
that the variable path constraints are satisfied during the entire
maneuver process, which further confirms the feasibility of the
obtained exploration solution.
Next, we focus on the optimality of the obtained exploration
solutions. It can be seen from the optimization formula-
tion (9) that the command controls (av , aω ) appear in the
motion dynamics linearly. Moreover, they do not appear in
the path constraint explicitly. As a result, a bang-singular-bang
mode can be expected in the command control profiles over
t ∈ [0, t f ] if a time-optimal solution is found. According to
the acceleration of the corresponding linear velocity profiles
displayed in the right column of Figs. 5–8, we can see that
the proposed motion planner is able to generate a near-
optimal time-optimal exploration trajectory for all the test
cases.
Furthermore, to highlight the advantages of applying
GRURDNN, comparative studies were carried out. The
methods selected for comparison are the fully connected
DNN-based motion planner established in [23] and the NMPC
tracking scheme constructed in [18] and [20]. The corre-
sponding results are demonstrated in Figs. 5–8, from where
it is obvious that all the methods studied in this article
can produce feasible exploration solutions for the considered
mission cases. In terms of algorithm performance, although
differences can be identified among obtained solutions, the
GRURDNN is more likely to produce an exploration trajectory
closer to the preplanned optimal solution. To further evaluate
the performance of different algorithms, we pay attention to
the objective
t function value, the accumulation of variation
(e.g., t0 f e(t)dt), and the average execution time required for
each time step tcpu . These results are tabulated in Table III.
It can be seen from Table III that the proposed method
can better approximate the preplanned optimal solution than Fig. 9. Experimental mobile robot: TurtleBot3 Waffle Pi.
the fully connected DNN-based t method suggested in [23],
as lower values of J ∗ and t0 f e(t)dt are obtained. This
can be attributed to the recurrent network structure with C. Experimental Setup
the GRU mechanism. However, the price that we paid for It is worth noting that simply considering collision-free
this performance enhancement is that the computation time constraints for expected obstacles might result in significant
required for each step tends to experience a slight increase, safety issues for the mobile robot. This is because, in real-
which is still at an acceptable level. On the contrary, the world scenarios, an object can suddenly appear around the
performance of the NMPC algorithm is much worse than the robot due to the uncertainty of the environment or inaccurate
two DNN-based direct recalling strategies. Due to the con- map information. Therefore, we perform real-world experi-
sideration of different system constraints and multiple colli- ments in an uncertain environment where unexpected obstacles
sion avoidance constraints, some NMPC online reoptimization will suddenly appear around the robot.
processes might not be maturely solved. Thus, the control In the real-world experiment, we used Turtlebot3 Waffle Pi
performance is likely to become worse, and the required mobile robot, as shown in Fig. 9, to validate the proposed
computation time tends to increase. In summary, the above hierarchical GRURDNN-DRL control framework. Turtlebot3
results and discussions not only confirm the feasibility of supports an open-source robot operating system (ROS). The
the proposed GRURDNN motion planner but also appreciate proposed algorithms were implemented under ROS. The com-
its enhanced approximation performance. Consequently, it is puting platform of Turtlebot3 is the Raspberry Pi 3 board,
advantageous to implement the proposed GRURDNN-based which receives and executes the command from the host
approach to guide the mobile robot to fulfill the time-optimal computer with Nvidia GTX 1080 GPU and Intel Core i9 CPU
autonomous exploration problem. (2.9 GHz). Turtlebot3 is equipped with a 360◦ laser scanner
Authorized licensed use limited to: Gyeongsang National Univ. Downloaded on January 30,2025 at 09:48:41 UTC from IEEE Xplore. Restrictions apply.
5788 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 35, NO. 4, APRIL 2024
Fig. 10. Gazebo environment 1. Fig. 12. Q value comparison in environment 1: DDPG+PER versus the
proposed algorithm.
for sensing the environment. The LiDAR detecting range was Fig. 13. Algorithm validation in real environment 1.
configured as 1 m in this experiment.
Authorized licensed use limited to: Gyeongsang National Univ. Downloaded on January 30,2025 at 09:48:41 UTC from IEEE Xplore. Restrictions apply.
CHAI et al.: DESIGN AND EXPERIMENTAL VALIDATION OF DRL-BASED FAST TRAJECTORY PLANNING AND CONTROL 5789
E. Validation of the Proposed Algorithms in Real follow the preplanned path to reach it. If an obstacle appears
Environment around the robot and blocks the way to the target position, the
The trained model was transferred to the real robot directly DRL-based algorithm will be triggered to avoid collision and
without fine-tuning and tested in a living room environ- navigate the robot to the next target waypoint.
ment. Both static environment (environment 1) and suddenly As shown in Figs. 13 and 18, the rviz software, third
changed environment (environment 2) are tested. Turtlebot person view, and onboard camera view were recorded. The
published its velocity, heading, and laser data to the ROS robot localization was realized using the ROS built-in adaptive
master that is a laptop running the DRL model, and the ROS Monte Carlo localization (AMCL) algorithm. The turtlebot
master publishes the velocity and heading command to the mobile robot needs to avoid all the static obstacles and also
mobile robot. The living room map was prescanned using an unexpected obstacle, which is manually placed during
gmapping ROS node, and an optimal path was calculated by the mission. The static obstacles are prescanned using ROS
the proposed deep learning algorithm. The mobile robot will built-in gmapping algorithm, and they are represented as a
Authorized licensed use limited to: Gyeongsang National Univ. Downloaded on January 30,2025 at 09:48:41 UTC from IEEE Xplore. Restrictions apply.
5790 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 35, NO. 4, APRIL 2024
black area on the map. The red dots represent laser points.
VI. C ONCLUSION
As we can see, only a certain range of obstacles can be scanned
around the turtlebot during the mission. The final target was In this article, the problem of time-optimal autonomous
illustrated as a green circle. exploration for mobile robots operating in an unknown envi-
In environment 1, all the obstacles were static, and obstacles ronment has been investigated. An upper–lower hierarchical
positions were similar to Scenario 4 in Section V-B. The path deep learning-based control framework is designed, which
was generated by the proposed path planning algorithm and the leverages a GRURDNN motion planning algorithm and a
mobile robot followed the path to avoid the static obstacles. DRL-based online collision avoidance approach. The proposed
As shown in Fig. 14, the final path was similar to the path GRURDNN motion planning scheme was trained offline on a
generated in Scenario 4, which validated the proposed path pregenerated optimized maneuver trajectory dataset such that
planning algorithm. In environment 2, after the mobile robot it can predict the optimal motion in real time. In this way, the
avoided the fourth obstacle, we manually placed one obstacle time-consuming online optimization process can be omitted.
in front of the robot, and the obstacle edge was scanned as a Subsequently, the DRL-based online control scheme is applied
red line, as shown in Fig. 19. It is worth noting that the placed to deal with unexpected obstacles that are frequently identified
obstacle was not marked as a black area, indicating that the in real-world scenarios. After performing a number of simu-
placed obstacle was unknown to the mobile robot. The DRL lation and experimental studies, some concluding remarks can
algorithm was then triggered to navigate the robot, so as to be summarized.
reach the final target. Note that, due to the limitation of rviz 1) It is advantageous to embed the GRU mechanism in the
software, the turtlebot path was originally recorded as light and DNN-based motion planner, as the inherent relationships
thin green lines, as shown in Fig. 19. For better illustration, between optimized robot state and control time histories
we highlighted the actual robot trajectory in green, as shown are better exploited.
in Fig. 20. 2) The designed DRL-based online control scheme is
The odometry data of the turtlebot were also recorded and able to directly learn the control policy from human
shown in Fig. 21. The real trajectory is shown as the blue demonstration data and generate more exploration than
line, and the preplanned GRURDNN path is represented as the vanilla PER algorithm, thereby reducing the time
the green line. The maximum distance between the planned required by the training process.
path and the real trajectory is around 4.5 cm that is acceptable 3) The proposed hierarchical deep learning-based control
for the mobile robot to operate in the 25 m2 size area. When framework has the capability of fulfilling different
unknown obstacles occur on the map, shown as a red rectangle, autonomous exploration cases effectively and efficiently
the DRL algorithm was triggered to navigate the mobile robot in the presence of both expected and unexpected
to avoid it finally. As the DRL method is the end-to-end- obstacles.
based method and does not need to compute ESDF information Consequently, we believe that the proposed control scheme
and optimization equation, its computing time is only 1.2 ms, can serve as a promising tool for solving the mobile robot
which is comparable with the state-of-art EGO-planner that is autonomous exploration problem. In future work, the proposed
0.81 ms. Note that all of our codes are developed under Python strategy can be extended to address multimobile robot tra-
as the deep learning library TensorFlow is also developed jectory planning problems, where collision avoidance among
using Python. EGO-planner is developed using C++ that is the mobile robots should be considered. Moreover, transfer
much more computationally efficient than Python. We believe learning can be used to extend the network as a motion planner
that our DRL method can be equivalent to or even faster for different scenarios. In terms of DRL, 2-D LiDAR data
than EGO-planner if it is redeveloped under the C++ code in can be extended to 3-D point cloud data that can be used as
the future. In total, it can be observed that the mobile robot input of the neural network to perceive the 3-D environment.
was able to avoid both prescanned and unexpected obstacles To improve the training efficiency, federated learning using
throughout the way to the destination. multiple agents can also be applied.
Authorized licensed use limited to: Gyeongsang National Univ. Downloaded on January 30,2025 at 09:48:41 UTC from IEEE Xplore. Restrictions apply.
CHAI et al.: DESIGN AND EXPERIMENTAL VALIDATION OF DRL-BASED FAST TRAJECTORY PLANNING AND CONTROL 5791
Authorized licensed use limited to: Gyeongsang National Univ. Downloaded on January 30,2025 at 09:48:41 UTC from IEEE Xplore. Restrictions apply.
5792 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 35, NO. 4, APRIL 2024
Hanlin Niu (Member, IEEE) received the B.Eng. Hujun Yin (Senior Member, IEEE) received the
degree in mechanics from Tianjin University, B.Eng. degree in electronic engineering and the
Tianjin, China, in 2012, and the Ph.D. degree in M.Sc. degree in signal processing from Southeast
aeronautical engineering from Cranfield University, University in 1983 and 1986, respectively, and the
Cranfield, U.K., in 2018. Ph.D. degree in neural networks from the University
From July 2017 to January 2020, he was a of York, York, U.K., in 1997.
Research Associate in robotics with Cardiff Univer- He joined The University of Manchester (UMIST
sity. Since January 2020, he has been a Research before the merge), Manchester, U.K., in 1996, as a
Associate in robotics with The University of Post-Doctoral Research Associate, where he was
Manchester, Manchester, U.K., being involved in the appointed as a Lecturer in 1998 and subsequently
Robotics and AI in Nuclear Project, one of the four promoted to a Senior Lecturer, a Reader, and a
big robotics and AI projects funded by the Engineering and Physical Sciences Professor. He has also been the Head of business engagement in AI and
Research Council. His research interests include path planning, path following, data of the Faculty of Science and Engineering, UMIST, since 2019.
collision avoidance, deep reinforcement learning of autonomous vehicles, and Dr. Yin is a Turing Fellow of the Alan Turing Institute (ATI) and a member
teleoperation of the robotic arm. of the EPSRC Peer Review College. He has been the Vice-Chair of the IEEE
Dr. Niu is currently a member of the Robotics for Nuclear Environments Computational Intelligence Society (CIS) UK Ireland Chapter since 2019.
Project (EPSRC no. EP/P01366X/1).
Authorized licensed use limited to: Gyeongsang National Univ. Downloaded on January 30,2025 at 09:48:41 UTC from IEEE Xplore. Restrictions apply.