0% found this document useful (0 votes)
17 views16 pages

Mobile Robot Path Planning Using A QAPF Learning Algorithm For Known and Unknown Environments

Uploaded by

dummyacc231100
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views16 pages

Mobile Robot Path Planning Using A QAPF Learning Algorithm For Known and Unknown Environments

Uploaded by

dummyacc231100
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Received 26 July 2022, accepted 4 August 2022, date of publication 8 August 2022, date of current version 16 August 2022.

Digital Object Identifier 10.1109/ACCESS.2022.3197628

Mobile Robot Path Planning Using a QAPF


Learning Algorithm for Known and Unknown
Environments
ULISES OROZCO-ROSAS 1 , (Member, IEEE), KENIA PICOS 1 , JUAN J. PANTRIGO 2,

ANTONIO S. MONTEMAYOR 2 , AND ALFREDO CUESTA-INFANTE 2


1 CETYS Universidad, Tijuana, Baja California 22210, Mexico
2 UniversidadRey Juan Carlos, Móstoles, 28933 Madrid, Spain
Corresponding author: Ulises Orozco-Rosas ([email protected])
This work was supported in part by the Coordinación Institucional de Investigación and the Centro de Innovación y Diseño (CEID) of
Centro de Enseñanza Técnica y Superior (CETYS Universidad); in part by the Consejo Nacional de Ciencia y Tecnología (CONACYT),
Mexico; in part by the Spanish Government Research Funding under Grant RTI2018-098743-B-I00 (MICINN/FEDER) and Grant
PID2021-128362OB-I00 (MICINN/FEDER); and in part by the Comunidad De Madrid Research Funding under Grant Y2018/EMT-5062.

ABSTRACT This paper presents the computation of feasible paths for mobile robots in known and unknown
environments using a QAPF learning algorithm. Q-learning is a reinforcement learning algorithm that has
increased in popularity in mobile robot path planning in recent times, due to its self-learning capability
without requiring a priori model of the environment. However, Q-learning shows slow convergence to the
optimal solution, notwithstanding such an advantage. To address this limitation, the concept of partially
guided Q-learning is employed wherein, the artificial potential field (APF) method is utilized to improve
the classical Q-learning approach. Therefore, the proposed QAPF learning algorithm for path planning can
enhance learning speed and improve final performance using the combination of Q-learning and the APF
method. Criteria used to measure planning effectiveness include path length, path smoothness, and learning
time. Experiments demonstrate that the QAPF algorithm successfully achieves better learning values that
outperform the classical Q-learning approach in all the test environments presented in terms of the criteria
mentioned above in offline and online path planning modes. The QAPF learning algorithm reached an
improvement of 18.83% in path length for the online mode, an improvement of 169.75% in path smoothness
for the offline mode, and an improvement of 74.84% in training time over the classical approach.

INDEX TERMS Path planning, Q-learning, artificial potential field, reinforcement learning, mobile robots.

I. INTRODUCTION planning algorithms have been successfully used in various


The path planning problem is a fundamental issue in mobile industries and academic disciplines like robotics, manufac-
robot navigation because of the need of having algorithms to turing, and aerospace applications, among others [2].
convert high-level specifications of tasks from humans into The path planning problem consists in how the mobile
low-level descriptions of how to move [1]. There are some robot (MR) determines to move in an environment to its
good reasons to develop efficient planning algorithms. For predefined goal point without colliding with the obstacles
example, the necessity to get machines that can solve some presented in the environment. This concerns the computa-
tasks that are difficult to solve for humans involves model- tion of a collision-free path between the start point and the
ing planning problems, designing efficient algorithms, and goal point [3]. Path planning varies according to different
developing robust implementations. Another reason is that environments that the MR faces, such as a known environ-
ment, partially known environment, or unknown environ-
The associate editor coordinating the review of this manuscript and ment. Among the types of environments, the partially known
approving it for publication was Jiachen Yang . environment is the most practical where some areas within the

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
84648 VOLUME 10, 2022
U. Orozco-Rosas et al.: Mobile Robot Path Planning Using a QAPF Learning Algorithm

environment have already been known before the navigation. path length, path smoothness, and computation time. The
Furthermore, path planning can be classified into static and contributions of this paper can be summarized as follows:
dynamic, depending on the nature of the obstacles. In static • A reinforcement learning-based algorithm combined
path planning, the position and orientation of the obstacles with the artificial potential field method called QAPF
are unchanged with time, while in dynamic path planning, learning algorithm is proposed to solve mobile robot
the obstacles are free to move in the environment [4]. path planning problems.
The path planning problem solution has been addressed • The QAPF learning algorithm includes three opera-
from diverse methods [5]. One approach is reinforcement tions: exploration, exploitation, and APF weighting to
learning, due to it can be employed for MR path planning overcome the limitations presented by the classical
for known and unknown environments. Q-learning is a rein- Q-learning approach in path planning.
forcement learning algorithm that uses a scalar reinforcement • A set of experiments and studies demonstrate that the
signal or a reward to interact with a complex environment [6]. proposed QAPF learning algorithm successfully solves
Q-learning maps situations to actions, to maximize a numer- diverse path planning problems arising from numerous
ical reward through systematic learning. To obtain its own complex environments that the mobile robot can face.
experience, the agent offers a trade-off between exploration
and exploitation, so it not only has to exploit what it already The organization of this paper is as follows. Section II
knows through greater action in the current experience, summarizes the literature review and related work. Section III
but also must explore what action will work better in the describes the path planning problem, the Q-learning algo-
future [7]. The mobile robot (MR) which is known as the rithm, and the APF method. Section IV explains in
agent receives a reward for collision-free action and receives detail the proposed QAPF learning algorithm. Section V
a penalty when it collides with the obstacles [5]. presents the results of simulations that demonstrate the appli-
In Q-learning, the reward concept is a key difference cability of the proposal. Finally, the conclusions of this paper
concerning both supervised and unsupervised solutions. The are drawn in Section VI.
reward is less informative than it is in supervised learning,
where the agent is given the correct actions to perform. II. RELATED WORK
Unfortunately, information regarding correct actions is not In the literature, there are several proposals to address path
always available. However, the reward is more informative planning problems using reinforcement learning [11], [12].
than unsupervised learning, where no explicit comments are Some proposals based on reinforcement learning have been
made regarding performance [8]. combined with other techniques to improve performance.
Q-learning has been employed in mobile robot path plan- In [13], the combination of reinforcement learning with an
ning due to its self-learning capability without requiring a pri- improved artificial potential field is presented to solve path
ori model of the environment. Although, Q-learning presents planning problems. Another example, now combining rein-
slow convergence to the optimal solution, notwithstanding forcement learning with a metaheuristic can be found in [14].
such an advantage [9]. To address this limitation, the concept In that work, a reinforcement learning-based grey wolf opti-
of partially guided Q-learning through the APF weighting is mizer (RLGWO) algorithm is presented to solve unmanned
employed wherein, the artificial potential field (APF) method aerial vehicles (UAVs) path planning problems.
is utilized to improve the classical Q-learning approach. Furthermore, there are numerous proposals to address path
The APF method is employed in path planning due to its planning problems using Q-learning. A few of them, which
effectiveness to provide smooth and safe paths. However, are worth mentioning, are described next. In [15], a modified
it presents disadvantages like the local minima problem, Q-learning algorithm is presented to solve the path planning
or the goal non-reachable with obstacles nearby, among oth- problem in product assembly. In that work, the proposed algo-
ers [10]. Therefore, the proposed QAPF learning algorithm rithm speeds up the convergence speed by adding a dynamic
for path planning can enhance learning speed and improve reward function, optimizes the initial Q table by introducing
final performance using the combination of Q-learning and knowledge and experience through the case-based reasoning
the APF method. In this manner, the QAPF learning algo- algorithm, and prevents entry into the trapped area through
rithm overcomes or mitigates the disadvantages presented by the obstacle avoiding method.
both conventional methods. Another example is presented by Maoudj and Hentout [16]
The main contribution of this paper is the development They propose an efficient Q-learning (EQL) algorithm to
of a new robust algorithm based on Q-learning and the overcome the limitations of slow and weak convergence and
APF method to solve path planning problems for mobile ensure an optimal collision-free path in less possible time.
robots. The proposed QAPF learning algorithm presents In this regard, in that work a reward function is proposed to
short, smooth, and collision-free paths. The proposed QAPF initialize the Q table and provide the robot with prior knowl-
learning algorithm is extensively validated against a path edge of the environment, followed by an efficient selection
planning algorithm based on the classical Q-learning (CQL) strategy proposal to accelerate the learning process through
approach. Both planning algorithms are tested in benchmark search space reduction while ensuring a rapid convergence
environments and compared concerning measures such as toward an optimized solution.

VOLUME 10, 2022 84649


U. Orozco-Rosas et al.: Mobile Robot Path Planning Using a QAPF Learning Algorithm

In [17], a flexible Q-learning-based model is proposed


to handle continuous space problems for decentralized
multi-agent robot navigation in cluttered environments.
In that research, an agent-level decentralized collision avoid-
ance low-cost model is presented. Furthermore, a method to
merge non-overlapping Q-learning features is employed to
reduce its size significantly and make it possible to solve more
complicated scenarios with the same memory size.
Bulut proposed an improved epsilon-greedy Q-learning
(IEGQL) algorithm to enhance efficiency and productiv-
ity regarding path length and computational cost [18]. The
IEGQL presents a reward function that ensures the envi-
ronment’s knowledge in advance for a mobile robot, and
mathematical modeling is presented to provide the optimal FIGURE 1. Path planning problem definition.
selection besides ensuring a rapid convergence.
Some proposals have combined Q-learning with other
algorithms to improve performance. An example is found
of the path planning problem. Next, we explain in detail the
in [19], where a combination of the Manhattan distance and
Q-learning algorithm. Last, we conclude this section with the
Q-learning (CMD-QL) is presented to improve the conver-
fundamentals of the artificial potential field method.
gence speed. In the CMD-QL, the Q table is firstly initialized
with Manhattan distance to enhance the learning efficiency
of the initial stage of Q-learning; secondly, the selection A. PATH PLANNING PROBLEM DEFINITION
strategy of the -greedy action is improved to balance the Path planning is a problem that requires finding a continu-
exploration-exploitation relationship of the mobile agent’s ous path, QG , between a given start point, q0 , and the goal
actions. The CMD-QL was tested under known, partially point, qf , for a particular system, subject to a variety of
known, and unknown environments, respectively. The results constraints [22]. Under this general definition, in this paper,
revealed that the CMD-QL can converge to the optimal path we have defined the problem in a simplified form, which is
faster than the classical Q-learning method. described as follows.
An example, now combining Q-learning with a meta- The mobile robot (MR) environment is defined as a
heuristic can be found in [20], Sadhu et al. proposed a modi- two-dimensional map that includes a set of obstacles Oj ,
fication of the Firefly Algorithm by including the Q-learning j = 1, 2, . . . , n, where n is the number of the obstacles
framework into it, called QFA. The QFA has been employed in the environment. The instantaneous position of the MR
to plan the path of the robot arm end-effector such that it can is denoted by q, which is represented by its coordinates,
reach a pre-assigned goal position by traversing the minimum physical radius, and the angle of orientation, i.e., q(x, y, φ, θ ).
possible distance while dodging the obstacles present in the In this definition, we have assumed a circular occupancy area
environment. for the MR. The path planning goal is to find a feasible
Another work is presented by Low et al. [5]. They present sequence QG of configurations that can drive the MR from
an improved Q-learning based on the Flower Pollination a start point, q0 , to a goal point, qf , in the x-y plane. Fig. 1
Algorithm (FPA) for mobile robot path planning. Through depicts the definition.
the integration of the prior knowledge obtained from the FPA
into the classical Q-learning, the initialization of the Q-values B. Q-LEARNING
serves as a good exploration foundation to accelerate the Reinforcement learning, inspired by animal learning in psy-
learning process of the mobile robot. chology, learns optimal decision-making strategies from
There are some proposals employing deep Q-learning like experience. It defines any decision maker as an agent and
in [21]. Where, Wu et al. presented a tailored design of state everything outside the agent as the environment. The agent
and action spaces and a dueling deep Q-network. That work aims to maximize the accumulated reward and obtains a
consists of a deep reinforcement learning method for the reward value as a feedback signal for training through inter-
autonomous navigation and obstacle avoidance of unmanned action with the environment [23]. Reinforcement learning
surface vehicles. In that work, the proposal outperforms is categorized into two groups: model-based methods and
related methods in the efficiency of exploration and the speed model-free methods.
of convergence in static and dynamic environments. For the model-based methods, always utilize a model of
the environment to predict rewards for unseen state-action
III. PRELIMINARIES pairs. For the model-free methods, we can divide into two
In this section, we present the problem definition and the branches: policy-based methods and value-based methods.
fundamentals of the proposed QAPF learning algorithm for Specifically, the policy-based methods aim to generate a
mobile robot path planning. We start with a general definition policy, in which the input is a state, and the output is an action.

84650 VOLUME 10, 2022


U. Orozco-Rosas et al.: Mobile Robot Path Planning Using a QAPF Learning Algorithm

These works apply deterministic policies, which generate C. ARTIFICIAL POTENTIAL FIELD
an action directly. For value-based methods, the action with The main idea of the artificial potential field (APF)
maximum Q-value over all the possible actions is selected as method [27] is to establish an attractive potential field force
the best action [24]. Therefore, Q-learning is a value-based around the goal point, as well as to establish a repulsive
method. During the learning, the agent performs an action potential force around obstacles [28]. By this idea, the APF
with the highest expected Q-values to estimate the optimal method employs attractive and repulsive components to draw
policy [14]. the MR to its goal while keeping it away from obstacles.
Q-learning [25] is model-free reinforcement learning, Therefore, the total artificial potential field, U (q), includes
which combines theories, including the Bellman equa- two terms, the attractive potential function, Uatt (q), and
tions and Markov decision process (MDP), with temporal- the repulsive potential function, Urep (q). The total artificial
difference (TD) learning. It is a way for agents to learn how potential field, U (q), is then the sum of these two potential
to act optimally in controlled Markovian domains. Its form is functions, as indicated in Eq. (2).
one-step Q-learning, which is defined by Eq. (1).
U (q) = Uatt (q) + Urep (q) (2)
Q(s, a) = (1 − α)Q(s, a) + α(r + γ max(Q(s0 , ∀a0 ))) (1)
The attractive potential function is described by Eq. (3),
where Q(s, a) estimates the action value after applying an where q represents the current MR position. The goal point
action a in state s, α is the learning rate, γ is the discount is represented by qf and katt is a positive scalar-constant that
factor, and r is the immediate reward received [26]. The represents the attractive proportional gain of the function.
Q-learning main components are as follows.
1
1) Agent: The learner that can interact with the environ- Uatt (q) = katt (q − qf )2 (3)
ment via a set of sensors and actuators. 2
2) Environment: Everything that interacts with an agent, The repulsive potential function is denoted by Eq. (4),
i.e., everything outside the agent. where ρ0 represents the limit distance of influence of the
3) Policy: A mapping from perceived states set, S, to the repulsive potential field and ρ is the shortest distance from the
actions set, A. MR to the obstacle. The influence of the repulsive potential
4) Reward function: A mapping from state-action pairs to field, Urep (q), is presented in two cases. The first case is
a scalar number. presented when the MR is under the influence of the obstacle,
5) Q-value: The total amount of reward an agent can that is, if the distance from the robot to the obstacle, ρ, is less
expect to accumulate over the future, starting from that or equal to the limit distance of influence, ρ0 . Otherwise,
state. in the second case, the repulsive potential field will be zero
Q-learning works by successively improving its evalua- and the MR will be free of the influence of that obstacle. The
tions of the quality of actions at particular states. Learning selection of the distance ρ0 depends on the MR maximum
proceeds similarly to Sutton’s method of temporal differ- speed. The krep is a positive scalar-constant that represents
ences: an agent tries an action at a particular state and eval- the repulsive proportional gain of the function. Therefore, the
uates its consequences in terms of the immediate reward or repulsive potential function has a limited range of influence,
penalty it receives and its estimate of the value of the state and it prevents the movement of the MR from being affected
to which it is taken. By trying all actions in all states repeat- by a distant obstacle [29].
edly, it learns which are best overall, judged by long-term (
1
krep ( ρ1 − ρ10 )2 , if ρ ≤ ρ0
discounted reward [25]. Urep (q) = 2 (4)
In Q-learning, the agent’s experience consists of a 0, if ρ > ρ0
sequence of distinct stages or episodes. In the nth episode, The artificial potential field method is widely employed in
the agent: path planning for mobile robots due to its simplicity, mathe-
1) Observes its current state st matical elegance, and effectiveness in providing smooth and
2) Selects and performs an action at safe planning [10]. Therefore, in this paper, the artificial
3) Observes the subsequent state st+1 potential field method is utilized to improve the classical
4) Receives an immediate payoff rt Q-learning approach. In that sense, the proposed QAPF learn-
Then, the agent adjusts its Q values using a learning rate ing algorithm for path planning can enhance learning speed
α, according to: note that this description assumes a look-up and improve final performance using the combination of
table representation for the Qm×n (st , at ). Q-learning and the APF method.
A learning agent is composed of two fundamental parts,
a learning element, and a performance element. The design IV. QAPF LEARNING ALGORITHM FOR PATH PLANNING
of a learning element is dictated by what type of performance To improve the disadvantage of slow convergence in the clas-
element is used, which functional component is to be learned, sical Q-learning, the QAPF learning algorithm is proposed
how that functional component is represented, and what kind in this work, which includes three operations: exploration,
of feedback is available [8]. exploitation, and APF weighting.

VOLUME 10, 2022 84651


U. Orozco-Rosas et al.: Mobile Robot Path Planning Using a QAPF Learning Algorithm

The proposed QAPF learning algorithm combines B. QAPF LEARNING ALGORITHM


Q-learning and the artificial potential field (APF) method to Algorithm 2 presents the QAPF pseudocode for MR path
improve performance of the MR path planning. The objective planning. The QAPF learning algorithm employs the next
of this section is to explain in detail the proposed QAPF input parameters: the goal point, qf , that the MR must achieve
learning algorithm. Therefore, firstly the classical Q-learning and the environment information, which is composed of n
(CQL) pseudocode (Algorithm 1) is presented. Then, the pro- rectangular obstacles in Oj (xj , yj , lj , wj ) format. The main
posed QAPF learning pseudocode (Algorithm 2) is explained. objective of the proposed QAPF learning algorithm is to
Lastly, the path generator pseudocode (Algorithm 3) is pro- obtain the learning values Qm×n to build the path, QG =
vided. Both the CQL algorithm and the QAPF learning [q0 , q1 , . . . , qf ], through the Algorithm 3, i.e., a collision-free
algorithm employ Algorithm 3 to generate the path. path that will drive the MR to achieve the goal point with the
minimum path length. The parameter that Algorithm 2 returns
A. Q-LEARNING ALGORITHM is the array Qm×n .
Algorithm 1 presents the CQL pseudocode for MR path plan- Algorithm 2 in line 1 initializes to zero the learning values
ning. The CQL algorithm employs the next input parameters: Qm×n . From lines 2 to 21, there is the learning iterative pro-
the goal point, qf , and the environment information, Oj . The cess of the proposed QAPF algorithm to find the learning val-
objective of the CQL algorithm is to obtain the learning values ues Qm×n to generate the path through the Algorithm 3. The
Qm×n to build the path, QG , through the Algorithm 3. The stop condition for the learning iterative process is presented
parameter that Algorithm 1 returns is the array Qm×n . when the maximum number of episodes, Nep , is reached.
Algorithm 1 in line 1 initializes to zero the learning values
Qm×n . From lines 2 to 11, there is the learning iterative
process. The stop condition for the learning iterative process Algorithm 2 QAPF
is presented when the maximum number of episodes, Nep , Input: goal point qf and environment information Oj
is reached. Output: learning values Qm×n
1 initialize Qm×n (st , at ) ← {0}
Algorithm 1 CQL 2 for each episode do

Input: goal point qf and environment information Oj 3 set st ← a random state from the states set S
Output: learning values Qm×n 4 while st 6= qf and safe do
1 initialize Qm×n (st , at ) ← {0} 5 if ζ < uniform random number then
2 for each episode do 6 compute probability pi using Eq. 5
3 set st ← a random state from the states set S 7 compute APF weighting using Eq. 6
4 while st 6 = qf and safe do 8 choose at in st by using APF weighting
5 choose the best at in st by using Qm×n 9 else
6 perform action at and receive reward r 10 if ζ < uniform random number then
7 find out the new state st+1 11 choose the best at in st by using Qm×n
8 update Qm×n (st , at ) using Eq. 1 12 else
9 st ← st+1 13 choose random at in st
10 end 14 end
11 end 15 end
12 return Qm×n 16 perform action at and receive reward r
17 find out the new state st+1
18 update Qm×n (st , at ) using Eq. 1
In line 3, a random state is assigned to the current state, st . 19 st ← st+1
From lines 4 to 10, the explore-exploit process is performed. 20 end
The stop condition is presented when the current state, st , 21 end
is equal to the goal point, qf , or the Boolean flag safe is False. 22 return Qm×n
The objective of the flag safe is to serve as a stop condition
if a collision is presented with the obstacles or when the MR
steps into the limits of the workspace. In line 3, a random state is assigned to the current MR state,
In line 5, the best at in st is chosen using Qm×n . In line 6, st . From lines 4 to 20, there is the decision iterative process to
the action, at , is performed and a reward is received. Next, operate on artificial potential field (APF) weighting, exploita-
in line 7, the new state, st+1 , is computed. Then, in line 8, tion, or exploration. The stop condition for the decision pro-
the learning value, Qm×n (st , at ), is updated using Eq. 1. Last, cess is presented when the current state, st , is equal to the goal
in line 9. The new state, st+1 , is assigned to the current point, qf , or the Boolean flag safe is False. The main objective
state st . In the end, the Algorithm 1 returns the resultant of the flag safe is to serve as a stop condition if a collision is
learning values Qm×n to build the path, QG , through the presented with the obstacles or when the MR steps into the
Algorithm 3. limits of the workspace.

84652 VOLUME 10, 2022


U. Orozco-Rosas et al.: Mobile Robot Path Planning Using a QAPF Learning Algorithm

In line 5, a uniform random number is generated. If the


uniform random number is greater than the decision rate,
ζ ∈ (0, 1], the APF weighting procedure will be performed.
Otherwise, the explore-exploit approach will be performed.
For the APF weighting, we are going to employ the Moore
neighborhood, which is defined on a two-dimensional square
lattice and is composed of a central cell, in this case, the
current MR state st , and the eight cells that surround it.
The probabilities assigned to the neighbor cells of st are
inversely proportional to their total artificial potential field.
A neighbor cell with the lowest total artificial potential field
has the greatest probability of being assigned to st+1 , while
the neighbor cell with the highest total artificial potential field
has the lowest probability of being assigned to st+1 . A random
number determines which neighbor cell is selected.
Line 6 computes the probability pi for i = 1, . . . , k, where FIGURE 2. Generalized framework of the proposed QAPF learning
algorithm.
k is the number of neighbor cells. The probability pi is defined
by Eq. (5) that computes the inverse of the total artificial
potential field, Eq. (2), for the state of each neighbor cell qi .
it receives over the long run. The reward signal thus defines
1 what are the good and bad events for the agent [30].
pi = (5)
U (qi ) Next, in line 17, the new state, st+1 , is computed using
In line 7, the standard (unit) APF weighting function is Eq. 7. Each action, at , when applied from the current state, st ,
computed. The APF weighting function σ : Rk → (0, 1)k produces a new state, st+1 , as specified by the state transition
is defined by Eq. (6) for i = 1, . . . , k and p = (p1 , . . . , pk ) ∈ function, f , that is f (st , at ) = st + at , in which st ∈ S and
Rk . at ∈ A.
pi st+1 = f (st , at ) (7)
σ (p)i = Pk (6)
j=1 pj
Then, in line 18. The learning value, Qm×n (st , at ),
The cumulative probabilities obtained by the APF weight- is updated using Eq. 1. Where, Qm×n (st , at ) estimates the
ing function are used in choosing the action at in the state action value after applying an action at in state st . Last,
st (line 8). First, the probabilities contained by p are sorted. in line 19. The new state, st+1 , is assigned to the current state
Then, a random number between zero and one is generated. st to continue with the iterative process until the goal point,
Starting at the top of the list, the first neighbor cell with a qf , is reached or an unsecured condition occurs.
cumulative probability that is greater than the random number In the end, the Algorithm 2 returns the resultant learning
is selected for the state st+1 . values Qm×n to build the path QG through the Algorithm 3.
If the APF weighting procedure is not executed, the Fig. 2 summarizes the process described by Algorithm 2.
explore-exploit approach will be performed (line 9). The learner and decision maker is the mobile robot (MR).
In line 10, a uniform random number is generated. If the Everything outside the MR is considered as the environment.
uniform random number is greater than the decision rate, ζ , These interact continually, the MR selecting actions through
the exploitation procedure will be performed. Otherwise, the the operations of exploration, exploitation, or APF weighting,
exploration procedure is performed. and the environment responding to these actions and present-
Exploitation highlights the direction of search to control ing new situations to the MR. The environment also gives
the search within the neighborhood of the best solutions rise to rewards, special numerical values that the MR seeks
obtained by exploration. In that sense, in line 11, the best to maximize over time through its choice of actions [30].
action, at , in the state, st , is chosen using the learning values, The classical Q-learning approach employs random num-
Qm×n . bers or zeros to initialize the learning values Qm×n . In this
If the exploitation procedure is not executed, the explo- work, we use zeros for both, the CQL algorithm and the
ration approach will be performed (line 12). Exploration QAPF learning algorithm. Therefore, the learning efficiency
involves the process of determining different candidate solu- in the initial stage is low. Moreover, the balance between
tions by randomly exploring the search space. In that way, exploration and exploitation of actions is a fundamental factor
in line 13, a random action, at , in the state, st , is executed. affecting the convergence speed of the classical Q-learning
Once the action, at , has been chosen through the APF approach. If the exploration rate is too high, it will cause
weighting, exploitation, or exploration procedure. In line 16, some high-value behaviors not to be effectively used and
the action, at , is performed and a reward is received. The affect the path planning; while the exploration rate is too
agent’s sole objective is to maximize the total reward that low, it will make some high-value behaviors unable to be

VOLUME 10, 2022 84653


U. Orozco-Rosas et al.: Mobile Robot Path Planning Using a QAPF Learning Algorithm

effectively trained in a short time and slow down the learning of objective points from the start point to the goal point.
convergence [19]. To overcome this problem, the proposed Therefore, the parameter that Algorithm 3 returns is the array
QAPF learning algorithm employs the concept of partially QG = [q0 , q1 , . . . , qf ].
guided Q-learning through the APF weighting to virtualize Algorithm 3 in line 1 initializes to zero an index, i, that
the a priori environment into a total artificial potential field will be employed by the objective points. In line 2, the start
and compute the appropriate learning values, Qm×n . In con- point, q0 , is assigned to the current state st . From lines 3 to 14,
sequence, to speed up the convergence. there is the path generator iterative process to build the path
The convergence is presented when the learning values QG . The stop condition is presented when the current state,
Qm×n remain unchanged or are under a certain bound set in st , is equal to the goal point, qf , or the Boolean flag safe is
advance. In that sense, the stop condition in this work for the False.
learning process is presented when the maximum number of In line 4, the index, i, is increased by one. Next, the
episodes, Nep , is reached. Therefore, the convergence of the verification of the environment is performed (line 5). If the
learning values Qm×n means that the MR can plan a feasible environment has changed, then the environment information,
path based on its experience [9]. However, the MR will not Oj , is updated (line 7). In line 9, the best action, at , in the cur-
necessarily find the global shortest path. It will find a nearly rent state, st , is chosen using the learning values, Qm×n . Once
optimal or optimal path in the best of cases. Consequently, the action, at , has been chosen, the action, at , is performed
the shortest path is defined on the basis of the experience the (line 10). Now, in line 11, the new state, st+1 , is computed
MR has. using Eq. 7. Last, in line 12. The new state, st+1 , is assigned
to the current state st , and the current state is assigned to the
C. PATH GENERATOR ALGORITHM objective point qi (line 13). In the end, the Algorithm 3 returns
The MR uses Algorithm 3 to generate the path from the a collision-free path between the start point, q0 , and the goal
start point, q0 , to the goal point, qf , under a known, partially point, qf , if the flag safe remains True. And, an effective path,
known, or unknown environment. Algorithm 3 uses the learn- if the current state, st , is equal to the goal point, qf in the end.
ing values Qm×n to generate the best path, QG , in the offline
and online path planning modes. During the path generation, V. EXPERIMENTS AND RESULTS
the environment is verified to see if new obstacles not con- In this section, we describe the experiments, and we present
sidered at the beginning were added or dynamic obstacles the results of a comparative study of the proposed QAPF
have changed their position. If the environment change, the learning algorithm versus the CQL algorithm in a set of
environment information, Oj , is updated. ten test environments to evaluate the performance of both
planning algorithms in terms of path length, path smooth-
Algorithm 3 Path Generator ness, learning time, and path planning time in offline mode
for known environments and online mode for unknown or
Input: start point q0 , goal point qf , environment
partially unknown environments.
information Oj , and learning values Qm×n
Output: path QG A. TEST CONDITIONS
1 i←0
In all the experiments, we considered that the MR config-
2 set st ← q0
uration is defined as q(x, y, φ, θ ), where x and y are the
3 while st 6 = qf and safe do
coordinate points, φ = 0.2 is the physical radius of the MR in
4 i←i+1
meters, and θ is the angle of orientation that is always oriented
5 verification of the environment
to the next coordinate point to visit. Hence, the initial point
6 if environment has changed then
is centered over the starting point q0 and oriented to the first
7 update environment information Oj
coordinate to visit.
8 end
Table 1 presents the test environments configuration. The
9 choose the best at in st by using Qm×n
table contains the goal point, qf , that the MR must attain
10 perform action at
and the test environment layout. Each test environment is
11 find out the new state st+1
configured by n rectangular obstacles Oj (xj , yj , lj , wj ), for
12 st ← st+1
1 ≤ j ≤ n, where the left-bottom vertex of each obstacle
13 qi ← st
is placed at the coordinate points (xj , yj ) and its length and
14 end
width are indicated by lj and wj , respectively. These test
15 QG ← [q0 , q1 , . . . , qi ]
environments were designed to evaluate the performance and
16 return QG
accuracy of the QAPF learning algorithm. The benchmark
maps presented in [31] inspired the test environments, and
Algorithm 3 employs the input parameters: the start point, it has been labeled as Map01, Map02, . . . , and Map10.
q0 , the goal point, qf , the environment information, Oj , and The test environments Map01 to Map10 cover well-known
the learning values Qm×n . The objective of the path gener- difficult path planning problems, e.g., path-following predic-
ator algorithm is to build the path, QG , that is composed tion problems, problematic areas to reach because the goal

84654 VOLUME 10, 2022


U. Orozco-Rosas et al.: Mobile Robot Path Planning Using a QAPF Learning Algorithm

TABLE 1. Test environments configuration with the goal point and obstacles information.

is too close to an obstacle, and trap sites due to local min- Q* after the MR has traveled all the states and considered all
ima, among other problems [32], [33]. The test environments the possible actions in the given environment [25]. As such,
described in Table 1 can be graphically observed in Fig. 3, the learning rate, α, is empirically set to 0.3 in this work.
where each test environment is configured with rectangular Remark 3: The discount factor, γ ∈ [0, 1), determines the
obstacles in red and a blue dot indicating the goal point type of reward that the MR receives. When γ = 0, the MR
qf , as well as its corresponding APF surface aside. These will only consider immediate reward, whereas the MR will
environments present challenging problems for testing path- consider future reward when γ approaches 1. In that sense,
planning algorithms; thereby we used them in this work to the discount factor, γ , is set to 0.8 in this work, following [5],
evaluate the QAPF learning algorithm. These test environ- [34].
ments represent just a sample of the types of environments Remark 4: The decision rate, ζ ∈ (0, 1], determines the
that the MR can expect to find in typical real-world scenarios. process involved to choose the action at in the state st .
All the test environments have a physical dimension of 10 × These processes include the APF weighting procedure, and
10 meters and the input coordinates (x, y) are quantized into the explore-exploit approach explained in Section IV. The
161 × 161 = 25, 921 states. Hence, each state in the grid has decision rate, ζ , is empirically set to 0.2 in this work.
a physical separation of 0.0625 meters. Remark 5: The APF weighting employs the total artificial
In this work, all the experiments were carried out on an potential field in its process. Therefore, the attractive and
Intel Core i9 CPU (3.60 GHz) with 16 GB of RAM running repulsive proportional gains, {ka , kr | 0 < ka , kr < 10} [29],
the Ubuntu Focal Fossa distribution of Linux with Python are defined as follow. The attractive proportional gain, ka ,
3.7, and OpenCV 4.5. The experiments are composed of a is empirically set to 0.25 as well the repulsive proportional
learning phase and a testing phase. To make a fair comparison gain, kr , is empirically set to 0.60 in this work.
among the algorithms, we set the same parameters. And for the testing phase, we have the following remarks:
For the learning phase, we have the following remarks: Remark 6: The path length dist is defined as the sum of
Remark 1: The reward function is defined by Eq. (8). distances between the configuration states from the start point
Where the reward is r = 100 when the agent arrives at goal q0 to the goal point qf [35], and it is calculated by Eq. (9).
state qf . The reward is r = −1 when the agent collides with Nconf −1
obstacles or when it steps into the limits of the workspace, dist =
X
L(i, i + 1) (9)
and then in both cases, the agent will be instantly reset to i=0
a valid random state. For other steps, the agent receives a p
reward r = 0. where, L(i, i + 1) = (xi+1 − xi )2 + (yi+1 − yi )2 is the
 distance between configuration states i = (xi , yi ) and i + 1 =
100, if st = qf and safe = True

(xi+1 , yi+1 ), the number of configuration states is Nconf .
r = −1, if st 6 = qf and safe = False (8) Remark 7: Path smoothness, smooth, has as goal to mea-

0, otherwise sure how much snaky is the path, and it is calculated by

Eq. (10).
Remark 2: The learning rate, α ∈ (0, 1), regulates the range
of the latest received data that will override the previous data. Nconf −1
1 X
When α = 0, the MR will learn nothing, whereas the MR smooth = |β(i, i + 1)| (10)
Nconf
will only read the latest received information when α = 1 [5]. i=0
When the value of α is relatively small, Q-value in the func- where, β(i, i + 1) is the angle in every change of direc-
tion Q(s, a) will eventually converge to the optimal Q-value, tion θ between configuration states i = (xi , yi ) and

VOLUME 10, 2022 84655


U. Orozco-Rosas et al.: Mobile Robot Path Planning Using a QAPF Learning Algorithm

FIGURE 3. Test environments with rectangular obstacles in red and a blue dot indicating the goal point, each one is presented in an area of
10 × 10 and their correspondent APF surface aside.

84656 VOLUME 10, 2022


U. Orozco-Rosas et al.: Mobile Robot Path Planning Using a QAPF Learning Algorithm

TABLE 2. Results in terms of path length (distance in meters) for the learning phase considering a different number of episodes, the best result (lower is
the best) for each test environment is in bold. For ‘- - -’ the learning algorithm was unable of achieving a feasible result.

FIGURE 4. Learning results comparative. Each map shows the best path QG obtained by the CQL algorithm and the QAPF learning
algorithm for 5 × 104 , 1 × 105 , 5 × 105 , 1 × 106 , and 2 × 106 episodes respectively from left to right.

i + 1 = (xi+1 , yi+1 ). If the direction θ has no change The Eq. (9) was employed to compute all the results in
β(i, i + 1) = 0, otherwise, β(i, i + 1) = arctan 2((yi+1 − yi ), Table 2. The results show that the QAPF learning algorithm
(xi+1 − xi )). presents better performance in terms of path length in the
learning phase over the CQL algorithm in all the test envi-
ronments. For example, for the QAPF learning algorithm
B. LEARNING PHASE RESULTS in Map10, the shortest path length is 9.0078 meters which
To assess the learning performance of the proposed QAPF is reached with 1 × 106 episodes of training. However, for
learning algorithm versus the CQL algorithm, Table 2 the CQL algorithm at the same number of episodes the path
presents the results in terms of path length for a different length reached is 9.9915 meters. The difference between
number of episodes employed during the learning phase in the results is 0.9837 meters, which represents an important
each test environment. For the learning phase were employed advantage of the QAPF learning algorithm. The advantage of
from 5×104 to 3×106 episodes to reach the best results, after the QAPF learning algorithm can be observed in the results
3 × 106 episodes no better results or improvement is obtained presented in Table 2. Therefore, it can be concluded that the
in any of the test environments employed. QAPF learning algorithm outperforms the CQL algorithm in

VOLUME 10, 2022 84657


U. Orozco-Rosas et al.: Mobile Robot Path Planning Using a QAPF Learning Algorithm

lower for the QAPF learning algorithm in all the test environ-
ments. It can be concluded that the QAPF learning algorithm
presents an efficient performance for the learning phase due
to better results presented in all the test environments.
In summary, for the learning phase results that are shown in
Fig. 5. Regarding path length, the CQL algorithm achieves an
average of 14.7104 meters and the QAPF learning algorithm
achieves an average of 13.8469 meters. Therefore, the dif-
ference is 0.8635 meters, reaching a 6.24% of improvement
with the QAPF learning algorithm. Regarding path smooth-
ness, the CQL algorithm achieves an average of 7.4297 units
and the QAPF learning algorithm achieves an average of
2.9360 units. Therefore, the difference is 4.4937 units, yield-
ing a 153.05% of improvement with the QAPF learning
algorithm. Finally, regarding training time, the CQL algo-
rithm achieves an average of 5.35 ms and the QAPF learning
algorithm achieves an average of 3.06 ms for a single episode.
Therefore, the difference is 2.29 ms, yielding a 74.84% of
improvement with the QAPF learning algorithm.

C. TESTING PHASE: OFFLINE PATH PLANNING RESULTS


In offline path planning, the aim is to find a feasible
collision-free path between the start point q0 and goal point
qf , in a known static environment composed of obstacles. The
environments described in Table 1 were used to test the QAPF
learning algorithm in offline path planning mode. Fig. 6
shows the best path QG obtained in each test environment
from Map01 to Map10. These paths have the shortest path
length obtained by the QAPF learning algorithm as it is shown
FIGURE 5. Best results (lower is the best) for the learning phase in a in Table 2, where the best results (shortest path) are in bold.
single mission in each test environment.
For a quantitative comparison between the proposed QAPF
learning algorithm and the CQL algorithm in offline path
planning mode, for each test environment, twenty different
most of the stages of learning in terms of path length in the path planning problems were generated by modifying the
training phase. start point at random. Fig. 7 shows an example of the offline
Fig. 4 shows a visualization of the learning results pre- path planning for twenty different random start points in
sented in Table 2 for the CQL algorithm and the QAPF learn- Map10.
ing algorithm through the different number of episodes in Fig. 8 shows the offline path planning results for twenty
Map10. It can be observed that for 5 × 104 episodes the CQL different missions (in which each one has a different start
algorithm is unable to find a path, instead, the QAPF learning point) in terms of path length, path smoothness, and com-
algorithm reaches a feasible solution with the same number putation time for each test environment. In Fig. 8(a), it can
of episodes. Also, it can be observed that each algorithm be observed that the QAPF learning algorithm overcomes the
improves its solution through the increase in the number of CQL algorithm in all the test environments. Also, the path
episodes, finding that the QAPF learning algorithm presents smoothness results (Fig. 8(b)) show a better performance for
better results for most of the stages of learning. the QAPF learning algorithm over the CQL algorithm, in this
Fig. 5 shows an overall view of the best results presented in case, the QAPF learning algorithm shows a wide advantage
Table 2 for the learning phase in terms of path length, smooth- in most of the test environments. Fig. 8(c) presents the path
ness, and training time. In Fig. 5(a), it can be observed that planning computation time results for twenty different mis-
the QAPF learning algorithm overcomes the CQL algorithm sions in each test environment. The mean and the standard
in all the test environments. To obtain the results presented in deviation are slightly lower for the QAPF learning algorithm
Fig. 5(b), the Eq. (10) was employed. The path smoothness in all the test environments. It can be concluded that the QAPF
results show a better performance for the QAPF learning learning algorithm presents an efficient performance for the
algorithm over the CQL algorithm, whereas for the two first testing phase in offline path planning due to better results
test environments the QAPF learning algorithm shows a wide presented in all the test environments.
advantage. Fig. 5(c) presents the training time results for In summary, for the testing phase in offline mode results,
a single episode. The mean and the standard deviation are which are shown in Fig. 8. Regarding path length, the

84658 VOLUME 10, 2022


U. Orozco-Rosas et al.: Mobile Robot Path Planning Using a QAPF Learning Algorithm

FIGURE 6. Path planning results for the different test environments. Each map shows the best path QG obtained by the QAPF learning
algorithm in offline path planning mode.

FIGURE 7. Offline path planning for twenty different random start points
in Map10. The map shows the paths QG obtained by the QAPF learning
algorithm, the different start points in blue, and the goal in green.

CQL algorithm achieves an average of 9.2027 meters


and the QAPF learning algorithm achieves an average of
8.5599 meters. Therefore, the difference is 0.6428 meters,
reaching a 7.51% of improvement with the QAPF learning
algorithm. Regarding path smoothness, the CQL algorithm
achieves an average of 15.9806 units and the QAPF learning
algorithm achieves an average of 5.9242 units. Therefore, the
difference is 10.0564 units, yielding a 169.75% of improve-
ment with the QAPF learning algorithm. Finally, regarding
path planning computation time, the CQL algorithm achieves
FIGURE 8. Offline path planning results (lower is the best) for twenty
an average of 6.56 ms and the QAPF learning algorithm different missions in each test environment.
achieves an average of 6.28 ms. Therefore, the difference is
0.28 ms, reaching a 4.46% of improvement with the QAPF
learning algorithm. require that the MR can respond and take decisions, a task
that is called online path planning. Fig. 9 exemplifies the
D. TESTING PHASE: ONLINE PATH PLANNING RESULTS online path planning experiment considering three new static
In real-world missions, the MR commonly faces unknown or obstacles. For this experiment, the test environment Map01 is
partially known environments. These environment conditions employed, see Fig. 9(a). First, the path planning is performed

VOLUME 10, 2022 84659


U. Orozco-Rosas et al.: Mobile Robot Path Planning Using a QAPF Learning Algorithm

FIGURE 9. Online path planning.

FIGURE 10. Online path planning with dynamic obstacles.

in offline mode because we know the environment informa- it calculates the obstacle position to update the environment
tion described by the original test environment Map01, see layout map, as shown in Fig. 9(c). The MR path planning
Fig. 9(b). The minimum path length found to reach the target algorithm based on the QAPF learning algorithm now has
position is 7.0418 meters, using the best results, as described a different environment layout. Therefore, it is necessary to
in Table 2. At position (6.0800, 6.0000), a new static obstacle update the path and continue with the movement to reach the
is added to change the environment configuration. After a goal point.
while, when the MR has moved 2.8472 meters, it reaches the Next, at position (6.8000, 4.9800), a second new static
position (5.6875, 6.5000). The MR senses the new obstacle; obstacle is added to change the environment configuration.

84660 VOLUME 10, 2022


U. Orozco-Rosas et al.: Mobile Robot Path Planning Using a QAPF Learning Algorithm

After a while, when the MR has moved an additional distance


of 1.7393 meters, it touches the position (6.6250, 5.5625).
The MR senses the second new obstacle; it calculates the
obstacle position to update the environment layout, as shown
in Fig. 9(d). Then, at position (5.9800, 4.0000), a third new
static obstacle is added to change the environment config-
uration. After a while, when the MR has moved an addi-
tional distance of 2.6339 meters, it touches the position
(6.5625, 4.0000). The MR senses the third new obstacle; it
calculates the obstacle position to update the environment
layout, as shown in Fig. 9(e). Finally, the MR follows the
new path to reach the goal point. The complete path to
accomplish the goal point is shown in Fig. 9(f). The total
path length from the original start point to the goal point is
2.8472+1.7393+2.6339+2.5955 = 9.8159 meters.
Now, we are going to present an experiment where the
QAPF learning algorithm will be employed for online path
planning dealing with new dynamic obstacles. This experi-
ment looks at the case of new dynamic obstacles that can be
persons or other mobile robots moving through the environ-
ment. Contrarily to the case of the unknown static obstacles,
here the new obstacle will not remain static, it will be moving
with a defined trajectory that is unknown to the MR.
For this experiment, we are going to use the test environ-
ment Map03, see Table 1. Fig. 10 shows the online path plan-
ning experiment considering unknown dynamic obstacles.
First, the path planning is performed in offline mode because
we have the environment information described by the orig-
inal test environment Map03, see Fig. 10(b). The minimum
path length found to achieve the goal point is 8.1516 meters,
using the QAPF learning algorithm, see Table 2. Now the MR FIGURE 11. Online path planning results (lower is the best) for twenty
navigation can start. After a while, when the MR has traveled different missions in each test environment.

0.7955 meters, it reaches the position (5.5625, 8.4375). The


MR senses a new obstacle (first mobile robot in red), which is In Fig. 11(a), it can be observed that the QAPF learning algo-
located at (5.9800, 7.9400) at that moment, the MR calculates rithm overcomes the CQL algorithm in all the test environ-
the obstacle position to update the environment layout map, ments. Besides, the path smoothness results (Fig. 11(b)) show
as shown in Fig. 10(c). a better performance for the QAPF learning algorithm over
The MR now has a different environment layout. Con- the CQL algorithm, in this case, the QAPF learning algorithm
sequently, it is necessary to update the path to achieve shows a wide advantage in most of the test environments.
the goal point. After a while, when the MR has traveled Fig. 11(c) presents the path planning computation time results
3.4893 meters, it reaches the position (7.1250, 6.4375). The for twenty different missions in each test environment. The
MR senses the second new dynamic obstacle that is located mean and the standard deviation are slightly lower for the
at (7.1000, 5.7600) at that moment. The MR calculates the QAPF learning algorithm in all the test environments. It can
obstacle position to update the environment layout, as shown be concluded that the QAPF learning algorithm presents
in Fig. 10(d). Then, when the MR has traveled 4.2722 meters, an efficient performance for the testing phase in online
it reaches the position (6.5000, 4.0000). The MR senses path planning due to better results presented in all the test
a third new dynamic obstacle, located at (5.8800, 3.9600) environments.
at that moment. The MR calculates the obstacle position In summary, for the testing phase in online mode results,
to update the environment layout, as shown in Fig. 10(e). which are shown in Fig. 11. Regarding path length, the
Finally, the MR follows the new path to reach the target CQL algorithm achieves an average of 9.4827 meters
position; the complete path is shown in Fig. 10(f). The total and the QAPF learning algorithm achieves an average of
path length from the original start point to the goal point is 7.9800 meters. Therefore, the difference is 1.5027 meters,
0.7955+3.4893+4.2722+1.8258 = 10.3828 m. reaching an 18.83% of improvement with the QAPF learning
Fig. 11 shows the online path planning results in terms algorithm. Regarding path smoothness, the CQL algorithm
of path length, path smoothness, and computation time achieves an average of 24.5259 units and the QAPF learning
for twenty different missions in each test environment. algorithm achieves an average of 10.3578 units. Therefore,

VOLUME 10, 2022 84661


U. Orozco-Rosas et al.: Mobile Robot Path Planning Using a QAPF Learning Algorithm

the difference is 14.1681 units, reaching 136.79% of improve- There are several possible directions to extend this work in
ment with the QAPF learning algorithm. Finally, regarding the future. Firstly, it could be motivating to focus on multi-
path planning computation time, the CQL algorithm achieves agent (multi MR) path planning in complicated dynamic and
an average of 8.56 ms and the QAPF learning algorithm uncertain situations. Secondly, other types of reinforcement
achieves an average of 7.33 ms. Therefore, the difference is learning can be considered as Deep Q-Networks, through
1.23 ms, reaching a 16.78% of improvement with the QAPF the combination of reinforcement learning and deep neural
learning algorithm. networks it can be possible to solve a wide range of path
planning problems. Lastly, the QAPF learning algorithm can
be extended to work in a 3D space that could be useful for
VI. CONCLUSION many applications for gathering information (i.e., drones),
In this work, the proposed QAPF learning algorithm suc- disaster relief, and exploration.
cessfully solves path planning problems in offline and online
modes for known and unknown environments. The combi- REFERENCES
nation of the Q-learning approach and the APF method can [1] S. M. La Valle, Planning Algorithms. New York, NY, USA: Cambridge
get over the drawback of the classical Q-learning approach, Univ. Press, 2006.
[2] M. A. Contreras-Cruz, V. Ayala-Ramirez, and U. H. Hernandez-Belmonte,
such as slow learning speed, time consumption, and impos- ‘‘Mobile robot path planning using artificial bee colony and evolutionary
sible learning in known and unknown environments. All the programming,’’ Appl. Soft Comput., vol. 30, pp. 319–328, May 2015.
simulation results demonstrate that the proposed QAPF learn- [3] A. A. Saadi, A. Soukane, Y. Meraihi, A. B. Gabis, S. Mirjalili, and
ing algorithm can enhance learning speed and improve path A. Ramdane-Cherif, ‘‘UAV path planning using optimization approaches:
A survey,’’ Arch. Comput. Methods Eng., vol. 2022, pp. 1–52, Apr. 2022.
planning in terms of path length and path smoothness. [4] O. Montiel, U. Orozco-Rosas, and R. Sepúlveda, ‘‘Path planning for
For the training phase, the QAPF learning algorithm mobile robots using bacterial potential field for avoiding static and
reached an improvement of 6.24% in path length, 153.05% dynamic obstacles,’’ Expert Syst. Appl., vol. 42, no. 12, pp. 5177–5191,
2015.
in path smoothness, and 74.84% in training time over the [5] E. S. Low, P. Ong, and K. C. Cheah, ‘‘Solving the optimal path planning of
classical approach. For the testing phase in offline mode, the a mobile robot using improved Q-learning,’’ Robot. Auto. Syst., vol. 115,
QAPF learning algorithm reached an improvement of 7.51% pp. 143–161, May 2019.
[6] L. Chang, L. Shan, C. Jiang, and Y. Dai, ‘‘Reinforcement based mobile
in path length, 169.75% in path smoothness, and 4.46% in robot path planning with improved dynamic window approach in unknown
path planning computation time over the classical approach. environment,’’ Auto. Robots, vol. 45, no. 1, pp. 51–76, Jan. 2021.
And, for the testing phase in online mode, the QAPF learning [7] Y.-H. Wang, T.-H.-S. Li, and C.-J. Lin, ‘‘Backward Q-learning: The combi-
nation of Sarsa algorithm and Q-learning,’’ Eng. Appl. Artif. Intell., vol. 26,
algorithm reached an improvement of 18.83% in path length, no. 9, pp. 2184–2193, Oct. 2013.
136.79% in path smoothness, and 16.78% in path planning [8] M. Zolfpour-Arokhlo, A. Selamat, S. Z. M. Hashim, and H. Afkhami,
computation time over the classical approach. The effects of ‘‘Modeling of route planning system based on Q value-based dynamic
programming with multi-agent reinforcement learning algorithms,’’ Eng.
the APF weighting in the results of the QAPF learning algo- Appl. Artif. Intell., vol. 29, pp. 163–177, Mar. 2014.
rithm are beneficial over the classical Q-learning approach [9] M. Zhao, H. Lu, S. Yang, and F. Guo, ‘‘The experience-memory Q-learning
due to the advantages presented by the APF method such as algorithm for robot path planning in unknown environment,’’ IEEE Access,
vol. 8, pp. 47824–47844, 2020.
the effectiveness in providing smooth and safe planning.
[10] U. Orozco-Rosas, K. Picos, and O. Montiel, ‘‘Hybrid path planning algo-
We employed different test scenarios to evaluate the rithm based on membrane pseudo-bacterial potential field for autonomous
QAPF learning algorithm. The results obtained in the dif- mobile robots,’’ IEEE Access, vol. 7, pp. 156787–156803, 2019.
ferent known and partially unknown environments show that [11] Z. Cui and Y. Wang, ‘‘UAV path planning based on multi-layer reinforce-
ment learning technique,’’ IEEE Access, vol. 9, pp. 59486–59497, 2021.
the proposed QAPF learning algorithm achieves the three [12] D. L. Cruz and W. Yu, ‘‘Path planning of multi-agent systems in unknown
requirements efficiently to solve the path planning problem: environment with neural kernel smoothing and reinforcement learning,’’
safety, length, and smoothness, which makes the QAPF learn- Neurocomputing, vol. 233, pp. 34–42, Apr. 2017.
[13] Q. Yao, Z. Zheng, L. Qi, H. Yuan, X. Guo, M. Zhao, Z. Liu, and T. Yang,
ing algorithm appropriate to find competitive results for MR ‘‘Path planning method with improved artificial potential field—A rein-
navigation in complex and real scenarios. forcement learning perspective,’’ IEEE Access, vol. 8, pp. 135513–135523,
The path planning results demonstrate that the QAPF 2020.
[14] C. Qu, W. Gai, M. Zhong, and J. Zhang, ‘‘A novel reinforcement learning
learning algorithm produces better solutions in all the test based grey wolf optimizer algorithm for unmanned aerial vehicles (UAVs)
environments. Furthermore, the path planning results are path planning,’’ Appl. Soft Comput., vol. 89, Apr. 2020, Art. no. 106099.
achieved with a lower number of episodes in the learn- [15] X. Guo, G. Peng, and Y. Meng, ‘‘A modified Q-learning algorithm for
robot path planning in a digital twin assembly system,’’ Int. J. Adv. Manuf.
ing phase when the QAPF learning algorithm is employed. Technol., vol. 119, nos. 5–6, pp. 3951–3961, Mar. 2022.
Another advantage of the proposed QAPF learning algorithm [16] A. Maoudj and A. Hentout, ‘‘Optimal path planning approach based on
is low variability in training time, making it highly reliable Q-learning algorithm for mobile robots,’’ Appl. Soft Comput., vol. 97,
Dec. 2020, Art. no. 106796.
for MR path planning.
[17] V. B. Ajabshir, M. S. Guzel, and E. Bostanci, ‘‘A low-cost Q-learning-based
In this regard, the QAPF learning algorithm could be approach to handle continuous space problems for decentralized multi-
useful for many applications in MRs for local and global agent robot navigation in cluttered environments,’’ IEEE Access, vol. 10,
path planning, including industrial and domestic MRs, self- pp. 35287–35301, 2022.
[18] V. Bulut, ‘‘Optimal path planning method based on epsilon-greedy
driving cars, exploration vehicles, unmanned aerial vehicles, Q-learning algorithm,’’ J. Brazilian Soc. Mech. Sci. Eng., vol. 44, no. 3,
and autonomous underwater vehicles. p. 106, Mar. 2022.

84662 VOLUME 10, 2022


U. Orozco-Rosas et al.: Mobile Robot Path Planning Using a QAPF Learning Algorithm

[19] T. Gao, C. Li, G. Liu, N. Guo, D. Wang, and Y. Li, ‘‘Hybrid path planning KENIA PICOS received the Ph.D. degree from
algorithm of the mobile agent based on Q-learning,’’ Autom. Control the Instituto Politécnico Nacional, Centro de
Comput. Sci., vol. 56, no. 2, pp. 130–142, Apr. 2022. Investigación y Desarrollo de Tecnología Digi-
[20] A. K. Sadhu, A. Konar, T. Bhattacharjee, and S. Das, ‘‘Synergism of tal, in 2017. She is currently a full-time Pro-
firefly algorithm and Q-learning for robot arm path planning,’’ Swarm fessor with the School of Engineering, CETYS
Evol. Comput., vol. 43, pp. 50–68, Dec. 2018. Universidad, Tijuana, Mexico. She is a member
[21] X. Wu, H. Chen, C. Chen, M. Zhong, S. Xie, Y. Guo, and H. Fujita, of the National System of Researchers (Sistema
‘‘The autonomous navigation and obstacle avoidance for USVs with
Nacional de Investigadores) at CONACYT. Her
ANOA deep reinforcement learning method,’’ Knowl.-Based Syst.,
current research interests include computer vision,
vol. 196, May 2020, Art. no. 105201.
[22] U. Orozco-Rosas, O. Montiel, and R. Sepúlveda, ‘‘Pseudo-bacterial poten- object recognition, three-dimensional object track-
tial field based path planner for autonomous mobile robot navigation,’’ Int. ing, pose estimation, and parallel computing with graphics processing units.
J. Adv. Robot. Syst., vol. 12, no. 7, p. 81, 2015.
[23] K. Zhu and T. Zhang, ‘‘Deep reinforcement learning based mobile robot
navigation: A review,’’ Tsinghua Sci. Technol., vol. 26, no. 5, pp. 674–691,
Oct. 2021.
[24] F. Liu, R. Tang, X. Li, W. Zhang, Y. Ye, H. Chen, H. Guo, Y. Zhang,
and X. He, ‘‘State representation modeling for deep reinforcement learn-
ing based recommendation,’’ Knowl.-Based Syst., vol. 205, Oct. 2020,
Art. no. 106170.
[25] C. J. C. H. Watkins and P. Dayan, ‘‘Q-learning,’’ Mach. Learn., vol. 8,
nos. 3–4, pp. 279–292, 1992. JUAN J. PANTRIGO received the M.S. degree
[26] A. K. Sadhu and A. Konar, ‘‘An efficient computing of correlated equilib- in fundamental physics from Universidad de
rium for cooperative Q-learning-based multi-robot planning,’’ IEEE Trans. Extremadura, in 1998, and the Ph.D. degree from
Syst., Man, Cybern., Syst., vol. 50, no. 8, pp. 2779–2794, Aug. 2020.
Universidad Rey Juan Carlos, in 2005. He is
[27] O. Khatib, ‘‘Real-time obstacle avoidance for manipulators and mobile
currently an Associate Professor with Univer-
robots,’’ in Proc. IEEE Int. Conf. Robot. Autom., vol. 2, Mar. 1985,
pp. 500–505. sidad Rey Juan Carlos and a member of the
[28] O. Montiel, R. Sepúlveda, and U. Orozco-Rosas, ‘‘Optimal path planning CAPO Research Group, Department of Com-
generation for mobile robots using parallel evolutionary artificial potential puter Science. His research interests include
field,’’ J. Intell. Robot. Syst., vol. 79, no. 2, pp. 237–257, Aug. 2015. high-dimensional space-state tracking problems,
[29] U. Orozco-Rosas, O. Montiel, and R. Sepúlveda, ‘‘Mobile robot path computer vision, metaheuristic optimization,
planning using membrane evolutionary artificial potential field,’’ Appl. Soft machine learning, and hybrid approaches.
Comput. J., vol. 77, pp. 236–251, Apr. 2019.
[30] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction,
2nd ed. Cambridge, MA, USA: MIT Press, 2018.
[31] Motion Planning Maps, Intelligent and Mobile Robotics
Group, Department of Cybernetics, Czech Technical University
in Prague. Accessed: Jun. 27, 2022. [Online]. Available:
http://imr.ciirc.cvut.cz/planning/maps.xml
[32] F. Bayat, S. Najafinia, and M. Aliyari, ‘‘Mobile robots path planning: Elec-
trostatic potential field approach,’’ Expert Syst. Appl., vol. 100, pp. 68–78,
Jun. 2018.
ANTONIO S. MONTEMAYOR received the M.S.
[33] B. Wang, S. Li, J. Guo, and Q. Chen, ‘‘Car-like mobile robot path planning
in rough terrain using multi-objective particle swarm optimization algo-
degree in physics from Universidad Autónoma de
rithm,’’ Neurocomputing, vol. 282, pp. 42–51, Mar. 2018. Madrid and the Ph.D. degree in computer sci-
[34] L. Khriji, F. Touati, K. Benhmed, and A. Al-Yahmedi, ‘‘Mobile robot ence from Universidad Rey Juan Carlos, in 2006.
navigation based on Q-learning technique,’’ Int. J. Adv. Robot. Syst., vol. 8, He is currently an Associate Professor with Uni-
no. 1, p. 4, Mar. 2011. versidad Rey Juan Carlos and a PI of the CAPO
[35] X. Wang, G. Zhang, J. Zhao, H. Rong, F. Ipate, and R. Lefticaru, ‘‘A mod- Research Group, Department of Computer Sci-
ified membrane-inspired algorithm based on particle swarm optimization ence and Statistics, Madrid, Spain. His research
for mobile robot path planning,’’ Int. J. Comput. Commun. Control, vol. 10, interests include computer vision, artificial intel-
no. 5, pp. 732–745, Oct. 2015. ligence, and GPU computing.

ULISES OROZCO-ROSAS (Member, IEEE)


received the M.S. and Ph.D. degrees in digital
systems from the Instituto Politécnico Nacional, ALFREDO CUESTA-INFANTE received the M.S.
Mexico, in 2014 and 2017, respectively. He has degree in physics from Universidad Complutense
held a postdoctoral position. From 2017 to 2018, de Madrid and the Ph.D. degree in computer sci-
he was appointed at the Department of Computer ence from Universidad Nacional de Educación a
Science, Universidad Rey Juan Carlos, Spain. Distancia. He is currently an Associate Professor
He is currently an Associate Professor with the with Universidad Rey Juan Carlos and a mem-
School of Engineering, CETYS Universidad, and a ber of the CAPO Research Group, Department of
member of the National System of Researchers at Computer Science. His research interests include
CONACYT. His research interests include machine learning, computational reinforcement and imitation learning, computer
intelligence, parallel and heterogeneous computing, autonomous vehicles, vision, and synthetic data generation.
and mobile robots.

VOLUME 10, 2022 84663

You might also like