Mobile Robot Path Planning Using A QAPF Learning Algorithm For Known and Unknown Environments
Mobile Robot Path Planning Using A QAPF Learning Algorithm For Known and Unknown Environments
ABSTRACT This paper presents the computation of feasible paths for mobile robots in known and unknown
environments using a QAPF learning algorithm. Q-learning is a reinforcement learning algorithm that has
increased in popularity in mobile robot path planning in recent times, due to its self-learning capability
without requiring a priori model of the environment. However, Q-learning shows slow convergence to the
optimal solution, notwithstanding such an advantage. To address this limitation, the concept of partially
guided Q-learning is employed wherein, the artificial potential field (APF) method is utilized to improve
the classical Q-learning approach. Therefore, the proposed QAPF learning algorithm for path planning can
enhance learning speed and improve final performance using the combination of Q-learning and the APF
method. Criteria used to measure planning effectiveness include path length, path smoothness, and learning
time. Experiments demonstrate that the QAPF algorithm successfully achieves better learning values that
outperform the classical Q-learning approach in all the test environments presented in terms of the criteria
mentioned above in offline and online path planning modes. The QAPF learning algorithm reached an
improvement of 18.83% in path length for the online mode, an improvement of 169.75% in path smoothness
for the offline mode, and an improvement of 74.84% in training time over the classical approach.
INDEX TERMS Path planning, Q-learning, artificial potential field, reinforcement learning, mobile robots.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
84648 VOLUME 10, 2022
U. Orozco-Rosas et al.: Mobile Robot Path Planning Using a QAPF Learning Algorithm
environment have already been known before the navigation. path length, path smoothness, and computation time. The
Furthermore, path planning can be classified into static and contributions of this paper can be summarized as follows:
dynamic, depending on the nature of the obstacles. In static • A reinforcement learning-based algorithm combined
path planning, the position and orientation of the obstacles with the artificial potential field method called QAPF
are unchanged with time, while in dynamic path planning, learning algorithm is proposed to solve mobile robot
the obstacles are free to move in the environment [4]. path planning problems.
The path planning problem solution has been addressed • The QAPF learning algorithm includes three opera-
from diverse methods [5]. One approach is reinforcement tions: exploration, exploitation, and APF weighting to
learning, due to it can be employed for MR path planning overcome the limitations presented by the classical
for known and unknown environments. Q-learning is a rein- Q-learning approach in path planning.
forcement learning algorithm that uses a scalar reinforcement • A set of experiments and studies demonstrate that the
signal or a reward to interact with a complex environment [6]. proposed QAPF learning algorithm successfully solves
Q-learning maps situations to actions, to maximize a numer- diverse path planning problems arising from numerous
ical reward through systematic learning. To obtain its own complex environments that the mobile robot can face.
experience, the agent offers a trade-off between exploration
and exploitation, so it not only has to exploit what it already The organization of this paper is as follows. Section II
knows through greater action in the current experience, summarizes the literature review and related work. Section III
but also must explore what action will work better in the describes the path planning problem, the Q-learning algo-
future [7]. The mobile robot (MR) which is known as the rithm, and the APF method. Section IV explains in
agent receives a reward for collision-free action and receives detail the proposed QAPF learning algorithm. Section V
a penalty when it collides with the obstacles [5]. presents the results of simulations that demonstrate the appli-
In Q-learning, the reward concept is a key difference cability of the proposal. Finally, the conclusions of this paper
concerning both supervised and unsupervised solutions. The are drawn in Section VI.
reward is less informative than it is in supervised learning,
where the agent is given the correct actions to perform. II. RELATED WORK
Unfortunately, information regarding correct actions is not In the literature, there are several proposals to address path
always available. However, the reward is more informative planning problems using reinforcement learning [11], [12].
than unsupervised learning, where no explicit comments are Some proposals based on reinforcement learning have been
made regarding performance [8]. combined with other techniques to improve performance.
Q-learning has been employed in mobile robot path plan- In [13], the combination of reinforcement learning with an
ning due to its self-learning capability without requiring a pri- improved artificial potential field is presented to solve path
ori model of the environment. Although, Q-learning presents planning problems. Another example, now combining rein-
slow convergence to the optimal solution, notwithstanding forcement learning with a metaheuristic can be found in [14].
such an advantage [9]. To address this limitation, the concept In that work, a reinforcement learning-based grey wolf opti-
of partially guided Q-learning through the APF weighting is mizer (RLGWO) algorithm is presented to solve unmanned
employed wherein, the artificial potential field (APF) method aerial vehicles (UAVs) path planning problems.
is utilized to improve the classical Q-learning approach. Furthermore, there are numerous proposals to address path
The APF method is employed in path planning due to its planning problems using Q-learning. A few of them, which
effectiveness to provide smooth and safe paths. However, are worth mentioning, are described next. In [15], a modified
it presents disadvantages like the local minima problem, Q-learning algorithm is presented to solve the path planning
or the goal non-reachable with obstacles nearby, among oth- problem in product assembly. In that work, the proposed algo-
ers [10]. Therefore, the proposed QAPF learning algorithm rithm speeds up the convergence speed by adding a dynamic
for path planning can enhance learning speed and improve reward function, optimizes the initial Q table by introducing
final performance using the combination of Q-learning and knowledge and experience through the case-based reasoning
the APF method. In this manner, the QAPF learning algo- algorithm, and prevents entry into the trapped area through
rithm overcomes or mitigates the disadvantages presented by the obstacle avoiding method.
both conventional methods. Another example is presented by Maoudj and Hentout [16]
The main contribution of this paper is the development They propose an efficient Q-learning (EQL) algorithm to
of a new robust algorithm based on Q-learning and the overcome the limitations of slow and weak convergence and
APF method to solve path planning problems for mobile ensure an optimal collision-free path in less possible time.
robots. The proposed QAPF learning algorithm presents In this regard, in that work a reward function is proposed to
short, smooth, and collision-free paths. The proposed QAPF initialize the Q table and provide the robot with prior knowl-
learning algorithm is extensively validated against a path edge of the environment, followed by an efficient selection
planning algorithm based on the classical Q-learning (CQL) strategy proposal to accelerate the learning process through
approach. Both planning algorithms are tested in benchmark search space reduction while ensuring a rapid convergence
environments and compared concerning measures such as toward an optimized solution.
These works apply deterministic policies, which generate C. ARTIFICIAL POTENTIAL FIELD
an action directly. For value-based methods, the action with The main idea of the artificial potential field (APF)
maximum Q-value over all the possible actions is selected as method [27] is to establish an attractive potential field force
the best action [24]. Therefore, Q-learning is a value-based around the goal point, as well as to establish a repulsive
method. During the learning, the agent performs an action potential force around obstacles [28]. By this idea, the APF
with the highest expected Q-values to estimate the optimal method employs attractive and repulsive components to draw
policy [14]. the MR to its goal while keeping it away from obstacles.
Q-learning [25] is model-free reinforcement learning, Therefore, the total artificial potential field, U (q), includes
which combines theories, including the Bellman equa- two terms, the attractive potential function, Uatt (q), and
tions and Markov decision process (MDP), with temporal- the repulsive potential function, Urep (q). The total artificial
difference (TD) learning. It is a way for agents to learn how potential field, U (q), is then the sum of these two potential
to act optimally in controlled Markovian domains. Its form is functions, as indicated in Eq. (2).
one-step Q-learning, which is defined by Eq. (1).
U (q) = Uatt (q) + Urep (q) (2)
Q(s, a) = (1 − α)Q(s, a) + α(r + γ max(Q(s0 , ∀a0 ))) (1)
The attractive potential function is described by Eq. (3),
where Q(s, a) estimates the action value after applying an where q represents the current MR position. The goal point
action a in state s, α is the learning rate, γ is the discount is represented by qf and katt is a positive scalar-constant that
factor, and r is the immediate reward received [26]. The represents the attractive proportional gain of the function.
Q-learning main components are as follows.
1
1) Agent: The learner that can interact with the environ- Uatt (q) = katt (q − qf )2 (3)
ment via a set of sensors and actuators. 2
2) Environment: Everything that interacts with an agent, The repulsive potential function is denoted by Eq. (4),
i.e., everything outside the agent. where ρ0 represents the limit distance of influence of the
3) Policy: A mapping from perceived states set, S, to the repulsive potential field and ρ is the shortest distance from the
actions set, A. MR to the obstacle. The influence of the repulsive potential
4) Reward function: A mapping from state-action pairs to field, Urep (q), is presented in two cases. The first case is
a scalar number. presented when the MR is under the influence of the obstacle,
5) Q-value: The total amount of reward an agent can that is, if the distance from the robot to the obstacle, ρ, is less
expect to accumulate over the future, starting from that or equal to the limit distance of influence, ρ0 . Otherwise,
state. in the second case, the repulsive potential field will be zero
Q-learning works by successively improving its evalua- and the MR will be free of the influence of that obstacle. The
tions of the quality of actions at particular states. Learning selection of the distance ρ0 depends on the MR maximum
proceeds similarly to Sutton’s method of temporal differ- speed. The krep is a positive scalar-constant that represents
ences: an agent tries an action at a particular state and eval- the repulsive proportional gain of the function. Therefore, the
uates its consequences in terms of the immediate reward or repulsive potential function has a limited range of influence,
penalty it receives and its estimate of the value of the state and it prevents the movement of the MR from being affected
to which it is taken. By trying all actions in all states repeat- by a distant obstacle [29].
edly, it learns which are best overall, judged by long-term (
1
krep ( ρ1 − ρ10 )2 , if ρ ≤ ρ0
discounted reward [25]. Urep (q) = 2 (4)
In Q-learning, the agent’s experience consists of a 0, if ρ > ρ0
sequence of distinct stages or episodes. In the nth episode, The artificial potential field method is widely employed in
the agent: path planning for mobile robots due to its simplicity, mathe-
1) Observes its current state st matical elegance, and effectiveness in providing smooth and
2) Selects and performs an action at safe planning [10]. Therefore, in this paper, the artificial
3) Observes the subsequent state st+1 potential field method is utilized to improve the classical
4) Receives an immediate payoff rt Q-learning approach. In that sense, the proposed QAPF learn-
Then, the agent adjusts its Q values using a learning rate ing algorithm for path planning can enhance learning speed
α, according to: note that this description assumes a look-up and improve final performance using the combination of
table representation for the Qm×n (st , at ). Q-learning and the APF method.
A learning agent is composed of two fundamental parts,
a learning element, and a performance element. The design IV. QAPF LEARNING ALGORITHM FOR PATH PLANNING
of a learning element is dictated by what type of performance To improve the disadvantage of slow convergence in the clas-
element is used, which functional component is to be learned, sical Q-learning, the QAPF learning algorithm is proposed
how that functional component is represented, and what kind in this work, which includes three operations: exploration,
of feedback is available [8]. exploitation, and APF weighting.
Input: goal point qf and environment information Oj 3 set st ← a random state from the states set S
Output: learning values Qm×n 4 while st 6= qf and safe do
1 initialize Qm×n (st , at ) ← {0} 5 if ζ < uniform random number then
2 for each episode do 6 compute probability pi using Eq. 5
3 set st ← a random state from the states set S 7 compute APF weighting using Eq. 6
4 while st 6 = qf and safe do 8 choose at in st by using APF weighting
5 choose the best at in st by using Qm×n 9 else
6 perform action at and receive reward r 10 if ζ < uniform random number then
7 find out the new state st+1 11 choose the best at in st by using Qm×n
8 update Qm×n (st , at ) using Eq. 1 12 else
9 st ← st+1 13 choose random at in st
10 end 14 end
11 end 15 end
12 return Qm×n 16 perform action at and receive reward r
17 find out the new state st+1
18 update Qm×n (st , at ) using Eq. 1
In line 3, a random state is assigned to the current state, st . 19 st ← st+1
From lines 4 to 10, the explore-exploit process is performed. 20 end
The stop condition is presented when the current state, st , 21 end
is equal to the goal point, qf , or the Boolean flag safe is False. 22 return Qm×n
The objective of the flag safe is to serve as a stop condition
if a collision is presented with the obstacles or when the MR
steps into the limits of the workspace. In line 3, a random state is assigned to the current MR state,
In line 5, the best at in st is chosen using Qm×n . In line 6, st . From lines 4 to 20, there is the decision iterative process to
the action, at , is performed and a reward is received. Next, operate on artificial potential field (APF) weighting, exploita-
in line 7, the new state, st+1 , is computed. Then, in line 8, tion, or exploration. The stop condition for the decision pro-
the learning value, Qm×n (st , at ), is updated using Eq. 1. Last, cess is presented when the current state, st , is equal to the goal
in line 9. The new state, st+1 , is assigned to the current point, qf , or the Boolean flag safe is False. The main objective
state st . In the end, the Algorithm 1 returns the resultant of the flag safe is to serve as a stop condition if a collision is
learning values Qm×n to build the path, QG , through the presented with the obstacles or when the MR steps into the
Algorithm 3. limits of the workspace.
effectively trained in a short time and slow down the learning of objective points from the start point to the goal point.
convergence [19]. To overcome this problem, the proposed Therefore, the parameter that Algorithm 3 returns is the array
QAPF learning algorithm employs the concept of partially QG = [q0 , q1 , . . . , qf ].
guided Q-learning through the APF weighting to virtualize Algorithm 3 in line 1 initializes to zero an index, i, that
the a priori environment into a total artificial potential field will be employed by the objective points. In line 2, the start
and compute the appropriate learning values, Qm×n . In con- point, q0 , is assigned to the current state st . From lines 3 to 14,
sequence, to speed up the convergence. there is the path generator iterative process to build the path
The convergence is presented when the learning values QG . The stop condition is presented when the current state,
Qm×n remain unchanged or are under a certain bound set in st , is equal to the goal point, qf , or the Boolean flag safe is
advance. In that sense, the stop condition in this work for the False.
learning process is presented when the maximum number of In line 4, the index, i, is increased by one. Next, the
episodes, Nep , is reached. Therefore, the convergence of the verification of the environment is performed (line 5). If the
learning values Qm×n means that the MR can plan a feasible environment has changed, then the environment information,
path based on its experience [9]. However, the MR will not Oj , is updated (line 7). In line 9, the best action, at , in the cur-
necessarily find the global shortest path. It will find a nearly rent state, st , is chosen using the learning values, Qm×n . Once
optimal or optimal path in the best of cases. Consequently, the action, at , has been chosen, the action, at , is performed
the shortest path is defined on the basis of the experience the (line 10). Now, in line 11, the new state, st+1 , is computed
MR has. using Eq. 7. Last, in line 12. The new state, st+1 , is assigned
to the current state st , and the current state is assigned to the
C. PATH GENERATOR ALGORITHM objective point qi (line 13). In the end, the Algorithm 3 returns
The MR uses Algorithm 3 to generate the path from the a collision-free path between the start point, q0 , and the goal
start point, q0 , to the goal point, qf , under a known, partially point, qf , if the flag safe remains True. And, an effective path,
known, or unknown environment. Algorithm 3 uses the learn- if the current state, st , is equal to the goal point, qf in the end.
ing values Qm×n to generate the best path, QG , in the offline
and online path planning modes. During the path generation, V. EXPERIMENTS AND RESULTS
the environment is verified to see if new obstacles not con- In this section, we describe the experiments, and we present
sidered at the beginning were added or dynamic obstacles the results of a comparative study of the proposed QAPF
have changed their position. If the environment change, the learning algorithm versus the CQL algorithm in a set of
environment information, Oj , is updated. ten test environments to evaluate the performance of both
planning algorithms in terms of path length, path smooth-
Algorithm 3 Path Generator ness, learning time, and path planning time in offline mode
for known environments and online mode for unknown or
Input: start point q0 , goal point qf , environment
partially unknown environments.
information Oj , and learning values Qm×n
Output: path QG A. TEST CONDITIONS
1 i←0
In all the experiments, we considered that the MR config-
2 set st ← q0
uration is defined as q(x, y, φ, θ ), where x and y are the
3 while st 6 = qf and safe do
coordinate points, φ = 0.2 is the physical radius of the MR in
4 i←i+1
meters, and θ is the angle of orientation that is always oriented
5 verification of the environment
to the next coordinate point to visit. Hence, the initial point
6 if environment has changed then
is centered over the starting point q0 and oriented to the first
7 update environment information Oj
coordinate to visit.
8 end
Table 1 presents the test environments configuration. The
9 choose the best at in st by using Qm×n
table contains the goal point, qf , that the MR must attain
10 perform action at
and the test environment layout. Each test environment is
11 find out the new state st+1
configured by n rectangular obstacles Oj (xj , yj , lj , wj ), for
12 st ← st+1
1 ≤ j ≤ n, where the left-bottom vertex of each obstacle
13 qi ← st
is placed at the coordinate points (xj , yj ) and its length and
14 end
width are indicated by lj and wj , respectively. These test
15 QG ← [q0 , q1 , . . . , qi ]
environments were designed to evaluate the performance and
16 return QG
accuracy of the QAPF learning algorithm. The benchmark
maps presented in [31] inspired the test environments, and
Algorithm 3 employs the input parameters: the start point, it has been labeled as Map01, Map02, . . . , and Map10.
q0 , the goal point, qf , the environment information, Oj , and The test environments Map01 to Map10 cover well-known
the learning values Qm×n . The objective of the path gener- difficult path planning problems, e.g., path-following predic-
ator algorithm is to build the path, QG , that is composed tion problems, problematic areas to reach because the goal
TABLE 1. Test environments configuration with the goal point and obstacles information.
is too close to an obstacle, and trap sites due to local min- Q* after the MR has traveled all the states and considered all
ima, among other problems [32], [33]. The test environments the possible actions in the given environment [25]. As such,
described in Table 1 can be graphically observed in Fig. 3, the learning rate, α, is empirically set to 0.3 in this work.
where each test environment is configured with rectangular Remark 3: The discount factor, γ ∈ [0, 1), determines the
obstacles in red and a blue dot indicating the goal point type of reward that the MR receives. When γ = 0, the MR
qf , as well as its corresponding APF surface aside. These will only consider immediate reward, whereas the MR will
environments present challenging problems for testing path- consider future reward when γ approaches 1. In that sense,
planning algorithms; thereby we used them in this work to the discount factor, γ , is set to 0.8 in this work, following [5],
evaluate the QAPF learning algorithm. These test environ- [34].
ments represent just a sample of the types of environments Remark 4: The decision rate, ζ ∈ (0, 1], determines the
that the MR can expect to find in typical real-world scenarios. process involved to choose the action at in the state st .
All the test environments have a physical dimension of 10 × These processes include the APF weighting procedure, and
10 meters and the input coordinates (x, y) are quantized into the explore-exploit approach explained in Section IV. The
161 × 161 = 25, 921 states. Hence, each state in the grid has decision rate, ζ , is empirically set to 0.2 in this work.
a physical separation of 0.0625 meters. Remark 5: The APF weighting employs the total artificial
In this work, all the experiments were carried out on an potential field in its process. Therefore, the attractive and
Intel Core i9 CPU (3.60 GHz) with 16 GB of RAM running repulsive proportional gains, {ka , kr | 0 < ka , kr < 10} [29],
the Ubuntu Focal Fossa distribution of Linux with Python are defined as follow. The attractive proportional gain, ka ,
3.7, and OpenCV 4.5. The experiments are composed of a is empirically set to 0.25 as well the repulsive proportional
learning phase and a testing phase. To make a fair comparison gain, kr , is empirically set to 0.60 in this work.
among the algorithms, we set the same parameters. And for the testing phase, we have the following remarks:
For the learning phase, we have the following remarks: Remark 6: The path length dist is defined as the sum of
Remark 1: The reward function is defined by Eq. (8). distances between the configuration states from the start point
Where the reward is r = 100 when the agent arrives at goal q0 to the goal point qf [35], and it is calculated by Eq. (9).
state qf . The reward is r = −1 when the agent collides with Nconf −1
obstacles or when it steps into the limits of the workspace, dist =
X
L(i, i + 1) (9)
and then in both cases, the agent will be instantly reset to i=0
a valid random state. For other steps, the agent receives a p
reward r = 0. where, L(i, i + 1) = (xi+1 − xi )2 + (yi+1 − yi )2 is the
distance between configuration states i = (xi , yi ) and i + 1 =
100, if st = qf and safe = True
(xi+1 , yi+1 ), the number of configuration states is Nconf .
r = −1, if st 6 = qf and safe = False (8) Remark 7: Path smoothness, smooth, has as goal to mea-
0, otherwise sure how much snaky is the path, and it is calculated by
Eq. (10).
Remark 2: The learning rate, α ∈ (0, 1), regulates the range
of the latest received data that will override the previous data. Nconf −1
1 X
When α = 0, the MR will learn nothing, whereas the MR smooth = |β(i, i + 1)| (10)
Nconf
will only read the latest received information when α = 1 [5]. i=0
When the value of α is relatively small, Q-value in the func- where, β(i, i + 1) is the angle in every change of direc-
tion Q(s, a) will eventually converge to the optimal Q-value, tion θ between configuration states i = (xi , yi ) and
FIGURE 3. Test environments with rectangular obstacles in red and a blue dot indicating the goal point, each one is presented in an area of
10 × 10 and their correspondent APF surface aside.
TABLE 2. Results in terms of path length (distance in meters) for the learning phase considering a different number of episodes, the best result (lower is
the best) for each test environment is in bold. For ‘- - -’ the learning algorithm was unable of achieving a feasible result.
FIGURE 4. Learning results comparative. Each map shows the best path QG obtained by the CQL algorithm and the QAPF learning
algorithm for 5 × 104 , 1 × 105 , 5 × 105 , 1 × 106 , and 2 × 106 episodes respectively from left to right.
i + 1 = (xi+1 , yi+1 ). If the direction θ has no change The Eq. (9) was employed to compute all the results in
β(i, i + 1) = 0, otherwise, β(i, i + 1) = arctan 2((yi+1 − yi ), Table 2. The results show that the QAPF learning algorithm
(xi+1 − xi )). presents better performance in terms of path length in the
learning phase over the CQL algorithm in all the test envi-
ronments. For example, for the QAPF learning algorithm
B. LEARNING PHASE RESULTS in Map10, the shortest path length is 9.0078 meters which
To assess the learning performance of the proposed QAPF is reached with 1 × 106 episodes of training. However, for
learning algorithm versus the CQL algorithm, Table 2 the CQL algorithm at the same number of episodes the path
presents the results in terms of path length for a different length reached is 9.9915 meters. The difference between
number of episodes employed during the learning phase in the results is 0.9837 meters, which represents an important
each test environment. For the learning phase were employed advantage of the QAPF learning algorithm. The advantage of
from 5×104 to 3×106 episodes to reach the best results, after the QAPF learning algorithm can be observed in the results
3 × 106 episodes no better results or improvement is obtained presented in Table 2. Therefore, it can be concluded that the
in any of the test environments employed. QAPF learning algorithm outperforms the CQL algorithm in
lower for the QAPF learning algorithm in all the test environ-
ments. It can be concluded that the QAPF learning algorithm
presents an efficient performance for the learning phase due
to better results presented in all the test environments.
In summary, for the learning phase results that are shown in
Fig. 5. Regarding path length, the CQL algorithm achieves an
average of 14.7104 meters and the QAPF learning algorithm
achieves an average of 13.8469 meters. Therefore, the dif-
ference is 0.8635 meters, reaching a 6.24% of improvement
with the QAPF learning algorithm. Regarding path smooth-
ness, the CQL algorithm achieves an average of 7.4297 units
and the QAPF learning algorithm achieves an average of
2.9360 units. Therefore, the difference is 4.4937 units, yield-
ing a 153.05% of improvement with the QAPF learning
algorithm. Finally, regarding training time, the CQL algo-
rithm achieves an average of 5.35 ms and the QAPF learning
algorithm achieves an average of 3.06 ms for a single episode.
Therefore, the difference is 2.29 ms, yielding a 74.84% of
improvement with the QAPF learning algorithm.
FIGURE 6. Path planning results for the different test environments. Each map shows the best path QG obtained by the QAPF learning
algorithm in offline path planning mode.
FIGURE 7. Offline path planning for twenty different random start points
in Map10. The map shows the paths QG obtained by the QAPF learning
algorithm, the different start points in blue, and the goal in green.
in offline mode because we know the environment informa- it calculates the obstacle position to update the environment
tion described by the original test environment Map01, see layout map, as shown in Fig. 9(c). The MR path planning
Fig. 9(b). The minimum path length found to reach the target algorithm based on the QAPF learning algorithm now has
position is 7.0418 meters, using the best results, as described a different environment layout. Therefore, it is necessary to
in Table 2. At position (6.0800, 6.0000), a new static obstacle update the path and continue with the movement to reach the
is added to change the environment configuration. After a goal point.
while, when the MR has moved 2.8472 meters, it reaches the Next, at position (6.8000, 4.9800), a second new static
position (5.6875, 6.5000). The MR senses the new obstacle; obstacle is added to change the environment configuration.
the difference is 14.1681 units, reaching 136.79% of improve- There are several possible directions to extend this work in
ment with the QAPF learning algorithm. Finally, regarding the future. Firstly, it could be motivating to focus on multi-
path planning computation time, the CQL algorithm achieves agent (multi MR) path planning in complicated dynamic and
an average of 8.56 ms and the QAPF learning algorithm uncertain situations. Secondly, other types of reinforcement
achieves an average of 7.33 ms. Therefore, the difference is learning can be considered as Deep Q-Networks, through
1.23 ms, reaching a 16.78% of improvement with the QAPF the combination of reinforcement learning and deep neural
learning algorithm. networks it can be possible to solve a wide range of path
planning problems. Lastly, the QAPF learning algorithm can
be extended to work in a 3D space that could be useful for
VI. CONCLUSION many applications for gathering information (i.e., drones),
In this work, the proposed QAPF learning algorithm suc- disaster relief, and exploration.
cessfully solves path planning problems in offline and online
modes for known and unknown environments. The combi- REFERENCES
nation of the Q-learning approach and the APF method can [1] S. M. La Valle, Planning Algorithms. New York, NY, USA: Cambridge
get over the drawback of the classical Q-learning approach, Univ. Press, 2006.
[2] M. A. Contreras-Cruz, V. Ayala-Ramirez, and U. H. Hernandez-Belmonte,
such as slow learning speed, time consumption, and impos- ‘‘Mobile robot path planning using artificial bee colony and evolutionary
sible learning in known and unknown environments. All the programming,’’ Appl. Soft Comput., vol. 30, pp. 319–328, May 2015.
simulation results demonstrate that the proposed QAPF learn- [3] A. A. Saadi, A. Soukane, Y. Meraihi, A. B. Gabis, S. Mirjalili, and
ing algorithm can enhance learning speed and improve path A. Ramdane-Cherif, ‘‘UAV path planning using optimization approaches:
A survey,’’ Arch. Comput. Methods Eng., vol. 2022, pp. 1–52, Apr. 2022.
planning in terms of path length and path smoothness. [4] O. Montiel, U. Orozco-Rosas, and R. Sepúlveda, ‘‘Path planning for
For the training phase, the QAPF learning algorithm mobile robots using bacterial potential field for avoiding static and
reached an improvement of 6.24% in path length, 153.05% dynamic obstacles,’’ Expert Syst. Appl., vol. 42, no. 12, pp. 5177–5191,
2015.
in path smoothness, and 74.84% in training time over the [5] E. S. Low, P. Ong, and K. C. Cheah, ‘‘Solving the optimal path planning of
classical approach. For the testing phase in offline mode, the a mobile robot using improved Q-learning,’’ Robot. Auto. Syst., vol. 115,
QAPF learning algorithm reached an improvement of 7.51% pp. 143–161, May 2019.
[6] L. Chang, L. Shan, C. Jiang, and Y. Dai, ‘‘Reinforcement based mobile
in path length, 169.75% in path smoothness, and 4.46% in robot path planning with improved dynamic window approach in unknown
path planning computation time over the classical approach. environment,’’ Auto. Robots, vol. 45, no. 1, pp. 51–76, Jan. 2021.
And, for the testing phase in online mode, the QAPF learning [7] Y.-H. Wang, T.-H.-S. Li, and C.-J. Lin, ‘‘Backward Q-learning: The combi-
nation of Sarsa algorithm and Q-learning,’’ Eng. Appl. Artif. Intell., vol. 26,
algorithm reached an improvement of 18.83% in path length, no. 9, pp. 2184–2193, Oct. 2013.
136.79% in path smoothness, and 16.78% in path planning [8] M. Zolfpour-Arokhlo, A. Selamat, S. Z. M. Hashim, and H. Afkhami,
computation time over the classical approach. The effects of ‘‘Modeling of route planning system based on Q value-based dynamic
programming with multi-agent reinforcement learning algorithms,’’ Eng.
the APF weighting in the results of the QAPF learning algo- Appl. Artif. Intell., vol. 29, pp. 163–177, Mar. 2014.
rithm are beneficial over the classical Q-learning approach [9] M. Zhao, H. Lu, S. Yang, and F. Guo, ‘‘The experience-memory Q-learning
due to the advantages presented by the APF method such as algorithm for robot path planning in unknown environment,’’ IEEE Access,
vol. 8, pp. 47824–47844, 2020.
the effectiveness in providing smooth and safe planning.
[10] U. Orozco-Rosas, K. Picos, and O. Montiel, ‘‘Hybrid path planning algo-
We employed different test scenarios to evaluate the rithm based on membrane pseudo-bacterial potential field for autonomous
QAPF learning algorithm. The results obtained in the dif- mobile robots,’’ IEEE Access, vol. 7, pp. 156787–156803, 2019.
ferent known and partially unknown environments show that [11] Z. Cui and Y. Wang, ‘‘UAV path planning based on multi-layer reinforce-
ment learning technique,’’ IEEE Access, vol. 9, pp. 59486–59497, 2021.
the proposed QAPF learning algorithm achieves the three [12] D. L. Cruz and W. Yu, ‘‘Path planning of multi-agent systems in unknown
requirements efficiently to solve the path planning problem: environment with neural kernel smoothing and reinforcement learning,’’
safety, length, and smoothness, which makes the QAPF learn- Neurocomputing, vol. 233, pp. 34–42, Apr. 2017.
[13] Q. Yao, Z. Zheng, L. Qi, H. Yuan, X. Guo, M. Zhao, Z. Liu, and T. Yang,
ing algorithm appropriate to find competitive results for MR ‘‘Path planning method with improved artificial potential field—A rein-
navigation in complex and real scenarios. forcement learning perspective,’’ IEEE Access, vol. 8, pp. 135513–135523,
The path planning results demonstrate that the QAPF 2020.
[14] C. Qu, W. Gai, M. Zhong, and J. Zhang, ‘‘A novel reinforcement learning
learning algorithm produces better solutions in all the test based grey wolf optimizer algorithm for unmanned aerial vehicles (UAVs)
environments. Furthermore, the path planning results are path planning,’’ Appl. Soft Comput., vol. 89, Apr. 2020, Art. no. 106099.
achieved with a lower number of episodes in the learn- [15] X. Guo, G. Peng, and Y. Meng, ‘‘A modified Q-learning algorithm for
robot path planning in a digital twin assembly system,’’ Int. J. Adv. Manuf.
ing phase when the QAPF learning algorithm is employed. Technol., vol. 119, nos. 5–6, pp. 3951–3961, Mar. 2022.
Another advantage of the proposed QAPF learning algorithm [16] A. Maoudj and A. Hentout, ‘‘Optimal path planning approach based on
is low variability in training time, making it highly reliable Q-learning algorithm for mobile robots,’’ Appl. Soft Comput., vol. 97,
Dec. 2020, Art. no. 106796.
for MR path planning.
[17] V. B. Ajabshir, M. S. Guzel, and E. Bostanci, ‘‘A low-cost Q-learning-based
In this regard, the QAPF learning algorithm could be approach to handle continuous space problems for decentralized multi-
useful for many applications in MRs for local and global agent robot navigation in cluttered environments,’’ IEEE Access, vol. 10,
path planning, including industrial and domestic MRs, self- pp. 35287–35301, 2022.
[18] V. Bulut, ‘‘Optimal path planning method based on epsilon-greedy
driving cars, exploration vehicles, unmanned aerial vehicles, Q-learning algorithm,’’ J. Brazilian Soc. Mech. Sci. Eng., vol. 44, no. 3,
and autonomous underwater vehicles. p. 106, Mar. 2022.
[19] T. Gao, C. Li, G. Liu, N. Guo, D. Wang, and Y. Li, ‘‘Hybrid path planning KENIA PICOS received the Ph.D. degree from
algorithm of the mobile agent based on Q-learning,’’ Autom. Control the Instituto Politécnico Nacional, Centro de
Comput. Sci., vol. 56, no. 2, pp. 130–142, Apr. 2022. Investigación y Desarrollo de Tecnología Digi-
[20] A. K. Sadhu, A. Konar, T. Bhattacharjee, and S. Das, ‘‘Synergism of tal, in 2017. She is currently a full-time Pro-
firefly algorithm and Q-learning for robot arm path planning,’’ Swarm fessor with the School of Engineering, CETYS
Evol. Comput., vol. 43, pp. 50–68, Dec. 2018. Universidad, Tijuana, Mexico. She is a member
[21] X. Wu, H. Chen, C. Chen, M. Zhong, S. Xie, Y. Guo, and H. Fujita, of the National System of Researchers (Sistema
‘‘The autonomous navigation and obstacle avoidance for USVs with
Nacional de Investigadores) at CONACYT. Her
ANOA deep reinforcement learning method,’’ Knowl.-Based Syst.,
current research interests include computer vision,
vol. 196, May 2020, Art. no. 105201.
[22] U. Orozco-Rosas, O. Montiel, and R. Sepúlveda, ‘‘Pseudo-bacterial poten- object recognition, three-dimensional object track-
tial field based path planner for autonomous mobile robot navigation,’’ Int. ing, pose estimation, and parallel computing with graphics processing units.
J. Adv. Robot. Syst., vol. 12, no. 7, p. 81, 2015.
[23] K. Zhu and T. Zhang, ‘‘Deep reinforcement learning based mobile robot
navigation: A review,’’ Tsinghua Sci. Technol., vol. 26, no. 5, pp. 674–691,
Oct. 2021.
[24] F. Liu, R. Tang, X. Li, W. Zhang, Y. Ye, H. Chen, H. Guo, Y. Zhang,
and X. He, ‘‘State representation modeling for deep reinforcement learn-
ing based recommendation,’’ Knowl.-Based Syst., vol. 205, Oct. 2020,
Art. no. 106170.
[25] C. J. C. H. Watkins and P. Dayan, ‘‘Q-learning,’’ Mach. Learn., vol. 8,
nos. 3–4, pp. 279–292, 1992. JUAN J. PANTRIGO received the M.S. degree
[26] A. K. Sadhu and A. Konar, ‘‘An efficient computing of correlated equilib- in fundamental physics from Universidad de
rium for cooperative Q-learning-based multi-robot planning,’’ IEEE Trans. Extremadura, in 1998, and the Ph.D. degree from
Syst., Man, Cybern., Syst., vol. 50, no. 8, pp. 2779–2794, Aug. 2020.
Universidad Rey Juan Carlos, in 2005. He is
[27] O. Khatib, ‘‘Real-time obstacle avoidance for manipulators and mobile
currently an Associate Professor with Univer-
robots,’’ in Proc. IEEE Int. Conf. Robot. Autom., vol. 2, Mar. 1985,
pp. 500–505. sidad Rey Juan Carlos and a member of the
[28] O. Montiel, R. Sepúlveda, and U. Orozco-Rosas, ‘‘Optimal path planning CAPO Research Group, Department of Com-
generation for mobile robots using parallel evolutionary artificial potential puter Science. His research interests include
field,’’ J. Intell. Robot. Syst., vol. 79, no. 2, pp. 237–257, Aug. 2015. high-dimensional space-state tracking problems,
[29] U. Orozco-Rosas, O. Montiel, and R. Sepúlveda, ‘‘Mobile robot path computer vision, metaheuristic optimization,
planning using membrane evolutionary artificial potential field,’’ Appl. Soft machine learning, and hybrid approaches.
Comput. J., vol. 77, pp. 236–251, Apr. 2019.
[30] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction,
2nd ed. Cambridge, MA, USA: MIT Press, 2018.
[31] Motion Planning Maps, Intelligent and Mobile Robotics
Group, Department of Cybernetics, Czech Technical University
in Prague. Accessed: Jun. 27, 2022. [Online]. Available:
http://imr.ciirc.cvut.cz/planning/maps.xml
[32] F. Bayat, S. Najafinia, and M. Aliyari, ‘‘Mobile robots path planning: Elec-
trostatic potential field approach,’’ Expert Syst. Appl., vol. 100, pp. 68–78,
Jun. 2018.
ANTONIO S. MONTEMAYOR received the M.S.
[33] B. Wang, S. Li, J. Guo, and Q. Chen, ‘‘Car-like mobile robot path planning
in rough terrain using multi-objective particle swarm optimization algo-
degree in physics from Universidad Autónoma de
rithm,’’ Neurocomputing, vol. 282, pp. 42–51, Mar. 2018. Madrid and the Ph.D. degree in computer sci-
[34] L. Khriji, F. Touati, K. Benhmed, and A. Al-Yahmedi, ‘‘Mobile robot ence from Universidad Rey Juan Carlos, in 2006.
navigation based on Q-learning technique,’’ Int. J. Adv. Robot. Syst., vol. 8, He is currently an Associate Professor with Uni-
no. 1, p. 4, Mar. 2011. versidad Rey Juan Carlos and a PI of the CAPO
[35] X. Wang, G. Zhang, J. Zhao, H. Rong, F. Ipate, and R. Lefticaru, ‘‘A mod- Research Group, Department of Computer Sci-
ified membrane-inspired algorithm based on particle swarm optimization ence and Statistics, Madrid, Spain. His research
for mobile robot path planning,’’ Int. J. Comput. Commun. Control, vol. 10, interests include computer vision, artificial intel-
no. 5, pp. 732–745, Oct. 2015. ligence, and GPU computing.