Path Planning For Unmanned Surface Vehicle Based On Improved Q-Learning Algorithm
Path Planning For Unmanned Surface Vehicle Based On Improved Q-Learning Algorithm
Ocean Engineering
journal homepage: www.elsevier.com/locate/oceaneng
Research paper
A R T I C L E I N F O A B S T R A C T
Handling Editor: Prof. A.I. Incecik Path planning is a key factor for the unmanned surface vehicle (USV) to achieve efficient navigation. In this
paper, to solve the global path planning and obstacle avoidance problems for the USV, an improved Q-Learning
Keywords: algorithm called neural network smoothing and fast Q-Learning (NSFQ) is proposed. Three main improvement
Q-learning parts are composed of the proposed algorithm. Firstly, the radial basis function (RBF) neural network is com
Unmanned surface vehicle
bined with the Q-Learning algorithm to approximate the action value function Q, which improves the conver
Path planning
gence speed of the Q-Learning algorithm. Secondly, to ensure that the planned path conforms to the maneuvering
Reinforcement learning
RBF neural network characteristics of the USV, the heading angle, motion characteristics, ship length, and safety of the USV are taken
into account by the proposed algorithm. Based on these factors, the action space and reward function are
optimized, the state space is reconstructed, and the safety threshold is proposed. Finally, a third-order Bezier
curve is used to smooth the initial path, so that the USV can maintain its heading stability during navigation.
Based on simulation results, the proposed NSFQ algorithm outperforms the A* and RRT algorithms in terms of
evaluation indicators such as heading angle, angular velocity, path length, sailing time, and path smoothness.
1. Introduction from the initial position to the target position in a marine environment
with obstacles based on certain constraints and target requirements
With the continuous advancement of artificial intelligence technol (Mac et al., 2016).
ogy, the research and application of unmanned driving fields, such as Currently, the A* algorithm, the rapidly exploring random tree
unmanned vehicles, unmanned aerial vehicles (UAVs), and unmanned (RRT) algorithm, the artificial potential field (APF), the particle swarm
surface vehicles (USVs), have undergone tremendous development optimization (PSO) algorithm, the ant colony algorithm (ACA), the ge
(Wang et al., 2018). Among them, the research on unmanned vehicles is netic algorithm (GA), and other algorithms (Ozturk et al., 2022; Yao
mature, while the research on USVs is currently relatively limited due to et al., 2023) have been widely used in USV path planning. The A* al
the complexity and uncertainty of the marine environment. USVs are gorithm (Singh et al., 2018) is based on the Dijkstra algorithm (Wang
widely used in marine resource exploration (Jin et al., 2018b), marine et al., 2019a) with the addition of heuristic functions (Hart et al., 1968).
rescue (Li et al., 2023), marine environmental monitoring (Fu and Khan, Although the algorithm improves the computational efficiency, the
2020), collaborative combat (Owen et al., 2021), and other fields planned path has problems with more turns, low path smoothness, and
(Cheng et al., 2021; Jin et al., 2018a). In practical applications, one of small safety distance. The RRT algorithm (Zhao et al., 2022) is a path
the key factors determining whether the USV can achieve effective planning algorithm based on random sampling that has strong search
navigation depends on path planning, which to some extent indicates capability and fast search speed. However, because the sampling points
the level of intelligence of the USV (Sun et al., 2020). are randomly generated, the generated path is not smooth enough and
has too many unnecessary turns, and the planned path length is too long.
The APF algorithm (Sang et al., 2021) is a dynamic path planning al
1.1. Related work gorithm with good real-time performance and strong adaptability to
dynamic environments. It is often used in combination with other
The path planning of the USV refers to planning an obstacle-free path
* Corresponding author. College of Intelligent Systems Science and Engineering, Harbin Engineering University, Nantong Street 145, Harbin, 150001, China.
E-mail address: [email protected] (P. Wu).
https://doi.org/10.1016/j.oceaneng.2023.116510
Received 2 September 2023; Received in revised form 18 November 2023; Accepted 1 December 2023
Available online 21 December 2023
0029-8018/© 2023 Elsevier Ltd. All rights reserved.
Y. Wang et al. Ocean Engineering 292 (2024) 116510
algorithms. However, the algorithm has the disadvantage of being prone global path planning and obstacle avoidance problems for USV with the
to falling into local optimal solutions. In addition, more intelligent following main contributions.
optimization algorithms have been proposed and applied to path plan
ning, including representative swarm intelligence algorithms such as 1) To improve the convergence speed of the Q-Learning algorithm, the
ACA (Ntakolia and Lyridis, 2022), GA (Xin et al., 2019), and PSO, etc. radial basis function (RBF) neural network is combined with the Q-
These swarm intelligence algorithms have advantages in solving com Learning algorithm to approximate the action value function Q.
plex problems, but they also have long computation time, slow Moreover, the heading angle and turning performance of the USV are
convergence speed, and are easy to fall into local minimal values. taken into account by the proposed algorithm, the action space and
All of the above algorithms have their own advantages and disad reward functions are optimized, the state space is reconstructed.
vantages, and they all need to assume complete environmental infor 2) In response to the obstacle avoidance problem of the USV, a safety
mation. However, there is rarely prior knowledge of marine threshold is proposed to ensure the safety of the USV, which is twice
environments (Wang et al., 2018). Reinforcement learning algorithms the length of the USV. In addition, a third-order Bezier curve is used
do not require any human knowledge or default rules (Lan et al., 2022). to smooth the initial path so that the USV can maintain its heading
The Q-Learning algorithm is one of the classic reinforcement learning stability during navigation.
algorithms and has strong robustness and adaptability to uncertain en 3) By comparing with other algorithms (Wang et al., 2019c; Zhao et al.,
vironments (Watkins and Dayan, 1992). The classical Q-Learning (CQL) 2022), this paper demonstrates that the NSFQ algorithm outperforms
algorithm has shortcomings, such as long learning time, low exploration other algorithms in terms of evaluation indicators such as path
efficiency, and slow convergence speed (Duan and Chen, 2019; Xu and length, sailing time, heading angle, angular velocity, and path
Yuan, 2019). Many literatures have made improvements to the classical smoothness.
Q-Learning reinforcement learning algorithm (Chen et al., 2019) to
address its limitations, which is the part of this paper to focus on To better introduce the NSFQ algorithm and demonstrate the effec
research and improvement. tiveness and superiority of the proposed algorithm, the remaining sec
The literature on improving the Q-Learning algorithm can be tions of this paper are organized as follows. In Section 2, the model of the
broadly divided into the following three categories. Firstly, many USV and the shortcomings of the classical Q-Learning algorithm are
studies have improved the basic elements of Q-Learning algorithms, outlined. In Section 3, the specific content of the proposed NSFQ algo
such as reward function (Marthi, 2007), action space (Bianchi et al., rithm is introduced separately. In Section 4, the comparative experi
2008), value function Q (Low et al., 2019), etc., to improve the ments and simulation results of USV path planning in different
learning and exploration efficiency of the Q-Learning algorithm. simulation environments are presented. In Section 5, conclusions are
However, the above improvements did not significantly improve the drawn based on simulation experiments and results.
convergence speed or learning time of the algorithm. Secondly, prior
knowledge can be introduced to provide the Q-Learning algorithm 2. Preliminaries
with additional information to improve convergence speed, so many
scholars have also utilized prior knowledge to improve the Q-Learning In this section, the model of the USV is first introduced, then the Q-
algorithm (Hao et al., 2023). Yang et al. (2022) proposed a global path Learning algorithm is summarized, and some problems with the Q-
planning algorithm based on the double deep Q network (DDQN), Learning algorithm are outlined in the USV path planning. It makes
while utilizing prior knowledge to select the action space, and proving preparations for the NSFQ algorithm proposed in Section 3.
that the generated path has better performance. In addition, neural
networks can be incorporated into the Q-Learning algorithm to
2.1. Model of the USV
enhance its performance and efficiency (Wang et al., 2019b). Wang
(2021) introduced artificial neural networks into the Q-Learning al
The horizontal kinematics model of the USV can be represented by
gorithm, using a back propagation (BP) neural network to approxi
Eq. (1).
mate the Q function to solve the global path planning problem of ⎧
automated guided vehicles (AGV). ⎨ ẋ = u cos ψ − v sin ψ
However, in response to the shortcomings of the Q-Learning algo ẏ = u sin ψ + v cos ψ (1)
⎩
rithm, most of the algorithm improvements and optimizations proposed ψ̇ = r
by the aforementioned scholars are only applicable to the path planning
problems for mobile robots, and not suitable for path planning of the where: x, y are the position of the USV on the horizontal plane in the
USV. The dynamic characteristics, efficiency, and rules of the ship are northeast coordinate system; ψ is the heading angle of the USV; u, v and r
not taken into account by the above algorithm. Moreover, the USV needs are the longitudinal speed, the transverse speed, and the heading
to achieve autonomous driving and avoid the input of human knowledge angular velocity of the USV in the hull coordinate system, respectively.
as much as possible during path planning. The Q-Learning algorithm The kinematic coordinate system of the USV is established, as shown in
does not require any human knowledge or default rules and is charac Fig. 1.
terized by strong robustness and adaptability to uncertain environments.
Meanwhile, the path planned by the above algorithm does not meet the 2.2. Classical Q-Learning algorithm
requirements of USV maneuverability and safety, and the planned path
length and smoothness are not ideal. Therefore, in this paper, a neural The Q-Learning algorithm is a classical reinforcement learning al
network smoothing and fast Q-Learning (NSFQ) algorithm is proposed to gorithm, which is an offline learning method to estimate the state-action
solve the global path planning and obstacle avoidance problems for the value function. Its basic idea is completed by the USV perception of the
USV. The heading angle, angular velocity, length of the USV, and other current environmental state st and evaluation of the reward Rt obtained
indicators of the USV, as well as the maneuverability and motion by the possible action at , aiming to maximize the cumulative reward of
characteristics of the USV, are taken into account by this proposed the USV in its interaction with the environment.
algorithm. The reinforcement learning principle can be seen in Fig. 2. st is the
state of the USV at time t, at is the action performed by the USV at time t
1.2. Contributions in the environment. Generally, the action is selected by the ε − greedy
strategy, and ε is the exploration factor. The next state st+1 of the
The NSFQ algorithm proposed in this paper is applied to solve the environment is obtained by the action at , meanwhile, the environment
2
Y. Wang et al. Ocean Engineering 292 (2024) 116510
3
Y. Wang et al. Ocean Engineering 292 (2024) 116510
ψ A = [0◦ , 90◦ , 180◦ , − 90◦ ] (4) 3.1.3. The action value function Q(st , at )
The Q-table is used by the classical Q-Learning algorithm to describe
To shorten the path length of the USV, and its heading angle control
the state-action value function Q(st , at ), and to realize the learning and
and navigation safety are taken into account, the action space of the Q-
storage of the Q-table (Wei and Jin, 2019). However, the working
Learning algorithm is improved. The improved action space increases
environment of the USV is relatively complex, the number of learning
the search behavior in the diagonal direction, so the action space of USV
parameters increases exponentially with the increase of the state
is increased to 8 discrete actions.
dimension, that is, a large amount of memory space is occupied by the
The improved action space of USV can be defined by Eq. (5):
Q-table, which reduces the computational efficiency and causes
Action = [1, 2, 3, 4, 5, 6, 7, 8] (5) “dimension disaster” (Wang, 2021).
In this paper, the RBF neural network is used instead of the Q-table. It
where, 1, 2, 3, 4, 5, 6, 7 and 8 respectively represent USV forward, turn is used to approximate the action value function Q(st , at ) of the Q-
left 45◦ , turn left 90◦ , turn left 135◦ , backward, turn right 135◦ , turn Learning algorithm. The RBF neural network has strong function
right 90◦ , and turn right 45◦ . approximation ability, making it faster and more efficient to approxi
At this time, the heading angle of the USV can be obtained by Eq. (6): mate the real Q function. As a result, the RBF neural network does not
exist the problem of catastrophic forgetting. The network structure of
ψ A = [0◦ , 45◦ , 90◦ , 135◦ , 180◦ , − 135◦ , − 90◦ , − 45◦ ] (6)
RBF-based Q-Learning algorithm is shown in Fig. 4.
The schematic diagram before and after the improvement of the The RBF neural network is a three-layer static feed-forward network,
action space is compared in Fig. 3. The left figure represents the action including an input layer, a hidden layer, and an output layer.
space of the Q-Learning algorithm, and the right figure represents the The first layer is the input layer, which is composed of signal source
improved action space. nodes. Its input is composed of state variable (s1 , s2 , ...,sN ) and an action
variable a. The input number is M = N+1, and the input vector is:
3.1.2. State space
X = [s1 , s2 , ...,sN , a] (9)
The classical Q-Learning algorithm uses grid-based modeling, so the
size of its state space can be determined by the number of grids. How The second layer is the hidden layer, and φk (X), (k= 1, 2, ...,M) is the
ever, in this paper, the grid-based modeling method is not employed in basis function. Generally, the M-dimensional Gaussian function is
the NSFQ algorithm. Therefore, the state space of the algorithm must be selected, and the expression of the k-th RBF node is derived from Eq.
reconstructed. (10):
During the navigation of the USV, the longitudinal peed of the USV is ( )
∑M
(sk − μk )2
u, and the transverse speed v = 0. Therefore, if the starting point coor φk (X) = exp − (10)
dinate position (x(0), y(0)), the ending point coordinate position (x(e), k=1
2σ 2k
y(e)) and the heading angle ψ at a certain time are known, the position of
the USV at any time can be obtained. At that time, the position of the where, σ k is the variance of the Gaussian excitation function of the k-th
USV can be converted to Eq. (1): hidden neuron, μk is the Gaussian function clustering center of the k-th
⎧ ∫ t+1 hidden layer node. Assuming the target point coordinate is (xe , ye ), and
⎪
⎪ x(t+1) = x(t) + u cos ψ dt the current position coordinate of the USV is (xk , yk ). Their expressions
⎪
⎪
⎨ t
∫ t+1 can be obtained by Eq. (11) ~ Eq. (13).
(7) √̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅
⎪
⎪ y(t+1) = y(t) + u sin ψ dt
⎪
⎪
⎩ t Distance = (xk − xe )2 + (yk − ye )2 (11)
r = ψ̇
Distancemax
The position coordinates of the USV at each moment can be calcu σk = √̅̅̅̅̅̅ (12)
lated by Eq. (7). The state space of the Q-Learning algorithm is 2N
composed of the position coordinates of USV during navigation. The
heading angle at each moment is taken into account by the NSFQ al 1 ∑M
μk = sk (13)
gorithm, so its state space is composed of the position coordinates and M k=1
heading angle of the USV at each moment. The state space is defined as:
The third layer is the output layer, which reacts to the impact of the
S = [x, y, ψ ] (8) input mode. The action value function Q(st , at ) of the Q-Learning algo
rithm is approximated by only one output node. The expression of the
where, P = [x, y] is the position coordinates of the USV, ψ is the heading
angle of the USV.
Fig. 3. Schematic diagram of action space before and after improvement. Fig. 4. Approximation of action value function Q based on RBF neural network.
4
Y. Wang et al. Ocean Engineering 292 (2024) 116510
where, RObstacle is the obstacle area, RSafe is the safe area, and RGoal is the Fig. 5. The circular prohibited area with safety threshold ρ.
target point. μ is a very small positive number, and its size should be
determined by the planned path length. Ls→Goal is the distance from the 3.2.2. Quadrilateral prohibited area
current position to the target point. It can be seen from Eq. (16) that the Quadrilateral envelope surfaces can be used to simplify the
reward value of the safe area near the target point is relatively large, complexity of modeling for relatively large irregular obstacles. Fig. 6
while the reward value of the safe area far from the target point is shows the quadrilateral prohibited area before and after the safety
relatively small. threshold ρ is introduced. The quadrilateral prohibited area without
introducing the safety threshold ρ is represented by the solid line
3.2. The safety threshold quadrilateral, while the quadrilateral prohibited area after introducing
the safety threshold ρ is represented by the dashed line quadrilateral.
To ensure the safety of the USV, a safety threshold ρ is proposed that The solid line quadrilateral is formed by enclosing four intersecting
is twice the length of the USV. The envelope line with an increased safety lines l1 , l2 , l3 and l4 . The expression of four intersecting lines can be
threshold will be used as the prohibited area to cover obstacles. The expressed as:
prohibited area is the area covered by obstacles, which is the area where 4
the navigation of the USV is prohibited. In marine environments, most Π li : [yi (t) − ki x(t) − bi ]= 0 (21)
i=1
obstacles are irregularly shaped, which can greatly complicate the
environmental modeling process. To simplify the processing, circular where, i= 1, 2, 3, 4 is the serial number of the quadrilateral straight line,
and quadrilateral envelope surfaces are used to envelop these obstacles. k1 , k2 , k3 and k4 are the slopes of straight lines l1 , l2 , l3 and l4 respec
tively, and b1 , b2 , b3 and b4 are the intercept of lines l1 , l2 , l3 and l4
3.2.1. Circular prohibited area respectively on the y-axis.
Circular envelope surfaces can be used to simplify the complexity of The quadrilateral prohibited area without introducing the safety
modeling for relatively small irregular obstacles. Fig. 5 shows the cir threshold ρ is expressed by Eq. (22):
cular prohibited area before and after the safety threshold ρ is intro
duced. The circular prohibited area without introducing the safety
threshold ρ is represented by the solid line circle, while the prohibited
area after introducing the safety threshold ρ is represented by the dashed
line circle.
The circular prohibited area without introducing safety threshold ρ is
expressed by Eq. (17):
(x − x0 )2 + (y − y0 )2 ≤ r2 (17)
The circular prohibited area after introducing the safety threshold ρ
is expressed by Eq. (20) as follows:
R=r + ρ (18)
ρ= 2⋅LUSV (19)
(x − x0 )2 + (y − y0 )2 ≤ R2 (20)
where, r is the radius of the solid line circle without introducing the
safety threshold ρ, R is the radius of the dashed line circle with intro
ducing the safety threshold ρ, (x0 , y0 ) is the center of the prohibited area,
and LUSV is the length of the USV. Fig. 6. The quadrilateral prohibited area with safety threshold ρ.
5
Y. Wang et al. Ocean Engineering 292 (2024) 116510
⎧
⎪
⎪ y(t) − y1 (t)≥ 0 requirements of the path are taken into account, and the third-order
⎨
y(t) − y2 (t)≤ 0 Bezier curve is used to smooth the initial path in this paper.
(22)
⎪ y(t) −
⎪ y3 (t)≥ 0 The equation for each point on a third-order Bezier curve is defined
⎩
y(t) − y4 (t)≤ 0 as follows:
The dashed line quadrilateral introducing the safety threshold ρ is B(t) = (1 − t)3 P00 +3t(1 − t)2 P01 +3t2 (1 − t)P02 + t3 P03 , t ∈ [0, 1]
enclosed by four intersecting lines ln1 , ln2 , ln3 and ln4 . The new four
intersecting lines can be expressed as: where, P00 , P01 , P02 and P03 represent three consecutive initial control
4 points.
Π lni : [yni (t) − ki x(t) − bni ]= 0 (23) The optimization process of the third-order Bezier curve is shown in
i=1
Fig. 7, with four control points selected. The initial control points P01 and
where, bn1 , bn2 , bn3 and bn4 are the intercept of lines ln1 , ln2 , ln3 and ln4 P02 can be smoothed by the third-order Bezier curve. The black line is the
respectively on the y-axis. initial path, and the red curve is the smoothed path of the third-order
Due to the introduction of the safety threshold ρ, the intercept of the Bezier curve.
straight line on the y-axis changes by:
Δb = ρ / cos(arctan ki ) (24) 3.4. USV path planning process based on NSFQ algorithm
The intercept of new straight lines ln1 , ln2 , ln3 and ln4 on the y-axis can The flow chart of USV path planning based on the NSFQ algorithm is
be expressed by Eq. (25): shown in Fig. 8, and the specific steps are as follows.
⎧
⎪
⎪ bn1 = b1 − Δb
⎨
bn2 = b2 + Δb Step 1: Generate the obstacle environment. Randomly generated
(25) obstacle maps are used in USV path planning. Moreover, the starting
⎪ bn3 = b3 − Δb
⎪
⎩
bn4 = b4 + Δb and ending coordinates are determined.
Step 2: The envelope line of the safety threshold ρ is introduced as the
The quadrilateral prohibited area after introducing the safety prohibited area to cover obstacles.
threshold ρ is expressed by Eq. (26) as follows: Step 3: Initialization. The relevant parameter values of the algorithm
⎧
⎪ y(t) − yn1 (t)≥ 0 are set, including learning rate α, discount factor γ, exploration factor
⎪
⎨
y(t) − yn2 (t)≤ 0 ε, maximum number of iterations N, maximum number of explora
(26)
⎪ y(t) − yn3 (t)≥ 0
⎪ tion steps per iteration and other related parameter values.
⎩
y(t) − yn4 (t)≤ 0 Step 4: The action space of USV can be designed by Eq. (6). The RBF
Neural Network is introduced to approximate the real Q-value, and
When the slope of an edge does not exist, the inequality constraint form the reward function R is designed according to the target point and
of the edge becomes a constraint on x(t). obstacle information by Eq. (16).
Eq. (22) can be transformed into: Step 5: The initial state sstart is composed of the starting point and
initial heading of the USV according to Eq. (7).
∏
4
{ [ ]}
min 0, (− 1)i (x(t) − xi (t)) ≤ 0 (27) Step 6: According to the ε − greedy strategy, the specific action a of
i=1 the USV in state st is selected, the corresponding instant reward R for
Eq. (26) can be transformed into: this action a, and the next state st+1 to be transferred are obtained,
furthermore, the position coordinates and heading of the USV in the
∏
4
{ [ ( )]} next state st+1 are acquired.
min 0, (− 1)i x(t) − xi (t) − (− 1)i ρ ≤ 0 (28)
i=1
Step 7: The RBF neural network is used to approximate the Q-value
by Eq. (14), and update the Q-value by Eq. (2).
where, min( ⋅) is the minimum function, xi (t) is an expression for line li , Step 8: The current state is transitioned into st ←st+1 when the Q-
and the slope of line li does not exist. value update of the current state is completed.
Step 9: The state s is judged. If the state s of the USV is located in the
obstacle area or the target area, proceed to step 5 for a new round of
3.3. Path smoothing learning and increment the iteration number N by 1. If it is located in
a feasible area, proceed to step 6 to continue learning.
The action space is enhanced by incorporating heading information, Step 10: Determine whether the maximum number of iterations has
increasing the number of discrete actions to 8 for the actual search ac been reached. If not, it indicates that iterations have not yet ended,
tion of USV. Due to the limitation of action space, the planned initial proceed to step 6. If the maximum number of iterations has been
path is non-smooth, with numerous unnecessary turns and broken line reached, proceed to step 11.
connections. The heading angle of the USV and its continuity in the Step 11: The optimal path strategy π∗ (s) = argmaxQ(s, a) is output
actual navigation are taken into account, in this paper, the Bezier curve until the Q-value converges.
is used to smooth the initial path. Step 12: A third-order Bezier curve is used to smooth the initial path.
Based on Eq. (29), the expressions for the first-order to third-order
Bezier curves are shown in Table 1.
{
P0i i = 0, 1, ..., n − k Table 1
k
Pi = (29) Calculation formula for Bezier curves.
(1 − t)Pk−i 1 + tPk−i+11 i = 0, 1, ..., n − k
Control first-order second-order third-order
point
where, t ∈ [0, 1] is the proportional coefficient, P0i is the i-th initial
control point, Pki is the i-th k-order control point (k = 1, 2, ..., n). P00
Due to the higher order of the Bezier curve, the greater the degree of P01 P10 = (1 − t)P00 + tP01
P02 P11 = (1 − t)P01 + tP02 P20 = (1 − t)P10 + tP11
optimization and the smoother the path. However, at the same time, it is
P03 P12 = (1 − t)P02 + tP03 P21 = (1 − t)P11 + tP12 P30 = (1 − t)P20 + tP21
more prone to collision. The safety of the USV and the smoothness
6
Y. Wang et al. Ocean Engineering 292 (2024) 116510
Table 2
The complexity quantification of the simulation environment.
Simulation environment Env.1 Env.2 Env.3 Env.4
4. Simulation results
Table 3
Comparison of evaluation indicators for different algorithms.
Statistics A* RRT CQL NSFQ
7
Y. Wang et al. Ocean Engineering 292 (2024) 116510
Table 4
Comparison of path length performance for different algorithms.
Environment Path length vs A* Path length vs RRT Path length vs CQL
yt+1 − yt
ψ 0 = arctan (33)
xt+1 − xt
⎧
⎪
⎪ ψ0 xt+1 − xt > 0
⎪
⎪
⎨ π + ψ0 xt+1 − xt < 0 and yt+1 − yt ≥ 0
ψ = − π + ψ0 xt+1 − xt < 0 and yt+1 − yt < 0 (34)
⎪
⎪
⎪
⎪ π /2 xt+1 − xt = 0 and yt+1 − yt > 0
⎩
− π /2 xt+1 − xt = 0 and yt+1 − yt < 0
8
Y. Wang et al. Ocean Engineering 292 (2024) 116510
Table 5 Table 8
Comparison of heading angle standard deviation for different algorithms. Comparison of key evaluation indicators for different algorithms.
Environment Heading angle Std. Heading angle Std. Heading angle Std. Evaluation indicators VS A* VS RRT
Deviation vs A* Deviation vs RRT Deviation vs CQL
Path Length 3.466 % 7.020 %
Env.1 (− )38.922 % 9.610 % 38.715 % Sailing time 3.466 % 7.020 %
Env.2 13.720 % 39.828 % 56.932 % Angle Std. Deviation 11.778 % 18.093 %
Env.3 35.995 % 58.014 % 71.028 % Max. r variations 73.490 % 71.857 %
Env.4 36.319 % 61.370 % 73.700 %
Mean 11.778 % 42.206 % 60.094 %
This paper considers the dynamic characteristics of the USV and
simplifies the USV with length Ls and width Bs to a circular shape with
Table 6 length Ls as the diameter, as shown in Fig. 9. The size of the safety circle
Comparison of maximum angular velocity r variations for different algorithms. needs to be dynamically adjusted for the USV at different speeds. The μ
value is determined by Eq. (40):
Environment Max. r variations vs Max. r variations vs Max. r variations vs
A* RRT CQL σu
μ= 1+ (40)
Env.1 41.788 % 51.470 % 70.894 % umax
Env.2 78.995 % 45.780 % 57.381 %
Env.3 91.556 % 72.215 % 82.658 % where, u is the speed of the USV, umax is the maximum speed of the USV,
Env.4 85.425 % 47.883 % 71.118 % and σ is the coefficient, which can be adjusted by the specific situation.
Mean 74.441 % 54.337 % 70.513 %
9
Y. Wang et al. Ocean Engineering 292 (2024) 116510
starting point of the USV, and the red hexagon represents the ending the turning performance and maneuvering characteristics of USV during
point of the USV. The white area is the feasible area; the black area is the navigation, as shown in Fig. 10 (a)~(d). Compared with the RRT and A*
obstacle area, which is the prohibited area, simulating obstacles that algorithms, the path planned by the NSFQ algorithm is smoother, with
USV may encounter in the marine environment, such as islands, rivers, better path planning results. The turning performance, maneuvering
and ports. The dashed envelope area is the prohibited area after characteristics, and heading stability of the USV during navigation and
considering the safety threshold ρ. obstacle avoidance are taken into account by the algorithm.
In Fig. 10, the red curve represents the path trajectory of the USV In order to comprehensively verify the performance of the proposed
using the NSFQ algorithm, the blue broken line represents the path algorithm, the evaluation indicators introduced in Section 4.1 are used
trajectory of the USV using the RRT algorithm, the green broken line to evaluate the algorithm. The evaluation indicators are path length,
represents the path trajectory of the USV using the A* algorithm, the sailing time, heading angle, and angular velocity, as shown in Table 3.
purple broken line represents the path trajectory of the USV using the Combining the data in Table 3 with Fig. 10, it can be concluded that
classical Q-Learning (CQL) algorithm. In Fig. 10 (a)~(c), there are more the proposed algorithm outperforms the A* and RRT algorithms in terms
obstacles at the starting point, in Fig. 10 (b), there are more obstacles at of path length, as shown in Table 4. Compared with the A*, RRT and CQL
the ending point, and in Fig. 10 (d), there are more obstacles at both the algorithms, the mean path length of the proposed algorithm is reduced
starting and ending points, which simulates the situation of the USV by 1.302 %, 26.542 % and 10.009 % in the four different simulation
entering and leaving ports. environments, respectively. Due to the shorter path length planned, less
In the four different simulation environments of Fig. 10, the NSFQ energy was consumed by the USV and the shorter sailing time was used.
algorithm can effectively solve the global path planning and obstacle The variation curves of the heading angle ψ of the USV using
avoidance problems of USV. The A * algorithm uses grid-based different methods in different simulation environments are shown in
modeling, so the randomly generated obstacle map needs to be Fig. 12. The heading angle of the USV using the A* and RRT algorithms
divided into several grids, and the feasible area and prohibited area need has too many turning angles, which may undergo sudden changes. Due
to be identified. Therefore, after the simulation environment is con to the high speed and inertia of the USV, its turning is not flexible, and
verted into a grid map, the problem of multiple continuous sharp turns the change of its heading angle should be smooth to maintain the sta
may occur during path planning of the USV, which does not meet the bility of the USV heading. In Fig. 12, the change curve of the USV
maneuvering characteristics of the USV in actual navigation, as shown in heading angle using the NSFQ algorithm is relatively smooth, and the
Fig. 11. heading is stable.
The RRT algorithm does not require specific modeling of the simu The comparison of the heading angle standard deviation for different
lation environment during path planning, but the feasible path planned algorithms is shown in Table 5. It can be concluded that the proposed
by the RRT algorithm is not relatively optimal, and the path length is algorithm outperforms the A* and RRT algorithms in terms of the
relatively long. In addition, the path planned by the RRT algorithm is heading angle standard deviation. Compared with the A*, RRT and CQL
winding and has too many unnecessary turns, which also does not meet algorithms, the mean heading angle standard deviation of the proposed
Fig. 10. Path planning of USV using different methods in different simulation environments.
10
Y. Wang et al. Ocean Engineering 292 (2024) 116510
algorithm is reduced by 11.778 %, 42.206 % and 60.094 % in the four be too large, as an excessive angular velocity indicates a significant
different simulation environments, respectively. It is worth noting that change in its heading angle, and the USV has large rudder turns and poor
the A* algorithm has the smallest heading angle standard deviation in heading stability during navigation, which can lead to the risk of
Env.1, which is better than the NSFQ algorithm. However, as the capsizing.
simulation environments become increasingly complex, the superiority The comparison of the maximum angular velocity variations for
of the NSFQ algorithm becomes more and more apparent. The NSFQ different algorithms is shown in Table 6. It can be concluded that the
algorithm has the smallest heading angle standard deviation in Env.2, proposed algorithm outperforms the A* and RRT algorithms in terms of
Env.3 and Env.4. the maximum angular velocity variation. Compared with the A*, RRT
The variation curves of the angular velocity r of the USV using and CQL algorithms, the mean maximum angular velocity variation of
different methods in different simulation environments are shown in the proposed algorithm is reduced by 74.441 %, 54.337 % and 70.513 %
Fig. 13. The magnitude of the angular velocity of the USV represents the in the four different simulation environments, respectively.
change in its heading angle. The angular velocity of the USV should not However, in practice, the environment is not completely known; it is
11
Y. Wang et al. Ocean Engineering 292 (2024) 116510
partially known or completely unknown. In dynamic and uncertain values in the image into two possible values: 0 (white, representing the
marine environments, it is difficult or even impossible to obtain the feasible area) and 1 (black, representing the prohibited area), which can
information of various obstacles before path planning (Cheng et al., make it easier for computers to process and parse map data, thereby
2021). This requires USV to be equipped with sensors (radar, lidar, improving the efficiency of path planning algorithms.
electro-optical (EO) camera, infrared (IR) camera systems, etc.) to The path planning simulation diagrams using different methods for
obtain information on the position, heading and speed of the sur the USV in the marine simulation environment are shown in Fig. 15. The
rounding vessels (Han et al., 2020). In view of the unknown obstacle variation of heading angle and angular velocity of the USV using
environment, static obstacles and dynamic obstacles around the USV are different methods in the marine environment is shown in Fig. 16 (a). The
detected by sensors, the map is updated in real time, and real-time on variation of heading angle and angular velocity of the USV using
line planning is turned to avoid encountering obstacles. different methods in the marine environment is shown in Fig. 17.
The obstacles on the water surface of the USV are not only static, but Compared with A* and RRT algorithm, the NSFQ algorithm has a
also some movable obstacles, such as other sailing ships. It is assumed smoother path and better path planning effectiveness. The specific
that the direction and speed of dynamic obstacles are known. The results evaluation indicators are shown in Table 7.
of the proposed method are shown in Fig. 14. The dynamic characteristics of the USV are influenced by wind,
In summary, the USV can effectively avoid obstacles in both static waves, and currents. The wind and waves basically do not need to be
and dynamic obstacle environments. The NSFQ algorithm proposed in considered, the controller of the ship itself can achieve its heading un
this paper can effectively solve the global path planning and obstacle affected. The currents also have less impact on the USV, which is dis
avoidance problems of the USV. The proposed algorithm is superior to A cussed in this paper. A flow field simulation model with a known
* and RRT algorithms in evaluation indicators such as path length, direction parallel to the x-axis and a speed of 1 m/s, as is shown in
sailing time, heading angle, and angular velocity, thereby proving the Fig. 16 (b).
effectiveness and superiority of the NSFQ algorithm. The comparison of key evaluation indicators for different algorithms
is shown in Table 8. It can be concluded that compared with the A * and
4.4. Simulation results in marine environment RRT algorithms, the path length of the NSFQ algorithm is reduced by
3.466 % and 7.020 % in the marine simulation environment, respec
Finally, in order to verify the feasibility and efficiency of the NSFQ tively. The sailing time is reduced by 3.466 % and 7.020 % respectively.
algorithm, a map of a certain sea area is selected as the marine simu The heading angle standard deviation is reduced by 11.778 % and
lation environment for path planning, as shown in Fig. 14 (a). The sea 18.093 % respectively. The maximum angular velocity variation is
area is dominated by islands, which have narrow waterways and reduced by 73.490 % and 71.857 % respectively. The effectiveness and
complicated paths, and the obstacles are scattered and numerous. superiority of the NSFQ algorithm are proved.
During the path planning process, this paper performed image In the marine simulation environment, the NSFQ algorithm proposed
binarization on the map to simplify the computational complexity of in this paper is superior to the A * and RRT algorithms in evaluation
computers in processing images and reduce storage requirements, as indicators such as path length, sailing time, heading angle, and angular
shown in Fig. 14 (b). Image binarization refers to converting the pixel velocity, thereby proving the effectiveness and superiority of the NSFQ
12
Y. Wang et al. Ocean Engineering 292 (2024) 116510
Fig. 14. Path planning of USV using different methods in dynamic simulation environments.
algorithm. However, the NSFQ algorithm still has limitations. Its limi position by using the model of the USV. The initial heading angle of the
tation is that the computation time is longer than other traditional USV can be obtained from action space. The position and initial heading
algorithms. angle of the USV at each moment are reconstructed into state space.
Meanwhile, the position and heading angle of the USV are also funda
5. Conclusion mental factors in the action space and reward function. Moreover, the
safety threshold is introduced to ensure the safety of the USV, which is
In this paper, a NSFQ algorithm is proposed for the global path twice the length of the USV. A third-order Bezier curve is used to smooth
planning and obstacle avoidance problems of the USV. In the proposed the initial path so that the USV can maintain its heading stability during
algorithm, the RBF neural network is used instead of the Q-table, which navigation and obstacle avoidance.
makes it faster and more efficient to approximate the Q function. The Finally, to verify the effectiveness and superiority of the NSFQ al
next position of the USV can be inferred according to the current gorithm, different simulation environments are established, which are
13
Y. Wang et al. Ocean Engineering 292 (2024) 116510
Fig. 16. Path planning of the USV using different methods in marine environment.
Fig. 17. The variation of heading angle and angular velocity of the USV.
the randomly generated obstacle maps and the simulation map of the Declaration of competing interest
marine environment. The data analysis results show that the path
length, sailing time, heading angle, angular velocity, and smoothness The authors declare that they have no known competing financial
obtained by the proposed algorithm are significantly better than the A* interests or personal relationships that could have appeared to influence
and RRT algorithms. In this paper, it is necessary to set the heading the work reported in this paper.
speed in advance, and the speed cannot be adjusted in real-time during
navigation, which is the disadvantage of this paper. Further research is Data availability
needed in the future. Secondly, all obstacles in this paper are known, and
further research is needed on how to avoid obstacles in unknown ob The authors do not have permission to share data.
stacles using the proposed algorithm.
References
CRediT authorship contribution statement
Bianchi, R., Ribeiro, C., Costa, A., 2008. Accelerating autonomous learning by using
heuristic selection of actions. J. Heuristics 14 (2), 135–168.
Yuanhui Wang: Supervision, Validation, Conceptualization, Formal
Chen, C., Chen, X.Q., Ma, F., Zeng, X.J., Wang, J., 2019. A knowledge-free path planning
analysis, Funding acquisition, Resources. Changzhou Lu: Conceptuali approach for smart ships based on reinforcement learning. Ocean Eng. 189, 9.
zation, Software, Validation, Writing – original draft, Writing – review & Cheng, C., Sha, Q., He, B., Li, G.J.O.E., 2021. Path planning and obstacle avoidance for
editing. Peng Wu: Supervision, Validation, Project administration. AUV: a review. Ocean Eng. 235, 109355.
Duan, J.M., Chen, Q.L., 2019. Prior knowledge based Q-learning path planning
Xiaoyue Zhang: Data curation, Project administration, Validation. algorithm. Electron. Opt. Control 26 (9), 29–33.
Fu, J. J., Khan, F., 2020. Monitoring and modeling of environmental load considering
dependence and its impact on the failure probability. Ocean Eng. 199.
Han, J., Cho, Y., Kim, J., Kim, J., Son, N.s., 2020. Autonomous collision detection and
avoidance for ARAGON. USV: Development and field tests 37 (6), 987–1002.
14
Y. Wang et al. Ocean Engineering 292 (2024) 116510
Hao, B., Du, H., Yan, Z.P., 2023. A path planning approach for unmanned surface Singh, Y., Sharma, S., Sutton, R., Hatton, D., Khan, A., 2018. A constrained A* approach
vehicles based on dynamic and fast Q-learning. Ocean Eng. 270. towards optimal path planning for an unmanned surface vehicle in a maritime
Hart, P.E., Nilsson, N.J., Raphael, B., 1968. A formal basis for the heuristic determination environment containing dynamic obstacles and ocean currents. Ocean Eng. 169,
of minimum cost paths. IEEE Trans. Syst. Sci. Cybern. 4 (2), 100–107. 187–201.
Jin, J., Zhang, J., Shao, F., Lyu, Z., 2018a. A novel ocean bathymetry technology based Sun, Y.S., Wang, L.F., Wu, J., Ran, X.R., 2020. A general overview of path planning
on an unmanned surface vehicle. Acta Oceanol. . 37 (9), 99–106. methods for autonomous underwater vehicle. Ship Science and Technology 42 (7),
Jin, J., Zhang, J., Shao, F., Zhichao, L., Wang, D., 2018b. A novel ocean bathymetry 1–7.
technology based on an unmanned surface vehicle. Acta Oceanol. Sin. 37 (9), Wang, C.B., Zhang, X.Y., Zou, Z.Q., Wang, S.B., 2018. On path planning of unmanned
99–106. ship based on Q-learning. Ship&Ocean Engineering 47 (5), 168–171.
Lan, W., Jin, X., Chang, X., Wang, T.L., Zhou, H., Tian, W., Zhou, L.L., 2022. Path Wang, D.X., 2021. AGV path planning based on improved Q⁃learning algorithm.
planning for underwater gliders in time-varying ocean current using deep Electronic Design Engineering 29 (4), 7–10+15.
reinforcement learning. Ocean Eng. 262. Wang, H., Mao, W., Eriksson, L., 2019a. A Three-Dimensional Dijkstra’s algorithm for
Li, X.W., Gao, M., Kang, Z., Sun, H.X., Liu, Y.C., Yao, C.Y., Zhang, A.M., 2023. multi-objective ship voyage optimization. Ocean Eng. 186.
Collaborative search and rescue based on swarm of H-MASSs using consensus theory. Wang, J., Zhang, P.L., Zhao, Z.Y., Cheng, X.P., 2019b. Path planning method based on
Ocean Eng. 278. neural network and Q(λ)-learning. Automation and Instrumentation 34 (9), 1–4.
Low, E.S., Ong, P., Cheah, K.C., 2019. Solving the optimal path planning of a mobile Wang, Z.Y., Zeng, G.H., Huang, B., Fang, Z.J., 2019c. Global optimal path planning for
robot using improved Q-learning. Robot. Autonom. Syst. 115, 143–161. robots with improved A* algorithm. J. Comput. Appl. 39 (9), 2517–2522.
Mac, T.T., Copot, C., Tran, D.T., Keyser, R.D., 2016. Heuristic approaches in robot path Watkins, C., Dayan, P., 1992. Q-learning. Mach. Learn. 8 (3–4), 279–292.
planning: a survey. Robot. Autonom. Syst. 86, 13–28. Wei, Y.L., Jin, W.Y., 2019. Intelligent Vehicle Path Planning Based onNeural Network Q-
Marthi, Bhaskara, 2007. Automatic shaping and decomposition of reward functions. learning Algorithm. Fire Contr. Command Contr. 44 (02), 46–49.
Proceedings of the 24th International Conference on Machine Learning. ACM, USA, Xin, J.F., Zhong, J.B., Yang, F.R., Cui, Y., Sheng, J.L., 2019. An improved genetic
pp. 601–608. algorithm for path-planning of unmanned surface vehicle. Sensors 19 (11).
Ntakolia, C., Lyridis, D.V., 2022. A comparative study on Ant Colony Optimization Xu, X.S., Yuan, J., 2019. Path planning for mobile robot based on improved
algorithm approaches for solving multi-objective path planning problems in case of reinforcement learning algorithm. Journal of Chinese Inertial Technology 27 (3),
unmanned surface vehicles. Ocean Eng. 255. 314–320.
Owen, I., Lee, R., Wall, A., Fernandez, N., 2021. The NATO generic destroyer a shared Yang, X.F., Shi, Y.L., Liu, W., Ye, H., Zhong, W.B., Rong, X.Z., 2022. Global path planning
geometry for collaborative research into modelling and simulation of shipboard algorithm based on double DQN for multi-tasks amphibious unmanned surface
helicopter launch and recovery. Ocean Eng. 228. vehicle. Ocean Eng. 266.
Ozturk, U., Akda, M., Ayabakan, T., 2022. A review of path planning algorithms in Yao, P., Lou, Y.T., Zhang, K.M., 2023. Multi-USV cooperative path planning by window
maritime autonomous surface ships: navigation safety perspective. Ocean Eng. 251. update based self-organizing map and spectral clustering. Ocean Eng. 275.
Sang, H.Q., You, Y.S., Sun, X.J., Zhou, Y., Liu, F., 2021. The hybrid path planning Zhao, C., Zhu, Y.F., Du, Y.C., Liao, F.X., Chan, C.Y., 2022. A novel direct trajectory
algorithm based on improved A* and artificial potential field for unmanned surface planning approach based on generative adversarial networks and rapidly-exploring
vehicle formations. Ocean Eng. 223. random tree. IEEE Trans. Intell. Transport. Syst. 23 (10), 17910–17921.
15