Zhou Et Al. - 2022 - A Review of Motion Planning Algorithms For Intelli
Zhou Et Al. - 2022 - A Review of Motion Planning Algorithms For Intelli
https://doi.org/10.1007/s10845-021-01867-z
Received: 11 February 2021 / Accepted: 12 October 2021 / Published online: 25 November 2021
© The Author(s) 2021
Abstract
Principles of typical motion planning algorithms are investigated and analyzed in this paper. These algorithms include
traditional planning algorithms, classical machine learning algorithms, optimal value reinforcement learning, and policy
gradient reinforcement learning. Traditional planning algorithms investigated include graph search algorithms, sampling-
based algorithms, interpolating curve algorithms, and reaction-based algorithms. Classical machine learning algorithms
include multiclass support vector machine, long short-term memory, Monte-Carlo tree search and convolutional neural
network. Optimal value reinforcement learning algorithms include Q learning, deep Q-learning network, double deep Q-
learning network, dueling deep Q-learning network. Policy gradient algorithms include policy gradient method, actor-critic
algorithm, asynchronous advantage actor-critic, advantage actor-critic, deterministic policy gradient, deep deterministic
policy gradient, trust region policy optimization and proximal policy optimization. New general criteria are also introduced
to evaluate the performance and application of motion planning algorithms by analytical comparisons. The convergence
speed and stability of optimal value and policy gradient algorithms are specially analyzed. Future directions are presented
analytically according to principles and analytical comparisons of motion planning algorithms. This paper provides researchers
with a clear and comprehensive understanding about advantages, disadvantages, relationships, and future of motion planning
algorithms in robots, and paves ways for better motion planning algorithms in academia, engineering, and manufacturing.
Keywords Motion planning · Path planning · Intelligent robots · Reinforcement learning · Deep learning
123
388 Journal of Intelligent Manufacturing (2022) 33:387–424
Fig. 1 Three types of robotic platform. The first and second figures represents a differential-wheel chassis. The third and fourth figures rep-
represent wheel-based chassis (Minguez et al., 2008). The first figure resent four-leg dog “SpotMini” from Boston Dynamic and the robotic
represents an Ackerman-type (car-like) chassis, while the second figure arm (Meyes et al., 2017)
Baidu had been tested successfully in highways of Beijing yaw solely without changing their position (x, y). Robots with
in 2017 (Fan et al., 2018), and man-manipulated buses had differential wheels are also sensitive to the speed difference
already been replaced by autonomous buses from Huawei of two front wheels. The sensitivity depends on the rate of
in some specific areas of Shenzhen. Other companies in tra- the gearing steer mechanism that yields the speed reduction
ditional vehicle manufacturing, like Audi and Toyota, also and angular moment rotation. It means it is flexible to move
have their own experimental autonomous vehicles. Among in low-speed indoor scenarios but very dangerous to move in
research institutes and universities, Navlab (navigation lab) high-speed situations if something wrong in the speed con-
of Carnegie Mellon, Oxford University and MIT are leading trol of two front wheels, because little speed changes of two
research institutes. Up to 2020, European countries like Bel- front wheels in differential chassis can be exaggerated and
gium, France, Italy, and UK are planning to operate transport accident follows.
systems for autonomous vehicles. Twenty-nine US states had It is popular to use legs in the chassis of robots in recent
passed laws in permitting autonomous vehicles. Autonomous years. Typical examples are human-like and animal-like
vehicle is therefore expected to widely spread in near future (dog-like, Fig. 1) robots from Boston Dynamic. The robotic
with improvement of traffic laws. arm (Fig. 1) is also a popular platform to deploy motion
planning algorithms. In summary, wheels, arms, and legs are
Motion planning and robotic platform Robots use
choices of chassis to implement motion planning algorithms
motion planning algorithms to plan their trajectories both
which are widely used in academic and industrial scenar-
at global and local level. Human-like and dog-like robots
ios including commercial autonomous driving, service robot,
from Boston Dynamic and autonomous robotic car from MIT
surgery robot and industrial arms.
(Everett et al., 2018) are good examples. All of them lever-
age motion planning algorithms to enable robots to freely Architecture of robots Classical hierarchical robotic
walk in dense and dynamic scenarios both indoor and out- architecture (Meystel, 1990) in Fig. 2a is composed by three
door. Chassis of robots has two types of wheels, including stages: sense, plan and act (Murphy, 2000). Robots with
Ackerman-type wheel and differential wheel (Fig. 1). this architecture can be successfully used in simple appli-
In Ackerman-type robots, two front wheels steer the robot, cations. It can generate long-term action plans, however,
while two rear wheels drive the robot. The Ackerman-type researchers are unsatisfied with the slow speed of this archi-
chassis has two servos. Two front wheels share a same servo, tecture in the update of world model and navigation plan,
and it means these two wheels can steer with a same steering when coping with the environment with uncertainty. Reactive
angle or range ϕ (Fig. 1). Two rear wheels share another servo architecture (Brooks, 1986) in Fig. 2b, therefore, is intro-
to control the speed of robots. The robot using differential duced to cope with uncertain scenarios. Reactive architecture
wheel, however, is completely different with Ackerman-type is designed to output instant response by the sense-act struc-
robot in functions of servo. The chassis with differential ture (Murphy, 2000). Reactive strategies or algorithms (e.g.,
wheels generally has two servos, and each wheel is controlled potential fields) originate from the intuitive response of ani-
by one servo for forwarding. Steering is realized by giving mals, and they are computationally inexpensive. However,
different speeds to each wheel. Steering range in Ackerman- the robot based on reactive architecture is short-sighted. It
type robots is limited because two front wheels steer with a cannot generate long-term plans to fulfill challenging tasks.
same angle ϕ. The Ackerman-type wheel is therefore suit- Hybrid deliberative/reactive architecture in Fig. 2c fuses
able to be used in high-speed outdoor scenarios because of advantages of hierarchical and reactive architectures, and
stability. Robots with differential wheels, however, can steer it is also successfully used in autonomous robots (Arkin
in an angle ∈ (0, 2π ], and it means robots can change their et al., 1987; Murphy, 2000). Hybrid deliberative/reactive
123
Journal of Intelligent Manufacturing (2022) 33:387–424 389
123
390 Journal of Intelligent Manufacturing (2022) 33:387–424
Cartographer
Planner Mission Planner
Mission planner Sequencer
Carto- Performance Monitoring Agent
grapher Navigator
Pilot
Homeostatic
Control
Deliberative Layer
Reactive Layer
Motor
Sensor
Motor schema manager Behavioral
Manager
ps2 ms2
Σ Actuators
ps3 ms3
(c) Hybrid deliberative/reactive architecture for autonomous robots (Arkin et al., 1987; Murphy, 2000)
Trainer
robots Interactions
Autonomous
robotic Goal info
World model Parser/
architecture Environment
(networks)
Navigator
Actuators
Sensors info
Feedbacks (rewards)
Abstract
Feature Environment Environment Time-sequential Decision
functional
extraction perception understanding navigation execution
modules
Fig. 2 continued
other hand, motion planning should achieve long-term opti- Classification of planning algorithms Robotic planning
mal planning goals as path planning when robots interact algorithms can be divided into two categories: traditional
with the environment. algorithms and ML-based algorithms according to their prin-
ciples and the era they were invented. Traditional algorithms
are composed by four groups including graph search algo-
123
Journal of Intelligent Manufacturing (2022) 33:387–424 391
Fig. 3 Path planning and motion planning. The left figure denotes a mover’s problem that not only consider planning a path from global
planned path based on shortest distance and time, and path is gener- level, but also consider kinetics features, speeds and poses of the piano
ated from high or global level. The right figure denotes famous piano
rithms (e.g., A*), sampling-based algorithms like rapidly- convergence). A breakthrough was made when Google Deep-
exploring random tree (RRT), interpolating curve algorithms Mind introduced nature DQN (Mnih et al., 2013, 2015),
(e.g., line and circle), and reaction-based algorithms (e.g., in which reply buffer is to reuse old data to improve the
DWA). ML based planning algorithms include classical ML efficiency. Performance in robustness, however, is limited
algorithms like support vector machine (SVM), optimal because of noise that impacts the estimation of state-action
value RL like deep Q-learning network (DQN) and policy value (Q value). Double DQN (Hasselt et al., 2016; Sui et al.,
gradient RL (e.g., actor-critic algorithm). Categories of plan- 2018) and dueling DQN (Wang et al., 2015) are therefore
ning algorithms are summarized in Fig. 4. invented to cope with problems caused by noise. Double
DQN utilizes another network to evaluate the estimation of
Development of ML-based algorithms Classical ML,
Q value in DQN to reduce noise, while advantage value (A
like SVM, are used to implement simple motion planning at
value) is utilized in dueling DQN to obtain better Q value, and
an earlier stage, but its performance is poor because SVM
noise is mostly reduced. The Q learning, DQN, double DQN
is short-sighted for its one-step prediction. It requires well-
and dueling DQN are all based on optimal values (Q value
prepared vector as inputs that cannot fully represent features
and A value) to select optimal time-sequential actions. These
of image-based dataset. Significant improvement to extract
algorithms are therefore called optimal value algorithms.
high-level features from images were made after the inven-
Implementation of optimal value algorithms, however, is
tion of convolutional neural network (CNN) (Lecun et al.,
computationally expensive.
1998). CNN is widely used in many image-related tasks
Optimal value algorithms are latter replaced by policy
including motion planning, but it cannot cope with com-
gradient method (Sutton et al., 1999), in which gradi-
plex time-sequential motion planning problems. These better
ent approach (Zhang, 2019) is directly utilized to upgrade
suit Markov chain (Chan et al., 2012) and long short-term
policy that is used to generate optimal actions. Policy gra-
memory (LSTM) (Inoue et al., 2019). Neural networks are
dient method is more stable in network convergence, but
then combined with LSTM or algorithms that are based on
it lacks efficiency in speed of network convergence. Actor-
Markov chain (e.g., Q learning (Smart & Kaelbling, 2002))
critic algorithm ((Cormen et al., 2009; Konda & Tsitsiklis,
to implement time-sequential motion planning. However,
2001)) improves the speed of convergence by the actor-critic
the efficiency is limited (e.g., poor performance in network
architecture. However, improvement in convergence speed is
123
392 Journal of Intelligent Manufacturing (2022) 33:387–424
123
Journal of Intelligent Manufacturing (2022) 33:387–424 393
(a)
(b)
Fig. 6 Steps of the Dijkstra algorithm (a) and road networks in web composed by nodes and edges, therefore graph search algorithms like
maps (b) (Indrajaya et al., 2015; Mariescu & Franti., 2018). Web maps A* and Dijkstra’s algorithms can be used in these graphs
are based on GPS data. Road network is mapped into the graph that is
search algorithms are composed by many algorithms. The nearest neighbor j of the node I; and estimate the distance
most popular are Dijkstra’s algorithm (Dijkstra, 1959) and of nodesj and i; (3) estimate the distance between the node j
A* algorithm (Hart et al., 1968). and the goal node. The overall estimated cost is the sum of
these three factors:
Dijkstra’s algorithm is one of earliest optimal algorithms
based on best-first search technique to find the shortest paths
Ci cstar t,i + min j di, j + d j,goal (1)
among nodes in a graph. Finding the shortest paths in a road
network is a typical example. Steps of the Dijkstra algorithm
where Ci represents overall estimated cost of node i, cstar t,i
(Fig. 6) include: (1) converting the road network to a graph,
the estimated cost from the origin to the node i, di, j the
and distances between nodes in the graph are expected to be
estimated distance from the node i to its nearest node j, and
found by exploration; (2) picking the unvisited node with the
d j,goal the estimated distance from the node j to the node
lowest distance from the source node; (3) calculating the dis-
of goal. A* algorithm has a long history in path planning in
tance from the picked node to each unvisited neighbor and
robots. A common application of the A* algorithm is mobile
update the distance of all neighbor nodes if the distance to
rovers planning via an occupancy grid map (Fig. 7) using the
the picked node is smaller than the previous distance; (4)
Euclidean distance (Wang, 2005). There are many variants
marking the visited node when the calculation of distance to
of A* algorithm, like dynamic A* and dynamic D* (Stentz,
all neighbors is done. Previous steps repeat until the shortest
1994), Field D* (Ferguson & Stentz, 2006), Theta* (Daniel
distance between origin and destination is found. Dijkstra’s
et al., 2014), Anytime Repairing A* (ARA*) and Anytime
algorithm can be divided into two versions: forward ver-
D* (Likhachev et al., 2008), hybrid A* (Montemerlo et al.,
sion and backward version. Calculation of overall cost in
2008), and AD* (Ferguson et al., 2008). Other graph search
the backward version, called cost-to-come, is accomplished
algorithms have a difference with common robotic grid map.
by estimating the minimum distance from selected node to
For example, the state lattice algorithm (Ziegler & Stiller,
destination, while estimation of overall cost in the forward
2009) uses one type of grid map with a specific shape (Fig. 7),
version, called cost-to-go, is realized by estimating the mini-
while the grid in normal robotic map is in a square-grid shape
mum distance from selected node to the initial node. In most
(Fig. 7).
cases, nodes are expanded according to the cost-to-go.
A* algorithm is based on the best-first search, and it uti- Sampling-based algorithms
lizes heuristic function to find the shortest path by estimating
the overall cost. The algorithm is different from the Dijkstra’s Sampling-based algorithms randomly sample a fixed
algorithm in the estimation of the path cost. The cost estima- workspace to generate sub-optimal paths. The RRT and the
tion of a node i in a graph by A* is as follows: (1) estimate probabilistic roadmap method (PRM) are two algorithms
the distance between the initial node and node i; (2) find the that are commonly utilized in motion planning. The RRT
algorithm is more popular and widely used for commercial
123
394 Journal of Intelligent Manufacturing (2022) 33:387–424
Fig. 7 The left figure represents a specific grid map in the State Lattice algorithm (Ziegler & Stiller, 2009), while the right figure represents a normal
square-grid (occupancy grid) map in the robot operating system (ROS)
Fig. 9 Trajectories generated by mathematical rules (Bautista et al., 2014; Farouki & Sakkalis, 1994; Funke et al., 2012; Reeds & Shepp, 1990; Xu
et al., 2012)
123
Journal of Intelligent Manufacturing (2022) 33:387–424 395
Fig. 10 Different types of potential filed. (a–e) denote five primitive potential fields: uniform, perpendicular, attraction, repulsion, and tangen-
tial. (f) denotes a potential field combined by attraction (goal) and repulsion (obstacle) (Murphy, 2000)
PFM (Khatib, 1986) is about using vectors to represent maneuver (velocity) is selected from RAV to avoid static and
behaviors and using vector summation to combine vectors moving obstacles (Fiorini & Shiller, 1998). To compute a
from different behaviors to produce an emergent behavior RAV (Fig. 11): (1) Velocity obstacle (VO) must be obtained.
(Murphy, 2000). Potential field is a differentiable real-valued VO is a velocity set or space, and the selection of velocity
function U whose value can be seen as energy, and its gradient from VO will lead to collision. (2) A set of reachable veloc-
can be seen as a force. If potential field function U is defined ities (RV) should be obtained. This is achieved by mapping
artificially, it is called artificial potential field (APF). Its gra- the actuator constraints to acceleration constraints (Fiorini &
dient ∇U (x), where x denotes a robot configuration (e.g., Shiller, 1998). (3) RAV is obtained by computing the differ-
positions of robots), is a vector which points at a local direc- ence between RV and VO.
tion that maximally increases U (Tobaruela, 2012). Hence, To select a proper avoidance maneuver (Fig. 11), exhaus-
robots in potential field or combined potential field (Fig. 10) tive global search method and heuristic search method in
will be forced to move along the gradient of potential field RAV are suitable for off-line and on-line cases, respectively:
to maximize U . (1) A search tree can be obtained by expanding the tree on
Shortcomings of PFM include: (1) local minima if poten- RAV. A proper avoidance maneuver can be selected from
tial field converges to a minimum that is not global minimum. the search tree according to assigned cost on the branch of
(2) oscillation of motion when robots navigate among very search tree. Cost is relevant with some objective functions
close obstacles at high speed. (3) impossibility to go through (e.g., distance traveled, motion time and energy). The search
small openings. These shortcomings can be solved or par- tree is expanded off-line, therefore near-optimal trajectories
tially solved by potential field variants (e.g., generalized that lead to shortest time or distance can be obtained (Fiorini
potential fields method (GPFM) (Krogh, 1984), virtual force & Shiller, 1998). (2) The heuristic search costs less time on
field (VFF) (Borenstein & Koren, 1989), vector field his- search process, and it is designed to select specific veloci-
togram (VFH) (Borenstein & Koren, 1991) and harmonic ties that can realize special goals (e.g., the highest avoidance
potential field (HPF) (Masoud, 2007)) in real-world engi- velocity towards goals, the maximum avoidance velocity, and
neering and manufacturing. velocities that ensure desired trajectory structures).
VOM (Fiorini & Shiller, 1998) relies on current positions However, collisions with obstacles still exist when using
and velocities of robots and obstacles to compute a reachable velocity obstacle method in complex scenarios like dense
avoidance velocity space (RAV), and then a proper avoidance and dynamic cases. Hence, some optimized velocity obstacle
123
396 Journal of Intelligent Manufacturing (2022) 33:387–424
Fig. 11 The principle of VOM. (a)–(c) denote the VO, RV, and RAV. goals, the maximum avoidance velocity, and velocities that ensure
(d) denotes the exhaustive search in the search tree. (e)–(g) denote desired trajectory structures (Fiorini & Shiller, 1998)
heuristic search method to select the highest avoidance velocity towards
Classical ML
methods, like reciprocal velocity obstacle (RVO) (Berg et al.,
2008, 2011; Guy et al., 2009), are introduced to better avoid Here basic principles of four classical but pervasive ML algo-
collisions. rithms for motion planning are presented. These algorithms
DWA (Fox et al., 1997) is about choosing a proper trans- include three supervised learning algorithms (SVM, LSTM
lational and rotational velocity (v, w) that will maximize an and CNN) and one RL that is the Monte-Carlo tree search
objective function within dynamic window. Objective func- (MCTS).
tion includes a measure of progress towards a goal location, SVM (Evgeniou & Pontil, 1999) is a well-known super-
the forward velocity of the robot, and the distance to the vised learning algorithm for classification. The basic prin-
next obstacle on the trajectory. Proper velocity (v, w) is ciple of SVM is about drawing an optimal separating
selected within the dynamic window (a search space of veloc- hyperplane between inputted data by training a maximum
ity) which consists of the velocities reachable within a short margin classifier (Evgeniou & Pontil, 1999). Inputted data is
time interval. This is achieved by: (1) computing a two- in the form of vector that is mapped into high-dimensional
123
Journal of Intelligent Manufacturing (2022) 33:387–424 397
123
398 Journal of Intelligent Manufacturing (2022) 33:387–424
Fig. 14 Four processes of MCTS. These processes repeat until the convergence of state values in the tree
Steering angles
Images
Inputs Labels
Initial
parameters
CNN
Parameters
CNN layers
(weight matrix)
New
parameters Feature vector
Softmax
Optimizer
Loss
function
Fig. 15 Training steps of CNN. The trajectory is planned by human in of feature to probabilities p ∈ (0, 1). The optimizer represents gradi-
data collection in which steering angles of robots are recorded as labels ent descent approach, e.g., stochastic gradient descent (SGD) (Zhang,
of data. Robots learn behavior strategies in training and move along 2019)
the planned trajectory in the test. The softmax function maps values
Optimal value RL nature DQN, double DQN and dueling DQN. Motion plan-
ning is realized by attaching destination and safe paths with
Here basic concepts of RL are recalled firstly, and then the big reward (numerical value), while obstacles are attached
principles of Q learning, nature DQN, double DQN and duel- with penalties (negative reward). Optimal path is found
ing DQN are given. according to total rewards from initial place to destination. To
Classical ML algorithm like CNN is competent only in better understand optimal value RL, it is necessary to recall
static obstacle avoidance by one-step prediction, therefore several fundamental concepts: Markov chain, Markov deci-
it cannot cope with time-sequential obstacle avoidance. RL sion process (MDP), model-based dynamic programming,
algorithms, e.g., optimal value RL, fit time-sequential tasks. model-free RL, Monte-Carlo method (MC), temporal dif-
Typical examples of these algorithms include Q learning, ference method (TD), and State-action-reward-state-action
123
Journal of Intelligent Manufacturing (2022) 33:387–424 399
(a) (b)
Fig. 16 a represents the relationship of basic concepts of RL. b represents the principle of MDP
(SARSA). MDP is based on Markov chain (Chan et al., the time step t is defined as the expectation of accumulative
2012), and it can be divided into two categories: model-based rewards G t by
dynamic programming and model-free RL. Mode-free RL
can be divided into MC and TD that includes SARSA and Q V (s) E G t Rt+1 + γ Rt+1 + . . . + γ T −1 RT |St s
learning algorithms. Relationship of these concepts is shown (5)
in Fig. 16.
Markov chain Variable set X {X n n > 0} is called where γ represent a discount factor (γ [0, 1]). MC uses G t −
Markov chain (Chan et al., 2012) if X meets V (s) to update its state value V MC (s) by
123
400 Journal of Intelligent Manufacturing (2022) 33:387–424
whileQ learning directly uses maximum estimated action Nature deep Q-learning network
value max Q Q(St+1 , At+1 ) at time step t + 1 to update its
action value by DQN (Mnih et al., 2013) is a combination of Q leaning and
deep neural network (e.g., CNN). DQN uses CNN to approx-
Q Q L (St , At ) imate Q values by its weight θ . Hence, Q table in Q learning
← Q (St , At ) changes to Q value network that can be converged in a faster
speed in complex motion planning. DQN became a research
+ α Rt+1 + γ max Q (St+1 , At+1 ) − Q (St , At ) focus when it was invented by Google DeepMind (Mnih et al.,
At+1 (9)
2013, 2015), and performance of DQN approximates or even
(2) SARSA adopts selected action At+1 directly to update surpasses the performance of human being in Atari games
its next action value, but Q learning algorithm use ε-greedy (e.g., Pac-man and Enduro in Fig. 18) and real-world motion
to select a new action to update its next action value. planning tasks (Bae et al., 2019; Isele et al., 2017). DQN
SARSA uses ε-greedy method to sample all potential utilizes CNN to approximate Q values (Fig. 19) by
action values of next step and selects a “safe” action even-
tually, while Q learning pays attention to the maximum Q ∗ (s, a) ≈ Q(s, a; θ ) (10)
estimated action value of the next step and selects optimal
actions eventually. Steps of SARSA is shown in Algorithm 1
(Sutton & Barto, 1998), while Q learning algorithm as Algo-
rithm 2 (Sutton & Barto, 1998) and Fig. 17. Implementations
of robotic motion planning by Q learning are as (Panov et al.,
2018; Qureshi et al., 2018; Smart & Kaelbling, 2002).
123
Journal of Intelligent Manufacturing (2022) 33:387–424 401
No Yes
Convergence?
s,a s,a,r,s’
Action selection Action execuation Q value update End
Q, s
s (images)
Q value initialization Environment
Fig. 17 Steps of Q learning algorithm. Input of Q learning is in the vector format normally. Q value is obtained via Q value table or network as
approximator. Extra preprocessing is needed to extract features from image if input is in image format
Loss value measures the distance between expected Fig. 19 Q(s,a0 ), Q(s,a1 ), Q(s,a2 ) and Q(s,at ) denote Q values of all
potential actions
value and real value. In DQN, expected value is (r +
γ maxQ(s’,a’;θ ’)) that is similar to labels in supervised learn-
ing, while Q(s,a;θ ) is the observed real value. weights of in each step, while weight of targeted network θ ’ is updated in
targeted network and Q value network share a same weight θ . a long period of time. Hence, θ is updated frequently while
The difference is that weight of Q value network θ is updated θ ’ is more stable. It is necessary to keep targeted network
stable, otherwise Q value network will be hard to converge.
Detailed steps of DQN are shown as Algorithm 3 (Mnih
et al., 2013) and Fig. 20.
123
402 Journal of Intelligent Manufacturing (2022) 33:387–424
No Yes
Convergence?
123
Journal of Intelligent Manufacturing (2022) 33:387–424 403
Images of
environment
Convolutional
layers
FC layers
Fig. 22 Q(s,a) and A(s,a) saliency maps (red-tinted overlay) on the Atari
Fig. 21 The architecture of dueling DQN, in which Q value Q(s,a) is game (Enduro). Q(s,a) learns to pay attention to the road, but pay less
decoupled into two parts, including V value V (s) and A value A(s,a) attention to obstacles in the front. A(s,a) learns to pay much attention
to dynamic obstacles in the front (Wang et al., 2015)
Q (s, a; θ , α, β)
Policy gradient RL
V (s; θ , β)
⎧ ⎫
⎨ 1 ⎬ Here the principles of policy gradient method and actor-critic
+ A (s, a; θ , α) − A st , at− ; θ , α algorithm are given firstly. It is followed by recalling the
⎩ |A| − ⎭
at A (17) principles of their optimized variants: (1) A3C and A2C; (2)
DPG and DDPG; (3) TROP and PPO.
Thus, a stable V-A pair is obtained although original Optimal value RL uses neural network to approximate
semantic definition of A value (Eq. 13) is changed (Wang optimal values to indirectly select actions. This process is
et
⎧ al., 2015). In other words: (1) advantage ⎫ constraint simplified as a ← argmaxa R(s, a) + Q(s, a; θ ). Noise
⎨ ⎬ leads to over-estimation of Q(s, a; θ ), therefore the selected
1
A(s, a; θ , α) − |A | A st , at− ; θ , α 0 is used to actions are suboptimal, and network θ is hard to converge.
⎩ a− A
⎭
t
Policy gradient algorithm uses neural network θ as policy
constrain the update of A value network α; (2) Q value
network θ is therefore obtained by Q(s, a; θ , α, β) V 2 A. Suran. Dueling Double Deep Q Learning using Tensorflow
(s; θ , β). Q value network θ is updated according to V value 2.x. Web. Jul 10, 2020. https://towardsdatascience.com/dueling-double-
that is more accurate and it is easy to obtain via accumulative deep-q-learning-using-tensorflow-2-x-7bbbcec06a2a.
123
404 Journal of Intelligent Manufacturing (2022) 33:387–424
Fig. 23 Training and test steps of policy gradient algorithms. In the be updated in training. Policy refers to target policy normally. Robots
training, time-sequential actions are generated by the behavior policy. learn trajectories via target policy (neural network as approximator) and
Note that policy is divided to behavior policy and target policy. Behav- trained policy is obtained. In the test, optimal time-sequential actions
ior policy is about selecting actions for training and behavior policy will are generated directly by trained policy πθ : s → a until destination is
not be updated, while target policy is also used to select actions but it will reached
πθ : s → a to directly select actions to avoid this problem. lead to a better policy. Increment of network is the gradient
Brief steps of policy gradient algorithm are shown in Fig. 23. value of objective, and that is
∇θ J (θ ) ∫ ∇θ πθ (τ ) R (τ ) dτ
Policy gradient method ∫ πθ (τ ) ∇θ log πθ (τ ) R (τ ) dτ
Eτ ∼πθ (τ ) ∇θ log πθ (τ ) R (τ ) (19)
Policy is a probability distribution P{a|s,θ } π θ (a|s)
π (a|s;θ ) that is used to select action a in state s, where weight
An example of PG is Monte-carlo reinforce (Williams,
θ is the parameter matrix that is used as an approximation
1992). Data τ for training are generated from simulation
of policy π (a|s). Policy gradient method (PG) (Sutton et al.,
by stochastic policy. Previous objective and its gradient
1999) seeks an optimal policy and uses it to find optimal
(Eq. 18–19) are replaced by
actions. how to find this optimal policy? Given a episode τ
(s1 ,a1 ,…,sT ,aT ), the probability to output actions in τ is πθ
1 i i
N T
T
J (θ ) ≈ r st , a t (20)
(τ ) p(s1 ) πθ (at |st ) p(st |st−1 , at−1 ). The aim of the PG N
t2 i1 t1
is to find optimal parameter θ ∗ arg maxθ Eτ ∼πθ (τ ) [R(τ )] T T
1
N
T
∇θ J (θ ) ≈ ∇θ log πθ at , st
i i
r st , a t
i i
where episode reward R(τ ) r (st , at )) is the accumu- N
t1 i1 t1 t1
lative rewards in episode τ . Objective of PG is defined as the (21)
expectation of rewards in episode τ by
where N is the number of episodes, T the length of tra-
jectory. A target policy πθ is used to generate episodes for
J (θ ) Eτ ∼πθ (τ ) [R(τ )] ∫ πθ (τ )R(τ )dτ (18) training. For example, Gaussian distribution function
is used
as behavior policy to select actions by a ∼ N μ(s), σ 2 .
Network f (s; θ ) is then used to approximate expectation of
To find higher expectation of rewards, gradient operation Gaussian
distribution
by μ(s) f (s; θ ). Itmeans a ∼ N
is used on objective to find the increment of network that may mean f s; θ {wi , bi }iL ) , stdev σ 2 and μ(s; θ )
123
Journal of Intelligent Manufacturing (2022) 33:387–424 405
3 V, V’ Update of
a, r, s’ weight ԝ
1
∇θ J (θ ) − −1 ( f (st ) − at )
2
df
dθ
(22) Actor
neural
2 Critic
neural w 5
network network
where ddθf is obtained by backward-propagation. According
to Eq. 21–22, its objective gradient is 1s Environment
s
T
1
N 1 d f
∇θ J (θ ) ≈ − −1 f sti − ati
N 2 dθ Fig. 24 Training steps of AC
i1 t1
T
× r sti , ati (23) whereβ represents learning rate. Objective of policy network
t1
is defined by
Once objective gradient is obtained, network is updated
by gradient ascent method. That is J (θ ) πθ (at |st ) · e (30)
Objective of critic network is defined by A3C In contrast to AC, the A3C (Everett et al., 2018) has
three features (1) multi-thread computing; (2) multi-step
J (w) e2 (27) rewards; (3) policy entropy. Multi-thread computing means
multiple interactions with the environment to collect data
Objective gradient is therefore obtained by minimizing and update networks. Multi-step rewards are used in critic
the mean-square error network, therefore the TD-errore of A3C is obtained by
T
∇w J (w) ∇w e2 (28) e γ i−t ri + V (st+n ) − V (st ) (33)
it
Critic network is updated by gradient ascent method
(Zhang, 2019). That is Hence, the speed of convergence is improved. Here γ is a
discount factor, and n is the number of steps. Data collection
w ← w + β∇w J (w) (29) by policy π (st ; θ ) will cause over-concentration, because
123
406 Journal of Intelligent Manufacturing (2022) 33:387–424
initial policy is with poor performance therefore actions are In on-policy policy gradient algorithm (e.g., PG), its objec-
selected from small area of workspace. This causes poor qual- tive is defined as
ity of the input, therefore convergence speed of network is
poor. Policy entropy increases the ability of policy in action J (θ ) Eτ ∼πθ (τ ) [R (τ )] ∫ πθ (τ ) R (τ ) dτ
exploitation to reduce over-concentration. Objective gradient τ ∼πθ (τ )
of A3C therefore changes to ∫ ρ π (s) ∫ πθ (a|s) R (s, a) dads
s∼S a∼A
Es∼ρ π ,a∼πθ [R (s, a)] (36)
∇θ J (θ ) A3C ∇θ J (θ ) AC + β∇θ H (π (st ; θ )) (34)
where ρ π is the distribution of state transition.
The objec-
where β is a discount factor and H (π (st ; θ ) is the policy tive gradient of PG ∇θ J (θ ) Eτ ∼πθ (τ ) ∇θ log πθ (τ )R(τ )
entropy. includes a vector C ∇θ log πθ (τ ) and a scalar R R
A2C A2C (Everett et al., 2018) is the alternative of A3C (τ ). Vector C is the trend of policy update, while scalar R is
algorithm. Each thread in A3C algorithm can be utilized the range of this trend. Hence, the scalar R acts as a critic
to collect data, train critic and policy networks, and send that decides how policy is updated. Action value Q π (s, a) is
updated weights to global model. Each thread in A2C how- defined as the expectation of discounted rewards by
ever can only be used to collect data. Weights in A2C are ∞
γ
updated synchronously compared with the asynchronous Q π (s, a) E r1 γ k−t r (sk , ak )|S1 s, A1 a; π
update of A3C, and experiments demonstrate that syn- kt
chronous update of weights is better than asynchronous way (37)
in weights update ((Babaeizadeh et al., 2016; Mnih et al.,
2016)). Their mechanisms in weight update are shown in Q π (s, a) is an alternative of scalar R, and it is better than
Fig. 25. R as the critic. Hence, objective gradient of PG changes to
T
πθ (a|s) π (ak |sk ; θ )
ρ β (s) θ k + αEs∼ρ μk ∇θ μθ (s) ∇a Q μ (s, a) |aμθ (s) (40)
k
kt (35)
βθ (a|s) T
kt β(ak |sk ; θ )
There are small changes in state distribution ρ μ of deter-
Importance-sampling ratio measures the similarity of two ministic policy during the update of network θ , but this
policies. These policies must be with large similarity in the change will not impact the update of network. Hence, net-
definition of important sampling. Particularly, behavior pol- work of deterministic policy is updated by
icy β θ is the same as policy πθ in on-policy algorithms. This
means πθ βθ and ρ β (s) ρ π (s) 1. θ ← θ + αEs∼ρ μ ∇θ μθ (s)∇a Q μ (s, a)|aμθ (s) (41)
123
Journal of Intelligent Manufacturing (2022) 33:387–424 407
Global Weight
2nd thread Environment 2nd thread Data Model
model update
Training
because
wt ← wt + αw δt ∇w Q w (st , at ) (46)
∇θ J (μθ ) ∇θ ∫ ρ μ (s) R (s, μθ (s)) ds
S θt ← θt + αθ ∇θ μθ (s)∇a Q w (s, a)|aμθ (s) (47)
∇θ Es∼ρ μ [R (s, μθ (s))]
∫ ρ μ (s) ∇θ μθ (s) ∇a Q μ (s, a) |aμθ (s) However, no constrains is used on network w in the
S approximation of Q value and this will lead to a large bias.
Es∼ρ μ ∇θ μθ (s) ∇a Q μ (s, a) |aμθ (s) (42) How to obtain a Q w (s, a) without bias? Compatible
function approximation (CFA) can eliminate the bias by
Once Q μ (s, a) is obtained, θ can be updated after obtain- adding two constraints on w (proof is given in Silver et al.
ing objective gradient. (2014)): (1) ∇a Q w (s, a)|
aμθ (s) ∇θ μθ (s) w; (2) M S E
How to find Q μ (s, a)? Note that discounted reward Q μ (θ , w) E[(s; θ , w)] (s; θ , w) (s; θ , w) → 0 where
(s, a) is a critic in stochastic policy gradient mentioned ∈ (s; θ , w) ∇a Q w (s, a)|aμθ (s) −∇a Q μ (s, a)|aμθ (s) . In
before. If network w is used as approximator, Q μ (s, a) is other words, Q w (s, a) should meet
obtained by
Q w (s, a) (a − μθ (s)) ∇θ μθ (s) w + V v (s) (48)
μ w
Q (s, a) ≈ Q (s, a) (43)
where state value V v (s) may be any differentiable base-
stochastic policy gradient algorithm includes two networks, line function (Silver et al., 2014). Here v and ∅ are feature
in which w is the critic that approximates Q value and and parameter of state value (V v (s) v ∅(s)). Parame-
θ is used as actor to select actions in test (actions are ter ∅ is also the feature of advantage function (Aw (s, a)
∅(s, a) w), and ∅(s, a) is defined as ∅(s, a) ∇θ μθ (s)(a−
def
selected by behavior policy β in training). Stochastic policy
gradient in this case is called off-policy deterministic actor- μθ (s)). Hence, a low-bias Qw (s,a) is obtained using OPDAC-
critic (OPDAC) or OPDAC-Q. Objective gradient of OPDAC Q and CFA. This new algorithm with less bias is called
therefore changes from the Eq. 42 to Compatible OPDAC-Q (COPDAC-Q) (Silver et al., 2014),
in which weights are updated as Eq. 49–51
∇θ Jβ (μθ ) ≈ Es∼ρ β ∇θ μθ (s)∇a Q w (s, a)|aμθ (s) (44)
vt ← vt + av δt ∅(st ) (49)
where β represents the behavior policy. Two networks are
updated by wt ← wt + αw δt ∅(st , at ) wt + αw δt ∇w Aw (st , at ) (50)
123
408 Journal of Intelligent Manufacturing (2022) 33:387–424
Actor CFA
Deterministic
policy μθ(s) Gradient Q
Critic Qμ(s,a) Qw(s,a) Q learning
learning
where δt is the same as the Eq. 45. Here av , αw and αθ Critic network θ Q is updated by minimizing the loss value
are learning rates. Note that linear function approximation (MSE loss)
method (Silver et al., 2014) is used to obtain advantage func-
1 Q 2
tion Aw (s, a) that is used to replace the value function Q w Loss J θ (54)
(s, a) because Aw (s, a) is efficient than Q w (s, a) in weight N
i
update. Linear function approximation however may lead to
divergence of Q w (s, a) in critic δ. Critic δ can be replaced where N is the number of tuples < s,a,r,s’ > sampled from
by the gradient Q-learning critic (Sutton et al., 2009) to replay buffer. Target function of policy network is defined by
reduce the divergence. Algorithm that combines COPDAC- 1
Q and gradient Q-learning critic is called COPDAC Gradient J θμ Q si , ai ; θ μ (55)
N
Q-learning (COPDAC-GQ). Details of gradient Q-learning i
critic and COPDAC-GQ algorithm can be found in ((Silver
and objective gradient is obtained by
et al., 2014; Sutton et al., 2009)).
By analytical illustration above, 2 examples (COPDAC-Q 1
and COPDAC-GQ) of DPG algorithm are obtained. In short, ∇θ μ J θ μ ∼
∇θ μ μ si ; θ μ ∇a Q si , a; θ Q |aμ(si )
N
key points of DPG are to: (1) find a no-biased Q w (s, a) as i
(56)
critic; (2) train a deterministic policy μθ (s) to select actions.
networks of DPG are updated as AC via the policy ascent
Hence, policy network θ μ is updated according to gradient
approach. Brief steps of DPG are shown in Fig. 26.
ascent method by
DDPG (Lillicrap et al., 2019) is the combination of replay
buffer, deterministic policy μ(s) and actor-critic architecture.
θ μ ← θ μ + α∇θ μ J θ μ (57)
θ Q is used as critic network to approximate action value Q
si , ai ; θ Q . θ μ is used as policy network to approximate where α is a learning rate. New target networks
deterministic policy μ(s; θ μ ). TD target y of DDPG is defined
by θ Q ← τ θ Q + (1 − τ )θ Q (58)
yi ri + γ Q si+1 , μ si+1 ; θ μ ; θ Q (52)
θ μ ← τ θ μ + (1 − τ )θ μ (59)
where θ Q and θ μ are copies of θ Q and θ μ as target networks where τ is a learning rate, are obtained by “soft” update
that are updated with low frequency. The objective of critic method that improves the stability of network convergence.
network is defined by Detailed steps of DDPG are shown in Algorithm 4 (Lill-
icrap et al., 2019) and Fig. 27. Examples can be found in
J θ Q yi − Q si , ai ; θ Q (53) ((Jorgensen & Tamar, 2019; Munos et al., 2016)) in which
DDPG is used in robotic arms.
123
Journal of Intelligent Manufacturing (2022) 33:387–424 409
TRPO and PPO η(θ ) and η(θold ) denote expectations of new and old poli-
cies, respectively. Their ∞relationship is defined
by η(θ ) η
PPO (Long et al., 2018; Schulman et al., 2017b) is the opti- t
(θold ) + Es0 ,a0 ,s1 ,a1 ... γ Aθold (st , at ) where γ is a dis-
mized version of TRPO (Schulman et al., 2017a). Hence, t0
here the principle of TRPO is given before recalling that of count factor, and Aθold (st , at ) is the advantage value that is
defined by Aθ (s, a) Q θ (s, a) − Vθ (s). Thus, η(θ ) η
PPO.
TRPO Previous policy gradient algorithms update their (θold )+ ρθ (s) θ (a|s)Aθold (s, a) where ρθ (s) is the prob-
s a
policies by θ ← θ + ∇θ J (θ ). However, new policy is ability distribution of new policy, but ρθ (s) is unknown
improved unstably with fluctuation. The goal of TRPO is therefore it is impossible to obtain new policy η(θ ). Approx-
to improve its policy monotonously, therefore stability of imation of new policy’s expectation L θold (θ ) is defined by
convergence is improved by finding a new policy with the
objective that is defined by
L θold (θ ) η (θold ) + ρθold (s) θ (a|s) Aθold (s, a)
K L (θold , θ ) ≤ δ
J (θ ) L θold (θ ), s.t.D max (60) s a
θ (a|s)
where L θold (θ ) is the approximation of new policy’s expec- η (θold ) + ρθold (s)
θold (a|s)
K L (θold , θ ) the KL divergence between old policy
tation, D max s a
θold and new policy θ , and δ a trust region constraint of KL
· θold (a|s) · Aθold (s, a)
divergence. The objective gradient ∇θ J (θ ) is obtained by
(61)
maximizing the objective J (θ ). θ (a|s)
η (θold ) + E Aθold (s, a)
θold (a|s)
K L (θold , θ )
η(θ ) ≥ L θold (θ ) − C · D max (62)
123
410 Journal of Intelligent Manufacturing (2022) 33:387–424
Update of and
weight θμ using sampled Q
targeted function y y
5 2 Replay
θμ
a, r, s’ buffer <s,a,r,s’>
r,θQ θμ’ θQ’
4 Update of
θμ’ weight θQ by minimizing
the loss
1 Actor Critic
3
neural neural θQ
network network
s Environment s
Fig. 27 Steps of DDPG. DDPG combines the replay buffer, actor-critic experience tuple is saved in replay buffer. Third, experiences are sam-
architecture, and deterministic policy. First, action is selected by policy pled from replay buffer for training. Fourth, critic network is updated.
network and reward is obtained. State transits to next state. Second, Finally, policy network is updated
where penalty coefficient C 2γ 2 , γ [0, 1] and ∈ the divergence instead of penalty coefficientC. Fixed trust
(1−γ )
maximum advantage. Hence, it is possible to obtain η(θ ) by region constraint δ leads to a reasonable step size in policy
K L (θold , θ ) or
maximi zeθ L θold (θ ) − C · D max update therefore stability in convergence is improved and
convergence speed is acceptable. However, objective of
θ (a|s) TRPO is obtained in implementation by conjugate gradient
maximi zeθ E Aθold (s, a) − C · D max (θold , θ )
θold (a|s) KL
method (Schulman et al., 2017b) that is computationally
(63) expensive.
PPO optimizes objective maximi zeθ E
However, penalty coefficient C (constrain of KL diver- θ(a|s)
θold (a|s) Aθold (s, a) − C · D K L (θold , θ ) from two aspects:
max
gence) will lead to small step size in policy update. Hence, a
θ(a|s)
trust region constraint δ is used to constrain KL divergence (1) probability ratio r (θ ) θold (a|s) in objective is con-
by strained in interval [1 − , 1 + ] by introducing “surrogate”
objective
θ (a|s)
maximi zeθ E Aθ (s, a) , s.t.D max
K L (θold , θ ) ≤ δ
θold (a|s) old L C L I P (θ ) E{min[r (θ )A, cli p(r (θ ), 1 − , 1 + )]} (66)
(64)
where ∈ is a hyperparameter, to penalize changes of pol-
therefore step size in policy update is enlarged robustly. New icy that move r (θ ) away from 1 (Schulman et al., 2017b);
improved policy is obtained in trust region by maximizing (2) penalty coefficient C is replaced by adaptive penalty
K L (θold , θ ) ≤ δ. This objective can
objective L θold (θ ), s.t. D max coefficient β that increases or decreases according to the
be simplified further (Schulman et al., 2017a) and new policy expectation of KL divergence in new update. To be exact,
θ is obtained by
dtarg β
ifd , β ← ; i f d dtarg × 1.5, β ← β × 2 (67)
maximi zeθ ∇θ L θold (θ )|θθold • (θ − θold ) 1.5 2
123
Journal of Intelligent Manufacturing (2022) 33:387–424 411
123
412 Journal of Intelligent Manufacturing (2022) 33:387–424
Graph search algorithm Dijkstra’s 1 Graph or map 1. Best-first search (large Trajectory
search space)
2. Heuristic function in cost
estimation
A* 1,2
Sampling based algorithm PRM 1 1. Random search
(suboptimal path)
2. Non-holonomic
constraint
RRT 1,2
Interpolating Curve Line and circle 1. Mathematical rules
algorithm Clothoid curves 2. Path smoothing using
CAGD
Polynomial curves
Bezier curves
Spline curves
Reaction based algorithm PFM (APF) Robot configurations (e.g., 1. Different potential field Moving directions
position) functions U for different
targets (e.g., goal,
obstacle)
2. Combined U and
gradient of U
VOM Positions and velocities 1. VO, RV and RAV Selected velocity among
(robot and obstacles) 2. Exhaustive/global search, possible velocities
and heuristic search
according to objective
function U
DWA Robot’s position, distances 1. Vs , Va , Vd , Vr
to goal/obstacles, (v, w), 2. Velocity selection
and kinematics of robot according to objective
function U
Optimal value RL This group here includes Q learning, differences are: first, DQN obtains action values by neural
DQN, double DQN, and dueling DQN. Features of algo- network Q(s, a; θ ), while Q learning obtains action values
rithms here include the replay buffer, objectives of algorithm, by querying the Q-table; second, double DQN uses another
and method of weight update. Comparisons of these algo- neural network θ − to evaluate
actions to obtain
selected bet-
rithms are listed in Table 3. According to that: (1) Q learning ter action values by Q s , argmax a Q s , a ; θ ; θ − ; third,
normally uses well-prepared vector as the input, while DQN, dueling DQN obtains action values by dividing them to
double DQN and dueling DQN use images as their input advantage values and state values. The constraint Ea∼π (s)
because these algorithms use convolutional layer to process [A(s, a)] 0 is used on the advantage value, therefore
value is obtainedby Q(s, a; θ , α, β)
high-dimensional images. (2) Outputs of these algorithms a better action
are time-sequential actions by performing trained model. (3) −
V (s; θ , β) + A(s, a; θ , α) − |A
1
|
−
at ∈A A(s ,
t ta ; θ , α) .
DQN, double DQN and dueling DQN use replay buffer to
Networks that approximate action value in these algorithms
reuse the experience, while Q learning collects experiences
are updated by minimizing MSE with gradient descent
and learns from then in an online way. (4) DQN, double
approach.
DQN and dueling DQN use MSE e2 as their objectives. Their
123
Journal of Intelligent Manufacturing (2022) 33:387–424 413
123
414 Journal of Intelligent Manufacturing (2022) 33:387–424
Policy gradient RL Algorithms in this group include PG, updated in a “soft” way by θ Q ← τ θ Q + (1 − τ )θ Q and
AC, A3C, A2C, DPG, DDPG, TRPO, and PPO. Features of θ μ ← τ θ μ + (1 − τ )θ μ .
them include actor-critic architecture, multi-thread method,
replay buffer, objective of algorithm, and method of weight Analytical comparisons of motion planning
update. Comparisons of these algorithms are listed in Table algorithms
4. According to that: (1) Inputs of policy gradient RL can
be image or vector, and image is used as inputs under Here analytical comparisons of motion planning algorithms
the condition that convolutional layer is used as prepro- are made according to general criteria we summarized.
cessing component to convert high-dimensional image to These criteria consist of six aspects: local or global planning,
low-dimensional feature. (2) Outputs of policy gradient RL path length, optimal velocity, reaction speed, safe distance
are time-sequential actions by performing trained policy and time-sequential path. The speed and stability of network
π (s) : s → a. (3) Actor-critic architecture is not used in convergence for optimal value RL and policy gradient RL are
PG, while other policy gradient RL are implemented with then compared analytically because convergence speed and
actor-critic architecture. (4) A3C and A2C use multi-thread stability of RL in robotic motion planning are recent research
method to collect data and update their network, while focus.
other policy gradient RL are based on single thread in data
collection and network update. (5) DPG and DDPG use
Comparisons according to general criteria
replay buffer to reuse data in an offline way, while other
policy gradient RL learn online. (6) The objective of PG
Local (reactive) or global planning This criterion denotes
is defined as the expectation of accumulative rewards in
the area where the algorithm is used in most case. Table 5
the episode by Eτ ∼πθ (τ ) [R(τ )]. Critic objectives of AC,
lists planning algorithms and the criteria they fit. Accord-
A3C, A2C, DPG and DDPG are defined as MSE e2 , and
ing to Table 5: (1) Graph search algorithms plan their path
their critic networks are updated by minimizing the MSE.
globally by search methods (e.g., depth-first search, best-first
However, their actor objectives are different because: first,
search) to obtain a collision-free trajectory on the graph or
actor objective of AC is defined as πθ (at |st ) · e; second,
map. (2) Sampling-based algorithms samples local or global
policy entropy is added on πθ (at |st ) · e to encourage the
workspace by sampling methods (e.g., random tree) to find
exploration, therefore actor objectives of A3C and A2C are
collision-free trajectories. (3) Interpolating curve algorithms
defined by πθ (at |st ) · e + β H (π (st ; θ ); third, DPG and DDPG
draw fixed and short trajectories by mathematical rules to
use action value as their actor objectives by Q(si , ai ; θ μ )
avoid local obstacles. (4) Reaction based algorithms plan
that are approximated by neural network, and their pol-
local paths or reactive actions according to their objective
icy networks (actor networks) are updated by obtaining
functions. (5) MSVM and CNN make one-step prediction by
objective gradient ∇θ μ μ(si ; θ μ )∇a Q(si , a; θ Q )|aμ(si ) .
trained classifiers to decides their local motions. (6) LSTM,
Critic objectives of TRPO and PPO are defined as the
MCTS, optimal value RL and policy gradient RL can make
advantage value by At δt + (γ λ)δt+1 + . . . + γ λT −t δ(sT )
time-sequential motion planning from the start to destination
where δt rt + γ V (st+1 ; w) − V (st ; w), and their critic
by performing their trained models. These models include
networks w are updated by minimizing δt2 . Actor objectives
the stack structure model of LSTM, tree model of MCTS
of TRPO and PPO are different: objective of TRPO is
θ(a|s) and matrix weight model of RL. These algorithms fit global
defined as E[ θold (a|s) Aθold (s, a)], s.t.D K L (θold , θ ) ≤ δ in
max
motion planning tasks theoretically if size of workspace is
which a fixed trust region constraint δ is used to ensure
not large, because it is hard to train a converged model
the monotonous update of policy network θ , while PPO
in large workspace. In most case, models of these algo-
uses “surrogate” [1 − , 1 + ] and adaptive penalty β to
rithms are trained in local workspace to make time-sequential
ensure a better monotonous update of policy network,
predictions by performing their trained model or policy π
therefore PPO has two objectives that are defined as
θ(a|s) (s) : s → a.
L K L P E N (θ ) E[ θold (a|s) Aθold (s, a) − β · D K L (θold , θ )]
max
(KL-penalized objective) and L C L I P+V F+S (θ ) Path length This criterion denotes the length of planned
E[L C L I P (θ ) + c1 L V F (θ ) + c2 S(πθ |s)] (surrogate objective) path that is described as optimal path (the shortest path),
where L C L I P (θ ) E{min[r (θ )A, cli p(r (θ ), 1 − , 1 + )]}. suboptimal path, and fixed path. Path length of algorithms
(7) Policy network of policy gradient RL are all updated are listed in Table 5. According to that: (1) Graph search
by gradient ascent method θ ← θ + α∇θ J (θ ). Policy algorithms can find a shortest path by performing search
networks of A3C and A2C are updated in asynchronous and methods (e.g., best-first search) in the graph or map. (2)
synchronous ways, respectively. Networks of DDPG are Sampling-based algorithms plan a suboptimal path. Their
sampling method (e.g., random tree) leads to the insufficient
sampling that only covers a part of cases and suboptimal
123
Journal of Intelligent Manufacturing (2022) 33:387–424 415
path is obtained. (3) Interpolating curve algorithms plan their can reach the destination with minimum time in travelled
path according to mathematical rules that lead to a fixed path. This criterion is described as optimal velocity and sub-
length of path. (4) Reaction-based algorithms can find the optimal velocity. Table 5 lists performance of algorithms
shortest path if their objective functions are related to short- in the optimal velocity. According to that: (1) Performance
est distance travelled, and then robots will move towards of graph search algorithms, sampling-based algorithms and
directions that maximize objective functions. (5) MSVM, interpolating algorithms in the velocity tuning cannot be eval-
LSTM and CNN plan their path by performing models that uated, because these algorithms are only designed for the
are trained with human-labeled dataset, therefore suboptimal path planning to find a collision-free trajectory in the graph
path is obtained. MCTS can receive the feedback (reward) or map. (2) Among reaction-based algorithms, the velocity of
from the environment to update its model (tree), therefore it robots in PFM can be tuned according to obtained potential
is possible to find a shortest path like other RL algorithms. field, while velocities of robots in VOM and DWA are dynam-
(6) RL algorithms (optimal value RL and policy gradient RL) ically selected among their possible velocities according to
can generate optimal path under the condition that reason- their objective functions. Hence, reaction-based algorithms
able penalty is used to punish moved steps in the training, can realize optimal velocity. (3) MSVM, LSTM, and CNN
therefore optimal path is obtained by performing the trained can output actions that are in the format v [vx , v y ] where
RL policy. vx and v y are velocity in x and y axis, if algorithms are
trained with these vector labels. However, these velocity-
Optimal velocity This criterion denotes the ability to tune
related labels are all hard-coded artificially. Time to reach
the velocity when algorithms plan their path, therefore robot
123
416 Journal of Intelligent Manufacturing (2022) 33:387–424
destination heavily relies on the artificial factor, therefore DDPG). This process is fast and time cost can be ignored,
these algorithms cannot realize optimal velocity. MCTS can therefore reaction speed of these algorithms is fast.
realize optimal velocity theoretically if the size of action
Safe distance This criterion denotes the ability to keep a
space is small. However, the case where MCTS is used in
safe distance to obstacles. Safe distance is described as three
velocity-related tasks has not been found. Large action space
level that includes fixed distance, suboptimal distance and
will lead to a large search tree, and it is hard to train such
optimal distance. Table 5 lists the performance of algorithms.
a large tree model. In normal case, MCTS is used in one-
According to that: (1) Graph search algorithms and sampling-
step prediction via its trained tree model, therefore the state
based algorithms keep a fixed distance to static obstacles by
can transit into next state but velocity in this state transition
hard-coded setting in robotic application. However, high col-
process will not be considered. Hence, its ability in velocity
lision rate is inevitable in dynamic environment because of
tunning cannot be evaluated. (4) optimal value RL and pol-
slow update frequency of graph or map. (2) Interpolating
icy gradient RL can realize optimal velocity by attaching the
algorithms keep a fixed distance to static and dynamic obsta-
penalty to consumed time in the training. These algorithms
cles according to mathematical rules. (3) Reaction-based
can automatically learn how to choose the best velocity in
algorithms can keep safe and dynamic distances to obstacles
the action space for training to cost time as less as possible,
theoretically according to their potential field functions (for
therefore robots can realize optimal velocity by performing
PFM) or objective functions (for VOM and DWA). However,
trained policy. Note that in this case, actions in optimal value
their updates in potential fields or search spaces cost a short
RL and policy gradient RL must be in format of [vx , v y ]
period of time that cannot be ignored, therefore increasing
and action space that contains many action choices must be
the rate of collision with obstacles especially in high-speed,
pre-defined artificially.
dense, and dynamic scenarios. Hence, they keep a subop-
Reaction speed This criterion denotes the computational timal distance to obstacles. (4) MSVM, LSTM and CNN
cost or time cost of the algorithm to react dynamic obsta- keep a suboptimal distance to static and dynamic obstacles.
cles. Computational cost of some algorithms (e.g., traditional Suboptimal distance is obtained by performing a model that
non-AI algorithms) would be easy to obtain by counting the is trained with human-labeled dataset. MCTS can output a
number of lines and functions in the source code (Kinnunen collision-free time-sequential path and keep a safe distance
et al., 2011). However, the problems of this approach are to obstacles theoretically as other RL algorithms. (5) optimal
that all source codes should be found first, and the result value RL and policy gradient RL keep a safe and dynamic dis-
would still depend on the chosen programming language tance to static and dynamic obstacles with the computational
and the quality of the implementation. It is hard to obtain cost that can be ignored, by performing a trained policy π
clear computational costs, thus here reaction speed is briefly (s) : s → a. This policy is trained under the condition that the
described analytically using three levels: slow, medium and penalty is used to punish the close distances between robot
fast. Table 5 lists reaction speed of the algorithms. Accord- and obstacles in the training, therefore algorithms will auto-
ing to that: (1) Graph search algorithms and sampling-based matically learn how to keep an optimal distance to obstacles
algorithms rely on planned trajectories in the graph or map when robot moves towards destination.
to avoid obstacles. However, the graph or map is updated in
Time-sequential path This criterion denotes whether an
a slow frequency normally, therefore reaction speed of these
algorithm fits time-sequential task or not. Table 5 lists algo-
algorithms is slow. (2) Interpolating curve algorithms plan
rithms that fit time-sequential planning. According to that: (1)
their path according to mathematical rules with limited and
Graph search algorithms, sampling-based algorithms, inter-
predictable time in computation, therefore reaction speed of
polating curve algorithms, and reaction-based algorithms
these algorithms is medium. (3) Reaction-based algorithms
plan their path according to the graph or map, mathemati-
should first update their potential field (for PFM) and search
cal rules, obtained potential field or search spaces, regardless
space (for VOM and DWA) in small local areas to select mov-
of environmental state in each time step. Hence, these algo-
ing directions or proper velocities. These update processes
rithms cannot fit time-sequential task. (2) MSVM and CNN
take lesser time than that of graph search algorithms. Hence,
output actions by one-step prediction that has no relation with
reaction speed of reaction-based algorithms is medium. (4)
environmental state in each time step. (3) LSTM and MCTS
classical ML algorithms, optimal value RL and policy gra-
store environmental state in each time step in their cells and
dient RL react to obstacles by performing trained model or
nodes respectively, and their models are updated by learning
policy π (s) : s → a that maps state of environment to a
from these time-related experience. Time-sequential actions
probability distribution P(a|s) or to actions directly (e.g.,
are outputted by performing trained models, therefore these
algorithms fit time-sequential task. (4) Optimal value RL and
policy gradient RL train their policy network by learning
from environmental state in each time step. Time-sequential
123
Journal of Intelligent Manufacturing (2022) 33:387–424 417
actions are outputted by performing trained policy, therefore and KL-penalized objective L K L P E N (θ )
θ(a|s)
these algorithms fit time-sequential task. E[ θold (a|s) Aθold (s, a) − β · D K L (θold , θ )].
max
123
418 Journal of Intelligent Manufacturing (2022) 33:387–424
steps that include data collection, data preprocessing, motion therefore partial understanding of environment is avoided.
planning and decision making (Fig. 28). Another way to avoid partial understanding of environment
Data collection To realize mentioned task, these ques- is the data translation that interpretates data to new format,
tions should be considered firstly: (1) how to collect enough therefore algorithms can have a better understanding about
data? (2) how to collect high-quality data? To collect enough the relationship of robots and other obstacles (e.g., attention
data in a short time, multi-thread method or cloud technol- weight (Chen et al., 2019) and relation graph (Chen et al.,
ogy should be considered. Existing techniques seem enough 2020)). However, algorithms in data fusion and translation
to solve this question well. To collect high-quality data, exist- cannot fit all cases, therefore further works is needed accord-
ing works use prioritized replay buffer (Oh et al., 2018) to ing to the environment of application.
reuse high-quality data to train network. Imitation learning Motion planning: In this step, the selection and optimiza-
(Codevilla et al., 2018; Oh et al., 2018) is also used for the tion of motion planning algorithms should be considered: (1)
network initialization with expert experience, therefore net- if traditional motion planning algorithms (e.g., A*, RRT) are
work can converge faster (e.g., deep V learning (Chen et al., selected for the task mentioned before, global trajectory from
2016, 2019)). Existing methods in data collection work well, the start to destination will be obtained, but this process is
therefore it is hard to make further optimization. computationally expensive because of the large search space.
Data preprocess Data fusion and data translation should To solve this problem, the combination of traditional algo-
be considered after data is obtained. Multi-sensor data fusion rithms and other ML algorithms (e.g., CNN, DQN) may be
algorithms (Durrant-Whyte & Henderson, 2008) fuse data a good choice. For example, RRT can be combined with
that is collected from same or different type of sensors. Data DQN (Fig. 29) by using action value to predict directions
fusion is realized from pixel, feature, and decision levels, of tree expansion, instead of the heuristic or random search.
123
Journal of Intelligent Manufacturing (2022) 33:387–424 419
(2) It seems impossible to use supervised learning to real- reconfigured into different shapes (Salemi et al., 2006; Wang
ize task mentioned above safely and quickly. Global path is et al., 2020) to improve their performance in motion plan-
impossible to be obtained by supervised learning that out- ning.
puts one-step prediction. (3) Global path cannot be obtained Decision: Traditional algorithms (e.g., A*) feature the
by optimal value RL or policy gradient RL, but their perfor- global trajectory planning, while optimal value RL and pol-
mance in safety and efficiency is good locally by performing icy gradient RL feature the safe and quick motion planning
trained RL policy that leads to quick reaction, safe distance locally. It is a good direction to realize task mentioned above
to obstacles, and shortest travelled path or time. However, by combining traditional algorithm with RL. This is achieved
it is time-consuming to train a RL policy because of defi- by fusing the commands generated by each algorithm or
ciencies in network convergence. Existing works made some functional module (e.g., functional modules in ROS) via
optimizations to improve convergence (e.g., DDPG, PPO) in algorithm-level and system-level data fusions. Multi-sensor
games and physical robots to shorten training time of RL, fusion technique can not only fuse information of sensors
but there is still a long way to go in real-world engineering from pixel and feature level as the inputs, but also fuse dif-
and manufacturing for commercial purposes. Recent trend to ferent types of decisional commands from decisional level
improve the network convergence is to create hybrid architec- in a loose-coupled way. Hence, overall path of the robot is
ture that is the fusion of high-performance components (e.g., expected to approximate the shortest path, and safety and
replay buffer, actor-critic architecture, policy entropy, multi- efficiency can be ensured simultaneously. Recent state of art
thread method). Apart from optimizations of motion planning borrows the group decision making theory to sensor fusion or
algorithms, hardware planning may be a possible direction to data fusion to solve the problem of how weights or impacts of
improve the performance of future motion planning system. each decisional command are determined from the consensus
Hardware planning refers to the reconfiguration and adjust- level and confidence level (Ji et al., 2020). Some pioneering
ment of hardware of robotic system, therefore robots can be ideas about how to better combine or integrate algorithms can
123
420 Journal of Intelligent Manufacturing (2022) 33:387–424
Traditional
Search space reducing
algorithms
Motion
planing Supervised
Combination with other algorithms
learning
Hardware
Self-reconfiguration to have different shapes
planning
Joint
Collaborative operations (software and hardware)
decision
be also seen in Zhang et al. (2019). Apart from hybrid algo- resilience refers to the ability of the robot to recover its
rithms or decisional commands, joint decision making may function after the robot is partially damaged (Wang et al.,
be helpful to improve the performance of future motion plan- 2020). To conclude, Fig. 28 lists possible research directions,
ning system. This is achieved via collaborative operations but attentions to improve the performance of robotic motion
from software-based planning (algorithm-based planning) planning are expected to be paid on: (1) data fusion and trans-
and hardware planning. lation of inputted features; (2) the optimization in traditional
Finally, the robustness and resilience of motion planning planning algorithms to reduce the search space by combining
system should be considered if robots are expected to have traditional algorithms with supervised learning or RL; (3) the
better performance in engineering and manufacturing for optimization in network convergence for RL.
commercial use. Robustness refers to the ability to resist out-
side noise or attack (e.g., the cyber-attack in ROS), while
123
Journal of Intelligent Manufacturing (2022) 33:387–424 421
Declarations
123
422 Journal of Intelligent Manufacturing (2022) 33:387–424
recognition. In F. Fogleman Soulie & J. Herault (Eds.), Neuro- Fiorini, P., & Shiller, Z. (1998). Motion planning in dynamic environ-
computing: Algorithms, architectures and applications. Springer- ments using velocity obstacles. International Journal of Robotics
Verlag. Research, 17(7), 760–772.
Brooks, R. A. (1986). A robust layered control system for a mobile Fortunato, M., Azar, M. G., Piot, B., et al. (2017). Noisy networks for
robot. IEEE Transactions on Robotics and Automation, 1(1), 1–10. exploration, arXiv:1706.10295 [cs.LG].
Cai, M., Lin, Y., Han, B., Liu, C., & Zhang, W. (2017). On a simple and Fox, D., Burgard, W., & Thrun, S. (1997). The dynamic window
efficient approach to probability distribution function aggregation. approach to collision avoidance. IEEE Robotics and Automation
IEEE Transactions on Systems, Man, and Cybernetics: Systems, Magazine, 4(1), 23–33.
47(9), 2444–2453. Funke, J., Theodosis, P., Hindiyeh, R., et al. (2012). Up to the lim-
Chan, K. C., Lenard, C. T., Mills, T. M. (2012). An introduction to its: Autonomous Audi TTS. IEEE Intelligent Vehicles Symposium
Markov chains. In 49th Annual Conference of Mathematical Asso- (IV). Alcala De Henares, 2012, 541–547.
ciation of Victoria, Melbourne, pp 40–47. G., Samuel, G. (2017). Google sibling waymo launches fully
Chao, Y., Xiang, X., & Wang, C. (2020). Towards real-time path plan- autonomous ride-hailing service. The Guardian 7.
ning through deep reinforcement learning for UAV in dynamic Gao, W., Hus, D., Lee, W. S., Shen, S., Subramanian, K. (2017).
environment. Journal of Intelligent and Robotic Systems, 98, Intention-net: integrating planning and deep learning for goal-
297–309. directed autonomous navigation. arXiv, arXiv:1710.05627 [cs.AI].
Chen, Y., Liu, M., Everett, M., & How, J. P. (2016). Decentral- Gilhyun, R. (2018). Applying asynchronous deep classification net-
ized non-communicating multiagent collision avoidance with deep works and gaming reinforcement learning-based motion plan-
reinforcement learning, arXiv:1609.07845v2 [cs.MA]. ners to mobile robots. IEEE Robotics and Automation Society,
Chen, C., Liu, Y., Kreiss, S., & Alahi, A. (2019). Crowd-robot inter- pp. 6268–6275.
action: Crowd-aware robot navigation with attention-based deep Guy, S. J., Chhugani, J., Kim, C., Satish, N., Lin, M., Manocha, D.,
reinforcement learning. International Conference on Robotics and & Dubey P. (2009). Clearpath: Highly parallel collision avoid-
Automation (ICRA), pp. 6015–6022. ance for multi-agent simulation. Proceedings of the 2009 ACM
Chen, C., Hu, S., Nikdel, P., Mori, G., & Savva, M. (2020). Rela- SIGGRAPH/Eurographics Symposium on Computer Animation,
tional graph learning for crowd navigation. arXiv:1909.13165v3 pp. 177–187.
[cs.RO]. Hart, P. E., Nilsson, N. J., & Raphael, B. (1968). A formal basis for the
Codevilla, F., Müller, M., López, A., Koltun, V. & Dosovitskiy, heuristic determination of minimum cost paths. IEEE Transactions
A. (2018). End-to-end driving via conditional imitation learn- on Systems Science and Cybernetics, 4(2), 100–107.
ing. IEEE International Conference on Robotics and Automation Hasselt, H. V., Guez, A. & Silver, D. (2016). Deep reinforcement learn-
(ICRA), 2018, pp. 4693–4700. ing with double q-learning. In Proceedings of the AAAI conference
Cormen, T. H., Leiserson, C. E., Rivest, R. L., & Stein, C. (2009). on artificial intelligence, vol. 30, no. 1.
Introduction to algorithms. The MIT Press. He, K., Zhang, X., Ren, S., Sun, J. (2016). Deep residual learning for
Coulom, R. (2006). Efficient selectivity and backup operators in image recognition. 2016 IEEE Conference on Computer Vision
Monte-Carlo tree search. Computers and Games, 5th International and Pattern Recognition (CVPR), Las Vegas, NV, pp. 770–778.
Conference, CG 2006, Turin, Italy, May 29–31, 2006. Hessel, M., Modayil, J., Van, H. H., et al. (2017). Rainbow: Combining
Daniel, K., Nash, A., Koenig, S., & Felner, A. (2014). Theta*: Any-angle improvements in deep reinforcement learning. arXiv:1710.02298
path planning on grids. Journal of Artificial Intelligence Research, [cs.AI].
39(1), 533–579. Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory.
Dijkstra, E. W. (1959). A note on two problems in connexion with Neural Computation, 9(8), 1735–1780.
graphs. Numerische Mathematik, 1, 269–271. Indrajaya, M. A., Affandi, A., & Pratomo, I. (2015). Design of geo-
Dos Santos Mignon, A., & De Azevedo Da Rocha, R. L. (2017). An graphic information system for tracking and routing using dijkstra
adaptive implementation of -greedy in reinforcement learning. In algorithm for public transportation. In 2015 1st International Con-
Procedia Computer Science, 109, 1146–1151. ference on Wireless and Telematics (ICWT), pp. 1–4.
Durrant-Whyte, H., & Henderson, T. C. (2008). Multi-sensor data Inoue, M., Yamashita, T., & Nishida, T. (2019). Robot path planning
fusion. In B. Siciliano & O. Khatib (Eds.), Springer handbook by LSTM network under changing environment. In S. Bhatia, S.
of robotics. Springer handbooks. Springer. Tiwari, K. Mishra, & M. Trivedi (Eds.), Advances in computer
Everett, M., Chen, Y., How, J. P. (2018). Motion planning among communication and computational sciences. Springer.
dynamic, decision-making robots with deep reinforcement learn- Isele, D., Cosgun, A., Subramanian, K., Ffjimura, K. (2017). Navigat-
ing. 2018 IEEE/RSJ International Conference on Intelligent ing occluded intersections with autonomous vehicles using deep
Robots and Systems (IROS), Madrid, pp. 3052–3059. reinforcement learning. arXiv, arXiv:1705.01196 [cs.AI].
Evgeniou, T., & Pontil, M. (1999) Support vector machines: Theory Jeon, J. H., Cowlagi, R. V., Peter, S. C., et al. (2013). Optimal motion
and applications. In Advanced Course on Artificial Intelligence, planning with the half-car dynamical model for autonomous high-
Springer, Berlin, Heidelberg, pp. 249–257. speed driving. American Control Conference (ACC), pp. 188–193.
Fan, H., Zhu, F., Liu, C., Zhang, L., Zhuang, L., Li, D., Zhu, W., Hu, Ji, C., Lu, X., & Zhang, W. (2020). A biobjective optimization model for
J., Li, H., & Kong, Q. (2008). Baidu apollo em motion planner. expert opinions aggregation and its application in group decision
arXiv:1807.08048. making. IEEE Systems Journal, 15(2), 2834–2844.
Farouki, R. T., & Sakkalis, T. (1994). Pythagorean-hodograph space Jorgensen, T., Tama, A. (2019). Harnessing reinforcement learning for
curves. Advances in Computational Mathematics, 2(1), 41–66. neural motion planning. arXiv, arXiv:1906.00214 [cs.RO].
Ferguson, D., Howard, T. M., & Likhachev, M. (2008). Motion plan- Kakade, S., & Langford, J. (2002). Approximately optimal approximate
ning in urban environments. Journal of Field Robotics, 25(11–12), reinforcement learning. ICML, 2, 267–274.
939–960. Kalos, M. H., & Whitlock, P. A. (2008). Monte Carlo methods. Wiley-
Ferguson, D., & Stentz, A. (2006). Using interpolation to improve path VCH. ISBN 978-3-527-40760-6.
planning: The field D* algorithm. Journal of Field Robotics, 23(2), Kavraki, L. E., Svestka, P., Latombe, J. C., & Overmars, M. H. (2002).
79–101. Probabilistic roadmaps for path planning in high-dimensional con-
figuration spaces. IEEE Transactions on Robotics and Automation,
12(4), 566–580.
123
Journal of Intelligent Manufacturing (2022) 33:387–424 423
Khatib, O. (1986). Real-time obstacle avoidance for manipulators and Panov, A. I., Yakovlev, K. S., & Suvorov, R. (2018). Grid path planning
mobile robots. International Journal of Robotics Research, 5(1), with deep reinforcement learning: Preliminary results. Procedia
90–98. Computer Science, 123, 347–353.
Kim, B., Kaelbling, L. P., & Lozano-Perez, T. (2019). Adversarial Paxton, C., Raman, V., Hager, G. D., & Kobilarov, M. (2017). Combin-
actor-critic method for task and motion planning problems using ing neural networks and tree search for task and motion planning
planning experience. AAAI Conference on Artificial Intelligence in challenging environments. 2017 IEEE/RSJ International Con-
(AAAI), 33(01), 8017–8024. ference on Intelligent Robots and Systems (IROS), Vancouver, BC,
Kinnunen, T., Sidoroff, I., Tuononen, M., & Fränti, P. (2011). pp. 6059–6066.
Comparison of clustering methods: A case study of text- Qureshi, A. H., Simeonov, A., Bency, M. J., Yip, M. C. (2018). Motion
independent speaker modeling. Pattern Recognition Letters, planning networks, arXiv:1806.05767 [cs.RO].
32(13), 1604–1617. Reeds, J. A., & Shepp, L. A. (1990). Optimal paths for a car that
Konda, V. R., Tsitsiklis, J. N. (2001). Actor-critic algorithms. Society goes both forward and backward. Pacific Journal of Math, 145(2),
for Industrial and Applied Mathematics, vol 42. 367–393.
Krogh, B. (1984). A generalized potential field approach to obstacle Rummery, G. A., & Niranjan, M. (1994). On-line Q-learning using
avoidance control. Proceedings of SME Conference on Robotics connectionist systems. University of Cambridge.
Research: The Next Five Years and Beyond, Bethlehem, PA. Salemi, B., Moll, M., & Shen, W. M. (2006). SUPERBOT: A deployable
LaValle, S. M., & Kuffner, J. J. (1999). Randomized kinodynamic multi–functional and modular self–reconfigurable robotic system.
planning. The International Journal of Robotics Research, 20(5), Proceeding of IEEE/RSJ International Conference on Intelligent
378–400. and Robots System, pp. 3636–3641.
Lecun, Y., Bottou, L., Bengio, Y., & Patrick, H. (1998). Gradient- Schaul, T., Quan, J., Antonoglou, I., Silver, D. (2016). Prioritized
based learning applied to document recognition. Proceedings of experience replay. International Conference on Learning Repre-
the IEEE, 86(11), 2278–2324. sentations (ICLR).
Lei, X., Zhang, Z., & Dong, P. (2018). Dynamic path planning of Schulman, J., Levine, S., Moritz, P., Jordan, M. I., & Abbeel, P. (2017).
unknown environment based on deep reinforcement learning. Trust region policy optimization. arXiv:1502.05477v5 [cs.LG].
Journal of Robotics, 2018, 1–10. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O. (2017).
Likhachev, M., Ferguson, D., Gordon, G., Stentz, A., & Thrun, S. Proximal policy optimization algorithms. arXiv:1707.06347v2
(2008). Anytime search in dynamic graphs. Artificial Intelligence, [cs.LG].
172(14), 1613–1643. Silver, D., Huang, A., Maddison, C. J., et al. (2016). Mastering the
Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., game of Go with deep neural networks and tree search. Nature,
Silver, D., Wierstra, D. (2019). Continuous control with deep rein- 529(7587), 484–489.
forcement learning. arXiv:1509.02971 [cs.LG]. Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., Riedmiller, M.
Long, P., Fan, T., Liao, X., Liu, W., Zhang, H., & Pan, J. (2018). Towards (2014). Deterministic policy gradient algorithms. In Proceedings
optimally decentralized multi-robot collision avoidance via deep of the 31st International Conference on International Conference
reinforcement learning. , arXiv:1709.10082v3 [cs.RO]. on Machine Learning, vol. 32. pp 387–395.
Mariescu, I. R., & Franti, P. (2018). Cell Net: Inferring road networks Smart, W. D., Kaelbling, L. P. (2002). Effective reinforcement learning
from GPS trajectories. ACM Transactions on Spatial Algorithms for mobile robots. IEEE International Conference on Robotics and
and Systems, 4(3), 1–22. Automation, vol 4.
Masoud, A. A. (2007). Decentralized self-organizing potential field- Stentz, A. (1994). Optimal and efficient path planning for partially-
based control for individually motivated mobile agents in a clut- known environments. Robotics and Automation, pp. 203–220.
tered environment: A vector-harmonic potential field approach. Sui, Z., Pu, Z., Yi, J., Tian, X. (2018). Path Planning of multiagent
IEEE Transactions on Systems, Man, and Cybernetics-Part a: Sys- constrained formation through deep reinforcement Learning. 2018
tems and Humans, 37(3), 372–390. International Joint Conference on Neural Networks (IJCNN).
Meyes, R., Tercan, H., Roggendorf, S., Thomas, T., Christian, B., Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An intro-
Markus, O., Christian, B., Sabina, J., Tobias, M. (2017). Motion duction. MIT Press.
planning for industrial robots using reinforcement learning. In 50th Sutton, R., Mcallester, D. A., Singh, S., Mansour, Y. (1999). Policy
CIRP Conference on Manufacturing Systems, 63:107–112. gradient methods for reinforcement learning with function approx-
Meystel, A. (1990). Knowledge based nested hierarchical control. imation. In Proceedings of the 12th International Conference on
Advances in Automation and Robotics, 2, 63–152. Neural Information Processing Systems, MIT Press, Cambridge,
Minguez, J., Lamiraux, F., & Laumond, J. P. (2008). Motion planning MA, USA, pp 1057–1063.
and obstacle avoidance. Springer International Publishing. Sutton, R. S., Maei, H. R., Precup, D., et al. (2009). Fast gradient-descent
Mnih, V., Kavukcuoglu, K., Silver, D., et al. (2013). Playing atari with methods for temporal-difference learning with linear function
deep reinforcement learning. arXiv, arXiv:1312.5602 [cs.LG]. approximation. In 26th International Conference on Machine
Mnih, V., Kavukcuoglu, K., Silver, D., et al. (2015). Human-level con- Learning, Montreal, Canada.
trol through deep reinforcement learning. Nature, 518, 529–533. Tobaruela, J. A. (2012). Reactive and path-planning methods for mobile
Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T. P., Harley, robot navigation. PhD dissertation, Universitat de les Illes Balears.
T., Silver, D., Kavukcuoglu, K. (2016). Asynchronous methods for http://hdl.handle.net/11201/151880.
deep reinforcement learning. arXiv, arXiv:1602.01783 [cs.LG]. Tsitsiklis, J. N. (2003). On the convergence of optimistic policy itera-
Montemerlo, M., Becker, J., Bhat, S., et al. (2008). Junior: The stanford tion. Journal of Machine Learning Research, 3(1), 59–72.
entry in the urban challenge. Journal of Field Robotics, 25(9), Van den Berg, J., Lin, M., & Manocha, D. (2008). Reciprocal velocity
569–597. obstacles for real-time multi-agent navigation. IEEE International
Munos, R., Stepleton, T., Harutyunyan, A., Bellemare, M. G. (2016). Conference on Robotics and Automation, 2008, 1928–1935.
Safe and efficient off-policy reinforcement learning. arXiv, arXiv: Van Den Berg, J., Guy, S. J., Lin, M., Manocha, D. (2011). Reciprocal
1606.02647 [cs.LG]. n-body collision avoidance. Robotics research, 2011, pp. 3–19.
Murphy, R. R. (2000). Introduction to AI robotics. MIT press. Wang, L. (2005). On the euclidean distance of image. IEEE Trans-
Oh, J., Guo, Y., Singh, S. & Lee, H. (2018). Self-imitation learning. actions on Pattern Analysis and Machine Intelligence, 27(8),
arXiv 2018, arXiv:1806.05635v1 [cs.LG]. 1334–1339.
123
424 Journal of Intelligent Manufacturing (2022) 33:387–424
Wang, F., Qian, Z., Yan, Z., Yuan, C., & Zhang, W. (2020). A novel Zhang, W. J., Wang, J. W., & Lin, Y. (2019). Integrated design and oper-
resilient robot: Kinematic analysis and experimentation. IEEE ation management for enterprise systems. Enterprise Information
Access, 8, 2885–2892. Systems, 13(4), 424–429.
Wang, Z., Freitas, N., Lanctot, M. (2015). Dueling network architectures Zhang, J. (2019) Gradient descent based optimization algorithms for
for deep reinforcement learning. arXiv:1511.06581 [cs.LG]. deep learning models training, arXiv:1903.03614v1 [cs.LG].
Weston, J., Watkins, C. (1998). Multi-class support vector machines. Ziegler, J., Stiller, C. (2009). Spatiotemporal state lattices for fast tra-
Technical report, Department of computer science, Royal Hol- jectory planning in dynamic on-road driving scenarios. IEEE/RSJ
loway, university of London, May 20. International Conference on Intelligent Robots and Systems,
Williams, R. J. (1992). Simple statistical gradient-following algorithms pp. 1879–1884.
for connectionist reinforcement learning. Machine Learning, 8,
229–256.
Xu, W., Wei, J., Dolan, J. M., Zhao, H., Zha, H. (2012). A real-time Publisher’s Note Springer Nature remains neutral with regard to juris-
motion planner with trajectory optimization for autonomous vehi- dictional claims in published maps and institutional affiliations.
cles. IEEE International Conference on Robotics and Automation,
pp. 2061–2067.
123