0% found this document useful (0 votes)
27 views38 pages

Zhou Et Al. - 2022 - A Review of Motion Planning Algorithms For Intelli

Uploaded by

Wai Phyo Maung
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views38 pages

Zhou Et Al. - 2022 - A Review of Motion Planning Algorithms For Intelli

Uploaded by

Wai Phyo Maung
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Journal of Intelligent Manufacturing (2022) 33:387–424

https://doi.org/10.1007/s10845-021-01867-z

A review of motion planning algorithms for intelligent robots


Chengmin Zhou1 · Bingding Huang2 · Pasi Fränti1

Received: 11 February 2021 / Accepted: 12 October 2021 / Published online: 25 November 2021
© The Author(s) 2021

Abstract
Principles of typical motion planning algorithms are investigated and analyzed in this paper. These algorithms include
traditional planning algorithms, classical machine learning algorithms, optimal value reinforcement learning, and policy
gradient reinforcement learning. Traditional planning algorithms investigated include graph search algorithms, sampling-
based algorithms, interpolating curve algorithms, and reaction-based algorithms. Classical machine learning algorithms
include multiclass support vector machine, long short-term memory, Monte-Carlo tree search and convolutional neural
network. Optimal value reinforcement learning algorithms include Q learning, deep Q-learning network, double deep Q-
learning network, dueling deep Q-learning network. Policy gradient algorithms include policy gradient method, actor-critic
algorithm, asynchronous advantage actor-critic, advantage actor-critic, deterministic policy gradient, deep deterministic
policy gradient, trust region policy optimization and proximal policy optimization. New general criteria are also introduced
to evaluate the performance and application of motion planning algorithms by analytical comparisons. The convergence
speed and stability of optimal value and policy gradient algorithms are specially analyzed. Future directions are presented
analytically according to principles and analytical comparisons of motion planning algorithms. This paper provides researchers
with a clear and comprehensive understanding about advantages, disadvantages, relationships, and future of motion planning
algorithms in robots, and paves ways for better motion planning algorithms in academia, engineering, and manufacturing.

Keywords Motion planning · Path planning · Intelligent robots · Reinforcement learning · Deep learning

Introduction industry, agricultural production, manufacture industry and


dangerous scenarios like nuclear radiation environment to
Intelligent robot, nowadays, is serving people from differ- replace human manipulation, therefore the risk of injury is
ent backgrounds in dense and dynamic shopping malls, train reduced, and efficiency is improved.
stations and airports (Bai et al., 2015) like Daxin in Beijing Research of motion planning is going through a flour-
and Changi in Singapore. Intelligent robots guide pedestrians ishing period, due to development and popularity of deep
to find coffee house, departure gates and exits via accurate learning (DL) and reinforcement learning (RL) that have
motion planning, and assist pedestrians in luggage delivery. better performance in coping with non-linear problems with
Another example of intelligent robot is the parcel delivery complexity. The complexity of these problems generally
robot from e-commercial tech giants like JD in China and refers to the uncertainty, ambiguity, and incompleteness (Cai
Amazon in US. Researchers in tech giants make it possible et al., 2017), especially the uncertainty that is the most chal-
for robots to autonomously navigate themselves and avoid lenging issue in robotic motion planning. Many universities,
dynamic and uncertain obstacles via applying motion plan- tech giants, and research groups all over the world there-
ning algorithms to accomplish parcel delivery tasks. In short, fore attach much importance, time, and energy on developing
intelligent robot gradually plays a significant role in service new motion planning techniques by applying DL algorithms
or integrating traditional motion planning algorithms with
B Pasi Fränti advanced machine learning (ML) algorithms. Autonomous
[email protected]
vehicle is an example. Among tech giants, Google initiated
1 Machine Learning Group, School of Computing, University their self-driving project named Waymo in 2016 (Samuel.,
of Eastern Finland, Joensuu, Finland 2017). In 2017, Tesla pledged a fully self-driving capable
2 College of Big Data and Internet, Shenzhen Technology vehicle (Bilbeisi & Kesse, 2017). Autonomous car from
University, Shenzhen, China

123
388 Journal of Intelligent Manufacturing (2022) 33:387–424

Fig. 1 Three types of robotic platform. The first and second figures represents a differential-wheel chassis. The third and fourth figures rep-
represent wheel-based chassis (Minguez et al., 2008). The first figure resent four-leg dog “SpotMini” from Boston Dynamic and the robotic
represents an Ackerman-type (car-like) chassis, while the second figure arm (Meyes et al., 2017)

Baidu had been tested successfully in highways of Beijing yaw solely without changing their position (x, y). Robots with
in 2017 (Fan et al., 2018), and man-manipulated buses had differential wheels are also sensitive to the speed difference
already been replaced by autonomous buses from Huawei of two front wheels. The sensitivity depends on the rate of
in some specific areas of Shenzhen. Other companies in tra- the gearing steer mechanism that yields the speed reduction
ditional vehicle manufacturing, like Audi and Toyota, also and angular moment rotation. It means it is flexible to move
have their own experimental autonomous vehicles. Among in low-speed indoor scenarios but very dangerous to move in
research institutes and universities, Navlab (navigation lab) high-speed situations if something wrong in the speed con-
of Carnegie Mellon, Oxford University and MIT are leading trol of two front wheels, because little speed changes of two
research institutes. Up to 2020, European countries like Bel- front wheels in differential chassis can be exaggerated and
gium, France, Italy, and UK are planning to operate transport accident follows.
systems for autonomous vehicles. Twenty-nine US states had It is popular to use legs in the chassis of robots in recent
passed laws in permitting autonomous vehicles. Autonomous years. Typical examples are human-like and animal-like
vehicle is therefore expected to widely spread in near future (dog-like, Fig. 1) robots from Boston Dynamic. The robotic
with improvement of traffic laws. arm (Fig. 1) is also a popular platform to deploy motion
planning algorithms. In summary, wheels, arms, and legs are
Motion planning and robotic platform Robots use
choices of chassis to implement motion planning algorithms
motion planning algorithms to plan their trajectories both
which are widely used in academic and industrial scenar-
at global and local level. Human-like and dog-like robots
ios including commercial autonomous driving, service robot,
from Boston Dynamic and autonomous robotic car from MIT
surgery robot and industrial arms.
(Everett et al., 2018) are good examples. All of them lever-
age motion planning algorithms to enable robots to freely Architecture of robots Classical hierarchical robotic
walk in dense and dynamic scenarios both indoor and out- architecture (Meystel, 1990) in Fig. 2a is composed by three
door. Chassis of robots has two types of wheels, including stages: sense, plan and act (Murphy, 2000). Robots with
Ackerman-type wheel and differential wheel (Fig. 1). this architecture can be successfully used in simple appli-
In Ackerman-type robots, two front wheels steer the robot, cations. It can generate long-term action plans, however,
while two rear wheels drive the robot. The Ackerman-type researchers are unsatisfied with the slow speed of this archi-
chassis has two servos. Two front wheels share a same servo, tecture in the update of world model and navigation plan,
and it means these two wheels can steer with a same steering when coping with the environment with uncertainty. Reactive
angle or range ϕ (Fig. 1). Two rear wheels share another servo architecture (Brooks, 1986) in Fig. 2b, therefore, is intro-
to control the speed of robots. The robot using differential duced to cope with uncertain scenarios. Reactive architecture
wheel, however, is completely different with Ackerman-type is designed to output instant response by the sense-act struc-
robot in functions of servo. The chassis with differential ture (Murphy, 2000). Reactive strategies or algorithms (e.g.,
wheels generally has two servos, and each wheel is controlled potential fields) originate from the intuitive response of ani-
by one servo for forwarding. Steering is realized by giving mals, and they are computationally inexpensive. However,
different speeds to each wheel. Steering range in Ackerman- the robot based on reactive architecture is short-sighted. It
type robots is limited because two front wheels steer with a cannot generate long-term plans to fulfill challenging tasks.
same angle ϕ. The Ackerman-type wheel is therefore suit- Hybrid deliberative/reactive architecture in Fig. 2c fuses
able to be used in high-speed outdoor scenarios because of advantages of hierarchical and reactive architectures, and
stability. Robots with differential wheels, however, can steer it is also successfully used in autonomous robots (Arkin
in an angle ∈ (0, 2π ], and it means robots can change their et al., 1987; Murphy, 2000). Hybrid deliberative/reactive

123
Journal of Intelligent Manufacturing (2022) 33:387–424 389

ment of DL and RL algorithms in recent years. For example,


in recent works (Chen et al., 2016, 2019, 2020; Everett et al.,
2018; Long et al., 2018): (1) The goal’s information (e.g.,
position of goals), sensor’s information (e.g. distances to
other robots) and attributes of robots (e.g., radius) are com-
bined to form the features of robots; (2) More features of
robots are obtained by interacting with the environment, and
feedbacks (rewards) are obtained accordingly; (3) These fea-
tures are recorded by the networks as the world model that
will be updated according to feedbacks, and it is followed
by obtaining a converged model; (4) Previous procedures are
defined as the trainer to obtain the world model, and then
time-sequential actions are generated in the navigator by
performing the trained world model to navigate robots to des-
tinations; (5) However, these time-sequential actions cannot
be recognized by the actuators of robots (e.g., motor), there-
fore it is necessary to use the parser to parse them to proper
formats that can be executed by actuators. This architecture
of current autonomous robots can be simply described as
five functional modules: feature extraction, environment per-
ception, environment understanding, time-sequential navi-
gation, and decision execution. It can be also simply divided
into three stages: sense and train, plan, and act.
The advantages of recent autonomous robotic architec-
ture are the simplicity, safety, and efficiency. Unlike Fig. 2c,
there is no clear boundary between deliberative and reactive
layers in Fig. 2d. Time-sequential planning can realize long-
term planning goals, local planning goals or quick response
with safety (e.g., safe distances to other objects) and effi-
ciency (e.g., shortest path, shortest time) at the same time.
Disadvantages of recent autonomous robotic architecture are
expensive computation cost and poor network convergence
Fig. 2 Architectures of autonomous robots. (a–c) denote the classi- especially when networks are trained with data from large
cal autonomous robotic architecture, while (d) denotes recent trend of outdoor scenarios.
autonomous robotic architecture that features the DL and RL
Motion planning and path planning Motion planning
is the extension of path planning. They are almost the same
architecture uses the deliberative layer to realize high-level
term, but few differences exist. For example, path planning
long-term planning, while the reactive layer is used to realize
aims at finding the path between the origin and destination
local reactive planning. Hybrid deliberative/reactive archi-
in workspace by strategies like shortest distance or shortest
tecture is still widely used in robots nowadays. A typical
time (Fig. 3), therefore path is planned from the global met-
example is that: (1) in deliberative layer, world maps of the
ric or topological level. Motion planning, however, aims at
environment are constructed using information from sen-
generating interactive trajectories in workspace when robots
sors like the light detection and ranging (liDAR). Planning
interact with dynamic environment, therefore motion plan-
algorithms (e.g., A*) then plan high-level paths; (2) in reac-
ning needs to consider kinetics features, velocities and poses
tive layer, reactive strategies, like dynamic window approach
of robots and dynamic objects nearby (Fig. 3) when robots
(DWA) for local planning and PID for speed control, are
move towards the goal. Note that workspace here is an area
used to make local planning or instant reactions to cope with
where an algorithm works, or the task exists.
dynamic and uncertain scenarios. Finally, high-level plan-
To conclude, on one hand, motion planning must con-
ning, local planning or instant reactions are evaluated in
sider short-term optimal or suboptimal reactive strategies
the behavior manager to generate a better combined plan-
to make instant or reactive response. This is achieved by
ning.
rotary or linear control in hardware (e.g., motor, servo) from
However, autonomous robotic architecture is evolving
the perspective of robotic and control engineering. On the
towards a simple architecture in Fig. 2d with the develop-

123
390 Journal of Intelligent Manufacturing (2022) 33:387–424

Cartographer
Planner Mission Planner
Mission planner Sequencer
Carto- Performance Monitoring Agent
grapher Navigator

Pilot

Homeostatic
Control
Deliberative Layer
Reactive Layer
Motor
Sensor
Motor schema manager Behavioral
Manager

Sensors ps1 ms1

ps2 ms2
Σ Actuators

ps3 ms3

(c) Hybrid deliberative/reactive architecture for autonomous robots (Arkin et al., 1987; Murphy, 2000)

Stages SENSE & TRAIN PLAN ACT

Trainer
robots Interactions
Autonomous
robotic Goal info
World model Parser/
architecture Environment
(networks)
Navigator
Actuators
Sensors info

Feedbacks (rewards)

Abstract
Feature Environment Environment Time-sequential Decision
functional
extraction perception understanding navigation execution
modules

(d) Recent trend of autonomous robotic architecture.

Fig. 2 continued

other hand, motion planning should achieve long-term opti- Classification of planning algorithms Robotic planning
mal planning goals as path planning when robots interact algorithms can be divided into two categories: traditional
with the environment. algorithms and ML-based algorithms according to their prin-
ciples and the era they were invented. Traditional algorithms
are composed by four groups including graph search algo-

123
Journal of Intelligent Manufacturing (2022) 33:387–424 391

Fig. 3 Path planning and motion planning. The left figure denotes a mover’s problem that not only consider planning a path from global
planned path based on shortest distance and time, and path is gener- level, but also consider kinetics features, speeds and poses of the piano
ated from high or global level. The right figure denotes famous piano

Fig. 4 Two categories of robotic


planning algorithm Traditional algorithms ML-based algorithms

Graph search Sampling based Optimal value


Classical ML
based algorithms algorithms RL

Interpolating Reaction-based Policy gradient


curve algorithms algorithms RL

rithms (e.g., A*), sampling-based algorithms like rapidly- convergence). A breakthrough was made when Google Deep-
exploring random tree (RRT), interpolating curve algorithms Mind introduced nature DQN (Mnih et al., 2013, 2015),
(e.g., line and circle), and reaction-based algorithms (e.g., in which reply buffer is to reuse old data to improve the
DWA). ML based planning algorithms include classical ML efficiency. Performance in robustness, however, is limited
algorithms like support vector machine (SVM), optimal because of noise that impacts the estimation of state-action
value RL like deep Q-learning network (DQN) and policy value (Q value). Double DQN (Hasselt et al., 2016; Sui et al.,
gradient RL (e.g., actor-critic algorithm). Categories of plan- 2018) and dueling DQN (Wang et al., 2015) are therefore
ning algorithms are summarized in Fig. 4. invented to cope with problems caused by noise. Double
DQN utilizes another network to evaluate the estimation of
Development of ML-based algorithms Classical ML,
Q value in DQN to reduce noise, while advantage value (A
like SVM, are used to implement simple motion planning at
value) is utilized in dueling DQN to obtain better Q value, and
an earlier stage, but its performance is poor because SVM
noise is mostly reduced. The Q learning, DQN, double DQN
is short-sighted for its one-step prediction. It requires well-
and dueling DQN are all based on optimal values (Q value
prepared vector as inputs that cannot fully represent features
and A value) to select optimal time-sequential actions. These
of image-based dataset. Significant improvement to extract
algorithms are therefore called optimal value algorithms.
high-level features from images were made after the inven-
Implementation of optimal value algorithms, however, is
tion of convolutional neural network (CNN) (Lecun et al.,
computationally expensive.
1998). CNN is widely used in many image-related tasks
Optimal value algorithms are latter replaced by policy
including motion planning, but it cannot cope with com-
gradient method (Sutton et al., 1999), in which gradi-
plex time-sequential motion planning problems. These better
ent approach (Zhang, 2019) is directly utilized to upgrade
suit Markov chain (Chan et al., 2012) and long short-term
policy that is used to generate optimal actions. Policy gra-
memory (LSTM) (Inoue et al., 2019). Neural networks are
dient method is more stable in network convergence, but
then combined with LSTM or algorithms that are based on
it lacks efficiency in speed of network convergence. Actor-
Markov chain (e.g., Q learning (Smart & Kaelbling, 2002))
critic algorithm ((Cormen et al., 2009; Konda & Tsitsiklis,
to implement time-sequential motion planning. However,
2001)) improves the speed of convergence by the actor-critic
the efficiency is limited (e.g., poor performance in network
architecture. However, improvement in convergence speed is

123
392 Journal of Intelligent Manufacturing (2022) 33:387–424

achieved by sacrificing the stability of convergence, therefore Classical ML


the network of actor-critic algorithm is hard to converge in
earlier-stage training. Asynchronous advantage actor-critic SVM CNN LSTM MCTS
(A3C) (Gilhyun, 2018; Mnih et al., 2016), advantage actor-
critic (A2C)1 (Babaeizadeh et al., 2016), trust region policy
optimization (TRPO) (Schulman et al., 2017a) and prox-
Optimal value RL
imal policy optimization (PPO) (Schulman et al., 2017b) Double DQN
algorithms are then invented to cope with this shortcoming.
Q learning DQN
Multi-thread technique (Mnih et al., 2016) is utilized in A3C
Dueling DQN
and A2C to accelerate the speed of convergence, while TRPO
and PPO improve the policy of actor-critic algorithm by intro-
ducing trust region constraint in TRPO, and “surrogate” and
adaptive penalty in PPO to improve the speed and stability
Policy gradient RL
of convergence. Data, however, is dropped after training, and A3C A2C
new data must therefore be collected to train the network until
convergence of network. Policy
Actor-critic TRPO PPO
gradient
Off-policy gradient algorithms including deterministic
policy gradient (DPG) (Silver et al., 2014) and deep DPG
(DDPG) ((Lillicrap et al., 2019; Munos et al., 2016)) are DPG DDPG
invented to reuse data by replay buffer. DDPG fuses the
actor-critic architecture and deterministic policy to enhance
the convergence speed. In summary, classical ML, optimal Fig. 5 Development of ML-based robotic motion planning algorithms.
These algorithms evolve from classical ML to optimal value RL and
value RL, and policy gradient RL are typical ML algorithms
policy gradient RL. Classical ML cannot address time-sequential plan-
in robotic motion planning, and the development of these ning problem but RL copes with it well. Optimal value RL suffers slow
ML-based motion planning algorithms is shown in Fig. 5. and unstable convergence speed but policy gradient RL performs better
In this paper, state-of-art ML-based algorithms are inves- in network convergence
tigated and analyzed to provide researchers with a compre-
hensive and clear understanding about functions, structures,
advantages, and disadvantages of planning algorithms. We Traditional planning algorithms
also summarize new criteria to evaluate the performance
of planning algorithms. Potential directions for making Traditional planning algorithms can be divided into four
practical optimization in motion planning algorithms are dis- groups: graph-search, sampling-based, interpolating curve,
cussed simultaneously. Contributions of this paper include: and reaction-based algorithms. They will be described in
(1) Survey of traditional planning algorithms. (2) Detailed detail in the following sections.
investigations of classical ML, optimal value RL and pol-
icy gradient RL for robotic motion planning. (3) Analytical Graph-search algorithms
comparisons of these algorithms according to new evaluation
criteria; (4) Analysis of future directions. Graph-search algorithms can be divided into depth-first
This paper is organized as follows: Sects. “Traditional search, breadth-first search, and best-first search (Dijkstra,
planning algorithms”, “Classical ML”, “Optimal value RL” 1959). The depth-first search algorithm builds a search tree as
and “Policy gradient RL” present principles and applications deep and fast as possible from the origin to destination until
of traditional planning algorithms, classical ML, optimal a proper path is found. The breadth-first search algorithm
value RL and policy gradient RL in robotic motion plan- shares similarities with the depth-first search algorithm by
ning; section VI presents analytical comparisons of these building a search tree. The search tree in the breadth-first
algorithms, and criteria for performance evaluation; section search algorithm, however, is accomplished by extending
VII analyzes future directions of robotic motion planning. the tree as broad and quick as possible until a proper path
is found. The best-first search algorithm adds a numerical
criterion (value or cost) to each node and edge in the search
tree. According to that, the search process is guided by cal-
culation of values in the search tree to decide: (1) whether
the search tree should be expanded; (2) which branch in
1 OpenAI Baselines: ACKTR and A2C. Web. August 18, 2017. https:// the search tree should be extended. The process of build-
openai.com/blog/baselines-acktr-a2c. ing search trees repeats until a proper path is found. Graph

123
Journal of Intelligent Manufacturing (2022) 33:387–424 393

Graph Vertexes Distance Distance Vertexes


establishment selection calculation update marking

(a)

(b)
Fig. 6 Steps of the Dijkstra algorithm (a) and road networks in web composed by nodes and edges, therefore graph search algorithms like
maps (b) (Indrajaya et al., 2015; Mariescu & Franti., 2018). Web maps A* and Dijkstra’s algorithms can be used in these graphs
are based on GPS data. Road network is mapped into the graph that is

search algorithms are composed by many algorithms. The nearest neighbor j of the node I; and estimate the distance
most popular are Dijkstra’s algorithm (Dijkstra, 1959) and of nodesj and i; (3) estimate the distance between the node j
A* algorithm (Hart et al., 1968). and the goal node. The overall estimated cost is the sum of
these three factors:
Dijkstra’s algorithm is one of earliest optimal algorithms
based on best-first search technique to find the shortest paths  
Ci  cstar t,i + min j di, j + d j,goal (1)
among nodes in a graph. Finding the shortest paths in a road
network is a typical example. Steps of the Dijkstra algorithm
where Ci represents overall estimated cost of node i, cstar t,i
(Fig. 6) include: (1) converting the road network to a graph,
the estimated cost from the origin to the node i, di, j the
and distances between nodes in the graph are expected to be
estimated distance from the node i to its nearest node j, and
found by exploration; (2) picking the unvisited node with the
d j,goal the estimated distance from the node j to the node
lowest distance from the source node; (3) calculating the dis-
of goal. A* algorithm has a long history in path planning in
tance from the picked node to each unvisited neighbor and
robots. A common application of the A* algorithm is mobile
update the distance of all neighbor nodes if the distance to
rovers planning via an occupancy grid map (Fig. 7) using the
the picked node is smaller than the previous distance; (4)
Euclidean distance (Wang, 2005). There are many variants
marking the visited node when the calculation of distance to
of A* algorithm, like dynamic A* and dynamic D* (Stentz,
all neighbors is done. Previous steps repeat until the shortest
1994), Field D* (Ferguson & Stentz, 2006), Theta* (Daniel
distance between origin and destination is found. Dijkstra’s
et al., 2014), Anytime Repairing A* (ARA*) and Anytime
algorithm can be divided into two versions: forward ver-
D* (Likhachev et al., 2008), hybrid A* (Montemerlo et al.,
sion and backward version. Calculation of overall cost in
2008), and AD* (Ferguson et al., 2008). Other graph search
the backward version, called cost-to-come, is accomplished
algorithms have a difference with common robotic grid map.
by estimating the minimum distance from selected node to
For example, the state lattice algorithm (Ziegler & Stiller,
destination, while estimation of overall cost in the forward
2009) uses one type of grid map with a specific shape (Fig. 7),
version, called cost-to-go, is realized by estimating the mini-
while the grid in normal robotic map is in a square-grid shape
mum distance from selected node to the initial node. In most
(Fig. 7).
cases, nodes are expanded according to the cost-to-go.
A* algorithm is based on the best-first search, and it uti- Sampling-based algorithms
lizes heuristic function to find the shortest path by estimating
the overall cost. The algorithm is different from the Dijkstra’s Sampling-based algorithms randomly sample a fixed
algorithm in the estimation of the path cost. The cost estima- workspace to generate sub-optimal paths. The RRT and the
tion of a node i in a graph by A* is as follows: (1) estimate probabilistic roadmap method (PRM) are two algorithms
the distance between the initial node and node i; (2) find the that are commonly utilized in motion planning. The RRT
algorithm is more popular and widely used for commercial

123
394 Journal of Intelligent Manufacturing (2022) 33:387–424

Fig. 7 The left figure represents a specific grid map in the State Lattice algorithm (Ziegler & Stiller, 2009), while the right figure represents a normal
square-grid (occupancy grid) map in the robot operating system (ROS)

tic roadmap. An example of trajectory generated by PRM is


shown in Fig. 8.

Interpolating curve algorithms

Interpolating curve algorithm is defined as a process that


constructs or inserts a set of mathematical rules to draw
trajectories. The interpolating curve algorithm is based on
techniques, e.g., computer aided geometric design (CAGD),
to draw a smooth path. Mathematical rules are used for path
Fig. 8 Trajectories planned by the RRT and PRM. The left figure rep- smoothing and curve generation. Typical path smoothing
resents trajectories planned by RRT algorithm (Jeon et al., 2013), and and curve generation rules include line and circle (Reeds
the right figure represents the trajectory planned by PRM algorithm
(Kavraki et al., 2002) & Shepp, 1990), clothoid curves (Funke et al., 2012), poly-
nomial curves (Xu et al., 2012), Bezier curves (Bautista et al.,
2014) and spline curves (Farouki & Sakkalis, 1994). Exam-
ples of trajectories are shown in Fig. 9.
and industrial purposes. It constructs a tree that attempts
to explore the workspace rapidly and uniformly via a ran-
dom search (LaValle & Kuffner, 1999). The RRT algorithm Reaction-based algorithms
can consider non-holonomic constraints, such as the maxi-
mum turning radius and momentum of the vehicle (Bautista Unlike graph-search algorithms that cost longer time to plan
et al., 2015). The example of trajectories generated by RRT high-level or global-level paths, reaction-based algorithms
is shown in Fig. 8. The PRM algorithm (Kavraki et al., 2002) are about making reactions or doing local path planning
is normally used in a static scenario. It is divided into two quickly and intuitively, as the description of algorithms in
phases: learning phase and query phase. In the learning reactive architecture (Murphy, 2000). Here three reaction-
phase, a collision-free probabilistic roadmap is constructed based algorithms that are widely used in engineering and
and stored as a graph. In query phase, a path that connects manufacturing are presented, and they are potential field
original and targeted nodes is searched from the probabilis- method (PFM), velocity obstacle method (VOM), and DWA.

Fig. 9 Trajectories generated by mathematical rules (Bautista et al., 2014; Farouki & Sakkalis, 1994; Funke et al., 2012; Reeds & Shepp, 1990; Xu
et al., 2012)

123
Journal of Intelligent Manufacturing (2022) 33:387–424 395

Fig. 10 Different types of potential filed. (a–e) denote five primitive potential fields: uniform, perpendicular, attraction, repulsion, and tangen-
tial. (f) denotes a potential field combined by attraction (goal) and repulsion (obstacle) (Murphy, 2000)

PFM (Khatib, 1986) is about using vectors to represent maneuver (velocity) is selected from RAV to avoid static and
behaviors and using vector summation to combine vectors moving obstacles (Fiorini & Shiller, 1998). To compute a
from different behaviors to produce an emergent behavior RAV (Fig. 11): (1) Velocity obstacle (VO) must be obtained.
(Murphy, 2000). Potential field is a differentiable real-valued VO is a velocity set or space, and the selection of velocity
function U whose value can be seen as energy, and its gradient from VO will lead to collision. (2) A set of reachable veloc-
can be seen as a force. If potential field function U is defined ities (RV) should be obtained. This is achieved by mapping
artificially, it is called artificial potential field (APF). Its gra- the actuator constraints to acceleration constraints (Fiorini &
dient ∇U (x), where x denotes a robot configuration (e.g., Shiller, 1998). (3) RAV is obtained by computing the differ-
positions of robots), is a vector which points at a local direc- ence between RV and VO.
tion that maximally increases U (Tobaruela, 2012). Hence, To select a proper avoidance maneuver (Fig. 11), exhaus-
robots in potential field or combined potential field (Fig. 10) tive global search method and heuristic search method in
will be forced to move along the gradient of potential field RAV are suitable for off-line and on-line cases, respectively:
to maximize U . (1) A search tree can be obtained by expanding the tree on
Shortcomings of PFM include: (1) local minima if poten- RAV. A proper avoidance maneuver can be selected from
tial field converges to a minimum that is not global minimum. the search tree according to assigned cost on the branch of
(2) oscillation of motion when robots navigate among very search tree. Cost is relevant with some objective functions
close obstacles at high speed. (3) impossibility to go through (e.g., distance traveled, motion time and energy). The search
small openings. These shortcomings can be solved or par- tree is expanded off-line, therefore near-optimal trajectories
tially solved by potential field variants (e.g., generalized that lead to shortest time or distance can be obtained (Fiorini
potential fields method (GPFM) (Krogh, 1984), virtual force & Shiller, 1998). (2) The heuristic search costs less time on
field (VFF) (Borenstein & Koren, 1989), vector field his- search process, and it is designed to select specific veloci-
togram (VFH) (Borenstein & Koren, 1991) and harmonic ties that can realize special goals (e.g., the highest avoidance
potential field (HPF) (Masoud, 2007)) in real-world engi- velocity towards goals, the maximum avoidance velocity, and
neering and manufacturing. velocities that ensure desired trajectory structures).
VOM (Fiorini & Shiller, 1998) relies on current positions However, collisions with obstacles still exist when using
and velocities of robots and obstacles to compute a reachable velocity obstacle method in complex scenarios like dense
avoidance velocity space (RAV), and then a proper avoidance and dynamic cases. Hence, some optimized velocity obstacle

123
396 Journal of Intelligent Manufacturing (2022) 33:387–424

Fig. 11 The principle of VOM. (a)–(c) denote the VO, RV, and RAV. goals, the maximum avoidance velocity, and velocities that ensure
(d) denotes the exhaustive search in the search tree. (e)–(g) denote desired trajectory structures (Fiorini & Shiller, 1998)
heuristic search method to select the highest avoidance velocity towards

dimensional possible velocity search space Vs that is related


to circular trajectories (curvatures) uniquely determined by
(v, w). (2) computing admissible velocities Va that ensures
the stop before the robot reaches the closest obstacle on
the corresponding curvature. (3) computing dynamic win-
dow velocity Vd . All curvatures outside the dynamic window
cannot be reached within the next time interval. (4) selecting
a proper velocity from resulting search space which consists
of resulting velocity Vr defined by Vr  Vs ∩ Va ∩ Vd (Fox
et al., 1997). The relationship of these velocity search spaces
Fig. 12 The relationship of possible velocity search space Vs , admissi- is shown in Fig. 12 where the resulting search space is the
ble velocities Va , dynamic window velocity Vd , and resulting velocity white area in the dynamic window.
Vr (Stentz, 1994)

Classical ML
methods, like reciprocal velocity obstacle (RVO) (Berg et al.,
2008, 2011; Guy et al., 2009), are introduced to better avoid Here basic principles of four classical but pervasive ML algo-
collisions. rithms for motion planning are presented. These algorithms
DWA (Fox et al., 1997) is about choosing a proper trans- include three supervised learning algorithms (SVM, LSTM
lational and rotational velocity (v, w) that will maximize an and CNN) and one RL that is the Monte-Carlo tree search
objective function within dynamic window. Objective func- (MCTS).
tion includes a measure of progress towards a goal location, SVM (Evgeniou & Pontil, 1999) is a well-known super-
the forward velocity of the robot, and the distance to the vised learning algorithm for classification. The basic prin-
next obstacle on the trajectory. Proper velocity (v, w) is ciple of SVM is about drawing an optimal separating
selected within the dynamic window (a search space of veloc- hyperplane between inputted data by training a maximum
ity) which consists of the velocities reachable within a short margin classifier (Evgeniou & Pontil, 1999). Inputted data is
time interval. This is achieved by: (1) computing a two- in the form of vector that is mapped into high-dimensional

123
Journal of Intelligent Manufacturing (2022) 33:387–424 397

nodes are updated according to received reward. These four


processes are repeated until the convergence of state values in
the tree. The robot can therefore plan its motion according to
state values in the tree. MCTS fits discrete-action tasks (e.g.,
AlphaGo (Silver et al., 2016)), and it also fits time-sequential
tasks like autonomous driving.
CNN (Lecun et al., 1998) has become a research focus of
ML after LeNet5 (Lecun et al., 1998) was introduced and suc-
cessfully applied into handwritten digits recognition. CNN is
one of the essential types of neural network because it is good
at extracting high-level features from high-dimensional high-
resolution images by convolutional layers. CNN makes the
Fig. 13 Cells of LSTM that are implemented using neural network (N. robot avoid obstacles and plans motions of robot according
Arbel. How LSTM networks solve the problem of vanishing gradients.
Web. Dec 21, 2018. https://medium.com/datadriveninvestor/how-do-
to human experience by models trained in forward propa-
lstm-networks-solve-the-problem-of-vanishing-gradients). ct denotes gation and back propagation processes, especially the back
cell’s state in time step t. h t denotes the output that will be transferred propagation. In the back propagation, a model with a weight
to the next state as its input, therefore format of input is the vector matrix/vector θ is updated to record features of obstacles.
[h t−1 , xt ]. Cell states are controlled and updated by three gates (forget
gate, input gate and output gate) that are implemented using neural
Note that θ  {wi , bi }iL where w and b represent weight
networks with weights W f , Wc + Wi , and Wo respectively and bias, and i represents the serial number of w-b pairs. L
represents the length of weight.
Training steps of CNN are shown as Fig. 15. Images of
space where classified vectors are obtained by performing objects (obstacles) are used as inputs of CNN. Outputs are
trained classifier. SVM is used in 2-class classification that probability distributions obtained by softmax function (Bri-
cannot suit real-world task, but its variant multiclass SVM dle, 1990). Loss value Loss C E is cross-entropy (CE) and that
(MSVM) (Weston & Watkins, 1998) works. is obtained by
LSTM ((Hochreiter & Schmidhuber, 1997; Inoue et al.,
2019)) is a variant of recurrent neural network (RNN). LSTM 
can remember inputted data (vectors) in its cells. Because LossC E  − pi · log qi (2)
of limited capacity of cell in storage, a part of data will be i
dropped when cells are updated with past and new data, and
then a part of data will be remembered and transferred to
next time step. These functions in cells are achieved by neu- where p denotes probability distributions of output (observed
ral network as the description in Fig. 13. In robotic motion real value), q represents probability distributions of expecta-
planning, robots’ features and labels in each time step are tion ( p, q(0, 1)), and i represents the serial number of each
fed into neural networks in cells for training, therefore deci- batch of images in training. The loss function measures the
sions for motion planning are made by performing trained difference (distance) of observed real value p and expected
network. value q. Mean-square error (MSE) is an alternative of CE

MCTS is a classical RL algorithm, and it is the combina- and MSE is defined by Loss M S E  ( pi − qi )2 where pi
i
tion of Monte-carlo method (Kalos & Whitlock, 2008) and represents observed values while qi represents predicted val-
the search tree (Coulom, 2006). MCTS is widely used in ues or expectation. The weight is updated in optimizer by
games (e.g., Go and chess) for motion prediction ((Paxton minimizing the loss value using gradient descent approach
et al., 2017; Silver et al., 2016)). Mechanism of MCTS is (Zhang, 2019) therefore new weight winew is obtained by
composed by four processes that include selection, expan-
sion, simulation, and backpropagation as Fig. 14. In robotic
motion planning, node of MCTS represents possible state ∂ Loss
winew  wi − η · (3)
of robot, and stores state value of robot in each step. First, ∂wi
selection is made to choose some possible nodes in the tree
based on known state value. Second, tree expands to unknown
state by tree policy (e.g., random search). Third, simulation where w represents the weight, η represents a learning rate
of expansion is made on new-expanded node by default pol- (η(0, 1)) andi represents the serial number of each batch
icy (e.g., random search) until terminal state of robot and of images in training. Improved variants of CNN are also
reward R is obtained. Finally, backpropagation is made from widely used in motion planning, e.g., residue networks (Gao
new-expanded node to root node, and state values in these et al., 2017; He et al., 2016).

123
398 Journal of Intelligent Manufacturing (2022) 33:387–424

Fig. 14 Four processes of MCTS. These processes repeat until the convergence of state values in the tree

Steering angles

Images
Inputs Labels

Initial
parameters
CNN
Parameters
CNN layers
(weight matrix)
New
parameters Feature vector

Softmax
Optimizer

Probability distribution Probability distribution


Loss value (expectation) (real value)

Loss
function

Fig. 15 Training steps of CNN. The trajectory is planned by human in of feature to probabilities p ∈ (0, 1). The optimizer represents gradi-
data collection in which steering angles of robots are recorded as labels ent descent approach, e.g., stochastic gradient descent (SGD) (Zhang,
of data. Robots learn behavior strategies in training and move along 2019)
the planned trajectory in the test. The softmax function maps values

Optimal value RL nature DQN, double DQN and dueling DQN. Motion plan-
ning is realized by attaching destination and safe paths with
Here basic concepts of RL are recalled firstly, and then the big reward (numerical value), while obstacles are attached
principles of Q learning, nature DQN, double DQN and duel- with penalties (negative reward). Optimal path is found
ing DQN are given. according to total rewards from initial place to destination. To
Classical ML algorithm like CNN is competent only in better understand optimal value RL, it is necessary to recall
static obstacle avoidance by one-step prediction, therefore several fundamental concepts: Markov chain, Markov deci-
it cannot cope with time-sequential obstacle avoidance. RL sion process (MDP), model-based dynamic programming,
algorithms, e.g., optimal value RL, fit time-sequential tasks. model-free RL, Monte-Carlo method (MC), temporal dif-
Typical examples of these algorithms include Q learning, ference method (TD), and State-action-reward-state-action

123
Journal of Intelligent Manufacturing (2022) 33:387–424 399

Markov Chain/Markov Decision Process Robot


Model-free reinforcement learning π
s r
Model-based
dynamic Monte Temporal Difference method
SARSA Q learning
a
programming Carlo
r’
method algorithm algorithm
s’ Environment

(a) (b)

Fig. 16 a represents the relationship of basic concepts of RL. b represents the principle of MDP

(SARSA). MDP is based on Markov chain (Chan et al., the time step t is defined as the expectation of accumulative
2012), and it can be divided into two categories: model-based rewards G t by
dynamic programming and model-free RL. Mode-free RL  
can be divided into MC and TD that includes SARSA and Q V (s)  E G t  Rt+1 + γ Rt+1 + . . . + γ T −1 RT |St  s
learning algorithms. Relationship of these concepts is shown (5)
in Fig. 16.
Markov chain Variable set X  {X n n > 0} is called where γ represent a discount factor (γ [0, 1]). MC uses G t −
Markov chain (Chan et al., 2012) if X meets V (s) to update its state value V MC (s) by

VMC (s) ← V (s) + α(G t − V (s)) (6)


p(X t+1 |X t , . . . , X 1 )  p(X t+1 |X t ) (4)
where “←” represents the update process in which new value
This means the occurrence of event X t+1 depends only on will replace previous value. α is a discount factor. TD uses
event X t and has no correlation to any earlier events. Rt+1 + γ V (st+1 ) − V (s) to update its state value VT D (s) by
Markov decision process MDP (Chan et al., 2012) is a 
sequential decision process based on Markov Chain. This VT D (s) ← V (s) + α Rt+1 + γ V (st+1 ) − V (s) (7)
means the state and action of the next step depend only on
the state and action of the current step. MDP is described as a where α is a learning rate, Rt+1 +γ V (st+1 ) is the TD target in
tuple < S, A, P, R >. S represents the state and here it refers which the estimated state value V (st+1 ) is obtained by boot-
to the state of robot and obstacles. A represents an action strapping method (Tsitsiklis, 2003). This means MC updates
taken by robot. State S transits into another state under a state- its state value after the termination of an episode, while TD
transition probability P and a reward R from the environment update its state value in every steps. TD method is therefore
is obtained. Principle of MDP is shown in Fig. 16. First, the efficient than MC in state value update.
robot in state s interacts with the environment and generate
an action based on policy π (s)s → a. Robot then obtains Q learning
the reward r from the environment, and state transits into the
next state s’. The reach of next state s’ marks the end of one TD includes SARSA (Rummery & Niranjan, 1994) and Q
loop and the start of the next loop. learning ((Smart & Kaelbling, 2002; Sutton & Barto, 1998)).
Given an episode < S 1 , A1 , R2 , S 2 , A2 , R3 , …, S t , At , Rt+1 ,
Model-free RL and model-based dynamic program-
…, S T > , SARSA and Q learning use the ε-greedy method
ming Problems in MDP can be solved using model-based
(Santos Mignon., 2017) to select an action At at time step t.
dynamic programming and model-free RL methods. The
There are two differences between SARSA and Q learning:
model-based dynamic programming is used in a known
(1) SARSA uses ε-greedy again to select an estimated action
environment, while the model-free RL is utilized to solve
value Q(St+1 , At+1 ) at time step t + 1 to update its action
problems in an unknown environment.
value by
Temporal difference and Monte Carlo methods The
Q S A R S A (St , At )
model-free RL includes MC and TD. A sequence of actions
is called an episode. Given an episode < S 1 , A1 , R2 , S 2 , ← Q (St , At ) + α (Rt+1 + γ Q (St+1 , At+1 ) − Q (St , At ))
A2 , R3 , …, S t , At , Rt+1 , …, S T > , the state value V (s) in (8)

123
400 Journal of Intelligent Manufacturing (2022) 33:387–424

whileQ learning directly uses maximum estimated action Nature deep Q-learning network
value max Q Q(St+1 , At+1 ) at time step t + 1 to update its
action value by DQN (Mnih et al., 2013) is a combination of Q leaning and
deep neural network (e.g., CNN). DQN uses CNN to approx-
Q Q L (St , At ) imate Q values by its weight θ . Hence, Q table in Q learning
← Q (St , At ) changes to Q value network that can be converged in a faster
speed in complex motion planning. DQN became a research
+ α Rt+1 + γ max Q (St+1 , At+1 ) − Q (St , At ) focus when it was invented by Google DeepMind (Mnih et al.,
At+1 (9)
2013, 2015), and performance of DQN approximates or even
(2) SARSA adopts selected action At+1 directly to update surpasses the performance of human being in Atari games
its next action value, but Q learning algorithm use ε-greedy (e.g., Pac-man and Enduro in Fig. 18) and real-world motion
to select a new action to update its next action value. planning tasks (Bae et al., 2019; Isele et al., 2017). DQN
SARSA uses ε-greedy method to sample all potential utilizes CNN to approximate Q values (Fig. 19) by
action values of next step and selects a “safe” action even-
tually, while Q learning pays attention to the maximum Q ∗ (s, a) ≈ Q(s, a; θ ) (10)
estimated action value of the next step and selects optimal
actions eventually. Steps of SARSA is shown in Algorithm 1
(Sutton & Barto, 1998), while Q learning algorithm as Algo-
rithm 2 (Sutton & Barto, 1998) and Fig. 17. Implementations
of robotic motion planning by Q learning are as (Panov et al.,
2018; Qureshi et al., 2018; Smart & Kaelbling, 2002).

123
Journal of Intelligent Manufacturing (2022) 33:387–424 401

No Yes
Convergence?

s,a s,a,r,s’
Action selection Action execuation Q value update End
Q, s
s (images)
Q value initialization Environment

Fig. 17 Steps of Q learning algorithm. Input of Q learning is in the vector format normally. Q value is obtained via Q value table or network as
approximator. Extra preprocessing is needed to extract features from image if input is in image format

Fig. 18 Two examples of motion


planning in early-stage arcade
games: Enduro (left) and
Pac-man (right)

In contrast with the Q learning, DQN features three Convolutional Neural


components: CNN, replay buffer (Schaul et al., 2016) and Network (CNN) (weight )
targeted network. CNN extracts feature from images that are
inputs. Outputs can be Q value of current state Q(s,a) and
Q value of next state Q(s’,a’), therefore experiences < s,a,r,s’
> are obtained and temporarily stored in replay buffer. It is
followed by training DQN using experiences in the replay
buffer. In this process, a targeted network θ is leveraged to
minimize the loss value by Images of
environment
Convolutional
  2 layers Full-Connection (FC)
Loss  r + γ max Q s , a ; θ − Q(s, a; θ ) (11) layers
a

Loss value measures the distance between expected Fig. 19 Q(s,a0 ), Q(s,a1 ), Q(s,a2 ) and Q(s,at ) denote Q values of all
potential actions
value and real value. In DQN, expected value is (r +
γ maxQ(s’,a’;θ ’)) that is similar to labels in supervised learn-
ing, while Q(s,a;θ ) is the observed real value. weights of in each step, while weight of targeted network θ ’ is updated in
targeted network and Q value network share a same weight θ . a long period of time. Hence, θ is updated frequently while
The difference is that weight of Q value network θ is updated θ ’ is more stable. It is necessary to keep targeted network
stable, otherwise Q value network will be hard to converge.
Detailed steps of DQN are shown as Algorithm 3 (Mnih
et al., 2013) and Fig. 20.

123
402 Journal of Intelligent Manufacturing (2022) 33:387–424

No Yes
Convergence?

s,a s,a,r,s’ s,a,r,s’ ɵ, ɵ’


Action selection Action execuation Replay buffer Next action selection Parameter update End
Q, s
s (images) Targeted network
Q value calculation Environment

Fig. 20 Steps of DQN algorithm

Double deep Q-learning network


Dueling deep Q-learning network
Noise in DQN leads to bias and false selection of next action
a follows,
 therefore
 leading to over-estimation of next action The state value V π (s) measures “how good the robot is” in
value Q s , a ; θ . To reduce the over-estimation caused by
the state s where π denotes policy π : s → a, while the
noise, researchers invented the double DQN (Hasselt et al.,
action value Q π (s, a) denotes “how good the robot is” after
2016) in which another independent targeted network with
robot takes action a in state s using policy π . Advantage value
weight θ − is introduced to evaluate the selected action a .
(A value) denotes the difference of Q π (s, a) and V π (s) by
Hence, equation of targeted
 network
 therefore changes from
y D Q N  r + γ max Q s , a ; θ to
A(s, a)  Q(s, a) − V (s, a) (13)
   
y doubleD Q N  r + γ Q s , arg maxa Q s , a ; θ ; θ − (12)
therefore A value measures “how good the action a is” in state
s if robot takes the action a. In neural network case (Fig. 21),
Steps of double DQN are the same with DQN. Examples
weights α, β, θ are added, therefore
of application are (Chao et al., 2020; Lei et al., 2018; Sui
et al., 2018) in which double DQN is used in games and
physical robots based on ROS. Q(s, a; θ , α, β)  V (s; θ , β) + A(s, a; θ , α) (14)

where θ is the weight of neural network and it is the shared


weight of Q, V and A values. Here α denotes the weight of A
value, and β the weight of V value. V (s; θ , β) is a scalar, and
A(s, a; θ , α) is a vector. There are however too many V -A
value pairs if Q value is simply divided into two components

123
Journal of Intelligent Manufacturing (2022) 33:387–424 403

Images of
environment
Convolutional
layers
FC layers
Fig. 22 Q(s,a) and A(s,a) saliency maps (red-tinted overlay) on the Atari
Fig. 21 The architecture of dueling DQN, in which Q value Q(s,a) is game (Enduro). Q(s,a) learns to pay attention to the road, but pay less
decoupled into two parts, including V value V (s) and A value A(s,a) attention to obstacles in the front. A(s,a) learns to pay much attention
to dynamic obstacles in the front (Wang et al., 2015)

without constraints, and only one V -A pairs are qualified.


Thus, it is necessary to constrain the V value or A value to rewards defined by Eq. 5. Hence, a better estimation of action
obtain a fixed V -A pair. According to relationship of Q π value is obtained by performing a better Q value network θ .
(s, a) and V π (s) where V π (s)  Ea∼π (s) [Q π (s, a)], the DQN obtained action value Q(s, a) directly by using net-
expectation of A value is work to approximate action value. This process introduces
over-estimation of action value. Dueling DQN obtains bet-
Ea∼π (s) [A(s, a)]  0 (15) ter action value Q(s, a) by constraining advantage value A
(s, a). Finally, three weights (θ , α, β) are obtained after train-
Equation 15 can be used as a rule to constrain A value ing, and Q value network θ is with less bias but A value is
for obtaining a stable V -A pair. Expectation of A value is better than action value to represent “how good the action is”
obtained by using A(st , at ) to subtract mean A value that is (Fig. 22).
obtained from all possible actions, therefore Further optimizations are distributional DQN (Bellemare
et al., 2017), noise network (Fortunato et al., 2017), duel-
1    ing double DQN2 and rainbow model (Hessel et al., 2017).
E[A(st , at )]  A(st , at ) − A st , at− (16)
|A| − Distributional DQN is like the dueling DQN, as noise is
at  A
reduced by optimizing the architecture of DQN. Noise net-
work is about improving the ability in exploration by a more
where A represents action space in time step t, |A| the
exquisite and smooth approach. Dueling double DQN and
number of actions, and at− one of possible actions in A
rainbow model are hybrid algorithms. Rainbow model fuses
at time step t. Expectation of A value keeps zero for t
several suitable components: double networks, replay buffer,
[0, T ], although the fluctuation of A(st , at ) in different action
dueling network, multi-step learning, distributional network,
choices. Researchers use the expectation of A value to replace
and noise network.
the current A value by

Q (s, a; θ , α, β)
Policy gradient RL
 V (s; θ , β)
⎧ ⎫
⎨ 1   ⎬ Here the principles of policy gradient method and actor-critic
+ A (s, a; θ , α) − A st , at− ; θ , α algorithm are given firstly. It is followed by recalling the
⎩ |A| − ⎭
at  A (17) principles of their optimized variants: (1) A3C and A2C; (2)
DPG and DDPG; (3) TROP and PPO.
Thus, a stable V-A pair is obtained although original Optimal value RL uses neural network to approximate
semantic definition of A value (Eq. 13) is changed (Wang optimal values to indirectly select actions. This process is
et
⎧ al., 2015). In other words: (1) advantage ⎫ constraint simplified as a ← argmaxa R(s, a) + Q(s, a; θ ). Noise
⎨  ⎬ leads to over-estimation of Q(s, a; θ ), therefore the selected
1 
A(s, a; θ , α) − |A | A st , at− ; θ , α  0 is used to actions are suboptimal, and network θ is hard to converge.
⎩ a− A

t
Policy gradient algorithm uses neural network θ as policy
constrain the update of A value network α; (2) Q value
network θ is therefore obtained by Q(s, a; θ , α, β)  V 2 A. Suran. Dueling Double Deep Q Learning using Tensorflow
(s; θ , β). Q value network θ is updated according to V value 2.x. Web. Jul 10, 2020. https://towardsdatascience.com/dueling-double-
that is more accurate and it is easy to obtain via accumulative deep-q-learning-using-tensorflow-2-x-7bbbcec06a2a.

123
404 Journal of Intelligent Manufacturing (2022) 33:387–424

Fig. 23 Training and test steps of policy gradient algorithms. In the be updated in training. Policy refers to target policy normally. Robots
training, time-sequential actions are generated by the behavior policy. learn trajectories via target policy (neural network as approximator) and
Note that policy is divided to behavior policy and target policy. Behav- trained policy is obtained. In the test, optimal time-sequential actions
ior policy is about selecting actions for training and behavior policy will are generated directly by trained policy πθ : s → a until destination is
not be updated, while target policy is also used to select actions but it will reached

πθ : s → a to directly select actions to avoid this problem. lead to a better policy. Increment of network is the gradient
Brief steps of policy gradient algorithm are shown in Fig. 23. value of objective, and that is

∇θ J (θ )  ∫ ∇θ πθ (τ ) R (τ ) dτ
Policy gradient method  ∫ πθ (τ ) ∇θ log πθ (τ ) R (τ ) dτ

 Eτ ∼πθ (τ ) ∇θ log πθ (τ ) R (τ ) (19)
Policy is a probability distribution P{a|s,θ }  π θ (a|s) 
π (a|s;θ ) that is used to select action a in state s, where weight
An example of PG is Monte-carlo reinforce (Williams,
θ is the parameter matrix that is used as an approximation
1992). Data τ for training are generated from simulation
of policy π (a|s). Policy gradient method (PG) (Sutton et al.,
by stochastic policy. Previous objective and its gradient
1999) seeks an optimal policy and uses it to find optimal
(Eq. 18–19) are replaced by
actions. how to find this optimal policy? Given a episode τ
 (s1 ,a1 ,…,sT ,aT ), the probability to output actions in τ is πθ
1    i i
N T

T
J (θ ) ≈ r st , a t (20)
(τ )  p(s1 ) πθ (at |st ) p(st |st−1 , at−1 ). The aim of the PG N
t2 i1 t1
is to find optimal parameter θ ∗  arg maxθ Eτ ∼πθ (τ ) [R(τ )]  T  T 
1  
N     

T
∇θ J (θ ) ≈ ∇θ log πθ at , st
i i
r st , a t
i i
where episode reward R(τ )  r (st , at )) is the accumu- N
t1 i1 t1 t1
lative rewards in episode τ . Objective of PG is defined as the (21)
expectation of rewards in episode τ by
where N is the number of episodes, T the length of tra-
jectory. A target policy πθ is used to generate episodes for
J (θ )  Eτ ∼πθ (τ ) [R(τ )]  ∫ πθ (τ )R(τ )dτ (18) training. For example, Gaussian distribution function
 is used
as behavior policy to select actions by a ∼ N μ(s), σ 2 .
Network f (s; θ ) is then used to approximate expectation of
To find higher expectation of rewards, gradient operation Gaussian
 distribution
 by μ(s)  f (s; θ ). Itmeans a ∼ N
is used on objective to find the increment of network that may mean  f s; θ  {wi , bi }iL ) , stdev  σ 2 and μ(s; θ ) 

123
Journal of Intelligent Manufacturing (2022) 33:387–424 405

[mean, stdev] where w and b represent weight and bias of Update of


e
network, L the number of w-b pairs. Its objective is defined
as J (θ )  f (st ; w) − at
2 , therefore the objective gradient is ɵ 4 weight ɵ

3 V, V’ Update of
a, r, s’ weight ԝ
1
∇θ J (θ )  −  −1 ( f (st ) − at )
2
df

(22) Actor
neural
2 Critic
neural w 5
network network
where ddθf is obtained by backward-propagation. According
to Eq. 21–22, its objective gradient is 1s Environment
s

 T 
1 
N  1    d f
∇θ J (θ ) ≈ −  −1 f sti − ati
N 2 dθ Fig. 24 Training steps of AC
i1 t1
 T 
  
× r sti , ati (23) whereβ represents learning rate. Objective of policy network
t1
is defined by
Once objective gradient is obtained, network is updated
by gradient ascent method. That is J (θ )  πθ (at |st ) · e (30)

θ ← θ + ∇θ J (θ ) (24) Hence, objective gradient of policy network is obtained


by
Actor-critic algorithm
∇θ J (θ )  ∇θ log πθ (at |st ) · e (31)
The update of policy in PG is based on expectation of accu-
mulative rewards in episode τ Eτ ∼πθ (τ ) [R(τ )]. This leads and policy network is updated by
to high variance that causes low speed in network conver-
gence, but convergence stability is improved. Actor-critic θ ← θ + α∇θ J (θ ) (32)
algorithm (AC) (Cormen et al., 2009; Kim et al., 2019; Konda
& Tsitsiklis, 2001) reduces the variance by one-step reward where α is a learning rate of actor network. Detailed steps of
in TD-error e for network update. TD-error is defined by the AC are as Fig. 24: (1) action at at time step t is selected by
policy network θ ; (2) selected action is executed and reward
e  rt + V (st+1 ) − V (st ) (25) is obtained. State transits into the next state st ; (3) state value
is obtained by critic network and TD error is obtained; (4)
To enhance convergence speed, AC uses actor-critic archi- policy network is updated by minimizing objective of critic
tecture that includes actor network (policy network) and critic network; (5) critic network is updated according to objec-
network. Critic network is used in TD-error therefore TD- tive gradient of critic network. This process repeats until the
error changes into convergence of policy and critic networks.

e  rt + V (st+1 ; w) − V (st ; w) (26) A3C and A2C

Objective of critic network is defined by A3C In contrast to AC, the A3C (Everett et al., 2018) has
three features (1) multi-thread computing; (2) multi-step
J (w)  e2 (27) rewards; (3) policy entropy. Multi-thread computing means
multiple interactions with the environment to collect data
Objective gradient is therefore obtained by minimizing and update networks. Multi-step rewards are used in critic
the mean-square error network, therefore the TD-errore of A3C is obtained by


T
∇w J (w)  ∇w e2 (28) e γ i−t ri + V (st+n ) − V (st ) (33)
it
Critic network is updated by gradient ascent method
(Zhang, 2019). That is Hence, the speed of convergence is improved. Here γ is a
discount factor, and n is the number of steps. Data collection
w ← w + β∇w J (w) (29) by policy π (st ; θ ) will cause over-concentration, because

123
406 Journal of Intelligent Manufacturing (2022) 33:387–424

initial policy is with poor performance therefore actions are In on-policy policy gradient algorithm (e.g., PG), its objec-
selected from small area of workspace. This causes poor qual- tive is defined as
ity of the input, therefore convergence speed of network is
poor. Policy entropy increases the ability of policy in action J (θ )  Eτ ∼πθ (τ ) [R (τ )]  ∫ πθ (τ ) R (τ ) dτ
exploitation to reduce over-concentration. Objective gradient τ ∼πθ (τ )
of A3C therefore changes to  ∫ ρ π (s) ∫ πθ (a|s) R (s, a) dads
s∼S a∼A
 Es∼ρ π ,a∼πθ [R (s, a)] (36)
∇θ J (θ ) A3C  ∇θ J (θ ) AC + β∇θ H (π (st ; θ )) (34)
where ρ π is the distribution of state transition.
 The objec-
where β is a discount factor and H (π (st ; θ ) is the policy tive gradient of PG ∇θ J (θ )  Eτ ∼πθ (τ ) ∇θ log πθ (τ )R(τ )
entropy. includes a vector C  ∇θ log πθ (τ ) and a scalar R  R
A2C A2C (Everett et al., 2018) is the alternative of A3C (τ ). Vector C is the trend of policy update, while scalar R is
algorithm. Each thread in A3C algorithm can be utilized the range of this trend. Hence, the scalar R acts as a critic
to collect data, train critic and policy networks, and send that decides how policy is updated. Action value Q π (s, a) is
updated weights to global model. Each thread in A2C how- defined as the expectation of discounted rewards by
ever can only be used to collect data. Weights in A2C are  ∞

γ

updated synchronously compared with the asynchronous Q π (s, a)  E r1  γ k−t r (sk , ak )|S1  s, A1  a; π
update of A3C, and experiments demonstrate that syn- kt
chronous update of weights is better than asynchronous way (37)
in weights update ((Babaeizadeh et al., 2016; Mnih et al.,
2016)). Their mechanisms in weight update are shown in Q π (s, a) is an alternative of scalar R, and it is better than
Fig. 25. R as the critic. Hence, objective gradient of PG changes to

∇θ J (θ )  ∇θ ∫ ρ π (s) ∫ πθ (a|s) Q π (s, a) dads


DPG and DDPG s∼S
 a∼A
 Es∼ρ π ,a∼πθ ∇θ log πθ (a|s)Q π (s, a) (38)
Here some prerequisites are recalled on-policy algorithm,
off-policy algorithm, important sampling ratio, stochastic and policy is updated using objective gradient with action
policy gradient algorithm, and then the principles of DPG value Q π (s, a). Hence, algorithms are called stochastic pol-
and DDPG are given. icy gradient algorithm if action value Q π (s, a) is used as
Prerequisites In data generation and training processes, if critic.
behavior policy and target policy of an algorithm are the same DPG DPG are algorithms in which a deterministic policy
policy πθ , this algorithm is called on-policy algorithm. On- μθ (s) is trained to select actions, instead of policy πθ (a|s) in
policy algorithm however may lead to low-quality data in data AC. A policy is deterministic policy μθ (s) if it directly maps
generation and a slow speed in network convergence. This the state to the action ← μθ (s), while stochastic policy πθ
problem can be solved by using one policy (behavior policy) (a|s) maps state and action to a probability P(a|s) (Silver
β θ for data generation and another policy (target policy) πθ et al., 2014). The update of deterministic policy is defined as
for learning and making decision. Algorithms using different
μk+1 (s)  arg maxa Q μ (s, a)
k
policies on data generation and learning are therefore called (39)
off-policy algorithms. Although policies in off-policy algo-
rithm are different, their relationship can still be measured by If network θ is used as approximator of deterministic pol-
transition probability ρ β (s) that is the importance-sampling icy, update of network changes to
ratio and defined by  
θ k+1  θ k + αEs∼ρ μk ∇θ Q μ (s, μθ (s))
k

T  
πθ (a|s) π (ak |sk ; θ )
ρ β (s)   θ k + αEs∼ρ μk ∇θ μθ (s) ∇a Q μ (s, a) |aμθ (s) (40)
k
 kt (35)
βθ (a|s) T
kt β(ak |sk ; θ )
There are small changes in state distribution ρ μ of deter-
Importance-sampling ratio measures the similarity of two ministic policy during the update of network θ , but this
policies. These policies must be with large similarity in the change will not impact the update of network. Hence, net-
definition of important sampling. Particularly, behavior pol- work of deterministic policy is updated by
icy β θ is the same as policy πθ in on-policy algorithms. This 
means πθ  βθ and ρ β (s)  ρ π (s)  1. θ ← θ + αEs∼ρ μ ∇θ μθ (s)∇a Q μ (s, a)|aμθ (s) (41)

123
Journal of Intelligent Manufacturing (2022) 33:387–424 407

A3C Training A2C

Weight update 1st thread 1st thread Data

Interact Interact Training and


Training weight update

Global Weight
2nd thread Environment 2nd thread Data Model
model update

Training

Weight update 3rd thread 3rd thread Data

Fig. 25 The weight update processes of the A3C and A2C

because
wt ← wt + αw δt ∇w Q w (st , at ) (46)
∇θ J (μθ )  ∇θ ∫ ρ μ (s) R (s, μθ (s)) ds
S θt ← θt + αθ ∇θ μθ (s)∇a Q w (s, a)|aμθ (s) (47)
 ∇θ Es∼ρ μ [R (s, μθ (s))]
 ∫ ρ μ (s) ∇θ μθ (s) ∇a Q μ (s, a) |aμθ (s) However, no constrains is used on network w in the
S approximation of Q value and this will lead to a large bias.

 Es∼ρ μ ∇θ μθ (s) ∇a Q μ (s, a) |aμθ (s) (42) How to obtain a Q w (s, a) without bias? Compatible
function approximation (CFA) can eliminate the bias by
Once Q μ (s, a) is obtained, θ can be updated after obtain- adding two constraints on w (proof is given in Silver et al.
ing objective gradient. (2014)): (1) ∇a Q w (s, a)|
 aμθ (s)  ∇θ μθ (s) w; (2) M S E

How to find Q μ (s, a)? Note that discounted reward Q μ (θ , w)  E[(s; θ , w)] (s; θ , w) (s; θ , w) → 0 where
(s, a) is a critic in stochastic policy gradient mentioned ∈ (s; θ , w)  ∇a Q w (s, a)|aμθ (s) −∇a Q μ (s, a)|aμθ (s) . In
before. If network w is used as approximator, Q μ (s, a) is other words, Q w (s, a) should meet
obtained by
Q w (s, a)  (a − μθ (s)) ∇θ μθ (s) w + V v (s) (48)
μ w
Q (s, a) ≈ Q (s, a) (43)
where state value V v (s) may be any differentiable base-
stochastic policy gradient algorithm includes two networks, line function (Silver et al., 2014). Here v and ∅ are feature
in which w is the critic that approximates Q value and and parameter of state value (V v (s)  v  ∅(s)). Parame-
θ is used as actor to select actions in test (actions are ter ∅ is also the feature of advantage function (Aw (s, a) 
∅(s, a) w), and ∅(s, a) is defined as ∅(s, a)  ∇θ μθ (s)(a−
def
selected by behavior policy β in training). Stochastic policy
gradient in this case is called off-policy deterministic actor- μθ (s)). Hence, a low-bias Qw (s,a) is obtained using OPDAC-
critic (OPDAC) or OPDAC-Q. Objective gradient of OPDAC Q and CFA. This new algorithm with less bias is called
therefore changes from the Eq. 42 to Compatible OPDAC-Q (COPDAC-Q) (Silver et al., 2014),
in which weights are updated as Eq. 49–51

∇θ Jβ (μθ ) ≈ Es∼ρ β ∇θ μθ (s)∇a Q w (s, a)|aμθ (s) (44)
vt ← vt + av δt ∅(st ) (49)
where β represents the behavior policy. Two networks are
updated by wt ← wt + αw δt ∅(st , at )  wt + αw δt ∇w Aw (st , at ) (50)

δt  rt + γ Q w (st+1 , μθ (st+1 )) − Q w (st , at ) (45) θt ← θt + αθ ∇θ μθ (st )(∇θ μθ (st ) wt ) (51)

123
408 Journal of Intelligent Manufacturing (2022) 33:387–424

Actor CFA
Deterministic
policy μθ(s) Gradient Q
Critic Qμ(s,a) Qw(s,a) Q learning
learning

Fig. 26 Brief steps of DPG algorithm

where δt is the same as the Eq. 45. Here av , αw and αθ Critic network θ Q is updated by minimizing the loss value
are learning rates. Note that linear function approximation (MSE loss)
method (Silver et al., 2014) is used to obtain advantage func-
1   Q 2
tion Aw (s, a) that is used to replace the value function Q w Loss  J θ (54)
(s, a) because Aw (s, a) is efficient than Q w (s, a) in weight N
i
update. Linear function approximation however may lead to
divergence of Q w (s, a) in critic δ. Critic δ can be replaced where N is the number of tuples < s,a,r,s’ > sampled from
by the gradient Q-learning critic (Sutton et al., 2009) to replay buffer. Target function of policy network is defined by
reduce the divergence. Algorithm that combines COPDAC-   1   
Q and gradient Q-learning critic is called COPDAC Gradient J θμ  Q si , ai ; θ μ (55)
N
Q-learning (COPDAC-GQ). Details of gradient Q-learning i
critic and COPDAC-GQ algorithm can be found in ((Silver
and objective gradient is obtained by
et al., 2014; Sutton et al., 2009)).
By analytical illustration above, 2 examples (COPDAC-Q   1     
and COPDAC-GQ) of DPG algorithm are obtained. In short, ∇θ μ J θ μ ∼
 ∇θ μ μ si ; θ μ ∇a Q si , a; θ Q |aμ(si )
N
key points of DPG are to: (1) find a no-biased Q w (s, a) as i
(56)
critic; (2) train a deterministic policy μθ (s) to select actions.
networks of DPG are updated as AC via the policy ascent
Hence, policy network θ μ is updated according to gradient
approach. Brief steps of DPG are shown in Fig. 26.
ascent method by
DDPG (Lillicrap et al., 2019) is the combination of replay
buffer, deterministic policy μ(s) and actor-critic architecture.  
θ μ ← θ μ + α∇θ μ J θ μ (57)
θ Q is used as critic network to approximate action value Q
si , ai ; θ Q . θ μ is used as policy network to approximate where α is a learning rate. New target networks
deterministic policy μ(s; θ μ ). TD target y of DDPG is defined
by θ Q ← τ θ Q + (1 − τ )θ Q (58)
   
yi  ri + γ Q si+1 , μ si+1 ; θ μ ; θ Q (52)
θ μ ← τ θ μ + (1 − τ )θ μ (59)

where θ Q and θ μ are copies of θ Q and θ μ as target networks where τ is a learning rate, are obtained by “soft” update
that are updated with low frequency. The objective of critic method that improves the stability of network convergence.
network is defined by Detailed steps of DDPG are shown in Algorithm 4 (Lill-
    icrap et al., 2019) and Fig. 27. Examples can be found in
J θ Q  yi − Q si , ai ; θ Q (53) ((Jorgensen & Tamar, 2019; Munos et al., 2016)) in which
DDPG is used in robotic arms.

123
Journal of Intelligent Manufacturing (2022) 33:387–424 409

TRPO and PPO η(θ ) and η(θold ) denote expectations of new and old poli-
cies, respectively. Their  ∞relationship is defined
 by η(θ )  η
PPO (Long et al., 2018; Schulman et al., 2017b) is the opti-  t
(θold ) + Es0 ,a0 ,s1 ,a1 ... γ Aθold (st , at ) where γ is a dis-
mized version of TRPO (Schulman et al., 2017a). Hence, t0
here the principle of TRPO is given before recalling that of count factor, and Aθold (st , at ) is the advantage value that is
defined by Aθ (s, a)  Q θ (s, a) − Vθ (s). Thus, η(θ )  η
PPO.  
TRPO Previous policy gradient algorithms update their (θold )+ ρθ (s) θ (a|s)Aθold (s, a) where ρθ (s) is the prob-
s a
policies by θ ← θ + ∇θ J (θ ). However, new policy is ability distribution of new policy, but ρθ (s) is unknown
improved unstably with fluctuation. The goal of TRPO is therefore it is impossible to obtain new policy η(θ ). Approx-
to improve its policy monotonously, therefore stability of imation of new policy’s expectation L θold (θ ) is defined by
convergence is improved by finding a new policy with the
objective that is defined by  
L θold (θ )  η (θold ) + ρθold (s) θ (a|s) Aθold (s, a)
K L (θold , θ ) ≤ δ
J (θ )  L θold (θ ), s.t.D max (60) s a

  θ (a|s)
where L θold (θ ) is the approximation of new policy’s expec-  η (θold ) + ρθold (s)
θold (a|s)
K L (θold , θ ) the KL divergence between old policy
tation, D max s a
θold and new policy θ , and δ a trust region constraint of KL
· θold (a|s) · Aθold (s, a)
divergence. The objective gradient ∇θ J (θ ) is obtained by
  (61)
maximizing the objective J (θ ). θ (a|s)
 η (θold ) + E Aθold (s, a)
θold (a|s)

where ρθold (s) is known. The relationship of L θold (θ ) and η


(θ ) (Kakade & Langford, 2002) is proved to be

K L (θold , θ )
η(θ ) ≥ L θold (θ ) − C · D max (62)

123
410 Journal of Intelligent Manufacturing (2022) 33:387–424

Update of and
weight θμ using sampled Q
targeted function y y
5 2 Replay

θμ
a, r, s’ buffer <s,a,r,s’>
r,θQ θμ’ θQ’
4 Update of
θμ’ weight θQ by minimizing
the loss

1 Actor Critic
3
neural neural θQ
network network

s Environment s

Fig. 27 Steps of DDPG. DDPG combines the replay buffer, actor-critic experience tuple is saved in replay buffer. Third, experiences are sam-
architecture, and deterministic policy. First, action is selected by policy pled from replay buffer for training. Fourth, critic network is updated.
network and reward is obtained. State transits to next state. Second, Finally, policy network is updated

where penalty coefficient C  2γ 2 , γ [0, 1] and ∈ the divergence instead of penalty coefficientC. Fixed trust
(1−γ )
maximum advantage. Hence, it is possible to obtain η(θ ) by region constraint δ leads to a reasonable step size in policy

K L (θold , θ ) or
maximi zeθ L θold (θ ) − C · D max update therefore stability in convergence is improved and
  convergence speed is acceptable. However, objective of
θ (a|s) TRPO is obtained in implementation by conjugate gradient
maximi zeθ E Aθold (s, a) − C · D max (θold , θ )
θold (a|s) KL
method (Schulman et al., 2017b) that is computationally
(63) expensive.
 PPO optimizes objective  maximi zeθ E
However, penalty coefficient C (constrain of KL diver- θ(a|s)
θold (a|s) Aθold (s, a) − C · D K L (θold , θ ) from two aspects:
max
gence) will lead to small step size in policy update. Hence, a
θ(a|s)
trust region constraint δ is used to constrain KL divergence (1) probability ratio r (θ )  θold (a|s) in objective is con-
by strained in interval [1 − , 1 + ] by introducing “surrogate”
  objective
θ (a|s)
maximi zeθ E Aθ (s, a) , s.t.D max
K L (θold , θ ) ≤ δ
θold (a|s) old L C L I P (θ )  E{min[r (θ )A, cli p(r (θ ), 1 − , 1 + )]} (66)
(64)
where ∈ is a hyperparameter, to penalize changes of pol-
therefore step size in policy update is enlarged robustly. New icy that move r (θ ) away from 1 (Schulman et al., 2017b);
improved policy is obtained in trust region by maximizing (2) penalty coefficient C is replaced by adaptive penalty
K L (θold , θ ) ≤ δ. This objective can
objective L θold (θ ), s.t. D max coefficient β that increases or decreases according to the
be simplified further (Schulman et al., 2017a) and new policy expectation of KL divergence in new update. To be exact,
θ is obtained by  
dtarg β
 ifd , β ← ; i f d dtarg × 1.5, β ← β × 2 (67)
maximi zeθ ∇θ L θold (θ )|θθold • (θ − θold ) 1.5 2

1 where d  E[D max


K L (θold , θ )] and dtarg denotes target value
s.t. θ − θold 2 ≤ δ (65)
2 of KL divergence in each policy update, therefore KL-
penalized objective is obtained by
 PPO The objective
 of TRPO is maximi zeθ E
θ(a|s)
, s.t.D max
θold (a|s) Aθold (s, a) K L (θold , θ ) ≤ δ, in which a
fixed trust region constraint δ is used to constrain KL

123
Journal of Intelligent Manufacturing (2022) 33:387–424 411

L K L P E N (θ ) is based on best-first search. However, the search process


 
θ (a|s) is computationally expensive because search space is large,
E Aθ (s, a) − β · D K L (θold , θ )
max
θold (a|s) old (68) therefore heuristic function is used to reduce the search space
and the shortest path is found by estimating the overall cost
In the implementation with neural network, loss function (e.g., A*). (3) Sampling-based algorithms randomly sample
is required to combine the policy surrogate and value func- collision-free trajectories in search space (e.g., PRM), and
tion error (Schulman et al., 2017b), and entropy are also used constraints (e.g., non-holonomic constraint) are needed for
in objective to encourage exploration. Hence, combined sur- some algorithms (e.g., RRT) in the sampling process. (4)
rogate objective is obtained by Interpolating curve algorithms plan their path by mathemat-
  ical rules, and then planned path is smoothed by CAGD.
L C L I P+V F+S (θ )  E L C L I P (θ ) + c1 L V F (θ ) + c2 S(πθ |s) (5) In reaction-based algorithms, the moving directions of
(69) robots in PFM (APF) are the gradients of the converged and
combined potential field function U . The velocity of robots
where c1 , c2 , S and L V F (θ ) denote two coefficients, entropy in VOM is selected from RAV that is related to VO and
bonus and square-error loss respectively. Objectives of PPO RV. Exhaustive search and heuristic search can be used in
(L C L I P+V F+S (θ ) and L K L P E N (θ )) is optimized by SGD the velocity selection process of VOM, and selected veloc-
that costs less computing resource than conjugate gradient ity must maximize the objective function U (e.g., distance
method. PPO is implemented with actor-critic architecture, traveled and motion time). Before velocity selection process
therefore it converges faster than TRPO. of DWA, it is necessary to reduce the search space of veloc-
ity to obtain the resulting search space Vr  Vs ∩ Va ∩ Vd .
Proper velocity of robots in DWA is selected from Vr , and
Analytical comparisons selected velocity must maximize the objective function U .
Note that U includes a measure of progress towards a goal
To provide a clear understanding about advantages and disad- location, the forward velocity of the robot, and the distance
vantages of different motion planning algorithms, we divide to the next obstacle on the trajectory. (6) Outputs of graph
these algorithms into four groups: traditional algorithms, search algorithms, sampling-based algorithms, and interpo-
classical ML algorithms, optimal value RL and policy gra- lating curve algorithms are trajectories. The outputs of PFM
dient RL. Comparisons are made according to principles (APF) are directions of robots according to the gradient of
of the algorithm mentioned in section II, III, IV and V. the converged and combined potential field function U , while
First, direct comparisons of the algorithm in each group outputs of VOM and DWA are selected velocity among pos-
are made to provide a clear understanding about the input, sible velocities.
output, and key features of these algorithms. Second, ana-
Classical ML algorithms This group includes MSVM,
lytical comparisons of all motion planning algorithms are
LSTM, MCTS and CNN. These algorithms are listed in Table
made to provide a comprehensive understanding about per-
2. According to that: (1) MSVM, LSTM and MCTS use
formances and applications of the algorithm, according to
well-prepared vector as the input, while CNN can directly
criteria summarized. Third, analytical comparisons about
use image as its input. (2) LSTM and MCTS can output
the convergence of RL-based motion planning algorithms
time-sequential actions, because of their structures (e.g., tree)
are specially made, because RL-based algorithms are the
that can store and learn time-sequential features. MSVM and
research focus recently.
CNN cannot output time-sequential actions because they out-
put one-step prediction by performing trained classifier. (3)
Direct comparisons of motion planning algorithm
MSVM plans the motion of the robot by training a maximum
margin classifier. LSTM stores and processes inputs in its cell
Traditional algorithms This group includes graph search
that is a stack structure, and then actions are outputted by per-
algorithms, sampling-based algorithms, interpolating curve
forming trained LSTM model. MCTS is the combination of
algorithms and reaction-based algorithms. Table 1 lists their
Monte-Carlo method and search tree. Environmental states
input, output, and key features. According to Table 1:
and values are stored and updated in its node of tree, therefore
(1) Graph search algorithms, sampling-based algorithms,
actions are outputted by performing trained MCTS model.
and interpolating curve algorithms use graph or map of
CNN converts high-dimensional images to low-dimensional
workspace as the input and global trajectories are generated
features by convolutional layers. These low-dimensional fea-
directly, while reaction-based algorithms use different types
tures are used to train a CNN model, therefore actions are
of information as the input. (2) Graph search algorithms find
outputted by performing trained CNN model.
the shortest and collision-free trajectories by the search meth-
ods (e.g., best-first search). For example, Dijkstra’s algorithm

123
412 Journal of Intelligent Manufacturing (2022) 33:387–424

Table 1 Comparisons of traditional planning algorithm


Classification Example Input Key features Output

Graph search algorithm Dijkstra’s 1 Graph or map 1. Best-first search (large Trajectory
search space)
2. Heuristic function in cost
estimation
A* 1,2
Sampling based algorithm PRM 1 1. Random search
(suboptimal path)
2. Non-holonomic
constraint
RRT 1,2
Interpolating Curve Line and circle 1. Mathematical rules
algorithm Clothoid curves 2. Path smoothing using
CAGD
Polynomial curves
Bezier curves
Spline curves
Reaction based algorithm PFM (APF) Robot configurations (e.g., 1. Different potential field Moving directions
position) functions U for different
targets (e.g., goal,
obstacle)
2. Combined U and
gradient of U
VOM Positions and velocities 1. VO, RV and RAV Selected velocity among
(robot and obstacles) 2. Exhaustive/global search, possible velocities
and heuristic search
according to objective
function U
DWA Robot’s position, distances 1. Vs , Va , Vd , Vr
to goal/obstacles, (v, w), 2. Velocity selection
and kinematics of robot according to objective
function U

Table 2 Comparison of classical


ML algorithms Algorithm Input Key features Output

MSVM Vector Maximum margin classifier None-sequential actions


LSTM Vector Cell (stack structure) Time-sequential actions
MCTS Vector Monte-carlo method/Tree structure Time-sequential actions
CNN Image Convolutional layers/Weight matrix None-sequential actions

Optimal value RL This group here includes Q learning, differences are: first, DQN obtains action values by neural
DQN, double DQN, and dueling DQN. Features of algo- network Q(s, a; θ ), while Q learning obtains action values
rithms here include the replay buffer, objectives of algorithm, by querying the Q-table; second, double DQN uses another
and method of weight update. Comparisons of these algo- neural network θ − to evaluate
  actions to obtain
selected  bet-
rithms are listed in Table 3. According to that: (1) Q learning ter action values by Q s , argmax a Q s , a ; θ ; θ − ; third,
normally uses well-prepared vector as the input, while DQN, dueling DQN obtains action values by dividing them to
double DQN and dueling DQN use images as their input advantage values and state values. The constraint Ea∼π (s)
because these algorithms use convolutional layer to process [A(s, a)]  0 is used on the advantage value, therefore
 value is obtainedby Q(s, a; θ , α, β)  
high-dimensional images. (2) Outputs of these algorithms a better action
are time-sequential actions by performing trained model. (3) −
V (s; θ , β) + A(s, a; θ , α) − |A
1
|

at ∈A A(s ,
t ta ; θ , α) .
DQN, double DQN and dueling DQN use replay buffer to
Networks that approximate action value in these algorithms
reuse the experience, while Q learning collects experiences
are updated by minimizing MSE with gradient descent
and learns from then in an online way. (4) DQN, double
approach.
DQN and dueling DQN use MSE e2 as their objectives. Their

123
Journal of Intelligent Manufacturing (2022) 33:387–424 413

Table 3 Comparison of optimal-value RL


Algorithm Input Output Replay buffer Objective Method of weight update

Q learning Vector Time-sequential actions No e2 where e  Gradient descent


r + γ maxa Q(s , a ) − Q(s, a)
DQN Image Time-sequential actions Yes e2 where Gradient descent
e  r + γ maxa Q(s , a ; θ ) −
Q(s, a; θ)
Double DQN Image Time-sequential actions Yes e2 where Gradient descent
e  r + γ Q(s , argmax a Q
s , a ; θ ; θ − ) − Q(s, a; θ)
Dueling DQN Image Time-sequential actions Yes e2 where Gradient descent
e  r + γ max a Q(s , a ; θ) −
Q(s, a; θ) and Q
 a; θ, α, β)  V (s;
(s, θ, β)+ 

A(s, a; θ, α) − |A
1
| a − ∈A A(st , at ; θ, α)
t

Table 4 Comparison of policy gradient RL


Algorithm Input Output Actor-critic Multi-thread Replay buffer Objective Method of
architecture method weight update

PG Image/vector Time-sequential —* — — 1 Gradient ascent


actions
AC Image/vector Time-sequential Yes — — Critic: 2 Actor: 3 Gradient ascent
actions
A3C Image/vector Time-sequential Yes Yes — Critic: 4 Actor: 5 Gradient ascent
actions Asynchronous
update
A2C Image/vector Time-sequential Yes Yes — Critic: 6 Actor: 7 Gradient ascent
actions Synchronous
update
DPG Image/vector Time-sequential Yes — Yes Critic: 8 Actor: 9 Gradient ascent
actions
DDGP Image/vector Time-sequential Yes — Yes Critic: 10 Actor: Gradient ascent
actions 11 Soft update
TRPO Image/vector Time-sequential Yes — — Critic: 12 Actor: Gradient ascent
actions 13
PPO Image/vector Time-sequential Yes — — Critic: 14 Actor: Gradient ascent
actions 15
* Here the mark “—” denotes “No”
1. Eτ ∼πθ (τ ) [R(τ )]
2. e2 , e  rt + V (st+1 ; w) − V (st ; w)
)•e
3. πθ (at |st 
T
4. e2 , e  it γ i−t ri + V (st+n ; w) − V (st ; w)
5. πθ (at |st 
) • e + β H (π (st ; θ)
T
6. e2 , e  it γ i−t ri + V (st+n ; w) − V (st ; w)
7. πθ (at |st ) • e + β H (π (st ; θ)
8. e2 , e  rt + γ Q w (st+1 , μθ (st+1 )) − Q w (st , at )
9. Q w (st , at ), objective gradient: ∇θ μθ (s)∇a Q w (s, a)|aμθ (s)
   
10.e2 , e  ri + γ Q si+1 , μ si+1 ; θ μ ; θ Q − Q(si , ai ; θ Q )
11. Q(si , ai ; θ μ ), objective gradient: ∇θ μ μ(si ; θ μ )∇a Q(si , a; θ Q )|aμ(si )
12. At  δt + (γ λ)δt+1 + · · · + γ λT −t δ(sT ), δt  rt + γ V (st+1 ; w) − V (st ; w)
θ (a|s)
13. E[ θold (a|s) Aθold (s, a)], s.t.D K L (θold , θ) ≤ δ
max

14. At  δt + (γ λ)δt+1 + · · · + γ λT −t δ(sT ), δt  rt + γ V (st+1 ; w) − V (st ; w)


θ (a|s)
15. (1) L K L P E N (θ)  E[ θold (a|s) Aθold (s, a) − β • D K L (θold , θ)]
max

(2) L C L I P+V F+S (θ)  E[L C L I P (θ) + c1 L V F (θ) + c2 S(πθ |s)]


where L C L I P (θ)  E{min[r (θ)A, cli p(r (θ), 1 − , 1 + )]}

123
414 Journal of Intelligent Manufacturing (2022) 33:387–424

Policy gradient RL Algorithms in this group include PG, updated in a “soft” way by θ Q ← τ θ Q + (1 − τ )θ Q and
AC, A3C, A2C, DPG, DDPG, TRPO, and PPO. Features of θ μ ← τ θ μ + (1 − τ )θ μ .
them include actor-critic architecture, multi-thread method,
replay buffer, objective of algorithm, and method of weight Analytical comparisons of motion planning
update. Comparisons of these algorithms are listed in Table algorithms
4. According to that: (1) Inputs of policy gradient RL can
be image or vector, and image is used as inputs under Here analytical comparisons of motion planning algorithms
the condition that convolutional layer is used as prepro- are made according to general criteria we summarized.
cessing component to convert high-dimensional image to These criteria consist of six aspects: local or global planning,
low-dimensional feature. (2) Outputs of policy gradient RL path length, optimal velocity, reaction speed, safe distance
are time-sequential actions by performing trained policy and time-sequential path. The speed and stability of network
π (s) : s → a. (3) Actor-critic architecture is not used in convergence for optimal value RL and policy gradient RL are
PG, while other policy gradient RL are implemented with then compared analytically because convergence speed and
actor-critic architecture. (4) A3C and A2C use multi-thread stability of RL in robotic motion planning are recent research
method to collect data and update their network, while focus.
other policy gradient RL are based on single thread in data
collection and network update. (5) DPG and DDPG use
Comparisons according to general criteria
replay buffer to reuse data in an offline way, while other
policy gradient RL learn online. (6) The objective of PG
Local (reactive) or global planning This criterion denotes
is defined as the expectation of accumulative rewards in
the area where the algorithm is used in most case. Table 5
the episode by Eτ ∼πθ (τ ) [R(τ )]. Critic objectives of AC,
lists planning algorithms and the criteria they fit. Accord-
A3C, A2C, DPG and DDPG are defined as MSE e2 , and
ing to Table 5: (1) Graph search algorithms plan their path
their critic networks are updated by minimizing the MSE.
globally by search methods (e.g., depth-first search, best-first
However, their actor objectives are different because: first,
search) to obtain a collision-free trajectory on the graph or
actor objective of AC is defined as πθ (at |st ) · e; second,
map. (2) Sampling-based algorithms samples local or global
policy entropy is added on πθ (at |st ) · e to encourage the
workspace by sampling methods (e.g., random tree) to find
exploration, therefore actor objectives of A3C and A2C are
collision-free trajectories. (3) Interpolating curve algorithms
defined by πθ (at |st ) · e + β H (π (st ; θ ); third, DPG and DDPG
draw fixed and short trajectories by mathematical rules to
use action value as their actor objectives by Q(si , ai ; θ μ )
avoid local obstacles. (4) Reaction based algorithms plan
that are approximated by neural network, and their pol-
local paths or reactive actions according to their objective
icy networks (actor networks) are updated by obtaining
functions. (5) MSVM and CNN make one-step prediction by
objective gradient ∇θ μ μ(si ; θ μ )∇a Q(si , a; θ Q )|aμ(si ) .
trained classifiers to decides their local motions. (6) LSTM,
Critic objectives of TRPO and PPO are defined as the
MCTS, optimal value RL and policy gradient RL can make
advantage value by At  δt + (γ λ)δt+1 + . . . + γ λT −t δ(sT )
time-sequential motion planning from the start to destination
where δt  rt + γ V (st+1 ; w) − V (st ; w), and their critic
by performing their trained models. These models include
networks w are updated by minimizing δt2 . Actor objectives
the stack structure model of LSTM, tree model of MCTS
of TRPO and PPO are different: objective of TRPO is
θ(a|s) and matrix weight model of RL. These algorithms fit global
defined as E[ θold (a|s) Aθold (s, a)], s.t.D K L (θold , θ ) ≤ δ in
max
motion planning tasks theoretically if size of workspace is
which a fixed trust region constraint δ is used to ensure
not large, because it is hard to train a converged model
the monotonous update of policy network θ , while PPO
in large workspace. In most case, models of these algo-
uses “surrogate” [1 − , 1 + ] and adaptive penalty β to
rithms are trained in local workspace to make time-sequential
ensure a better monotonous update of policy network,
predictions by performing their trained model or policy π
therefore PPO has two objectives that are defined as
θ(a|s) (s) : s → a.
L K L P E N (θ )  E[ θold (a|s) Aθold (s, a) − β · D K L (θold , θ )]
max

(KL-penalized objective) and L C L I P+V F+S (θ )  Path length This criterion denotes the length of planned
E[L C L I P (θ ) + c1 L V F (θ ) + c2 S(πθ |s)] (surrogate objective) path that is described as optimal path (the shortest path),
where L C L I P (θ )  E{min[r (θ )A, cli p(r (θ ), 1 − , 1 + )]}. suboptimal path, and fixed path. Path length of algorithms
(7) Policy network of policy gradient RL are all updated are listed in Table 5. According to that: (1) Graph search
by gradient ascent method θ ← θ + α∇θ J (θ ). Policy algorithms can find a shortest path by performing search
networks of A3C and A2C are updated in asynchronous and methods (e.g., best-first search) in the graph or map. (2)
synchronous ways, respectively. Networks of DDPG are Sampling-based algorithms plan a suboptimal path. Their
sampling method (e.g., random tree) leads to the insufficient
sampling that only covers a part of cases and suboptimal

123
Journal of Intelligent Manufacturing (2022) 33:387–424 415

Table 5 Analytical comparisons according to general criteria


Algorithm Local/global Path length Optimal velocity Reaction speed Safe distance Time-sequential
path

Graph search alg Global Optimal path –* Slow Fixed No


(shortest path) distance/High
Collison rate
Sampling-based Local/Global Suboptimal path – Slow Fixed No
alg distance/High
Collison rate
Interpolating curve Local Fixed path – Medium Fixed distance No
alg
Reaction based alg Local Optimal path Optimal velocity Medium Suboptimal No
distance
MSVM Local Suboptimal path Suboptimal Fast Suboptimal No
velocity distance
LSTM Local/Global Suboptimal path Suboptimal Fast Suboptimal Yes
velocity distance
MCTS Local/Global Optimal path – Fast Optimal distance Yes
CNN Local Suboptimal path Suboptimal Fast Suboptimal No
velocity distance
Q learning Local/Global Optimal path Optimal velocity Fast Optimal distance Yes
DQN Local/Global Optimal path Optimal velocity Fast Optimal distance Yes
Double DQN Local/Global Optimal path Optimal velocity Fast Optimal distance Yes
Dueling DQN Local/Global Optimal path Optimal velocity Fast Optimal distance Yes
PG Local/Global Optimal path Optimal velocity Fast Optimal distance Yes
AC Local/Global Optimal path Optimal velocity Fast Optimal distance Yes
A3C Local/Global Optimal path Optimal velocity Fast Optimal distance Yes
A2C Local/Global Optimal path Optimal velocity Fast Optimal distance Yes
DPG Local/Global Optimal path Optimal velocity Fast Optimal distance Yes
DDGP Local/Global Optimal path Optimal velocity Fast Optimal distance Yes
TRPO Local/Global Optimal path Optimal velocity Fast Optimal distance Yes
PPO Local/Global Optimal path Optimal velocity Fast Optimal distance Yes
* The mark “–” denotes the performance that cannot be evaluated

path is obtained. (3) Interpolating curve algorithms plan their can reach the destination with minimum time in travelled
path according to mathematical rules that lead to a fixed path. This criterion is described as optimal velocity and sub-
length of path. (4) Reaction-based algorithms can find the optimal velocity. Table 5 lists performance of algorithms
shortest path if their objective functions are related to short- in the optimal velocity. According to that: (1) Performance
est distance travelled, and then robots will move towards of graph search algorithms, sampling-based algorithms and
directions that maximize objective functions. (5) MSVM, interpolating algorithms in the velocity tuning cannot be eval-
LSTM and CNN plan their path by performing models that uated, because these algorithms are only designed for the
are trained with human-labeled dataset, therefore suboptimal path planning to find a collision-free trajectory in the graph
path is obtained. MCTS can receive the feedback (reward) or map. (2) Among reaction-based algorithms, the velocity of
from the environment to update its model (tree), therefore it robots in PFM can be tuned according to obtained potential
is possible to find a shortest path like other RL algorithms. field, while velocities of robots in VOM and DWA are dynam-
(6) RL algorithms (optimal value RL and policy gradient RL) ically selected among their possible velocities according to
can generate optimal path under the condition that reason- their objective functions. Hence, reaction-based algorithms
able penalty is used to punish moved steps in the training, can realize optimal velocity. (3) MSVM, LSTM, and CNN
therefore optimal path is obtained by performing the trained can output actions that are in the format v  [vx , v y ] where
RL policy. vx and v y are velocity in x and y axis, if algorithms are
trained with these vector labels. However, these velocity-
Optimal velocity This criterion denotes the ability to tune
related labels are all hard-coded artificially. Time to reach
the velocity when algorithms plan their path, therefore robot

123
416 Journal of Intelligent Manufacturing (2022) 33:387–424

destination heavily relies on the artificial factor, therefore DDPG). This process is fast and time cost can be ignored,
these algorithms cannot realize optimal velocity. MCTS can therefore reaction speed of these algorithms is fast.
realize optimal velocity theoretically if the size of action
Safe distance This criterion denotes the ability to keep a
space is small. However, the case where MCTS is used in
safe distance to obstacles. Safe distance is described as three
velocity-related tasks has not been found. Large action space
level that includes fixed distance, suboptimal distance and
will lead to a large search tree, and it is hard to train such
optimal distance. Table 5 lists the performance of algorithms.
a large tree model. In normal case, MCTS is used in one-
According to that: (1) Graph search algorithms and sampling-
step prediction via its trained tree model, therefore the state
based algorithms keep a fixed distance to static obstacles by
can transit into next state but velocity in this state transition
hard-coded setting in robotic application. However, high col-
process will not be considered. Hence, its ability in velocity
lision rate is inevitable in dynamic environment because of
tunning cannot be evaluated. (4) optimal value RL and pol-
slow update frequency of graph or map. (2) Interpolating
icy gradient RL can realize optimal velocity by attaching the
algorithms keep a fixed distance to static and dynamic obsta-
penalty to consumed time in the training. These algorithms
cles according to mathematical rules. (3) Reaction-based
can automatically learn how to choose the best velocity in
algorithms can keep safe and dynamic distances to obstacles
the action space for training to cost time as less as possible,
theoretically according to their potential field functions (for
therefore robots can realize optimal velocity by performing
PFM) or objective functions (for VOM and DWA). However,
trained policy. Note that in this case, actions in optimal value
their updates in potential fields or search spaces cost a short
RL and policy gradient RL must be in format of [vx , v y ]
period of time that cannot be ignored, therefore increasing
and action space that contains many action choices must be
the rate of collision with obstacles especially in high-speed,
pre-defined artificially.
dense, and dynamic scenarios. Hence, they keep a subop-
Reaction speed This criterion denotes the computational timal distance to obstacles. (4) MSVM, LSTM and CNN
cost or time cost of the algorithm to react dynamic obsta- keep a suboptimal distance to static and dynamic obstacles.
cles. Computational cost of some algorithms (e.g., traditional Suboptimal distance is obtained by performing a model that
non-AI algorithms) would be easy to obtain by counting the is trained with human-labeled dataset. MCTS can output a
number of lines and functions in the source code (Kinnunen collision-free time-sequential path and keep a safe distance
et al., 2011). However, the problems of this approach are to obstacles theoretically as other RL algorithms. (5) optimal
that all source codes should be found first, and the result value RL and policy gradient RL keep a safe and dynamic dis-
would still depend on the chosen programming language tance to static and dynamic obstacles with the computational
and the quality of the implementation. It is hard to obtain cost that can be ignored, by performing a trained policy π
clear computational costs, thus here reaction speed is briefly (s) : s → a. This policy is trained under the condition that the
described analytically using three levels: slow, medium and penalty is used to punish the close distances between robot
fast. Table 5 lists reaction speed of the algorithms. Accord- and obstacles in the training, therefore algorithms will auto-
ing to that: (1) Graph search algorithms and sampling-based matically learn how to keep an optimal distance to obstacles
algorithms rely on planned trajectories in the graph or map when robot moves towards destination.
to avoid obstacles. However, the graph or map is updated in
Time-sequential path This criterion denotes whether an
a slow frequency normally, therefore reaction speed of these
algorithm fits time-sequential task or not. Table 5 lists algo-
algorithms is slow. (2) Interpolating curve algorithms plan
rithms that fit time-sequential planning. According to that: (1)
their path according to mathematical rules with limited and
Graph search algorithms, sampling-based algorithms, inter-
predictable time in computation, therefore reaction speed of
polating curve algorithms, and reaction-based algorithms
these algorithms is medium. (3) Reaction-based algorithms
plan their path according to the graph or map, mathemati-
should first update their potential field (for PFM) and search
cal rules, obtained potential field or search spaces, regardless
space (for VOM and DWA) in small local areas to select mov-
of environmental state in each time step. Hence, these algo-
ing directions or proper velocities. These update processes
rithms cannot fit time-sequential task. (2) MSVM and CNN
take lesser time than that of graph search algorithms. Hence,
output actions by one-step prediction that has no relation with
reaction speed of reaction-based algorithms is medium. (4)
environmental state in each time step. (3) LSTM and MCTS
classical ML algorithms, optimal value RL and policy gra-
store environmental state in each time step in their cells and
dient RL react to obstacles by performing trained model or
nodes respectively, and their models are updated by learning
policy π (s) : s → a that maps state of environment to a
from these time-related experience. Time-sequential actions
probability distribution P(a|s) or to actions directly (e.g.,
are outputted by performing trained models, therefore these
algorithms fit time-sequential task. (4) Optimal value RL and
policy gradient RL train their policy network by learning
from environmental state in each time step. Time-sequential

123
Journal of Intelligent Manufacturing (2022) 33:387–424 417

actions are outputted by performing trained policy, therefore and KL-penalized objective L K L P E N (θ ) 
θ(a|s)
these algorithms fit time-sequential task. E[ θold (a|s) Aθold (s, a) − β · D K L (θold , θ )].
max

Convergence stability Table 7 lists convergence stability


of optimal value RL and policy gradient RL. According to
Comparisons of convergence speed and stability
that: (1) Q learning update its action value every step, there-
fore the bias is introduced. Over-estimation of Q value leads
Convergence speed Here poor, reasonable, good, and
to suboptimal update direction of Q value, if neural network
excellent are used to describe the performance of conver-
is used as approximator. Hence, convergence stability of Q
gence speed. Table 6 lists the performance of optimal value
learning is poor. (2) DQN improves the convergence stability
RL and policy gradient RL. According to that: (1) Q learning
by replay buffer in which a batch of experiences are sampled
only fits simple motion planning with small-size Q table. It
for training and its network is update according to the batch
is hard to converge for Q learning with large-size Q table in
loss. (3) Double DQN and dueling DQN can find a better
complex environment. Over-estimation of Q value also leads
action value than that of DQN by the evaluation network and
to poor performance of Q learning if neural network is used
advantage network respectively, therefore networks of these
to approximate Q value. (2) DQN suffers the over-estimation
algorithms are updated towards a better direction. (4) PG
of Q value when CNN is used to approximate Q values, but
updates its network according to the accumulative episode
DQN learns from the experience in replay buffer that make
reward. This reduces bias caused by one-step rewards, but it
network reuse high-quality experience efficiently. Hence,
introduces high variance. Hence, network of PG is updated
convergence speed of DQN is reasonable. (3) Double DQN
with stability, but it is still hard to converge. (5) The perfor-
uses another network θ − to evaluate actions that are selected
mance of actor and critic network of AC is poor in early-stage
byθ . New Q value  with less over-estimation
 is obtained by
training. This leads to a fluctuated update of networks in the
Q s , arg maxa Q s , a ; θ ; θ − , therefore the convergence
beginning, although network is updated by gradient ascent
speed is improved. (4) Dueling DQN finds better Q value
method θ ← θ + ∇θ J (θ ). (6) A3C and A2C update their
by: first, dividing action value to state value and advantage
T
value Q(s, a)  V (s, a) + A(s, a); second, constraining networks by multi-step rewards γ i−t ri that reduces the
advantage value A(s, a) by Ea∼π (s) [A(s, a)]  0, therefore it
bias and improves convergence stability, although it will
new action value ⎧ is obtained by Q(s, a; θ , α, β)  ⎫ V
introduce some variance. Gradient ascent method also helps
⎨   ⎬
1  in convergence stability, therefore performance in their
(s; θ , β) + A(s, a; θ , α) − |A | A st , at− ; θ , α .
⎩ a− A
⎭ convergence stability is reasonable. (6) Unbiased critic,
t
Hence, performance of dueling DQN in convergence speed gradient ascent method and replay buffer contribute to good
is good. (5) PG updates its policy according to the episode performance in convergence stability for DPG and DDPG.
rewards by Eτ ∼πθ (τ ) [R(τ )], therefore poor performance Additionally, target networks of DDPG are updated in a
in convergence speed is inevitable. (5) AC uses the critic soft way by θ Q ← τ θ Q + (1 − τ )θ Q and θ μ ← τ θ μ +
network to evaluate actions selected by the actor network, (1 − τ )θ μ that also contributes to convergence stability.
therefore speeding up the convergence. (6) A3C and A2C use (7) Networks of TRPO and PPO is updated monotonously.
multi-thread method to improve convergence speed directly, TRPO achieves this goal by the trust region constraint
and policy entropy is also used to encourage exploration. K L (θold , θ ) ≤ δ, while
D max  PPO uses “surrogate” objective
These methods indirectly enhance the convergence speed. L C L I P+V F+S (θ )  E L C L I P (θ ) + c1 L V F (θ ) + c2 S(πθ |s)
(7) Performance of DPG and DDPG in convergence speed is and KL-penalized objective L  (θ )
KLPEN  E
θ(a|s)
good because: first, their critics are unbiased critic networks A
θold (a|s) θold (s, a) − β · D max
KL (θold , θ ) . Hence, per-
obtained by CFA and gradient Q learning; second, their formance of their networks in convergence stability is
policies are deterministic policy μθ (s) that is faster than the good.
stochastic policy πθ (at |st ) in convergence speed; third, their
networks are updated offline with replay buffer; fourth, noise
is used in DDPG to encourage the exploration; (8) TRPO
makes a great improvement in convergence speed by adding Future directions of robotic motion planning
trust region constraint to policies by D max K L (θold , θ ) ≤ δ,
therefore its networks are updated monotonously  by Here a common but complex real-world motion planning task
θ(a|s)
maximizing its objective E θold A
(a|s) θold (s, a) , s.t.D max
KL is firstly given: how to realize long-distance motion plan-
(θold , θ ) ≤ δ. (9) PPO moves further in improving ning with safety and efficiency (e.g., long-distance luggage
convergence speed by  introducing “surrogate” objective delivery by robots)? Then research questions and directions
L C L I P+V F+S (θ )  E L C L I P (θ ) + c1 L V F (θ ) + c2 S(πθ |s) are obtained by analyzing this task according to processing

123
418 Journal of Intelligent Manufacturing (2022) 33:387–424

Table 6 Comparison of speed in


convergence for RL Algorithm Speed of convergence Reasons

Q learning Poor Over-estimation of action value


DQN Reasonable Replay buffer
Double DQN Good Replay buffer
Another network for evaluation
Dueling DQN Good Replay buffer
Division of action value Q(s, a)  V (s, a) + A(s, a)
PG Poor High variance by Eτ ∼πθ (τ ) [R(τ )]
AC Reasonable Actor-critic architecture
A3C Good Actor-critic architecture
Multi-thread method
Policy entropy
A2C Good Actor-critic architecture
Multi-thread method
Policy entropy
DPG Good Replay buffer
Actor-critic architecture
Deterministic policy
Unbiased critic network
DDGP Good Replay buffer
Actor-critic architecture
Deterministic policy
Unbiased critic network
Exploration noise
TRPO Good Actor-critic architecture
Fixed trust region constraint;
PPO Excellent Actor-critic architecture
“surrogate” objective
KL-penalized objective

steps that include data collection, data preprocessing, motion therefore partial understanding of environment is avoided.
planning and decision making (Fig. 28). Another way to avoid partial understanding of environment
Data collection To realize mentioned task, these ques- is the data translation that interpretates data to new format,
tions should be considered firstly: (1) how to collect enough therefore algorithms can have a better understanding about
data? (2) how to collect high-quality data? To collect enough the relationship of robots and other obstacles (e.g., attention
data in a short time, multi-thread method or cloud technol- weight (Chen et al., 2019) and relation graph (Chen et al.,
ogy should be considered. Existing techniques seem enough 2020)). However, algorithms in data fusion and translation
to solve this question well. To collect high-quality data, exist- cannot fit all cases, therefore further works is needed accord-
ing works use prioritized replay buffer (Oh et al., 2018) to ing to the environment of application.
reuse high-quality data to train network. Imitation learning Motion planning: In this step, the selection and optimiza-
(Codevilla et al., 2018; Oh et al., 2018) is also used for the tion of motion planning algorithms should be considered: (1)
network initialization with expert experience, therefore net- if traditional motion planning algorithms (e.g., A*, RRT) are
work can converge faster (e.g., deep V learning (Chen et al., selected for the task mentioned before, global trajectory from
2016, 2019)). Existing methods in data collection work well, the start to destination will be obtained, but this process is
therefore it is hard to make further optimization. computationally expensive because of the large search space.
Data preprocess Data fusion and data translation should To solve this problem, the combination of traditional algo-
be considered after data is obtained. Multi-sensor data fusion rithms and other ML algorithms (e.g., CNN, DQN) may be
algorithms (Durrant-Whyte & Henderson, 2008) fuse data a good choice. For example, RRT can be combined with
that is collected from same or different type of sensors. Data DQN (Fig. 29) by using action value to predict directions
fusion is realized from pixel, feature, and decision levels, of tree expansion, instead of the heuristic or random search.

123
Journal of Intelligent Manufacturing (2022) 33:387–424 419

Table 7 Comparison of stability


in convergence for RL Algorithms Stability of convergence Reasons

Q learning Poor Update in each step


Over-estimation of action value
DQN Reasonable Replay buffer
Double DQN Reasonable Replay buffer
Evaluation network
Dueling DQN Reasonable Replay buffer
Advantage network
PG Good Update by trajectory rewards
Gradient ascent update
AC Poor (in the beginning) Update in each step
Gradient ascent update
A3C Reasonable Update by multi-step rewards
Gradient ascent update
A2C Reasonable Update by multi-step rewards
Gradient ascent update
DPG Good Unbiased critic
Gradient ascent update
Replay buffer
DDGP Good Unbiased critic
Gradient ascent update
Replay buffer
Soft update
TRPO Good Monotonous update
Gradient ascent update
PPO Good Monotonous update
Gradient ascent update

(2) It seems impossible to use supervised learning to real- reconfigured into different shapes (Salemi et al., 2006; Wang
ize task mentioned above safely and quickly. Global path is et al., 2020) to improve their performance in motion plan-
impossible to be obtained by supervised learning that out- ning.
puts one-step prediction. (3) Global path cannot be obtained Decision: Traditional algorithms (e.g., A*) feature the
by optimal value RL or policy gradient RL, but their perfor- global trajectory planning, while optimal value RL and pol-
mance in safety and efficiency is good locally by performing icy gradient RL feature the safe and quick motion planning
trained RL policy that leads to quick reaction, safe distance locally. It is a good direction to realize task mentioned above
to obstacles, and shortest travelled path or time. However, by combining traditional algorithm with RL. This is achieved
it is time-consuming to train a RL policy because of defi- by fusing the commands generated by each algorithm or
ciencies in network convergence. Existing works made some functional module (e.g., functional modules in ROS) via
optimizations to improve convergence (e.g., DDPG, PPO) in algorithm-level and system-level data fusions. Multi-sensor
games and physical robots to shorten training time of RL, fusion technique can not only fuse information of sensors
but there is still a long way to go in real-world engineering from pixel and feature level as the inputs, but also fuse dif-
and manufacturing for commercial purposes. Recent trend to ferent types of decisional commands from decisional level
improve the network convergence is to create hybrid architec- in a loose-coupled way. Hence, overall path of the robot is
ture that is the fusion of high-performance components (e.g., expected to approximate the shortest path, and safety and
replay buffer, actor-critic architecture, policy entropy, multi- efficiency can be ensured simultaneously. Recent state of art
thread method). Apart from optimizations of motion planning borrows the group decision making theory to sensor fusion or
algorithms, hardware planning may be a possible direction to data fusion to solve the problem of how weights or impacts of
improve the performance of future motion planning system. each decisional command are determined from the consensus
Hardware planning refers to the reconfiguration and adjust- level and confidence level (Ji et al., 2020). Some pioneering
ment of hardware of robotic system, therefore robots can be ideas about how to better combine or integrate algorithms can

123
420 Journal of Intelligent Manufacturing (2022) 33:387–424

Quantity Method to collect sufficient data


Data
collecton
Quality Method to collect high-quality data

Data fusion Input fusion strategies


Preprocess
Data
Feature interpretation
translation

Traditional
Search space reducing
algorithms
Motion
planing Supervised
Combination with other algorithms
learning

RL Speed and stability of convergence

Hardware
Self-reconfiguration to have different shapes
planning

Decision Fusions Decision fusions (algorithm and system levels)

Joint
Collaborative operations (software and hardware)
decision

Robustness and resilience

Fig. 28 Processing steps for motion planning task

be also seen in Zhang et al. (2019). Apart from hybrid algo- resilience refers to the ability of the robot to recover its
rithms or decisional commands, joint decision making may function after the robot is partially damaged (Wang et al.,
be helpful to improve the performance of future motion plan- 2020). To conclude, Fig. 28 lists possible research directions,
ning system. This is achieved via collaborative operations but attentions to improve the performance of robotic motion
from software-based planning (algorithm-based planning) planning are expected to be paid on: (1) data fusion and trans-
and hardware planning. lation of inputted features; (2) the optimization in traditional
Finally, the robustness and resilience of motion planning planning algorithms to reduce the search space by combining
system should be considered if robots are expected to have traditional algorithms with supervised learning or RL; (3) the
better performance in engineering and manufacturing for optimization in network convergence for RL.
commercial use. Robustness refers to the ability to resist out-
side noise or attack (e.g., the cyber-attack in ROS), while

123
Journal of Intelligent Manufacturing (2022) 33:387–424 421

Funding Open access funding provided by University of Eastern Fin-


land (UEF) including Kuopio University Hospital. The authors did not
receive support from any organization for the submitted work.

Declarations

Conflict of interests All authors certify that they have no affiliations


with or involvement in any organization or entity with any financial
interest or non-financial interest in the subject matter or materials dis-
cussed in this manuscript.

Open Access This article is licensed under a Creative Commons


Attribution 4.0 International License, which permits use, sharing, adap-
tation, distribution and reproduction in any medium or format, as
long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons licence, and indi-
cate if changes were made. The images or other third party material
in this article are included in the article’s Creative Commons licence,
unless indicated otherwise in a credit line to the material. If material
Fig. 29 Fusion of DQN or CNN with RRT is not included in the article’s Creative Commons licence and your
intended use is not permitted by statutory regulation or exceeds the
permitted use, you will need to obtain permission directly from the copy-
right holder. To view a copy of this licence, visit http://creativecomm
ons.org/licenses/by/4.0/.
Conclusion

This paper carefully analyzes principles of robotic motion


planning algorithms in section II-VI. These algorithms References
include traditional planning algorithms, classical ML, opti-
mal value RL and policy gradient RL. Direct comparisons Arkin, R. C., Riseman, E. M., & Hansen, A. (1887). AuRA: an
architecture for vision-based robot navigation. Proceedings of the
of these algorithms are made in section VII according to DARPA Image Understanding Workshop, Los Angeles, CA, Febru-
their principles. Hence, a clear understanding about mecha- ary 1987, pp. 417–413.
nisms of motion planning algorithms is provided. Analytical Babaeizadeh, M., Frosio, I., Tyree, S., Clemons, J., Kautz J. (2016).
comparisons of these algorithms are made in section VII Reinforcement learning through asynchronous advantage Actor-
Critic on a GPU. arXiv, arXiv:1611.06256 [cs.LG].
according to the new criteria summarized that include local or
Bae, H., Kim, G., Kim, J., Qian, D., & Lee, S. (2019). Multi-robot path
global planning, path length, optimal velocity, reaction speed, planning method using reinforcement learning. Applied Science.,
safe distance, and time-sequential path. Hence, general per- 9, 3057.
formances of these algorithms and their potential application Bai, H., Cai, S., Ye, N., Hsu, D., & Lee, W. S. (2015). Intention-
aware online POMDP planning for autonomous driving in a crowd.
domains are obtained. The convergence speed and stability 2015 IEEE International Conference on Robotics and Automation
of optimal value RL and policy gradient RL are specially (ICRA), Seattle, WA, pp. 454–460.
compared in section VII because they are recent research Bautista, G. D., Perez, J., Milanés, V., & Nashashibi, F. (2015). A
focus on robotic motion planning. Hence, a detailed and review of motion planning techniques for automated vehicles.
IEEE Transactions on Intelligent Transportation Systems, 17(4),
clear understanding of these algorithms in network conver- 1–11.
gence are provided. Finally, a common motion planning task Bautista, D. G., Perez, J., Lattarulo, R. A., Milanes, V., Nashashibi,
is analyzed: long-distance motion planning with safety and F. (2014). Continuous curvature planning with obstacle avoidance
efficiency (e.g., long-distance luggage delivery by robots) capabilities in urban scenarios. IEEE International Conference on
Intelligent Transportation Systems, pp. 1430–1435.
according to processing steps (data collection, data pre- Bellemare, M. G., Dabney, W., Rémi, M. (2017). A distributional per-
processing, motion planning and decision making). Hence, spective on reinforcement learning. arXiv:1707.06887 [cs.LG].
potential research directions are obtained, and we hope they Bilbeisi, K. M., & Kesse, M. (2017). Tesla: A successful entrepreneur-
are useful to pave ways for further improvements of robotic ship strategy. Morrow, GA: Clayton State University.
Borenstein, J., & Koren, Y. (1989). Real-time obstacle avoidance for fast
motion planning algorithms or motion planning systems in mobile robots. IEEE Transactions on Systems, Man, and Cyber-
academia, engineering, and manufacturing. netics, 19(5), 1179–1187.
Borenstein, J., & Koren, Y. (1991). The vector field histogram-fast
Author contributions Conceptualization: [CZ], [PF], [BH]; Methodol- obstacle avoidance for mobile robots. IEEE Transactions on
ogy: [CZ]; Formal analysis and investigation: [CZ]; Writing—original Robotics and Automation, 7(3), 278–288.
draft preparation: [CZ]; Writing—review and editing: [PF], [BH]; Fund- Bridle, J. S. (1990). Probabilistic interpretation of feedforward classi-
ing acquisition; Resources: [CZ], [BH], [PF]; Supervision: [PF], [BH]. fication network outputs, with relationships to statistical pattern

123
422 Journal of Intelligent Manufacturing (2022) 33:387–424

recognition. In F. Fogleman Soulie & J. Herault (Eds.), Neuro- Fiorini, P., & Shiller, Z. (1998). Motion planning in dynamic environ-
computing: Algorithms, architectures and applications. Springer- ments using velocity obstacles. International Journal of Robotics
Verlag. Research, 17(7), 760–772.
Brooks, R. A. (1986). A robust layered control system for a mobile Fortunato, M., Azar, M. G., Piot, B., et al. (2017). Noisy networks for
robot. IEEE Transactions on Robotics and Automation, 1(1), 1–10. exploration, arXiv:1706.10295 [cs.LG].
Cai, M., Lin, Y., Han, B., Liu, C., & Zhang, W. (2017). On a simple and Fox, D., Burgard, W., & Thrun, S. (1997). The dynamic window
efficient approach to probability distribution function aggregation. approach to collision avoidance. IEEE Robotics and Automation
IEEE Transactions on Systems, Man, and Cybernetics: Systems, Magazine, 4(1), 23–33.
47(9), 2444–2453. Funke, J., Theodosis, P., Hindiyeh, R., et al. (2012). Up to the lim-
Chan, K. C., Lenard, C. T., Mills, T. M. (2012). An introduction to its: Autonomous Audi TTS. IEEE Intelligent Vehicles Symposium
Markov chains. In 49th Annual Conference of Mathematical Asso- (IV). Alcala De Henares, 2012, 541–547.
ciation of Victoria, Melbourne, pp 40–47. G., Samuel, G. (2017). Google sibling waymo launches fully
Chao, Y., Xiang, X., & Wang, C. (2020). Towards real-time path plan- autonomous ride-hailing service. The Guardian 7.
ning through deep reinforcement learning for UAV in dynamic Gao, W., Hus, D., Lee, W. S., Shen, S., Subramanian, K. (2017).
environment. Journal of Intelligent and Robotic Systems, 98, Intention-net: integrating planning and deep learning for goal-
297–309. directed autonomous navigation. arXiv, arXiv:1710.05627 [cs.AI].
Chen, Y., Liu, M., Everett, M., & How, J. P. (2016). Decentral- Gilhyun, R. (2018). Applying asynchronous deep classification net-
ized non-communicating multiagent collision avoidance with deep works and gaming reinforcement learning-based motion plan-
reinforcement learning, arXiv:1609.07845v2 [cs.MA]. ners to mobile robots. IEEE Robotics and Automation Society,
Chen, C., Liu, Y., Kreiss, S., & Alahi, A. (2019). Crowd-robot inter- pp. 6268–6275.
action: Crowd-aware robot navigation with attention-based deep Guy, S. J., Chhugani, J., Kim, C., Satish, N., Lin, M., Manocha, D.,
reinforcement learning. International Conference on Robotics and & Dubey P. (2009). Clearpath: Highly parallel collision avoid-
Automation (ICRA), pp. 6015–6022. ance for multi-agent simulation. Proceedings of the 2009 ACM
Chen, C., Hu, S., Nikdel, P., Mori, G., & Savva, M. (2020). Rela- SIGGRAPH/Eurographics Symposium on Computer Animation,
tional graph learning for crowd navigation. arXiv:1909.13165v3 pp. 177–187.
[cs.RO]. Hart, P. E., Nilsson, N. J., & Raphael, B. (1968). A formal basis for the
Codevilla, F., Müller, M., López, A., Koltun, V. & Dosovitskiy, heuristic determination of minimum cost paths. IEEE Transactions
A. (2018). End-to-end driving via conditional imitation learn- on Systems Science and Cybernetics, 4(2), 100–107.
ing. IEEE International Conference on Robotics and Automation Hasselt, H. V., Guez, A. & Silver, D. (2016). Deep reinforcement learn-
(ICRA), 2018, pp. 4693–4700. ing with double q-learning. In Proceedings of the AAAI conference
Cormen, T. H., Leiserson, C. E., Rivest, R. L., & Stein, C. (2009). on artificial intelligence, vol. 30, no. 1.
Introduction to algorithms. The MIT Press. He, K., Zhang, X., Ren, S., Sun, J. (2016). Deep residual learning for
Coulom, R. (2006). Efficient selectivity and backup operators in image recognition. 2016 IEEE Conference on Computer Vision
Monte-Carlo tree search. Computers and Games, 5th International and Pattern Recognition (CVPR), Las Vegas, NV, pp. 770–778.
Conference, CG 2006, Turin, Italy, May 29–31, 2006. Hessel, M., Modayil, J., Van, H. H., et al. (2017). Rainbow: Combining
Daniel, K., Nash, A., Koenig, S., & Felner, A. (2014). Theta*: Any-angle improvements in deep reinforcement learning. arXiv:1710.02298
path planning on grids. Journal of Artificial Intelligence Research, [cs.AI].
39(1), 533–579. Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory.
Dijkstra, E. W. (1959). A note on two problems in connexion with Neural Computation, 9(8), 1735–1780.
graphs. Numerische Mathematik, 1, 269–271. Indrajaya, M. A., Affandi, A., & Pratomo, I. (2015). Design of geo-
Dos Santos Mignon, A., & De Azevedo Da Rocha, R. L. (2017). An graphic information system for tracking and routing using dijkstra
adaptive implementation of -greedy in reinforcement learning. In algorithm for public transportation. In 2015 1st International Con-
Procedia Computer Science, 109, 1146–1151. ference on Wireless and Telematics (ICWT), pp. 1–4.
Durrant-Whyte, H., & Henderson, T. C. (2008). Multi-sensor data Inoue, M., Yamashita, T., & Nishida, T. (2019). Robot path planning
fusion. In B. Siciliano & O. Khatib (Eds.), Springer handbook by LSTM network under changing environment. In S. Bhatia, S.
of robotics. Springer handbooks. Springer. Tiwari, K. Mishra, & M. Trivedi (Eds.), Advances in computer
Everett, M., Chen, Y., How, J. P. (2018). Motion planning among communication and computational sciences. Springer.
dynamic, decision-making robots with deep reinforcement learn- Isele, D., Cosgun, A., Subramanian, K., Ffjimura, K. (2017). Navigat-
ing. 2018 IEEE/RSJ International Conference on Intelligent ing occluded intersections with autonomous vehicles using deep
Robots and Systems (IROS), Madrid, pp. 3052–3059. reinforcement learning. arXiv, arXiv:1705.01196 [cs.AI].
Evgeniou, T., & Pontil, M. (1999) Support vector machines: Theory Jeon, J. H., Cowlagi, R. V., Peter, S. C., et al. (2013). Optimal motion
and applications. In Advanced Course on Artificial Intelligence, planning with the half-car dynamical model for autonomous high-
Springer, Berlin, Heidelberg, pp. 249–257. speed driving. American Control Conference (ACC), pp. 188–193.
Fan, H., Zhu, F., Liu, C., Zhang, L., Zhuang, L., Li, D., Zhu, W., Hu, Ji, C., Lu, X., & Zhang, W. (2020). A biobjective optimization model for
J., Li, H., & Kong, Q. (2008). Baidu apollo em motion planner. expert opinions aggregation and its application in group decision
arXiv:1807.08048. making. IEEE Systems Journal, 15(2), 2834–2844.
Farouki, R. T., & Sakkalis, T. (1994). Pythagorean-hodograph space Jorgensen, T., Tama, A. (2019). Harnessing reinforcement learning for
curves. Advances in Computational Mathematics, 2(1), 41–66. neural motion planning. arXiv, arXiv:1906.00214 [cs.RO].
Ferguson, D., Howard, T. M., & Likhachev, M. (2008). Motion plan- Kakade, S., & Langford, J. (2002). Approximately optimal approximate
ning in urban environments. Journal of Field Robotics, 25(11–12), reinforcement learning. ICML, 2, 267–274.
939–960. Kalos, M. H., & Whitlock, P. A. (2008). Monte Carlo methods. Wiley-
Ferguson, D., & Stentz, A. (2006). Using interpolation to improve path VCH. ISBN 978-3-527-40760-6.
planning: The field D* algorithm. Journal of Field Robotics, 23(2), Kavraki, L. E., Svestka, P., Latombe, J. C., & Overmars, M. H. (2002).
79–101. Probabilistic roadmaps for path planning in high-dimensional con-
figuration spaces. IEEE Transactions on Robotics and Automation,
12(4), 566–580.

123
Journal of Intelligent Manufacturing (2022) 33:387–424 423

Khatib, O. (1986). Real-time obstacle avoidance for manipulators and Panov, A. I., Yakovlev, K. S., & Suvorov, R. (2018). Grid path planning
mobile robots. International Journal of Robotics Research, 5(1), with deep reinforcement learning: Preliminary results. Procedia
90–98. Computer Science, 123, 347–353.
Kim, B., Kaelbling, L. P., & Lozano-Perez, T. (2019). Adversarial Paxton, C., Raman, V., Hager, G. D., & Kobilarov, M. (2017). Combin-
actor-critic method for task and motion planning problems using ing neural networks and tree search for task and motion planning
planning experience. AAAI Conference on Artificial Intelligence in challenging environments. 2017 IEEE/RSJ International Con-
(AAAI), 33(01), 8017–8024. ference on Intelligent Robots and Systems (IROS), Vancouver, BC,
Kinnunen, T., Sidoroff, I., Tuononen, M., & Fränti, P. (2011). pp. 6059–6066.
Comparison of clustering methods: A case study of text- Qureshi, A. H., Simeonov, A., Bency, M. J., Yip, M. C. (2018). Motion
independent speaker modeling. Pattern Recognition Letters, planning networks, arXiv:1806.05767 [cs.RO].
32(13), 1604–1617. Reeds, J. A., & Shepp, L. A. (1990). Optimal paths for a car that
Konda, V. R., Tsitsiklis, J. N. (2001). Actor-critic algorithms. Society goes both forward and backward. Pacific Journal of Math, 145(2),
for Industrial and Applied Mathematics, vol 42. 367–393.
Krogh, B. (1984). A generalized potential field approach to obstacle Rummery, G. A., & Niranjan, M. (1994). On-line Q-learning using
avoidance control. Proceedings of SME Conference on Robotics connectionist systems. University of Cambridge.
Research: The Next Five Years and Beyond, Bethlehem, PA. Salemi, B., Moll, M., & Shen, W. M. (2006). SUPERBOT: A deployable
LaValle, S. M., & Kuffner, J. J. (1999). Randomized kinodynamic multi–functional and modular self–reconfigurable robotic system.
planning. The International Journal of Robotics Research, 20(5), Proceeding of IEEE/RSJ International Conference on Intelligent
378–400. and Robots System, pp. 3636–3641.
Lecun, Y., Bottou, L., Bengio, Y., & Patrick, H. (1998). Gradient- Schaul, T., Quan, J., Antonoglou, I., Silver, D. (2016). Prioritized
based learning applied to document recognition. Proceedings of experience replay. International Conference on Learning Repre-
the IEEE, 86(11), 2278–2324. sentations (ICLR).
Lei, X., Zhang, Z., & Dong, P. (2018). Dynamic path planning of Schulman, J., Levine, S., Moritz, P., Jordan, M. I., & Abbeel, P. (2017).
unknown environment based on deep reinforcement learning. Trust region policy optimization. arXiv:1502.05477v5 [cs.LG].
Journal of Robotics, 2018, 1–10. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O. (2017).
Likhachev, M., Ferguson, D., Gordon, G., Stentz, A., & Thrun, S. Proximal policy optimization algorithms. arXiv:1707.06347v2
(2008). Anytime search in dynamic graphs. Artificial Intelligence, [cs.LG].
172(14), 1613–1643. Silver, D., Huang, A., Maddison, C. J., et al. (2016). Mastering the
Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., game of Go with deep neural networks and tree search. Nature,
Silver, D., Wierstra, D. (2019). Continuous control with deep rein- 529(7587), 484–489.
forcement learning. arXiv:1509.02971 [cs.LG]. Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., Riedmiller, M.
Long, P., Fan, T., Liao, X., Liu, W., Zhang, H., & Pan, J. (2018). Towards (2014). Deterministic policy gradient algorithms. In Proceedings
optimally decentralized multi-robot collision avoidance via deep of the 31st International Conference on International Conference
reinforcement learning. , arXiv:1709.10082v3 [cs.RO]. on Machine Learning, vol. 32. pp 387–395.
Mariescu, I. R., & Franti, P. (2018). Cell Net: Inferring road networks Smart, W. D., Kaelbling, L. P. (2002). Effective reinforcement learning
from GPS trajectories. ACM Transactions on Spatial Algorithms for mobile robots. IEEE International Conference on Robotics and
and Systems, 4(3), 1–22. Automation, vol 4.
Masoud, A. A. (2007). Decentralized self-organizing potential field- Stentz, A. (1994). Optimal and efficient path planning for partially-
based control for individually motivated mobile agents in a clut- known environments. Robotics and Automation, pp. 203–220.
tered environment: A vector-harmonic potential field approach. Sui, Z., Pu, Z., Yi, J., Tian, X. (2018). Path Planning of multiagent
IEEE Transactions on Systems, Man, and Cybernetics-Part a: Sys- constrained formation through deep reinforcement Learning. 2018
tems and Humans, 37(3), 372–390. International Joint Conference on Neural Networks (IJCNN).
Meyes, R., Tercan, H., Roggendorf, S., Thomas, T., Christian, B., Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An intro-
Markus, O., Christian, B., Sabina, J., Tobias, M. (2017). Motion duction. MIT Press.
planning for industrial robots using reinforcement learning. In 50th Sutton, R., Mcallester, D. A., Singh, S., Mansour, Y. (1999). Policy
CIRP Conference on Manufacturing Systems, 63:107–112. gradient methods for reinforcement learning with function approx-
Meystel, A. (1990). Knowledge based nested hierarchical control. imation. In Proceedings of the 12th International Conference on
Advances in Automation and Robotics, 2, 63–152. Neural Information Processing Systems, MIT Press, Cambridge,
Minguez, J., Lamiraux, F., & Laumond, J. P. (2008). Motion planning MA, USA, pp 1057–1063.
and obstacle avoidance. Springer International Publishing. Sutton, R. S., Maei, H. R., Precup, D., et al. (2009). Fast gradient-descent
Mnih, V., Kavukcuoglu, K., Silver, D., et al. (2013). Playing atari with methods for temporal-difference learning with linear function
deep reinforcement learning. arXiv, arXiv:1312.5602 [cs.LG]. approximation. In 26th International Conference on Machine
Mnih, V., Kavukcuoglu, K., Silver, D., et al. (2015). Human-level con- Learning, Montreal, Canada.
trol through deep reinforcement learning. Nature, 518, 529–533. Tobaruela, J. A. (2012). Reactive and path-planning methods for mobile
Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T. P., Harley, robot navigation. PhD dissertation, Universitat de les Illes Balears.
T., Silver, D., Kavukcuoglu, K. (2016). Asynchronous methods for http://hdl.handle.net/11201/151880.
deep reinforcement learning. arXiv, arXiv:1602.01783 [cs.LG]. Tsitsiklis, J. N. (2003). On the convergence of optimistic policy itera-
Montemerlo, M., Becker, J., Bhat, S., et al. (2008). Junior: The stanford tion. Journal of Machine Learning Research, 3(1), 59–72.
entry in the urban challenge. Journal of Field Robotics, 25(9), Van den Berg, J., Lin, M., & Manocha, D. (2008). Reciprocal velocity
569–597. obstacles for real-time multi-agent navigation. IEEE International
Munos, R., Stepleton, T., Harutyunyan, A., Bellemare, M. G. (2016). Conference on Robotics and Automation, 2008, 1928–1935.
Safe and efficient off-policy reinforcement learning. arXiv, arXiv: Van Den Berg, J., Guy, S. J., Lin, M., Manocha, D. (2011). Reciprocal
1606.02647 [cs.LG]. n-body collision avoidance. Robotics research, 2011, pp. 3–19.
Murphy, R. R. (2000). Introduction to AI robotics. MIT press. Wang, L. (2005). On the euclidean distance of image. IEEE Trans-
Oh, J., Guo, Y., Singh, S. & Lee, H. (2018). Self-imitation learning. actions on Pattern Analysis and Machine Intelligence, 27(8),
arXiv 2018, arXiv:1806.05635v1 [cs.LG]. 1334–1339.

123
424 Journal of Intelligent Manufacturing (2022) 33:387–424

Wang, F., Qian, Z., Yan, Z., Yuan, C., & Zhang, W. (2020). A novel Zhang, W. J., Wang, J. W., & Lin, Y. (2019). Integrated design and oper-
resilient robot: Kinematic analysis and experimentation. IEEE ation management for enterprise systems. Enterprise Information
Access, 8, 2885–2892. Systems, 13(4), 424–429.
Wang, Z., Freitas, N., Lanctot, M. (2015). Dueling network architectures Zhang, J. (2019) Gradient descent based optimization algorithms for
for deep reinforcement learning. arXiv:1511.06581 [cs.LG]. deep learning models training, arXiv:1903.03614v1 [cs.LG].
Weston, J., Watkins, C. (1998). Multi-class support vector machines. Ziegler, J., Stiller, C. (2009). Spatiotemporal state lattices for fast tra-
Technical report, Department of computer science, Royal Hol- jectory planning in dynamic on-road driving scenarios. IEEE/RSJ
loway, university of London, May 20. International Conference on Intelligent Robots and Systems,
Williams, R. J. (1992). Simple statistical gradient-following algorithms pp. 1879–1884.
for connectionist reinforcement learning. Machine Learning, 8,
229–256.
Xu, W., Wei, J., Dolan, J. M., Zhao, H., Zha, H. (2012). A real-time Publisher’s Note Springer Nature remains neutral with regard to juris-
motion planner with trajectory optimization for autonomous vehi- dictional claims in published maps and institutional affiliations.
cles. IEEE International Conference on Robotics and Automation,
pp. 2061–2067.

123

You might also like