Reinforcement Learning in Motion Planning
Reinforcement Learning in Motion Planning
Satinder P. Singh'"
Department of Brain and Cognitive Sciences
Massachusetts Institute of Technology
Cambridge, MA 02139
[email protected]
Abstract
"'This work was done while the first author was finishing his Ph.D in computer science
at the University of Massachusetts, Amherst.
655
656 Singh, Barto, Grupen, and Connolly
An agent using reinforcement learning (Sutton et al., 1991; Barto et al., to appear)
(RL) to approximate solutions to optimal control problems has to search, or explore,
to improve its policy for selecting actions. Although exploration does not directly
affect performance (Moore & Atkeson, 1993) in off-line learning with a model of
the environment, exploration in on-line learning can lead the agent to perform
worse than is acceptable. In some cases, exploration might have unsafe, or even
catastrophic, results, often modeled in terms of reaching 'failure' states of the agent's
environment. To make on-line RL more practical, especially if it involves expensive
hardware, task-specific minimal levels of performance should be ensured during
learning, a topic not addressed by prior RL research.
Although the need for exploration cannot be entirely removed, domain knowledge
can sometimes be used to define the set of actions from which the RL agent composes
a control policy so that exploration is conducted in a space that excludes most of
the unacceptable policies. We illustrate this approach using a simulated dynamic
mobile robot in two different environments.
The robustness and greatly accelerated learning resulting from the above factors
can more than offset the cost of designing the actions. However, care has to be
taken in defining the action space to ensure that the resulting policy space contains
at least one policy that is close to optimal.
First consider geometric path planning, i.e., the problem of finding safe paths for a
robot with no dynamical constraints in an environment with stationary obstacles .
A safe path in our context is one that avoids all obstacles and terminates in a
desired configuration. Connolly (1992) has developed a method that generates safe
paths by solving Laplace's equation in configuration space with boundary conditions
determined by obstacle and goal configurations (also see, Connolly & Grupen, 1993).
Laplace's equation is the partial differential equation
n {j2ljJ
V2ljJ L {)x~ = 0,
i=l I
(1)
whose solution is a harmonic function, ljJ, with no interior local minima. In practice,
a finite difference approximation to Equation 1 is solved numerically via Gauss Sidel
relaxation on a mesh over configuration space. Safe paths are generated by gradient
descent on the resulting approximate harmonic function. In the general motion
planning problem, we are interested in finding control policies that not only keep
the robot safe but also extremize some performance criterion, e.g., minimum time,
minimum jerk, etc.
Two types of boundary conditions, called Dirichlet and Neumann boundary condi-
tions, give rise to two different harmonic functions , <I> D and <I> N, that generate dif-
ferent types of safe paths . Robots following paths generated from <I> D tend to be re-
pelled perpendicularly from obstacles. In contrast, robots following paths generated
from <I>N tend to skirt obstacles by moving parallel to their boundaries. Although
the state space in the motion planning problem for a dynamic robot in a planar
environment is {x, x, y, if}, harmonic functions are derived in two-dimensional posi-
tion space. These functions are inexpensive to compute (relative to the expense of
solving the optimal control problem) because they are independent of the robot dy-
namics and criterion of optimal control. The closed-loop control policies that follow
the gradient of the Dirichlet or Neumann solutions, respectively denoted 1rD and
= =
1rN, are defined approximately as follows: 1rD(S) V<I>D(§), and 1rN(S) V<I>N(§),
where § is the projection of the state s onto position space .1
Instead of formulating the motion planning problem as a RL task in which a control
policy maps states into primitive control actions, consider the formulation in which
a policy maps each state s to a mixing parameter k( s) so that the actual action is
: [1- k(S)]1rD(S) + [k(S)]1rN(S) , where 0 ~ k(s) ~ 1. Figure 1B shows the structure
of this kind of policy. In Appendix B, we present conditions guaranteeing that for
a robot with no dynamical constraints, this policy space contains only acceptable
policies, i.e., policies that cause the robot to reach the goal configuration without
contacting an obstacle. Although this guarantee does not strictly hold when the
robot has dynamical constraints, e.g., when there are bounds on acceleration, this
formulation still reduces the risk of unacceptable behavior.
3 Simulation Results
In this paper we present a brief summary of simulation results for the two envi-
ronments shown in Figures 2A and 3A. See Singh (1993) for details. The first
1 In
practice, the gradients of the harmonic functions act as reference signals to a PD-
controller.
658 Singh, Barto, Grupen, and Connolly
Policy 1
k(s)
State Policy
(s)
1 - k(s)
• • PoliCy 2
X X Y Y k
Neumann
state mixing
(s) coefficient (s)
The learning task is to approximate minimum time paths from every point inside
the environment to the goal region without contacting the boundary wall. A rein-
forcement learning algorithm called Q-Iearning (Watkins, 1989) (see Appendix A)
was used to learn the mixing function, k. Figure lA shows the 2-layer neural net-
work architecture used to store the Q-values. The robot was trained in a series
of trials, each trial starting with the robot placed at a randomly chosen state and
ending when the robot enters the goal region. The points marked by stars in Fig-
ures 2A and 3A were the starting locations for which statistics were collected to
produce learning curves.
Figures 2B, 2C, 3A and 3B show three robot trajectories from two randomly chosen
start states; the black-filled circles mark the Dirichlet trajectory (labeled D), the
white-filled circles mark the Neumann trajectory (labeled N), and the grey-filled
circles mark the trajectory after learning (labeled Q). Trajectories are shown by
taking snapshots of the robot at every time step; the velocity of the robot can
be judged by the spacing between successive circles on the trajectory. Figure 2D
shows the mixing function for zero-velocity states in the two-room environment,
while Figure 3C shows the mixing function for zero velocity states in the horseshoe
environment. The darker the region, the higher the proportion of the Neumann
Robust Reinforcement Learning in Motion Planning 659
policy in the mixture. In the two-room environment, t.he agent learns to follow
the Neumann policy in the left-hand room and to follow the Dirichlet policy in the
right-hand room.
Figure 2E shows the average time to reach the goal region as a function of the
number of trials in the two-room environment. The solid-line curve shows the
performance of the Q-Iearning algorithm. The horizontal lines show the average
time to reach the goal region for the designated pure policies. Figure 3D similarly
presents the results for the horseshoe environment. Note that in both cases the
RL agent learns a policy that is better than either the pure Dirichlet or the pure
Neumann policies. The relative advantage of the learned policy is greater in the
two-room environment than in the horseshoe environment .
On the two-room environment we also compared Q-Iearning using harmonic func-
tions, as described above, with Q-Iearning using primitive accelerations of the mobile
robot as actions. The results are summarized along three dimensions: a) speed of
learning: the latter system took more than 20,000 trials to achieve the same level
of performance achieved by the former in 100 trials, b) safety: the simulated robot
using the latter system crashed into the walls more than 200 times, and c) asymp-
totic performance: the final solution found by the latter system was 6% better than
the one found by the former.
4 Conclusion
A Q-learning
where R( St, a) is the payoff, 0 ::; I ::; 1 is the discount factor, and a is a learning
rate parameter. See Watkins (1989) for further details.
660 Singh, Barto, Grupen, and Connolly
A
* * *
*
* *
*
*
* .
------------------
GOAl . *
B 7000
E
8000
~ 5000 Q-Iearning
CJ
.&:. Neumann
CJ _0
Dirichlet
m
ex:
c
-
0
Q)
E
i=
300 0
2000
Q)
C)
ca
....
Q)
~ 00
0 9000 18000 27000 3eOOO 45000
Number of Trials
A
*
*
*
••
• • ••• • M •• • • M •••• _ • • • • M . . . . . . . . . . . . . . . . ._ . . . . . . . . . . . . . . , • • • • • • ••••••• _ .................... . . _ . . ..................... . .... _ • • •• M
20000
* * .......... 1... ..
D
G
B ~ 1&000
<!J
s= Q-Iearning
m
a:::
Neumann
'0000
Dirichlet
$2
Q)
E
~
: )
• N
Q)
~
------------------------_.
·: 8Do
•
Q)
B Safety
Let L denote the surface whose gradients at any point are given by the closed-loop
policy under consideration. Then for there to be no minima in L, the gradient of L
should not vanish in the workspace, i.e., (1- k(S»\7<1>D(S) + k(S)\7<1>N(S) ;/; O. The
only way it can vanish is if 'Vi
k(s)
(3)
1- k(s)
where [·Ji is the ith component of vector [.J. The algorithm can explicitly check for
that possibility and prevent it. Alternatively, note that due to the finite precision
in any practical implementation, it is extremely unlikely that Equation 3 will hold
in any state. Also note that 7r( s) for any point s on the boundary will always point
away from the boundary because it is the convex sum of two vectors, one of which
is normal to the boundary, and the other of which is parallel to the boundary.
Acknowledgements
This work was supported by a grant ECS-9214866 to Prof. A. G. Barto from the
National Science Foundation, and by grants IRI-9116297, IRI-9208920, and CDA-
8922572 to Prof. R. Grupen from the National Science Foundation.
References
Barto, A.G., Bradtke, S.J., & Singh, S.P. (to appear). Learning to act using real-
time dynamic programming. Artificial Intelligence.
Connolly, C . (1992). Applications of harmonic functions to robotics. In The 1992
International Symposium on Intelligent Control. IEEE.
Connolly, C. & Grupen, R. (1993). On the applications of harmonic functions to
robotics. Journal of Robotic Systems, 10(7), 931-946.
Mahadevan, S. & Connell, J. (1990). Automatic programming of behavior-based
robots using reinforcement learning . Technical report, IBM Research Division,
T.J.Watson Research Center, Yorktown Heights, NY.
Moore, A.W. & Atkeson, C.G. (1993). Prioritized sweeping: Reinforcement learning
with less data and less real time. Machine Learning, 13(1).
Singh, S.P. (1992). Transfer of learning by composing solutions for elemental se-
quential tasks. Machine Learning, 8(3/4), 323-339.
Singh, S.P. (1993). Learning to Solve Markovian Decision Processes. PhD thesis,
Department of Computer Science, University of Massachusetts. also, CMPSCI
Technical Report 93-77.
Sutton, R.S ., Barto, A.G., & Williams, R.J. (1991). Reinforcement learning is direct
adaptive optimal control. In Proceedings of the American Control Conference,
pages 2143-2146, Boston, MA.
Watkins, C.J.C.H. (1989). Learning from Delayed Rewards. PhD thesis, Cambridge
Univ ., Cambridge, England.