0% found this document useful (0 votes)

35 views8 pages

Reinforcement Learning in Motion Planning

The document presents a method for reinforcement learning that uses domain knowledge to define actions as closed-loop policies to ensure safe exploration. This is demonstrated on a simulated mobile robot in two environments. The method formulates actions as mixtures of Dirichlet and Neumann policies, which tend to repel from and skirt obstacles, maintaining acceptable performance during learning.

Uploaded by

nandini nandu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views8 pages

Reinforcement Learning in Motion Planning

Uploaded by

nandini nandu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

•

Robust Reinforcement Learning In

Motion Planning

Satinder P. Singh'"
Department of Brain and Cognitive Sciences
Massachusetts Institute of Technology
Cambridge, MA 02139
[email protected]

Andrew G. Barto, Roderic Grupen, and Christopher Connolly

Department of Computer Science
University of Massachusetts
Amherst, MA 01003

Abstract

While exploring to find better solutions, an agent performing on-

line reinforcement learning (RL) can perform worse than is accept-
able. In some cases, exploration might have unsafe, or even catas-
trophic, results, often modeled in terms of reaching 'failure' states
of the agent's environment. This paper presents a method that uses
domain knowledge to reduce the number of failures during explo-
ration. This method formulates the set of actions from which the
RL agent composes a control policy to ensure that exploration is
conducted in a policy space that excludes most of the unacceptable
policies. The resulting action set has a more abstract relationship
to the task being solved than is common in many applications of
RL. Although the cost of this added safety is that learning may
result in a suboptimal solution, we argue that this is an appropri-
ate tradeoff in many problems. We illustrate this method in the
domain of motion planning.

"'This work was done while the first author was finishing his Ph.D in computer science
at the University of Massachusetts, Amherst.

655
656 Singh, Barto, Grupen, and Connolly

An agent using reinforcement learning (Sutton et al., 1991; Barto et al., to appear)
(RL) to approximate solutions to optimal control problems has to search, or explore,
to improve its policy for selecting actions. Although exploration does not directly
affect performance (Moore & Atkeson, 1993) in off-line learning with a model of
the environment, exploration in on-line learning can lead the agent to perform
worse than is acceptable. In some cases, exploration might have unsafe, or even
catastrophic, results, often modeled in terms of reaching 'failure' states of the agent's
environment. To make on-line RL more practical, especially if it involves expensive
hardware, task-specific minimal levels of performance should be ensured during
learning, a topic not addressed by prior RL research.
Although the need for exploration cannot be entirely removed, domain knowledge
can sometimes be used to define the set of actions from which the RL agent composes
a control policy so that exploration is conducted in a space that excludes most of
the unacceptable policies. We illustrate this approach using a simulated dynamic
mobile robot in two different environments.

1 Closed-loop policies as actions

RL agents search for optimal policies in a solution space determined in part by
the set of actions available to the agent. With a few exceptions (e.g., Mahadevan
& Connell, 1990; Singh, 1992), researchers have formulated RL tasks with actions
that are primitive in the sense that they are low-level, are available in very state,
are executed open-loop, and last a single time-step. We propose that this is an
arbitrary, and self-imposed, restriction, and that in general the set of actions can
have a much more abstract relationship to the problem being solved. Specifically,
what are considered 'actions' by the RL algorithm can themselves be closed-loop
control policies that meet important subgoals of the task being solved.
In this paper, the following general advantages afforded by using closed-loop policies
as actions are demonstrated in the domain of motion planning:

1. It is possible to design actions to meet certain hard constraints so that RL

maintains acceptable performance while simultaneously improving perfor-
mance over time.
2. It is possible to design actions so that the action space for the learning
problem has fewer dimensions than the actual dimension of the physical
action space.

The robustness and greatly accelerated learning resulting from the above factors
can more than offset the cost of designing the actions. However, care has to be
taken in defining the action space to ensure that the resulting policy space contains
at least one policy that is close to optimal.

2 RL with Dirichlet and Neumann control policies

The motion planning problem arises from the need to give an autonomous robot
the ability to plan its own motion, i.e., to decide what actions to execute in order
to achieve a task specified by initial and desired spatial arrangements of objects.
Robust Reinforcement Learning in Motion Planning 657

First consider geometric path planning, i.e., the problem of finding safe paths for a
robot with no dynamical constraints in an environment with stationary obstacles .
A safe path in our context is one that avoids all obstacles and terminates in a
desired configuration. Connolly (1992) has developed a method that generates safe
paths by solving Laplace's equation in configuration space with boundary conditions
determined by obstacle and goal configurations (also see, Connolly & Grupen, 1993).
Laplace's equation is the partial differential equation
n {j2ljJ
V2ljJ L {)x~ = 0,
i=l I
(1)

whose solution is a harmonic function, ljJ, with no interior local minima. In practice,
a finite difference approximation to Equation 1 is solved numerically via Gauss Sidel
relaxation on a mesh over configuration space. Safe paths are generated by gradient
descent on the resulting approximate harmonic function. In the general motion
planning problem, we are interested in finding control policies that not only keep
the robot safe but also extremize some performance criterion, e.g., minimum time,
minimum jerk, etc.
Two types of boundary conditions, called Dirichlet and Neumann boundary condi-
tions, give rise to two different harmonic functions , D and N, that generate dif-
ferent types of safe paths . Robots following paths generated from D tend to be re-
pelled perpendicularly from obstacles. In contrast, robots following paths generated
from N tend to skirt obstacles by moving parallel to their boundaries. Although
the state space in the motion planning problem for a dynamic robot in a planar
environment is {x, x, y, if}, harmonic functions are derived in two-dimensional posi-
tion space. These functions are inexpensive to compute (relative to the expense of
solving the optimal control problem) because they are independent of the robot dy-
namics and criterion of optimal control. The closed-loop control policies that follow
the gradient of the Dirichlet or Neumann solutions, respectively denoted 1rD and
= =
1rN, are defined approximately as follows: 1rD(S) VD(§), and 1rN(S) VN(§),
where § is the projection of the state s onto position space .1
Instead of formulating the motion planning problem as a RL task in which a control
policy maps states into primitive control actions, consider the formulation in which
a policy maps each state s to a mixing parameter k( s) so that the actual action is
: [1- k(S)]1rD(S) + [k(S)]1rN(S) , where 0 ~ k(s) ~ 1. Figure 1B shows the structure
of this kind of policy. In Appendix B, we present conditions guaranteeing that for
a robot with no dynamical constraints, this policy space contains only acceptable
policies, i.e., policies that cause the robot to reach the goal configuration without
contacting an obstacle. Although this guarantee does not strictly hold when the
robot has dynamical constraints, e.g., when there are bounds on acceleration, this
formulation still reduces the risk of unacceptable behavior.

3 Simulation Results
In this paper we present a brief summary of simulation results for the two envi-
ronments shown in Figures 2A and 3A. See Singh (1993) for details. The first
1 In
practice, the gradients of the harmonic functions act as reference signals to a PD-
controller.
658 Singh, Barto, Grupen, and Connolly

environment consists of two rooms connected by a corridor. The second environ-

ment is a horseshoe-shaped corridor. The mobile robot is simulated as a unit-mass
that can accelerate in any direction. The only dynamical constraint is a bound on
the maximum acceleration.
Q(state. action)
A B (s)

Policy 1

k(s)

State Policy
(s)

1 - k(s)

• • PoliCy 2
X X Y Y k
Neumann
state mixing
(s) coefficient (s)

Figure 1: Q-Iearning Network and Policy Structure. Panel A: 2-layer Connectionist

Network Used to Store Q-values. Network inversion was used to find the maximum
Q-value (Equation 2) at any state and the associated greedy action. The hidden
layer consists of radial-basis units. Panel B: Policy Structure. The agent has to learn
a mapping from state s to a mixing coefficient 0 < k( s) < 1 that determines the
proportion in which to mix the actions specifies by the pure Dirichlet and Neumann
policies.

The learning task is to approximate minimum time paths from every point inside
the environment to the goal region without contacting the boundary wall. A rein-
forcement learning algorithm called Q-Iearning (Watkins, 1989) (see Appendix A)
was used to learn the mixing function, k. Figure lA shows the 2-layer neural net-
work architecture used to store the Q-values. The robot was trained in a series
of trials, each trial starting with the robot placed at a randomly chosen state and
ending when the robot enters the goal region. The points marked by stars in Fig-
ures 2A and 3A were the starting locations for which statistics were collected to
produce learning curves.
Figures 2B, 2C, 3A and 3B show three robot trajectories from two randomly chosen
start states; the black-filled circles mark the Dirichlet trajectory (labeled D), the
white-filled circles mark the Neumann trajectory (labeled N), and the grey-filled
circles mark the trajectory after learning (labeled Q). Trajectories are shown by
taking snapshots of the robot at every time step; the velocity of the robot can
be judged by the spacing between successive circles on the trajectory. Figure 2D
shows the mixing function for zero-velocity states in the two-room environment,
while Figure 3C shows the mixing function for zero velocity states in the horseshoe
environment. The darker the region, the higher the proportion of the Neumann
Robust Reinforcement Learning in Motion Planning 659

policy in the mixture. In the two-room environment, t.he agent learns to follow
the Neumann policy in the left-hand room and to follow the Dirichlet policy in the
right-hand room.
Figure 2E shows the average time to reach the goal region as a function of the
number of trials in the two-room environment. The solid-line curve shows the
performance of the Q-Iearning algorithm. The horizontal lines show the average
time to reach the goal region for the designated pure policies. Figure 3D similarly
presents the results for the horseshoe environment. Note that in both cases the
RL agent learns a policy that is better than either the pure Dirichlet or the pure
Neumann policies. The relative advantage of the learned policy is greater in the
two-room environment than in the horseshoe environment .
On the two-room environment we also compared Q-Iearning using harmonic func-
tions, as described above, with Q-Iearning using primitive accelerations of the mobile
robot as actions. The results are summarized along three dimensions: a) speed of
learning: the latter system took more than 20,000 trials to achieve the same level
of performance achieved by the former in 100 trials, b) safety: the simulated robot
using the latter system crashed into the walls more than 200 times, and c) asymp-
totic performance: the final solution found by the latter system was 6% better than
the one found by the former.

4 Conclusion

Our simulations show how an RL system is capable of maintaining acceptable per-

formance while simultaneously improving performance over time. A secondary mo-
tivation for this work was to correct the erroneous impression that the proper, if
not the only, way to formulate RL problems is with low-level actions. Experience
on large problems formulated in this fashion has contributed to the perception that
RL algorithms are hopelessly slow for real-world applications. We suggest that a
more appropriate way to apply RL is as a "component technology" that uses expe-
rience to improve on partial solutions that have already been found through either
analytical techniques or the cumulative experience and intuitions of the researchers
themselves. The RL framework is more abstract, and hence more flexible, than
most current applications of RL would lead one to believe. Future applications of
RL should more fully exploit the flexibility of the RL framework.

A Q-learning

On executing action a in state St at time t, the following update on the Q-value

function is performed:

where R( St, a) is the payoff, 0 ::; I ::; 1 is the discount factor, and a is a learning
rate parameter. See Watkins (1989) for further details.
660 Singh, Barto, Grupen, and Connolly

A
* * *
*
* *
*
*
* .
------------------
GOAl . *

B 7000
E
8000

~ 5000 Q-Iearning
CJ
.&:. Neumann
CJ _0
Dirichlet
m
ex:
c
-
0
Q)
E
i=
300 0

2000

Q)
C)
ca
....
Q)

~ 00
0 9000 18000 27000 3eOOO 45000

Number of Trials

Figure 2: Results for the Two-Room Environment . Panel A: Two-Room Environ-

ment. The stars mark the starting locations for which statistics were computed .
Panel B: Sample Trajectories from one Starting Location. The black-filled circles
labeled D show a pure Dirichlet trajectory, the white-filled circles labeled N show a
pure Neumann trajectory, and the grey-filled circles labeled Q show the trajectory
after learning. The trajectories are shown by taking snapshots at every time step;
the velocity of the robot can be judged by the distance between successive points
on the trajectory. Panel C: Three Sample Trajectories from a Different Starting
Location. Panel D: Mixing Function Learned by the Q-Iearning Network for Zero
Velocity States. The darker the region the higher the proportion of the Neumann
policy in the resulting mixture. Panel E: Learning Curve. The curve plots the time
taken by the robot to reach the goal region, averaged over the locations marked
with stars in Panel A, as a function of the number of Q-Iearning trials. The dashed
line shows the average time using the pure Neumann policy; the dotted line for the
pure Dirichlet policy; and the solid line for Q-Iearning. The mixed policy formed
by Q-Iearning rapidly outperforms both pure harmonic function policies.
Robust Reinforcement Learning in Motion Planning 661

A
*
*
*

••
• • ••• • M •• • • M •••• _ • • • • M . . . . . . . . . . . . . . . . ._ . . . . . . . . . . . . . . , • • • • • • ••••••• _ .................... . . _ . . ..................... . .... _ • • •• M

20000

* * .......... 1... ..
D
G
B ~ 1&000

<!J
s= Q-Iearning
m
a:::
Neumann
'0000
Dirichlet
$2
Q)
E
~
: )
• N
Q)

~
------------------------_.

·: 8Do
•
Q)

••• A. • A.._ ~

ooo~--------~,oooo~------~~~-------~~------~
GOAL Number of_Trials

Figure 3: Results for the Horseshoe Environment . Panel A: Horseshoe-Shaped

Environment. Locations marked with stars are the starting locations for which
statistics were computed. It also shows sample trajectories from one starting loca-
tion; the black-filled circles marked D show a Dirichlet trajectory, the white-filled
circles marked N show a Neumann trajectory, and the grey-filled circles marked Q
show the trajectory after learning. The trajectories are shown by taking snapshots
at every time step; the velocity of the robot can be judged by the distance between
successive points on the trajectory. Panel B: Three Sample Trajectories from a
Different Starting Location. Panel C: Mixing Function Learned by the Q-Iearning
Network for Zero Velocity States. The darker the region the higher the proportion
of the Neumann policy in the resulting mixture. Panel D: Learning Curve. The
curve plots the time taken by the robot to reach the goal region, averaged over the
locations marked with stars in Panel A, as a function of the number of Q-Iearning
trials. The dashed line shows the average time for the pure Neumann policy; the
dotted line for the pure Dirichlet policy; and the solid line for Q-Iearning. Q-Iearning
rapidly outperforms both pure harmonic function policies.
662 Singh, Barto, Grupen, and Connolly

B Safety
Let L denote the surface whose gradients at any point are given by the closed-loop
policy under consideration. Then for there to be no minima in L, the gradient of L
should not vanish in the workspace, i.e., (1- k(S»\7<1>D(S) + k(S)\7<1>N(S) ;/; O. The
only way it can vanish is if 'Vi
k(s)
(3)
1- k(s)
where [·Ji is the ith component of vector [.J. The algorithm can explicitly check for
that possibility and prevent it. Alternatively, note that due to the finite precision
in any practical implementation, it is extremely unlikely that Equation 3 will hold
in any state. Also note that 7r( s) for any point s on the boundary will always point
away from the boundary because it is the convex sum of two vectors, one of which
is normal to the boundary, and the other of which is parallel to the boundary.

Acknowledgements
This work was supported by a grant ECS-9214866 to Prof. A. G. Barto from the
National Science Foundation, and by grants IRI-9116297, IRI-9208920, and CDA-
8922572 to Prof. R. Grupen from the National Science Foundation.

References
Barto, A.G., Bradtke, S.J., & Singh, S.P. (to appear). Learning to act using real-
time dynamic programming. Artificial Intelligence.
Connolly, C . (1992). Applications of harmonic functions to robotics. In The 1992
International Symposium on Intelligent Control. IEEE.
Connolly, C. & Grupen, R. (1993). On the applications of harmonic functions to
robotics. Journal of Robotic Systems, 10(7), 931-946.
Mahadevan, S. & Connell, J. (1990). Automatic programming of behavior-based
robots using reinforcement learning . Technical report, IBM Research Division,
T.J.Watson Research Center, Yorktown Heights, NY.
Moore, A.W. & Atkeson, C.G. (1993). Prioritized sweeping: Reinforcement learning
with less data and less real time. Machine Learning, 13(1).
Singh, S.P. (1992). Transfer of learning by composing solutions for elemental se-
quential tasks. Machine Learning, 8(3/4), 323-339.
Singh, S.P. (1993). Learning to Solve Markovian Decision Processes. PhD thesis,
Department of Computer Science, University of Massachusetts. also, CMPSCI
Technical Report 93-77.
Sutton, R.S ., Barto, A.G., & Williams, R.J. (1991). Reinforcement learning is direct
adaptive optimal control. In Proceedings of the American Control Conference,
pages 2143-2146, Boston, MA.
Watkins, C.J.C.H. (1989). Learning from Delayed Rewards. PhD thesis, Cambridge
Univ ., Cambridge, England.

All Formulas in One: Quantitative Aptitude Ebook by Lucid Math
100% (1)
All Formulas in One: Quantitative Aptitude Ebook by Lucid Math
26 pages
Reinforcement Learning 1
No ratings yet
Reinforcement Learning 1
14 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
50 pages
Topic: Identifying Errors: New Curriculum 2023
No ratings yet
Topic: Identifying Errors: New Curriculum 2023
13 pages
AI - Unit 2
No ratings yet
AI - Unit 2
46 pages
Conti USA IFS Hydraulic Hoses Fittings Catalog 2016
No ratings yet
Conti USA IFS Hydraulic Hoses Fittings Catalog 2016
444 pages
C' Ifornia: California Code Ol, Regulations
No ratings yet
C' Ifornia: California Code Ol, Regulations
62 pages
Lecture Notes v1.0 687 F22
No ratings yet
Lecture Notes v1.0 687 F22
115 pages
Maai 6
No ratings yet
Maai 6
143 pages
Module 01
No ratings yet
Module 01
66 pages
Module 1
No ratings yet
Module 1
72 pages
Andy 1
No ratings yet
Andy 1
78 pages
02TSRevised Reinforcement Learning Ebook All Chapters PDF
No ratings yet
02TSRevised Reinforcement Learning Ebook All Chapters PDF
87 pages
Esfuerzos en Vigas - PDF
No ratings yet
Esfuerzos en Vigas - PDF
6 pages
Reinforcement Learning For IoT - Final
No ratings yet
Reinforcement Learning For IoT - Final
45 pages
RL
No ratings yet
RL
94 pages
Lec 04 Reinforcement Learning
No ratings yet
Lec 04 Reinforcement Learning
57 pages
Unit 5 ML
No ratings yet
Unit 5 ML
49 pages
11-DL-Deep Learning For Reinforcement Learning
No ratings yet
11-DL-Deep Learning For Reinforcement Learning
47 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
38 pages
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
No ratings yet
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
34 pages
A Comprehensive Survey On Safe Reinforcement Learning
No ratings yet
A Comprehensive Survey On Safe Reinforcement Learning
44 pages
Unit 1 - Reinforcement Learning, Overfitting, Training, Validation Sets, Metrics, Bias and Variance
No ratings yet
Unit 1 - Reinforcement Learning, Overfitting, Training, Validation Sets, Metrics, Bias and Variance
16 pages
Reinforcement LN-6
No ratings yet
Reinforcement LN-6
13 pages
RL Lecturer
No ratings yet
RL Lecturer
38 pages
ML U5 Notes
No ratings yet
ML U5 Notes
26 pages
Exploration Versus Exploitation in Reinforcement Learning: A Stochastic Control Approach
No ratings yet
Exploration Versus Exploitation in Reinforcement Learning: A Stochastic Control Approach
33 pages
RL MJJ
No ratings yet
RL MJJ
32 pages
Problem Solving 11 20
No ratings yet
Problem Solving 11 20
10 pages
Chapter 2
No ratings yet
Chapter 2
21 pages
Reinforcement Learning and Robotics
No ratings yet
Reinforcement Learning and Robotics
35 pages
A Primer Chapter On Reinforcement Learning-Final
No ratings yet
A Primer Chapter On Reinforcement Learning-Final
22 pages
37 RL
No ratings yet
37 RL
18 pages
Reinforcement Learning: A Tutorial
No ratings yet
Reinforcement Learning: A Tutorial
17 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
17 pages
Unit 5 ML
No ratings yet
Unit 5 ML
15 pages
10.3934 Geosci.2022027
No ratings yet
10.3934 Geosci.2022027
15 pages
Unit-5 Part C 1) Explain The Q Function and Q Learning Algorithm Assuming Deterministic Rewards and Actions With Example. Ans)
No ratings yet
Unit-5 Part C 1) Explain The Q Function and Q Learning Algorithm Assuming Deterministic Rewards and Actions With Example. Ans)
11 pages
Hindsight Experience Replay
No ratings yet
Hindsight Experience Replay
15 pages
Enhancing Robotic Manipulation: Harnessing The Power of Multi-Task Reinforcement Learning and Single Life Reinforcement Learning in Meta-World
No ratings yet
Enhancing Robotic Manipulation: Harnessing The Power of Multi-Task Reinforcement Learning and Single Life Reinforcement Learning in Meta-World
13 pages
Final
No ratings yet
Final
18 pages
RL Chap 3
No ratings yet
RL Chap 3
15 pages
MB-310 Dynamics 365 Finance
No ratings yet
MB-310 Dynamics 365 Finance
13 pages
Reinforcement Learning: Instructor: Max Welling
No ratings yet
Reinforcement Learning: Instructor: Max Welling
18 pages
Alm Planning 21-7
No ratings yet
Alm Planning 21-7
7 pages
tiếng anhi
No ratings yet
tiếng anhi
7 pages
SSRN 4768234
No ratings yet
SSRN 4768234
6 pages
AI Assignment 1
No ratings yet
AI Assignment 1
13 pages
Robust Adversarial Reinforcement Learning: Mnih Et Al. 2015
No ratings yet
Robust Adversarial Reinforcement Learning: Mnih Et Al. 2015
10 pages
NeurIPS 2023 Reinforcement Learning With Simple Sequence Priors Paper Conference
No ratings yet
NeurIPS 2023 Reinforcement Learning With Simple Sequence Priors Paper Conference
21 pages
RL Module 1
No ratings yet
RL Module 1
6 pages
Glasius1995 Neural Network Dynamics For Path Planning and Obstacle Avoidance
No ratings yet
Glasius1995 Neural Network Dynamics For Path Planning and Obstacle Avoidance
9 pages
XPG-RL RL With Explainable Priority Guidance For Efficiency-Boosted Mechanical Search
No ratings yet
XPG-RL RL With Explainable Priority Guidance For Efficiency-Boosted Mechanical Search
13 pages
A Review On Lifting Beams: July 2017
No ratings yet
A Review On Lifting Beams: July 2017
14 pages
RLAlgs in MDPs
No ratings yet
RLAlgs in MDPs
9 pages
Unit 3 Ai
No ratings yet
Unit 3 Ai
5 pages
03 04 Lessonarticle
No ratings yet
03 04 Lessonarticle
5 pages
5.5 Reinforcement Learning
No ratings yet
5.5 Reinforcement Learning
5 pages
Cutler16 ICRA Final Submission
No ratings yet
Cutler16 ICRA Final Submission
7 pages
A Reinforcement Learning Approach To Obstacle Avoidance of Mobil
No ratings yet
A Reinforcement Learning Approach To Obstacle Avoidance of Mobil
5 pages
Applications HVAC EN
No ratings yet
Applications HVAC EN
21 pages
Active Policy Learning For Robot Planning and Exploration Under Uncertainty
No ratings yet
Active Policy Learning For Robot Planning and Exploration Under Uncertainty
8 pages
Subgoal Discovery For Hierarchical Reinforcement Learning Using Learned Policies
No ratings yet
Subgoal Discovery For Hierarchical Reinforcement Learning Using Learned Policies
5 pages
Reinforcement Learning Based Approach For Mobile Robot Navigation
No ratings yet
Reinforcement Learning Based Approach For Mobile Robot Navigation
4 pages
Lecture Doubts
No ratings yet
Lecture Doubts
2 pages
Application of Reinforcement Learning To A Two Dof Robot Arm Control
No ratings yet
Application of Reinforcement Learning To A Two Dof Robot Arm Control
2 pages
RAC QB Final-2023
No ratings yet
RAC QB Final-2023
9 pages
SDG Primer FINAL PDF
No ratings yet
SDG Primer FINAL PDF
238 pages
2022 Bar Examination Questionnaire For Criminal Law
No ratings yet
2022 Bar Examination Questionnaire For Criminal Law
1 page
35.232-2016.30 Balsam Tawfiq Swaidan
No ratings yet
35.232-2016.30 Balsam Tawfiq Swaidan
70 pages
RISK MGMT Chap III 2020 Sem II
No ratings yet
RISK MGMT Chap III 2020 Sem II
18 pages
Unit Conversions: The S.I. and English Systems
No ratings yet
Unit Conversions: The S.I. and English Systems
4 pages
Analysis of Legal Case Document Automated Summarizer
No ratings yet
Analysis of Legal Case Document Automated Summarizer
6 pages
IELTS Simon Speaking Part 3 9dee133876
No ratings yet
IELTS Simon Speaking Part 3 9dee133876
37 pages
Features: Battery Earth Fault Relays
No ratings yet
Features: Battery Earth Fault Relays
5 pages
PORTUGALICBR Proceedings BookOfAbstracts
No ratings yet
PORTUGALICBR Proceedings BookOfAbstracts
348 pages
NOTES CH 9 Living Organisms G6 2
No ratings yet
NOTES CH 9 Living Organisms G6 2
5 pages
03 Clerk Post English Grammer
No ratings yet
03 Clerk Post English Grammer
166 pages
Palindromes: Digitalcommons@University of Nebraska - Lincoln
No ratings yet
Palindromes: Digitalcommons@University of Nebraska - Lincoln
19 pages
English - Question - Paper (HW-1)
No ratings yet
English - Question - Paper (HW-1)
1 page
Dot Matrix Printer (DMP)
No ratings yet
Dot Matrix Printer (DMP)
12 pages
CMAT - Module 3 Answer Key (QA - DI - LR)
No ratings yet
CMAT - Module 3 Answer Key (QA - DI - LR)
8 pages
R S Aggarwal Solution Class 11 Maths Chapter 31 Probability Exercise 31A
No ratings yet
R S Aggarwal Solution Class 11 Maths Chapter 31 Probability Exercise 31A
9 pages
MaheswariVeni Auth Nagercoil
No ratings yet
MaheswariVeni Auth Nagercoil
2 pages
Reflective Essay 1
No ratings yet
Reflective Essay 1
2 pages
Do Hhjmbfujhfddgbkod
No ratings yet
Do Hhjmbfujhfddgbkod
1 page
Markov Decision Process: Fundamentals and Applications
From Everand
Markov Decision Process: Fundamentals and Applications
Fouad Sabry
No ratings yet
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
Computer Vision Graph Cuts: Exploring Graph Cuts in Computer Vision
From Everand
Computer Vision Graph Cuts: Exploring Graph Cuts in Computer Vision
Fouad Sabry
No ratings yet

Reinforcement Learning in Motion Planning

Uploaded by

Reinforcement Learning in Motion Planning

Uploaded by

•

Robust Reinforcement Learning In

Andrew G. Barto, Roderic Grupen, and Christopher Connolly

While exploring to find better solutions, an agent performing on-

1 Closed-loop policies as actions

1. It is possible to design actions to meet certain hard constraints so that RL

2 RL with Dirichlet and Neumann control policies

environment consists of two rooms connected by a corridor. The second environ-

Figure 1: Q-Iearning Network and Policy Structure. Panel A: 2-layer Connectionist

Our simulations show how an RL system is capable of maintaining acceptable per-

On executing action a in state St at time t, the following update on the Q-value

Figure 2: Results for the Two-Room Environment . Panel A: Two-Room Environ-

____ ••• __ A. __ • A.._ ~

Figure 3: Results for the Horseshoe Environment . Panel A: Horseshoe-Shaped

You might also like

••• A. • A.._ ~