Advanced Robotics
Advanced Robotics
Advanced Robotics
Fall 2009
Lecture 1: Introduction
Pieter Abbeel
UC Berkeley EECS
www
http://www.cs.berkeley.edu/~pabbeel/cs287-fa09
Page 1
Announcements
Communication:
Announcements: webpage
Email: pabbeel@cs.berkeley.edu
Enrollment:
Class Details
Prerequisites:
Page 2
Class Goals
Lecture outline
Page 3
Driverless cars
Page 4
Four-legged locomotion
[Kolter, Abbeel & Ng]
Two-legged locomotion
[Tedrake +al.]
Page 5
Mapping
Mapping
Page 6
Mobile Manipulation
[Quigley, Gould, Saxena, Ng + al.]
SLAM, localization, motion planning for navigation and grasping, grasp point
selection, (visual category recognition, speech recognition and synthesis)
Outline of Topics
Page 7
1. Control
1. Control (ctd)
Page 8
1. Control (ctd)
2. Estimation
Page 9
4. Reinforcement learning
Page 10
5. Misc. Topics
Simulation / FEM
Pomdps
k-armed bandits
separation principle
Reading materials
Control
Estimation
Reinforcement learning
Misc. topics
Page 11
Page 12
Announcements
Page 1
Control
Estimation
Manipulation/Grasping
Reinforcement Learning
Misc. Topics
Case Studies
Control in CS287
Overarching goal:
Page 2
Todays lecture
Tedrake, 1.2
Todays lecture
Page 3
u
l
+ b(t)
+ mgl sin (t) = u(t)
I (t)
Page 4
I = ml2
+ b(t)
+ mgl sin (t) = u(t)
I (t)
The Matlab code that generated all discussed simulations will be posted on www.
Can we do better
than this?
(0) = 0, (0)
=0
(0) = 4 , (0)
=0
Page 5
Feedforward control
u
l
+ b(t)
+ mgl sin (t) = u(t)
I (t)
(0) = 0, (0)
=0
Feedforward control
Simulation results:
+ c(t)
+ mgl sin (t) = u(t)
I (t)
Can we do better
than this?
Page 6
(0) = 0, (0)
=0
Thus far:
g
+ b(t)
+ mgl sin (t) = u(t)
I (t)
H(q)
q + C(q, q)
+ G(q) = B(q)u
A system is fully actuated when in a certain state (q,\dot{q},t) if, when in that
state, it can be controlled to instantaneously accelerate in any direction.
Many systems of interest are of the form:
q = f1 (q, q,
t) + f2 (q, q,
t)u
rankf2 (q, q,
t) = dimq
rankf2 (q, q,
t) < dimq
Page 7
(1)
q(t) = v(t)
u(t) = f21 (q, q,
t) (v(t) f1 (q, q,
t))
The literature on control for linear systems is very extensive and
hence this can be useful. This is an example of feedback
linearization. More on this in future lectures.]
n DOF manipulator
H(q)
q + C(q, q)
+ G(q) = B(q)u
f2 = H 1 B, H full rank, B = I, hence rank(H 1 B) = rank(B)
Page 8
Car
Acrobot
Cart-pole
Helicopter
+ b(t)
+ mgl sin (t) = u(t)
I (t)
Page 9
+ b(t)
+ mgl sin (t) = u(t)
I (t)
steady-state error
+ b(t)
+ mgl sin (t) = u(t)
I (t)
[ = 180 is an unstable
equilibrium point]
Page 10
Proportional control
Task: hold arm at 45 degrees
However, still:
Proportional control
Task: swing arm up to 180
degrees and hold there
Page 11
Current status
PD control
u(t) = Kp (qdesired (t) q(t)) + Kd (qdesired q(t))
Page 12
Page 13
PID
u(t) = Kp (qdesired (t) q(t)) + Kd(qdesired q(t))
+ Ki
t
0
(qdesired ( ) q( ))d
+ Kd (
qdesired q(t)) + Ki (qdesired (t) q(t))
0 = Kk (qdesired (t) q(t))
Recap so far
Model inaccuracy
Perturbations + instability
Steady state error reduced by (roughly) factor Kp, but large Kp can be
problematic in presence of delay Add integral term
Ignores momentum Add derivative term
Remaining questions:
Any guarantees?
Page 14
PID tuning
Notation: KI =
kp
,
Ti
KD = kp Td
Page 15
Kc = 100;
Tc = 0.63s;
Page 16
Matters in practice!
Page 17
?
Note: many underactuated systems do use PID type controllers in their core (e.g.,
helicopter governor, gyro)
Page 18
Page 1
q = f1 (q, q,
t) + f2 (q, q,
t)u fully actuated in (q, q,
t) iff rankf2 (q, q,
t) = dimq.
Given a fully actuated system and a (smooth) target trajectory q*
Model inaccuracy
Perturbations + instability
Steady state error reduced by (roughly) factor Kp, but large Kp can be problematic in
presence of delay Add integral term
Ignores momentum Add derivative term
+ Ki
t
0
(q ( ) q( ))d
PID constants require tuning: often by hand, Ziegler-Nichols and TLC provide
good starting points, (policy search could automate this)
If control inputs do not directly relate to the degrees of freedom, we can use
feedback linearization to get that form:
q(t) = v(t)
Todays lecture
Page 2
--
Optional:
+ Ki
t
0
(q ( ) q( ))d
Matters in practice!
Page 3
Lyapunov
Guarantees?
Equilibrium state. A state x is an equilibrium state of the system x =
f (x) if f (x ) = 0.
Stability. The equilibrium state x is said to be stable if, for any R > 0,
there exists r > 0, such that if x(0) x < r, then x(t) x < R for all
t 0. Otherwise, the equilibrium point is unstable.
Page 4
Stability illustration
Page 5
Proving stability
Page 6
Proof:
Page 7
Proof:
Suppose trajectory x(t) does not converge to zero.
V (x(t)) is decreasing and nonnegative, so it converges to, say, as t .
Since x(t) does not converge to 0, we must have > 0, so for all t,
V (x(t)) V (x(0)).
C = {z| V (z) V (x(0))} is closed and bounded, hence compact. So
V (assumed continuous) attains its supremum on C, i.e., supzC V = a < 0.
Since V (x(t)) a for all t, we have
V (x(T )) = V (x(0)) +
V (x(t))dt V (x(0)) aT
Page 8
Example 1
q + q + sin q = u
u = sin q + Kp (q q) + Kd (0 q)
Page 9
Example 1 (solution)
q + q + g(q) = u
u = g(q) + Kd (q q) + Kd (0 q)
We choose V = 12 Kp (q q )2 + 12 q2 .
This gives for V :
V
=
=
Kp (q q )q + qq
Kp (q q )q + q (Kp (q q) Kd q q))
(1 + Kd )q
Hence V satisfies: (i) V (q) 0 and = 0 iff q = q , (ii) V 0. Since the arm
cannot get stuck at any position such that q = 0 (which can be easily shown
by noting that acceleration is non-zero in such situations), the robot arm must
settle down at q = 0 and q = 0, according to the invariant set theorem. Thus
the system is globally asymptotically stable.
Page 10
Page 11
Lyapunov recap
Page 12
Energy pumping nicely described in Tedrake Chapter 3. Enjoy the optional read!
Forthcoming lectures
Dynamic programming
Discretization
Examples:
Page 13
Announcement
Page 1
Dynamic programming
Discretization
Examples:
Continuous discrete [Chow and Tsitsiklis, 1991; Munos and Moore, 2001] [[[Kushner
and Dupuis 2001 (optional)]]]
Error bounds:
Solution through value iteration [Tedrake Ch.6, Sutton and Barto Ch.1-4]
Value function: Chow and Tsitsiklis; Kushner and Dupuis; function approximation [Gordon 1995;
Tsitsiklis and Van Roy, 1996]
Value function close to optimal resulting policy good
Page 2
= f(x(t), u(t), t)
cost function : g(x, u, t)
S: set of states
A: set of actions
P: dynamics model
H: horizon
g: S x A R cost function
(S, A, P, H, g)
P (xt+1 = x |xt = x, ut = u)
Policy = (0 , 1 , . . . , H ), k : S A
Goal: find
arg min J
Page 3
H
t=0
g(x(t), u(t))|x0 = x, ]
JH
(x) =
JH1
(x) =
min g(x, u) +
P (x |x, u)JH
(x )
u
...
Jk (x) =
min g(x, u) +
J0 (x) =
...
min g(x, u) +
P (x |x, u)J1 (x )
P (x |x, u)Jk+1
(x )
And
P (x |x, u)Jk+1
(x );
Running time: O(|S|2 |A| H) vs. nave search over all policies would
require evaluation of |A||S|H policies
(S, A, P, , g)
: discount factor
= (0 , 1 , . . .), k : S A
t
Value of a policy : J (x) = E[
t=0 g(x(t), u(t))|x0 = x, ]
Policy
Page 4
For i=0,1,
For all s S
s
P (s |s, u) g(s, a) + J (i) (s )
Facts:
J (i) J for i
There is an optimal stationary policy: = ( , , . . .) which satisfies:
(x) = arg min g(x, u) +
P (x |x, u)J (x)
u
Page 5
Can also derive HJB equation for the stochastic setting. Keywords for
finding out more: Controlled diffusions / diffusion jump processes.
References:
Page 6
Discretized MDP:
Discretization: example 1
Discrete states: { , , }
P (2 |s, a) = pA ;
P (3 |s, a) = pB ;
P (6 |s, a) = pC ;
s.t. s = pA 2 + pB 3 + pC 6
P (s |s, a)
i P (i ; s
)J(i)
Page 7
Page 8
Discretization: example 2
s
a
Discrete states: { , , }
P (2 |s, a) = 1;
P (s |s, a)
i P (i ; s
)J(i)
This is nearest neighbor; could also use weighted combination of nearest neighbors.
Page 9
Discretization: example 3
s
P (i |j , u) =
Discrete states: { , , }
sj
P (s |s,u)1{s i }ds
sj
P (s |s,u)ds
Page 10
Continuous time
One might want to discretize time in a variable way such that one
discrete time transition roughly corresponds to a transition into
neighboring grid points/regions
Discounting:
exp(t)
Continuous time:
0
1
if q = q = 0
otherwise
1
if q sign(q) 2sign(q)q
1 otherwise
[See Tedrake 6.6.3 for further details.]
Page 11
Kuhn triang., h = 1
optimal
Page 12
Continuous time:
In discrete time:
Cost function:
q = u
qt+1
= qt + qt t
qt+1
= qt + ut
g(q, q,
u) = q 2 + u2
Page 13
Kuhn triang., h = 1
optimal
Page 14
Nearest neighbor, h = 1
optimal
dt=0.1
Page 15
dt= 0.01
dt= 0.1
h = 0.1
h = 0.02
Discretization guarantees
Typical guarantees:
Combine with:
Page 16
Page 17
Function approximation
General idea
Page 18
Today
Great references:
Gordon, 1995, Stable function approximation in dynamic programming
Tsitsiklis and Van Roy, 1996, Feature based methods for large scale
dynamic programming
Bertsekas and Tsitsiklis, Neuro-dynamic programming, Chap. 6
Page 1
(S, A, P, , g)
: discount factor
= (0 , 1 , . . .), k : S A
t
Value of a policy : J (x) = E[
t=0 g(x(t), u(t))|x0 = x, ]
Policy
Facts:
J (i) J for i
There is an optimal stationary policy: = ( , , . . .) which satises:
(s) = arg min g(s, u) +
P (s |s, u)J (s)
u
Page 2
projection: nd some
(i+1)
such that s S
sS
sS (J
Page 3
Potential guarantees?
i : J(i) J (i)
can we provide any guarantees?
: floss (J , J )
can we provide any guarantees?
Simple example
Function approximator: [1 2] *
Page 4
Simple example
J =
J(1) (x1 )
J(1) (x2 )
1
2
6 (0)
i
(0)
which diverges if > 56 even though the function approximation class can
represent the true value function.]
Bellman operator
s
P (s |s, u) g(s, a) + J (i) (s )
Bellman operator T
T : |S| |S| is dened as:
P (s |s, u)J(s )
Page 5
Contractions
Definition. The operator F is a -contraction w.r.t. some norm if
X, X : F X F X X X
Proof of Theorem 1
Useful fact.
Cauchy sequences: If for x0 , x1 , x2 , . . ., we have that
, K : xM xN < f or M, N > K
then we call x0 , x1 , x2 , ... a Cauchy sequence.
If x0 , x1 , x2 , . . . is a Cauchy sequence, and xi n , then there exists x n
such that limi xi = x .
Proof.
Assume N > M .
N 1
F M X F N X = i=M (F i X F i+1 X)
N 1
i
i+1
X
i=M F X F
N 1 i
F
X
i=M
N 1
=
X F X i=M i
=
X F X 1
.
As X F X 1
goes to zero for M going to infnity, we have that for any
> 0 for F M X F N X to hold for all M, N > K, it suces to pick K
large enough.
Hence X, F X, . . . is a Cauchy sequence and converges.
Page 6
X1 ,
F X2
X2 ,
this implies,
F X1 F X2 = X1 X2 .
At the same time we have from the contractive property of F
F X1 F X2 X1 X2 .
Combining both gives us
X1 X2 X1 X2 .
Hence,
X1 = X2 .
Therefore, the xed point of F is unique.
Page 7
For i=1, 2,
for s=1,,|S|
J(s) min g(s, u) +
uA
P (s |s, u)J(s )
Page 8
such that every state s S occurs innitely often. Dene the operators Ts(k) as
follows:
(T J )(s),
if s(k) = s
(Ts(k) J)(s) =
J(s),
otherwise
Asynchronous value iteration initializes J and then applies, in sequence,
Ts(0) , Ts(1) , . . ..
projection: nd some
(i+1)
J(i+1) T J(i)
Page 9
Composing operators
Examples:
Page 10
ss J1 (s ) s0 +
ss (J1 (s ) J2 (s ))|
max
|J1 (s ) J2 (s )|
= J1 J2
This holds true for all s, hence we have
J1 J2 J1 J2
Linear regression
Page 11
s
ss J2 (s )|
2
1
Page 12
Today
Performance boosts
Speed-ups
Great references:
Gordon, 1995, Stable function approximation in dynamic programming
Tsitsiklis and Van Roy, 1996, Feature based methods for large scale dynamic programming
Bertsekas and Tsitsiklis, Neuro-dynamic programming, Chap. 6
Page 1
(S, A, P, , g)
= (0 , 1 , . . .), k : S A
t
Value of a policy : J (x) = E[
t=0 g(x(t), u(t))|x0 = x, ]
Policy
Facts:
J (i) J for i
There is an optimal stationary policy: = ( , , . . .) which satises:
(s) = arg min g(s, u) +
P (s |s, u)J (s)
u
Page 2
projection: nd some
(i+1)
such that s S
abstract
J(1)=TJ(0)
back-up
back-up for
s S
J(1)=TJ(0)
x x
x
x
s
s
J(1)
project
Interpolate to
J(1)
successor
x
states of S
x xx x
x
s
s
J(2)=TJ(1)
back-up
s
J(2)
project
back-up for
s S
J(2)=TJ(1)
x
x
x
s
Interpolate to
(2)
successor J
states of S
x
xx x
x
s
Page 3
P(x2|x1,u) = 1; P(x2|x2,u) = 1
g(x1,u) = 0; g(x2,u) = 0;
Function approximator: [1 2] *
VI w/ least squares function approximation
diverges for > 5/6 [see last lecture for details]
Contractions
T (k) J J
=
T (k) J T (k) J
T (k1) J T (k1) J
k
J J
Page 4
J J
2
1
Proof
Page 5
we have that
J J
1
.
Proof:
J J
J T J + T J T 2 J + T 2 J T 3 J + ... J
+
T J T 2 J
+
T 2 J T 3 J
+ ... +
T J J
J T J
+ + 2 + ...
iterations i :
T J(i) T J(i)
Or, more generally, we have a noisy sequence of back-ups:
J
1
.
Proof by induction:
Base case: We have
J (1) T J (0)
.
Induction: We also have for any i > 1:
T i J (0) J (i)
=
T T i1 J (0) T J (i1) w (i1)
+
T i1 J (0) J (i1)
+ ((1 + + 2 + . . . + (i2) ))
Page 6
J J
1
t
Here J = E[
t=0 g(st , (st ))].
Proof
Recall:
(T J )(s) = min g(s, u) +
u
P (s |s, u)J (s )
Similarly dene:
(T J )(s) = g(s, (s)) +
P (s |s, (s))J (s )
We have T J = J and (same result for MDP with only 1 policy available)
T J = J .
A very typical proof follows, with the main ingredients adding and subtracting the same terms to make terms pairwise easier to compare/bound:
J J
T J J
T J T J
+
T J J
J J
+
T J J
J J
+
J J
+
J J
J J
+ 2,
Page 7
Iterate: J T J
Guarantees when:
Sample states:
Page 8
Practicalities
Page 9
Speed-ups
Parallelization
Prioritized sweeping
Richardson extrapolation
Kuhn triangulation
Prioritized sweeping
Prioritized sweeping idea: focus updates on states for which the update is
expected to be most significant
Place states into priority queue and perform updates accordingly
For details: See Moore and Atkeson, 1993, Prioritized sweeping: RL with less
data and less real time
Page 10
Richardson extrapolation
Similarly:
Then we can get rid of the order h error term by using the following
approximation which combines both:
Kuhn triangulation
Page 11
Page 12
Today
Performance boosts
Speed-ups
Great references:
Gordon, 1995, Stable function approximation in dynamic programming
Tsitsiklis and Van Roy, 1996, Feature based methods for large scale dynamic programming
Bertsekas and Tsitsiklis, Neuro-dynamic programming, Chap. 6
Page 1
(S, A, P, , g)
= (0 , 1 , . . .), k : S A
t
Value of a policy : J (x) = E[
t=0 g(x(t), u(t))|x0 = x, ]
Policy
Facts:
J (i) J for i
There is an optimal stationary policy: = ( , , . . .) which satises:
(s) = arg min g(s, u) +
P (s |s, u)J (s)
u
Page 2
projection: nd some
(i+1)
such that s S
abstract
J(1)=TJ(0)
back-up
back-up for
s S
J(1)=TJ(0)
x x
x
x
s
s
J(1)
project
Interpolate to
J(1)
successor
x
states of S
x xx x
x
s
s
J(2)=TJ(1)
back-up
s
J(2)
project
back-up for
s S
J(2)=TJ(1)
x
x
x
s
Interpolate to
(2)
successor J
states of S
x
xx x
x
s
Page 3
P(x2|x1,u) = 1; P(x2|x2,u) = 1
g(x1,u) = 0; g(x2,u) = 0;
Function approximator: [1 2] *
VI w/ least squares function approximation
diverges for > 5/6 [see last lecture for details]
Contractions
T (k) J J
=
T (k) J T (k) J
T (k1) J T (k1) J
k
J J
Page 4
J J
2
1
Proof
Page 5
we have that
J J
1
.
Proof:
J J
J T J + T J T 2 J + T 2 J T 3 J + ... J
+
T J T 2 J
+
T 2 J T 3 J
+ ... +
T J J
J T J
+ + 2 + ...
iterations i :
T J(i) T J(i)
Or, more generally, we have a noisy sequence of back-ups:
J
1
.
Proof by induction:
Base case: We have
J (1) T J (0)
.
Induction: We also have for any i > 1:
T i J (0) J (i)
=
T T i1 J (0) T J (i1) w (i1)
+
T i1 J (0) J (i1)
+ ((1 + + 2 + . . . + (i2) ))
Page 6
J J
1
t
Here J = E[
t=0 g(st , (st ))].
Proof
Recall:
(T J )(s) = min g(s, u) +
u
P (s |s, u)J (s )
Similarly dene:
(T J )(s) = g(s, (s)) +
P (s |s, (s))J (s )
We have T J = J and (same result for MDP with only 1 policy available)
T J = J .
A very typical proof follows, with the main ingredients adding and subtracting the same terms to make terms pairwise easier to compare/bound:
J J
T J J
T J T J
+
T J J
J J
+
T J J
J J
+
J J
+
J J
J J
+ 2,
Page 7
Iterate: J T J
Guarantees when:
Sample states:
Page 8
Practicalities
Page 9
Speed-ups
Parallelization
Prioritized sweeping
Richardson extrapolation
Kuhn triangulation
Prioritized sweeping
Prioritized sweeping idea: focus updates on states for which the update is
expected to be most significant
Place states into priority queue and perform updates accordingly
For details: See Moore and Atkeson, 1993, Prioritized sweeping: RL with less
data and less real time
Page 10
Richardson extrapolation
Similarly:
Then we can get rid of the order h error term by using the following
approximation which combines both:
Kuhn triangulation
Page 11
Page 12
PHD
Page 1
Announcements
Yes.
Announcements
Ideally: we are able to identify a topic that relates both to your ongoing PhD research and the course.
You are very welcome to come up with your own project ideas, yet
make sure to pass them by me **before** you submit your abstract.
Feel free to stop by office hours or set an appointment (via email) to
discuss potential projects.
Page 2
Announcements
Milestones:
In practice
Page 3
Today
Linear Quadratic (LQ) setting --- special case: can solve continuous
optimal control problem exactly
Great reference:
[optional] Anderson and Moore, Linear Quadratic Methods --- standard reference for LQ
setting
g(xt , ut ) = x
t Qxt + ut Rut
with Q 0, R 0.
For a square matrix X we have X 0 if and only if for all vectors z we
have z Xz > 0. Hence there is a non-zero cost for any state dierent from the
all-zeros state, and any input dierent from the all-zeros input.
Page 4
Value iteration
LQR:
= min x Qx + u Ru + Ji (Ax + Bu)
u
Page 5
= min x Qx + u Ru + J0 (Ax + Bu)
u
= min x Qx + u Ru + (Ax + Bu) P0 (Ax + Bu)
u
(1)
(2)
In summary:
J0 (x) = x P0 x
xt+1 = Axt + But
g(x, u) = u Ru + x Qx
J1 (x) =
x P1 x
for: P1
K1
(R + B P0 B)1 B P0 A.
= (R + B P1 B)1 B P1 A.
Page 6
= Axt + But
g(xt , ut ) = x
t Qxt + ut Rut
Affine systems
Page 7
g(xt , ut ) = x
t Qxt + ut Rut
= Axt + But + wt
g(xt , ut ) = x
t Qxt + ut Rut
Page 8
xt+1 = f(xt , ut )
u s.t. x = f(x , u )
Linearizing the dynamics around x* gives:
f
f
xt+1 f (x , u ) +
(x , u )(xt x ) +
(x , u )(ut u )
x
u
B
A
Equivalently:
[=standard LQR]
Standard LQR:
xt+1
= Axt + But
g(xt , ut ) = x
t Qxt + ut Rut
Page 9
Standard LQR:
= Axt + But
xt+1
g(xt , ut ) = x
t Qxt + ut Rut
Standard LQR:
xt+1 = Axt + But
g(xt , ut ) = x
t Qxt + ut Rut
xt+1
ut+1
A B
0 I
xt+ =
cost = (x Q x + u R u)
xt
ut1
xt
Q =
B
I
Q 0
0 R
ut
ut
Page 10
xt+1
g(xt , ut )
= At xt + Bt ut
= x
t Qt xt + ut Rt ut
= (RHi + BHi
Pi1 BHi )1 BHi
Pi1 AHi
Pi
Page 11
Problem statement:
minu0 ,u1 ,...,uH1
H1
t=0
H1
t=0
Page 12
H
g(xt , ut )
t=0
subject to
xt+1 = f (xt , ut ) t
Page 13
(i)
xt+1 = f (xt , ut ) +
f (i) (i)
f (i) (i)
(i)
(i)
(x , ut )(xt xt ) +
(x , ut )(ut ut )
x t
u t
Subtracting the same term on both sides gives the format we want:
(i)
(i)
(i)
(i)
f (i) (i)
f (i) (i)
(i)
(i)
(x , ut )(xt xt )+ (xt , ut )(ut ut )
x t
u
1]
zt
[xt xt
vt
At
Bt
(ut ut )
f (i) (i)
(i)
(i)
(i)
x (xt , ut ) f (xt , ut ) xt+1
0
1
f (i) (i)
(x
,
u
)
t
t
u
0
(i)
(i)
(1 )g(xt , ut ) + (xt xt 22 + ut ut 22 )
Page 14
(i)
xt+1 = f (xt , ut ) +
f (i) (i)
f (i) (i)
(i)
(i)
(x , ut )(xt xt ) +
(x , ut )(ut ut )
x t
u t
Subtracting the same term on both sides gives the format we want:
(i)
(i)
(i)
(i)
f (i) (i)
f (i) (i)
(i)
(i)
(x , ut )(xt xt )+ (xt , ut )(ut ut )
x t
u
1]
[xt xt
vt
At
(i)
(ut ut )
f (i) (i)
x (xt , ut )
zt
Bt
(i)
(i)
f
u (xt , ut )
(i)
(i)
(i)
f (xt , ut ) xt+1
1
Page 15
xt+1
xt+1 xt+1
xt+1
xt+1 xt+1 ; 1
xt+1
f (i) (i)
f (i) (i)
(i)
(i)
(x , ut )(xt xt ) +
(x , ut )(ut ut )
x t
u t
f (i) (i)
f (i) (i)
(i)
(i)
(i)
(i)
f (xt , ut ) xt+1 +
(x , ut )(xt xt xt + xt ) +
(x , ut )(ut ut ut + ut )
x t
u t
f (i) (i)
f (i) (i)
(i)
(i)
(i)
f (xt , ut ) xt+1 +
(x , ut )(xt xt ) +
(x , ut )(xt xt )
x t
x t
f (i) (i)
f (i) (i)
(i)
+ (xt , ut )(ut ut ) +
(x , ut )(ut ut )
u
u t
A[(xt xt ); 1] + B(ut ut )
(i)
(i)
(i)
(i)
f (xt , ut ) +
For
A=
(i)
(i)
f
(xt , ut )
x
f (xt , ut ) xt+1 +
(i)
(i)
f
(xt , ut )(xt
x
(i)
xt ) +
and
B=
(i)
(i)
f
(xt , ut )
u
(i)
(i)
f
(xt , ut )(ut
u
(i)
ut )
The cost function can be used as is: (xt xt ) Q(xt xt )+(ut ut ) R(ut ut ).
Page 16
min
u
Yes!
At convergence of iLQR and DDP, we end up with linearizations around
the (state,input) trajectory the algorithm converged to
In practice: the system could not be on this trajectory due to
perturbations / initial state being off / dynamics model being off /
Solution: at time t when asked to generate control input ut, we could resolve the control problem using iLQR or DDP over the time steps t
through H
Replanning entire trajectory is often impractical in practice: replan over
horizon h. = receding horizon control
Page 17
Multiplicative noise
Cart-pole
H(q)
q + C(q, q)
+ G(q) = B(q)u
mp l cos
mp l 2
0 mp l sin
C(q, q)
=
0 0
0
G(q) =
mp gl sin
1
B =
0
H(q) =
Page 18
mc + mp
mp l cos
Page 19
Page 20
t=0
MPC:
For t = 0, 1, 2, . . .
1. Solve
minut ,ut+1 ,...,uH
H
g(xk , uk )
k=t
s.t.
xk+1 = f (xk , uk , 0) k = t, t + 1, . . . , H 1
t+h
k=t
s.t.
xk+1 = f (xk , uk , 0) k = t, t + 1, . . . , t + h 1
Page 1
Single shooting
H
g(xk , uk )
k=t
s.t.
xk+1 = f(xk , uk , 0) k = t, t + 1, . . . , H 1
Note: When taking derivatives, one ends up repeatedly applying the chain rule and the
same Jacobians keep re-occurring
Beneficial to not waste time re-computing same Jacobians; pretty straightforward,
various specifics with their own names. (E.g., back-propagation.)
Page 2
H
g(xk , uk )
k=t
s.t.
xk+1 = f (xk , uk , 0) k = t, t + 1, . . . , H 1
hk (xk , uk ) 0 k = t, t + 1, . . . , H 1
Goal: solve
minut ,ut+1 ,...,uH ,xt ,xt+1 ,...,xH
H
g(xk , uk )
k=t
s.t.
xk+1 = f (xk , uk , 0) k = t, t + 1, . . . , H 1
hk (xk , uk ) 0 k = t, t + 1, . . . , H 1
Corresponds to:
Page 3
Quasi-Newton methods
Further readings
Tedrake Chapter 9.
Diehl, Ferreau and Haverbeke, 2008, Nonlinear MPC overview
paper
Francesco Borelli (M.E., UC Berkeley): taught course in Spring
2009 on (linear) MPC
Packages:
We have ignored:
Page 4
Non-minimum phase
example
[Slotine and Li, p. 195, Example II.2]
Page 5
Feedback linearization
Feedback linearization
Page 6
Feedback linearization
Feedback linearization
Page 7
Feedback linearization
Further readings:
Announcements
Page 8
A system xt+1 = f(xt, ut) if for all x0 and all x, there exists a time k and
a control sequence u0, , uk-1 such that xt = x.
Fact. The linear system xt+1 = Axt + But with xt n is controllable iff
[B AB A2 B . . . An B] is full rank.
Lagrangian dynamics
Page 9
Page 10
Reinforcement Learning
Page 1
For known initial state --- tree of sufficient statistics could suffice
Multi-armed bandits
Slot machines
Clinical trials
Advertising
Merchandising
Page 2
Multi-armed bandits
Information state
Page 3
Semi-MDP
Transition model:
Objective:
Bellman update:
Optimal stopping
1. continue
2. stop and accumulate reward g for current time and
for all future times.
Page 4
Optimal stopping
Hence, for fixed g, we can find the value of each state in the optimal
stopping problem by dynamic programming
However, we are interested in g*(s) for all s:
Optimal stopping
One approach:
Page 5
Reward rate
Reward rate
Page 6
(i)
1. Find the optimal stopping cost g (st ) for each
bandits current state
Page 7
Key requirements
Page 8
Further readings
Gittins, J.C., D.M. Jones. 1974. A dynamic allocation index for the sequential design
of experiments.[Gittins indices]
Different family of approaches: regret-based
j
Page 9
Reinforcement Learning
Page 1
Examples
goal: max E [ t t R(st, at) | ]
Cleaning robot
Walking robot
Pole balancing
Server management
Page 2
Solving MDPs
R(s) = -0.02
R(s) = -0.04
R(s) = -0.1
R(s) = -2.0
Page 3
Value iteration
Policy iteration
Value Iteration
Algorithm:
Given Vi, calculate the values for all states for depth i+1:
Page 4
V3
Page 5
Convergence
Infinity norm: V = maxs |V (s)|
Fact. Value iteration converges to the optimal value function V which satisfies
the Bellman equation:
s S : V (s) = max
T (s, a, s )(R(s, a, s ) + V (s ))
a
12
Page 6
Policy Iteration
Alternative approach:
13
Policy Iteration
14
Page 7
Comparison
Value iteration:
Policy iteration:
Asynchronous versions:
Page 8
Outline
Page 1
Page 2
The dual LP
Page 3
s,a,s
s.t. s
(s, a) = c(s) +
(s , a)T (s , a, s)
s ,a
Meaning (s,a) ?
Meaning c(s) ?
LP approach recap
The optimal value function satisfies:
s : V (s) = max
T (s, a, s ) [R(s, a, s) + V (s )] .
a
The relaxation still has the optimal value function as one of its solutions, but we
might have introduced new solutions. So we look for an objective function that
will favor the optimal value function over other solutions of (1). To this extent,
we observed the following monotonicity property of the Bellman operator T :
s V1 (s) V2 (s) implies : s (T V1 )(s) (T V2 )(s)
Any solution to (1) satisfies V T V , hence also: T V T 2 V , hence also:
T 2 V T 3 V ... T 1 V T V = V . Stringing these together, we get for
any solution V of (1) that the following holds:
V V
Hence to find V as the solution to (1), it suffices to add an objective function
which favors the smallest solution:
min c V s.t.s, a : V (s)
T (s, a, s) [R(s, a, s ) + V (s)] . (2)
V
s,a,s
s.t. s
a
(s, a) = c(s) +
(s , a)T (s , a, s)
s ,a
Page 4
Announcements
Value iteration:
Policy iteration:
Policy improvement:
Linear programming:
Page 5
Problem: We need
to estimate these
too!
Page 6
Sample of V(s):
Update to V(s):
Same update:
Sample of V(s):
Update to V(s):
Same update:
Page 7
Page 8
k =
k=0
k2 <
k=0
Examples:
1/k
C/(C+k)
Experience replay
Page 9
Outline
Model-free approaches
Recap TD(0)
Sarsa
Q learning
TD Gammon
Page 1
Page 2
When experiencing st, at, st+1, rt+1, at+1 perform the following sarsa
update:
How about Q(s,a) for action a inconsistent with the policy at state s?
Converges (w.p. 1) to Q function for current policy for all states and
actions *if* all states and actions are visited infinitely often (assuming
proper step-sizing)
Exploration aspect
Page 3
Page 4
Sarsa converges w.p. 1 to an optimal policy and actionvalue function as long as all state-action pairs are visited an
infinite number of times and the policy converges in the limit
to the greedy policy (which can be arranged, e.g., by having
greedy policies with = 1 / t).
Q learning
Q(st , at ) (1 )Q(st , at ) + r(st , at , st+1 ) + max Q (st+1 , a)
a
Compare to sarsa:
Page 5
Q learning
Q-Learning Properties
Page 6
Exploration / Exploitation
Page 7
Exploration Functions
When to explore
Exploration function
Page 8
t:
t+1:
t+2:
+also:
Page 9
Note that at the next time step we update V (st+1 ). This (crudely speaking)
results in having a better estimate of the value function for state st+1 . TD()
takes advantage of the availability of this better estimate to improve the update
we performed for V (st ) in the previous step of the algorithm. Concretely, TD()
performs another update on V (st ) to account for our improved estimate of
V (st+1 ) as follows:
V (st ) V (st ) + t+1
where is a fudge factor that determines how heavily we weight changes in
the value function for st+1 .
Similarly, at time t + 2 we perform the following set of updates:
V (st+2 ) V (st+2 ) + [R(st+2 ) + V (st+2 ) V (st+3 )] ...
t+2
TD()
Page 10
TD:
Sample =
R(st ) + V (st+1 )
R(st ) + R(st+1 ) + 2 V (st+2 )
R(st ) + R(st+1 ) + 2 R(st+2 ) + 3 V (st+3 )
...
R(st ) + R(st+1 ) + 2 R(st+2 ) + . . . + T R(sT )
[0,1]
Forward view equivalent to backward view
Page 11
Sarsa()
Watkins Q()
Page 12
Replacing traces
Recall TD()
Page 13
Recap RL so far
Page 14
Assignment #1
Page 1
Recap RL so far
i (s) = exp
1
2 (s
si ) 1 (s si )
Polynomials:
Fourier basis
ji ji
ji
Page 2
Example: tetris
state: board configuration + shape of the falling piece ~2200
states!
V (s) =
22
i=1 i i (s)
Ten basis functions, 0, . . . , 9, mapping the state to the height h[k] of each
of the ten columns.
Nine basis functions, 10, . . . , 18, each mapping the state to the absolute
difference between heights of successive columns: |h[k+1] h[k]|, k = 1, .
. . , 9.
One basis function, 19, that maps state to the maximum column height:
maxk h[k]
One basis function, 20, that maps state to the number of holes in the
board.
One basis function, 21, that is equal to 1 in every state.
[Bertsekas & Ioffe, 1996 (TD); Bertsekas & Tsitsiklis 1996 (TD);Kakade 2002 (policy gradient); Farias & Van Roy, 2006 (approximate LP)]
Objective
Page 3
encountered states s
2
V (s) V (s)
Iterate
Draw a state s according to P
Update:
Page 4
time t:
t+1 t + R(st , at , st+1 ) + Vt (st+1 ) Vt (st ) Vt (st )
time t+1:
et
t+2 t+1 + R(st+1 , at+1 , st+2 ) + Vt+1 (st+2 ) Vt+1 (st+1 ) Vt+1 (st+1 )
t+1
Combined:
t+1
et+1
t+2
Page 5
Guarantees
TD(
) w/linear function approximation: [Tsitsiklis and Van Roy, 1997]
***If*** samples are generated from traces of execution of the policy , and for
appropriate choice of step-sizes , TD() converges and at convergence: [D =
expected discounted state visitation frequencies under policy ]
V V D
1
D V V D
1
Sarsa(
) w/linear function approximation: same as TD
Q w/linear function approximation: [Melo and Ribeiro, 2007] Convergence
to reasonable Q value under certain assumptions, including: s, a(s, a)1 1
[Could also use infinity norm contraction function approximators to attain
convergence --- see earlier lectures. However, this class of function
approximators tends to be more restrictive.]
Off-policy counterexamples
Page 6
Off-policy counterexamples
Tsitsiklis and Van Roy counterexample: complete backup with off-policy linear regression [i.e., uniform least
squares, rather than waited by state visitation rates]
Back-up:
min
s D(s)((T
V )(s) (s))2
Key observations:
V1 , V2 : D V1 D V2 D V1 V2 D
Page 7
D(i)x(i)2
Bellman operator:
(T J)(s) =
T operator:
T J (s) = (1 )
m=0
m E
m
k=0
t g(sk ) + m+1 J(sm+1 )
At convergence:
V V D
1
D V V D
1
Page 8
Empirical comparison
Backgammon
Page 9
Backgammon
Page 10
Input features
1 piece unit1=1;
Neural net
(i), i = 1, . . . , 198
i wij (i))
1+exp(
1
1
1+exp
Page 11
wij (i))
wj h(j)
Neural nets
Learning
Page 12
Results
Page 13
(T V )(s) =
Back-up:
min
Perform regression:
min
min
(V (s) (s))2
(s,a,s )
(s,a,s )
Page 1
Iterate:
(new) = arg min
(s,a,s )
Solution:
(new) = ( )1 (R + (old) )
Solution:
Fixed point?
( )
= ( )1 (R + )
= (R + )
= R
1
=
R
Page 2
LSTD(0)
Collect state-action-state triples (si,ai,si) according to a policy
(s1 )
(s1 )
(s2 ) (s )
2
, =
=
...
...
(sm )
(sm )
R(s1 , a1 , s1 )
, R = R(s2 , a2 , s2 )
...
R(sm , am , sm )
V (s) (s)
1
for =
R
Iterate
Tweaks:
Page 3
(s1 )
(s1 )
(s2 )
(s2 )
=
, m =
...
...
(sm )
(sm )
R(s1 , a1 , s1 )
, Rm = R(s2 , a2 , s2 )
...
R(sm , am , sm )
V (s) m
(s)
1
for m = m (m m )
m R m
1
m+1 =
m Rm + m+1 rm+1
m (m m ) + m+1 (m m )
Sherman-Morrison formula:
RLSTD
= A1
m bm
1
=
m (m m )
= m Rm
1
A1
m m+1 (m+1 m+1 ) Am
1
1 + (m+1 m+1 ) Am m+1
= bm + m+1 rm+1
= A1
m
Page 4
Can be applied to non-linear setting too: simply linearize the non-linear function
approximator around the current estimate of \theta; not globally optimal, but likely
still better than nave gradient descent
(+prior Extended Kalman filter)
Page 5
Page 6
TD methods recap
Batch version
Applications of TD methods
Backgammon
Cartpole balancing
Acrobot swing-up
Bicycle riding
Page 7
Large MDPs:
TD methods:
Imitation learning
POMDPS
Hierarchical methods
Fine tuning policies through running trials on a real system, Robotic success stories
Partial observability
Reward shaping
Stochastic approximation
Page 8
Imitation learning
If expert available, could use expert trace s1, a1, s2, a2,
s3, a3, to learn something from the expert
Trajectory primitives:
Page 9
Behavioral cloning
If expert available, could use expert trace s1, a1, s2, a2, s3, a3, to
learn the expert policy : S A
Class of policies to learn:
Advantages:
Minuses:
Alvinn
Page 10
Alvinn
Issues:
Page 11
Transformed images
Transformed images
original
extrap1
Page 12
extrap2
Image buffering:
Road types:
Results
Ernst Dickmanns
Page 13
7 stages
2. Level out and fly to a distance of 32,000 feet from the starting point.
Page 14
Sammut + al
Sammut+al
Page 15
Tetris
V (s) =
22
i=1 i i (s)
Ten basis functions, 0, . . . , 9, mapping the state to the height h[k] of each
of the ten columns.
Nine basis functions, 10, . . . , 18, each mapping the state to the absolute
difference between heights of successive columns: |h[k+1] h[k]|, k = 1, .
. . , 9.
One basis function, 19, that maps state to the maximum column height:
maxk h[k]
One basis function, 20, that maps state to the number of holes in the
board.
One basis function, 21, that is equal to 1 in every state.
[Bertsekas & Ioffe, 1996 (TD); Bertsekas & Tsitsiklis 1996 (TD);Kakade 2002 (policy gradient); Farias & Van Roy, 2006 (approximate LP)]
Page 16
Page 17
Page 1
22
i=1 i i (s)
Ten basis functions, 0, . . . , 9, mapping the state to the height h[k] of each
of the ten columns.
Nine basis functions, 10, . . . , 18, each mapping the state to the absolute
difference between heights of successive columns: |h[k+1] h[k]|, k = 1, .
. . , 9.
One basis function, 19, that maps state to the maximum column height:
maxk h[k]
One basis function, 20, that maps state to the number of holes in the
board.
One basis function, 21, that is equal to 1 in every state.
[Bertsekas & Ioffe, 1996 (TD); Bertsekas & Tsitsiklis 1996 (TD);Kakade 2002 (policy gradient); Farias & Van Roy, 2006 (approximate LP)]
Page 2
Page 3
+ C
i,j
i,j
(i)
(i)
exp( (s+ ))
(i)
exp( (s+ )+
(i)
exp( (sj )
Scientific inquiry
Page 4
Problem setup
Input:
No reward function
Inverse RL:
Can we recover R ?
Vs. Behavioral cloning (which directly learns the teachers policy using supervised
learning)
Lecture outline
Inverse RL intro
Case studies
Page 5
Max margin
Basic principle
E[
t R (st )| ] E[
t R (st )|]
Find R* such that
t=0
R(s)
1{st = s| }
R(s)
1{st = s| }
sS
t=0
t=0
sS
t=0
Page 6
E[
t R(st )|] =
t=0
R(st )|]
= E[
t=0
t w (st )|]
t=0
= w E[
t (st )|]
t=0
= w ()
Subbing into E[
gives us:
t=0
R (st )| ] E[ t=0 t R (st )|]
Page 7
t=0
R (st )| ] E[
t=0
R (st )|]
Recap of challenges
Let R(s) = w (s), where w n , and : S n .
Find w such that w ( ) w ()
Challenges:
Page 8
Ambiguity
Ambiguity
s.t. w ( ) w () + 1
s.t. w ( ) w () + m( , )
Page 9
Expert suboptimality
s.t. w ( ) w () + m( , )
Expert suboptimality
s.t. w ( ) w () + m( , )
w, (i)
(i)
s.t. w (
(i)
Page 10
i, (i)
(i)
i, (i)
Constraint generation
Initialize (i) = {} for all i and then iterate
Solve
min w22 + C
w
(i)
s.t. w (
(i)
i, (i) (i)
Page 11
E[
t R(st )|] = w ()
t=0
max margin
(*)
wmm
wsmm
Constraint generation
constraint generation:
max E[
(2)
(*)
t=0
w(3) = wmm
w(2)
(1)
(0)
1
Page 12
Page 13
If expert suboptimal:
Lecture outline
Inverse RL intro
Max-margin
Feature matching
Case studies
Page 14
Recall:
Q (s, a; R) = R(s) +
1
exp(Q (s, a; R))
Z(s; R, )
P ((s1 , a1 )) . . . P ((sm , am )) =
1
1
exp(Q (s1 , a1 ; R)) . . .
exp(Q (sm , am ; R))
Z(s1 ; R, )
Z(sm ; R, )
Ramachandran and Amir, AAAI2007: MCMC method to sample from this distribution
Neu and Szepesvari, UAI2007: gradient method to find local optimum of the likelihood
Lecture outline
Inverse RL intro
Case studies:
Highway driving,
Route inference,
Quadruped locomotion
Page 15
Abbeel and Ng, ICML 2004; Syed and Schapire, NIPS 2007
Highway driving
Teacher in Training World
Input:
Page 16
Learned
behavior
Driving
demonstration
Learned
behavior
Staying on-road,
Forward vs. reverse driving,
Amount of switching between forward and reverse,
Lane keeping,
On-road vs. off-road,
Curvature of paths.
Page 17
Experimental setup
Page 18
Sloppy driving-style
Page 19
Time
Money
Stress
Skill
Ziebart+al, 2007/8/9
Time
Fuel
Safety
Stress
Skill
Mood
Distance
Speed
Type
Lanes
Turns
Context
Ziebart+al, 2007/8/9
Page 20
Data Collection
25 Taxi Drivers
Length
Speed
Road
Type
Lanes
Accidents
Construction
Congestion
Time of day
Destination Prediction
Page 21
Ziebart+al, 2007/8/9
Quadruped
Experimental setup
Page 22
Without learning
Page 23
Inverse RL history
Page 24
Inverse RL history
Page 25
Page 1
Problem setup
Input:
No reward function
Inverse RL:
Can we recover R ?
Vs. Behavioral cloning (which directly learns the teachers policy using supervised
learning)
Staying on-road,
Forward vs. reverse driving,
Amount of switching between forward and reverse,
Lane keeping,
On-road vs. off-road,
Curvature of paths.
Page 2
Experimental setup
Page 3
Sloppy driving-style
Quadruped
Page 4
Experimental setup
Without learning
Page 5
Announcements
Grading:
Page 6
Lecture outline
Policy search
Page 7
Locus length
Page 8
After learning
Page 9
H
R(st , at , st+1 )|
t=0
Numerical optimization: find the gradient w.r.t. and take a step in the
gradient direction.
H
R
U
st R
ut R
st+1
=
(st , ut , st+1 )
+
(st , ut , st+1 )
+ (st , ut , st+1 )
i
s
s
i
i
i
t=0
st
f
st1 f
ut1
=
(st1 , ut1 )
+
(st1 , ut1 )
i
s
i
s
i
ut
st
=
(st , ) +
(st , )
i
i
s
i
Computing these recursively, starting at time 0 is a discrete time instantiation
of Real Time Recurrent Learning (RTRL)
H
t=0
H
Pt (st = s, ut = u; )R(s, u)
t=0 s,u
Numerical optimization: find the gradient w.r.t. and take a step in the
gradient direction.
H
Pt
U
=
(st = s, ut = u; )R(s, u)
i
i
t=0 s,u
Using the chain rule we obtain the following recursive procedure:
Pt (st = s, ut = u; ) =
Pt1 (st1 = s , ut1 = u )T (s , u , s)(u|s; )
s ,u
Pt
i (st
= s, ut = u; ) =
s ,u
Pt1
(st1 = s , ut1 = u ; )T (s , u , s)(u|s; )
i
Page 10
Policy gradient
Deterministic
Analytical
Stochastic
Known
Dynamics
Unknown
Dynamics
Known
Dynamics
Unknown
Dynamics
OK
Taking
derivatives--potentially time
consuming and
error-prone
N/A
OK
Often
computationally
impractical
N/A
Finite differences
We can compute the gradient g using standard finite difference methods, as
follows:
U
U ( + ej ) U ( ej )
() =
j
2
Where:
ej =
0
0
..
.
0
1
0
..
.
0
Page 11
jth entry
Page 12
Policy gradient
Deterministic
Stochastic
Known
Dynamics
Unknown
Dynamics
Known
Dynamics
Unknown
Dynamics
Analytical
OK
Taking
derivatives--potentially time
consuming and
error-prone
N/A
OK
Often
computationally
impractical
N/A
Finite
differences
OK
Sometimes
computationally
more expensive
than analytical
OK
OK
N = #roll-outs:
Naive: O(N-1/4),
or O(N-2/5)
Fix random
seed: O(N-1/2)
[1]
Same as known
dynamics, but
no fixing of
random seed.
[1] P. Glynn, Likelihood ratio gradient estimation: an overview, in Proceedings of the 1987
Winter Simulation Conference, Atlanta, GA, 1987, pp. 366375.
Page 13
Assumption:
Stochasticity:
Page 14
Page 15
H
R(st , ut ); ] =
P ( ; )R( )
t=0
Page 16
P ( ; )R( )
P ( ; )R( )
P ( ; )
P ( ; )
P ( ; )R( )
P ( ; )
P ( ; )
R( )
P ( ; )
P ( ; ) log P ( ; )R( )
Approximate with the empirical estimate for m sample paths under policy
:
m
U () g =
1
log P ( (i) ; )R( (i) )
m i=1
H
H
(i)
(i)
(i)
(i) (i)
P (st+1 |st , ut ) (ut |st )
t=0
(i)
(i)
(i)
log P (st+1 |st , ut )
t=0
H
t=0
H
t=0
(i)
(i)
t=0
H
policy
dynamics model
(i)
(i)
Page 17
(i) (i)
log (ut |st )
g =
1
log P ( (i) ; )R( (i) )
m i=1
Here:
log P ( (i) ; ) =
H
t=0
Unbiased means:
(i)
(i)
E[
g ] = U ()
We can obtain a gradient estimate from a single trial run! While the math we
did is sound (and constitutes the commonly used derivation), this could seem
a bit surprising at first. Lets perform another derivation which might give us
some more insight.
Page 18
Page 1
Policy gradient
Deterministic
Stochastic
Known
Dynamics
Unknown
Dynamics
Known
Dynamics
Unknown
Dynamics
Analytical
OK
Taking
derivatives--potentially time
consuming and
error-prone
N/A
OK
Often
computationally
impractical
N/A
Finite
differences
OK
Sometimes
computationally
more expensive
than analytical
OK
OK
N = #roll-outs:
Naive: O(N-1/4), or
O(N-2/5)
Fix random seed:
O(N-1/2) [1]
Same as known
dynamics, but no
fixing of random
seed.
Likelihood ratio
method
OK
OK
OK
O(N-1/2) [1]
OK
O(N-1/2) [1]
[1] P. Glynn, Likelihood ratio gradient estimation: an overview, in Proceedings of the 1987
Winter Simulation Conference, Atlanta, GA, 1987, pp. 366375.
Assumption:
Stochasticity:
Page 2
t=0
P ( ; )R( )
P ( ; )R( )
P ( ; )
P ( ; )
P ( ; )
dynamics model
(i)
(i)
(i)
H
H
t=0
P ( ; )
R( )
P ( ; )
P ( ; ) log P ( ; )R( )
Approximate with the empirical estimate for m sample paths under policy
:
m
1
log P ( (i) ; )R( (i) )
m i=1
Page 3
policy
H
t=0
(i)
(i)
t=0
P ( ; )R( )
U () g =
H
t=0
P ( ; )R( )
(i)
(i)
(i)
(i) (i)
log P ( (i) ; ) = log P (st+1 |st , ut ) (ut |st )
t=0
(i)
(i)
(i)
(i)
g =
1
log P ( (i) ; )R( (i) )
m i=1
Here:
log P ( (i) ; ) =
H
t=0
(i)
(i)
Unbiased means:
E[
g ] = U ()
Page 4
Envelopes riddle
Page 5
Envelopes riddle
Envelopes riddle
MDP:
Policy
(1|0)
(2|0)
exp()
1 + exp()
1
1 + exp()
1
R(1)
1 + exp()
log P ( = 2; )R( = 2) =
exp()
R(2)
1 + exp()
Page 6
Envelopes riddle
log P ( = 1; )R( = 1) =
(2|0)
exp()
1 + exp()
1
1 + exp()
1
R(1)
1 + exp()
log P ( = 2; )R( = 2) =
(1|0)
exp()
R(2)
1 + exp()
This gradient update is simply making the recently observed path more
likely; and how much more likely depends on the observed R for the
observed path
rather than let it depend simply on R, if we had a baseline b which is
an estimate of the expected reward under the current policy, then could
update scaled by (R-b) instead,
i.e. the baseline enables updating such that better than average paths
become more likely, less than average paths become less likely
g =
1
log P ( (i) ; )(R( (i) ) b)
m i=1
Page 7
P ( ; ) = 1
P ( ; ) = 0
i
P ( ; ) = 0
j
P ( ; )
P ( ; ) = 0
P ( ; ) j
P ( ; )
log P ( ; ) = 0
j
E
log P ( ; ) = 0
j
E
log P ( ; )bj = 0
j
U () = E
log P ( ; )R( )
j
j
= E
log P ( ; )(R( ) bj )
j
m
1
(i)
log P ( (i) ; )(R( ) bj )
m
j
i=1
It is unbiased, i.e.:
E
gj =
U ()
j
2
2
gj E [
gj ]) = E
gj ) 2E [
gj ]]
min E (
gj2 + E (E
gj E [
bj
= E
gj2 + (E
gj ) 2E [
gj ] E [
gj ]
= E
gj2
(E
gj )2
= U()
independent of bj
Page 8
min E
gj2
bj
2
= min E
log P ( ; ) (R( ) bj )
bj
j
2
= min E
log( ; ) R( )2 + b2j 2bj R( )
bj
j
2
2
log P ( ; ) R( )2 +E
log P ( ; ) b2j
= min E
bj
j
j
independent of bj
2E
log P ( ; )2 bj R( )
i
2
2
2
2
= min bj E
log P ( ; )
2bj E
log P ( ; )
R( )
bj
j
j
E
R( )
j log P ( ; )
bj =
2
E
log
P
(
;
)
j
(i)
log P ( ; ) R( (i) ) bj ,
j
H1
m
H1
1
(i) (i)
(i)
(i)
log (ut |st )
R(st , ut ) bj
m i=1 t=0 j
t=0
m
gj
1
m i=1
Future actions do not depend on past rewards (assuming a fixed policy). This
can be formalized as
E
log (ut |st )R(sk , uk ) = 0 k < t
j
Removing these terms with zero expected value from our gradient estimate we
obtain:
H1
m H1
1
(i) (i)
(i)
(i)
gj =
log (ut |st )
R(sk , uk ) bj
m i=1 t=0 j
k=t
Page 9
Actor-Critic
Our gradient estimate:
m H1
1
(i) (i)
gj =
log (ut |st )
m i=1 t=0 j
H1
(i)
(i)
R(sk , uk )
bj
k=t
H1
(i)
(i)
(i)
(i)
The term k=t R(sk , uk ) is a sample based estimate of Q (st , uk ). If
we simultaneously run a temporal difference (TD) learning method to estimate
Q , then we could substitute its estimate for Q instead of the sample based
estimate!
m H1
1
(i) (i)
(s(i) , u(i) ) bj
log (ut |st ) Q
t
t
m
j
t=0
i=1
Natural gradient
Page 10
Natural gradient
Page 11
ABSTRACT
Although neurons as computational elements are 7 orders of magnitude slower than their artificial
counterparts, the primate brain grossly outperforms robotic algorithms in all but the most
structured tasks. Parallelism alone is a poor explanation, and much recent functional modelling of
the central nervous system focuses on its modular, heavily feedback-based computational
architecture, the result of accumulation of subsystems throughout evolution. We discuss this
architecture from a global functionality point of view, and show why evolution is likely to favor
certain types of aggregate stability. We then study synchronization as a model of computations at
different scales in the brain, such as pattern matching, restoration, priming, temporal binding of
sensory data, and mirror neuron response. We derive a simple condition for a general dynamical
system to globally converge to a regime where diverse groups of fully synchronized elements
coexist, and show accordingly how patterns can be transiently selected and controlled by a very
small number of inputs or connections. We also quantify how synchronization mechanisms can
protect general nonlinear systems from noise. Applications to some classical questions in
robotics, control, and systems neuroscience are discussed.
The development makes extensive use of nonlinear contraction theory, a comparatively
recent analysis tool whose main features will be briefly reviewed.
Page 1
Dynamic gait:
Why hard?
Page 2
John E. Wilson. Walking toy. Technical report, United States Patent Office, October 15 1936.
Tad McGeer. Passive dynamic walking. International Journal of Robotics Research, 9(2):62.82,
April 1990.
Page 3
44 cm
Dynamics
q = f(q, q,
u, d(t))
F (x , x) = P (
xn+1 = x |
xn = x; )
Stochasticity due to
Sensor noise
Disturbances d(t)
Page 4
Goal: stabilize the limit cycle trajectory that the passive robot follows when walking
down the ramp, making it invariant to slope.
Reward function:
1
R(x(n)) = x(n) x 22
2
x* is taken from the gait of the walker down a slope of 0.03 radians
Action space:
At the beginning of each step cycle (=when a foot touches down) we choose an
action in the discrete time RL formulation
Our action choice is a feedback control policy to be deployed during the step, in
this particular example it is a column vector w
Choosing this action means that throughout the following step cycle, the following
continous-time feedback controls will be exerted:
u(t) =
wi i (
x(t)) = w (
x(t))
Goal: find the (constant) action choice w which maximizes expected sum of rewards
Policy class
w N (, 2 I)
Which gives us:
1
(w|x) =
exp
(2)d d
1
(w ) (w )
2
2
Page 5
Policy update
Likelihood ratio based gradient estimate from a single trace of H footsteps:
H1
H1
g =
log (w(n)|
x(n))
R(
x(k)) b
We have:
n=0
k=n
1
(w )
2 2
Rather than waiting till horizon H is reached, we can perform the updates
online as follows: (here is a step-size parameter, b(n) is the amount of baseline
we allocate to time nsee next slide)
log (w|
x) =
e(n)
(n + 1)
1
(w(n) (n))
2 2
= (n) + e(n)(R(
x(n)) b(n))
= e(n 1) +
1
(w(n) (n))
2 2
Page 6
(n + 1)
b(n)
1
(w (n))
2 2
(n) + e(n)(R(
x(n)) b(n))
V (
x(n)) V (
x(n + 1))
e(n 1) +
TD(0) updates:
(n)
v(n + 1)
= R(
x(n)) + V (
x(n + 1)) V (
x(n))
= v(n) + v (n)(
x(n))
Page 7
Return maps
after learning
before learning
[Note: this is a projection from 2x9-1 dim to 1dim]
Page 8
Toddler movie
Page 9
Pieter Abbeel
UC Berkeley EECS
Natural gradient
Page 1
Natural gradient
(1)
Page 2
(2)
Newtons direction
Newtons method approximates the function f () by a quadratic function
through a Taylor expansion around the current point k :
1
f () f(k ) + f ((k) ) ( (k) ) + ( (k) ) H((k) )( (k) )
2
Here Hij ((k) ) =
2f
(k)
)
i j (
at (k) .
The local optimum of the 2nd order approximation is found by setting its
gradient equal to zero, which gives:
Newton step direction = ( (k) ) = H 1 ((k) ) f ((k) )
(1)
= A HA
gradient
= A f
(2)
Page 3
Natural gradient
Natural gradient
Page 4
Natural gradient
max
:2
f ( + ) arg
= arg
=
max
f () + f ()
max
f ()
:2
:2
f ()
f ()2
KL(P ( ; 1 )||P ( ; 2 )) =
P ( ; 1 ) log
P ( ; 1 )
P ( ; 2 )
p
1p
+ (1 p) log
q
1q
exp()
log
1 + exp()
exp()
1+exp()
exp()
1+exp()
exp()
1+exp()
1
log
1 + exp()
Page 5
and Prob(heads) =
exp()
1+exp()
1
1+exp()
1
1+exp()
exp()
exp()
,q =
1 + exp()
1 + exp()
P ( ; 1 ) log
P ( ; 1 )
P ( ; 2 )
KL(P ( ; )||P ( ; + ))
P ( ; ) log P ( ; ) log P ( ; )
P ( ; ) log P ( ; ) log P ( ; )
= G()
P (x; ) log
P (x; )
2
P (x; )2
x
x
dP (x;) dP (x;)
d
1 d2
1
d
d
=
P (x; )
P (x; ) +
P (x; )
2
d
2
d
2
P (x; )
P (x; )
x
x
x
d
1
d2
=
P (x; )
P (x; )
d x
2
d 2 x
1
d
d
+
P (x; )
log P (x; )
log P (x; )
2
d
d
x
2
d
1
d
=
1
1
d
2
d 2
1
d
d
+
P (x; )
log P (x; )
log P (x; )
2
d
d
x
1
d
d
= 0 0 +
P (x; )
log P (x; )
log P (x; )
2
d
d
x
=
1
G()
2
Page 6
Natural gradient gN
Page 7
Natural gradient gN
= arg
arg
= arg
max
f ( + )
max
f () + f ()
max
f ()
: 12 G()
: 12 G()
= G()1 f ()
Page 8
Rather than following the gradient, which depends on the choice of parameterization for the set of probability distributions that we are searching over,
follow the natural gradient gN :
gN = G()1 f (P )
Here G() is the Fisher information matrix, and can be computed as follows:
P (x) log P (x) log P (x)
G() =
xX
P ( ; )R( )
Natural gradient gN :
gN = G()1 U ()
Both the Fisher information matrix G and the gradient need to be estimate
from samples. We have seen many ways to estimate the gradient from samples.
Remains to show how to estimate G.
G() =
P ( ) log P ( ; ) log P ( ; )
1
log P ( (i) ; ) log P ( (i) ; )
m i=1
As we have seen earlier, we can compute the expression log P ( (i) ) even
without access to the dynamics model:
log P ( (i) ; )
H
1
t=0
H
1
t=0
Page 9
Example
Page 10
Announcements
Format: pdf
Topic: RL
Start early!
Animals use a combination of sensory modalities to control their movement including visual,
mechanosensory and chemosensory information. Mechanosensory systems that can detect
inertial forces are capable of responding much more rapidly than visual systems and, as such,
are thought to play a critical role in rapid course correction during flight. This talk focusses on
two gryoscopic organs: halteres of flies and antennae of moths. Both have mechanical and
neural components play critical roles in encoding relatively tiny Coriolis forces associated with
body rotations, both of which will be reviewed along with new data that suggests each have
complex circuits that connect visual systems to mechanosensory systems. But, insects are
bristling with mechanosensory structures, including the wings themselves. It is not clear whether
these too could serve an IMU function in addition to their obvious aerodynamic roles.
Page 11
ABSTRACT
Although neurons as computational elements are 7 orders of magnitude slower than their artificial
counterparts, the primate brain grossly outperforms robotic algorithms in all but the most
structured tasks. Parallelism alone is a poor explanation, and much recent functional modelling of
the central nervous system focuses on its modular, heavily feedback-based computational
architecture, the result of accumulation of subsystems throughout evolution. We discuss this
architecture from a global functionality point of view, and show why evolution is likely to favor
certain types of aggregate stability. We then study synchronization as a model of computations at
different scales in the brain, such as pattern matching, restoration, priming, temporal binding of
sensory data, and mirror neuron response. We derive a simple condition for a general dynamical
system to globally converge to a regime where diverse groups of fully synchronized elements
coexist, and show accordingly how patterns can be transiently selected and controlled by a very
small number of inputs or connections. We also quantify how synchronization mechanisms can
protect general nonlinear systems from noise. Applications to some classical questions in
robotics, control, and systems neuroscience are discussed.
The development makes extensive use of nonlinear contraction theory, a comparatively
recent analysis tool whose main features will be briefly reviewed.
Page 12
Approximate LP
Exact Bellman LP
minV
c(s)V (s)
s.t.
V (s)
T (s, a, s ) (R(s, a, s ) + V (s )) s, a
Approximate LP
min
c(s) (s)
sS
s.t.
(s)
s
T (s, a, s ) R(s, a, s ) + (s ) s S , a
Approximate LP guarantees
min
c(s) (s)
sS
s.t.
(s)
s
T (s, a, s ) R(s, a, s ) + (s ) s S, a
Theorem. [de Farias and Van Roy] If one of the basis function satises
i (s) = 1 for all s S, then the LP has a feasible solution and the optimal
solution satises:
V 1,c
2
min V
1
Page 13
Constraint sampling
Consider a convex optimization problem with a very large number of constraints:
min c x
s.t. gi (x) b i = 1, 2, . . . , m
where x n , gi convex, and m >> n.
We obtain the sampled approximation by sampling the sequence {i1 , i2 , . . . , iN }
IID according to some measure over the constraints . This gives us:
min
s.t.
c x
gj (x) b j = i1 , i2 , . . . , iN
Let x
N be the optimal solution to the sampled convex problem.
This result can be leveraged to show that the solution to the sampled approximate LP is close to V with high probablity. (de Farias and Van Roy,
2001)
Reward shaping
Page 14
Shaped reward = R + F
Intuition of proof
In the new MDP, for a trace s0, a0, s1, a1, we obtain:
R(s0, a0, s1) + (s1) - (s0)
+ (R(s1, a1, s2) + (s) - (s1))
+ 2 (R(s2, a2, s3) + (s3) - (s2))
+
- (s0) + R(s0, a0, s1) + R(s1, a1, s2) + 2 R(s2, a2, s3) +
For any policy we have: (M: original MDP, M: MDP w/shaped reward)
VM(s0) = VM(s0) - (s0)
Page 15
A good potential?
Let = V*
=
=
max
a
max
a
V (s) + max
a
T (s, a, s ) (R(s, a, s ) + V (s ) + V (s ))
If we initialize V = 0, we obtain:
V (s) V (s) + max
a
=
=
T (s, a, s ) (R(s, a, s ) + V (s ))
V (s) + V (s)
0
Example
10x10 grid world, 1 goal state = absorbing, other states R=-1;
Prob(action successful) = 80%
Shaping function: 0(s) = - (manhattan distance to goal) / 0.8
Plot shows performance of Sarsa(0) with epsilon=0.1 greedy, learning rate = .2
(a) no shaping vs. (b) = 0.5 * 0 vs. (c) = 0
Page 16
Partial observability
Main ideas:
Page 17
Can we
Repeat forever
Based on all data seen so far, partition the state space is a set of
known states and a set of unknown states. [A state is known when
each action in that state has been observed sufficiently often.]
If currently in a known state:
Lump all unknown states together in one absorbing meta-state. Give the
meta-state a reward of 1. Give all known states a reward of zero. Find the
optimal policy in this new MDP.
If the optimal policy has a value of zero (or low enough): exit. [No need
for exploration anymore.]
Otherwise: Execute the optimal action for the current state.
Take the action that has been taken least often in this state.
Page 18
Hierarchical RL
Page 19
RL summary
Function approximation:
Imitation learning:
Page 20
Pieter Abbeel
UC Berkeley EECS
Overview
Thus far:
Page 1
Examples
Helicopter
Sensors:
Sensors:
Probability review
For any random variables X, Y we have:
Marginalization:
P(X=x) = y P(X=x, Y=y)
Note: no assumptions beyond X, Y being random variables are made for any of these to
hold true (and when we divide by something, that something is not zero)
Page 2
Probability review
For any random variables X, Y, Z, W we have:
Chain rule:
P(X=x, Y=y, Z=z, W=w) = P(X=x) P(Y=y | X=x ) P(Z=z | X=x, Y=y) P(W=w| X=x, Y=y, Z=z)
Marginalization:
P(X=x | W=w) = y,z P(X=x, Y=y, Z=z | W=w)
Note: no assumptions beyond X, Y, Z, W being random variables are made for any of these to
hold true (and when we divide by something, that something is not zero)
Independence
Page 3
Conditional independence
Chain rule (which holds true for all distributions, no assumptions needed):
P(X=x,Y=y,Z=z,W=w) = P(X=x)P(Y=y|X=x)P(Z=z|X=x,Y=y)P(W=w|X=x,Y=y,Z=z)
For binary variables the representation requires 1 + 2*1 + 4*1 + 8*1 = 24-1 numbers
(just like a full joint probability table)
P(X=x,Y=y,Z=z,W=w) = P(X=x)P(Y=y|X=x)P(Z=z|Y=y)P(W=w|Z=z)
For binary variables the representation requires 1 + 2*1 + 2*1 + 2*1 = 1+(4-1)*2
numbers --- significantly less!!
Markov Models
Models a distribution over a set of random variables X1, X2, , XT where the
index is typically associated with some notion of time.
Markov models make the assumption:
P(X1 = x1, X2 = x2, , XT = xT) = t P(Xt = xt | Xt-1 = xt-1, Xt-2 = xt-2, , X1 = x1)
(1)
X1
X2
X3
X4
Page 4
XT
X1
X2
X3
X4
XT
Z1
Z2
Z3
Z4
ZT
X2
X3
X4
XT
Z1
Z2
Z3
Z4
ZT
HMM assumptions:
P(X1 = x1)
P(X1 = x1)
P(Z1 = z1 | X1 = x1 )
P(Z1 = z1 | X1 = x1 )
P(X2 = x2 | X1 = x1 , Z1 = z1)
P(X2 = x2 | X1 = x1)
P(Z2 = z2 | X2 = x2 )
P(ZT = zT | XT = xT)
Page 5
Mini quiz
Example
Page 6
Robot localization:
Filtering / Monitoring
The Kalman filter was invented in the 60s and first implemented as
a method of trajectory estimation for the Apollo program. [See
course website for a historical account on the Kalman filter. From
Gauss to Kalman]
Page 7
Prob
t=0
Prob
t=1
Page 8
Prob
t=2
Prob
t=3
Page 9
Prob
t=4
Prob
t=5
Page 10
X1
Time update
X1
X2
Z1
Time update
Page 11
Xt
Xt+1
Observation update
Xt+1
Assume we have:
Et+1
Then:
Algorithm
For t = 1, 2,
Time update
Observation update
Page 12
Example HMM
Rt-1
P(Rt)
Rt
P(Ut)
0.7
0.9
0.3
0.2
recursive update
Normalization:
Page 13
The forward algorithm first sums over x1, then over x2 and so forth,
which allows it to efficiently compute the likelihood at all times t, indeed:
Relevance:
U2
U3
U4
X1
X2
X3
X4
X5
Z1
Z2
Z3
Z4
E5
Page 14
U2
U3
U4
X1
X2
X3
X4
X5
Z1
Z2
Z3
Z4
E5
Smoothing
The distribution over states at time t given all evidence until time t:
How about?
T < t : can simply run the forward algorithm until time t, but stop
incorporating evidence from time T+1 onwards
T > t : need something else
Page 15
Smoothing
Smoothing
Can also be read off from the Bayes net graph / conditional
independence assumptions:
Page 16
Backward algorithm
fl1 (xl1 )
xl
Smoother algorithm
Find
Page 17
Bayes filters
Recursively compute
Tractable cases:
Univariate Gaussian
PX(x)
Page 18
Properties of Gaussians
Mean:
Variance:
Classical CLT:
Page 19
Multi-variate Gaussians
= [1; 0]
= [1 0; 0 1]
= [-.5; 0]
= [1 0; 0 1]
Page 20
= [-1; -1.5]
= [1 0; 0 1]
= [0; 0]
= [1 0 ; 0 1]
= [0; 0]
= [.6 0 ; 0 .6]
= [0; 0]
= [2 0 ; 0 2]
= [0; 0]
= [1 0; 0 1]
= [0; 0]
= [0; 0]
= [1 0.5; 0.5 1] = [1 0.8; 0.8 1]
Page 21
= [0; 0]
= [1 0; 0 1]
= [0; 0]
= [0; 0]
= [1 0.5; 0.5 1] = [1 0.8; 0.8 1]
= [0; 0]
= [1 -0.5 ; -0.5 1]
= [0; 0]
= [1 -0.8 ; -0.8 1]
Page 22
= [0; 0]
= [3 0.8 ; 0.8 1]
xt = At xt 1 + Bt ut + t
with a measurement
zt = Ct xt + t
47
Bt
Ct
48
Page 23
bel ( x0 ) = N (x0 ; 0 , 0 )
53
xt = At xt 1 + Bt ut + t
p ( xt | ut , xt 1 ) = N (xt ; At xt 1 + Bt ut , Rt )
bel ( xt ) = p( xt | ut , xt 1 )
bel ( xt 1 ) dxt 1
~ N ( xt ; At xt 1 + Bt ut , Rt ) ~ N ( xt 1 ; t 1 , t 1 )
54
Page 24
bel ( xt 1 ) dxt 1
~ N (xt ; At xt 1 + Bt ut , Rt ) ~ N ( xt 1 ; t 1 , t 1 )
= At t 1 + Bt ut
bel ( xt ) = t
T
t = At t 1 At + Rt
55
f (t1 , t1 , xt )
1
d
(2) 2 |S| 2
exp
1
(xt1 m)S 1 (xt1 m)
2
dxt1
f(t1 , t1 , xt )
Page 25
Properties of Gaussians
We just showed:
X ~ N ( , )
T
Y ~ N ( A + B, AA )
Y = AX + B
Self-quiz
Page 26
zt = Ct xt + t
p ( zt | xt ) = N ( zt ; Ct xt , Qt )
bel ( xt ) =
p ( zt | xt )
bel ( xt )
~ N ( zt ; Ct xt , Qt )
~ N xt ; t , t
)
59
p ( zt | xt )
bel ( xt )
~ N ( zt ; Ct xt , Qt )
~ N xt ; t , t
= t + K t ( zt Ct t )
bel ( xt ) = t
t = ( I K t Ct ) t
with
60
Page 27
1
d
(2) 2 |S| 2
exp
1
(xt m)S 1 (xt m)
2
Correction:
K t = t CtT (Ct t CtT + Qt ) 1
t = t + K t ( z t Ct t )
t = ( I K t Ct ) t
Return t, t
62
Page 28
Page 29
Pieter Abbeel
UC Berkeley EECS
Overview
Page 1
X1
X2
X3
X4
XT
Z1
Z2
Z3
Z4
ZT
Filtering in HMM
For t = 1, 2,
Time update
Observation update
Page 2
U4
X1
X2
X3
X4
X5
Z1
Z2
Z3
Z4
E5
U3
U2
xt = At xt 1 + Bt ut + t
with a measurement
zt = Ct xt + t
Page 3
Correction:
K t = t CtT (Ct t CtT + Qt ) 1
t = t + K t ( z t Ct t )
t = ( I K t Ct ) t
Return t, t
Why interesting?
Page 4
A = [ 0.99
0.0074; -0.0136
0.99]; C = [ 1 1 ; -1 +1];
x(:,1) = [-3;2];
for t=1:T-1
x(:,t+1) = A * x(:,t) + w(:,t);
y(:,t) = C*x(:,t) + v(:,t);
end
% + plot
Page 5
(1)
Intuition: if no noise, we observe y0, y1, and we have that the unknown
initial state x0 satisfies:
y0 = C x0
y1 = CA x0
...
yK = CAK x0
This system of equations has a unique solution x0 iff the matrix [C; CA; CAK]
has full column rank. B/c any power of a matrix higher than n can be written in
terms of lower powers of the same matrix, condition (1) is sufficient to check
(i.e., the column rank will not grow anymore after having reached K=n-1).
Simple self-quiz
Page 6
Announcement: PS2
Page 7
Announcements
PS1: will get back to you over the weekend, likely Saturday
Milestone: ditto
xt = g (ut , xt 1 ) + noise
zt = h( xt ) + noise
Page 8
Non-linear Function
Throughout: Gaussian of P(y) is the Gaussian which minimizes the KL-divergence with P(y). It
turns out that this means the Gaussian with the same mean and covariance as P(y).
Page 9
Page 10
Prediction:
g (ut , xt 1 ) g (ut , t 1 ) +
g (ut , t 1 )
( xt 1 t 1 )
xt 1
g (ut , xt 1 ) g (ut , t 1 ) + Gt ( xt 1 t 1 )
Correction:
h( xt ) h( t ) +
h( t )
( xt t )
xt
h( xt ) h( t ) + H t ( xt t )
Page 11
EKF Algorithm
Extended_Kalman_filter( t-1, t-1, ut, zt):
Prediction:
t = g (ut , t 1 )
t = At t 1 + Bt ut
t = At t 1 AtT + Rt
T
t
t = Gt t 1G + Rt
Correction:
K t = t CtT (Ct t CtT + Qt ) 1
t = t + K t ( zt Ct t )
K t = t H tT ( H t t H tT + Qt ) 1
t = t + K t ( zt h( t ))
t = ( I K t H t ) t
Return t, t
t = ( I K t Ct ) t
Ht =
h( t )
xt
Gt =
g (ut , t 1 )
xt 1
Perception
Page 12
R
A
G
C
State: (nR, eR, R, nA, eA, nB, eFB, nC, eC, nD, eD, nE, eE, nF,
eF, nG, eG, nH, eH)
Transition model:
Page 13
h(nR, eR, R, nk, ek) = [xk; yk] = R() ( [nk; ek] - [nR; eR] )
[courtesy by J. Leonard]
Page 14
33
odometry
estimated trajectory
[courtesy by John Leonard]
34
Defining landmarks
Closing the loop problem: how to know you are closing a loop?
Keep track of the likelihood score of each EKF and discard the ones
with low likelihood score
Page 15
EKF Summary
EKF
UKF
37
Page 16
EKF
UKF
38
EKF
UKF
Page 17
Y = f(X)
Page 18
Assume
Then:
Self-quiz
Page 19
Dynamics update:
Observation update:
Page 20
[i]
t = 2n wc[i] (X [i]
4.
t )(Xt
t ) + Rt
t
i=0
t
t
5. Xt =
t
t +
t
6. Zt = h(Xt )
2n [i] [i]
7. zt = i=0 wm Zt
2n [i] [i]
[i]
8. St =
wc Z zt Z zt
+ Qt
i=0
[i]
[i]
Xt
t Zt zt
x,z S 1
10. Kt =
t
t
11. t =
t + Kt (zt zt )
t Kt St Kt
12. t =
13. return t , t
UKF Summary
Page 21
Announcements
Pieter Abbeel
UC Berkeley EECS
Page 1
X1
X2
X3
X4
XT
Z1
Z2
Z3
Z4
ZT
Filtering in HMM
For t = 1, 2,
Time update
Observation update
Page 2
xt = At xt 1 + Bt ut + t
with a measurement
zt = Ct xt + t
Correction:
K t = t CtT (Ct t CtT + Qt ) 1
t = t + K t ( z t Ct t )
t = ( I K t Ct ) t
Return t, t
Page 3
Nonlinear systems
xt = g (ut , xt 1 ) + noise
zt = h( xt ) + noise
Page 4
Prediction:
g (ut , xt 1 ) g (ut , t 1 ) +
g (ut , t 1 )
( xt 1 t 1 )
xt 1
g (ut , xt 1 ) g (ut , t 1 ) + Gt ( xt 1 t 1 )
Correction:
h( xt ) h( t ) +
h( t )
( xt t )
xt
h( xt ) h( t ) + H t ( xt t )
EKF Algorithm
Extended_Kalman_filter( t-1, t-1, ut, zt):
Prediction:
t = g (ut , t 1 )
t = At t 1 + Bt ut
t = At t 1 AtT + Rt
T
t
t = Gt t 1G + Rt
Correction:
K t = t CtT (Ct t CtT + Qt ) 1
t = t + K t ( zt Ct t )
K t = t H tT ( H t t H tT + Qt ) 1
t = t + K t ( zt h( t ))
t = ( I K t H t ) t
Return t, t
t = ( I K t Ct ) t
Ht =
h( t )
xt
Page 5
Gt =
g (ut , t 1 )
xt 1
EKF
UKF
11
Page 6
Y = f(X)
Assume
Then:
Page 7
Self-quiz
Page 8
Dynamics update:
Observation update:
2. Xt = g(t , Xt1 )
2n [i] [i]
3.
t =
wm Xt
i=0
[i]
t = 2n wc[i] (X [i]
4.
t )(Xt
t ) + Rt
t
i=0
t
t
5. Xt =
t
t +
t
6. Zt = h(Xt )
2n [i] [i]
7. zt = i=0 wm Zt
2n [i] [i]
[i]
8. St = i=0 wc Zt zt Zt zt
+ Qt
[i]
x,z = 2n wc[i] X [i]
9.
t Zt zt
t
t
i=0
x,z S 1
10. Kt =
t
t
11. t =
t + Kt (zt zt )
t Kt St Kt
12. t =
13. return t , t
Page 9
UKF Summary
Basic principle
30
Page 10
FastSLAM
Page 11
Page 12
Observation update
Importance sampling
Interested in estimating:
EXP [f (X)] =
f (x)P (x)dx
x
Q(x)
=
f (x)P (x)
dx ifQ(x) = 0 P (x) = 0
Q(x)
x
P (x)
=
f (x)
Q(x)dx
Q(x)
x
P (X)
= EXQ [
f (X)]
Q(X)
m
1 P (x(i) )
Page 13
For t=1, 2,
Dynamics update:
For i=1, 2, , N
(i)
(i)
Sample x t from P(Xt+1 | Xt = x t)
Observation update:
For i=1, 2, , N
(i)
(i)
(i)
w t+1 = w t* P(zt+1 | Xt+1 = x t+1)
Page 14
Resampling
2. St = ,
=0
3. For i = 1K n
4.
5.
6.
wti = p ( zt | xti )
7.
= + wti
8.
Insert
9. For i = 1K n
10.
wti = wti /
Normalize weights
Page 15
Particle Filters
Bel ( x) p( z | x) Bel ( x)
p ( z | x) Bel ( x)
w
= p( z | x)
Bel ( x)
Page 16
Robot Motion
Bel ( x)
d x'
Bel ( x) p( z | x) Bel ( x)
p ( z | x) Bel ( x)
w
= p( z | x)
Bel ( x)
Page 17
Robot Motion
Bel ( x)
Resampling issue
Loss of samples
Page 18
d x'
Advantages:
Page 19
Regularization
Page 20
Start
Sonar sensor
Laser sensor
Page 21
55
Page 22
56
57
Page 23
58
59
Page 24
60
61
Page 25
62
63
Page 26
64
65
Page 27
66
67
Page 28
68
69
Page 29
70
71
Page 30
72
Page 31
Summary PF Localization
78
Page 32
Announcements
Pieter Abbeel
UC Berkeley EECS
Page 1
Types of SLAM-Problems
[Lu & Milios, 97; Gutmann, 98: Thrun 98; Burgard, 99; Konolige & Gutmann, 00; Thrun, 00; Arras, 99; Haehnel,
01;]
Landmark-based
[Leonard et al., 98; Castelanos et al., 99: Dissanayake et al., 2001; Montemerlo et al., 2002;
State variables:
Robot pose
KF-type approaches are a good fit b/c they can keep track of
correlations between landmarks
Note: Could then use path of robot + sensor log and build a map
assuming known robot poses
Page 2
Grid-based SLAM
Page 3
z+d2
z+d3
z-d1
log
P (m[xy] = 1)
P (m[xy] = 1)
P (m[xy] = 1|zt )
log
+
log
P (m[xy] = 0)
P (m[xy] = 0)
P (m[xy] = 0|zt )
8
Incremental Updating
of Occupancy Grids (Example)
Page 4
Bel (m[ xy ] ) =
hits( x, y )
hits( x, y ) + misses( x, y )
12
Page 5
13
14
Page 6
Example
Out of 1000 beams only 60% are reflected from a cell and
40% intercept it without ending in it.
Accordingly, the reflection probability will be 0.6.
Suppose p(occ | z) = 0.55 when a beam ends in a cell and
p(occ | z) = 0.45 when a cell is intercepted by a beam that
does not end in it.
Accordingly, after n measurements we will have
0.55
0.45
n*0.6
0.45
*
0.55
n*0.4
11
=
9
n*0.6
11
*
9
n*0.4
11
=
9
n*0.2
16
Page 7
Rao-Blackwellization
poses
map
Page 8
18
Rao-Blackwellization
poses
map
SLAM posterior
Robot path posterior
Mapping with known poses
Factorization first introduced by Murphy in 1999
19
Rao-Blackwellization
Page 9
Rao-Blackwellized Mapping
Each particle represents a possible trajectory of the
robot
Each particle
maintains its own map and
updates it upon mapping with known poses
Each particle survives with a probability proportional to
the likelihood of the observations relative to its own
map
22
map of particle 3
map of particle 1
23
map of particle 2
Page 10
Problem
Each map is quite big in case of grid maps
Since each particle maintains its own map
Therefore, one needs to keep the number of particles
small
Solution:
Compute better proposal distributions!
Idea:
Improve the pose estimate before applying the particle
filter
24
t 1) p(xt | ut 1, xt 1)}
xt = argmax{p(zt | xt , m
xt
current measurement
robot motion
Page 11
Raw Odometry
Scan Matching
26
30
Page 12
Loop Closure
32
Page 13
Likelihood field
Map m
33
Scan Matching
34
Page 14
FastSLAM recap
Rao-Blackwellized representation:
Original FastSLAM:
Page 15
Graph-based Formulation
Goal:
Find a configuration of the nodes that minimize the error
introduced by the constraints
JGraphSLAM
= x
0 0 x0 +
+
t
(zti
i
i
h(xt , m, cit )) Q1
t (zt h(xt , m, ct ))
Page 16
scans
total acquisition time
traveled distance
total rotations
size
processing time
59668
4,699.71 seconds
2,587.71 meters
262.07 radians
180 x 110 meters
< 30 minutes
Page 17
GraphSLAM
Autonomous Blimp
Page 18
Page 19