0% found this document useful (0 votes)

285 views381 pages

Advanced Robotics

This document summarizes the first lecture of the CS 287: Advanced Robotics course taught by Pieter Abbeel at UC Berkeley in Fall 2009. The lecture introduced the course topics and goals, which include control, estimation, manipulation, reinforcement learning, and case studies. Example robotic systems were discussed like driverless cars, autonomous helicopters, legged robots, and mobile manipulation. The class aims to understand state-of-the-art robotics techniques and algorithms through lectures, assignments, and a final project.

Uploaded by

akozy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

285 views381 pages

Advanced Robotics

Uploaded by

akozy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 381

CS 287: Advanced Robotics

Fall 2009
Lecture 1: Introduction
Pieter Abbeel
UC Berkeley EECS

www

http://www.cs.berkeley.edu/~pabbeel/cs287-fa09

Instructor: Pieter Abbeel

Lectures: Tuesdays and Thursdays, 12:30pm-2:00pm,

405 Soda Hall
Office Hours: Thursdays 2:00-3:00pm, and by email
arrangement. In 746 Sutardja Dai Hall

Page 1

Announcements

Communication:

Announcements: webpage

Email: [email protected]

Office hours: Thursday 2-3pm + by email

arrangement, 746 SDH

Enrollment:

Undergrads stay after lecture and see me

Class Details

Prerequisites:

Familiarity with mathematical proofs, probability, algorithms,

linear algebra, calculus.

Ability to implement algorithmic ideas in code.

Strong interest in robotics

Work and grading

Four large assignments (4 * 15%)

One smaller assignment (5%)

Open-ended final project (35%)

Collaboration policy: Students may discuss assignments with each

other. However, each student must code up their solutions
independently and write down their answers independently.

Page 2

Class Goals

Learn the issues and techniques underneath state of the

art robotic systems
Build and experiment with some of the prevalent
algorithms
Be able to understand research papers in the field

Main conferences: ICRA, IROS, RSS, ISER, ISRR

Main journals: IJRR, T-RO, Autonomous Robots

Try out some ideas / extensions of your own

Lecture outline

Logistics --- questions? [textbook slide forthcoming]

A few sample robotic success stories

Outline of topics to be covered

Page 3

Driverless cars

Darpa Grand Challenge

First long-distance driverless car competition

2004: CMU vehicle drove 7.36 out of 150 miles

2005: 5 teams finished, Stanford team won

Darpa Urban Challenge (2007)

Urban environment: other vehicles present

6 teams finished (CMU won)

Ernst Dickmanns / Mercedes Benz: autonomous car on European

highways

Human in car for interventions

Paris highway and 1758km trip Munich -> Odense, lane
changes at up to 140km/h; longest autonomous stretch: 158km

Kalman filtering, Lyapunov, LQR, mapping, (terrain & object recognition)

Autonomous Helicopter Flight

[Coates, Abbeel & Ng]

Kalman filtering, model-predictive control, LQR, system ID, trajectory learning

Page 4

Four-legged locomotion
[Kolter, Abbeel & Ng]

inverse reinforcement learning, hierarchical RL, value iteration, receding

horizon control, motion planning

Two-legged locomotion
[Tedrake +al.]

TD learning, policy search, Poincare map, stability

Page 5

Mapping

[Video from W. Burgard and D. Haehnel]

baseline : Raw odometry data + laser range finder scans

Mapping

[Video from W. Burgard and D. Haehnel]

FastSLAM: particle filter + occupancy grid mapping

Page 6

Mobile Manipulation
[Quigley, Gould, Saxena, Ng + al.]

SLAM, localization, motion planning for navigation and grasping, grasp point
selection, (visual category recognition, speech recognition and synthesis)

Outline of Topics

Control: underactuation, controllability, Lyapunov, dynamic

programming, LQR, feedback linearization, MPC
Estimation: Bayes filters, KF, EKF, UKF, particle filter, occupancy
grid mapping, EKF slam, GraphSLAM, SEIF, FastSLAM
Manipulation and grasping: force closure, grasp point selection,
visual servo-ing, more sub-topics tbd
Reinforcement learning: value iteration, policy iteration, linear
programming, Q learning, TD, value function approximation, Sarsa,
LSTD, LSPI, policy gradient, inverse reinforcement learning, reward
shaping, hierarchical reinforcement learning, inference based
methods, exploration vs. exploitation
Brief coverage of: system identification, simulation, pomdps, karmed bandits, separation principle
Case studies: autonomous helicopter, Darpa Grand/Urban
Challenge, walking, mobile manipulation.

Page 7

1. Control

Overarching theme: mathematically capture

What makes control problems hard

What techniques do we have available to tackle the
hard problems

E.g.: Helicopters have underactuated, non-minimum

phase, highly non-linear and stochastic (within our
modeling capabilities) dynamics.
Hard or easy to control?

1. Control (ctd)

Under-actuated vs. fully actuated

Example: acrobot swing-up and balance task

Page 8

1. Control (ctd)

Other mathematical formalizations of what makes some

control problems easy/hard:

Linear vs. non-linear

Minimum-phase vs. non-minimum phase

Deterministic vs. stochastic

Solution and proof techniques we will study:

Lyapunov, dynamic programming, LQR, feedback

linearization, MPC

2. Estimation

Bayes filters: KF, EKF, UKF, particle filter

One of the key estimation problems in robotics:
Simultaneous Localization And Mapping (SLAM)
Essence: compute posterior over robot pose(s) and
environment map given

(i) Sensor model

(ii) Robot motion model

Challenge: Computationally impractical to compute

exact posterior because this is a very high-dimensional
distribution to represent
[You will benefit from 281A for this part of the course.]

Page 9

3. Grasping and Manipulation

Extensive mathematical theory on grasping: force

closure, types of contact, robustness of grasp
Empirical studies showcasing the relatively small
vocabulary of grasps being used by humans (compared
to the number of degrees of freedom in the human
hand)
Perception: grasp point detection

4. Reinforcement learning

Learning to act, often in discrete state spaces

value iteration, policy iteration, linear programming, Q

learning, TD, value function approximation, Sarsa,
LSTD, LSPI, policy gradient, inverse reinforcement
learning, reward shaping, hierarchical reinforcement
learning, inference based methods, exploration vs.
exploitation

Page 10

5. Misc. Topics

system identification: frequency domain vs. time domain

Simulation / FEM

Pomdps

k-armed bandits

separation principle

Reading materials

Control

Estimation

Reinforcement learning

Probabilistic Robotics, Thrun, Burgard and Fox.

Manipulation and grasping

Tedrake lecture notes 6.832:

https://svn.csail.mit.edu/russt_public/6.832/underactuated.pdf

Sutton and Barto, Reinforcement Learning (free online)

Misc. topics

Page 11

Next lecture we will start with our study of control!

Page 12

CS 287: Advanced Robotics

Fall 2009
Lecture 2: Control 1: Feedforward, feedback, PID, Lyapunov direct method
Pieter Abbeel
UC Berkeley EECS

Announcements

Office hours: Thursdays 2-3pm + by email arrangement,

746 SDH

SDH 7th floor should be unlocked during office hours

on Thursdays

Questions about last lecture?

Page 1

CS 287 Advanced Robotics

Control

Estimation

Manipulation/Grasping

Reinforcement Learning

Misc. Topics

Case Studies

Control in CS287

Overarching goal:

Understand what makes control problems hard

What techniques do we have available to tackle the
hard (and the easy) problems

Any applicability of control outside robotics? Yes, many!

Process industry, feedback in nature, networks and

computing systems, economics,

[See, e.g., Chapter 1 of Astrom and Murray, http://www.cds.caltech.edu/~murray/amwiki/Main_Page, for more

details---_optional_ reading. Fwiw: Astrom and Murray is a great read on mostly classical feedback control
and is freely available at above link.]

We will not have time to study these application

areas within CS287 [except for perhaps in your final
project!]

Page 2

Todays lecture

Feedforward vs. feedback

PID (Proportional Integral Derivative)

Lyapunov direct method --- a method that can be helpful

in proving guarantees about controllers
Reading materials:

Astrom and Murray, 10.3

Tedrake, 1.2

Optional: Slotine and Li, Example 3.21.

Based on a survey of over eleven thousand controllers in the refining, chemicals

and pulp and paper industries, 97% of regulatory controllers utilize PID feedback.
L. Desborough and R. Miller, 2002 [DM02]. [Quote from Astrom and Murray,
2009]

Todays lecture

Practical result: can build a trajectory controller for a

fully actuated robot arm

Our abstraction: torque control input to motor, read out

angle [in practice: voltages and encoder values]

Page 3

Intermezzo: Unconventional (?) robot arm use

Single link manipulator (aka the simple pendulum)

u
l

+ b(t)
+ mgl sin (t) = u(t)
I (t)

Page 4

I = ml2

Single link manipulator

u
l

+ b(t)
+ mgl sin (t) = u(t)
I (t)

How to hold arm at = 45 degrees?

The Matlab code that generated all discussed simulations will be posted on www.

Single link manipulator

Simulation results:
+ c(t)
+ mgl sin (t) = u(t), u = mgl sin
I (t)
4

Can we do better
than this?

(0) = 0, (0)
=0

(0) = 4 , (0)
=0

Page 5

Feedforward control
u
l

+ b(t)
+ mgl sin (t) = u(t)
I (t)

How to make arm follow a trajectory *(t) ?

u(t) = I (t) + c (t) + mgl sin (t)

(0) = 0, (0)
=0

Feedforward control
Simulation results:
+ c(t)
+ mgl sin (t) = u(t)
I (t)

u(t) = I (t) + c (t) + mgl sin (t)

Can we do better
than this?

Page 6

(0) = 0, (0)
=0

n DOF (degrees of freedom) manipulator?

Thus far:

+ b(t)
+ mgl sin (t) = u(t)
I (t)

n DOF manipulator: standard manipulator equations

H(q)
q + C(q, q)
+ G(q) = B(q)u

H : inertial matrix, full rank

B : identity matrix if every joint is actuated

Given trajectory q(t), can readily solve for

feedforward controls u(t) for all times t

Fully-Actuated vs. Underactuated

A system is fully actuated when in a certain state (q,\dot{q},t) if, when in that
state, it can be controlled to instantaneously accelerate in any direction.
Many systems of interest are of the form:

q = f1 (q, q,
t) + f2 (q, q,
t)u

Defn. Fully actuated: A control system described by Eqn. (1) is fully-actuated

in state (q,\dot{q},t) if it is able to command an instantaneous acceleration in
an arbitrary direction in q:

rankf2 (q, q,
t) = dimq

Defn. Underactuated: A control system described by Eqn. (1) is

underactuated in configuration (q,\dot{q},t) if it is not able to command an
instantaneous acceleration in an arbitrary direction in q:

rankf2 (q, q,
t) < dimq

[See also, Tedrake, Section 1.2.]

Page 7

(1)

Fully-Actuated vs. Underactuated

q = f1 (q, q,
t) + f2 (q, q,
t)u fully actuated in (q, q,
t) iff rankf2 (q, q,
t) = dimq.

Hence, for any fully actuated system, we can follow a trajectory by

simply solving for u(t):

u(t) = f21 (q, q,

t) (
q f1 (q, q,
t))

[We can also transform it into a linear system through a change of

variables from u to v:

q(t) = v(t)
u(t) = f21 (q, q,
t) (v(t) f1 (q, q,
t))
The literature on control for linear systems is very extensive and
hence this can be useful. This is an example of feedback
linearization. More on this in future lectures.]

Fully-Actuated vs. Underactuated

q = f1 (q, q,
t) + f2 (q, q,
t)u fully actuated in (q, q,
t) iff rankf2 (q, q,
t) = dimq.

n DOF manipulator

H(q)
q + C(q, q)
+ G(q) = B(q)u
f2 = H 1 B, H full rank, B = I, hence rank(H 1 B) = rank(B)

All joints actuated rank(B) = n fully actuated

Only p < n joints actuated rank(B) = p underactuated

Page 8

Example underactuated systems

Car

Acrobot

Cart-pole

Helicopter

Fully actuated systems: is our feedforward control

solution sufficient in practice?
u
l

+ b(t)
+ mgl sin (t) = u(t)
I (t)

Task: hold arm at 45 degrees.

What if parameters off? --- by 5%, 10%, 20%,

What is the effect of perturbations?

Page 9

Fully actuated systems: is our feedforward control

solution sufficient in practice?
u
l

+ b(t)
+ mgl sin (t) = u(t)
I (t)

Task: hold arm at 45 degrees.

Mass off by 10%:

steady-state error

Fully actuated systems: is our feedforward control

solution sufficient in practice?
u
l

+ b(t)
+ mgl sin (t) = u(t)
I (t)

Task: swing arm up to 180 degrees and hold there

Perturbation after 1sec:

Does not recover

[ = 180 is an unstable
equilibrium point]

Page 10

Proportional control
Task: hold arm at 45 degrees

u(t) = ufeedforward (t) + Kp (qdesired (t) q(t))

u(t) = ufeedforward (t)

Feedback can provide

Robustness to model errors

However, still:

Overshoot issues --- ignoring momentum/velocity!

Steady-state error --- simply crank up the gain?

Proportional control
Task: swing arm up to 180
degrees and hold there

u(t) = ufeedforward (t)

u(t) = ufeedforward (t) + Kp (qdesired (t) q(t))

Page 11

Current status

Feedback can provide

Robustness to model errors

Stabilization around states which are unstable in open-loop

Overshoot issues --- ignoring momentum/velocity!

Steady-state error --- simply crank up the gain?

PD control
u(t) = Kp (qdesired (t) q(t)) + Kd (qdesired q(t))

Page 12

Eliminate steady state error by cranking up Kp ?

+ b(t)
+ mgl sin (t) = u(t)
I (t)

Task: hold arm at 45 degrees

u(t) = ufeedforward (t) + Kp (qdesired (t) q(t))

In steady-state, q = q = 0 and we get:

mgl sin = ufeedforward + Kp ( )
Using some trigoniometry and assuming is close to we get:
=

ufeedforward mgl sin

Kp + mgl cos

Eliminate steady state error by cranking up Kp ?

u(t + t) = Kp (qdesired (t) q(t))

Page 13

PID
u(t) = Kp (qdesired (t) q(t)) + Kd(qdesired q(t))

+ Ki

t
0

(qdesired ( ) q( ))d

Zero error in steady-state: [assumes steady-state is achieved!]

q = q = 0, u = 0, hence, taking derivatives of above:

u = Kp (qdesired q(t))

+ Kd (
qdesired q(t)) + Ki (qdesired (t) q(t))
0 = Kk (qdesired (t) q(t))

Recap so far

Given a fully actuated system and a (smooth) target trajectory

Can solve dynamics equations for required control inputs =

feedforward controls
Feedforward control is insufficient in presence of

Proportional feedback control can alleviate some of the above issues.

Model inaccuracy
Perturbations + instability
Steady state error reduced by (roughly) factor Kp, but large Kp can be
problematic in presence of delay Add integral term
Ignores momentum Add derivative term

Remaining questions:

How to choose PID constants? Aka tuning

Any guarantees?

Page 14

PID tuning

Typically done by hand (3 numbers to play with) [policy

search should be able to automate this in many settings]
Ziegler-Nichols method (1940s) provides recipe for
starting point

Frequency response method

Step response method

Recipe results from

(a) Extensively hand-tuning controllers for many settings

(b) Fitting a function that maps from easy to measure
parameters to the three gains
[See also Astrom and Murray Section 10.3]

PID tuning: Ziegler-Nichols frequency domain method

Set derivative and integral gain to zero

Drive up the proportional gain until steady oscillation
occurs, record the corresponding gain kc and period Tc
Use the following table to set the three gains:

Notation: KI =

kp
,
Ti

KD = kp Td

Page 15

PID tuning: Ziegler-Nichols step response method

1. Record open-loop stepresponse characteristics

2. Read gains out from

above table

Frequency domain Ziegler-Nichols for single link

u
l

Kc = 100;

Tc = 0.63s;

Page 16

ZN and TLC results

Tyreus-Luyben tuning chart:

Kp = kc /2.2, Ti = 2.2Tc , Td = Tc /6.3
Tends to:
increase robustness,
decrease oscillation.

Aside: Integrator wind-up

Recipe: Stop integrating error when the controls

saturate
Reason: Otherwise it will take a long time to react in the
opposite direction in the future.

Matters in practice!

[See also Astrom and Murray, Section 10.4]

Page 17

Recap of main points

To control a fully actuated system:

Compute feedforward by solving for u(t)

However, feedforward is insufficient when:

Feedback can address these issues

Model is imperfect (i.e., always when dealing with real systems)

System is unstable

Standard feedback: PID

In practice, often even only feedback (i.e., without feedforward) can

already provide ok results in these settings, no model needed, which
can be very convenient

In this lecture no solution provided for underactuated systems

Note: many underactuated systems do use PID type controllers in their core (e.g.,
helicopter governor, gyro)

Page 18

CS 287: Advanced Robotics

Fall 2009
Lecture 3: Control 2: Fully actuated wrap-up/recap, Lyapunov direct method,
Optimal control
Pieter Abbeel
UC Berkeley EECS

Page 1

Fully actuated recap

q = f1 (q, q,
t) + f2 (q, q,
t)u fully actuated in (q, q,
t) iff rankf2 (q, q,
t) = dimq.
Given a fully actuated system and a (smooth) target trajectory q*

Can solve dynamics equations for required control inputs = feedforward

controls
uff (t) = f21 (q , q , t) (
q f1 (q , q , t))
Feedforward control is insufficient in presence of

Model inaccuracy
Perturbations + instability

Proportional feedback control can alleviate some of the above issues.

Steady state error reduced by (roughly) factor Kp, but large Kp can be problematic in
presence of delay Add integral term
Ignores momentum Add derivative term

u(t) = uff (t) + Kp (q (t) q(t)) + Kd (q (t) q(t))

+ Ki

t
0

(q ( ) q( ))d

PID constants require tuning: often by hand, Ziegler-Nichols and TLC provide
good starting points, (policy search could automate this)
If control inputs do not directly relate to the degrees of freedom, we can use
feedback linearization to get that form:

q(t) = v(t)

u(t) = f21 (q, q,

t) (v(t) f1 (q, q,
t))

Todays lecture

Fully actuated recap [done]

Aside on integrator wind-up

Lyapunov direct method --- a method that can be helpful

in proving guarantees about controllers
Optimal control

Page 2

Readings for todays lecture

Optional:

Tedrake Chapter 3 [optional, nice read on energy pumping

control strategies]
Slotine and Li, Example 3.21, Global asymptotic stability of a
robot position controller [optional]

Aside: Integrator wind-up

u(t) = uff (t) + Kp (q (t) q(t)) + Kd (q (t) q(t))

+ Ki

t
0

(q ( ) q( ))d

Recipe: Stop integrating error when the controls

saturate
Reason: Otherwise it will take a long time to react in the
opposite direction in the future.

Matters in practice!

[See also Astrom and Murray, Section 10.4]

Page 3

Lyapunov

Lyapunov theory is used to make conclusions about

trajectories of a system without finding the trajectories
(i.e., solving the differential equation)
A typical Lyapunov theorem has the form:

if there exists a function V : Rn R that satisfies some

conditions on V and \dot{V}
then, trajectories of system satisfy some property

If such a function V exists we call it a Lyapunov function

Lyapunov function V can be thought of as generalized
energy function for system

Guarantees?
Equilibrium state. A state x is an equilibrium state of the system x =
f (x) if f (x ) = 0.

Stability. The equilibrium state x is said to be stable if, for any R > 0,
there exists r > 0, such that if x(0) x < r, then x(t) x < R for all
t 0. Otherwise, the equilibrium point is unstable.

Asymptotic stability. An equilibrium point x is asymptotically stable if

it is stable, and if in addition there exists some r > 0 such that x(0) x < r
implies that x(t) x as t .

Page 4

Stability illustration

Simple physical examples

Page 5

Proving stability

To prove stability, we need to show something about the

solution of a non-linear differential equation for all initial
conditions within a certain radius of the equilibrium
point.

Challenge: typically no closed form solution to

differential equation!

How to analyze / prove stability ??

Alexandr Mikhailovich Lyapunov

In late 19th century introduced one of

the most useful and general
approaches for studying stability of
non-linear systems.
[Lyapunovs PhD thesis: 1892]

Page 6

[from Boyd, ee363]

Proof:

[from Boyd, ee363]

Page 7

[from Boyd, ee363]

Proof:
Suppose trajectory x(t) does not converge to zero.
V (x(t)) is decreasing and nonnegative, so it converges to, say, as t .
Since x(t) does not converge to 0, we must have > 0, so for all t,
V (x(t)) V (x(0)).
C = {z| V (z) V (x(0))} is closed and bounded, hence compact. So
V (assumed continuous) attains its supremum on C, i.e., supzC V = a < 0.
Since V (x(t)) a for all t, we have
V (x(T )) = V (x(0)) +

V (x(t))dt V (x(0)) aT

which for T > V (x(0))/a implies V (x(T )) < 0, a contradition.

So every trajectory x(t) converges to 0, i.e., x = f (x) is globally asymptotically stable.

[from Boyd, ee363]

Page 8

Global invariant set Theorem. Assume that

V (x) as x .
V (x) 0 over the whole state space.
Let R be the set of all points where V (x) = 0, and let M be the largest invariant
set in R. Then all solutions globally asymptotically converge to M as t .

Example 1
q + q + sin q = u
u = sin q + Kp (q q) + Kd (0 q)

Page 9

Example 1 (solution)
q + q + g(q) = u
u = g(q) + Kd (q q) + Kd (0 q)

We choose V = 12 Kp (q q )2 + 12 q2 .
This gives for V :
V

=
=

Kp (q q )q + qq
Kp (q q )q + q (Kp (q q) Kd q q))

(1 + Kd )q

Hence V satisfies: (i) V (q) 0 and = 0 iff q = q , (ii) V 0. Since the arm
cannot get stuck at any position such that q = 0 (which can be easily shown
by noting that acceleration is non-zero in such situations), the robot arm must
settle down at q = 0 and q = 0, according to the invariant set theorem. Thus
the system is globally asymptotically stable.

Example 2: PD controllers are stable for

fully actuated manipulators

(Slotine and Li, Example 3.21.)

Page 10

[from Boyd, ee363]

Page 11

[from Boyd, ee363]

Lyapunov recap

Enables providing stability guarantees w/o solving the

differential equations for all possible initial conditions!

Tricky part: finding a Lyapunov function

A lot more to it than we can cover in lecture, but this

should provide you with a starting point whenever you
might need something like this in the future

Page 12

Intermezzo on energy pumping

An interesting and representative approach from nonlinear control:

Energy pumping to swing up:

Write out energy of system E(q, \dot{q})

Write out time derivative of energy: \{dot}E (q, \dot{q}, u)
Choose u as a function of q and \dot{q} such that energy is steered
towards required energy to reach the top

Local controller to stabilize at top --- we will see this aspect in

detail later.

Energy pumping nicely described in Tedrake Chapter 3. Enjoy the optional read!

Forthcoming lectures

Optimal control: provides general computational approach

to tackle control problems---both under- and fully actuated.

Dynamic programming

Discretization

Dynamic programming for linear systems

Extensions to nonlinear settings:

Local linearization
Differential dynamic programming
Feedback linearization

Model predictive control (MPC)

Examples:

Page 13

CS 287: Advanced Robotics

Fall 2009
Lecture 4: Control 3: Optimal control---discretization (function
approximation)
Pieter Abbeel
UC Berkeley EECS

Announcement

Tuesday Sept 15: no lecture

Page 1

Today and forthcoming lectures

Optimal control: provides general computational approach
to tackle control problems---both under- and fully actuated.

Dynamic programming

Discretization

Dynamic programming for linear systems

Extensions to nonlinear settings:

Local linearization
Differential dynamic programming
Feedback linearization

Model predictive control (MPC)

Examples:

Today and Thursday

Optimal control formalism [Tedrake, Ch. 6, Sutton and Barto Ch.1-4]

Discrete Markov decision processes (MDPs)

Solution methods for continuous problems:

HJB equation [[[Tedrake, Ch. 7 (optional)]]]

Markov chain approximation method [Chow and Tsitsiklis, 1991; Munos and Moore, 2001]
[[[Kushner and Dupuis 2001 (optional)]]]

Continuous discrete [Chow and Tsitsiklis, 1991; Munos and Moore, 2001] [[[Kushner
and Dupuis 2001 (optional)]]]
Error bounds:

Solution through value iteration [Tedrake Ch.6, Sutton and Barto Ch.1-4]

Value function: Chow and Tsitsiklis; Kushner and Dupuis; function approximation [Gordon 1995;
Tsitsiklis and Van Roy, 1996]
Value function close to optimal resulting policy good

Speed-ups and Accuracy/Performance improvements

Page 2

Optimal control formulation

Given:
dynamics : x(t)

= f(x(t), u(t), t)
cost function : g(x, u, t)

Task: find a policy u(t) = (x, t) which optimizes:

J (x0 ) = h(x(T )) +

g(x(t), u(t), t)dt

Applicability: g and f often easier to specify than

Finite horizon discrete time

Markov decision process (MDP)

S: set of states

A: set of actions

P: dynamics model

H: horizon

g: S x A R cost function

(S, A, P, H, g)

P (xt+1 = x |xt = x, ut = u)

Policy = (0 , 1 , . . . , H ), k : S A

Cost-to-go of a policy : J (x) = E[

Goal: find

arg min J

Page 3

t=0

g(x(t), u(t))|x0 = x, ]

Dynamic programming (aka value iteration)

Let Jk = mink ,...,H E[ H
t=k g(xt , ut )], then we have:

JH
(x) =

JH1
(x) =

min g(x(H), u(H))

min g(x, u) +
P (x |x, u)JH
(x )
u

...

Jk (x) =

min g(x, u) +

J0 (x) =

...

min g(x, u) +
P (x |x, u)J1 (x )

k (x) = arg min g(x, u) +

P (x |x, u)Jk+1
(x )

And

P (x |x, u)Jk+1
(x );

Running time: O(|S|2 |A| H) vs. nave search over all policies would
require evaluation of |A||S|H policies

Discounted infinite horizon

Markov decision process (MDP)

(S, A, P, , g)

: discount factor

= (0 , 1 , . . .), k : S A

t
Value of a policy : J (x) = E[
t=0 g(x(t), u(t))|x0 = x, ]

Policy

Goal: find arg min V

Page 4

Discounted infinite horizon

Dynamic programming (DP) aka Value iteration (VI):

For i=0,1,
For all s S

J (i+1) (s) min

P (s |s, u) g(s, a) + J (i) (s )

Facts:

J (i) J for i
There is an optimal stationary policy: = ( , , . . .) which satisfies:

(x) = arg min g(x, u) +
P (x |x, u)J (x)
u

Continuous time and state-action space

Hamilton-Jacobi-Bellman equation / approach:

Continuous equivalent of discrete case we already discussed

We will see 2 slides.

Variational / Markov chain approximation method:

Numerically solve a continuous problem by directly

approximating the continuous MDP with a discrete MDP

We will study this approach in detail.

Page 5

Hamilton-Jacobi-Bellman (HJB) [*]

Can also derive HJB equation for the stochastic setting. Keywords for
finding out more: Controlled diffusions / diffusion jump processes.

For special cases, can assist in finding / verifying analytical solutions

However, for most cases, need to resort to numerical solution methods for
the corresponding PDE --- or directly approximate the control problem with
a Markov chain

References:

Tedrake Ch. 7; Bertsekas, Dynamic Programming and Optimal Control.

Oksendal, Stochastic Differential Equations: An Introduction with
Applications

Oksendal and Sulem, Applied Stochastic Control of Jump Diffusions

Michael Steele, Stochastic Calculus and Financial Applications

Markov chain approximations: Kushner and Dupuis, 1992/2001

Page 6

Markov chain approximation (discretization)

Original MDP (S, A, P, R, )

Discretized MDP:

Grid the state-space: the vertices are the discrete

states.
Reduce the action space to a finite set.

Sometimes not needed:

When Bellman back-up can be computed exactly over
the continuous action space
When we know only certain controls are part of the
optimal policy (e.g., when we know the problem has a
bang-bang optimal solution)

Transition function remains to be resolved!

Discretization: example 1

Discrete states: { , , }

P (2 |s, a) = pA ;
P (3 |s, a) = pB ;
P (6 |s, a) = pC ;
s.t. s = pA 2 + pB 3 + pC 6

Results in discrete MDP, which we know how to solve.

Policy when in continuous state:

(s) = arg mina g(s, a) +

Note: need not be triangular.

P (s |s, a)

i P (i ; s

)J(i)

[See also: Munos and Moore, 2001.]

Page 7

Discretization: example 1 (ctd)

Discretization turns deterministic transitions into

stochastic transitions
If MDP already stochastic

Repeat procedure to account for all possible

transitions and weight accordingly

If a (state, action) pair can results in infinitely many

different next states:

Sample next states from the next-state distribution

Discretization: example 1 (ctd)

Discretization results in finite state stochastic MDP,

hence we know value iteration will converge

Alternative interpretation: the Bellman back-ups in the

finite state MDP are

(a) back-ups on a subset of the full state space

(b) use linear interpolation to compute the required
next-state cost-to-go functions whenever the next
state is not in the discrete set

= value iteration with function approximation

Page 8

Discretization: example 2
s
a

Discrete states: { , , }

P (2 |s, a) = 1;

Similarly define transition

probabilities for all i

Results in discrete MDP, which we know how to solve.

Policy when in continuous state:

(s) = arg mina g(s, a) +

P (s |s, a)

i P (i ; s

)J(i)

This is nearest neighbor; could also use weighted combination of nearest neighbors.

Discretization: example 2 (ctd)

Discretization results in finite state (stochastic) MDP,

hence we know value iteration will converge

Alternative interpretation: the Bellman back-ups in the

finite state MDP are

(a) back-ups on a subset of the full state space

(b) use nearest neighbor interpolation to compute the
required next-state cost-to-go functions whenever
the next state is not in the discrete set

= value iteration with function approximation

Page 9

Discretization: example 3
s

P (i |j , u) =

Discrete states: { , , }

P (s |s,u)1{s i }ds

P (s |s,u)ds

After entering a region, the state gets uniformly reset to

any state from that region.

[Chow and Tsitsiklis, 1991]

Discretization: example 3 (ctd)

Discretization results in a similar MDP as for example 2

Main difference: transition probabilities are computed

based upon a region rather than the discrete states

Page 10

Continuous time

One might want to discretize time in a variable way such that one
discrete time transition roughly corresponds to a transition into
neighboring grid points/regions
Discounting:

exp(t)

t depends on the state and action

See, e.g., Munos and Moore, 2001 for details.

Note: Numerical methods research refers to this connection between
time and space as the CFL (Courant Friedrichs Levy) condition.
Googling for this term will give you more background info.
!! 1 nearest neighbor tends to be especially sensitive to having the
correct match [Indeed, with a mismatch between time and space 1
nearest neighbor might end up mapping many states to only
transition to themselves no matter which action is taken.]

Example: Double integrator---minimum time

Continuous time:

Objective: reach origin in minimum time

g(q, q,
u) =

q = u, t : u(t) [1, +1]

0
1

if q = q = 0
otherwise

Can be solved analytically: optimal policy is bang-bang:

the control system should accelerate maximally towards
the origin until a critical point at which it should hit the
brakes in order to come perfectly to rest at the origin.
This results in:
u=

1
if q sign(q) 2sign(q)q
1 otherwise
[See Tedrake 6.6.3 for further details.]

Page 11

Example: Double integrator---minimum

time---optimal solution

Example: Double integrator---minimum time

Kuhn triang., h = 1

optimal

Kuhn triang., h = 0.1

Kuhn triang., h = 0.02

Page 12

Resulting cost, Kuhn triang.

Green = continuous time optimal policy for mintime problem

For simulation we used: dt = 0.01; and goal area = within .01 of zero for q and \dot{q}.
This results in the continuous time optimal policy not being exactly optimal for the
discrete time case.

Example: Double integrator---quadratic cost

Continuous time:

In discrete time:

Cost function:

q = u

qt+1

= qt + qt t

qt+1

= qt + ut

g(q, q,
u) = q 2 + u2

Page 13

Example: Double integrator---quadratic cost

Kuhn triang., h = 1

optimal

Kuhn triang., h = 0.1

Kuhn triang., h = 0.02

Resulting cost, Kuhn

Page 14

Example: Double integrator---quadratic cost

Nearest neighbor, h = 1

optimal

Nearest neighbor, h = 0.1

Nearest neighbor, h = 0.02

dt=0.1

Resulting cost, nearest neighbor

Page 15

Nearest neighbor quickly degrades when

time and space scale are mismatched

dt= 0.01

dt= 0.1

h = 0.1

h = 0.02

Discretization guarantees

Typical guarantees:

Assume: smoothness of cost function, transition

model
For h 0, the discretized value function will
approach the true value function

Combine with:

Greedy policy w.r.t. value function V which is close to

V* is a policy that attains value close to V*

Page 16

Discretization proof techniques

Chow and Tsitsiklis, 1991:

Kushner and Dupuis, 2001:

Show that one discretized back-up is close to one complete

back-up + then show sequence of back-ups is also close

Show that sample paths in discrete stochastic MDP approach

sample paths in continuous (deterministic) MDP [also proofs
for stochastic continuous, bit more complex]

Function approximation based proof

Applies more generally to solving large-scale MDPs

Great descriptions: Gordon, 1995; Tsitsiklis and Van Roy, 1996

Example result (Chow and Tsitsiklis,1991)

Page 17

Function approximation

General idea

Value iteration back-up on some states Vi+1

Fit parameterized function to Vi+1

Discretization as function approximation

Nearest neighbor discretization = piecewise constant

Piecewise linear over triangles discretization

Page 18

CS 287: Advanced Robotics

Fall 2009
Lecture 5: Control 4: Optimal control / Reinforcement learning--- function
approximation in dynamic programming
Pieter Abbeel
UC Berkeley EECS

Today

Optimal control/Reinforcement learning provide a general

approach to tackle temporal decision making problems.

Often the state space is too large to perform exact Dynamic

Programming (DP) / Value Iteration (VI)

Today: Dynamic programming with function approximation

Great references:
Gordon, 1995, Stable function approximation in dynamic programming
Tsitsiklis and Van Roy, 1996, Feature based methods for large scale
dynamic programming
Bertsekas and Tsitsiklis, Neuro-dynamic programming, Chap. 6

Page 1

Recall: Discounted infinite horizon

Markov decision process (MDP)

(S, A, P, , g)

: discount factor

= (0 , 1 , . . .), k : S A

t
Value of a policy : J (x) = E[
t=0 g(x(t), u(t))|x0 = x, ]

Policy

Goal: find arg min V

Recall: Discounted infinite horizon

Dynamic programming (DP) aka Value iteration (VI):

For i=0,1,
For all s S

J (i+1) (s) min g(s, u) +

P (s |s, u)J (i) (s )

Facts:

J (i) J for i
There is an optimal stationary policy: = ( , , . . .) which satises:

(s) = arg min g(s, u) +
P (s |s, u)J (s)
u

Issue in practice: Bellmans curse of dimensionality: number of

states grows exponentially in the dimensionality of the state space

Page 2

DP/VI with function approximation

Pick some S S [typically the idea is that |S | << |S|].
Iterate for i = 0, 1, 2, . . .:

back-ups:s S : J(i+1) (s) min g(s, u) +
P (s |s, u)J(i) (s )
uA

projection: nd some

(i+1)

such that s S

J (i+1) (s) J(i+1) (s)

Projection enables generalization to s S \ S , which in

turn enables the Bellman back-ups in the next iteration.
parameterizes the class of functions used for
approximation of the cost-to-go function

Function approximation examples

Piecewise linear over triangles (tetrahedrons, etc.)

Piecewise constant over sets of states (=nearest

neighbor, often called state aggregation)

Fit a neural net to J

(i+1) = arg min

loss(J(i+1) (s) f (s))

Least squares fit to J

(i+1) = arg min

(i+1) (s) (s) )2

sS (J

Bezier patches (=particular choice of convex weighting)

[[TODO: work out examples in more detail.]]

Page 3

Potential guarantees?

If we have bounded error during function approximation

in each iteration, i.e.,

i : J(i) J (i)
can we provide any guarantees?

If we have a function approximation architecture that

can capture the true cost-to-go function within some
approximation error, i.e.,

: floss (J , J )
can we provide any guarantees?

Simple example

Function approximator: [1 2] *

Page 4

Simple example
J =

J(1) (x1 )
J(1) (x2 )

1
2

= 0 + J(0) (x2 ) = 2(0)

Function approximation with least squares t:

1
2(0)
(1)
(0)
2
2
Least squares t results in:
(1) =

6 (0)

Repeated back-ups and function approximations result in:

(i) =

(0)

which diverges if > 56 even though the function approximation class can
represent the true value function.]

Bellman operator

Dynamic programming (DP) aka Value iteration (VI):

Set J(0) = 0
For i=0,1,
For all s S

J (i+1) (s) min

P (s |s, u) g(s, a) + J (i) (s )

Bellman operator T
T : |S| |S| is dened as:

(T J )(s) = min g(s, a) +

P (s |s, u)J(s )

Hence running value iteration can be compactly written as:

Set J = 0
Repeat J T J .
Note: T is a non-linear operator (b/c of the min).

Page 5

Contractions
Definition. The operator F is a -contraction w.r.t. some norm if
X, X : F X F X X X

Theorem 1. The sequence X, F X, F 2 X, ... converges for every X.

Theorem 2. F has a unique xed point X which satises F X = X and all

sequences X, F X, F 2 X, ... converge to this unique xed point X .

Proof of Theorem 1
Useful fact.
Cauchy sequences: If for x0 , x1 , x2 , . . ., we have that
, K : xM xN < f or M, N > K
then we call x0 , x1 , x2 , ... a Cauchy sequence.
If x0 , x1 , x2 , . . . is a Cauchy sequence, and xi n , then there exists x n
such that limi xi = x .
Proof.
Assume N > M .
N 1
F M X F N X = i=M (F i X F i+1 X)
N 1
i
i+1
X

i=M F X F
N 1 i

F
X
i=M
N 1
=
X F X i=M i
=

X F X 1
.

As X F X 1
goes to zero for M going to infnity, we have that for any
> 0 for F M X F N X to hold for all M, N > K, it suces to pick K
large enough.
Hence X, F X, . . . is a Cauchy sequence and converges.

Page 6

Proof of uniqueness of fixed point

Suppose F has two xed points. Lets say
F X1

X1 ,

F X2

X2 ,

this implies,
F X1 F X2 = X1 X2 .
At the same time we have from the contractive property of F
F X1 F X2 X1 X2 .
Combining both gives us
X1 X2 X1 X2 .
Hence,
X1 = X2 .
Therefore, the xed point of F is unique.

The Bellman operator is a contraction

Definition. The innity norm of a vector x n is dened as
x = max |xi |
i

Fact. The Bellman operator T is a -contraction with respect to the innity

norm, i.e.,
T J1 T J2 J1 J2

Corollary. From any starting point, value iteration/Dynamic programming

converges to a unique xed point J which satises J = T J .

Page 7

Proof Bellman operator is a contraction

Value iteration variations

Gauss-Seidel value iteration

For i=1, 2,
for s=1,,|S|
J(s) min g(s, u) +
uA

P (s |s, u)J(s )

Compare to regular value iteration:

J (i+1) (s) min g(s, u) +
P (s |s, u)J (i) (s )
uA

Exercise: Show that Gauss-Seidel value iteration converges to J*. [Hint:

proceed by showing the combined operator which does the sequential
update for all states s=1,,|S| is a infinity norm contraction.]

Page 8

Value iteration variations

Asynchronous value iteration

Pick an innite sequence of states,
s(0) , s(1) , s(2) , ...

such that every state s S occurs innitely often. Dene the operators Ts(k) as
follows:

(T J )(s),
if s(k) = s
(Ts(k) J)(s) =
J(s),
otherwise
Asynchronous value iteration initializes J and then applies, in sequence,
Ts(0) , Ts(1) , . . ..

Exercise: Show that asynchronous value iteration

converges to J*.

DP/VI with function approximation

Pick some S S [typically the idea is that |S | << |S|].
Iterate for i = 0, 1, 2, . . .:

back-ups:s S : J(i+1) (s) min
P (s |s, u) g(s, u) + J(i) (s )
uA

projection: nd some

(i+1)

such that s S J (i+1) (s) J(i+1) (s)

New notation: projection operator maps from n into the subset

of n which can be represented by the function approximator class

J(i+1) T J(i)

While theoretical convergence analysis does not depend on this,

the projection operator has to operate based upon only knowing
J at the points s S , otherwise not practically feasible for large
scale problems

Page 9

Composing operators

Definition. An operator G is a non-expansion with

respect to a norm || . || if
GJ1 GJ2 J1 J2

Fact. If the operator F is a contraction with respect to

a norm || . || and the operator G is a non-expansion with
respect to the same norm, then the sequential
application of the operators G and F is a -contraction,
i.e.,
GF J1 GF J2 J1 J2

Corollary. The operator T is a -contraction w.r.t. the

infinity norm if is a non-expansion w.r.t. the infinity
norm.

Averager function approximators

are non-expansions

Examples:

nearest neighbor (aka state aggregation)

linear interpolation over triangles (tetrahedrons, )

Page 10

Averager function approximators

are non-expansions
Theorem. The mapping associated with any averaging method is a nonexpansion in the innity norm.
Proof: Let J1 and J2 be two vectors in n . Consider a particular entry s of
J1 and J2 :
|(J1 )(s) (J2 )(s)| = |s0 +
= |

ss J1 (s ) s0 +

ss (J1 (s ) J2 (s ))|

max
|J1 (s ) J2 (s )|

= J1 J2
This holds true for all s, hence we have
J1 J2 J1 J2

Linear regression

[Example taken from Gordon, 1995.]

Page 11

ss J2 (s )|

Guarantees for fixed point

Theorem. Let J be the optimal value function for a nite MDP with discount
factor . Let the projection operator be a non-expansion w.r.t. the innity
norm and let J be any xed point of . Suppose J J . Then T
converges to a value function J such that:
J J 2 +

2
1

I.e., if we pick a non-expansion function approximator

which can approximate J* well, then we obtain a good
value function estimate.

To apply to discretization: use continuity assumptions to

show that J* can be approximated well by chosen
discretization scheme

Page 12

CS 287: Advanced Robotics

Fall 2009
Lecture 5: Control 4: Optimal control / Reinforcement learning--- function
approximation in dynamic programming
Pieter Abbeel
UC Berkeley EECS

Today

Recap + continuation of value iteration with function approximation

Performance boosts

Speed-ups

Intermezzo: Extremely crude outline of (part of) the reinforcement learning

field [as it might assist when reading some of the references]

Great references:
Gordon, 1995, Stable function approximation in dynamic programming
Tsitsiklis and Van Roy, 1996, Feature based methods for large scale dynamic programming
Bertsekas and Tsitsiklis, Neuro-dynamic programming, Chap. 6

Page 1

Recall: Discounted infinite horizon

(S, A, P, , g)

Markov decision process (MDP)

: discount factor

= (0 , 1 , . . .), k : S A

t
Value of a policy : J (x) = E[
t=0 g(x(t), u(t))|x0 = x, ]

Policy

Goal: find arg min J

Recall: Discounted infinite horizon

Dynamic programming (DP) aka Value iteration (VI):

For i=0,1,
For all s S

J (i+1) (s) min g(s, u) +

P (s |s, u)J (i) (s )

Facts:

J (i) J for i
There is an optimal stationary policy: = ( , , . . .) which satises:

(s) = arg min g(s, u) +
P (s |s, u)J (s)
u

Issue in practice: Bellmans curse of dimensionality: number of

states grows exponentially in the dimensionality of the state space

Page 2

DP/VI with function approximation

Pick some S S [typically the idea is that |S | << |S|].
Iterate for i = 0, 1, 2, . . .:

back-ups:s S : J(i+1) (s) min g(s, u) +
P (s |s, u)J(i) (s )
uA

projection: nd some

(i+1)

such that s S

J(i+1) (s) = (J(i+1) )(s) J(i+1) (s)

Projection enables generalization to s S \ S , which in

turn enables the Bellman back-ups in the next iteration.
parameterizes the class of functions used for
approximation of the cost-to-go function

Example --- piecewise linear

actual computation

abstract
J(1)=TJ(0)

back-up

back-up for
s S

J(1)=TJ(0)
x x
x

x
s

s
J(1)

project

Interpolate to
J(1)
successor
x
states of S

x xx x

x
s

s
J(2)=TJ(1)

back-up

s
J(2)

project

back-up for
s S

J(2)=TJ(1)
x

x
x

s
Interpolate to
(2)
successor J
states of S
x

xx x

x
s

Page 3

Recall: VI with function approximation need not converge!

P(x2|x1,u) = 1; P(x2|x2,u) = 1
g(x1,u) = 0; g(x2,u) = 0;

Function approximator: [1 2] *
VI w/ least squares function approximation
diverges for > 5/6 [see last lecture for details]

Contractions

Fact. The Bellman operator, T, is a -contraction w.r.t.

the infinity norm, i.e.,
J1 , J2 :
T J1 T J2

J1 J2

Theorem. The Bellman operator has a unique fixed

point J* = TJ* and for all J we have that T(k)J converges
to J* for k going to infinity.
Note:

T (k) J J

=
T (k) J T (k) J

T (k1) J T (k1) J

k
J J

I.e., with every back-up, the innity norm distance to J decreases.

Page 4

Guarantees for fixed point

Theorem. Let J be the optimal value function for a nite MDP with discount
factor . Let the projection operator be a non-expansion w.r.t. the innity
norm and let J be any xed point of . Suppose
J J
. Then T
converges to a value function J such that:

J J

2
1

Proof

[See also Gordon 1995]

Page 5

Can we generally verify goodness of some

estimate J despite not having access to J*
Fact. Assume we have some J for which we have that
J T J
. Then

we have that
J J
1
.
Proof:

J J

J T J + T J T 2 J + T 2 J T 3 J + ... J

+
T J T 2 J

+
T 2 J T 3 J
+ ... +
T J J

J T J

+ + 2 + ...

Of course, in most (perhaps all) large scale settings in

which function approximation is desirable, it will be hard
to compute the bound on the infinity norm

What if the projection fails to be a non-expansion

Assume only introduces a little bit of noise, i.e.,

iterations i :
T J(i) T J(i)

Or, more generally, we have a noisy sequence of back-ups:

J (i+1) T J (i) + w(i) with the noise w(i) satisfying:

w(i)

Fact.
J (i) T i J
(1++. . .+ i1 ) and as a consequence lim supi
J (i)

J
1
.

Proof by induction:
Base case: We have
J (1) T J (0)
.
Induction: We also have for any i > 1:

T i J (0) J (i)

=
T T i1 J (0) T J (i1) w (i1)

+
T i1 J (0) J (i1)

+ ((1 + + 2 + . . . + (i2) ))

Page 6

Guarantees for greedy policy w.r.t.

approximate value function
Definition. is the greedy policy w.r.t. J if for all states s:

(s) arg min g(s, u) +
P (s |s, u)J (s )
u

Fact. Suppose that J satises

J J
. If is a greedy policy based on
J, then
2

J J

1

t
Here J = E[
t=0 g(st , (st ))].

[See also Bertsekas and Ttsitsiklis, 6.1.1]

Proof
Recall:
(T J )(s) = min g(s, u) +
u

P (s |s, u)J (s )

Similarly dene:
(T J )(s) = g(s, (s)) +

P (s |s, (s))J (s )

We have T J = J and (same result for MDP with only 1 policy available)
T J = J .
A very typical proof follows, with the main ingredients adding and subtracting the same terms to make terms pairwise easier to compare/bound:

J J

T J J

T J T J
+
T J J

J J
+
T J J

J J
+
J J
+
J J

J J
+ 2,

and the result follows.

Page 7

Recap function approximation

DP/VI with function approximation:

Iterate: J T J

Need not converge!

Guarantees when:

The projection is an infinity norm non-expansion

Bounded error in each projection/function
approximation step

In later lectures we will also study the policy iteration

and linear programming approaches

Reinforcement learning---very crude map

Exact methods w/full model available (e.g. Value

iteration/DP, policy iteration, LP)

Approximate DP w/model available

Sample states:

Use all sampled data in batch often reducible to

exact methods on an approximate transition model
Use incremental updates stochastic approximation
techniques might prove convergence to desired
solution

Page 8

Improving performance with a given value function

1. Multi-stage lookahead aka Receding/Moving horizon

Rather than using greedy policy w.r.t. approximate value

function, with

(st ) = arg minu g(s, u) + s P (s |s, u)J (s )
Two-stage lookahead:

At time t perform back-ups for all s which are successor states of st

Then use these backed up values to perform the back-up for st

N stage lookahead: similarly,perform back-ups to N-stages of

successor states of st backward in time
Cant guarantee N-stage lookahead provides better
performance [Can guarantee tighter infinity norm bound on
attained value function estimates by N-stage lookahead.]
Example application areas in which it has improved
performance chess, backgammon

Improving performance with a given value function

2. Roll-out policies

Given a policy , choose the current action u by

evaluating the cost encurred by taking action u
followed by executing the policy from then onwards
Guaranteed to perform better than the baseline policy
on top of which it builds (thanks to general
guarantees of policy iteration algorithm)

Baseline policy could be obtained with any method

Practicalities

Todo --- fill in

VI lends itself to parallellization

Multi-grid, Coarse-to-fine grid, Variable resolution grid

Prioritized sweeping

Richardson extrapolation

Kuhn triangulation

Prioritized sweeping

Dynamic programming (DP) / Value iteration (VI):

For i=0,1,
For all s S

J (i+1) (s) min g(s, u) +

P (s |s, u)J (i) (s )

Prioritized sweeping idea: focus updates on states for which the update is
expected to be most significant
Place states into priority queue and perform updates accordingly

For every Bellman update: compute the difference J^{(i+1)} J^{(i)}

Then update the priority of the states s from which one could transition into s
based upon the above difference and the transition probability of
transitioning into s

For details: See Moore and Atkeson, 1993, Prioritized sweeping: RL with less
data and less real time

Page 10

Richardson extrapolation

Generic method to improve the rate of convergence of a sequence

Assume h is the grid-size parameter in a discretization scheme

Assume we can approximate J(h)(x) as follows:

J (h) (x) = J(x) + J1 (x)h + o(h)

Similarly:

J (h/2) (x) = J(x) + J1 (x)h/2 + o(h)

Then we can get rid of the order h error term by using the following
approximation which combines both:

2J (h/2) (x) J (h) (x) = J(x) + o(h)

Kuhn triangulation

Allows efficient computation of the vertices participating in

a points barycentric coordinate system and of the convex
interpolation weights (aka the barycentric coordinates)

See Munos and Moore, 2001 for further details.

Page 11

Kuhn triangulation (from Munos and Moore)

Page 12

CS 287: Advanced Robotics

Fall 2009
Lecture 5: Control 4: Optimal control / Reinforcement learning--- function
approximation in dynamic programming
Pieter Abbeel
UC Berkeley EECS

Today

Recap + continuation of value iteration with function approximation

Performance boosts

Speed-ups

Intermezzo: Extremely crude outline of (part of) the reinforcement learning

field [as it might assist when reading some of the references]

Great references:
Gordon, 1995, Stable function approximation in dynamic programming
Tsitsiklis and Van Roy, 1996, Feature based methods for large scale dynamic programming
Bertsekas and Tsitsiklis, Neuro-dynamic programming, Chap. 6

Page 1

Recall: Discounted infinite horizon

(S, A, P, , g)

Markov decision process (MDP)

: discount factor

= (0 , 1 , . . .), k : S A

t
Value of a policy : J (x) = E[
t=0 g(x(t), u(t))|x0 = x, ]

Policy

Goal: find arg min J

Recall: Discounted infinite horizon

Dynamic programming (DP) aka Value iteration (VI):

For i=0,1,
For all s S

J (i+1) (s) min g(s, u) +

P (s |s, u)J (i) (s )

Facts:

J (i) J for i
There is an optimal stationary policy: = ( , , . . .) which satises:

(s) = arg min g(s, u) +
P (s |s, u)J (s)
u

Issue in practice: Bellmans curse of dimensionality: number of

states grows exponentially in the dimensionality of the state space

Page 2

DP/VI with function approximation

Pick some S S [typically the idea is that |S | << |S|].
Iterate for i = 0, 1, 2, . . .:

back-ups:s S : J(i+1) (s) min g(s, u) +
P (s |s, u)J(i) (s )
uA

projection: nd some

(i+1)

such that s S

J(i+1) (s) = (J(i+1) )(s) J(i+1) (s)

Projection enables generalization to s S \ S , which in

turn enables the Bellman back-ups in the next iteration.
parameterizes the class of functions used for
approximation of the cost-to-go function

Example --- piecewise linear

actual computation

abstract
J(1)=TJ(0)

back-up

back-up for
s S

J(1)=TJ(0)
x x
x

x
s

s
J(1)

project

Interpolate to
J(1)
successor
x
states of S

x xx x

x
s

s
J(2)=TJ(1)

back-up

s
J(2)

project

back-up for
s S

J(2)=TJ(1)
x

x
x

s
Interpolate to
(2)
successor J
states of S
x

xx x

x
s

Page 3

Recall: VI with function approximation need not converge!

P(x2|x1,u) = 1; P(x2|x2,u) = 1
g(x1,u) = 0; g(x2,u) = 0;

Function approximator: [1 2] *
VI w/ least squares function approximation
diverges for > 5/6 [see last lecture for details]

Contractions

Fact. The Bellman operator, T, is a -contraction w.r.t.

the infinity norm, i.e.,
J1 , J2 :
T J1 T J2

J1 J2

Theorem. The Bellman operator has a unique fixed

point J* = TJ* and for all J we have that T(k)J converges
to J* for k going to infinity.
Note:

T (k) J J

=
T (k) J T (k) J

T (k1) J T (k1) J

k
J J

I.e., with every back-up, the innity norm distance to J decreases.

Page 4

Guarantees for fixed point

Theorem. Let J be the optimal value function for a nite MDP with discount
factor . Let the projection operator be a non-expansion w.r.t. the innity
norm and let J be any xed point of . Suppose
J J
. Then T
converges to a value function J such that:

J J

2
1

Proof

[See also Gordon 1995]

Page 5

Can we generally verify goodness of some

estimate J despite not having access to J*
Fact. Assume we have some J for which we have that
J T J
. Then

we have that
J J
1
.
Proof:

J J

J T J + T J T 2 J + T 2 J T 3 J + ... J

+
T J T 2 J

+
T 2 J T 3 J
+ ... +
T J J

J T J

+ + 2 + ...

Of course, in most (perhaps all) large scale settings in

which function approximation is desirable, it will be hard
to compute the bound on the infinity norm

What if the projection fails to be a non-expansion

Assume only introduces a little bit of noise, i.e.,

iterations i :
T J(i) T J(i)

Or, more generally, we have a noisy sequence of back-ups:

J (i+1) T J (i) + w(i) with the noise w(i) satisfying:

w(i)

Fact.
J (i) T i J
(1++. . .+ i1 ) and as a consequence lim supi
J (i)

J
1
.

Proof by induction:
Base case: We have
J (1) T J (0)
.
Induction: We also have for any i > 1:

T i J (0) J (i)

=
T T i1 J (0) T J (i1) w (i1)

+
T i1 J (0) J (i1)

+ ((1 + + 2 + . . . + (i2) ))

Page 6

Guarantees for greedy policy w.r.t.

approximate value function
Definition. is the greedy policy w.r.t. J if for all states s:

(s) arg min g(s, u) +
P (s |s, u)J (s )
u

Fact. Suppose that J satises

J J
. If is a greedy policy based on
J, then
2

J J

1

t
Here J = E[
t=0 g(st , (st ))].

[See also Bertsekas and Ttsitsiklis, 6.1.1]

Proof
Recall:
(T J )(s) = min g(s, u) +
u

P (s |s, u)J (s )

Similarly dene:
(T J )(s) = g(s, (s)) +

P (s |s, (s))J (s )

J J

T J J

T J T J
+
T J J

J J
+
T J J

J J
+
J J
+
J J

J J
+ 2,

and the result follows.

Page 7

Recap function approximation

DP/VI with function approximation:

Iterate: J T J

Need not converge!

Guarantees when:

The projection is an infinity norm non-expansion

Bounded error in each projection/function
approximation step

In later lectures we will also study the policy iteration

and linear programming approaches

Reinforcement learning---very crude map

Exact methods w/full model available (e.g. Value

iteration/DP, policy iteration, LP)

Approximate DP w/model available

Sample states:

Use all sampled data in batch often reducible to

exact methods on an approximate transition model
Use incremental updates stochastic approximation
techniques might prove convergence to desired
solution

Page 8

Improving performance with a given value function

1. Multi-stage lookahead aka Receding/Moving horizon

Rather than using greedy policy w.r.t. approximate value

function, with

(st ) = arg minu g(s, u) + s P (s |s, u)J (s )
Two-stage lookahead:

At time t perform back-ups for all s which are successor states of st

Then use these backed up values to perform the back-up for st

N stage lookahead: similarly,perform back-ups to N-stages of

Improving performance with a given value function

2. Roll-out policies

Given a policy , choose the current action u by

Baseline policy could be obtained with any method

Practicalities

Todo --- fill in

VI lends itself to parallellization

Multi-grid, Coarse-to-fine grid, Variable resolution grid

Prioritized sweeping

Richardson extrapolation

Kuhn triangulation

Prioritized sweeping

Dynamic programming (DP) / Value iteration (VI):

For i=0,1,
For all s S

J (i+1) (s) min g(s, u) +

P (s |s, u)J (i) (s )

Prioritized sweeping idea: focus updates on states for which the update is
expected to be most significant
Place states into priority queue and perform updates accordingly

For every Bellman update: compute the difference J^{(i+1)} J^{(i)}

Then update the priority of the states s from which one could transition into s
based upon the above difference and the transition probability of
transitioning into s

For details: See Moore and Atkeson, 1993, Prioritized sweeping: RL with less
data and less real time

Page 10

Richardson extrapolation

Generic method to improve the rate of convergence of a sequence

Assume h is the grid-size parameter in a discretization scheme

Assume we can approximate J(h)(x) as follows:

J (h) (x) = J(x) + J1 (x)h + o(h)

Similarly:

J (h/2) (x) = J(x) + J1 (x)h/2 + o(h)

Then we can get rid of the order h error term by using the following
approximation which combines both:

2J (h/2) (x) J (h) (x) = J(x) + o(h)

Kuhn triangulation

Allows efficient computation of the vertices participating in

a points barycentric coordinate system and of the convex
interpolation weights (aka the barycentric coordinates)

See Munos and Moore, 2001 for further details.

Page 11

Kuhn triangulation (from Munos and Moore)

Page 12

CS 287: Advanced Robotics

Fall 2009
Lecture 6: Control 5: Optimal control --- [Function approximation in dynamic
programming---special case: quadratic]
Pieter Abbeel
UC Berkeley EECS

PHD

Page 1

Announcements

Will there be lecture this Thursday (Sept 24)?

Yes.

No office hours this Thursday (as I am examining

students for prelims).
Feel free to schedule an appointment by email instead.

Announcements

Final project contents:

Original investigation into an area that relates sufficiently closely to

the course.

Could be algorithmic/theoretical idea

Could be application of existing algorithm(s) to a platform or domain in
which these algorithms carry promise but have not been applied
Alternatively: Significant improvement for an existing (or new)
assignment for this course or for an existing (or new) assignment for
188 which has close ties to this course.

Ideally: we are able to identify a topic that relates both to your ongoing PhD research and the course.
You are very welcome to come up with your own project ideas, yet
make sure to pass them by me **before** you submit your abstract.
Feel free to stop by office hours or set an appointment (via email) to
discuss potential projects.

Page 2

Announcements

Final project logistics:

Final result: 6-8 page paper.

Milestones:

Should be structured like a conference paper, i.e., focus on the

problem setting, why it matters, what is interesting/unsolved about it,
your approach, results, analysis, and so forth. Cite and briefly survey
prior work as appropriate, but dont re-write prior work when not
directly relevant to understand your approach.
Oct. 9th, 23:59: **Approved-by-me** abstracts due: 1 page description
of project + goals for milestone. Make sure to sync up with me before
then!
Nov 9th, 23:59: 1 page milestone report due
Dec 3rd, In-class project presentations [tentatively]
Dec 11th, 23:59: Final paper due

1 or 2 students/project. If you are two students on 1 final project,

I will expect twice as much research effort has gone into it!

Bellmans curse of dimensionality

n-dimensional state space

Number of states grows exponentially in n

In practice

Discretization is considered only computationally

feasible up to 5 or 6 dimensional state spaces even
when using

Variable resolution discretization

Very fast implementations

Page 3

Today

Linear Quadratic (LQ) setting --- special case: can solve continuous
optimal control problem exactly

Great reference:
[optional] Anderson and Moore, Linear Quadratic Methods --- standard reference for LQ
setting

Linear Quadratic Regulator (LQR)

The LQR setting assumes a linear dynamical system:
xt+1 = Axt + But ,
xt : state at time t
ut : input at time t
It assumes a quadratic cost function:

g(xt , ut ) = x
t Qxt + ut Rut

with Q 0, R 0.
For a square matrix X we have X 0 if and only if for all vectors z we
have z Xz > 0. Hence there is a non-zero cost for any state dierent from the
all-zeros state, and any input dierent from the all-zeros input.

Page 4

While LQ assumptions might seem very restrictive,

we will see the method can be made applicable
for non-linear systems, e.g., helicopter.

Value iteration

Back-up step for i+1 steps to go:

LQR:

= min x Qx + u Ru + Ji (Ax + Bu)
u

Page 5

LQR value iteration: J1

Initialize J0 (x) = x P0 x.
J1 (x)

= min x Qx + u Ru + J0 (Ax + Bu)
u

= min x Qx + u Ru + (Ax + Bu) P0 (Ax + Bu)
u

(1)

To nd the minimum over u, we set the gradient w.r.t. u equal to zero:

u [. . .] = 2Ru + 2B P0 (Ax + Bu) = 0,
hence: u = (R + B P0 B)1 B P0 Ax

(2)

(2) into (1): J1 (x) = x P1 x

for: P1
K1

= Q + K1 RK1 + (A + BK1 ) P0 (A + BK1 )

= (R + B P0 B)1 B P0 A.

LQR value iteration: J1 (ctd)

In summary:

J0 (x) = x P0 x
xt+1 = Axt + But
g(x, u) = u Ru + x Qx
J1 (x) =

x P1 x

for: P1

Q + K1 RK1 + (A + BK1 ) P0 (A + BK1 )

(R + B P0 B)1 B P0 A.

J1(x) is quadratic, just like J0(x).

Value iteration update is the same for all times and can be done
in closed form!
J2 (x) = x P2 x
for: P2 = Q + K2 RK2 + (A + BK2 ) P1 (A + BK2 )
K2

= (R + B P1 B)1 B P1 A.

Page 6

Value iteration solution to LQR

Set P0 = 0.
for i = 1, 2, 3, . . .
Ki
Pi

= (R + B Pi1 B)1 B Pi1 A

= Q + Ki RKi + (A + BKi ) Pi1 (A + BKi )

The optimal policy for a i-step horizon is given by:

(x) = Ki x
The cost-to-go function for a i-step horizon is given by:
Ji (x) = x Pi x.

LQR assumptions revisited

xt+1

= Axt + But

g(xt , ut ) = x
t Qxt + ut Rut

= for keeping a linear system at the all-zeros state.

Extensions which make it more generally applicable:

Affine systems

System with stochasticity

Regulation around non-zero fixed point for non-linear systems

Penalization for change in control inputs

Linear time varying (LTV) systems

Trajectory following for non-linear systems

Page 7

LQR Ext0: Affine systems

xt+1 = Axt + But + c

g(xt , ut ) = x
t Qxt + ut Rut

Optimal control policy remains linear, optimal cost-to-go

function remains quadratic
Two avenues to do derivation:

1. Work through the DP update as we did for standard setting

2. Redefine the state as: z_t = [x_t; 1], then we have:

xt+1
A c
xt
B
zt+1 =
=
+
ut = A zt + B ut
1
0 1
1
0

LQR Ext1: stochastic system

xt+1

= Axt + But + wt

g(xt , ut ) = x
t Qxt + ut Rut

wt , t = 0, 1, . . . are zero mean and independent

Exercise: work through similar derivation as we did for

the deterministic case.
Result:

Same optimal control policy

Cost-to-go function is almost identical: has one additional term
which depends on the variance in the noise (and which cannot
be influenced by the choice of control inputs)

Page 8

LQR Ext2: non-linear systems

Nonlinear system:

xt+1 = f(xt , ut )

We can keep the system at the state x* iff

u s.t. x = f(x , u )
Linearizing the dynamics around x* gives:
f
f
xt+1 f (x , u ) +
(x , u )(xt x ) +
(x , u )(ut u )
x
u
B
A
Equivalently:

xt+1 x A(xt x ) + B(ut u )

Let zt = xt x* , let vt = ut u*, then:

zt+1 = Azt + Bvt ,

cost = zt Qzt + vt Rvt

[=standard LQR]

vt = Kzt ut u = K(xt x ) ut = u + K(xt x )

LQR Ext3: penalize for change in control inputs

Standard LQR:

xt+1

= Axt + But

g(xt , ut ) = x
t Qxt + ut Rut

When run in this format on real systems: often high frequency

control inputs get generated. Typically highly undesirable and
results in poor control performance.
Why?

Solution: frequency shaping of the cost function. (See, e.g.,

Anderson and Moore.)
Simple special case which works well in practice: penalize for
change in control inputs. ---- How ??

Page 9

LQR Ext3: penalize for change in control inputs

Standard LQR:
= Axt + But

xt+1

g(xt , ut ) = x
t Qxt + ut Rut

How to incorporate the change in controls into the

cost/reward function?

Soln. method A: explicitly incorporate into the state and the

reward function, and re-do the derivation based upon value
iteration.
Soln. method B: change of variables to fit into the standard LQR
setting.

LQR Ext3: penalize for change in control inputs

Standard LQR:
xt+1 = Axt + But

g(xt , ut ) = x
t Qxt + ut Rut

Introducing change in controls u:

xt+1
ut+1

A B
0 I

xt+ =

cost = (x Q x + u R u)

xt
ut1

Q =

B
I

Q 0
0 R

R = penalty for change in controls

[If R=0, then equivalent to standard LQR.]

Page 10

LQR Ext4: Linear Time Varying (LTV) Systems

xt+1
g(xt , ut )

= At xt + Bt ut

= x
t Qt xt + ut Rt ut

LQR Ext4: Linear Time Varying (LTV) Systems

Set P0 = 0.
for i = 1, 2, 3, . . .
Ki

= (RHi + BHi
Pi1 BHi )1 BHi
Pi1 AHi

= QHi + Ki RHi Ki + (AHi + BHi Ki ) Pi1 (AHi + BHi Ki )

The optimal policy for a i-step horizon is given by:

(x) = Ki x
The cost-to-go function for a i-step horizon is given by:
Ji (x) = x Pi x.

Page 11

LQR Ext5: Trajectory following for non-linear systems

A state sequence x, x, , xH* is a feasible target

trajectory iff
u0 , u1 , . . . , uH1 : t {0, 1, . . . , H 1} : xt+1 = f (xt , ut )

Problem statement:
minu0 ,u1 ,...,uH1

H1
t=0

(xt xt ) Q(xt xt ) + (ut ut ) R(ut ut )

s.t. xt+1 = f(xt , ut )

Transform into linear time varying case (LTV):

f
f
xt+1 f (xt , ut ) +
(xt , ut )(xt xt ) +
(xt , ut )(ut ut )
x
u
Bt
At

xt+1 xt At (xt xt ) + Bt (ut ut )

LQR Ext5: Trajectory following for non-linear systems

Transformed into linear time varying case (LTV):

minu0 ,u1 ,...,uH1

H1
t=0

(xt xt ) Q(xt xt ) + (ut ut ) R(ut ut )

s.t. xt+1 xt+1 At (xt xt ) + Bt (ut ut )

Now we can run the standard LQR back-up iterations.

Resulting policy at i time-steps from the end:

uHi uHi = Ki (xHi xHi )

The target trajectory need not be feasible to apply this technique,

however, if it is infeasible then the linearizations are not around the
(state,input) pairs that will be visited

Page 12

Most general cases

Methods which attempt to solve the generic optimal

control problem
minu

g(xt , ut )

t=0

subject to

xt+1 = f (xt , ut ) t

by iteratively approximating it and leveraging the fact

that the linear quadratic formulation is easy to solve.

Iteratively apply LQR

Initialize the algorithm by picking either (a) A control policy (0) or (b) A
(0)
(0)
(0)
(0)
(0)
(0)
sequence of states x0 , x1 , . . . , xH and control inputs u0 , u1 , . . . , uH . With
initialization (a), start in Step (1). With initialization (b), start in Step (2).
Iterate the following:
(1) Execute the current policy (i) and record the resulting state-input tra(i)
(i)
(i)
(i)
(i)
(i)
jectory x0 , u0 , x1 , u1 , . . . , xH , uH .
(2) Compute the LQ approximation of the optimal control around the obtained state-input trajectory by computing a rst-order Taylor expansion
of the dynamics model, and a second-order Taylor expansion of the cost
function.
(3) Use the LQR back-ups to solve for the optimal control policy (i+1) for
the LQ approximation obtained in Step (2).
(4) Set i = i + 1 and go to Step (1).

Page 13

Iterative LQR: in standard LTV format

Standard LTV is of the form zt+1 = At zt + Bt vt , g(z, v) = z Qz + v Rv.
(i)
(i)
Linearizing around (xt , ut ) in iteration i of the iterative LQR algorithm
gives us (up to rst order!):
(i)

(i)

xt+1 = f (xt , ut ) +

f (i) (i)
f (i) (i)
(i)
(i)
(x , ut )(xt xt ) +
(x , ut )(ut ut )
x t
u t

Subtracting the same term on both sides gives the format we want:
(i)

(i)

xt+1 xt+1 = f (xt , ut )xt+1 +

f (i) (i)
f (i) (i)
(i)
(i)
(x , ut )(xt xt )+ (xt , ut )(ut ut )
x t
u

Hence we get the standard format if using:

(i)

[xt xt

(ut ut )
f (i) (i)

(i)
(i)
(i)
x (xt , ut ) f (xt , ut ) xt+1
0
1
f (i) (i)
(x
,
u
)
t
t
u
0

(i)

Iteratively apply LQR: convergence

Need not converge as formulated!

Reason: the optimal policy for the LQ approximation

might end up not staying close to the sequence of
points around which the LQ approximation was
computed by Taylor expansion.
Solution: in each iteration, adjust the cost function so
this is the case, i.e., use the cost function
(i)

(i)

(1 )g(xt , ut ) + (xt xt 22 + ut ut 22 )

Assuming g is bounded, for close enough to one,

the 2nd term will dominate and ensure the
linearizations are good approximations around the
solution trajectory found by LQR.

Page 14

Iteratively apply LQR: practicalities

f is non-linear, hence this is a non-convex optimization

problem. Can get stuck in local optima! Good
initialization matters.
g could be non-convex: Then the LQ approximation fails
to have positive-definite cost matrices.

Iterative LQR: in standard LTV format

Standard LTV is of the form zt+1 = At zt + Bt vt , g(z, v) = z Qz + v Rv.
(i)
(i)
Linearizing around (xt , ut ) in iteration i of the iterative LQR algorithm
gives us (up to rst order!):
(i)

(i)

xt+1 = f (xt , ut ) +

f (i) (i)
f (i) (i)
(i)
(i)
(x , ut )(xt xt ) +
(x , ut )(ut ut )
x t
u t

Subtracting the same term on both sides gives the format we want:
(i)

(i)

xt+1 xt+1 = f (xt , ut )xt+1 +

f (i) (i)
f (i) (i)
(i)
(i)
(x , ut )(xt xt )+ (xt , ut )(ut ut )
x t
u

Hence we get the standard format if using:

(i)

[xt xt

(i)
(ut ut )
f (i) (i)
x (xt , ut )

(i)
(i)
f
u (xt , ut )

(i)

f (xt , ut ) xt+1
1

A similar derivation is needed to nd Q and R.

Page 15

Iterative LQR for trajectory following

While there is no need to follow this particular route, this is a (imho) particularly convenient way of turning the linearized and quadraticized approximation
in the iLQR iterations into the standard LQR format for the setting of trajectory
following with a quadratic penalty for deviation from the trajectory.
(i)
(i)
Let xt , ut be the state and control around which we linearize. Let xt , ut
be the target controls then we have:

xt+1

xt+1 xt+1

xt+1

xt+1 xt+1 ; 1

xt+1

f (i) (i)
f (i) (i)
(i)
(i)
(x , ut )(xt xt ) +
(x , ut )(ut ut )
x t
u t
f (i) (i)
f (i) (i)
(i)
(i)
(i)
(i)
f (xt , ut ) xt+1 +
(x , ut )(xt xt xt + xt ) +
(x , ut )(ut ut ut + ut )
x t
u t
f (i) (i)
f (i) (i)
(i)
(i)
(i)
f (xt , ut ) xt+1 +
(x , ut )(xt xt ) +
(x , ut )(xt xt )
x t
x t
f (i) (i)
f (i) (i)
(i)
+ (xt , ut )(ut ut ) +
(x , ut )(ut ut )
u
u t
A[(xt xt ); 1] + B(ut ut )
(i)

(i)

f (xt , ut ) +

For

(i)
(i)
f
(xt , ut )
x

f (xt , ut ) xt+1 +

(i)
(i)
f
(xt , ut )(xt
x

(i)

xt ) +

and
B=

(i)
(i)
f
(xt , ut )
u

(i)
(i)
f
(xt , ut )(ut
u

(i)

ut )

The cost function can be used as is: (xt xt ) Q(xt xt )+(ut ut ) R(ut ut ).

Differential Dynamic Programming (DDP)

Often loosely used to refer to iterative LQR procedure.

More precisely: Directly perform 2nd order Taylor expansion of the
Bellman back-up equation [rather than linearizing the dynamics and
2nd order approximating the cost]
Turns out this retains a term in the back-up equation which is
discarded in the iterative LQR approach
[Its a quadratic term in the dynamics model though, so even if cost is
convex, resulting LQ problem could be non-convex ]

[Typically cited book: Jacobson and Mayne, Differential dynamic

programming, 1970]

Page 16

Differential dynamic programming

Ji+1 (x) =

min
u

2nd order expansion of g around (x , u )

+Ji (f (x , u ))
dJ
+ (f(x, u) f(x , u ))
dx
d2 J
+(f (x, u) f (x , u )) 2 (f (x, u) f (x , u ))
dx

To keep entire expression 2nd order:

Use Taylor expansions of f and then remove all resulting
terms which are higher than 2nd order.
Turns out this keeps 1 additional term compared to
iterative LQR

Can we do even better?

Yes!
At convergence of iLQR and DDP, we end up with linearizations around
the (state,input) trajectory the algorithm converged to
In practice: the system could not be on this trajectory due to
perturbations / initial state being off / dynamics model being off /

Solution: at time t when asked to generate control input ut, we could resolve the control problem using iLQR or DDP over the time steps t
through H
Replanning entire trajectory is often impractical in practice: replan over
horizon h. = receding horizon control

This requires providing a cost to go J^{(t+h)} which accounts for all

future costs. This could be taken from the offline iLQR or DDP run

Page 17

Multiplicative noise

In many systems of interest, there is noise entering the

system which is multiplicative in the control inputs, i.e.:
xt+1 = Axt + (B + Bw wt )ut

Exercise: LQR derivation for this setting

[optional related reading:Todorov and Jordan, nips 2003]

Cart-pole

H(q)
q + C(q, q)
+ G(q) = B(q)u

mp l cos
mp l 2

0 mp l sin
C(q, q)
=
0 0

0
G(q) =
mp gl sin

1
B =
0
H(q) =

[See also Section 3.3 in Tedrake notes.]

Page 18

mc + mp
mp l cos

Cart-pole --- LQR

Q = diag([1;1;1;1]); R = 0; [x, theta, xdot, thetadot]

Page 19

Cart-pole --- LQR

Q = diag([1;1;1;1]); R = 1; [x, theta, xdot, thetadot]

Lyapunovs linearization method

We will not cover any details, but here is the basic result:
Assume x* is an equilibrium point for f(x), i.e., x* = f(x*).
If x* is an asymptotically stable equilibrium point for the
linearized system, then it is asymptotically stable for the
non-linear system.
If x* is unstable for the linear system, its unstable for the
non-linear system.
If x* is marginally stable for the linear system, no
conclusion can be drawn.
This provides additional justification for using linear control
design techniques for non-linear systems.
[See, e.g., Slotine and Li, or Boyd lecture notes (pointers available on course website) if you want to find out more.]

Page 20

CS 287: Advanced Robotics

Fall 2009
Lecture 8: Control 7: MPC --- Feedback Linearization --- Controllability --Lagrangian Dynamics
Pieter Abbeel
UC Berkeley EECS

Model predictive control (MPC)

Optimal control problem

Given a system with (stochastic) dynamics: xt+1 = f(xt , ut , wt ) Find the
optimal policy which minimizes the expected cost:
H

min E[
g(xt , ut )|]

t=0

MPC:
For t = 0, 1, 2, . . .
1. Solve
minut ,ut+1 ,...,uH

g(xk , uk )

k=t

s.t.

xk+1 = f (xk , uk , 0) k = t, t + 1, . . . , H 1

2. Execute ut from the solution found in (1).

In practice, one often ends up having to solve:

minut ,ut+1 ,...,ut+h1

t+h

g(xk , uk ) + g(xt+h , ut+h )

k=t

s.t.

xk+1 = f (xk , uk , 0) k = t, t + 1, . . . , t + h 1

Page 1

Single shooting

At core of MPC, need to quickly solve problems of the form:

minut ,ut+1 ,...,uH

g(xk , uk )

k=t

s.t.

xk+1 = f(xk , uk , 0) k = t, t + 1, . . . , H 1

Single shooting methods directly solve for

i.e., they solve:

Underneath, this typically boils down to iterating:

For the current

simulate and find the state sequence

Take the 1st (and 2nd) derivatives w.r.t.

Note: When taking derivatives, one ends up repeatedly applying the chain rule and the
same Jacobians keep re-occurring
Beneficial to not waste time re-computing same Jacobians; pretty straightforward,
various specifics with their own names. (E.g., back-propagation.)

Single shooting drawback

Numerical conditioning of the problem:

Influence on objective function of earlier actions vs.

later actions

What happens in case of a non-linear, unstable system?

Page 2

Multiple shooting/Direct collocation

Keep the state at each time in the optimization problem:

minut ,ut+1 ,...,uH ,xt ,xt+1 ,...,xH

g(xk , uk )

k=t

s.t.

xk+1 = f (xk , uk , 0) k = t, t + 1, . . . , H 1
hk (xk , uk ) 0 k = t, t + 1, . . . , H 1

Larger optimization problem, yet sparse structure.

Special case: Linear MPC: f linear, h, g convex
convex opt. problem, easily solved

Sequential Quadratic Programming (SQP)

Goal: solve
minut ,ut+1 ,...,uH ,xt ,xt+1 ,...,xH

g(xk , uk )

k=t

s.t.

xk+1 = f (xk , uk , 0) k = t, t + 1, . . . , H 1
hk (xk , uk ) 0 k = t, t + 1, . . . , H 1

SQP: Iterates over

Linearize f around current point (u

u, x), quadraticize g, h around
current point
Solve the resulting Quadratic Programming problem to find the
updated current point (u, x)

Corresponds to:

Write out the first-order necessary conditions for optimality (the

Karuhn-Kuhn-Tucker (KKT) conditions)
Apply Newtons method to solve the (typically non-linear) KKT
equations

Page 3

Sequential Quadratic Programming (SQP)

Not only method, but happens to be quite popular

Packages available, such as SNOPT, SOCS.

Many choices underneath:

Quasi-Newton methods

Compared to single shooting:

Easier initialization (single shooting relies on control sequence)

Easy to incorporate constraints on state / controls

More variables, yet good algorithms leverage sparsity to offset this

SNOPT, ACADO, SOCS,

We have ignored:

Continuous time aspects

Details of optimization methods underneath --- matters in
practice b/c the faster the longer horizon
Theoretical guarantees

Page 4

Related intermezzo: Nonlinear control

applied to kite-based power generation
[Diehl + al.]

Many companies pursuing this: Makani, KiteGen, SkySails, AmpyxPower,

Number from Diehl et al.: For a 500m2 kite and 10m/s wind speed (in sim)
can produce an average power of more than 5MW
Technically interesting aspect in particular work of Diehl et al.: incorporate
open-loop stability into the optimization problem.

Only possible for non-linear systems

The criterion quantifies how much deviation from the nominal trajectory would
amplify/decrease in one cycle

Non-minimum phase
example
[Slotine and Li, p. 195, Example II.2]

Page 5

Feedback linearization

Page 6

Feedback linearization

Page 7

Feedback linearization

Slotine and Li, Chapter 6

Isidori, Nonlinear control systems, 1989.

Announcements

Reminder: No office hours today.

[Feel free to schedule over email instead]

Page 8

Controllability [defn., linear systems]

A system xt+1 = f(xt, ut) if for all x0 and all x, there exists a time k and
a control sequence u0, , uk-1 such that xt = x.

Fact. The linear system xt+1 = Axt + But with xt n is controllable iff
[B AB A2 B . . . An B] is full rank.

Lagrangian dynamics

[From: Tedrake Appendix A]

Page 9

Lagrangian dynamics: example

Page 10

CS 287: Advanced Robotics

Fall 2009
Lecture 9: Reinforcement Learning 1: Bandits
Pieter Abbeel
UC Berkeley EECS

Reinforcement Learning

Model: Markov decision process (S, A, T, R, )

Goal: Find that maximizes expected sum of rewards

T and R might be unknown

[Drawing from Sutton and Barto, Reinforcement Learning: An Introduction, 1998]

Page 1

Exploration vs. exploitation

= classical dilemma in reinforcement learning

A conceptual solution: Bayesian approach:

State space = { x : x = probability distribution over T, R}

For known initial state --- tree of sufficient statistics could suffice

Transition model: describes transitions in new state space

Reward = standard reward

Today: one particular setting in which the Bayesian

solution is in fact computationally practical

Multi-armed bandits

Slot machines

Clinical trials

Advertising

Merchandising

Page 2

Multi-armed bandits

Information state

Page 3

Semi-MDP

Transition model:

Objective:

Bellman update:

Optimal stopping

A specialized version of the Semi-MDP is is the Optimal

stopping problem. At each of the times, we have two
choices:

1. continue
2. stop and accumulate reward g for current time and
for all future times.

The optimal stopping problem has the following Bellman

update:

Page 4

Optimal stopping

Optimal stopping Bellman update:

Hence, for fixed g, we can find the value of each state in the optimal
stopping problem by dynamic programming
However, we are interested in g*(s) for all s:

Note: is a random variable, which denotes the stopping time. It is the

policy in this setting.
Any stopping policy can be represented as a set of states in which we
decided to stop. The random variable takes on the value = time when we
first visit a state in the stopping set.

Optimal stopping

One approach:

Solve the optimal stopping problem for many values

of g, and for each state keep track of the smallest
value of g which causes stopping

Page 5

Reward rate

Expected reward rate

Basic idea to find g*

Page 6

Finding the optimal stopping costs

Solving the multi-armed bandit

(i)
1. Find the optimal stopping cost g (st ) for each
bandits current state

2. When asked to act, pull an arm i such that

Page 7

Key requirements

Reward at time t only depends on state of Mi at time t

When pulling Mi, only state of Mi changes

Note: M_i need not just be a bandit; we just need to be

able to compute its optimal stopping cost

Example: cashiers nightmare

P(i|j)

P(i|j): probability of joining queue

i after being served in queue j
c_i : cost of a customer being in
queue i

Page 8

Lai and Robbins, 1985

Auer +al, UCB algorithm (1998)

Type of result: after n plays, the regret is bounded by an expression O(log n)

After n plays the regret is defined by:

n
j E[Tj (n)]where = max j
j

Loosening and strengthening assumptions, e.g.,

Guha, S., K. Munagala. 2007. Approximation algorithms for budgeted learning

problems. STOC 07.

Various Robert Kleinberg publications

contextual bandit setting

Page 9

CS 287: Advanced Robotics

Fall 2009
Lecture 11: Reinforcement Learning
Pieter Abbeel
UC Berkeley EECS

Reinforcement Learning

Model: Markov decision process (S, A, T, R, )

Goal: Find that maximizes expected sum of rewards

T and R might be unknown

[Drawing from Sutton and Barto, Reinforcement Learning: An Introduction, 1998]

Page 1

Examples
goal: max E [ t t R(st, at) | ]

MDP (S, A, T, , R),

Cleaning robot

Walking robot

Pole balancing

Games: tetris, backgammon

Server management

Shortest path problems

Model for animals, people

Canonical Example: Grid World

The agent lives in a grid
Walls block the agents path
The agents actions do not
always go as planned:
80% of the time, the action
North takes the agent North
(if there is no wall there)
10% of the time, North takes
the agent West; 10% East
If there is a wall in the direction
the agent would have been
taken, the agent stays put

Big rewards come at the end

Page 2

Solving MDPs

In deterministic single-agent search problem, want an

optimal plan, or sequence of actions, from start to a goal
In an MDP, we want an optimal policy *: S A

A policy gives an action for each state

An optimal policy maximizes expected utility if followed

Defines a reflex agent

Example Optimal Policies

R(s) = -0.02

R(s) = -0.04

R(s) = -0.1

R(s) = -2.0

Page 3

Outline current and next few lectures

Recap and extend exact methods

Value iteration

Policy iteration

Generalized policy iteration

Linear programming [later]

Additional challenges we will address by building on top

of the above:

Unknown transition model and reward function

Very large state spaces

Value Iteration

Algorithm:

Start with V0(s) = 0 for all s.

Given Vi, calculate the values for all states for depth i+1:

This is called a value update or Bellman update/back-up

Repeat until convergence

Page 4

Example: Bellman Updates

Example: Value Iteration

Information propagates outward from terminal

states and eventually all states have correct
value estimates

Page 5

Convergence
Infinity norm: V = maxs |V (s)|

Fact. Value iteration converges to the optimal value function V which satisfies
the Bellman equation:

s S : V (s) = max
T (s, a, s )(R(s, a, s ) + V (s ))
a

Or in operator notation: V = T V where T denotes the Bellman operator.

Fact. If an estimate V satisfies V T V then we have that

V V

Practice: Computing Actions

Which action should we chose from state s:

Given optimal values V*?

= greedy action with respect to V*

= action choice with one step lookahead w.r.t. V*

Page 6

Policy Iteration

Alternative approach:

Step 1: Policy evaluation: calculate value function for

a fixed policy (not optimal!) until convergence
Step 2: Policy improvement: update policy using onestep lookahead with resulting converged (but not
optimal!) value function
Repeat steps until policy converges

This is policy iteration

Its still optimal!

Can converge faster under some conditions

Policy Iteration

Policy evaluation: with fixed current policy , find values

with simplified Bellman updates:

Iterate until values converge

Policy improvement: with fixed utilities, find the best

action according to one-step look-ahead

Page 7

Comparison

Value iteration:

Policy iteration:

Several passes to update utilities with frozen policy

Occasional passes to update policies

Generalized policy iteration:

Every pass (or backup) updates both utilities (explicitly, based

on current utilities) and policy (possibly implicitly, based on
current policy)

General idea of two interacting processes revolving around an

approximate policy and an approximate value

Asynchronous versions:

Any sequences of partial updates to either policy entries or

utilities will converge if every state is visited infinitely often
15

Page 8

CS 287: Advanced Robotics

Fall 2009
Lecture 12: Reinforcement Learning
Pieter Abbeel
UC Berkeley EECS

Outline

LP approach for finding the optimal value function of

MDPs
Model-free approaches

Page 1

Solving an MDP with linear programming

Page 2

Solving an MDP with linear programming

The dual LP

Page 3

The dual LP: interpretation

max
0

T (s, a, s )(s, a)R(s, a, s )

s,a,s

s.t. s

(s, a) = c(s) +

(s , a)T (s , a, s)

s ,a

Meaning (s,a) ?

Meaning c(s) ?

LP approach recap
The optimal value function satisfies:

s : V (s) = max
T (s, a, s ) [R(s, a, s) + V (s )] .
a

We can relax these non-linear equality constraints to inequality constraints:

s : V (s) max
T (s, a, s ) [R(s, a, s) + V (s )] .
a

Equivalently, (x maxi yi is equivalent to i x yi ), we have:

s, a : V (s)
T (s, a, s ) [R(s, a, s) + V (s )] . (1)
s

The relaxation still has the optimal value function as one of its solutions, but we
might have introduced new solutions. So we look for an objective function that
will favor the optimal value function over other solutions of (1). To this extent,
we observed the following monotonicity property of the Bellman operator T :
s V1 (s) V2 (s) implies : s (T V1 )(s) (T V2 )(s)
Any solution to (1) satisfies V T V , hence also: T V T 2 V , hence also:
T 2 V T 3 V ... T 1 V T V = V . Stringing these together, we get for
any solution V of (1) that the following holds:
V V
Hence to find V as the solution to (1), it suffices to add an objective function
which favors the smallest solution:

min c V s.t.s, a : V (s)
T (s, a, s) [R(s, a, s ) + V (s)] . (2)
V

If c(s) > 0 for all s, the unique solution to (2) is V .

Taking the Lagrange dual of (2), we obtain another interesting LP:

max
T (s, a, s)(s, a)R(s, a, s )
0

s,a,s

s.t. s

(s, a) = c(s) +

(s , a)T (s , a, s)

s ,a

Page 4

Announcements

PS 1: posted on class website, due Monday October 26.

Final project abstracts due tomorrow.

Value iteration:

Start with V0(s) = 0 for all s. Iterate until convergence:

Policy iteration:

Policy evaluation: Iterate until values converge

Policy improvement:

Generalized policy iteration:

Any interleaving of policy evaluation and policy improvement

Note: for particular choice of interleaving value iteration

Linear programming:

Page 5

What if T and R unknown

Model-based reinforcement learning

Estimate model from experience

Solve the MDP as if the model were correct

Model-free reinforcement learning

Adaptations of the exact algorithms which only require (s, a, r, s)

traces [some of them use (s, a, r, s, a)]
No model is built in the process

Sample Avg to Replace Expectation?

Who needs T and R? Approximate the

expectation with samples (drawn from T!)

Problem: We need
to estimate these
too!

Page 6

Sample Avg to Replace Expectation?

We could estimate V(s) for all states simultaneously:

Sample of V(s):
Update to V(s):
Same update:

Old updates will use very poor estimates of V(s)

This will surely affect our estimates of V(s) initially,

but will this also affect our final estimate?

Sample Avg to Replace Expectation?

Big idea: why bother learning T?

Update V(s) each time we experience (s,a,s)

Likely s will contribute updates more often

Temporal difference learning ( TD or TD(0) )

Policy still fixed!

Move values toward value of whatever
successor occurs: running average!

Sample of V(s):
Update to V(s):
Same update:

Page 7

Exponential Moving Average

Weighted averages emphasize certain samples

Exponential moving average

Makes recent samples more important

Forgets about the past (which contains mistakes in TD)

Easy to compute from the running average

Decreasing learning rate can give converging averages

TD(0) for estimating V

Note: this is really V

Page 8

Convergence guarantees for TD(0)

Convergence with probability 1 for the states which are

visited infinitely often if the step-size parameter
decreases according to the usual stochastic
approximation conditions

k =

k=0

k2 <

k=0

Examples:

1/k

C/(C+k)

Experience replay

If limited number of trials available: could repeatedly go

through the data and perform the TD updates again

Under this procedure, the values will converge to the

values under the empirical transition and reward model.

Page 9

CS 287: Advanced Robotics

Fall 2009
Lecture 13: Reinforcement Learning
Pieter Abbeel
UC Berkeley EECS

Outline

Model-free approaches

Recap TD(0)

Sarsa

Q learning

TD(), sarsa(), Q()

Function approximation and TD

TD Gammon

Page 1

TD(0) for estimating V

Stochastic version of the policy evaluation update:

Note: this is really V

Problems with TD Value Learning

TD value leaning is model-free for policy evaluation

However, if we want to turn our value estimates into a
policy---as required for a policy update step---were sunk:

Idea: learn Q-values directly

Makes action selection model-free too!

Page 2

Update Q values directly

When experiencing st, at, st+1, rt+1, at+1 perform the following sarsa
update:

Q (st , at ) (1 )Q (st , at ) + [r(st , at , st+1 ) + Q (st+1 , at+1 )]

Q (st , at ) + [r(st , at , st+1 ) + Q (st+1 , at+1 ) Q (st , at )]

Will find the Q values for the current policy .

How about Q(s,a) for action a inconsistent with the policy at state s?

Converges (w.p. 1) to Q function for current policy for all states and
actions *if* all states and actions are visited infinitely often (assuming
proper step-sizing)

Exploration aspect

To ensure convergence for all Q(s,a) we need to visit

every (s,a) infinitely often
The policy needs to include some randomness

Simplest: random actions ( greedy)

Every time step, flip a coin

With probability , act randomly
With probability 1-, act according to some current policy

This results in a new policy

We end up finding the Q values for this new policy

Page 3

Does policy iteration still work when we

execute epsilon greedy policies?

Policy iteration iterates:

Evaluate value of current policy V

Improve policy by choosing the greedy policy w.r.t. V

Answer: Using the epsilon greedy policies can be

interpreted as running policy iteration w.r.t. a related
MDP which differs slighty in its transition model: with
probability the transition is according to a random
action in the new MDP

Need not wait till convergence with the

policy improvement step

Recall: Generalized policy iteration methods: interleave

policy improvement and policy evaluation and
guaranteed to converge to the optimal policy as long as
value for every state updated infinitely often

Sarsa: continuously update the policy by choosing

actions greedy w.r.t. the current Q function

Page 4

Sarsa: updates Q values directly

Sarsa converges w.p. 1 to an optimal policy and actionvalue function as long as all state-action pairs are visited an
infinite number of times and the policy converges in the limit
to the greedy policy (which can be arranged, e.g., by having
greedy policies with = 1 / t).

Q learning

Directly approximate the optimal Q function Q*:

Q(st , at ) (1 )Q(st , at ) + r(st , at , st+1 ) + max Q (st+1 , a)
a

Compare to sarsa:

Q (st , at ) (1 )Q (st , at ) + [r(st , at , st+1 ) + Q (st+1 , at+1 )]

Page 5

Q learning

Q-Learning Properties

Will converge to optimal Q function if

Every (s,a) visited infinitely often

is chosen to decay according to standard stochastic
approximation requirements

Neat property: learns optimal Q-values regardless of

policy used to collect the experience

Off policy method

Strictly better than TD, sarsa? Some caveats.

Page 6

Behaviour of Q-learning vs. sarsa

Reward = 0 at goal; -100 in cliff region; -1 everywhere else

= 0.1

Exploration / Exploitation

Several schemes for forcing exploration

Simplest: random actions ( greedy)

Every time step, flip a coin

With probability , act randomly
With probability 1-, act according to current policy

Problems with random actions?

You do explore the space, but keep thrashing around once

learning is done
Takes a long time to explore certain spaces
One solution: lower over time
Another solution: exploration functions

Page 7

Exploration Functions

When to explore

Random actions: explore a fixed amount

Better idea: explore areas whose badness is not (yet)
established

Exploration function

Takes a value estimate and a count, and returns an optimistic

utility, e.g.
(exact form not important---for
optimality guarantees: it should guarantee that every (s,a) is
visited infinitely often _or_ that Q(s,a) is always optimistic)

TD() --- motivation (grid world)

Page 8

TD() --- motivation

TD() backward view

t+1:

V (st ) V (st ) + [R(st ) + V (st+1 ) V (st )]

V (st+1 ) V (st+1 ) + [R(st+1 ) + V (st+2 ) V (st+1 )]

+also perform: V (st ) V (st ) + t+1

t+2:

+also:

V (st+2 ) V (st+2 ) + [R(st+2 ) + V (st+3 ) V (st+2 )]

V (st+1 ) V (st+1 ) + t+2

V (st ) V (st ) + 2 2 t+2

Page 9

TD() --- backward view wordy

V (st ) V (st ) + [R(st ) + V (st+1 ) V (st)]

t

Similarly, the update at the next time step is

V (st+1 ) V (st+1 ) + (R(st+1 ) + V (st+2 ) V (st+1 )

t+1

Note that at the next time step we update V (st+1 ). This (crudely speaking)
results in having a better estimate of the value function for state st+1 . TD()
takes advantage of the availability of this better estimate to improve the update
we performed for V (st ) in the previous step of the algorithm. Concretely, TD()
performs another update on V (st ) to account for our improved estimate of
V (st+1 ) as follows:
V (st ) V (st ) + t+1
where is a fudge factor that determines how heavily we weight changes in
the value function for st+1 .
Similarly, at time t + 2 we perform the following set of updates:
V (st+2 ) V (st+2 ) + [R(st+2 ) + V (st+2 ) V (st+3 )] ...

t+2

V (st+1 ) V (st+1 ) + t+2

V (st) V (st ) + 2 2 t+2

e(st )

The term e(st ) is called the eligibility vector.

TD()

Page 10

TD() --- example

Random walk over 19 states. Left and rightmost states are
sinks. Rewards always zero, except when entering right sink.
0

TD() --- forward view

V (st ) (1 )V (st ) + sample

TD:

Sample =

R(st ) + V (st+1 )
R(st ) + R(st+1 ) + 2 V (st+2 )
R(st ) + R(st+1 ) + 2 R(st+2 ) + 3 V (st+3 )

...
R(st ) + R(st+1 ) + 2 R(st+2 ) + . . . + T R(sT )

[0,1]
Forward view equivalent to backward view

Page 11

Sarsa()

Watkins Q()

Page 12

Replacing traces

What if a state is visited at two different times t1 and t2 ?

Recall TD()

Replacing traces: example 1

Random walk over 19 states. Left and rightmost states are
sinks. Rewards always zero, except when entering right sink.
0

Page 13

Replacing traces: example 2

Recap RL so far

When model is available:

VI, PI, GPI, LP

When model is not available:

Model-based RL: collect data, estimate model, run one of the

above methods for estimated model
Model-free RL: learn V, Q directly from experience:

TD(), sarsa(), Q()

What about large MDPs for which we cannot represent

all states in memory or cannot collect experience from
all states?
Function Approximation

Page 14

CS 287: Advanced Robotics

Fall 2009
Lecture 14: Reinforcement Learning with Function Approximation and TD
Gammon case study
Pieter Abbeel
UC Berkeley EECS

Assignment #1

Roll-out: nice example paper: X. Yan, P. Diaconis, P. Rusmevichientong,

and B. Van Roy, ``Solitaire: Man Versus Machine,'' Advances in Neural
Information Processing Systems 17, MIT Press, 2005.

Page 1

Recap RL so far

When model is available:

VI, PI, GPI, LP

When model is not available:

Model-based RL: collect data, estimate model, run one of the

above methods for estimated model
Model-free RL: learn V, Q directly from experience:

TD(), sarsa(): on policy updates

Q: off policy updates

What about large MDPs for which we cannot represent

all states in memory or cannot collect experience from
all states?
Function Approximation

Generalization and function approximation

Represent the value function using a parameterized

function V(s), e.g.:

Neural network: is a vector with the weights on the

connections in the network
Linear function approximator: V(s) = (s)

i (s) = exp

2 (s

si ) 1 (s si )

Radial basis functions

Tilings: (often multiple) partitions of the state space

Polynomials:

Fourier basis

[Note: most algorithms kernelizable]

Often also specifically designed for the problem at hand

ji ji

i (s) = x11 x22 . . . xnn

{1, sin(2 Lx11 ), cos(2 Lx11 ), sin(2 Lx22 ), cos(2 Lx22 ), . . .}

Page 2

Example: tetris
state: board configuration + shape of the falling piece ~2200
states!

action: rotation and translation applied to the falling piece

V (s) =

i=1 i i (s)

22 features aka basis functions i

Ten basis functions, 0, . . . , 9, mapping the state to the height h[k] of each
of the ten columns.
Nine basis functions, 10, . . . , 18, each mapping the state to the absolute
difference between heights of successive columns: |h[k+1] h[k]|, k = 1, .
. . , 9.
One basis function, 19, that maps state to the maximum column height:
maxk h[k]
One basis function, 20, that maps state to the number of holes in the
board.
One basis function, 21, that is equal to 1 in every state.

[Bertsekas & Ioffe, 1996 (TD); Bertsekas & Tsitsiklis 1996 (TD);Kakade 2002 (policy gradient); Farias & Van Roy, 2006 (approximate LP)]

Objective

A standard way to find in supervised learning, optimize

MSE:
min

P (s) (V (s) V (s))2

Is this the correct objective?

Page 3

Evaluating the objective

When performing policy evaluation, we can obtain

samples by simply executing the policy and
recording the empirical (discounted) sum of rewards
min

encountered states s

2
V (s) V (s)

TD methods: use vt = rt + V (st+1 ) as a

substitute for V (s)

Stochastic gradient descent

Stochastic gradient descent to optimize MSE objective:

Iterate
Draw a state s according to P
Update:

12 (V (s) V (s))2 = + (V (s) V (s)) V (s)

TD(0): use vt = rt + V (st+1 ) as a substitute for V (s)

t+1 t + R(st , at , st+1 ) + Vt (st+1 ) Vt (st ) Vt (st )

Page 4

TD() with function approximation

time t:

t+1 t + R(st , at , st+1 ) + Vt (st+1 ) Vt (st ) Vt (st )

time t+1:

t+2 t+1 + R(st+1 , at+1 , st+2 ) + Vt+1 (st+2 ) Vt+1 (st+1 ) Vt+1 (st+1 )

t+2 t+2 + t+1 et

t+1

[improving previous update]

Combined:
t+1
et+1

= R(st+1 , at+1 , st+2 ) + V (st+2 ) Vt+1 (st+1 )

= et + t+1 Vt+1 (st+1 )

t+2

= t+1 + t+1 et+1

TD() with function approximation

Can similarly adapt sarsa() and Q() eligibility vectors

for function approximation

Page 5

Guarantees

Monte Carlo based evaluation:

Provides unbiased estimates under current policy

Will converge to true value of current policy

Temporal difference based evaluations:

TD(
) w/linear function approximation: [Tsitsiklis and Van Roy, 1997]
***If*** samples are generated from traces of execution of the policy , and for
appropriate choice of step-sizes , TD() converges and at convergence: [D =
expected discounted state visitation frequencies under policy ]
V V D

1
D V V D
1

Sarsa(
) w/linear function approximation: same as TD
Q w/linear function approximation: [Melo and Ribeiro, 2007] Convergence
to reasonable Q value under certain assumptions, including: s, a(s, a)1 1
[Could also use infinity norm contraction function approximators to attain
convergence --- see earlier lectures. However, this class of function
approximators tends to be more restrictive.]

Off-policy counterexamples

Bairds counterexample for off policy updates:

Page 6

Off-policy counterexamples

Tsitsiklis and Van Roy counterexample: complete backup with off-policy linear regression [i.e., uniform least
squares, rather than waited by state visitation rates]

Intuition behind TD(0) with linear

function approximation guarantees

Stochastic approximation of the following operations:

(T V )(s) =

Back-up:

Weighted linear regression:

T (s, (s), s ) [R(s, (s), s ) + V (s )]

min

s D(s)((T

V )(s) (s))2

with solution: = ( D)1 D(T V )

Key observations:

V1 , V2 : T V1 T V2 D V1 V2 D , here : xD =

V1 , V2 : D V1 D V2 D V1 V2 D

Page 7

D(i)x(i)2

Intuition behind TD() guarantees

Bellman operator:

(T J)(s) =

T operator:

P (s |s, (s)) [g(s) + J (s )] = E [g(s) + J(s )]

T J (s) = (1 )

m=0

m E

k=0

t g(sk ) + m+1 J(sm+1 )

T operator is contraction w.r.t. \| \|_D for all [0,1]

Should we use TD than well Monte Carlo?

At convergence:

V V D

1
D V V D
1

Page 8

Empirical comparison

(See, Sutton&Barto p.221 for details)

Backgammon

15 pieces, try go reach other side

Move according to roll of dice

If hitting an opponent piece: it gets reset to the middle row

Cannot hit points with two or more opponent pieces

Page 9

Backgammon

30 pieces, 24+2 possible locations

For typical state and dice roll: often 20 moves

TD Gammon [Tesauro 92,94,95]

Reward = 1 for winning the game

= 0 other states

Function approximator: 2 layer neural network

Page 10

Input features

For each point on the backgammon board, 4 input units

indicate the number of white pieces as follows:

1 piece unit1=1;

2 pieces unit1=1, unit2=1;

3 pieces unit1=1, unit2=1, unit3=1;

n>3 pieces unit1=1, unit2=1, unit3=1, unit4 = (n-3)/2

Similarly for black

[This already makes for 2424 = 192 input units.]

Last six: number of pieces on the bar (w/b), number of

pieces that completed the game (w/b), whites move,
blacks move

Neural net

(i), i = 1, . . . , 198

Each hidden unit computes:

h(j) = (

i wij (i))

1+exp(

Output unit computes:

o = ( j wj h(j)) =

1+exp

Overall: o = f ((1), . . . , (198); w)

Page 11

wij (i))

wj h(j)

Neural nets

Popular at that time for function approximation / learning

in general
Derivatives/Gradients are easily derived analytically

Turns out they can be computed through backward

error propagation --- name error backpropagation

Susceptible to local optima!

Learning

Initialize weights randomly

TD() [ = 0.7, = 0.1]

Source of games: self-play, greedy w.r.t. current value

function [results suggest game has enough
stochasticity built in for exploration purposes]

Page 12

Results

After 300,000 games as good as best previous

computer programs

TD Gammon 1.0: add Neurogammon features

Neurogammon: highly tuned neural network trained

on large corpus of exemplary moves

Substantially better than all previous computer

players; human expert level

TD Gammon 2.0, 2.1: selective 2-ply search

TD Gammon 3.0: selective 3-ply search, 160 hidden
units

Page 13

CS 287: Advanced Robotics

Fall 2009
Lecture 15: LSTD, LSPI, RLSTD, imitation learning
Pieter Abbeel
UC Berkeley EECS

TD(0) with linear function

approximation guarantees

Stochastic approximation of the following operations:

(T V )(s) =

Back-up:

Weighted linear regression:

T (s, (s), s ) [R(s, (s), s ) + V (s )]

min

D(s)((T V )(s) (s))2

Batch version (for large state spaces):

Let {(s,a,s)} have been sampled according to D

Iterate:

Back-up for sampled (s,a,s): V (s) [R(s, a, s ) + V (s )] = R(s, a, s ) + (s )

Perform regression:

min

(V (s) (s))2

(R(s, a, s ) + (old) (s ) (s))2

(s,a,s )

Page 1

TD(0) with linear function

approximation guarantees

Iterate:
(new) = arg min

(R(s, a, s ) + (old) (s ) (s))2

(s,a,s )

Can we find the fixed point directly?

Rewrite the least squares problem in matrix notation:

(new) = arg min R + (old) 22

Solution:

(new) = ( )1 (R + (old) )

TD(0) with linear function

approximation guarantees
(new) = ( )1 (R + (old) )

Solution:

Fixed point?

( )

= ( )1 (R + )
= (R + )
= R

1
=
R

Page 2

LSTD(0)
Collect state-action-state triples (si,ai,si) according to a policy

Build the matrices:

(s1 )
(s1 )

(s2 ) (s )
2
, =
=

...
...
(sm )
(sm )

R(s1 , a1 , s1 )

, R = R(s2 , a2 , s2 )

...
R(sm , am , sm )

Find an approximation of the value function

V (s) (s)

1
for =
R

LSTD(0) in policy iteration

Iterate

Collect state-action-state triples (si,ai,si) according to current

policy
Use LSTD(0) to compute V

Tweaks:

Can re-use triples (si,ai,si) from previous policies as long as they

are consistent with the current policy
Can redo the derivation with Q functions rather than V
In case of stochastic policies, can weight contribution of a triple
according to Prob(ai|si) under the current policy

Doing all three results in Least squares policy iteration,

(Lagoudakis and Parr, 2003).

Page 3

LSTD(0) --- batch vs. incremental updates

Collect state-action-state triples (si,ai,si) according to a policy

Build the matrices:

(s1 )
(s1 )
(s2 )
(s2 )

=
, m =
...
...
(sm )
(sm )

R(s1 , a1 , s1 )

, Rm = R(s2 , a2 , s2 )

...
R(sm , am , sm )

Find an approximation of the value function

V (s) m
(s)

1
for m = m (m m )
m R m

One more datapoint m+1 :

1
m+1 =
m Rm + m+1 rm+1
m (m m ) + m+1 (m m )

Sherman-Morrison formula:

RLSTD

Recursively compute approximation of the value function by

leveraging the Sherman-Morrison formula
A1
m
bm
m

= A1
m bm

One more datapoint m+1 :

A1
m+1
bm+1

=
m (m m )
= m Rm

1
A1
m m+1 (m+1 m+1 ) Am
1
1 + (m+1 m+1 ) Am m+1
= bm + m+1 rm+1

= A1
m

Note: there exist orthogonal matrix techniques to do the same thing

but in a numerically more stable fashion (essentially: keep track of
the QR decomposition of Am)

Page 4

RLSTD: for non-linear function approximators?

RLSTD with linear function approximation with a Gaussian prior on \theta

Kalman filter

Can be applied to non-linear setting too: simply linearize the non-linear function
approximator around the current estimate of \theta; not globally optimal, but likely
still better than nave gradient descent
(+prior Extended Kalman filter)

Recursive Least Squares (1)

[From: Boyd, ee263]

Page 5

Recursive Least Squares (2)

[From: Boyd, ee263]

Recursive Least Squares (3)

[From: Boyd, ee263]

Page 6

TD methods recap

Model-free RL: learn V, Q directly from experience:

TD(), sarsa(): on policy updates

Q: off policy updates

Large MDPs: include function Approximation

Some guarantees for linear function approximation

Batch version

No need to tweak various constants

Same solution can be obtained incrementally by using recursive
updates! This is generally true for least squares type systems.

Applications of TD methods

Backgammon

Standard RL testbeds (all in simulation)

Cartpole balancing

Acrobot swing-up

Gridworld --- Assignment #2

Bicycle riding

Tetris --- Assignment #2

As part of actor-critic methods (=policy gradient + TD)

Fine-tuning / Learning some robotics tasks

Many financial institutions use some linear TD for

pricing of options

Page 7

RL: our learning status

Small MDPs: VI, PI, GPI, LP

Large MDPs:

Value iteration + function approximation

TD methods:

Which important ideas are we missing (and will I try to cover

between today and the next 3-5 lectures) ?

Imitation learning

POMDPS

Hierarchical methods

Fine tuning policies through running trials on a real system, Robotic success stories

Partial observability

Guarantees, Generally applicable idea of constraint sampling

Policy gradient, Actor-Critic (=TD+policy gradient in one)

Learn from observing an expert

Linear programming w/function approximation and constraint sampling

TD, sarsa, Q with function approximation

Simplicity, limited storage can be a convenience
LSTD, LSPI, RLSTD
Built upon in and compared to in many current RL papers
Main current direction: feature selection

You should be able to read/understand many RL papers

Iterate: Bellman back-up, project,

Incorporate your knowledge to enable scaling to larger systems

Reward shaping

Can we choose reward functions such as to enable faster learning?

Exploration vs. exploitation

Stochastic approximation

How/When should we explore?

Basic intuition behind how/when sampled versions work?

Page 8

Imitation learning

Imitation learning: what to learn?

If expert available, could use expert trace s1, a1, s2, a2,
s3, a3, to learn something from the expert

Behavioral cloning: use supervised learning to

directly learn a policy SA.

Inverse reinforcement learning:

No model of the system dynamics required

No MDP / optimal control solution algorithm required

Learn the reward function

Often most compact and transferrable task description

Trajectory primitives:

Use expert trajectories as motion primitives / components for

motion planning
Use expert trajectories as starting points for trajectory optimization

Page 9

Behavioral cloning

If expert available, could use expert trace s1, a1, s2, a2, s3, a3, to
learn the expert policy : S A
Class of policies to learn:

Neural net, decision tree, linear regression, logistic regression,

svm, deep belief net,

Advantages:

No model of the system dynamics required

No MDP / optimal control solution algorithm required

Minuses:

Only works if we can come up with a good policy class

Typically more applicable to reactive tasks, less so to tasks that involve

planning

No leveraging of dynamics model if available.

Alvinn

Task: steer a vehicle

CMU Navlab Autonomous

Navigation Testbed

Input: 30x32 image.

Page 10

Alvinn

Richness of training data?

Training data from good driver does not well represent

situations from which it should be able to recover

Might over-train on the simple data

Solution? Intentionally swerve off-center?

Issues:

Inconvenience to switch on/off the learning

Might require a lot of swerving (which could be especially
undesirable in traffic)

Page 11

Transformed images

original

extrap1

Page 12

extrap2

Few other details

Steering direction for transformed images:

Image buffering:

pure pursuit model : constant steering arc will bring

it back in the center at distance T

Keeps 200 images in buffer

One backpropagation pass over all images in each
round of training
Replacement to favor neutral steering

Road types:

Results

Achieved 98.2% autonomous driving on a 5000 km (3000-mile) "No hands

across America" trip.

Throttle and brakes were human-controlled.

Note: other autonomous driving projects:

Ernst Dickmanns

Darpa Grand and Urban Challenge

Page 13

Sammut+al, Learning to fly (ICML1992)

Task (in Silicon Graphics Flight Sim)

(crudely) Take off, fly through some waypoints, land

Training data: 30 flights (/pilot)

Recorded features: on_ground, g_limit exceeded, wing_stall, twist , elevation,

azimuth, roll_speed, elevation_speed, azimuth_speed, airspeed, climbspeed, E/W
distance from centre of runway, altitude, N/S distance from northern end of runway,
fuel, rollers, elevator, rudder, thrust, flaps

Data from each flight segmented into seven stages

In each stage: Four separate decision trees (C4.5), one for each of the
elevator, rollers, thrust and flaps.
Succeeded in synthesizing control rules for a complete flight, including a
safe landing. The rules fly the Cessna in a manner very similar to that of
the pilot whose data were used to construct the rules.
Pilots who are frugal in their use of the controls give few examples of what
to do when things go wrong.

7 stages

1. Take off and fly to an altitude of 2,000 feet.

2. Level out and fly to a distance of 32,000 feet from the starting point.

3. Turn right to a compass heading of approximately 330. The subjects were

actually told to head toward a particular point in the scenery that corresponds to
that heading.
4. At a North/South distance of 42,000 feet, turn left to head back towards the
runway. The scenery contains grid marks on the ground. The starting point for
the turn is when the last grid line was reached. This corresponds to about 42,000
feet. The turn is considered complete when the azimuth is between 140and
180.
5. Line up on the runway. The aircraft was considered to be lined up when the
aircraft's azimuth is less than 5off the heading of the runway and the twist is
less that 10from horizontal.
6. Descend to the runway, keeping in line. The subjects were given the hint that
they should have an aiming point near the beginning of the runway.
7. Land on the runway.

Page 14

Sammut + al

Example decision tree:

Stage 3: Turn right to a compass heading of
approximately 330

twist <= -23 : left_roll_3

twist > -23 :
| azimuth <= -25 : no_roll
| azimuth > -25 : right_roll_2

Sammut+al

Page 15

Tetris

state: board configuration + shape of the falling piece ~2200

states!
action: rotation and translation applied to the falling piece

V (s) =

i=1 i i (s)

22 features aka basis functions i

[Bertsekas & Ioffe, 1996 (TD); Bertsekas & Tsitsiklis 1996 (TD);Kakade 2002 (policy gradient); Farias & Van Roy, 2006 (approximate LP)]

Behavioral cloning in tetris

Page 16

Page 17

CS 287: Advanced Robotics

Fall 2009
Lecture 16: imitation learning
Pieter Abbeel
UC Berkeley EECS

Page 1

Behavioral cloning example

state: board configuration + shape of the falling piece ~2200

states!
action: rotation and translation applied to the falling piece

Behavioral cloning example

V (s) =

i=1 i i (s)

22 features aka basis functions i

[Bertsekas & Ioffe, 1996 (TD); Bertsekas & Tsitsiklis 1996 (TD);Kakade 2002 (policy gradient); Farias & Van Roy, 2006 (approximate LP)]

Page 2

Behavioral cloning example

Page 3

Behavioral cloning example

Training data: Example choices of next states chosen by the demonstrator:
(i)
s+
(i)
Alternative choices of next states that were available: sj
Max-margin formulation
min,0

+ C

i,j

(i)

subject to i, j : (s+ ) (sj ) + 1 i,j

Probabilistic/Logistic formulation
(i)

Assumes experts choose for result s(i) with probability

exp( (s+ ))
(i)
exp( (s+ )+

(i)

exp( (sj )

Hence the maximum likelihood estimate is given by:

(i)

exp( (s+ ))
max
log
C

(i)
(i)

exp( (s+ )) + j exp( (sj )

Motivation for inverse RL

Scientific inquiry

Apprenticeship learning/Imitation learning through

inverse RL

Model animal and human behavior

E.g., bee foraging, songbird vocalization. [See intro of Ng
and Russell, 2000 for a brief overview.]

Presupposition: reward function provides the most succinct and

transferable definition of the task
Has enabled advancing the state of the art in various robotic
domains

Modeling of other agents, both adversarial and

cooperative

Page 4

Problem setup

Input:

State space, action space

Transition model Psa(st+1 | st, at)

No reward function

Inverse RL:

Can we recover R ?

Apprenticeship learning via inverse RL

Teachers demonstration: s0, a0, s1, a1, s2, a2,

(= trace of the teachers policy *)

Can we then use this R to find a good policy ?

Vs. Behavioral cloning (which directly learns the teachers policy using supervised
learning)

Inverse RL: leverages compactness of the reward function

Behavioral cloning: leverages compactness of the policy class considered, does not
require a dynamics model

Lecture outline

Inverse RL intro

Mathematical formulations for inverse RL

Case studies

Page 5

Three broad categories of formalizations

Max margin

Feature expectation matching

Interpret reward function as parameterization of a policy class

Basic principle

Find a reward function R* which explains the expert behaviour.

E[
t R (st )| ] E[
t R (st )|]
Find R* such that
t=0

Equivalently, find R* such that

R(s)
1{st = s| }
R(s)
1{st = s| }
sS

t=0

A convex feasibility problem in R*, but many challenges:

R=0 is a solution, more generally: reward function ambiguity

We typically only observe expert traces rather than the entire expert policy
* --- how to compute LHS?

Assumes the expert is indeed optimal --- otherwise infeasible

Computationally: assumes we can enumerate all policies

Page 6

Feature based reward function

ffLet R(s) = w (s), where w n , and : S n .

E[
t R(st )|] =
t=0

Feature based reward function

ffLet R(s) = w (s), where w n , and : S n .

R(st )|]

= E[

t=0

t w (st )|]

t=0

= w E[

t (st )|]

t=0

= w ()

Expected cumulative discounted sum of

feature values or feature expectations

Subbing into E[
gives us:

t=0

R (st )| ] E[ t=0 t R (st )|]

Find w such that w ( ) w ()

Page 7

Feature based reward function

t=0

R (st )| ] E[

t=0

R (st )|]

Let R(s) = w (s), where w n , and : S n .

Find w such that w ( ) w ()

Feature expectations can be readily estimated from sample trajectories.

The number of expert demonstrations required scales with the number
of features in the reward function.
The number of expert demonstration required does not depend on

Complexity of the experts optimal policy *

Size of the state space

Recap of challenges
Let R(s) = w (s), where w n , and : S n .
Find w such that w ( ) w ()

Challenges:

Assumes we know the entire expert policy * assumes we can

estimate expert feature expectations
R=0 is a solution (now: w=0), more generally: reward function
ambiguity
Assumes the expert is indeed optimal---became even more of an
issue with the more limited reward function expressiveness!
Computationally: assumes we can enumerate all policies

Page 8

Ambiguity

We currently have: Find w such that w ( ) w ()

Standard max margin:

Structured prediction max margin:

Ambiguity

Standard max margin:

min w22
w

s.t. w ( ) w () + 1

Structured prediction max margin:

min w22
w

s.t. w ( ) w () + m( , )

Justification: margin should be larger for policies that are

very different from *.
Example: m(, *) = number of states in which * was
observed and in which and * disagree

Page 9

Expert suboptimality

Structured prediction max margin:

min w22
w

s.t. w ( ) w () + m( , )

Expert suboptimality

Structured prediction max margin with slack variables:

min w22 + C
w,

s.t. w ( ) w () + m( , )

Can be generalized to multiple MDPs (could also be same

MDP with different initial state)
min w22 + C

w, (i)

(i)

s.t. w (

(i)

) w ( (i) ) + m( (i) , (i) ) (i)

Page 10

i, (i)

Complete max-margin formulation

min w22 + C
w

(i)

s.t. w ( (i) ) w ( (i) ) + m( (i) , (i) ) (i)

i, (i)

[Ratliff, Zinkevich and Bagnell, 2006]

Resolved: access to *, ambiguity, expert suboptimality

One challenge remains: very large number of

constraints

Ratliff+al use subgradient methods.

In this lecture: constraint generation

Constraint generation
Initialize (i) = {} for all i and then iterate

Solve
min w22 + C
w

(i)

s.t. w (

(i)

) w ( (i) ) + m( (i) , (i) ) (i)

i, (i) (i)

For current value of w, find the most violated constraint

for all i by solving:
max w ( (i) ) + m( (i) , (i) )
(i)

= find the optimal policy for the current estimate of the

reward function (+ loss augmentation m)

For all i add (i) to (i)

If no constraint violations were found, we are done.

Page 11

Visualization in feature expectation space

Every policy has a corresponding feature expectation vector (),

which for visualization purposes we assume to be 2D

structured max margin (?)

t R(st )|] = w ()

t=0

max margin

(*)
wmm

wsmm

Constraint generation

Every policy has a corresponding feature expectation vector (),

which for visualization purposes we assume to be 2D

constraint generation:

max E[

(2)

(*)

t Rw (st )|] = max w ()

t=0

w(3) = wmm
w(2)

(1)

(0)
1

Page 12

Three broad categories of formalizations

Max margin (Ratliff+al, 2006)

Feature boosting [Ratliff+al, 2007]

Hierarchical formulation [Kolter+al, 2008]

Feature expectation matching (Abbeel+Ng, 2004)

Two player game formulation of feature matching
(Syed+Schapire, 2008)
Max entropy formulation of feature matching (Ziebart+al,2008)

Interpret reward function as parameterization of a policy class.

(Neu+Szepesvari, 2007; Ramachandran+Amir, 2007)

Feature expectation matching

Inverse RL starting point: find a reward function such

that the expert outperforms other policies

Let R(s) = w (s), where w n , and : S n .

Find w such that w ( ) w ()

Observation in Abbeel and Ng, 2004: for a policy to be

guaranteed to perform as well as the expert policy *, it
suffices that the feature expectations match:
() ( ) small implies w ( ) w () small

How to find such a policy ?

Page 13

Feature expectation matching

If expert suboptimal:

Abbeel and Ng, 2004: resulting policy is a mixture of policies

which have expert in their convex hull---In practice: pick the best
one of this set and pick the corresponding reward function.

Syed and Schapire, 2008 recast the same problem in game

theoretic form which, at cost of adding in some prior knowledge,
results in having a unique solution for policy and reward
function.

Ziebart+al, 2008 assume the expert stochastically chooses

between paths where each paths log probability is given by its
expected sum of rewards.

Lecture outline

Inverse RL intro

Mathematical formulations for inverse RL

Max-margin

Feature matching

Reward function parameterizing the policy class

Case studies

Page 14

Reward function parameterizing the

policy class

Recall:

V (s; R) = R(s) + max

Q (s, a; R) = R(s) +

P (s |s, a)V (s; R)

Lets assume our expert acts according to:

(a|s; R, ) =

1
exp(Q (s, a; R))
Z(s; R, )

Then for any R and , we can evaluate the likelihood of

seeing a set of state-action pairs as follows:

P ((s1 , a1 )) . . . P ((sm , am )) =

1
1
exp(Q (s1 , a1 ; R)) . . .
exp(Q (sm , am ; R))
Z(s1 ; R, )
Z(sm ; R, )

Note: non-convex formulation --- due to non-linear equality constraint for V!

Ramachandran and Amir, AAAI2007: MCMC method to sample from this distribution

Neu and Szepesvari, UAI2007: gradient method to find local optimum of the likelihood

Lecture outline

Inverse RL intro

Mathematical formulations for inverse RL

Case studies:

Highway driving,

Parking lot navigation,

Route inference,

Quadruped locomotion

Page 15

Simulated highway driving

Abbeel and Ng, ICML 2004; Syed and Schapire, NIPS 2007

Highway driving
Teacher in Training World

[Abbeel and Ng 2004]

Learned Policy in Testing World

Input:

Dynamics model / Simulator Psa(st+1 | st, at)

Teachers demonstration: 1 minute in training world
Note: R* is unknown.
Reward features: 5 features corresponding to lanes/shoulders; 10 features
corresponding to presence of other car in current lane at different distances

Page 16

[Abbeel and Ng 2004]

More driving examples

Driving
demonstration

Learned
behavior

Driving
demonstration

Learned
behavior

In each video, the left sub-panel shows a

demonstration of a different driving
style, and the right sub-panel shows
the behavior learned from watching the
demonstration.

Parking lot navigation

Reward function trades off:

Staying on-road,
Forward vs. reverse driving,
Amount of switching between forward and reverse,
Lane keeping,
On-road vs. off-road,
Curvature of paths.

[Abbeel et al., IROS 08]

Page 17

Experimental setup

Demonstrate parking lot navigation on train parking lots.

Run our apprenticeship learning algorithm to find the reward

function.
Receive test parking lot map + starting point and
destination.
Find the trajectory that maximizes the learned reward
function for navigating the test parking lot.

Nice driving style

Page 18

Sloppy driving-style

Only 35% of routes are

fastest (Letchner, Krumm, &
Horvitz 2006)

Page 19

Time

Money

Stress

Skill

Ziebart+al, 2007/8/9

Time
Fuel
Safety
Stress
Skill
Mood

Distance
Speed
Type
Lanes
Turns
Context

Ziebart+al, 2007/8/9

Page 20

Data Collection

25 Taxi Drivers

Length
Speed
Road
Type
Lanes

Accidents
Construction
Congestion
Time of day

Over 100,000 miles

Destination Prediction

Page 21

Ziebart+al, 2007/8/9

Quadruped

Reward function trades off 25 features.

Hierarchical max margin [Kolter, Abbeel & Ng, 2008]

Experimental setup

Demonstrate path across the training terrain

Run our apprenticeship learning algorithm to find the

reward function
Receive testing terrain---height map.

Find the optimal policy with respect to the learned

reward function for crossing the testing terrain.
Hierarchical max margin [Kolter, Abbeel & Ng, 2008]

Page 22

Without learning

With learned reward function

Page 23

Inverse RL history

1964, Kalman posed the inverse optimal control

problem and solved it in the 1D input case
1994, Boyd+al.: a linear matrix inequality (LMI)
characterization for the general linear quadratic setting
2000, Ng and Russell: first MDP formulation, reward
function ambiguity pointed out and a few solutions
suggested
2004, Abbeel and Ng: inverse RL for apprenticeship
learning---reward feature matching
2006, Ratliff+al: max margin formulation

Page 24

Inverse RL history

2007, Ratliff+al: max margin with boosting---enables large

vocabulary of reward features
2007, Ramachandran and Amir, and Neu and Szepesvari: reward
function as characterization of policy class
2008, Kolter, Abbeel and Ng: hierarchical max-margin
2008, Syed and Schapire: feature matching + game theoretic
formulation
2008, Ziebart+al: feature matching + max entropy
2008, Abbeel+al: feature matching -- application to learning parking
lot navigation style
Active inverse RL? Inverse RL w.r.t. minmax control, partial
observability, learning stage (rather than observing optimal policy),
?

Page 25

Consider the following scenario:

There are two envelopes, each of which has
an unknown amount of money in it. You get
to choose one of the envelopes. Given this is
all you get to know, how should you choose?

Consider the changed scenario:

Same as above, but before you get to
choose, you can ask me to disclose the
amount in one of the envelopes. Without any
distributional assumptions on the amounts of
money, is there a strategy that could improve
your expected pay-off over simply picking an
envelope at random?

CS 287: Advanced Robotics

Fall 2009
Lecture 17: Policy search
Pieter Abbeel
UC Berkeley EECS

Page 1

Problem setup

Input:

State space, action space

Transition model Psa(st+1 | st, at)

No reward function

Inverse RL:

Can we recover R ?

Apprenticeship learning via inverse RL

Teachers demonstration: s0, a0, s1, a1, s2, a2,

(= trace of the teachers policy *)

Can we then use this R to find a good policy ?

Vs. Behavioral cloning (which directly learns the teachers policy using supervised
learning)

Inverse RL: leverages compactness of the reward function

Behavioral cloning: leverages compactness of the policy class considered, does not
require a dynamics model

Parking lot navigation

Reward function trades off:

Staying on-road,
Forward vs. reverse driving,
Amount of switching between forward and reverse,
Lane keeping,
On-road vs. off-road,
Curvature of paths.

[Abbeel et al., IROS 08]

Page 2

Experimental setup

Demonstrate parking lot navigation on train parking lots.

Run our apprenticeship learning algorithm to find the reward

function.
Receive test parking lot map + starting point and
destination.
Find the trajectory that maximizes the learned reward
function for navigating the test parking lot.

Nice driving style

Page 3

Sloppy driving-style

Quadruped

Reward function trades off 25 features.

Hierarchical max margin [Kolter, Abbeel & Ng, 2008]

Page 4

Experimental setup

Demonstrate path across the training terrain

Run our apprenticeship learning algorithm to find the

reward function
Receive testing terrain---height map.

Find the optimal policy with respect to the learned

reward function for crossing the testing terrain.
Hierarchical max margin [Kolter, Abbeel & Ng, 2008]

Without learning

Page 5

With learned reward function

Announcements

Assignment 1 was due yesterday at 23:59pm

For late day policy details, see class webpage.

Reminder: Project milestone (1 page progress report) due Friday November 6.

Grading:

3 assignments: 25%, 25%, 5%

Final project: 45%

Final project presentations:

Dec 1 & Dec 3

November 24: Zico Kolter guest lecture

Page 6

As part of the course

requirements, I expect you
to attend these 3 lectures

Lecture outline

Inverse RL case studies wrap-up

Policy search

Assume some policy class parameterized by a vector

, search for the optimal setting of
Perhaps the largest number of success stories of RL

Policy class for helicopter hover

x, y, z: x points forward along the helicopter, y sideways to the right, z
downward.
nx , ny , nz : rotation vector that brings helicopter back tolevel position (expressed in the helicopter frame).
ucollective = 1 f1 (z z) + 2 z
uelevator = 3 f2 (x x) + 4 f4 (x)
+ 5 q + 6 ny
uaileron = 7 f3 (y y) + 8 f5 (y)
+ 9 p + 10 nx
urudder = 11 r + 12 nz

Page 7

[Policy search was done in simulation]

Ng + al, ISER 2004

Policy class for quadruped

locomotion on flat terrain

12 parameters define the Aibos gait:

The front locus (3 parameters: height,

x-pos., y-pos.)

The rear locus (3 parameters)

Locus length

Locus skew multiplier in the x-y plane

(for turning)

The height of the front of the body

The height of the rear of the body

The time each foot takes to move

through its locus
The fraction of time each foot spends
on the ground

Kohl and Stone, ICRA2004

Page 8

Kohl and Stone, ICRA 2004

Before learning (hand-tuned)

After learning

[Policy search was done through trials on the actual robot.]

Policy class for tetris

Let S = s1 , s2 , . . . , sn be the set of board situations available depending on
the choice of placement of the current block. Then the probability of choosing
the action that leads to board situation si is taken is given by:

exp (si )
n

j=1 exp ( (sj ))

Page 9

Deterministic, known dynamics

max U () = max E

R(st , at , st+1 )|

t=0

Numerical optimization: find the gradient w.r.t. and take a step in the
gradient direction.
H

R
U
st R
ut R
st+1
=
(st , ut , st+1 )
+
(st , ut , st+1 )
+ (st , ut , st+1 )
i
s

s
i
i
i
t=0

st
f
st1 f
ut1
=
(st1 , ut1 )
+
(st1 , ut1 )
i
s
i
s
i
ut

st
=
(st , ) +
(st , )
i
i
s
i
Computing these recursively, starting at time 0 is a discrete time instantiation
of Real Time Recurrent Learning (RTRL)

Stochastic, known dynamics

max U () = max E

R(st , at , st+1 )| = max

t=0

Pt (st = s, ut = u; )R(s, u)

t=0 s,u

Numerical optimization: find the gradient w.r.t. and take a step in the
gradient direction.
H

Pt
U
=
(st = s, ut = u; )R(s, u)
i
i
t=0 s,u
Using the chain rule we obtain the following recursive procedure:

Pt (st = s, ut = u; ) =
Pt1 (st1 = s , ut1 = u )T (s , u , s)(u|s; )
s ,u

Pt
i (st

= s, ut = u; ) =

s ,u

Pt1
(st1 = s , ut1 = u ; )T (s , u , s)(u|s; )
i

+Pt1 (st1 = s , ut1 = u ; )T (s , u , s)

(u|s; )
i

Computationally impractical for most large state/action spaces!

Page 10

Policy gradient
Deterministic

Analytical

Stochastic

Known
Dynamics

Unknown
Dynamics

Known
Dynamics

Unknown
Dynamics

OK
Taking
derivatives--potentially time
consuming and
error-prone

N/A

OK
Often
computationally
impractical

N/A

Finite differences
We can compute the gradient g using standard finite difference methods, as
follows:
U
U ( + ej ) U ( ej )
() =
j
2
Where:

ej =

0
0
..
.
0
1
0
..
.
0

Page 11

jth entry

Finite differences --- generic points

Locally around our current estimate , we can approximate the utility U ( (i) )
at an alternative point (i) with a linear approximation:
U () U () + g T ( (i) )
Lets say we have a set of small perturbations (0) , (1) , . . . , (m) for which we
are willing to evaluate the utility U ( (i) ). We can get an estimate of the gradient
through solving the following set of equations for the gradient g through least
squares:
U( (0) ) = U () + ( (0) ) g
U( (1) ) = U () + ( (1) ) g
...
U ( (m) ) = U () + ( (m) ) g

Issue in stochastic setting

Noise could dominate the gradient evaluation

Page 12

Fixing the randomness

Intuition by example: wind influence on a helicopter is

stochastic, but if we assume the same wind pattern
across trials, this will make the different choices of
more readily comparable.
General instantiation:

Fix the random seed

Result: deterministic system

If the stochasticity can be captured in an appropriate

way, this will result in a significantly more efficient
gradient computation

Details of appropriate: Ng & Jordan, 2000

Policy gradient
Deterministic

Stochastic

Known
Dynamics

Unknown
Dynamics

Known
Dynamics

Unknown
Dynamics

Analytical

OK
Taking
derivatives--potentially time
consuming and
error-prone

N/A

OK
Often
computationally
impractical

N/A

Finite
differences

OK
Sometimes
computationally
more expensive
than analytical

OK
N = #roll-outs:
Naive: O(N-1/4),
or O(N-2/5)
Fix random
seed: O(N-1/2)
[1]

Same as known
dynamics, but
no fixing of
random seed.

[1] P. Glynn, Likelihood ratio gradient estimation: an overview, in Proceedings of the 1987
Winter Simulation Conference, Atlanta, GA, 1987, pp. 366375.

Page 13

Likelihood ratio method

Assumption:

Stochastic policy (ut | st)

Stochasticity:

Required for the methodology

+ Helpful to ensure exploration
- Optimal policy is often not stochastic (though it can be!!)

Likelihood ratio method

Page 14

Likelihood ratio method

Page 15

Likelihood ratio method

We let denote
a state-action sequence s0 , u0 , . . . , sH , uH . We overload
notation: R( ) = H
t=0 R(st , ut ).
U () = E[

R(st , ut ); ] =

P ( ; )R( )

t=0

In our new notation, our goal is to find :

max U () = max
P ( ; )R( )

Page 16

Likelihood ratio method

U () =

P ( ; )R( )

Taking the gradient w.r.t. gives

U () =
P ( ; )R( )
=

P ( ; )R( )

P ( ; )

P ( ; )R( )

P ( ; )

P ( ; )
R( )
P ( ; )

P ( ; ) log P ( ; )R( )

Approximate with the empirical estimate for m sample paths under policy
:
m

U () g =

1
log P ( (i) ; )R( (i) )
m i=1

Likelihood ratio method

log P ( (i) ; ) = log

=
=

(i)
(i)
(i)
(i) (i)
P (st+1 |st , ut ) (ut |st )

t=0
(i)
(i)
(i)
log P (st+1 |st , ut )

t=0

H

t=0

(i)

log (ut |st )

t=0

policy

dynamics model

(i)

log (ut |st )

no dynamics model required!!

Page 17

(i) (i)
log (ut |st )

Likelihood ratio method

The following expression provides us with an unbiased estimate of the gradient,
and we can compute it without access to a dynamics model:
m

g =

1
log P ( (i) ; )R( (i) )
m i=1

Here:
log P ( (i) ; ) =

H

t=0

Unbiased means:

(i)

log (ut |st )

no dynamics model required!!

E[
g ] = U ()

We can obtain a gradient estimate from a single trial run! While the math we
did is sound (and constitutes the commonly used derivation), this could seem
a bit surprising at first. Lets perform another derivation which might give us
some more insight.

Page 18

Boston Dynamics PetMan

CS 287: Advanced Robotics

Fall 2009
Lecture 18: Policy search
Pieter Abbeel
UC Berkeley EECS

Page 1

Policy gradient
Deterministic

Stochastic

Known
Dynamics

Unknown
Dynamics

Known
Dynamics

Unknown
Dynamics

Analytical

OK
Taking
derivatives--potentially time
consuming and
error-prone

N/A

OK
Often
computationally
impractical

N/A

Finite
differences

OK
Sometimes
computationally
more expensive
than analytical

OK
N = #roll-outs:
Naive: O(N-1/4), or
O(N-2/5)
Fix random seed:
O(N-1/2) [1]

Same as known
dynamics, but no
fixing of random
seed.

Likelihood ratio
method

OK
O(N-1/2) [1]

[1] P. Glynn, Likelihood ratio gradient estimation: an overview, in Proceedings of the 1987
Winter Simulation Conference, Atlanta, GA, 1987, pp. 366375.

Likelihood ratio method

Assumption:

Stochastic policy (ut | st)

Stochasticity:

Required for the methodology

+ Helpful to ensure exploration
- Optimal policy within the policy class is not always
stochastic (though it can be!!)

Page 2

Likelihood ratio method

We let denote
a state-action sequence s0 , u0 , . . . , sH , uH . We overload
notation: R( ) = H
t=0 R(st , ut ).
H

U () = E[
R(st , ut ); ] =
P ( ; )R( )

t=0

Our goal is to find :

max U () = max

P ( ; )R( )

Likelihood ratio method derivation

U() =

P ( ; )R( )

Taking the gradient w.r.t. gives

U () =
P ( ; )R( )
=

P ( ; )
P ( ; )

P ( ; )

dynamics model
(i)

(i)

log P (st+1 |st , ut ) +

H

t=0

P ( ; )
R( )
P ( ; )

P ( ; ) log P ( ; )R( )

Approximate with the empirical estimate for m sample paths under policy
:
m

1
log P ( (i) ; )R( (i) )
m i=1

Page 3

policy

H

t=0

(i)

log (ut |st )

t=0

P ( ; )R( )

U () g =

H

t=0

P ( ; )R( )

(i)
(i)
(i)
(i) (i)
log P ( (i) ; ) = log P (st+1 |st , ut ) (ut |st )

t=0

(i)

log (ut |st )

no dynamics model required!!

(i)

log (ut |st )

Likelihood ratio method: result recap

The following expression provides us with an unbiased estimate of the gradient,
and we can compute it without access to a dynamics model:
m

g =

1
log P ( (i) ; )R( (i) )
m i=1

Here:
log P ( (i) ; ) =

H

t=0

(i)

log (ut |st )

no dynamics model required!!

Unbiased means:
E[
g ] = U ()

Likelihood ratio method in practice

As formulated thus far: yes, unbiased estimator, but very

noisy, hence would take very long
Set of critical fixes that have led to real-world practicality:

Add a free parameter to the estimator called baseline and set it

such that the variance of the estimator is minimized
Exploit temporal structure + incorporate value function estimates (=
actor-critic learning)
Dont step in the direction of the gradient, follow the natural
gradient direction instead

Page 4

Consider the following scenario:

There are two envelopes, each of which has
an unknown amount of money in it. You get
to choose one of the envelopes. Given this is
all you get to know, how should you choose?

Consider the changed scenario:

Envelopes riddle

Page 5

Envelopes riddle

MDP:

horizon of 1, always start in state 0

Transition to state 1 or 2 according to choice made

Observe the reward in the visited state

Policy

(1|0)

(2|0)

exp()
1 + exp()
1
1 + exp()

Choose to see an envelopes contents according to

Perform a gradient update:

log P ( = 1; )R( = 1) =

1
R(1)
1 + exp()

log P ( = 2; )R( = 2) =

exp()
R(2)
1 + exp()

Page 6

Envelopes riddle

Perform a gradient update:

log P ( = 1; )R( = 1) =

(2|0)

exp()
1 + exp()
1
1 + exp()

1
R(1)
1 + exp()

log P ( = 2; )R( = 2) =

(1|0)

exp()
R(2)
1 + exp()

This gradient update is simply making the recently observed path more
likely; and how much more likely depends on the observed R for the
observed path
rather than let it depend simply on R, if we had a baseline b which is
an estimate of the expected reward under the current policy, then could
update scaled by (R-b) instead,
i.e. the baseline enables updating such that better than average paths
become more likely, less than average paths become less likely

Likelihood ratio method with baseline

Gradient estimate with baseline:

g =

1
log P ( (i) ; )(R( (i) ) b)
m i=1

This will (crudely speaking) increase the log-likelihood of paths with

higher than baseline reward, and decrease the log-likelihood of
observed paths with lower than baseline reward.

Is this still an unbiased gradient estimate?

Unbiased means: E[g ] = U()

Page 7

Even with baseline, we obtain an unbiased

estimate of the gradient

P ( ; ) = 1

P ( ; ) = 0
i

P ( ; ) = 0
j

P ( ; )
P ( ; ) = 0

P ( ; ) j

P ( ; )
log P ( ; ) = 0
j

E
log P ( ; ) = 0
j

E
log P ( ; )bj = 0
j

U () = E
log P ( ; )R( )
j
j

= E
log P ( ; )(R( ) bj )
j

m
1
(i)
log P ( (i) ; )(R( ) bj )

m
j
i=1

Natural choices for b:

Estimate of utility U()
Choose bj to minimize
the variance of the
gradient estimates.

Our gradient estimate:

gj =

log P ( (i) ; ) (R( ) bj ) ,

It is unbiased, i.e.:
E
gj =

U ()
j

Its variance is given by:

2
E (
gj E [
gj ])

which we would like to minimize over bj :

2
2
gj E [
gj ]) = E
gj ) 2E [
gj ]]
min E (
gj2 + E (E
gj E [
bj

= E
gj2 + (E
gj ) 2E [
gj ] E [
gj ]

= E
gj2

(E
gj )2

= U()
independent of bj

Page 8

min E
gj2
bj

= min E
log P ( ; ) (R( ) bj )
bj
j

2

= min E
log( ; ) R( )2 + b2j 2bj R( )
bj
j

2
2

log P ( ; ) R( )2 +E
log P ( ; ) b2j
= min E
bj
j
j

independent of bj

2E
log P ( ; )2 bj R( )
i

2
2

2
2
= min bj E
log P ( ; )
2bj E
log P ( ; )
R( )
bj
j
j

= 0 2bj E [...] 2E [...] = 0

bj

2

E
R( )
j log P ( ; )

bj =
2

E
log
P
(
;
)
j

Could estimate optimal baseline from samples.

Exploiting temporal structure

Our gradient estimate:

(i)
log P ( ; ) R( (i) ) bj ,
j

H1

m
H1

1
(i) (i)
(i)
(i)
log (ut |st )
R(st , ut ) bj
m i=1 t=0 j
t=0
m

1
m i=1

Future actions do not depend on past rewards (assuming a fixed policy). This
can be formalized as

E
log (ut |st )R(sk , uk ) = 0 k < t
j
Removing these terms with zero expected value from our gradient estimate we
obtain:
H1

m H1

1
(i) (i)
(i)
(i)
gj =
log (ut |st )
R(sk , uk ) bj
m i=1 t=0 j
k=t

Page 9

Actor-Critic
Our gradient estimate:
m H1
1
(i) (i)
gj =
log (ut |st )
m i=1 t=0 j

(i)
(i)
R(sk , uk )

k=t

H1
(i)
(i)
(i)
(i)
The term k=t R(sk , uk ) is a sample based estimate of Q (st , uk ). If
we simultaneously run a temporal difference (TD) learning method to estimate
Q , then we could substitute its estimate for Q instead of the sample based
estimate!

Our gradient estimate becomes:

gj =

m H1

1
(i) (i)
(s(i) , u(i) ) bj
log (ut |st ) Q
t
t
m
j
t=0
i=1

Natural gradient

Is the gradient the correct direction?

Page 10

Natural gradient

Is the gradient the correct direction?

Page 11

MODULARITY, POLYRHYTHMS, AND WHAT ROBOTICS AND CONTROL MAY YET

LEARN FROM THE BRAIN
Jean-Jacques Slotine, Nonlinear Systems Laboratory, MIT
Thursday, Nov 5th, 4:00 p.m., 3110 Etcheverry Hall

ABSTRACT
Although neurons as computational elements are 7 orders of magnitude slower than their artificial
counterparts, the primate brain grossly outperforms robotic algorithms in all but the most
structured tasks. Parallelism alone is a poor explanation, and much recent functional modelling of
the central nervous system focuses on its modular, heavily feedback-based computational
architecture, the result of accumulation of subsystems throughout evolution. We discuss this
architecture from a global functionality point of view, and show why evolution is likely to favor
certain types of aggregate stability. We then study synchronization as a model of computations at
different scales in the brain, such as pattern matching, restoration, priming, temporal binding of
sensory data, and mirror neuron response. We derive a simple condition for a general dynamical
system to globally converge to a regime where diverse groups of fully synchronized elements
coexist, and show accordingly how patterns can be transiently selected and controlled by a very
small number of inputs or connections. We also quantify how synchronization mechanisms can
protect general nonlinear systems from noise. Applications to some classical questions in
robotics, control, and systems neuroscience are discussed.
The development makes extensive use of nonlinear contraction theory, a comparatively
recent analysis tool whose main features will be briefly reviewed.

CS 287: Advanced Robotics

Fall 2009
Lecture 19:
Actor-Critic/Policy gradient for learning to walk in 20 minutes
Natural gradient
Pieter Abbeel
UC Berkeley EECS

Page 1

Case study: learning bipedal walking

Dynamic gait:

Why hard?

A bipedal walking gait is considered dynamic if the

ground projection of the center of mass leaves the
convex hull of the ground contact points during some
portion of the walking cycle.

Achieving stable dynamic walking on a bipedal robot

is a difficult control problem because bipeds can only
control the trajectory of their center of mass through
the unilateral, intermittent, uncertain force contacts
with the ground.

fully actuated walking

Passive dynamic walkers

Page 2

Passive dynamic walkers

The energy lost due to friction and collisions when the

swing leg returns to the ground are balanced by the
gradual conversion of potential energy into kinetic
energy as the walker moves down the slope.

Can we actuate them to have them walk on flat terrains?

John E. Wilson. Walking toy. Technical report, United States Patent Office, October 15 1936.

Tad McGeer. Passive dynamic walking. International Journal of Robotics Research, 9(2):62.82,
April 1990.

Learning to walk in 20 minutes --- Tedrake,

Zhang, Seung 2005

Page 3

Learning to walk in 20 minutes --- Tedrake,

Zhang, Seung 2005
passive hip joint [1DOF]

Arms: coupled to the

opposite leg to reduce
yaw moment

44 cm

freely swinging load [1DOF]

2 x 2 (roll, pitch)
position controlled
servo motors [4 DOF]
9DOFs:
* 6 internal DOFs
* 3 DOFs for the robots orientation
(always assumed in contact with
ground at a single point, absolute
(x,y) ignored)

Natural gait down 0.03 radians ramp:

0.8Hz, 6.5cm steps

Dynamics
q = f(q, q,
u, d(t))

q: vector of joint angles

u: control vector (4D)

d(t): time-varying vector of random disturbances

Discrete footstep-to-footstep dynamics: consider state at

touchdown of robots left leg

F (x , x) = P (
xn+1 = x |
xn = x; )

Stochasticity due to

Sensor noise
Disturbances d(t)

Page 4

Reinforcement learning formulation

Goal: stabilize the limit cycle trajectory that the passive robot follows when walking
down the ramp, making it invariant to slope.
Reward function:

1
R(x(n)) = x(n) x 22
2

x* is taken from the gait of the walker down a slope of 0.03 radians

Action space:

At the beginning of each step cycle (=when a foot touches down) we choose an
action in the discrete time RL formulation
Our action choice is a feedback control policy to be deployed during the step, in
this particular example it is a column vector w
Choosing this action means that throughout the following step cycle, the following
continous-time feedback controls will be exerted:

u(t) =

wi i (
x(t)) = w (
x(t))

Goal: find the (constant) action choice w which maximizes expected sum of rewards

Policy class

To apply the likelihood gradient ratio method, we need to

define a stochastic policy class. A natural choice is to
choose our action vector w to be sampled from a Gaussian:

w N (, 2 I)
Which gives us:

1
(w|x) =
exp
(2)d d

1
(w ) (w )
2
2

[Note: it does not depend on x, this is the case b/c the

actions we consider are feedback policies themselves!]

The policy optimization becomes optimizing the mean of

this Gaussian. [In other papers people have also included
the optimization of the variance parameter.]

Page 5

Policy update
Likelihood ratio based gradient estimate from a single trace of H footsteps:
H1

H1

g =
log (w(n)|
x(n))
R(
x(k)) b
We have:

n=0

k=n

1
(w )
2 2
Rather than waiting till horizon H is reached, we can perform the updates
online as follows: (here is a step-size parameter, b(n) is the amount of baseline
we allocate to time nsee next slide)
log (w|
x) =

e(n)
(n + 1)

1
(w(n) (n))
2 2
= (n) + e(n)(R(
x(n)) b(n))

= e(n 1) +

To reduce variance, can discount the eligibilities:

e(n) = e(n 1) +

1
(w(n) (n))
2 2

Choosing the baseline b(n)

A good choice for the baseline is such that it corresponds to an estimate of
the expected reward we should have obtained under the current policy.
Assuming we have estimates of the value function V under the current policy,
we can estimate such a baseline as follows:
b(n) = V (
x(n)) V (
x(n + 1))
To estimate V we can use TD(0) with function approximation. Using linear
value function approximation, we have:

V (
x) =
vi i (
x).
i

This gives us the following update equations to learn V with TD(0):

(n) = R(
x(n)) + V (
x(n + 1)) V (
x(n))
v(n + 1) = v(n) + v (n)(
x(n))

Page 6

The complete actor critic learning algorithm

Before each foot step, sample the feedback control policy parameters w(n)
from N ((n), 2 I).
During the foot step, execute the following controls in continuous time:
u(t) = w(n) (
x(t)).
After the foot step is completed, compute the reward function R(
x(n)) and
perform the following updates:
Policy updates:
e(n)

(n + 1)

b(n)

1
(w (n))
2 2
(n) + e(n)(R(
x(n)) b(n))
V (
x(n)) V (
x(n + 1))
e(n 1) +

TD(0) updates:
(n)
v(n + 1)

= R(
x(n)) + V (
x(n + 1)) V (
x(n))
= v(n) + v (n)(
x(n))

Manual dimensionality reduction

Decompose the control problem in the frontal and

sagittal planes
Due to simplicity of sagittal plane control---hand set.
Left with control of the ankle roll actuators to control in
the frontal plane

Let roll control input only depend on roll and droll/dt

Basis functions: non-overlapping tile encoding

Policy: 35 tiles (5 in roll x 7 in droll/dt )

Value: 11 tiles (a function in droll/dt only because the value
is evaluated at the discrete time when roll hits a particular
value)

Page 7

Experimental setup and results

When the learning begins, the policy parameters, w, are

set to 0 and the baseline parameters, v, are initialized
so that \hat{V}(x) R(x) / (1-)
Train the robot on flat terrain.
Reset with simple hand-designed controller that gets it
into a random initial state every 10s.
Results:

After 1 minute: foot clearance on every step

After 20 minutes: converged to a robust gait (=960
steps at 0.8Hz)

Return maps

after learning

before learning
[Note: this is a projection from 2x9-1 dim to 1dim]

Page 8

Toddler movie

On tread-mill: passive walking. On ground: learning.

Page 9

CS 287: Advanced Robotics

Fall 2009
Lecture 20:
Natural gradient
Reward shaping
Approximate LP with function approximation
POMDP
Hierarchical RL

Pieter Abbeel
UC Berkeley EECS

Natural gradient

Is the gradient the correct direction?

Page 1

Natural gradient

Is the gradient the correct direction?

Gradient and linear transformations

Consider the optimization of the function f () through gradient descent. In
iteration k we would perform an update of the following form:
(k+1) = (k) + f ((k) ).

(1)

Consider a new coordinate system x = A1 . We could work in the new

coordinate system instead, and optimize f (Ax) over the variable x. A gradient
descent step is given by:
x(k+1) = x(k) + x f (Ax(k) ) = x(k) + A f (Ax(k) )
If x(k) = A1 (k) , do we have x(k+1) = (k+1) ?
No!
The value in coordinates that corresponds to xk+1 is given by
Ax(k+1) = Ax(k) + AA f (Ax(k) ) = (k) + AA f ( (k) ) = (k+1)

Page 2

(2)

Newtons direction
Newtons method approximates the function f () by a quadratic function
through a Taylor expansion around the current point k :
1
f () f(k ) + f ((k) ) ( (k) ) + ( (k) ) H((k) )( (k) )
2
Here Hij ((k) ) =

2f
(k)
)
i j (

is a matrix with the 2nd derivatives of f evaluated

at (k) .
The local optimum of the 2nd order approximation is found by setting its
gradient equal to zero, which gives:
Newton step direction = ( (k) ) = H 1 ((k) ) f ((k) )

The Newton step direction is affine invariant

Newtons step direction for f () is given by:
H 1 ((k) ) f ( (k) ).

(1)

For f (Ax), with x = A1 , we have

Hessian

= A HA

gradient

= A f

Hence we have for the Newton step direction in the x coordinates:

1
A HA
A f (Ax(k) ) = A1 f (Ax(k) )

(2)

Translating this into coordinates gives us AA1 f (Ax(k) ) = f (Ax(k) ),

which is identical to the step direction directly computed in coordinates.

Page 3

Natural gradient

Gradient depends on choice of coordinate system.

Newtons method is invariant to affine coordinate
transformations, but not to general coordinate
transformations.

Can we achieve more invariance than simply affine

invariance?

Natural gradient

Lets first re-interpret the gradient:

The gradient is the direction of steepest ascent:

Page 4

Natural gradient

Lets first re-interpret the gradient:

The gradient is the direction of steepest ascent:

For small we have:

arg

max

f ( + ) arg
= arg
=

max

f () + f ()

max

f ()

:2
:2

f ()

f ()2

When expressing our problem in a different coordinate system f

remains the same, but the || . || <= 1 constraint means something
different in different coordinate systems.

Can we find a norm constraint that is independent of the coordinate

system????

A distance which is independent of the

parameterization of the policy class

Kullback-Leibler divergence between distributions over paths induced by

the policies:

KL(P ( ; 1 )||P ( ; 2 )) =

P ( ; 1 ) log

P ( ; 1 )
P ( ; 2 )

E.g., 2 Bernoulli distributions:

Prob(heads)=p and Prob(heads)=q

KL(P (; p)||P (; q)) = p log

p
1p
+ (1 p) log
q
1q

Alternative parameterization: Prob(heads) =

KL(P (; )P (; )) =

exp()
log
1 + exp()

exp()
1+exp()

exp()
1+exp()
exp()
1+exp()

1
log
1 + exp()

= KL(P (; p)P (; q)) if p =

Page 5

and Prob(heads) =

exp()
1+exp()

1
1+exp()
1
1+exp()

exp()
exp()
,q =
1 + exp()
1 + exp()

A distance which is independent of the

parameterization of the policy class

Kullback-Leibler divergence between distributions

over paths induced by the policies:
KL(P ( ; 1 )||P ( ; 2 )) =

P ( ; 1 ) log

P ( ; 1 )
P ( ; 2 )

Second-order Taylor approximation

KL(P ( ; )||P ( ; + ))

P ( ; ) log P ( ; ) log P ( ; )

= G()

G() = Fisher information matrix, independent of the

choice of parameterization of the class of distributions.

2nd order Taylor expansion of KL divergence

KL(P (X; )P (X; + )

P (x; )
P (x; ) log
P (x; + )
x

P (x; )
d
1
d2

P (x; ) log

log P (x; ) 2 log P (x; )

P
(x;
)
d
2
d
x

d
1
d2
=
P (x; ) log P (x; ) 2 log P (x; )
d
2
d
x

dP (x;)
dP (x;)
d2
d
P (x; ) d

2 P (x; )
P (x; )
d
d
1
d
=
P (x; )

P (x; )

P (x; )
2
P (x; )2
x
x
dP (x;) dP (x;)
d
1 d2
1
d
d
=
P (x; )
P (x; ) +
P (x; )

2
d
2
d
2
P (x; )
P (x; )
x
x
x

d
1
d2
=
P (x; )

P (x; )
d x
2
d 2 x

1
d
d
+
P (x; )
log P (x; )
log P (x; )

2
d
d
x

2
d
1
d
=
1

1
d
2
d 2

1
d
d
+
P (x; )
log P (x; )
log P (x; )

2
d
d
x

1
d
d

= 0 0 +
P (x; )
log P (x; )
log P (x; )
2
d
d
x
=

1
G()
2

Page 6

Natural gradient gN

Natural gradient: general setting

Page 7

Natural gradient in policy search

Natural gradient gN

= the direction with highest increase in the objective per

change in KL divergence
gN

= arg
arg
= arg

max

:KL(P ( ;)||P ( ;+)

f ( + )

max

f () + f ()

max

f ()

: 12 G()
: 12 G()

= G()1 f ()

Page 8

Natural gradient: general setting

Problem setting: optimize an objective which depends on a probability distribution P
max f (P )

Rather than following the gradient, which depends on the choice of parameterization for the set of probability distributions that we are searching over,
follow the natural gradient gN :
gN = G()1 f (P )
Here G() is the Fisher information matrix, and can be computed as follows:

P (x) log P (x) log P (x)
G() =
xX

Natural gradient in policy search

Objective:
U () =

P ( ; )R( )

Natural gradient gN :

gN = G()1 U ()
Both the Fisher information matrix G and the gradient need to be estimate
from samples. We have seen many ways to estimate the gradient from samples.
Remains to show how to estimate G.
G() =

P ( ) log P ( ; ) log P ( ; )

1
log P ( (i) ; ) log P ( (i) ; )
m i=1

As we have seen earlier, we can compute the expression log P ( (i) ) even
without access to the dynamics model:
log P ( (i) ; )

H
1

log T (st , ut , st+1 ) + log (ut |st )

t=0

H
1

log (ut |st )

t=0

Page 9

Example

Kober and Peters, NIPS 2009

Page 10

Announcements

Project milestone due tomorrow 23:59pm

= 1 page progress update.

Format: pdf

Assignment #2: out tomorrow night, due in 2 weeks

Topic: RL

Start early!

Late day policy: 7 days total; -20pts (out of 100pts of the

thing you are submitting late) per day beyond that
Assignment #3 will be released in 2 weeks and will be
very small compared to #1 and #2.

Thomas Daniel, University of Washington

A Tale of Two (or Three?) Gyroscopes: Inertial measurement units (IMUs) in flying insects
Date: Thursday, November 5, 2009
Time: 4:00 PM
Place: 2040 Valley Life Sciences Building

Animals use a combination of sensory modalities to control their movement including visual,
mechanosensory and chemosensory information. Mechanosensory systems that can detect
inertial forces are capable of responding much more rapidly than visual systems and, as such,
are thought to play a critical role in rapid course correction during flight. This talk focusses on
two gryoscopic organs: halteres of flies and antennae of moths. Both have mechanical and
neural components play critical roles in encoding relatively tiny Coriolis forces associated with
body rotations, both of which will be reviewed along with new data that suggests each have
complex circuits that connect visual systems to mechanosensory systems. But, insects are
bristling with mechanosensory structures, including the wings themselves. It is not clear whether
these too could serve an IMU function in addition to their obvious aerodynamic roles.

Page 11

MODULARITY, POLYRHYTHMS, AND WHAT ROBOTICS AND CONTROL MAY YET

LEARN FROM THE BRAIN
Jean-Jacques Slotine, Nonlinear Systems Laboratory, MIT
Thursday, Nov 5th, 4:00 p.m., 3110 Etcheverry Hall

Reinforcement learning: remaining topics

approximate LP, pomdp's, reward shaping, exploration

vs. exploitation, hierarchical methods

Page 12

Approximate LP
Exact Bellman LP

minV

c(s)V (s)

s.t.

V (s)

T (s, a, s ) (R(s, a, s ) + V (s )) s, a

Approximate LP

min

c(s) (s)

s.t.

(s)

T (s, a, s ) R(s, a, s ) + (s ) s S , a

Approximate LP guarantees

When retaining all constraints, yet introducing function approximation.

min

c(s) (s)

s.t.

(s)

T (s, a, s ) R(s, a, s ) + (s ) s S, a

Theorem. [de Farias and Van Roy] If one of the basis function satises
i (s) = 1 for all s S, then the LP has a feasible solution and the optimal
solution satises:
V 1,c

2
min V
1

Page 13

Constraint sampling
Consider a convex optimization problem with a very large number of constraints:
min c x
s.t. gi (x) b i = 1, 2, . . . , m
where x n , gi convex, and m >> n.
We obtain the sampled approximation by sampling the sequence {i1 , i2 , . . . , iN }
IID according to some measure over the constraints . This gives us:
min
s.t.

c x
gj (x) b j = i1 , i2 , . . . , iN

Let x
N be the optimal solution to the sampled convex problem.

Theorem. [Calafiore and Campi, 2005] For arbitrary > 0, > 0, if

n
1, then
N
Prob ( ({i|gi (
xN ) > bi }) ) 1
where the probability is taken over the random sampling of constraints.

This result can be leveraged to show that the solution to the sampled approximate LP is close to V with high probablity. (de Farias and Van Roy,
2001)

Reward shaping

What freedom do we have in specifying the reward

function? Can we choose it such that learning is faster?
Examples:

+ Tetris: set the reward equal to the distance

between highest filled square and the top of the
board vs. a reward of 1 for placing a block
- Bicycle control task: provide a positive reward for
motion towards the goal
- Soccer task: provide a positive reward for touching
the ball

Page 14

Potential based shaping

Let F(s,a,s) = (s) - (s)

Shaped reward = R + F

Theorem [Ng, Harada & Russell, 1999]

Potential based reward shaping is a necessary and
sufficient condition to guarantee that the optimal
policy in the shaped MDP M = (S, A, T, , R+F)
is also an optimal policy in the original MDP M =
(S, A, T, , R)
[In fact even stronger: all policies retain their
relative value.]

Intuition of proof

In the new MDP, for a trace s0, a0, s1, a1, we obtain:
R(s0, a0, s1) + (s1) - (s0)
+ (R(s1, a1, s2) + (s) - (s1))
+ 2 (R(s2, a2, s3) + (s3) - (s2))
+

- (s0) + R(s0, a0, s1) + R(s1, a1, s2) + 2 R(s2, a2, s3) +

For any policy we have: (M: original MDP, M: MDP w/shaped reward)
VM(s0) = VM(s0) - (s0)

Page 15

A good potential?

Let = V*

Then in one update we have:

V (s)

=
=

max
a

T (s, a, s ) (R(s, a, s ) + F (s, a, s ) + V (s ))

T (s, a, s ) (R(s, a, s ) + V (s ) V (s) + V (s ))

V (s) + max
a

T (s, a, s ) (R(s, a, s ) + V (s ) + V (s ))

If we initialize V = 0, we obtain:
V (s) V (s) + max
a

=
=

T (s, a, s ) (R(s, a, s ) + V (s ))

V (s) + V (s)
0

V=0 satisfies the Bellman equation; this particular choice of potential

function / reward shaping, we can find the solution to the shaped MDP
very quickly

Example
10x10 grid world, 1 goal state = absorbing, other states R=-1;
Prob(action successful) = 80%
Shaping function: 0(s) = - (manhattan distance to goal) / 0.8
Plot shows performance of Sarsa(0) with epsilon=0.1 greedy, learning rate = .2
(a) no shaping vs. (b) = 0.5 * 0 vs. (c) = 0

[From Ng, Harada & Russell 1999]

Page 16

Relationship to value function intialization

Potential based shaping can be shown to be equivalent

to initializing the value function to the shaping potential.

[see course website for the technical note describing this]

If restricting ourselves to potential based shaping, can
implement it in 2 ways.

Partial observability

= no direct observation of the state

Instead: might have noisy measurements

Partially Observable Markov Decision Process

(POMDP)

Main ideas:

Based on the noisy measurements and knowledge

about the dynamics, keep track of a probability
distribution for the current state
Define a new MDP for which the probability
distribution over current state is considered the state

Page 17

Exploration vs. exploitation

One of the milestone results: Kearns and Singh, Explicit

Exploration and Exploitation (E3), 2002
Question/Problem:

Given an MDP with unknown transition model.

Can we

(a) explicitly decide to take actions that will assist us in

building a transition model of the MDP, and
(b) detect when we have a sufficiently accurate model and
start exploiting?

Basic idea of E3 algorithm

Repeat forever

Based on all data seen so far, partition the state space is a set of
known states and a set of unknown states. [A state is known when
each action in that state has been observed sufficiently often.]
If currently in a known state:

Lump all unknown states together in one absorbing meta-state. Give the
meta-state a reward of 1. Give all known states a reward of zero. Find the
optimal policy in this new MDP.
If the optimal policy has a value of zero (or low enough): exit. [No need
for exploration anymore.]
Otherwise: Execute the optimal action for the current state.

If currently in a unknown state:

Take the action that has been taken least often in this state.

Page 18

Technical aspects underneath E3

Simulation lemma: if the transition models and reward models of

two MDPs are sufficiently close, then the optimal policy in one will
also be close to optimal in the other
After having seen a state-action pair sufficiently often, with high
probability the data based transition model estimate will be accurate

Their analysis provides a finite time result (as opposed to

asymptotic, such as for Q learning, sarsa, etc.)

Various extensions since:

Brafman and Tenneholtz, Rmax

Kakade + al, Metric E3

Kearns and Koller, E3 in MDP w/transition model ~ temporal

Bayes net

Hierarchical RL

Main idea: use hierarchical domain knowledge to speed

up RL
I posted one representative paper onto the course
website, should be a reasonable starting point if you
wanted to find out more.

Page 19

RL summary

Exact methods: VI, PI, GPI, LP

Model-free methods: TD, Q, sarsa

Function approximation:

Batch versions: LSTD (recursive version: RLSTD), LSPI

Contractions infinity norm, 2norm weighted by state visitation

frequency
Approximate LP

Policy gradient methods:

analytical, finite difference, likelihood ratio

Gradient <-> natural gradient

Imitation learning:

Behavioral cloning <-> inverse RL

Page 20

CS 287: Advanced Robotics

Fall 2009
Lecture 21:
HMMs, Bayes filter, smoother, Kalman filters

Pieter Abbeel
UC Berkeley EECS

Overview

Thus far:

Optimal control and reinforcement learning

We always assumed we got to observe the state at each time
and the challenge was to choose a good action

Current and next set of lectures

The state is not observed

Instead, we get some sensory information about the state

Challenge: compute a probability distribution over the state

which accounts for the sensory information (evidence) which
we have observed.

Page 1

Examples
Helicopter

A choice of state: position, orientation, velocity, angular rate

Sensors:

GPS : noisy estimate of position (sometimes also velocity)

Inertial sensing unit: noisy measurements from (i) 3-axis gyro [=angular rate sensor],
(ii) 3-axis accelerometer [=measures acceleration + gravity; e.g., measures (0,0,0) in
free-fall], (iii) 3-axis magnetometer

Mobile robot inside building

A choice of state: position and heading

Sensors:

Odometry (=sensing motion of actuators): e.g., wheel encoders

Laser range finder: measures time of flight of a laser beam between departure and
return (return is typically happening when hitting a surface that reflects the beam back
to where it came from)

Probability review
For any random variables X, Y we have:

Definition of conditional probability:

P(X=x | Y=y) = P(X=x, Y=y) / P(Y=y)

Chain rule: (follows directly from the above)

P(X=x, Y=y) = P(X=x) P(Y=y | X=x )
= P(Y=y) P(X=x | Y=y)

Bayes rule: (really just a re-ordering of terms in the above)

P(X=x | Y=y) = P(Y=y | X=x) P(X=x) / P(Y=y)

Marginalization:
P(X=x) = y P(X=x, Y=y)

Note: no assumptions beyond X, Y being random variables are made for any of these to
hold true (and when we divide by something, that something is not zero)

Page 2

Probability review
For any random variables X, Y, Z, W we have:

Conditional probability: (can condition on a third variable z throughout)

P(X=x | Y=y, Z=z) = P(X=x, Y=y | Z=z) / P(Y=y | Z=z)

Chain rule:
P(X=x, Y=y, Z=z, W=w) = P(X=x) P(Y=y | X=x ) P(Z=z | X=x, Y=y) P(W=w| X=x, Y=y, Z=z)

Bayes rule: (can condition on other variable z throughout)

P(X=x | Y=y, Z=z) = P(Y=y | X=x, Z=z) P(X=x | Z=z) / P(Y=y | Z=z)

Marginalization:
P(X=x | W=w) = y,z P(X=x, Y=y, Z=z | W=w)

Note: no assumptions beyond X, Y, Z, W being random variables are made for any of these to
hold true (and when we divide by something, that something is not zero)

Independence

Two random variables X and Y are independent iff

for all x, y : P(X=x, Y=y) = P(X=x) P(Y=y)

Representing a probability distribution over a set of random

variables X1, X2, , XT in its most general form can be expensive.

However, if we assumed the random variables were independent,

then we could very compactly represent the joint distribution as
follows:

E.g., if all Xi are binary valued, then there would be a total of 2T

possible instantiations and it would require 2T-1 numbers to
represent the probability distribution.

P(X1=x1, X2=x2, , XT=xT) = P(X1=x1) P(X2=x2) P(XT=xT)

Thanks to the independence assumptions, for the binary case,
we went from requiring 2T-1 parameters, to only requiring T
parameters!

Unfortunately independence is often too strong an assumption

Page 3

Conditional independence

Two random variables X and Y are conditionally independent given a third

random variable Z iff
for all x, y, z : P(X=x, Y=y | Z=z) = P(X=x | Z=z) P(Y=y | Z=z)

Chain rule (which holds true for all distributions, no assumptions needed):

P(X=x,Y=y,Z=z,W=w) = P(X=x)P(Y=y|X=x)P(Z=z|X=x,Y=y)P(W=w|X=x,Y=y,Z=z)

For binary variables the representation requires 1 + 2*1 + 4*1 + 8*1 = 24-1 numbers
(just like a full joint probability table)

Now assume Z independent of X given Y, and assume W independent of X and

Y given Z, then we obtain:

P(X=x,Y=y,Z=z,W=w) = P(X=x)P(Y=y|X=x)P(Z=z|Y=y)P(W=w|Z=z)
For binary variables the representation requires 1 + 2*1 + 2*1 + 2*1 = 1+(4-1)*2
numbers --- significantly less!!

Markov Models

Models a distribution over a set of random variables X1, X2, , XT where the
index is typically associated with some notion of time.
Markov models make the assumption:

Chain rule: (always holds true, not just in Markov models!)

P(X1 = x1, X2 = x2, , XT = xT) = t P(Xt = xt | Xt-1 = xt-1, Xt-2 = xt-2, , X1 = x1)

Now apply the Markov conditional independence assumption:

Xt is independent of X1, , Xt-2 when given Xt-1

P(X1 = x1, X2 = x2, , XT = xT) = t P(Xt = xt | Xt-1 = xt-1)

(1)

in binary case: 1 + 2*(T-1) numbers required to represent the joint

distribution over all variables (vs. 2T 1)

Graphical representation: a variable Xt receives an arrow from the variables

appearing in its conditional probability in the expression for the joint distribution
(1) [called a Bayesian network or Bayes net representation]

Page 4

Hidden Markov Models

Underlying Markov model over states Xt

For each state Xt there is a random variable Zt which is a sensory

measurement of Xt

Assumption 1: Xt independent of X1, , Xt-2 given Xt-1

Assumption 2: Zt is assumed conditionally independent of the other

variables given Xt

This gives the following graphical (Bayes net) representation:

Hidden Markov Models

P(X1=x1, Z1=z1, X2=x2, Z2=z2, , XT=xT, ZT=zT) =

Chain rule: (no assumptions)

HMM assumptions:

P(X1 = x1)

P(Z1 = z1 | X1 = x1 )

P(X2 = x2 | X1 = x1 , Z1 = z1)

P(X2 = x2 | X1 = x1)

P(Z2 = z2 | X1 = x1, Z1 = z1, X2 = x2 )

P(Z2 = z2 | X2 = x2 )

P(XT = xT | X1 = x1, Z1 = z1, , XT-1 = xT-1, ZT-1 = zT-1 )

P(XT = xT | XT-1 = xT-1)

P(ZT = zT | X1 = x1, Z1 = z1, , XT-1 = xT-1, ZT-1 = zT-1 , XT = xT)

P(ZT = zT | XT = xT)

Page 5

Mini quiz

What would the graph look like for a Bayesian network

with no conditional independence assumptions?

Our particular choice of ordering of variables in the

chain rule enabled us to easily incorporate the HMM
assumptions. What if we had chosen a different
ordering in the chain rule expansion?

Example

The HMM is defined by:

Initial distribution:
Transitions:
Observations:

Page 6

Real HMM Examples

Robot localization:

Observations are range readings (continuous)

States are positions on a map (continuous)

Speech recognition HMMs:

Observations are acoustic signals (continuous valued)

States are specific positions in specific words (so, tens of
thousands)

Machine translation HMMs:

Observations are words (tens of thousands)

States are translation options

Filtering / Monitoring

Filtering, or monitoring, is the task of tracking the distribution

P(Xt | Z1 = z1 , Z2 = z2 , , Zt = zt)
over time. This distribution is called the belief state.

We start with P(X0) in an initial setting, usually uniform

As time passes, or we get observations, we update the belief state.

The Kalman filter was invented in the 60s and first implemented as
a method of trajectory estimation for the Apollo program. [See
course website for a historical account on the Kalman filter. From
Gauss to Kalman]

Page 7

Example: Robot Localization

Example from
Michael Pfeiffer

Prob

t=0

Sensor model: never more than 1 mistake

Know the heading (North, East, South or West)
Motion model: may not execute action with small prob.

Example: Robot Localization

Prob

t=1

Page 8

Example: Robot Localization

Prob

t=2

Example: Robot Localization

Prob

t=3

Page 9

Example: Robot Localization

Prob

t=4

Example: Robot Localization

Prob

t=5

Page 10

Inference: Base Cases

Incorporate observation

Time update

Assume we have current belief P(X | evidence to date)

Then, after one time step passes:

Page 11

Xt+1

Observation update
Xt+1

Assume we have:

Et+1

Then:

Algorithm

Init P(x1) [e.g., uniformly]

Observation update for time 0:

For t = 1, 2,

Time update

Observation update

For continuous state / observation spaces: simply replace

summation by integral

Page 12

Example HMM

Rt-1

P(Rt)

P(Ut)

0.7

0.9

0.3

0.2

The Forward Algorithm

Time/dynamics update and observation update in one:

recursive update

Normalization:

Can be helpful for numerical reasons

However: lose information!

Can renormalize (for numerical reasons) + keep track of the

normalization factor (to enable recovering all information)

Page 13

The likelihood of the observations

The forward algorithm first sums over x1, then over x2 and so forth,
which allows it to efficiently compute the likelihood at all times t, indeed:

Relevance:

Compare the fit of several HMM models to the data

Could optimize the dynamics model and observation model to maximize the
likelihood
Run multiple simultaneous trackers --- retain the best and split again
whenever applicable (e.g., loop closures in SLAM, or different flight
maneuvers)

With control inputs

Page 14

With control inputs

Control inputs known:

They can be simply seen as selecting a particular dynamics function

Control inputs unknown:

Assume a distribution over them

Above drawing assumes open-loop controls. This is rarely the case in

practice. [Markov assumption is rarely the case either. Both assumptions
seem to have given quite satisfactory results.]

Smoothing

Thus far, filtering, which finds:

The distribution over states at time t given all evidence until time t:

The likelihood of the evidence up to time t:

How about?

T < t : can simply run the forward algorithm until time t, but stop
incorporating evidence from time T+1 onwards
T > t : need something else

Page 15

Smoothing

Sum as written has a number of terms exponential in T

Key idea: order in which variables are being summed out affects
computational complexity

Forward algorithm exploits summing out x1, x2, , xt-1 in this

order
Can similarly run a backward algorithm, which sums out xT, xT-1,
, xt+1 in this order

Smoothing

Forward algorithm computes this

Can be easily verified from the equations:

Backward algorithm computes this

The factors in the right parentheses only contain Xt+1, , XT,

hence they act as a constant when summing out over X1, , Xt-1
and can be brought outside the summation

Can also be read off from the Bayes net graph / conditional
independence assumptions:

X1, , Xt-1 are conditionally of Xt+1, , XT given Xt

Page 16

Backward algorithm

Sum out xT:

Can recursively compute for l=T, T-1, :

fl1 (xl1 )

P (xl |xl1 )P (el |xl )fl (xl )

Smoother algorithm

Run forward algorithm, which gives

Run backward algorithm, which gives

P(xt , z1, , zt) for all t

ft(xt) for all t

Find

P(xt , z1, , zT) = P(xt , z1, , zt) ft(xt)

If desirable, can renormalize and find P(xt | z1, , zT)

Page 17

Bayes filters

Recursively compute

P(xt , z1:t-1) = xt-1 P(xt | xt-1) P(xt-1 | z1:t-1)

P(xt , z1:t) = P(xt, z1:t-1) P(zt | xt)

Tractable cases:

State space finite and sufficiently small

(what we have in some sense considered so far)

Systems with linear dynamics and linear

observations and Gaussian noise
Kalman filtering

Univariate Gaussian

Gaussian distribution with mean , and standard

deviation :

PX(x)

Page 18

Properties of Gaussians

Mean:

Variance:

Central limit theorem (CLT)

Classical CLT:

Let X1, X2, be an infinite sequence of independent

random variables with E Xi = , E(Xi - )2 = 2
Define Zn = ((X1 + + Xn) n ) / ( n1/2)
Then for the limit of n going to infinity we have that Zn
is distributed according to N(0,1)

Crude statement: things that are the result of the

addition of lots of small effects tend to become
Gaussian.

Page 19

Multi-variate Gaussians

Multi-variate Gaussians: examples

= [1; 0]
= [1 0; 0 1]

= [-.5; 0]
= [1 0; 0 1]

Page 20

= [-1; -1.5]
= [1 0; 0 1]

Multi-variate Gaussians: examples

= [0; 0]

= [1 0 ; 0 1]

= [0; 0]
= [.6 0 ; 0 .6]

= [0; 0]
= [2 0 ; 0 2]

Multi-variate Gaussians: examples

= [0; 0]
= [1 0; 0 1]

= [0; 0]
= [0; 0]
= [1 0.5; 0.5 1] = [1 0.8; 0.8 1]

Page 21

Multi-variate Gaussians: examples

= [0; 0]
= [1 0; 0 1]

= [0; 0]
= [0; 0]
= [1 0.5; 0.5 1] = [1 0.8; 0.8 1]

Multi-variate Gaussians: examples

= [0; 0]
= [1 -0.5 ; -0.5 1]

= [0; 0]
= [1 -0.8 ; -0.8 1]

Page 22

= [0; 0]
= [3 0.8 ; 0.8 1]

Discrete Kalman Filter

Estimates the state x of a discrete-time controlled process
that is governed by the linear stochastic difference
equation

xt = At xt 1 + Bt ut + t
with a measurement

zt = Ct xt + t

Components of a Kalman Filter

Matrix (nxn) that describes how the state evolves

from t to t-1 without controls or noise.

Matrix (nxl) that describes how the control ut

changes the state from t to t-1.

Matrix (kxn) that describes how to map the state xt

to an observation zt.

Random variables representing the process and

measurement noise that are assumed to be
independent and normally distributed with
covariance Rt and Qt respectively.

Page 23

Linear Gaussian Systems: Initialization

Initial belief is normally distributed:

bel ( x0 ) = N (x0 ; 0 , 0 )

Linear Gaussian Systems: Dynamics

Dynamics are linear function of state and control

plus additive noise:

xt = At xt 1 + Bt ut + t
p ( xt | ut , xt 1 ) = N (xt ; At xt 1 + Bt ut , Rt )
bel ( xt ) = p( xt | ut , xt 1 )

bel ( xt 1 ) dxt 1

~ N ( xt ; At xt 1 + Bt ut , Rt ) ~ N ( xt 1 ; t 1 , t 1 )
54

Page 24

Linear Gaussian Systems: Dynamics

bel ( xt ) = p( xt | ut , xt 1 )

bel ( xt 1 ) dxt 1

~ N (xt ; At xt 1 + Bt ut , Rt ) ~ N ( xt 1 ; t 1 , t 1 )

bel ( xt ) = exp ( xt At xt 1 Bt ut )T Rt1 ( xt At xt 1 Bt ut )

exp ( xt 1 t 1 )T t11 ( xt 1 t 1 ) dxt 1

= At t 1 + Bt ut
bel ( xt ) = t
T
t = At t 1 At + Rt
55

Proof: completion of squares

To integrate out xt-1, re-write the integrand in the format:

f (t1 , t1 , xt )

1
d

(2) 2 |S| 2

exp

1
(xt1 m)S 1 (xt1 m)
2

dxt1

This integral is readily computed (integral of a

multivariate Gaussian times a constant = that constant)
to be

f(t1 , t1 , xt )

Inspection of f will show that it is a multi-variate

Gaussian in xt with the mean and covariance as shown
on previous slide.

Page 25

Properties of Gaussians

We just showed:

X ~ N ( , )
T
Y ~ N ( A + B, AA )
Y = AX + B

We stay in the Gaussian world as long as we start with

Gaussians and perform only linear transformations.
Now we know this, we could find Y and Y without
computing integrals by directly computing the expected
values:
E[Y ] = E[AX + B] = AE[X] + B = A + B
Y Y

= E[(Y E[Y ])(Y E[Y ]) ] = E[(AX + B A B)(AX + B A B) ]

= E[A(X )(X ) A ] = AE[(X )(X ) ]A = AA

Self-quiz

Page 26

Linear Gaussian Systems: Observations

Observations are linear function of state plus

additive noise:

zt = Ct xt + t
p ( zt | xt ) = N ( zt ; Ct xt , Qt )
bel ( xt ) =

p ( zt | xt )

bel ( xt )

~ N ( zt ; Ct xt , Qt )

~ N xt ; t , t

)
59

Linear Gaussian Systems: Observations

bel ( xt ) =

p ( zt | xt )

bel ( xt )

~ N ( zt ; Ct xt , Qt )

~ N xt ; t , t

bel ( xt ) = exp ( zt Ct xt )T Qt1 ( zt Ct xt ) exp ( xt t )T t1 ( xt t )

2
2

= t + K t ( zt Ct t )
bel ( xt ) = t
t = ( I K t Ct ) t

with

K t = t CtT (Ct t CtT + Qt ) 1

Page 27

Proof: completion of squares

Re-write the expression for bel(xt) in the format:

t , C t , Qt )
f(
t ,

1
d

(2) 2 |S| 2

exp

1
(xt m)S 1 (xt m)
2

f is the normalization factor

The expression in parentheses is a multi-variate
Gaussian in xt. Its parameters m and S can be
identified to satisfy the expressions for the mean and
covariance on the previous slide.

Kalman Filter Algorithm

Algorithm Kalman_filter( t-1, t-1, ut, zt):
Prediction:
t = At t 1 + Bt ut
t = At t 1 AtT + Rt

Correction:
K t = t CtT (Ct t CtT + Qt ) 1
t = t + K t ( z t Ct t )
t = ( I K t Ct ) t

Return t, t

Page 28

How to derive these updates

Simply work through the integrals

Key trick: completion of squares

If your derivation results in a different format apply

matrix inversion lemma to prove equivalence
(A + U CV )1 = A1 A1 U (C 1 + V A1 U)1 V A1

Page 29

CS 287: Advanced Robotics

Fall 2009
Lecture 22:
HMMs, Kalman filters

Pieter Abbeel
UC Berkeley EECS

Overview

Current and next set of lectures

The state is not observed

Instead, we get some sensory information about the state

Challenge: compute a probability distribution over the state

which accounts for the sensory information (evidence) which
we have observed.

Page 1

Hidden Markov Models

Underlying Markov model over states Xt

Assumption 1: Xt independent of X1, , Xt-2 given Xt-1

For each state Xt there is a random variable Zt which is a sensory

measurement of Xt

Assumption 2: Zt is assumed conditionally independent of the other

variables given Xt

This gives the following graphical (Bayes net) representation:

Filtering in HMM

Init P(x1) [e.g., uniformly]

Observation update for time 0:

For t = 1, 2,

Time update

Observation update

For discrete state / observation spaces: simply replace integral by

summation

Page 2

With control inputs

They can be simply seen as selecting a particular dynamics function

Control inputs unknown:

Control inputs known:

Assume a distribution over them

Above drawing assumes open-loop controls. This is rarely the case in

practice. [Markov assumption is rarely the case either. Both assumptions
seem to have given quite satisfactory results.]

Discrete-time Kalman Filter

Estimates the state x of a discrete-time controlled process
that is governed by the linear stochastic difference
equation

xt = At xt 1 + Bt ut + t
with a measurement

zt = Ct xt + t

Page 3

Kalman Filter Algorithm

Algorithm Kalman_filter( t-1, t-1, ut, zt):
Prediction:
t = At t 1 + Bt ut
t = At t 1 AtT + Rt

Correction:
K t = t CtT (Ct t CtT + Qt ) 1
t = t + K t ( z t Ct t )
t = ( I K t Ct ) t

Return t, t

Intermezzo: information filter

From an analytical point of view == Kalman filter

Difference: keep track of the inverse covariance rather than the
covariance matrix
[matter of some linear algebra manipulations to get into this form]

Why interesting?

Inverse covariance matrix = 0 is easier to work with than covariance

matrix = infinity (case of complete uncertainty)
Inverse covariance matrix is often sparser than the covariance matrix
--- for the insiders: inverse covariance matrix entry (i,j) = 0 if xi is
conditionally independent of xj given some set {xk, xl, }
Downside: when extended to non-linear setting, need to solve a
linear system to find the mean (around which one can then linearize)
See Probabilistic Robotics pp. 78-79 for more in-depth pros/cons

Page 4

Matlab code data generation example

A = [ 0.99

0.0074; -0.0136

0.99]; C = [ 1 1 ; -1 +1];

x(:,1) = [-3;2];

Sigma_w = diag([.3 .7]); Sigma_v = [2 .05; .05 1.5];

w = randn(2,T); w = sqrtm(Sigma_w)w; v = randn(2,T); v = sqrtm(Sigma_v)v;

for t=1:T-1
x(:,t+1) = A * x(:,t) + w(:,t);
y(:,t) = C*x(:,t) + v(:,t);
end

% now recover the state from the measurements

P_0 = diag([100 100]); x0 =[0; 0];

% run Kalman filter and smoother here

% + plot

Kalman filter/smoother example

Page 5

Kalman filter property

If system is observable (=dual of controllable!) then Kalman filter will

converge to the true state.
System is observable iff
O = [C ; CA ; CA2 ; ; CAn-1] is full column rank

(1)

Intuition: if no noise, we observe y0, y1, and we have that the unknown
initial state x0 satisfies:
y0 = C x0
y1 = CA x0
...
yK = CAK x0
This system of equations has a unique solution x0 iff the matrix [C; CA; CAK]
has full column rank. B/c any power of a matrix higher than n can be written in
terms of lower powers of the same matrix, condition (1) is sufficient to check
(i.e., the column rank will not grow anymore after having reached K=n-1).

Simple self-quiz

The previous slide assumed zero control inputs at all

times. Does the same result apply with non-zero control
inputs? What changes in the derivation?

Page 6

Kalman Filter Summary

Highly efficient: Polynomial in measurement

dimensionality k and state dimensionality n:
O(k2.376 + n2)

Optimal for linear Gaussian systems!

Most robotics systems are nonlinear!

Extended Kalman filter (EKF)

Unscented Kalman filter (UKF)

[And also: particle filter (next lecture)]

Announcement: PS2

I provided a game log of me playing for your convenience.

However, to ensure you fully understand, I suggest you follow the
following procedure:

Code up a simple heuristic policy to collect samples from the

state space for question 1. Then use these samples as your
state samples for ALP and for approximate value iteration.
Play the game yourself for question 2 and have it learn to clone
your playing style.

You dont need to follow the above procedure, but I strongly

suggest to in case you have any doubt about how the algorithms in
Q1 and Q2 operate, b/c following the above procedure will force
you more blatantly to see the differences (and can only increase
you ability to hand in a good PS2).

Page 7

Announcements

PS1: will get back to you over the weekend, likely Saturday

Milestone: ditto

PS2: dont underestimate it!

Office hours: canceled today. Feel free to set appointment

over email. Also away on Friday actually. Happy to come in
on Sat or Sun afternoon by appointment.
Tuesday 4-5pm 540 Cory: Hadas Kress-Gazit (Cornell)
High-level tasks to correct low-level robot control
In this talk I will present a formal approach to creating robot controllers that ensure the robot
satisfies a given high level task. I will describe a framework in which a user specifies a
complex and reactive task in Structured English. This task is then automatically translated,
using temporal logic and tools from the formal methods world, into a hybrid controller. This
controller is guaranteed to control the robot such that its motion and actions satisfy the
intended task, in a variety of different environments.

Nonlinear Dynamic Systems

Most realistic robotic problems involve nonlinear

functions

xt = g (ut , xt 1 ) + noise
zt = h( xt ) + noise

Page 8

Linearity Assumption Revisited

Non-linear Function

Throughout: Gaussian of P(y) is the Gaussian which minimizes the KL-divergence with P(y). It
turns out that this means the Gaussian with the same mean and covariance as P(y).

Page 9

EKF Linearization (1)

EKF Linearization (2)

Page 10

EKF Linearization (3)

EKF Linearization: First Order Taylor

Series Expansion

Prediction:
g (ut , xt 1 ) g (ut , t 1 ) +

g (ut , t 1 )
( xt 1 t 1 )
xt 1

g (ut , xt 1 ) g (ut , t 1 ) + Gt ( xt 1 t 1 )

Correction:
h( xt ) h( t ) +

h( t )
( xt t )
xt

h( xt ) h( t ) + H t ( xt t )

Page 11

EKF Algorithm
Extended_Kalman_filter( t-1, t-1, ut, zt):
Prediction:

t = g (ut , t 1 )

t = At t 1 + Bt ut
t = At t 1 AtT + Rt

T
t

t = Gt t 1G + Rt

Correction:
K t = t CtT (Ct t CtT + Qt ) 1
t = t + K t ( zt Ct t )

K t = t H tT ( H t t H tT + Qt ) 1
t = t + K t ( zt h( t ))
t = ( I K t H t ) t

Return t, t

t = ( I K t Ct ) t

Ht =

h( t )
xt

Gt =

g (ut , t 1 )
xt 1

Simultaneous Localization and Mapping (SLAM)

Robot navigating in unknown environment

Perception

Environment: typically through laser-range finders,

sonars, cameras
Odometry (its own motion): inertial sensing, wheel
encoders, (if outdoors and clear sky-view: gps)

Page 12

Simultaneous Localization and Mapping (SLAM)

R
A

G
C
State: (nR, eR, R, nA, eA, nB, eFB, nC, eC, nD, eD, nE, eE, nF,
eF, nG, eG, nH, eH)
Transition model:

Robot motion model; Landmarks stay in place

Simultaneous Localization and Mapping (SLAM)

In practice: robot is not aware of all landmarks from the

beginning
Moreover: no use in keeping track of landmarks the
robot has not received any measurements about

Incrementally grow the state when new landmarks get

encountered.

Page 13

Simultaneous Localization and Mapping (SLAM)

Landmark measurement model: robot measures [ xk; yk ],

the position of landmark k expressed in coordinate frame
attached to the robot:

h(nR, eR, R, nk, ek) = [xk; yk] = R() ( [nk; ek] - [nR; eR] )

Often also some odometry measurements

E.g., wheel encoders

As they measure the control input being applied, they
are often incorporated directly as control inputs (why?)

EKF SLAM Application

[courtesy by J. Leonard]

Page 14

EKF SLAM Application

odometry

estimated trajectory
[courtesy by John Leonard]

EKF-SLAM: practical challenges

Defining landmarks

Camera: interest point detectors, textures, color,

Often need to track multiple hypotheses

Data association/Correspondence problem: when seeing features

that constitute a landmark --- Which landmark is it?

Closing the loop problem: how to know you are closing a loop?

Can split off multiple EKFs whenever there is ambiguity;

Laser range finder: Distinct geometric features (e.g. use RANSAC to

find lines, then use corners as features)

Keep track of the likelihood score of each EKF and discard the ones
with low likelihood score

Computational complexity with large numbers of landmarks.

Page 15

EKF Summary

Highly efficient: Polynomial in measurement

dimensionality k and state dimensionality n:
O(k2.376 + n2)
Not optimal!
Can diverge if nonlinearities are large!
Works surprisingly well even when all assumptions are
violated!
Note duality with linearizing a non-linear system and then
running LQR back-ups to obtain the optimal linear
controller!
36

Linearization via Unscented Transform

EKF

UKF

Page 16

UKF Sigma-Point Estimate (2)

EKF

UKF

UKF Sigma-Point Estimate (3)

EKF

UKF

Page 17

UKF Sigma-Point Estimate (4)

UKF intuition why it can perform better

Assume we know the distribution over X and it has a mean \bar{x}

Y = f(X)

[Julier and Uhlmann, 1997]

Page 18

UKF intuition why it can perform better

Assume

1. We represent our distribution over x by a set of sample

points.
2. We propagate the points directly through the function f.

Then:

We dont have any errors in f !!

The accuracy we obtain can be related to how well the
first, second, third, moments of the samples correspond
to the first, second, third, moments of the true
distribution over x.

Self-quiz

When would the UKF significantly outperform the EKF?

Page 19

Original unscented transform

Picks a minimal set of sample points that match 1st, 2nd

and 3rd moments of a Gaussian:

\bar{x} = mean, Pxx = covariance, i ith row, x n

\kappa : extra degree of freedom to fine-tune the higher
order moments of the approximation; when x is
Gaussian, n+\kappa = 3 is a suggested heuristic
[Julier and Uhlmann, 1997]

Unscented Kalman filter

Dynamics update:

Can simply use unscented transform and estimate

the mean and variance at the next time from the
sample points

Observation update:

Use sigmapoints from unscented transform to

compute the covariance matrix between xt and zt.
Then can do the standard update.

Page 20

Algorithm Unscented Kalman filter(t1 , t1 , ut , zt ):

1. Xt1 = t1 t1 + t1 t1 t1
2. Xt = g(t , Xt1 )
2n [i] [i]
3.
t =
wm Xt
i=0

[i]
t = 2n wc[i] (X [i]
4.
t )(Xt
t ) + Rt
t
i=0

t
t
5. Xt =
t
t +
t

6. Zt = h(Xt )
2n [i] [i]
7. zt = i=0 wm Zt

2n [i] [i]
[i]
8. St =
wc Z zt Z zt
+ Qt
i=0

x,z = 2n wc[i]

9.
t
i=0

[i]
[i]
Xt
t Zt zt

x,z S 1
10. Kt =
t
t
11. t =
t + Kt (zt zt )
t Kt St Kt
12. t =
13. return t , t

[Table 3.4 in Probabilistic Robotics]

UKF Summary

Highly efficient: Same complexity as EKF, with a constant

factor slower in typical practical applications
Better linearization than EKF: Accurate in first two terms
of Taylor expansion (EKF only first term)

Derivative-free: No Jacobians needed

Still not optimal!

Page 21

Announcements

Final project: 45% of the grade, 10% presentation, 35%

write-up

Presentations: in lecture Dec 1 and 3

If you have constraints, inform me by email by Wednesday night, we
will assign the others at random on Thursday

PS2: due Friday 23:59pm.

Tuesday 4-5pm 540 Cory: Hadas Kress-Gazit (Cornell)

High-level tasks to correct low-level robot control
In this talk I will present a formal approach to creating robot controllers that ensure the robot
satisfies a given high level task. I will describe a framework in which a user specifies a
complex and reactive task in Structured English. This task is then automatically translated,
using temporal logic and tools from the formal methods world, into a hybrid controller. This
controller is guaranteed to control the robot such that its motion and actions satisfy the
intended task, in a variety of different environments.

CS 287: Advanced Robotics

Fall 2009
Lecture 23:
HMMs: Kalman filters, particle filters

Pieter Abbeel
UC Berkeley EECS

Page 1

Hidden Markov Models

Filtering in HMM

Init P(x1) [e.g., uniformly]

Observation update for time 0:

For t = 1, 2,

Time update

Observation update

For discrete state / observation spaces: simply replace integral by

summation

Page 2

Discrete-time Kalman Filter

Estimates the state x of a discrete-time controlled process
that is governed by the linear stochastic difference
equation

xt = At xt 1 + Bt ut + t
with a measurement

zt = Ct xt + t

Kalman Filter Algorithm

Algorithm Kalman_filter( t-1, t-1, ut, zt):
Prediction:
t = At t 1 + Bt ut
t = At t 1 AtT + Rt

Correction:
K t = t CtT (Ct t CtT + Qt ) 1
t = t + K t ( z t Ct t )
t = ( I K t Ct ) t

Return t, t

Page 3

Nonlinear systems

xt = g (ut , xt 1 ) + noise
zt = h( xt ) + noise

Extended Kalman filter (EKF)

Page 4

EKF Linearization: First Order Taylor

Series Expansion

Prediction:
g (ut , xt 1 ) g (ut , t 1 ) +

g (ut , t 1 )
( xt 1 t 1 )
xt 1

g (ut , xt 1 ) g (ut , t 1 ) + Gt ( xt 1 t 1 )

Correction:
h( xt ) h( t ) +

h( t )
( xt t )
xt

h( xt ) h( t ) + H t ( xt t )

EKF Algorithm
Extended_Kalman_filter( t-1, t-1, ut, zt):
Prediction:

t = g (ut , t 1 )

t = At t 1 + Bt ut
t = At t 1 AtT + Rt

T
t

t = Gt t 1G + Rt

Correction:
K t = t CtT (Ct t CtT + Qt ) 1
t = t + K t ( zt Ct t )

K t = t H tT ( H t t H tT + Qt ) 1
t = t + K t ( zt h( t ))
t = ( I K t H t ) t

Return t, t

t = ( I K t Ct ) t

Ht =

h( t )
xt

Page 5

Gt =

g (ut , t 1 )
xt 1

Linearization via Unscented Transform

EKF

UKF

UKF Sigma-Point Estimate (4)

Page 6

UKF intuition why it can perform better

Assume we know the distribution over X and it has a mean \bar{x}

Y = f(X)

[Julier and Uhlmann, 1997]

UKF intuition why it can perform better

Assume

1. We represent our distribution over x by a set of sample

points.
2. We propagate the points directly through the function f.

Then:

We dont have any errors in f !!

The accuracy we obtain can be related to how well the
first, second, third, moments of the samples correspond
to the first, second, third, moments of the true
distribution over x.

Page 7

Self-quiz

When would the UKF significantly outperform the EKF?

Original unscented transform

Picks a minimal set of sample points that match 1st, 2nd

and 3rd moments of a Gaussian:

\bar{x} = mean, Pxx = covariance, i ith row, x n

\kappa : extra degree of freedom to fine-tune the higher
order moments of the approximation; when x is
Gaussian, n+\kappa = 3 is a suggested heuristic
[Julier and Uhlmann, 1997]

Page 8

Unscented Kalman filter

Dynamics update:

Can simply use unscented transform and estimate

the mean and variance at the next time from the
sample points

Observation update:

Use sigmapoints from unscented transform to

compute the covariance matrix between xt and zt.
Then can do the standard update.

Algorithm Unscented Kalman filter(t1 , t1 , ut , zt ):

1. Xt1 = t1 t1 + t1 t1 t1

2. Xt = g(t , Xt1 )
2n [i] [i]
3.
t =
wm Xt
i=0

[i]
t = 2n wc[i] (X [i]
4.
t )(Xt
t ) + Rt
t
i=0

t
t
5. Xt =
t
t +
t

6. Zt = h(Xt )
2n [i] [i]
7. zt = i=0 wm Zt

2n [i] [i]
[i]
8. St = i=0 wc Zt zt Zt zt
+ Qt

[i]
x,z = 2n wc[i] X [i]
9.
t Zt zt
t
t
i=0

x,z S 1
10. Kt =
t
t

11. t =
t + Kt (zt zt )
t Kt St Kt
12. t =
13. return t , t

[Table 3.4 in Probabilistic Robotics]

Page 9

UKF Summary

Highly efficient: Same complexity as EKF, with a constant

factor slower in typical practical applications
Better linearization than EKF: Accurate in first two terms
of Taylor expansion (EKF only first term) + capturing more
aspects of the higher order terms

Derivative-free: No Jacobians needed

Still not optimal!

Particle filters motivation

Particle filters are a way to efficiently represent

non-Gaussian distribution

Basic principle

Set of state hypotheses (particles)

Survival-of-the-fittest

Page 10

Sample-based Localization (sonar)

FastSLAM

[particle filter + Rao-Blackwellization +

occupancy grid mapping + scan matching based odometry]

Page 11

Sample-based Mathematical Description of

Probability Distributions

Particle sets can be used to approximate functions

The more particles fall into an interval, the higher the

probability of that interval

If a continuous density needs to be extracted can place

a narrow Gaussian at the location of each particle

How to efficiently draw samples in an HMM?

Dynamics update with sample representation

of distribution: sample from P(xt+1 | xt)

Page 12

Observation update

Goal: go from a sample-based representation of

P(xt+1 | z1, , zt)
to a sample-based representation of
P(xt+1 | z1, , zt, zt+1) =
C * P(xt+1 | z1, , zt) * P(zt+1 | xt+1)

Importance sampling

Interested in estimating:

EXP [f (X)] =
f (x)P (x)dx
x
Q(x)
=
f (x)P (x)
dx ifQ(x) = 0 P (x) = 0
Q(x)
x

P (x)
=
f (x)
Q(x)dx
Q(x)
x
P (X)
= EXQ [
f (X)]
Q(X)
m
1 P (x(i) )

f (x(i) ) with x(i) Q

(i) )
m
Q(x
i=1
Hence we could sample from an alternative distribution Q and
simply re-weight the samples == Importance Sampling

Page 13

Sequential Importance Sampling (SIS) Particle Filter

Sample x(1)1, x(2)1, , x(N)1 from P(X1)

Set w(i)1= 1 for all i=1,,N

For t=1, 2,

Dynamics update:

For i=1, 2, , N
(i)
(i)
Sample x t from P(Xt+1 | Xt = x t)
Observation update:

For i=1, 2, , N
(i)
(i)
(i)
w t+1 = w t* P(zt+1 | Xt+1 = x t+1)

At any time t, the distribution is represented by the weighted set of samples

{ <x(i)t, w(i)t> ; i=1,,N}

SIS particle filter major issue

The resulting samples are only weighted by the

evidence
The samples themselves are never affected by the
evidence

Fails to concentrate particles/computation in the high

probability areas of the distribution P(xt | z1, , zt)

Page 14

Resampling

At any time t, the distribution is represented by the

weighted set of samples
{ <x(i)t, w(i)t> ; i=1,,N}

Sample N times from the set of particles

The probability of drawing each particle is given by its
importance weight

More particles/computation focused on the parts of the

state space with high probability mass

1. Algorithm particle_filter( St-1, ut-1 zt):

2. St = ,

3. For i = 1K n

Generate new samples

Sample index j(i) from the discrete distribution given by wt-1

Sample xti from p ( xt | xt 1 , ut 1 ) using xtj(1i ) and ut 1

wti = p ( zt | xti )

Compute importance weight

= + wti

Update normalization factor

St = S t {< xti , wti >}

Insert

9. For i = 1K n
10.

wti = wti /

Normalize weights

Page 15

Particle Filters

Sensor Information: Importance Sampling

Bel ( x) p( z | x) Bel ( x)
p ( z | x) Bel ( x)
w

= p( z | x)
Bel ( x)

Page 16

Robot Motion
Bel ( x)

p( x | u x' ) Bel ( x' )

d x'

Sensor Information: Importance Sampling

Bel ( x) p( z | x) Bel ( x)
p ( z | x) Bel ( x)
w

= p( z | x)
Bel ( x)

Page 17

Robot Motion
Bel ( x)

p( x | u x' ) Bel ( x' )

Resampling issue

Loss of samples

Page 18

d x'

Low Variance Resampling

Advantages:

More systematic coverage of space of samples

If all samples have same importance weight, no
samples are lost
Lower computational complexity

Page 19

Regularization

If no dynamics noise all particles will start to coincide

regularization: resample from a (narrow) Gaussian
around the particle

Mobile Robot Localization

Each particle is a potential pose of the robot

Proposal distribution is the motion model of the robot

(prediction step)

The observation model is used to compute the importance

weight (correction step)

Page 20

Motion Model Reminder

Start

Proximity Sensor Model Reminder

Sonar sensor

Laser sensor

Note: sensor model is not Gaussian at all!

Page 21

Sample-based Localization (sonar)

Page 22

Page 23

Page 24

Page 25

Page 26

Page 27

Page 28

Page 29

Page 30

Summary Particle Filters

Particle filters are an implementation of

recursive Bayesian filtering
They represent the posterior by a set of
weighted samples

They can model non-Gaussian distributions

Proposal to draw new samples

Weight to account for the differences between

the proposal and the target
Monte Carlo filter, Survival of the fittest,
Condensation, Bootstrap filter
77

Page 31

Summary PF Localization

In the context of localization, the particles are

propagated according to the motion model.
They are then weighted according to the likelihood of
the observations.
In a re-sampling step, new particles are drawn with a
probability proportional to the likelihood of the
observation.

Page 32

Announcements

PS2: due Friday 23:59pm.

Final project: 45% of the grade, 10% presentation, 35%

write-up

Presentations: in lecture Dec 1 and 3 --- schedule:

CS 287: Advanced Robotics

Fall 2009
Lecture 24:
SLAM

Pieter Abbeel
UC Berkeley EECS

Page 1

Types of SLAM-Problems

Grid maps or scans

[Lu & Milios, 97; Gutmann, 98: Thrun 98; Burgard, 99; Konolige & Gutmann, 00; Thrun, 00; Arras, 99; Haehnel,
01;]

Landmark-based

[Leonard et al., 98; Castelanos et al., 99: Dissanayake et al., 2001; Montemerlo et al., 2002;

Recap Landmark based SLAM

State variables:

Robot pose

Coordinates of each of the landmarks

Robot dynamics model: P(xt+1 | xt, ut)

Sensor model: P(zt+1 | xt, m)

Probability of landmark observations given the state

Can run EKF, SEIF, various other approaches

Result: path of robot, location of landmarks

KF-type approaches are a good fit b/c they can keep track of
correlations between landmarks
Note: Could then use path of robot + sensor log and build a map
assuming known robot poses

Page 2

Grid-based SLAM

Can we solve the SLAM problem if no pre-defined landmarks

are available?

As with landmarks, the map depends on the poses of the

robot during data acquisition

If the poses are known, grid-based mapping is easy

(mapping with known poses)

Occupancy Grid Maps

Introduced by Moravec and Elfes in 1985

Represent environment by a grid.

Estimate the probability that a location is

occupied by an obstacle.
Key assumptions

Occupancy of individual cells (m[xy]) is independent

Bel (mt ) = P (mt | u1 , z 2 K , ut 1 , zt )

= Bel (mt[ xy ] )
x, y

Robot positions are known!

Page 3

Occupancy Value Depending on the

Measured Distance
z+d1
z

z+d2
z+d3

z-d1

log

P (m[xy] = 1)
P (m[xy] = 1)
P (m[xy] = 1|zt )

log
+
log
P (m[xy] = 0)
P (m[xy] = 0)
P (m[xy] = 0|zt )
8

Incremental Updating
of Occupancy Grids (Example)

Page 4

Alternative: Simple Counting

For every cell count

hits(x,y): number of cases where a beam ended at

<x,y>
misses(x,y): number of cases where a beam passed
through <x,y>

Bel (m[ xy ] ) =

hits( x, y )
hits( x, y ) + misses( x, y )

Value of interest: P(reflects(x,y))

Difference between Occupancy Grid Maps and Counting

The counting model determines how often a cell reflects a

beam.
The occupancy model represents whether or not a cell is
occupied by an object.
Although a cell might be occupied by an object, the
reflection probability of this object might be very small.

Page 5

Example Occupancy Map

Example Reflection Map

glass panes

Page 6

Example

Out of 1000 beams only 60% are reflected from a cell and
40% intercept it without ending in it.
Accordingly, the reflection probability will be 0.6.
Suppose p(occ | z) = 0.55 when a beam ends in a cell and
p(occ | z) = 0.45 when a cell is intercepted by a beam that
does not end in it.
Accordingly, after n measurements we will have
0.55

0.45

n*0.6

0.45
*

0.55

n*0.4

11
=
9

n*0.6

11
*
9

n*0.4

11
=
9

n*0.2

Whereas the reflection map yields a value of 0.6, the

occupancy grid value converges to 1.
15

Mapping using Raw Odometry

Page 7

Distribution over robot poses and maps

Standard particle filter represents the distribution by a

set of samples

Rao-Blackwellization
poses

map

observations & movements

Factorization first introduced by Murphy in 1999

Page 8

Rao-Blackwellization
poses

map

observations & movements

SLAM posterior
Robot path posterior
Mapping with known poses
Factorization first introduced by Murphy in 1999

Rao-Blackwellization

This is localization, use MCL

Use the pose estimate
from the MCL part and apply
mapping with known poses
20

Page 9

Rao-Blackwellized Mapping
Each particle represents a possible trajectory of the
robot
Each particle
maintains its own map and
updates it upon mapping with known poses
Each particle survives with a probability proportional to
the likelihood of the observations relative to its own
map

Particle Filter Example

3 particles

map of particle 3

map of particle 1

map of particle 2

Page 10

Problem
Each map is quite big in case of grid maps
Since each particle maintains its own map
Therefore, one needs to keep the number of particles
small
Solution:
Compute better proposal distributions!
Idea:
Improve the pose estimate before applying the particle
filter

Pose Correction Using Scan Matching

Maximize the likelihood of the i-th pose and map relative to
the (i-1)-th pose and map

t 1) p(xt | ut 1, xt 1)}
xt = argmax{p(zt | xt , m
xt

current measurement

robot motion

map constructed so far

Page 11

Motion Model for Scan Matching

Raw Odometry
Scan Matching

Map: Intel Research Lab Seattle

FastSLAM with Scan-Matching

Page 12

Map of the Intel Lab

15 particles
four times faster
than real-time
P4, 2.8GHz
5cm resolution
during scan
matching
1cm resolution in
final map

FastSLAM with Scan-Matching

Loop Closure
32

Page 13

Scan matching: likelihood field

Likelihood field

Map m

=map convolved with a Gaussian

Scan Matching

Extract likelihood field from scan and use it to match

different scan.

Page 14

FastSLAM recap

Rao-Blackwellized representation:

Particle instantiates entire path of robot

Map associated with each path

Scan matching: improves proposal distribution

Original FastSLAM:

Map associated with each particle was a Gaussian

distribution over landmark positions

DP-SLAM: extension which has very efficient map

management, enabling having a relatively large number
of particles [Eliazar and Parr, 2002/2005]

SLAM thus far

Landmark based vs. occupancy grid

Probability distribution representation:

EKF vs. particle filter vs. Rao-Blackwellized particle

filter

EKF, SEIF, FastSLAM are all online

Currently popular 4th alternative: GraphSLAM

Page 15

Graph-based Formulation

Use a graph to represent the problem

Every node in the graph corresponds to a pose of the
robot during mapping
Every edge between two nodes corresponds to the
spatial constraints between them

Goal:
Find a configuration of the nodes that minimize the error
introduced by the constraints
JGraphSLAM

= x
0 0 x0 +
+

(xt g(ut , xt1 )) Rt1 (xt g(ut , xt1 ))

(zti

i
i
h(xt , m, cit )) Q1
t (zt h(xt , m, ct ))

The KUKA Production Site

Page 16

The KUKA Production Site

scans
total acquisition time
traveled distance
total rotations
size
processing time

59668
4,699.71 seconds
2,587.71 meters
262.07 radians
180 x 110 meters
< 30 minutes

Page 17

GraphSLAM

Visual SLAM for Flying Vehicles

Bastian Steder, Giorgio Grisetti, Cyrill Stachniss, Wolfram Burgard

Autonomous Blimp

Page 18

Recap tentative syllabus

Control: underactuation, controllability, Lyapunov, dynamic

programming, LQR, feedback linearization, MPC
Reinforcement learning: value iteration, policy iteration, linear
programming, Q learning, TD, value function approximation, Sarsa,
LSTD, LSPI, policy gradient, imitation learning, inverse
reinforcement learning, reward shaping, exploration vs. exploitation
Estimation: Bayes filters, KF, EKF, UKF, particle filter, occupancy
grid mapping, EKF slam, GraphSLAM, SEIF, FastSLAM
Manipulation and grasping: force closure, grasp point selection,
visual servo-ing, more sub-topics tbd
Case studies: autonomous helicopter, Darpa Grand/Urban
Challenge, walking, mobile manipulation.
Brief coverage of: system identification, simulation, pomdps, karmed bandits, separation principle

Page 19

Smart Health Care System Level 1
33% (3)
Smart Health Care System Level 1
16 pages
Internet of Things - Architecture and Protocols - Unit 2
100% (1)
Internet of Things - Architecture and Protocols - Unit 2
57 pages
Unit2 Iot
100% (1)
Unit2 Iot
89 pages
Res Unit-I Solar Energy
No ratings yet
Res Unit-I Solar Energy
163 pages
Advances in Engineering Materials: R. K. Tyagi Pallav Gupta Prosenjit Das Rajiv Prakash
No ratings yet
Advances in Engineering Materials: R. K. Tyagi Pallav Gupta Prosenjit Das Rajiv Prakash
377 pages
Robotics
No ratings yet
Robotics
96 pages
Animal Detection 1
No ratings yet
Animal Detection 1
56 pages
B20-ml Basedbotnet Attack in IoT Devices
No ratings yet
B20-ml Basedbotnet Attack in IoT Devices
66 pages
Introduction To Mobile Robots and Locomotion
No ratings yet
Introduction To Mobile Robots and Locomotion
23 pages
Lyapunov Based-Control
100% (1)
Lyapunov Based-Control
311 pages
Formulatinghypotheses 110911135920 Phpapp02
No ratings yet
Formulatinghypotheses 110911135920 Phpapp02
53 pages
Vikash Seminar Report
100% (1)
Vikash Seminar Report
31 pages
Thesis of Path Planning of Mobile Robots Using Genetic Algorithms
No ratings yet
Thesis of Path Planning of Mobile Robots Using Genetic Algorithms
151 pages
Mach-Effect Thruster Model: Acta Astronautica September 2017
No ratings yet
Mach-Effect Thruster Model: Acta Astronautica September 2017
36 pages
Amorphous Computing and Swarm Intelligence
100% (1)
Amorphous Computing and Swarm Intelligence
35 pages
Voice Controlled Car: BS Documentation by Hammad Malik (F16-1244) Arslan Ali (F16-1160) Hassam Akram (F16-1153)
100% (1)
Voice Controlled Car: BS Documentation by Hammad Malik (F16-1244) Arslan Ali (F16-1160) Hassam Akram (F16-1153)
16 pages
Tourism-Shekh Alam
No ratings yet
Tourism-Shekh Alam
30 pages
Building Microservice Architectures Neal Ford PDF
No ratings yet
Building Microservice Architectures Neal Ford PDF
80 pages
Face Mask Detection
No ratings yet
Face Mask Detection
44 pages
Karanja Evanson Mwangi Cit Masters Report Libre PDF
No ratings yet
Karanja Evanson Mwangi Cit Masters Report Libre PDF
136 pages
Applsci 13 04144 v2
No ratings yet
Applsci 13 04144 v2
26 pages
Robotics Roadmap
No ratings yet
Robotics Roadmap
5 pages
WSN in Agriculture Using ML
No ratings yet
WSN in Agriculture Using ML
23 pages
EGMN 420 001 32729 CAE Design Syllabus 2017.08.23 221700
No ratings yet
EGMN 420 001 32729 CAE Design Syllabus 2017.08.23 221700
65 pages
Acoustic Emission Sensing in Wireless Sensor Networks
No ratings yet
Acoustic Emission Sensing in Wireless Sensor Networks
143 pages
IOT - PLAN of ACTION
100% (1)
IOT - PLAN of ACTION
3 pages
Eye Directive Wheelchair Seminar
No ratings yet
Eye Directive Wheelchair Seminar
18 pages
NXP LPC1768 & Keil Quadcopter Project Lab Manual PDF
No ratings yet
NXP LPC1768 & Keil Quadcopter Project Lab Manual PDF
99 pages
News&Notes Entire Magazine
No ratings yet
News&Notes Entire Magazine
40 pages
"Hyperloop - New Transportation System": Seminar Report On
No ratings yet
"Hyperloop - New Transportation System": Seminar Report On
21 pages
Agriculture Robot Seminar Report
No ratings yet
Agriculture Robot Seminar Report
21 pages
Cloud Robotics IOT
No ratings yet
Cloud Robotics IOT
14 pages
Internet of Behaviour
No ratings yet
Internet of Behaviour
21 pages
National Institute of Technology, Raipur: Smart Irrigation System Using Lora
No ratings yet
National Institute of Technology, Raipur: Smart Irrigation System Using Lora
22 pages
Click To Get Netflix Cookies #10
No ratings yet
Click To Get Netflix Cookies #10
23 pages
Swachh Hasth-A Water Cleaning Robot
No ratings yet
Swachh Hasth-A Water Cleaning Robot
5 pages
2012 TV Firmware Upgrade Instruction T-MST10PAUSC
0% (1)
2012 TV Firmware Upgrade Instruction T-MST10PAUSC
5 pages
Facial Emotion Recognition Using Machine Learning
No ratings yet
Facial Emotion Recognition Using Machine Learning
10 pages
Workshop Lab: Arduino
No ratings yet
Workshop Lab: Arduino
8 pages
A Smart Medicine Box For Medication Management Using IoT
No ratings yet
A Smart Medicine Box For Medication Management Using IoT
6 pages
Agriculture Robot: Department of Mechanical Engineering
No ratings yet
Agriculture Robot: Department of Mechanical Engineering
23 pages
Abaqus Analysis User's Guide (6.13)
No ratings yet
Abaqus Analysis User's Guide (6.13)
11 pages
Earthquake Alarm
No ratings yet
Earthquake Alarm
11 pages
Suony Wind Mill Presentation
No ratings yet
Suony Wind Mill Presentation
17 pages
IoT Robotics
No ratings yet
IoT Robotics
7 pages
Vision Based Lane Detection For Unmanned Ground Vehicle (UGV)
No ratings yet
Vision Based Lane Detection For Unmanned Ground Vehicle (UGV)
19 pages
Plant Disease Detection Using Convolutional Neural Network
No ratings yet
Plant Disease Detection Using Convolutional Neural Network
14 pages
Plant Disease Detection Robot Using Raspberry Pi
No ratings yet
Plant Disease Detection Robot Using Raspberry Pi
10 pages
Deep Learning Report Waqas PDF
No ratings yet
Deep Learning Report Waqas PDF
16 pages
00 Robotics Lec00
No ratings yet
00 Robotics Lec00
14 pages
Conditioning Circuits
No ratings yet
Conditioning Circuits
57 pages
Introduction To Web Browser Internals
No ratings yet
Introduction To Web Browser Internals
5 pages
ECE 445 Biomedical Instrumentation Welcome: Dr. Wen Li 1216 Engineering Building
No ratings yet
ECE 445 Biomedical Instrumentation Welcome: Dr. Wen Li 1216 Engineering Building
11 pages
Four Wave Mixing
No ratings yet
Four Wave Mixing
17 pages
Basit Ali MS Research Proposal
No ratings yet
Basit Ali MS Research Proposal
10 pages
Formal Languages and Automata: Lab 4. Exercises
No ratings yet
Formal Languages and Automata: Lab 4. Exercises
2 pages
Robotics Engineering Detailed Guide
No ratings yet
Robotics Engineering Detailed Guide
1 page
Tide and Wave Energy
No ratings yet
Tide and Wave Energy
8 pages
Model Predictive Control Toolbox
No ratings yet
Model Predictive Control Toolbox
11 pages
Cryptocurrency Exchange Development
No ratings yet
Cryptocurrency Exchange Development
10 pages
Visio 2010 Developer Training 03 - Programming With The Visio Object Model
100% (1)
Visio 2010 Developer Training 03 - Programming With The Visio Object Model
333 pages
Objectives: Experiment No. 2 Driving Stepper Motor
No ratings yet
Objectives: Experiment No. 2 Driving Stepper Motor
6 pages
Object Detection and Avoidance in Unmanned Ground Vehicle Using Arduino1
No ratings yet
Object Detection and Avoidance in Unmanned Ground Vehicle Using Arduino1
4 pages
Advanced Robot Creation
No ratings yet
Advanced Robot Creation
12 pages
16 Robotics Visions Warm Intelligence Traffic Safety
No ratings yet
16 Robotics Visions Warm Intelligence Traffic Safety
9 pages
E Puck Based Testing
No ratings yet
E Puck Based Testing
8 pages
Robotics and Computer Vision in Swarm Intelligence and Traffic Safety
No ratings yet
Robotics and Computer Vision in Swarm Intelligence and Traffic Safety
10 pages
Rayleigh-Taylor Instability - Wikipedia, The Free Encyclopedia
No ratings yet
Rayleigh-Taylor Instability - Wikipedia, The Free Encyclopedia
8 pages
Sentiment Analysis Report
No ratings yet
Sentiment Analysis Report
4 pages
Lecture - 03 Stability of Equilibrium Points
No ratings yet
Lecture - 03 Stability of Equilibrium Points
35 pages
THC Hydra Article r3
100% (1)
THC Hydra Article r3
3 pages
AWS Lab Workbook v1.0
93% (14)
AWS Lab Workbook v1.0
127 pages
Analog Sensors PDF
No ratings yet
Analog Sensors PDF
110 pages
Windows 7 Book
100% (1)
Windows 7 Book
174 pages
Files - Python Questions and Answers - Sanfoundry
100% (1)
Files - Python Questions and Answers - Sanfoundry
15 pages
Lect 1 Intro
No ratings yet
Lect 1 Intro
22 pages
Digital Sensors: Dr. Ashraf Saleem
No ratings yet
Digital Sensors: Dr. Ashraf Saleem
11 pages
EC303 - Chapter 1 Computer Architecture Organization
No ratings yet
EC303 - Chapter 1 Computer Architecture Organization
61 pages
PEGA PRPC 05 Activities
No ratings yet
PEGA PRPC 05 Activities
36 pages
BC670 PDF
No ratings yet
BC670 PDF
225 pages
Webservices Questions What Is Web Services?
No ratings yet
Webservices Questions What Is Web Services?
6 pages
Generalized SAP BI Unit Test Case Templates
No ratings yet
Generalized SAP BI Unit Test Case Templates
6 pages
PetaSAN-Quick Start
No ratings yet
PetaSAN-Quick Start
29 pages
Induction Program: QA Orientation
No ratings yet
Induction Program: QA Orientation
15 pages
9-6 Monitor Users Guide
No ratings yet
9-6 Monitor Users Guide
119 pages
Bpmn2 0 Poster en
100% (1)
Bpmn2 0 Poster en
1 page
Unit 20 Implementing Managing and Maintaining A Microsoft Windows Server Network Infrastructure
No ratings yet
Unit 20 Implementing Managing and Maintaining A Microsoft Windows Server Network Infrastructure
4 pages
Ax40 Enus Ins 02
No ratings yet
Ax40 Enus Ins 02
28 pages
Pod2g Jailbreak Techniques, WWJC 2012
No ratings yet
Pod2g Jailbreak Techniques, WWJC 2012
56 pages
Lecture - 07 Stability of Feedback Systems
No ratings yet
Lecture - 07 Stability of Feedback Systems
31 pages
Two-Dimensional Systems
No ratings yet
Two-Dimensional Systems
21 pages
Wireless ECG Monitoring System Using PIC 16F877A and nRF905
50% (2)
Wireless ECG Monitoring System Using PIC 16F877A and nRF905
161 pages
Binary Search Algorithm
100% (1)
Binary Search Algorithm
12 pages
Dynamic Programming: 17.1. The Coin Changing Problem
No ratings yet
Dynamic Programming: 17.1. The Coin Changing Problem
3 pages
SAP BI 7.0 - InfoCube Partitioning
No ratings yet
SAP BI 7.0 - InfoCube Partitioning
5 pages
Bison
No ratings yet
Bison
108 pages
Assignment 2 PDF
No ratings yet
Assignment 2 PDF
45 pages
UPC1
No ratings yet
UPC1
24 pages
WMS Gaming Article - v1 PDF
No ratings yet
WMS Gaming Article - v1 PDF
13 pages
Constants in C
No ratings yet
Constants in C
4 pages
Aastra-Web Recovery Procedure PDF
No ratings yet
Aastra-Web Recovery Procedure PDF
3 pages