cs229 Notes13

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

CS229 Lecture notes

Dan Boneh & Andrew Ng

Part XIV
LQR, DDP and LQG
Linear Quadratic Regulation, Differential Dynamic Programming and Linear
Quadratic Gaussian

1 Finite-horizon MDPs
In the previous set of notes about Reinforcement Learning, we defined Markov
Decision Processes (MDPs) and covered Value Iteration / Policy Iteration in
a simplified setting. More specifically we introduced the optimal Bellman

equation that defines the optimal value function V π of the optimal policy
π∗.
∗ ∗
X
V π (s) = R(s) + max γ Psa (s0 )V π (s0 )
a∈A
s0 ∈S

Recall that from the optimal value function, we were able to recover the
optimal policy π ∗ with
X
π ∗ (s) = argmaxa∈A Psa (s0 )V ∗ (s0 )
s0 ∈S

In this set of lecture notes we’ll place ourselves in a more general setting:

1. We want to write equations that make sense for both the discrete and
the continuous case. We’ll therefore write
scribe: Guillaume Genthial

1
2

 ∗
Es0 ∼Psa V π (s0 )

instead of

X
Psa (s0 )V π (s0 )
s0 ∈S

meaning that we take the expectation of the value function at the next
state. In the finite case, we can rewrite the expectation as a sum over
states. In the continuous case, we can rewrite the expectation as an
integral. The notation s0 ∼ Psa means that the state s0 is sampled from
the distribution Psa .

2. We’ll assume that the rewards depend on both states and actions. In
other words, R : S × A → R. This implies that the previous mechanism
for computing the optimal action is changed into

 ∗
π ∗ (s) = argmaxa∈A R(s, a) + γEs0 ∼Psa V π (s0 )


3. Instead of considering an infinite horizon MDP, we’ll assume that we


have a finite horizon MDP that will be defined as a tuple

(S, A, Psa , T, R)

with T > 0 the time horizon (for instance T = 100). In this setting,
our definition of payoff is going to be (slightly) different:

R(s0 , a0 ) + R(s1 , a1 ) + · · · + R(sT , aT )

instead of (infinite horizon case)

R(s0 , a0 ) + γR(s1 , a1 ) + γ 2 R(s2 , a2 ) + . . .


X∞
R(st , at )γ t
t=0

What happened to the discount factor γ? Remember that the intro-


duction of γ was (partly) justified by the necessity of making sure that
3

the infinite sum would be finite and well-defined. If the rewards are
bounded by a constant R̄, the payoff is indeed bounded by


X ∞
X
| R(st )γ t | ≤ R̄ γt
t=0 t=0

and we recognize a geometric sum! Here, as the payoff is a finite sum,


the discount factor γ is not necessary anymore.

In this new setting, things behave quite differently. First, the optimal
policy π ∗ might be non-stationary, meaning that it changes over time.
In other words, now we have

π (t) : S → A
where the superscript (t) denotes the policy at time step t. The dynam-
ics of the finite horizon MDP following policy π (t) proceeds as follows:
we start in some state s0 , take some action a0 := π (0) (s0 ) according to
our policy at time step 0. The MDP transitions to a successor s1 , drawn
according to Ps0 a0 . Then, we get to pick another action a1 := π (1) (s1 )
following our new policy at time step 1 and so on...

Why does the optimal policy happen to be non-stationary in the finite-


horizon setting? Intuitively, as we have a finite numbers of actions to
take, we might want to adopt different strategies depending on where
we are in the environment and how much time we have left. Imagine
a grid with 2 goals with rewards +1 and +10. At the beginning, we
might want to take actions to aim for the +10 goal. But if after some
steps, dynamics somehow pushed us closer to the +1 goal and we don’t
have enough steps left to be able to reach the +10 goal, then a better
strategy would be to aim for the +1 goal...
4. This observation allows us to use time dependent dynamics

st+1 ∼ Ps(t)
t ,at

(t)
meaning that the transition’s distribution Pst ,at changes over time. The
same thing can be said about R(t) . Note that this setting is a better
4

model for real life. In a car, the gas tank empties, traffic changes,
etc. Combining the previous remarks, we’ll use the following general
formulation for our finite horizon MDP

(t)
, T, R(t)

S, A, Psa

Remark: notice that the above formulation would be equivalent to


adding the time into the state.
The value function at time t for a policy π is then defined in the same
way as before, as an expectation over trajectories generated following
policy π starting in state s.

Vt (s) = E R(t) (st , at ) + · · · + R(T ) (sT , aT )|st = s, π


 

Now, the question is

In this finite-horizon setting, how do we find the optimal value function

Vt∗ (s) = max Vtπ (s)


π

It turns out that Bellman’s equation for Value Iteration is made for Dy-
namic Programming. This may come as no surprise as Bellman is one of
the fathers of dynamic programming and the Bellman equation is strongly
related to the field. To understand how we can simplify the problem by
adopting an iteration-based approach, we make the following observations:

1. Notice that at the end of the game (for time step T ), the optimal value
is obvious

∀s ∈ S : VT∗ (s) := max R(T ) (s, a) (1)


a∈A

2. For another time step 0 ≤ t < T , if we suppose that we know the



optimal value function for the next time step Vt+1 , then we have

h  ∗ 0 i
∀t < T, s ∈ S : Vt∗ (s) (t)
:= max R (s, a) + Es0 ∼Psa
(t) Vt+1 (s ) (2)
a∈A

With these observations in mind, we can come up with a clever algorithm


to solve for the optimal value function:
5

1. compute VT∗ using equation (1).

2. for t = T − 1, . . . , 0:

compute Vt∗ using Vt+1



using equation (2)

Side note We can interpret standard value iteration as a special case


of this general case, but without keeping track of time. It turns out that
in the standard setting, if we run value iteration for T steps, we get a γ T
approximation of the optimal value iteration (geometric convergence). See
problem set 4 for a proof of the following result:

Theorem Let B denote the Bellman update and ||f (x)||∞ := supx |f (x)|.
If Vt denotes the value function at the t-th step, then

||Vt+1 − V ∗ ||∞ = ||B(Vt ) − V ∗ ||∞


≤ γ||Vt − V ∗ ||∞
≤ γ t ||V1 − V ∗ ||∞

In other words, the Bellman operator B is a γ-contracting operator.

2 Linear Quadratic Regulation (LQR)


In this section, we’ll cover a special case of the finite-horizon setting described
in Section 1, for which the exact solution is (easily) tractable. This model
is widely used in robotics, and a common technique in many problems is to
reduce the formulation to this framework.
First, let’s describe the model’s assumptions. We place ourselves in the
continuous setting, with

S = Rn , A = Rd
and we’ll assume linear transitions (with noise)

st+1 = At st + Bt at + wt

where At ∈ Rn×n , Bt ∈ Rn×d are matrices and wt ∼ N (0, Σt ) is some


gaussian noise (with zero mean). As we’ll show in the following paragraphs,
6

it turns out that the noise, as long as it has zero mean, does not impact the
optimal policy!
We’ll also assume quadratic rewards

R(t) (st , at ) = −s> >


t Ut st − at Wt at

where Ut ∈ Rn×n , Wt ∈ Rd×d are positive definite matrices (meaning that


the reward is always negative).

Remark Note that the quadratic formulation of the reward is equivalent


to saying that we want our state to be close to the origin (where the reward
is higher). For example, if Ut = In (the identity matrix) and Wt = Id , then
Rt = −||st ||2 − ||at ||2 , meaning that we want to take smooth actions (small
norm of at ) to go back to the origin (small norm of st ). This could model a
car trying to stay in the middle of lane without making impulsive moves...

Now that we have defined the assumptions of our LQR model, let’s cover
the 2 steps of the LQR algorithm

step 1 suppose that we don’t know the matrices A, B, Σ. To estimate them, we


can follow the ideas outlined in the Value Approximation section of the
RL notes. First, collect transitions from an arbitrary policy. Then, 
use
Pm PT −1  2
(i) (i) (i)
linear regression to find argminA,B i=1 t=0 st+1 − Ast + Bat .
Finally, use a technique seen in Gaussian Discriminant Analysis to learn
Σ.

step 2 assuming that the parameters of our model are known (given or esti-
mated with step 1), we can derive the optimal policy using dynamic
programming.

In other words, given

(
st+1 = At st + Bt at + wt At , Bt , Ut , Wt , Σt known
R (st , at ) = −s>
(t) >
t Ut st − at Wt at

we want to compute Vt∗ . If we go back to section 1, we can apply dynamic


programming, which yields
7

1. Initialization step
For the last time step T ,

VT∗ (sT ) = max RT (sT , aT )


aT ∈A

= max −s> >


T UT sT − aT Wt aT
aT ∈A

= −s>
T Ut sT (maximized for aT = 0)

2. Recurrence step

Let t < T . Suppose we know Vt+1 .

Fact 1: It can be shown that if Vt+1 is a quadratic function in st , then Vt∗
is also a quadratic function. In other words, there exists some matrix Φ
and some scalar Ψ such that

if Vt+1 (st+1 ) = s>
t+1 Φt+1 st+1 + Ψt+1
then Vt∗ (st ) = s>
t Φt st + Ψ t

For time step t = T , we had Φt = −UT and ΨT = 0.


Fact 2: We can show that the optimal policy is just a linear function of
the state.

Knowing Vt+1 is equivalent to knowing Φt+1 and Ψt+1 , so we just need
to explain how we compute Φt and Ψt from Φt+1 and Ψt+1 and the other
parameters of the problem.

Vt∗ (st ) = s>


t Φt st + Ψt
h i
(t) ∗
= max R (st , at ) + Est+1 ∼Ps(t),a [Vt+1 (st+1 )]
at t t

= max −st Ut st − at Vt at + Est+1 ∼N (At st +Bt at ,Σt ) [s>


 > >

t+1 Φt+1 st+1 + Ψt+1 ]
at

where the second line is just the definition of the optimal value function
and the third line is obtained by plugging in the dynamics of our model
along with the quadratic assumption. Notice that the last expression is
a quadratic function in at and can thus be (easily) optimized1 . We get
the optimal action a∗t
Use the identity E wt> Φt+1 wt = Tr(Σt Φt+1 ) with wt ∼ N (0, Σt )
1
 
8

a∗t = (Bt> Φt+1 Bt − Vt )−1 Bt Φt+1 At · st


 

= L t · st

where
Lt := (Bt> Φt+1 Bt − Wt )−1 Bt Φt+1 At
 

which is an impressive result: our optimal policy is linear in st . Given


a∗t we can solve for Φt and Ψt . We finally get the Discrete Ricatti
equations

 −1 
Φt = A>
t Φ t+1 − Φ t+1 Bt Bt
>
Φt+1 Bt − Wt Bt Φ t+1 At − Ut
Ψt = − tr (Σt Φt+1 ) + Ψt+1

Fact 3: we notice that Φt depends on neither Ψ nor the noise Σt ! As Lt


is a function of At , Bt and Φt+1 , it implies that the optimal policy also
does not depend on the noise! (But Ψt does depend on Σt , which
implies that Vt∗ depends on Σt .)

Then, to summarize, the LQR algorithm works as follows

1. (if necessary) estimate parameters At , Bt , Σt

2. initialize ΦT := −UT and ΨT := 0.

3. iterate from t = T − 1 . . . 0 to update Φt and Ψt using Φt+1 and Ψt+1


using the discrete Ricatti equations. If there exists a policy that drives
the state towards zero, then convergence is guaranteed!

Using Fact 3, we can be even more clever and make our algorithm run
(slightly) faster! As the optimal policy does not depend on Ψt , and the
update of Φt only depends on Φt , it is sufficient to update only Φt !
9

3 From non-linear dynamics to LQR


It turns out that a lot of problems can be reduced to LQR, even if dynamics
are non-linear. While LQR is a nice formulation because we are able to come
up with a nice exact solution, it is far from being general. Let’s take for
instance the case of the inverted pendulum. The transitions between states
look like
    
xt+1 xt
ẋt+1  ẋt  
 θt+1  = F  θt  , at 
    

θ̇t+1 θ̇t
where the function F depends on the cos of the angle etc. Now, the
question we may ask is

Can we linearize this system?

3.1 Linearization of dynamics


Let’s suppose that at time t, the system spends most of its time in some state
s¯t and the actions we perform are around āt . For the inverted pendulum, if
we reached some kind of optimal, this is true: our actions are small and we
don’t deviate much from the vertical.
We are going to use Taylor expansion to linearize the dynamics. In the
simple case where the state is one-dimensional and the transition function F
does not depend on the action, we would write something like

st+1 = F (st ) ≈ F (s¯t ) + F 0 (s¯t ) · (st − s¯t )


In the more general setting, the formula looks the same, with gradients
instead of simple derivatives

st+1 ≈ F (s¯t , āt ) + ∇s F (s¯t , āt ) · (st − s¯t ) + ∇a F (s¯t , āt ) · (at − āt ) (3)

and now, st+1 is linear in st and at , because we can rewrite equation (3)
as

st+1 ≈ Ast + Bst + κ


where κ is some constant and A, B are matrices. Now, this writing looks
awfully similar to the assumptions made for LQR. We just have to get rid
10

of the constant term κ! It turns out that the constant term can be absorbed
into st by artificially increasing the dimension by one. This is the same trick
that we used at the beginning of the class for linear regression...

3.2 Differential Dynamic Programming (DDP)


The previous method works well for cases where the goal is to stay around
some state s∗ (think about the inverted pendulum, or a car having to stay
in the middle of a lane). However, in some cases, the goal can be more
complicated.
We’ll cover a method that applies when our system has to follow some
trajectory (think about a rocket). This method is going to discretize the
trajectory into discrete time steps, and create intermediary goals around
which we will be able to use the previous technique! This method is called
Differential Dynamic Programming. The main steps are

step 1 come up with a nominal trajectory using a naive controller, that approx-
imate the trajectory we want to follow. In other words, our controller
is able to approximate the gold trajectory with

s∗0 , a∗0 → s∗1 , a∗1 → . . .

step 2 linearize the dynamics around each trajectory point s∗t , in other words

st+1 ≈ F (s∗t , a∗t ) + ∇s F (s∗t , a∗t )(st − s∗t ) + ∇a F (s∗t , a∗t )(at − a∗t )

where st , at would be our current state and action. Now that we have
a linear approximation around each of these points, we can use the
previous section and rewrite

st+1 = At · st + Bt · at

(notice that in that case, we use the non-stationary dynamics setting


that we mentioned at the beginning of these lecture notes)
Note We can apply a similar derivation for the reward R(t) , with a
second-order Taylor expansion.
11

R(st , at ) ≈ R(s∗t , a∗t ) + ∇s R(s∗t , a∗t )(st − s∗t ) + ∇a R(s∗t , a∗t )(at − a∗t )
1
+ (st − s∗t )> Hss (st − s∗t ) + (st − s∗t )> Hsa (at − a∗t )
2
1
+ (at − a∗t )> Haa (at − a∗t )
2

where Hxy refers to the entry of the Hessian of R with respect to x and
y evaluated in (s∗t , a∗t ) (omitted for readability). This expression can be
re-written as

Rt (st , at ) = −s> >


t Ut st − at Wt at

for some matrices Ut , Wt , with the same trick of adding an extra dimen-
sion of ones. To convince yourself, notice that
   
a b 1
= a + 2bx + cx2

1 x · ·
b c x

step 3 Now, you can convince yourself that our problem is strictly re-written
in the LQR framework. Let’s just use LQR to find the optimal policy
πt . As a result, our new controller will (hopefully) be better!
Note: Some problems might arise if the LQR trajectory deviates too
much from the linearized approximation of the trajectory, but that can
be fixed with reward-shaping...

step 4 Now that we get a new controller (our new policy πt ), we use it to
produce a new trajectory

s∗0 , π0 (s∗0 ) → s∗1 , π1 (s∗1 ) → → s∗T

note that when we generate this new trajectory, we use the real F and
not its linear approximation to compute transitions, meaning that

s∗t+1 = F (s∗t , a∗t )

then, go back to step 2 and repeat until some stopping criterion.


12

4 Linear Quadratic Gaussian (LQG)


Often, in the real word, we don’t get to observe the full state st . For example,
an autonomous car could receive an image from a camera, which is merely
an observation, and not the full state of the world. So far, we assumed
that the state was available. As this might not hold true for most of the
real-world problems, we need a new tool to model this situation: Partially
Observable MDPs.
A POMDP is an MDP with an extra observation layer. In other words,
we introduce a new variable ot , that follows some conditional distribution
given the current state st

ot |st ∼ O(o|s)
Formally, a finite-horizon POMDP is given by a tuple

(S, O, A, Psa , T, R)
Within this framework, the general strategy is to maintain a belief state
(distribution over states) based on the observation o1 , . . . , ot . Then, a policy
in a POMDP maps belief states to actions.

In this section, we’ll present a extension of LQR to this new setting.


Assume that we observe yt ∈ Rm with m < n such that
(
yt = C · st + vt
st+1 = A · st + B · at + wt
where C ∈ Rm×n is a compression matrix and vt is the sensor noise (also
gaussian, like wt ). Note that the reward function R(t) is left unchanged, as a
function of the state (not the observation) and action. Also, as distributions
are gaussian, the belief state is also going to be gaussian. In this new frame-
work, let’s give an overview of the strategy we are going to adopt to find the
optimal policy:

step 1 first, compute the distribution on the possible states (the belief state),
based on the observations we have. In other words, we want to compute
the mean st|t and the covariance Σt|t of


st |y1 , . . . , yt ∼ N st|t , Σt|t
13

to perform the computation efficiently over time, we’ll use the Kalman
Filter algorithm (used on-board Apollo Lunar Module!).

step 2 now that we have the distribution, we’ll use the mean st|t as the best
approximation for st

step 3 then set the action at := Lt st|t where Lt comes from the regular LQR
algorithm.

Intuitively, to understand why this works, notice that st|t is a noisy ap-
proximation of st (equivalent to adding more noise to LQR) but we proved
that LQR is independent of the noise!
Step 1 needs to be explicated. We’ll cover a simple case where there is
no action dependence in our dynamics (but the general case follows the same
idea). Suppose that
(
st+1 = A · st + wt , wt ∼ N (0, Σs )
yt = C · st + vt , vt ∼ N (0, Σy )
As noises are Gaussians, we can easily prove that the joint distribution is
also Gaussian
 
s1
 .. 
.
 st 
 
  ∼ N (µ, Σ) for some µ, Σ
y 1 
.
 .. 
yt
then, using the marginal formulas of gaussians (see Factor Analysis notes),
we would get

st |y1 , . . . , yt ∼ N st|t , Σt|t
However, computing the marginal distribution parameters using these
formulas would be computationally expensive! It would require manipulating
matrices of shape t × t. Recall that inverting a matrix can be done in O(t3 ),
and it would then have to be repeated over the time steps, yielding a cost in
O(t4 )!
The Kalman filter algorithm provides a much better way of computing
the mean and variance, by updating them over time in constant time in
14

t! The kalman filter is based on two basics steps. Assume that we know the
distribution of st |y1 , . . . , yt :

predict step compute st+1 |y1 , . . . , yt

update step compute st+1 |y1 , . . . , yt+1

and iterate over time steps! The combination of the predict and update
steps updates our belief states. In other words, the process looks like

predict update predict


(st |y1 , . . . , yt ) −−−−→ (st+1 |y1 , . . . , yt ) −−−−→ (st+1 |y1 , . . . , yt+1 ) −−−−→ . . .

predict step Suppose that we know the distribution of


st |y1 , . . . , yt ∼ N st|t , Σt|t

then, the distribution over the next state is also a gaussian distribution


st+1 |y1 , . . . , yt ∼ N st+1|t , Σt+1|t

where
(
st+1|t = A · st|t
Σt+1|t = A · Σt|t · A> + Σs

update step given st+1|t and Σt+1|t such that


st+1 |y1 , . . . , yt ∼ N st+1|t , Σt+1|t

we can prove that


st+1 |y1 , . . . , yt+1 ∼ N st+1|t+1 , Σt+1|t+1

where
(
st+1|t+1 = st+1|t + Kt (yt+1 − Cst+1|t )
Σt+1|t+1 = Σt+1|t − Kt · C · Σt+1|t

with
15

Kt := Σt+1|t C > (CΣt+1|t C > + Σy )−1

The matrix Kt is called the Kalman gain.

Now, if we have a closer look at the formulas, we notice that we don’t


need the observations prior to time step t! The update steps only depends
on the previous distribution. Putting it all together, the algorithm first runs
a forward pass to compute the Kt , Σt|t and st|t (sometimes referred to as
ŝ in the literature). Then, it runs a backward pass (the LQR updates) to
compute the quantities Ψt , Ψt and Lt . Finally, we recover the optimal policy
with a∗t = Lt st|t .

You might also like