cs229 Notes13
cs229 Notes13
cs229 Notes13
Part XIV
LQR, DDP and LQG
Linear Quadratic Regulation, Differential Dynamic Programming and Linear
Quadratic Gaussian
1 Finite-horizon MDPs
In the previous set of notes about Reinforcement Learning, we defined Markov
Decision Processes (MDPs) and covered Value Iteration / Policy Iteration in
a simplified setting. More specifically we introduced the optimal Bellman
∗
equation that defines the optimal value function V π of the optimal policy
π∗.
∗ ∗
X
V π (s) = R(s) + max γ Psa (s0 )V π (s0 )
a∈A
s0 ∈S
Recall that from the optimal value function, we were able to recover the
optimal policy π ∗ with
X
π ∗ (s) = argmaxa∈A Psa (s0 )V ∗ (s0 )
s0 ∈S
In this set of lecture notes we’ll place ourselves in a more general setting:
1. We want to write equations that make sense for both the discrete and
the continuous case. We’ll therefore write
scribe: Guillaume Genthial
1
2
∗
Es0 ∼Psa V π (s0 )
instead of
∗
X
Psa (s0 )V π (s0 )
s0 ∈S
meaning that we take the expectation of the value function at the next
state. In the finite case, we can rewrite the expectation as a sum over
states. In the continuous case, we can rewrite the expectation as an
integral. The notation s0 ∼ Psa means that the state s0 is sampled from
the distribution Psa .
2. We’ll assume that the rewards depend on both states and actions. In
other words, R : S × A → R. This implies that the previous mechanism
for computing the optimal action is changed into
∗
π ∗ (s) = argmaxa∈A R(s, a) + γEs0 ∼Psa V π (s0 )
(S, A, Psa , T, R)
with T > 0 the time horizon (for instance T = 100). In this setting,
our definition of payoff is going to be (slightly) different:
the infinite sum would be finite and well-defined. If the rewards are
bounded by a constant R̄, the payoff is indeed bounded by
∞
X ∞
X
| R(st )γ t | ≤ R̄ γt
t=0 t=0
In this new setting, things behave quite differently. First, the optimal
policy π ∗ might be non-stationary, meaning that it changes over time.
In other words, now we have
π (t) : S → A
where the superscript (t) denotes the policy at time step t. The dynam-
ics of the finite horizon MDP following policy π (t) proceeds as follows:
we start in some state s0 , take some action a0 := π (0) (s0 ) according to
our policy at time step 0. The MDP transitions to a successor s1 , drawn
according to Ps0 a0 . Then, we get to pick another action a1 := π (1) (s1 )
following our new policy at time step 1 and so on...
st+1 ∼ Ps(t)
t ,at
(t)
meaning that the transition’s distribution Pst ,at changes over time. The
same thing can be said about R(t) . Note that this setting is a better
4
model for real life. In a car, the gas tank empties, traffic changes,
etc. Combining the previous remarks, we’ll use the following general
formulation for our finite horizon MDP
(t)
, T, R(t)
S, A, Psa
It turns out that Bellman’s equation for Value Iteration is made for Dy-
namic Programming. This may come as no surprise as Bellman is one of
the fathers of dynamic programming and the Bellman equation is strongly
related to the field. To understand how we can simplify the problem by
adopting an iteration-based approach, we make the following observations:
1. Notice that at the end of the game (for time step T ), the optimal value
is obvious
h ∗ 0 i
∀t < T, s ∈ S : Vt∗ (s) (t)
:= max R (s, a) + Es0 ∼Psa
(t) Vt+1 (s ) (2)
a∈A
2. for t = T − 1, . . . , 0:
Theorem Let B denote the Bellman update and ||f (x)||∞ := supx |f (x)|.
If Vt denotes the value function at the t-th step, then
S = Rn , A = Rd
and we’ll assume linear transitions (with noise)
st+1 = At st + Bt at + wt
it turns out that the noise, as long as it has zero mean, does not impact the
optimal policy!
We’ll also assume quadratic rewards
Now that we have defined the assumptions of our LQR model, let’s cover
the 2 steps of the LQR algorithm
step 2 assuming that the parameters of our model are known (given or esti-
mated with step 1), we can derive the optimal policy using dynamic
programming.
(
st+1 = At st + Bt at + wt At , Bt , Ut , Wt , Σt known
R (st , at ) = −s>
(t) >
t Ut st − at Wt at
1. Initialization step
For the last time step T ,
= −s>
T Ut sT (maximized for aT = 0)
2. Recurrence step
∗
Let t < T . Suppose we know Vt+1 .
∗
Fact 1: It can be shown that if Vt+1 is a quadratic function in st , then Vt∗
is also a quadratic function. In other words, there exists some matrix Φ
and some scalar Ψ such that
∗
if Vt+1 (st+1 ) = s>
t+1 Φt+1 st+1 + Ψt+1
then Vt∗ (st ) = s>
t Φt st + Ψ t
where the second line is just the definition of the optimal value function
and the third line is obtained by plugging in the dynamics of our model
along with the quadratic assumption. Notice that the last expression is
a quadratic function in at and can thus be (easily) optimized1 . We get
the optimal action a∗t
Use the identity E wt> Φt+1 wt = Tr(Σt Φt+1 ) with wt ∼ N (0, Σt )
1
8
= L t · st
where
Lt := (Bt> Φt+1 Bt − Wt )−1 Bt Φt+1 At
−1
Φt = A>
t Φ t+1 − Φ t+1 Bt Bt
>
Φt+1 Bt − Wt Bt Φ t+1 At − Ut
Ψt = − tr (Σt Φt+1 ) + Ψt+1
Using Fact 3, we can be even more clever and make our algorithm run
(slightly) faster! As the optimal policy does not depend on Ψt , and the
update of Φt only depends on Φt , it is sufficient to update only Φt !
9
θ̇t+1 θ̇t
where the function F depends on the cos of the angle etc. Now, the
question we may ask is
st+1 ≈ F (s¯t , āt ) + ∇s F (s¯t , āt ) · (st − s¯t ) + ∇a F (s¯t , āt ) · (at − āt ) (3)
and now, st+1 is linear in st and at , because we can rewrite equation (3)
as
of the constant term κ! It turns out that the constant term can be absorbed
into st by artificially increasing the dimension by one. This is the same trick
that we used at the beginning of the class for linear regression...
step 1 come up with a nominal trajectory using a naive controller, that approx-
imate the trajectory we want to follow. In other words, our controller
is able to approximate the gold trajectory with
step 2 linearize the dynamics around each trajectory point s∗t , in other words
st+1 ≈ F (s∗t , a∗t ) + ∇s F (s∗t , a∗t )(st − s∗t ) + ∇a F (s∗t , a∗t )(at − a∗t )
where st , at would be our current state and action. Now that we have
a linear approximation around each of these points, we can use the
previous section and rewrite
st+1 = At · st + Bt · at
R(st , at ) ≈ R(s∗t , a∗t ) + ∇s R(s∗t , a∗t )(st − s∗t ) + ∇a R(s∗t , a∗t )(at − a∗t )
1
+ (st − s∗t )> Hss (st − s∗t ) + (st − s∗t )> Hsa (at − a∗t )
2
1
+ (at − a∗t )> Haa (at − a∗t )
2
where Hxy refers to the entry of the Hessian of R with respect to x and
y evaluated in (s∗t , a∗t ) (omitted for readability). This expression can be
re-written as
for some matrices Ut , Wt , with the same trick of adding an extra dimen-
sion of ones. To convince yourself, notice that
a b 1
= a + 2bx + cx2
1 x · ·
b c x
step 3 Now, you can convince yourself that our problem is strictly re-written
in the LQR framework. Let’s just use LQR to find the optimal policy
πt . As a result, our new controller will (hopefully) be better!
Note: Some problems might arise if the LQR trajectory deviates too
much from the linearized approximation of the trajectory, but that can
be fixed with reward-shaping...
step 4 Now that we get a new controller (our new policy πt ), we use it to
produce a new trajectory
note that when we generate this new trajectory, we use the real F and
not its linear approximation to compute transitions, meaning that
ot |st ∼ O(o|s)
Formally, a finite-horizon POMDP is given by a tuple
(S, O, A, Psa , T, R)
Within this framework, the general strategy is to maintain a belief state
(distribution over states) based on the observation o1 , . . . , ot . Then, a policy
in a POMDP maps belief states to actions.
step 1 first, compute the distribution on the possible states (the belief state),
based on the observations we have. In other words, we want to compute
the mean st|t and the covariance Σt|t of
st |y1 , . . . , yt ∼ N st|t , Σt|t
13
to perform the computation efficiently over time, we’ll use the Kalman
Filter algorithm (used on-board Apollo Lunar Module!).
step 2 now that we have the distribution, we’ll use the mean st|t as the best
approximation for st
step 3 then set the action at := Lt st|t where Lt comes from the regular LQR
algorithm.
Intuitively, to understand why this works, notice that st|t is a noisy ap-
proximation of st (equivalent to adding more noise to LQR) but we proved
that LQR is independent of the noise!
Step 1 needs to be explicated. We’ll cover a simple case where there is
no action dependence in our dynamics (but the general case follows the same
idea). Suppose that
(
st+1 = A · st + wt , wt ∼ N (0, Σs )
yt = C · st + vt , vt ∼ N (0, Σy )
As noises are Gaussians, we can easily prove that the joint distribution is
also Gaussian
s1
..
.
st
∼ N (µ, Σ) for some µ, Σ
y 1
.
..
yt
then, using the marginal formulas of gaussians (see Factor Analysis notes),
we would get
st |y1 , . . . , yt ∼ N st|t , Σt|t
However, computing the marginal distribution parameters using these
formulas would be computationally expensive! It would require manipulating
matrices of shape t × t. Recall that inverting a matrix can be done in O(t3 ),
and it would then have to be repeated over the time steps, yielding a cost in
O(t4 )!
The Kalman filter algorithm provides a much better way of computing
the mean and variance, by updating them over time in constant time in
14
t! The kalman filter is based on two basics steps. Assume that we know the
distribution of st |y1 , . . . , yt :
and iterate over time steps! The combination of the predict and update
steps updates our belief states. In other words, the process looks like
st |y1 , . . . , yt ∼ N st|t , Σt|t
then, the distribution over the next state is also a gaussian distribution
st+1 |y1 , . . . , yt ∼ N st+1|t , Σt+1|t
where
(
st+1|t = A · st|t
Σt+1|t = A · Σt|t · A> + Σs
st+1 |y1 , . . . , yt ∼ N st+1|t , Σt+1|t
st+1 |y1 , . . . , yt+1 ∼ N st+1|t+1 , Σt+1|t+1
where
(
st+1|t+1 = st+1|t + Kt (yt+1 − Cst+1|t )
Σt+1|t+1 = Σt+1|t − Kt · C · Σt+1|t
with
15