0% found this document useful (0 votes)

58 views15 pages

cs229 Notes13

This document summarizes lecture notes on Linear Quadratic Regulation (LQR), Differential Dynamic Programming (DDP) and Linear Quadratic Gaussian (LQG). It introduces finite-horizon Markov Decision Processes (MDPs) and describes how the optimal policy may be non-stationary. It then describes how Bellman's equation can be used to find the optimal value function through dynamic programming. Finally, it introduces the special case of LQR, where the state and action spaces are continuous and linear, and the rewards and transitions are quadratic and linear.

Uploaded by

Samarth Chauhan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

58 views15 pages

cs229 Notes13

Uploaded by

Samarth Chauhan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

CS229 Lecture notes

Dan Boneh & Andrew Ng

Part XIV
LQR, DDP and LQG
Linear Quadratic Regulation, Differential Dynamic Programming and Linear
Quadratic Gaussian

1 Finite-horizon MDPs
In the previous set of notes about Reinforcement Learning, we defined Markov
Decision Processes (MDPs) and covered Value Iteration / Policy Iteration in
a simplified setting. More specifically we introduced the optimal Bellman
∗
equation that defines the optimal value function V π of the optimal policy
π∗.
∗ ∗
X
V π (s) = R(s) + max γ Psa (s0 )V π (s0 )
a∈A
s0 ∈S

Recall that from the optimal value function, we were able to recover the
optimal policy π ∗ with
X
π ∗ (s) = argmaxa∈A Psa (s0 )V ∗ (s0 )
s0 ∈S

In this set of lecture notes we’ll place ourselves in a more general setting:

1. We want to write equations that make sense for both the discrete and
the continuous case. We’ll therefore write
scribe: Guillaume Genthial

1
2

∗
Es0 ∼Psa V π (s0 )

instead of
∗
X
Psa (s0 )V π (s0 )
s0 ∈S

meaning that we take the expectation of the value function at the next
state. In the finite case, we can rewrite the expectation as a sum over
states. In the continuous case, we can rewrite the expectation as an
integral. The notation s0 ∼ Psa means that the state s0 is sampled from
the distribution Psa .

2. We’ll assume that the rewards depend on both states and actions. In
other words, R : S × A → R. This implies that the previous mechanism
for computing the optimal action is changed into

∗
π ∗ (s) = argmaxa∈A R(s, a) + γEs0 ∼Psa V π (s0 )

3. Instead of considering an infinite horizon MDP, we’ll assume that we

have a finite horizon MDP that will be defined as a tuple

(S, A, Psa , T, R)

with T > 0 the time horizon (for instance T = 100). In this setting,
our definition of payoff is going to be (slightly) different:

R(s0 , a0 ) + R(s1 , a1 ) + · · · + R(sT , aT )

instead of (infinite horizon case)

R(s0 , a0 ) + γR(s1 , a1 ) + γ 2 R(s2 , a2 ) + . . .

X∞
R(st , at )γ t
t=0

What happened to the discount factor γ? Remember that the intro-

duction of γ was (partly) justified by the necessity of making sure that
3

the infinite sum would be finite and well-defined. If the rewards are
bounded by a constant R̄, the payoff is indeed bounded by

∞
X ∞
X
| R(st )γ t | ≤ R̄ γt
t=0 t=0

and we recognize a geometric sum! Here, as the payoff is a finite sum,

the discount factor γ is not necessary anymore.

In this new setting, things behave quite differently. First, the optimal
policy π ∗ might be non-stationary, meaning that it changes over time.
In other words, now we have

π (t) : S → A
where the superscript (t) denotes the policy at time step t. The dynam-
ics of the finite horizon MDP following policy π (t) proceeds as follows:
we start in some state s0 , take some action a0 := π (0) (s0 ) according to
our policy at time step 0. The MDP transitions to a successor s1 , drawn
according to Ps0 a0 . Then, we get to pick another action a1 := π (1) (s1 )
following our new policy at time step 1 and so on...

Why does the optimal policy happen to be non-stationary in the finite-

horizon setting? Intuitively, as we have a finite numbers of actions to
take, we might want to adopt different strategies depending on where
we are in the environment and how much time we have left. Imagine
a grid with 2 goals with rewards +1 and +10. At the beginning, we
might want to take actions to aim for the +10 goal. But if after some
steps, dynamics somehow pushed us closer to the +1 goal and we don’t
have enough steps left to be able to reach the +10 goal, then a better
strategy would be to aim for the +1 goal...
4. This observation allows us to use time dependent dynamics

st+1 ∼ Ps(t)
t ,at

(t)
meaning that the transition’s distribution Pst ,at changes over time. The
same thing can be said about R(t) . Note that this setting is a better
4

model for real life. In a car, the gas tank empties, traffic changes,
etc. Combining the previous remarks, we’ll use the following general
formulation for our finite horizon MDP

(t)
, T, R(t)

S, A, Psa

Remark: notice that the above formulation would be equivalent to

adding the time into the state.
The value function at time t for a policy π is then defined in the same
way as before, as an expectation over trajectories generated following
policy π starting in state s.

Vt (s) = E R(t) (st , at ) + · · · + R(T ) (sT , aT )|st = s, π

Now, the question is

In this finite-horizon setting, how do we find the optimal value function

Vt∗ (s) = max Vtπ (s)

It turns out that Bellman’s equation for Value Iteration is made for Dy-
namic Programming. This may come as no surprise as Bellman is one of
the fathers of dynamic programming and the Bellman equation is strongly
related to the field. To understand how we can simplify the problem by
adopting an iteration-based approach, we make the following observations:

1. Notice that at the end of the game (for time step T ), the optimal value
is obvious

∀s ∈ S : VT∗ (s) := max R(T ) (s, a) (1)

a∈A

2. For another time step 0 ≤ t < T , if we suppose that we know the

∗
optimal value function for the next time step Vt+1 , then we have

h ∗ 0 i
∀t < T, s ∈ S : Vt∗ (s) (t)
:= max R (s, a) + Es0 ∼Psa
(t) Vt+1 (s ) (2)
a∈A

With these observations in mind, we can come up with a clever algorithm

to solve for the optimal value function:
5

1. compute VT∗ using equation (1).

2. for t = T − 1, . . . , 0:

compute Vt∗ using Vt+1

∗
using equation (2)

Side note We can interpret standard value iteration as a special case

of this general case, but without keeping track of time. It turns out that
in the standard setting, if we run value iteration for T steps, we get a γ T
approximation of the optimal value iteration (geometric convergence). See
problem set 4 for a proof of the following result:

Theorem Let B denote the Bellman update and ||f (x)||∞ := supx |f (x)|.
If Vt denotes the value function at the t-th step, then

||Vt+1 − V ∗ ||∞ = ||B(Vt ) − V ∗ ||∞

≤ γ||Vt − V ∗ ||∞
≤ γ t ||V1 − V ∗ ||∞

In other words, the Bellman operator B is a γ-contracting operator.

2 Linear Quadratic Regulation (LQR)

In this section, we’ll cover a special case of the finite-horizon setting described
in Section 1, for which the exact solution is (easily) tractable. This model
is widely used in robotics, and a common technique in many problems is to
reduce the formulation to this framework.
First, let’s describe the model’s assumptions. We place ourselves in the
continuous setting, with

S = Rn , A = Rd
and we’ll assume linear transitions (with noise)

st+1 = At st + Bt at + wt

where At ∈ Rn×n , Bt ∈ Rn×d are matrices and wt ∼ N (0, Σt ) is some

gaussian noise (with zero mean). As we’ll show in the following paragraphs,
6

it turns out that the noise, as long as it has zero mean, does not impact the
optimal policy!
We’ll also assume quadratic rewards

R(t) (st , at ) = −s> >

t Ut st − at Wt at

where Ut ∈ Rn×n , Wt ∈ Rd×d are positive definite matrices (meaning that

the reward is always negative).

Remark Note that the quadratic formulation of the reward is equivalent

to saying that we want our state to be close to the origin (where the reward
is higher). For example, if Ut = In (the identity matrix) and Wt = Id , then
Rt = −||st ||2 − ||at ||2 , meaning that we want to take smooth actions (small
norm of at ) to go back to the origin (small norm of st ). This could model a
car trying to stay in the middle of lane without making impulsive moves...

Now that we have defined the assumptions of our LQR model, let’s cover
the 2 steps of the LQR algorithm

step 1 suppose that we don’t know the matrices A, B, Σ. To estimate them, we

can follow the ideas outlined in the Value Approximation section of the
RL notes. First, collect transitions from an arbitrary policy. Then,
use
Pm PT −1 2
(i) (i) (i)
linear regression to find argminA,B i=1 t=0 st+1 − Ast + Bat .
Finally, use a technique seen in Gaussian Discriminant Analysis to learn
Σ.

step 2 assuming that the parameters of our model are known (given or esti-
mated with step 1), we can derive the optimal policy using dynamic
programming.

In other words, given

(
st+1 = At st + Bt at + wt At , Bt , Ut , Wt , Σt known
R (st , at ) = −s>
(t) >
t Ut st − at Wt at

we want to compute Vt∗ . If we go back to section 1, we can apply dynamic

programming, which yields
7

1. Initialization step
For the last time step T ,

VT∗ (sT ) = max RT (sT , aT )

aT ∈A

= max −s> >

T UT sT − aT Wt aT
aT ∈A

= −s>
T Ut sT (maximized for aT = 0)

2. Recurrence step
∗
Let t < T . Suppose we know Vt+1 .
∗
Fact 1: It can be shown that if Vt+1 is a quadratic function in st , then Vt∗
is also a quadratic function. In other words, there exists some matrix Φ
and some scalar Ψ such that
∗
if Vt+1 (st+1 ) = s>
t+1 Φt+1 st+1 + Ψt+1
then Vt∗ (st ) = s>
t Φt st + Ψ t

For time step t = T , we had Φt = −UT and ΨT = 0.

Fact 2: We can show that the optimal policy is just a linear function of
the state.
∗
Knowing Vt+1 is equivalent to knowing Φt+1 and Ψt+1 , so we just need
to explain how we compute Φt and Ψt from Φt+1 and Ψt+1 and the other
parameters of the problem.

Vt∗ (st ) = s>

t Φt st + Ψt
h i
(t) ∗
= max R (st , at ) + Est+1 ∼Ps(t),a [Vt+1 (st+1 )]
at t t

= max −st Ut st − at Vt at + Est+1 ∼N (At st +Bt at ,Σt ) [s>

> >

t+1 Φt+1 st+1 + Ψt+1 ]
at

where the second line is just the definition of the optimal value function
and the third line is obtained by plugging in the dynamics of our model
along with the quadratic assumption. Notice that the last expression is
a quadratic function in at and can thus be (easily) optimized1 . We get
the optimal action a∗t
Use the identity E wt> Φt+1 wt = Tr(Σt Φt+1 ) with wt ∼ N (0, Σt )
1

8

a∗t = (Bt> Φt+1 Bt − Vt )−1 Bt Φt+1 At · st

= L t · st

where
Lt := (Bt> Φt+1 Bt − Wt )−1 Bt Φt+1 At

which is an impressive result: our optimal policy is linear in st . Given

a∗t we can solve for Φt and Ψt . We finally get the Discrete Ricatti
equations

−1
Φt = A>
t Φ t+1 − Φ t+1 Bt Bt
>
Φt+1 Bt − Wt Bt Φ t+1 At − Ut
Ψt = − tr (Σt Φt+1 ) + Ψt+1

Fact 3: we notice that Φt depends on neither Ψ nor the noise Σt ! As Lt

is a function of At , Bt and Φt+1 , it implies that the optimal policy also
does not depend on the noise! (But Ψt does depend on Σt , which
implies that Vt∗ depends on Σt .)

Then, to summarize, the LQR algorithm works as follows

1. (if necessary) estimate parameters At , Bt , Σt

2. initialize ΦT := −UT and ΨT := 0.

3. iterate from t = T − 1 . . . 0 to update Φt and Ψt using Φt+1 and Ψt+1

using the discrete Ricatti equations. If there exists a policy that drives
the state towards zero, then convergence is guaranteed!

Using Fact 3, we can be even more clever and make our algorithm run
(slightly) faster! As the optimal policy does not depend on Ψt , and the
update of Φt only depends on Φt , it is sufficient to update only Φt !
9

3 From non-linear dynamics to LQR

It turns out that a lot of problems can be reduced to LQR, even if dynamics
are non-linear. While LQR is a nice formulation because we are able to come
up with a nice exact solution, it is far from being general. Let’s take for
instance the case of the inverted pendulum. The transitions between states
look like
    
xt+1 xt
ẋt+1  ẋt  
 θt+1  = F  θt  , at 
    

θ̇t+1 θ̇t
where the function F depends on the cos of the angle etc. Now, the
question we may ask is

Can we linearize this system?

3.1 Linearization of dynamics

Let’s suppose that at time t, the system spends most of its time in some state
s¯t and the actions we perform are around āt . For the inverted pendulum, if
we reached some kind of optimal, this is true: our actions are small and we
don’t deviate much from the vertical.
We are going to use Taylor expansion to linearize the dynamics. In the
simple case where the state is one-dimensional and the transition function F
does not depend on the action, we would write something like

st+1 = F (st ) ≈ F (s¯t ) + F 0 (s¯t ) · (st − s¯t )

In the more general setting, the formula looks the same, with gradients
instead of simple derivatives

st+1 ≈ F (s¯t , āt ) + ∇s F (s¯t , āt ) · (st − s¯t ) + ∇a F (s¯t , āt ) · (at − āt ) (3)

and now, st+1 is linear in st and at , because we can rewrite equation (3)
as

st+1 ≈ Ast + Bst + κ

where κ is some constant and A, B are matrices. Now, this writing looks
awfully similar to the assumptions made for LQR. We just have to get rid
10

of the constant term κ! It turns out that the constant term can be absorbed
into st by artificially increasing the dimension by one. This is the same trick
that we used at the beginning of the class for linear regression...

3.2 Differential Dynamic Programming (DDP)

The previous method works well for cases where the goal is to stay around
some state s∗ (think about the inverted pendulum, or a car having to stay
in the middle of a lane). However, in some cases, the goal can be more
complicated.
We’ll cover a method that applies when our system has to follow some
trajectory (think about a rocket). This method is going to discretize the
trajectory into discrete time steps, and create intermediary goals around
which we will be able to use the previous technique! This method is called
Differential Dynamic Programming. The main steps are

step 1 come up with a nominal trajectory using a naive controller, that approx-
imate the trajectory we want to follow. In other words, our controller
is able to approximate the gold trajectory with

s∗0 , a∗0 → s∗1 , a∗1 → . . .

step 2 linearize the dynamics around each trajectory point s∗t , in other words

st+1 ≈ F (s∗t , a∗t ) + ∇s F (s∗t , a∗t )(st − s∗t ) + ∇a F (s∗t , a∗t )(at − a∗t )

where st , at would be our current state and action. Now that we have
a linear approximation around each of these points, we can use the
previous section and rewrite

st+1 = At · st + Bt · at

(notice that in that case, we use the non-stationary dynamics setting

that we mentioned at the beginning of these lecture notes)
Note We can apply a similar derivation for the reward R(t) , with a
second-order Taylor expansion.
11

R(st , at ) ≈ R(s∗t , a∗t ) + ∇s R(s∗t , a∗t )(st − s∗t ) + ∇a R(s∗t , a∗t )(at − a∗t )
1
+ (st − s∗t )> Hss (st − s∗t ) + (st − s∗t )> Hsa (at − a∗t )
2
1
+ (at − a∗t )> Haa (at − a∗t )
2

where Hxy refers to the entry of the Hessian of R with respect to x and
y evaluated in (s∗t , a∗t ) (omitted for readability). This expression can be
re-written as

Rt (st , at ) = −s> >

t Ut st − at Wt at

for some matrices Ut , Wt , with the same trick of adding an extra dimen-
sion of ones. To convince yourself, notice that

a b 1
= a + 2bx + cx2

1 x · ·
b c x

step 3 Now, you can convince yourself that our problem is strictly re-written
in the LQR framework. Let’s just use LQR to find the optimal policy
πt . As a result, our new controller will (hopefully) be better!
Note: Some problems might arise if the LQR trajectory deviates too
much from the linearized approximation of the trajectory, but that can
be fixed with reward-shaping...

step 4 Now that we get a new controller (our new policy πt ), we use it to
produce a new trajectory

s∗0 , π0 (s∗0 ) → s∗1 , π1 (s∗1 ) → → s∗T

note that when we generate this new trajectory, we use the real F and
not its linear approximation to compute transitions, meaning that

s∗t+1 = F (s∗t , a∗t )

then, go back to step 2 and repeat until some stopping criterion.

4 Linear Quadratic Gaussian (LQG)

Often, in the real word, we don’t get to observe the full state st . For example,
an autonomous car could receive an image from a camera, which is merely
an observation, and not the full state of the world. So far, we assumed
that the state was available. As this might not hold true for most of the
real-world problems, we need a new tool to model this situation: Partially
Observable MDPs.
A POMDP is an MDP with an extra observation layer. In other words,
we introduce a new variable ot , that follows some conditional distribution
given the current state st

ot |st ∼ O(o|s)
Formally, a finite-horizon POMDP is given by a tuple

(S, O, A, Psa , T, R)
Within this framework, the general strategy is to maintain a belief state
(distribution over states) based on the observation o1 , . . . , ot . Then, a policy
in a POMDP maps belief states to actions.

In this section, we’ll present a extension of LQR to this new setting.

Assume that we observe yt ∈ Rm with m < n such that
(
yt = C · st + vt
st+1 = A · st + B · at + wt
where C ∈ Rm×n is a compression matrix and vt is the sensor noise (also
gaussian, like wt ). Note that the reward function R(t) is left unchanged, as a
function of the state (not the observation) and action. Also, as distributions
are gaussian, the belief state is also going to be gaussian. In this new frame-
work, let’s give an overview of the strategy we are going to adopt to find the
optimal policy:

step 1 first, compute the distribution on the possible states (the belief state),
based on the observations we have. In other words, we want to compute
the mean st|t and the covariance Σt|t of

st |y1 , . . . , yt ∼ N st|t , Σt|t
13

to perform the computation efficiently over time, we’ll use the Kalman
Filter algorithm (used on-board Apollo Lunar Module!).

step 2 now that we have the distribution, we’ll use the mean st|t as the best
approximation for st

step 3 then set the action at := Lt st|t where Lt comes from the regular LQR
algorithm.

Intuitively, to understand why this works, notice that st|t is a noisy ap-
proximation of st (equivalent to adding more noise to LQR) but we proved
that LQR is independent of the noise!
Step 1 needs to be explicated. We’ll cover a simple case where there is
no action dependence in our dynamics (but the general case follows the same
idea). Suppose that
(
st+1 = A · st + wt , wt ∼ N (0, Σs )
yt = C · st + vt , vt ∼ N (0, Σy )
As noises are Gaussians, we can easily prove that the joint distribution is
also Gaussian
 
s1
 .. 
.
 st 
 
  ∼ N (µ, Σ) for some µ, Σ
y 1 
.
 .. 
yt
then, using the marginal formulas of gaussians (see Factor Analysis notes),
we would get

st |y1 , . . . , yt ∼ N st|t , Σt|t
However, computing the marginal distribution parameters using these
formulas would be computationally expensive! It would require manipulating
matrices of shape t × t. Recall that inverting a matrix can be done in O(t3 ),
and it would then have to be repeated over the time steps, yielding a cost in
O(t4 )!
The Kalman filter algorithm provides a much better way of computing
the mean and variance, by updating them over time in constant time in
14

t! The kalman filter is based on two basics steps. Assume that we know the
distribution of st |y1 , . . . , yt :

predict step compute st+1 |y1 , . . . , yt

update step compute st+1 |y1 , . . . , yt+1

and iterate over time steps! The combination of the predict and update
steps updates our belief states. In other words, the process looks like

predict update predict

(st |y1 , . . . , yt ) −−−−→ (st+1 |y1 , . . . , yt ) −−−−→ (st+1 |y1 , . . . , yt+1 ) −−−−→ . . .

predict step Suppose that we know the distribution of

st |y1 , . . . , yt ∼ N st|t , Σt|t

then, the distribution over the next state is also a gaussian distribution

st+1 |y1 , . . . , yt ∼ N st+1|t , Σt+1|t

where
(
st+1|t = A · st|t
Σt+1|t = A · Σt|t · A> + Σs

update step given st+1|t and Σt+1|t such that

st+1 |y1 , . . . , yt ∼ N st+1|t , Σt+1|t

we can prove that

st+1 |y1 , . . . , yt+1 ∼ N st+1|t+1 , Σt+1|t+1

where
(
st+1|t+1 = st+1|t + Kt (yt+1 − Cst+1|t )
Σt+1|t+1 = Σt+1|t − Kt · C · Σt+1|t

with
15

Kt := Σt+1|t C > (CΣt+1|t C > + Σy )−1

The matrix Kt is called the Kalman gain.

Now, if we have a closer look at the formulas, we notice that we don’t

need the observations prior to time step t! The update steps only depends
on the previous distribution. Putting it all together, the algorithm first runs
a forward pass to compute the Kt , Σt|t and st|t (sometimes referred to as
ŝ in the literature). Then, it runs a backward pass (the LQR updates) to
compute the quantities Ψt , Ψt and Lt . Finally, we recover the optimal policy
with a∗t = Lt st|t .

Karina e Zindagi - Hindi 1
50% (2)
Karina e Zindagi - Hindi 1
229 pages
Littomore
No ratings yet
Littomore
169 pages
MIT 6.036 Lecture
No ratings yet
MIT 6.036 Lecture
64 pages
Lecture SM 1 DP
No ratings yet
Lecture SM 1 DP
71 pages
2 Dynamic
No ratings yet
2 Dynamic
50 pages
2024 MDPs Part 1
No ratings yet
2024 MDPs Part 1
59 pages
242 Sheet 02 03
No ratings yet
242 Sheet 02 03
5 pages
Part 10
No ratings yet
Part 10
57 pages
کتاب هشتم بارگزاری شده
No ratings yet
کتاب هشتم بارگزاری شده
112 pages
Artificial Intelligence: Lecture 9 - Markov Decision Processes II Dr. Shivanjali Khare
No ratings yet
Artificial Intelligence: Lecture 9 - Markov Decision Processes II Dr. Shivanjali Khare
44 pages
Lecture26 Ri
No ratings yet
Lecture26 Ri
55 pages
l1 Mdps Exact Methods
No ratings yet
l1 Mdps Exact Methods
69 pages
08 - Markov Decision Processes
No ratings yet
08 - Markov Decision Processes
31 pages
Tut21 RL
No ratings yet
Tut21 RL
101 pages
2025 - MDPs 1
No ratings yet
2025 - MDPs 1
62 pages
2025 - MDPs 2
No ratings yet
2025 - MDPs 2
42 pages
Fa19 Lecture 15 MDPs II
No ratings yet
Fa19 Lecture 15 MDPs II
76 pages
Cs5811 Ch17 Complex Dec
No ratings yet
Cs5811 Ch17 Complex Dec
29 pages
2025 - MDPs - Part 2
No ratings yet
2025 - MDPs - Part 2
41 pages
Lec 08
No ratings yet
Lec 08
59 pages
Lec 12
No ratings yet
Lec 12
60 pages
MS&E 221: Stochastic Modeling: Session 7: Nonlinear Optimization, Markov Decision Processes
No ratings yet
MS&E 221: Stochastic Modeling: Session 7: Nonlinear Optimization, Markov Decision Processes
18 pages
10 - Reinforcement Learning
No ratings yet
10 - Reinforcement Learning
24 pages
Infinite-Horizon Dynamic Programming: Tianxiao Zheng Saif
No ratings yet
Infinite-Horizon Dynamic Programming: Tianxiao Zheng Saif
10 pages
15 MDP
No ratings yet
15 MDP
35 pages
4 Reinforcement Learning - Basic Algorithms: - S, A) ) and The Immediate Reward Function R (R (S, A, S
No ratings yet
4 Reinforcement Learning - Basic Algorithms: - S, A) ) and The Immediate Reward Function R (R (S, A, S
16 pages
Lec 09
No ratings yet
Lec 09
51 pages
Lecture 3 and 4
No ratings yet
Lecture 3 and 4
14 pages
Computational Economics: Session 16: Numerical Dynamic Programming
No ratings yet
Computational Economics: Session 16: Numerical Dynamic Programming
17 pages
Markov Decision Processes and Exact Solution Methods
No ratings yet
Markov Decision Processes and Exact Solution Methods
34 pages
RL Lecture4
No ratings yet
RL Lecture4
7 pages
RL Problem Sheet: E0 270: Machine Learning (Spring 2025)
No ratings yet
RL Problem Sheet: E0 270: Machine Learning (Spring 2025)
10 pages
GDD Nonlinear NIPS 2009 Convergent Temporal Difference Learning With Arbitrary Smooth Function Approximation
No ratings yet
GDD Nonlinear NIPS 2009 Convergent Temporal Difference Learning With Arbitrary Smooth Function Approximation
9 pages
CS229
No ratings yet
CS229
17 pages
EE675A Lec12
No ratings yet
EE675A Lec12
5 pages
MPX200 Multifunction Router User Guide
No ratings yet
MPX200 Multifunction Router User Guide
251 pages
09 - Monte Carlo Learning
No ratings yet
09 - Monte Carlo Learning
24 pages
Algebra FX 2.0 Plus FX 1.0 Plus: User's Guide
No ratings yet
Algebra FX 2.0 Plus FX 1.0 Plus: User's Guide
455 pages
02 MarkovDecisionProcess
No ratings yet
02 MarkovDecisionProcess
51 pages
DRL Homework 1
No ratings yet
DRL Homework 1
4 pages
Optimal Control Theory
No ratings yet
Optimal Control Theory
28 pages
MDP Cheatsheet
No ratings yet
MDP Cheatsheet
3 pages
A17 Complexdecisions
No ratings yet
A17 Complexdecisions
28 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
101 pages
Cs229-Notes12 Reinforcement in Control
No ratings yet
Cs229-Notes12 Reinforcement in Control
17 pages
20ai903 - RL - Unit 2
No ratings yet
20ai903 - RL - Unit 2
27 pages
Reinforcement Learning Cheatsheet
No ratings yet
Reinforcement Learning Cheatsheet
16 pages
Instructor (Andrew NG) :okay, Good Morning. Welcome Back. So I Hope All of You Had
No ratings yet
Instructor (Andrew NG) :okay, Good Morning. Welcome Back. So I Hope All of You Had
14 pages
Ecil Eee and Eie Syllabus
No ratings yet
Ecil Eee and Eie Syllabus
3 pages
Lec 3
No ratings yet
Lec 3
15 pages
Reinforcement Learning Cheat Sheet: Return
No ratings yet
Reinforcement Learning Cheat Sheet: Return
7 pages
Dynamic Programming
No ratings yet
Dynamic Programming
52 pages
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
66 pages
SLchapt 3
No ratings yet
SLchapt 3
10 pages
Distributed Control Systems: Prof - Dr. Joyanta Kumar Roy
No ratings yet
Distributed Control Systems: Prof - Dr. Joyanta Kumar Roy
27 pages
Reinforcement Learning in A Nutshell
No ratings yet
Reinforcement Learning in A Nutshell
12 pages
19.5 Markov Decision Processes: Resolving Unbounded Expected Rewards
No ratings yet
19.5 Markov Decision Processes: Resolving Unbounded Expected Rewards
13 pages
New CZ3005 Module 4 - Markov Decision Process
No ratings yet
New CZ3005 Module 4 - Markov Decision Process
38 pages
CS 188 Fall 2018 Written HW4 Soln
No ratings yet
CS 188 Fall 2018 Written HW4 Soln
6 pages
EE290 Lecture 16
No ratings yet
EE290 Lecture 16
4 pages
DM 04 04 Rule-Based Classification
No ratings yet
DM 04 04 Rule-Based Classification
72 pages
Paper On Power Factor Correction
No ratings yet
Paper On Power Factor Correction
6 pages
Machine Learning
No ratings yet
Machine Learning
5 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
15 pages
User Manual - Digital Microscope Andonstar ADSM201 PDF
No ratings yet
User Manual - Digital Microscope Andonstar ADSM201 PDF
8 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
7 pages
Q2 SUMMATIVE TEST and TOS - Docx - Grade 7
100% (1)
Q2 SUMMATIVE TEST and TOS - Docx - Grade 7
10 pages
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
No ratings yet
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
40 pages
Mold Price With Photos
No ratings yet
Mold Price With Photos
27 pages
M L U G: X Oader SER Uide
No ratings yet
M L U G: X Oader SER Uide
37 pages
Marketing Assigment 2011 (Final Work)
50% (4)
Marketing Assigment 2011 (Final Work)
12 pages
Microsoft Active Directory (AD)
No ratings yet
Microsoft Active Directory (AD)
24 pages
Blackberry Case Study
No ratings yet
Blackberry Case Study
5 pages
Poison Ivy 2.1.0 Documentation by Shapeless - Tutorial
No ratings yet
Poison Ivy 2.1.0 Documentation by Shapeless - Tutorial
12 pages
Notes On C Prog
No ratings yet
Notes On C Prog
28 pages
Human Resource Management: Shruti Sahani (149) Tripti Singhal (165) Tanvi Bhateja (162) Shivangi Goel
No ratings yet
Human Resource Management: Shruti Sahani (149) Tripti Singhal (165) Tanvi Bhateja (162) Shivangi Goel
16 pages
Ensemble Learning: Martin Sewell
No ratings yet
Ensemble Learning: Martin Sewell
16 pages
Product Description
No ratings yet
Product Description
15 pages
Báo cáo Đa nền tảng
No ratings yet
Báo cáo Đa nền tảng
24 pages
Documents: Search Books, Presentatio
No ratings yet
Documents: Search Books, Presentatio
14 pages
Border Patrol Resume
100% (1)
Border Patrol Resume
6 pages
GST-I-9105R Rayo Linela
No ratings yet
GST-I-9105R Rayo Linela
1 page
Marketing Cloud Advanced Cross Channel Demo
No ratings yet
Marketing Cloud Advanced Cross Channel Demo
7 pages
Merritt Morning Market 3917 - Jan 26
No ratings yet
Merritt Morning Market 3917 - Jan 26
2 pages
Dashpute Smita A.: Brief Overview
No ratings yet
Dashpute Smita A.: Brief Overview
3 pages
Cambridge Primary Computing Learner S Book Stage 1 Sample Pages 9781398368569 Pages 6
No ratings yet
Cambridge Primary Computing Learner S Book Stage 1 Sample Pages 9781398368569 Pages 6
1 page
Design and Manufacture of TDS Measurement and Cont
No ratings yet
Design and Manufacture of TDS Measurement and Cont
21 pages
Ramesh Yadav Resume
No ratings yet
Ramesh Yadav Resume
3 pages
Subject: - IEQ (22657)
No ratings yet
Subject: - IEQ (22657)
10 pages
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)

cs229 Notes13

Uploaded by

cs229 Notes13

Uploaded by

CS229 Lecture notes

Dan Boneh & Andrew Ng

3. Instead of considering an infinite horizon MDP, we’ll assume that we

R(s0 , a0 ) + R(s1 , a1 ) + · · · + R(sT , aT )

instead of (infinite horizon case)

R(s0 , a0 ) + γR(s1 , a1 ) + γ 2 R(s2 , a2 ) + . . .

What happened to the discount factor γ? Remember that the intro-

and we recognize a geometric sum! Here, as the payoff is a finite sum,

Why does the optimal policy happen to be non-stationary in the finite-

Remark: notice that the above formulation would be equivalent to

Vt (s) = E R(t) (st , at ) + · · · + R(T ) (sT , aT )|st = s, π

Now, the question is

In this finite-horizon setting, how do we find the optimal value function

Vt∗ (s) = max Vtπ (s)

∀s ∈ S : VT∗ (s) := max R(T ) (s, a) (1)

2. For another time step 0 ≤ t < T , if we suppose that we know the

With these observations in mind, we can come up with a clever algorithm

1. compute VT∗ using equation (1).

compute Vt∗ using Vt+1

Side note We can interpret standard value iteration as a special case

||Vt+1 − V ∗ ||∞ = ||B(Vt ) − V ∗ ||∞

In other words, the Bellman operator B is a γ-contracting operator.

2 Linear Quadratic Regulation (LQR)

where At ∈ Rn×n , Bt ∈ Rn×d are matrices and wt ∼ N (0, Σt ) is some

R(t) (st , at ) = −s> >

where Ut ∈ Rn×n , Wt ∈ Rd×d are positive definite matrices (meaning that

Remark Note that the quadratic formulation of the reward is equivalent

step 1 suppose that we don’t know the matrices A, B, Σ. To estimate them, we

In other words, given

we want to compute Vt∗ . If we go back to section 1, we can apply dynamic

VT∗ (sT ) = max RT (sT , aT )

= max −s> >

For time step t = T , we had Φt = −UT and ΨT = 0.

Vt∗ (st ) = s>

= max −st Ut st − at Vt at + Est+1 ∼N (At st +Bt at ,Σt ) [s>

a∗t = (Bt> Φt+1 Bt − Vt )−1 Bt Φt+1 At · st

which is an impressive result: our optimal policy is linear in st . Given

Fact 3: we notice that Φt depends on neither Ψ nor the noise Σt ! As Lt

Then, to summarize, the LQR algorithm works as follows

1. (if necessary) estimate parameters At , Bt , Σt

2. initialize ΦT := −UT and ΨT := 0.

3. iterate from t = T − 1 . . . 0 to update Φt and Ψt using Φt+1 and Ψt+1

3 From non-linear dynamics to LQR

Can we linearize this system?

3.1 Linearization of dynamics

st+1 = F (st ) ≈ F (s¯t ) + F 0 (s¯t ) · (st − s¯t )

st+1 ≈ Ast + Bst + κ

3.2 Differential Dynamic Programming (DDP)

s∗0 , a∗0 → s∗1 , a∗1 → . . .

(notice that in that case, we use the non-stationary dynamics setting

Rt (st , at ) = −s> >

s∗0 , π0 (s∗0 ) → s∗1 , π1 (s∗1 ) → → s∗T

s∗t+1 = F (s∗t , a∗t )

then, go back to step 2 and repeat until some stopping criterion.

4 Linear Quadratic Gaussian (LQG)

In this section, we’ll present a extension of LQR to this new setting.

predict step compute st+1 |y1 , . . . , yt

update step compute st+1 |y1 , . . . , yt+1

predict update predict

predict step Suppose that we know the distribution of

update step given st+1|t and Σt+1|t such that

we can prove that

Kt := Σt+1|t C > (CΣt+1|t C > + Σy )−1

The matrix Kt is called the Kalman gain.

Now, if we have a closer look at the formulas, we notice that we don’t

You might also like