0% found this document useful (0 votes)

18 views263 pages

DP Slides

Uploaded by

Talal Ben

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views263 pages

DP Slides

Uploaded by

Talal Ben

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 263

LECTURE SLIDES ON DYNAMIC PROGRAMMIN

BASED ON LECTURES GIVEN AT THE

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

CAMBRIDGE, MASS

FALL 2009

DIMITRI P. BERTSEKAS

These lecture slides are based on the book:

“Dynamic Programming and Optimal Con-
trol: 3rd edition,” Vols. 1 and 2, Athena
Scientific, 2007, by Dimitri P. Bertsekas;
see

http://www.athenasc.com/dpbook.html

Last Updated: September 2009

The slides may be freely reproduced and

distributed.
6.231 DYNAMIC PROGRAMMING

LECTURE 1

LECTURE OUTLINE

• Problem Formulation
• Examples
• The Basic Problem
• Significance of Feedback
DP AS AN OPTIMIZATION METHODOLOGY

• Generic optimization problem:

min g(u)
u∈U

where u is the optimization/decision variable, g(u)

is the cost function, and U is the constraint set
• Categories of problems:
− Discrete (U is finite) or continuous
− Linear (g is linear and U is polyhedral) or
nonlinear
− Stochastic or deterministic: In stochastic prob-
lems the cost involves a stochastic parameter
w, which is averaged, i.e., it has the form

g(u) = Ew G(u, w)

where w is a random parameter.

• DP can deal with complex stochastic problems
where information about w becomes available in
stages, and the decisions are also made in stages
and make use of this information.
BASIC STRUCTURE OF STOCHASTIC DP

• Discrete-time system

xk+1 = fk (xk , uk , wk ), k = 0, 1, . . . , N − 1

− k: Discrete time
− xk : State; summarizes past information that
is relevant for future optimization
− uk : Control; decision to be selected at time
k from a given set
− wk : Random parameter (also called distur-
bance or noise depending on the context)
− N : Horizon or number of times control is
applied
• Cost function that is additive over time
−1
( N
)
X
E gN (xN ) + gk (xk , uk , wk )
k=0
INVENTORY CONTROL EXAMPLE

wk Demand at Period k

Stock at Period k Inventory Stock at Period k + 1

xk System xk + 1 = xk + uk - wk

Stock Ordered at
Period k
Cos t of P e riod k uk
c uk + r (xk + uk - wk)

• Discrete-time system

xk+1 = fk (xk , uk , wk ) = xk + uk − wk
• Cost function that is additive over time
−1
( N
)
X
E gN (xN ) + gk (xk , uk , wk )
k=0
(N −1 )
X
=E cuk + r(xk + uk − wk )
k=0

• Optimization over policies: Rules/functions uk =

µk (xk ) that map states to controls
ADDITIONAL ASSUMPTIONS

• The set of values that the control uk can take

depend at most on xk and not on prior x or u
• Probability distribution of wk does not depend
on past values wk−1 , . . . , w0 , but may depend on
xk and uk
− Otherwise past values of w or x would be
useful for future optimization
• Sequence of events envisioned in period k:
− xk occurs according to

xk = fk−1 xk−1 , uk−1 , wk−1

− uk is selected with knowledge of xk , i.e.,

uk ∈ Uk (xk )

− wk is random and generated according to a

distribution

Pwk (xk , uk )
DETERMINISTIC FINITE-STATE PROBLEMS

• Scheduling example: Find optimal sequence of

operations A, B, C, D
• A must precede B, and C must precede D
• Given startup cost SA and SC , and setup tran-
sition cost Cmn from operation m to operation n

ABC CC D

C BC
AB
ACB C BD
C AB
A C CB
AC
C AC
CC D ACD C DB
SA
Initial
State
CAB C BD
CA C AB
SC
C C CA
C AD CAD C DB

CC D CD

C DA
CDA C AB
STOCHASTIC FINITE-STATE PROBLEMS

• Example: Find two-game chess match strategy

• Timid play draws with prob. pd > 0 and loses
with prob. 1 − pd . Bold play wins with prob. pw <
1/2 and loses with prob. 1 − pw

0.5-0.5 1- 0
pd pw

0-0 0-0
1 - pd 1 - pw

0-1 0-1

1st Game / Timid Play 1st Game / Bold Play

2-0
2-0
pw

pd 1-0
1 - pw
1-0 1.5-0.5
1.5-0.5
1 - pd
pw
pd 0.5-0.5 1-1
0.5-0.5 1-1 1 - pw
1 - pd
pw
pd 0.5-1.5
0.5-1.5 0-1
0-1 1 - pw
1 - pd
0-2
0-2

2nd Game / Timid Play 2nd Game / Bold Play

BASIC PROBLEM

• System xk+1 = fk (xk , uk , wk ), k = 0, . . . , N −1

• Control contraints uk ∈ Uk (xk )
• Probability distribution Pk (· | xk , uk ) of wk
• Policies π = {µ0 , . . . , µN −1 }, where µk maps
states xk into controls uk = µk (xk ) and is such
that µk (xk ) ∈ Uk (xk ) for all xk
• Expected cost of π starting at x0 is

−1
( N
)
X
Jπ (x0 ) = E gN (xN ) + gk (xk , µk (xk ), wk )
k=0

• Optimal cost function

J ∗ (x0 ) = min Jπ (x0 )

• Optimal policy π ∗ satisfies

Jπ∗ (x0 ) = J ∗ (x0 )

When produced by DP, π ∗ is independent of x0 .

SIGNIFICANCE OF FEEDBACK

• Open-loop versus closed-loop policies

u kµ=km
uk = k(x
(xk )k) System xk
xk + 1 = fk( xk,u k,wk)

) m
µkk

• In deterministic problems open loop is as good

as closed loop
• Chess match example; value of information

pd 1.5-0.5

1- 0
1 - pd
pw
Timid Play
1-1
0-0
1 - pw
Bold Play
pw 1- 1

0-1
1 - pw
Bold Play
0-2
VARIANTS OF DP PROBLEMS

• Continuous-time problems
• Imperfect state information problems
• Infinite horizon problems
• Suboptimal control
LECTURE BREAKDOWN

• Finite Horizon Problems (Vol. 1, Ch. 1-6)

− Ch. 1: The DP algorithm (2 lectures)
− Ch. 2: Deterministic finite-state problems (2
lectures)
− Ch. 4: Stochastic DP problems (2 lectures)
− Ch. 5: Imperfect state information problems
(2 lectures)
− Ch. 6: Suboptimal control (5 lectures)
• Infinite Horizon Problems - Simple (Vol. 1, Ch.
7, 3 lectures)
• Infinite Horizon Problems - Advanced (Vol. 2)
− Ch. 1: Discounted problems - Computational
methods (2 lectures)
− Ch. 2: Stochastic shortest path problems (1
lecture)
− Ch. 6: Approximate DP (5 lectures)
A NOTE ON THESE SLIDES

• These slides are a teaching aid, not a text

• Don’t expect a rigorous mathematical develop-
ment or precise mathematical statements
• Figures are meant to convey and enhance ideas,
not to express them precisely
• Omitted proofs and a much fuller discussion
can be found in the text, which these slides follow
6.231 DYNAMIC PROGRAMMING

LECTURE 2

LECTURE OUTLINE

• The basic problem

• Principle of optimality
• DP example: Deterministic problem
• DP example: Stochastic problem
• The general DP algorithm
• State augmentation
BASIC PROBLEM

• System xk+1 = fk (xk , uk , wk ), k = 0, . . . , N −1

• Control constraints uk ∈ Uk (xk )
• Probability distribution Pk (· | xk , uk ) of wk
• Policies π = {µ0 , . . . , µN −1 }, where µk maps
states xk into controls uk = µk (xk ) and is such
that µk (xk ) ∈ Uk (xk ) for all xk
• Expected cost of π starting at x0 is

−1
( N
)
X
Jπ (x0 ) = E gN (xN ) + gk (xk , µk (xk ), wk )
k=0

• Optimal cost function

J ∗ (x0 ) = min Jπ (x0 )

• Optimal policy π ∗ is one that satisfies

Jπ∗ (x0 ) = J ∗ (x0 )

PRINCIPLE OF OPTIMALITY

• Let π ∗ = {µ∗0 , µ∗1 , . . . , µ∗N −1 } be optimal policy

• Consider the “tail subproblem” whereby we are
at xi at time i and wish to minimize the “cost-to-
go” from time i to time N

−1
( N
)
X
E gN (xN ) + gk xk , µk (xk ), wk
k=i

and the “tail policy” {µ∗i , µ∗i+1 , . . . , µ∗N −1 }

xi Tail Subproblem

0 i N

• Principle of optimality: The tail policy is opti-

mal for the tail subproblem (optimization of the
future does not depend on what we did in the past)
• DP first solves ALL tail subroblems of final
stage
• At the generic step, it solves ALL tail subprob-
lems of a given time length, using the solution of
the tail subproblems of shorter time length
DETERMINISTIC SCHEDULING EXAMPLE

• Find optimal sequence of operations A, B, C,

D (A must precede B and C must precede D)

ABC 6

3
AB
ACB 1
2 9
A 4
3 AC
8 ACD 3
5 6
5
Initial
1 0 State
CAB 1
CA 2
3
4
C 3 4
CAD 3
7 6
CD

5 3
CDA 2

• Start from the last tail subproblem and go back-

wards
• At each state-time pair, we record the optimal
cost-to-go and the optimal decision
STOCHASTIC INVENTORY EXAMPLE

wk Demand at Period k

Stock at Period k Inventory Stock at Period k + 1

xk System
xk + 1 = xk + uk - wk

Stock Ordered at
Period k
Cost of Period k uk
cuk + r (xk + uk - wk)

• Tail Subproblems of Length 1:

JN −1 (xN −1 ) = min E cuN −1
uN −1 ≥0 wN −1

+ r(xN −1 + uN −1 − wN −1 )

• Tail Subproblems of Length N − k:

Jk (xk ) = min E cuk + r(xk + uk − wk )
uk ≥0 wk

+ Jk+1 (xk + uk − wk )

• J0 (x0 ) is opt. cost of initial state x0

DP ALGORITHM

• Start with

JN (xN ) = gN (xN ),

and go backwards using

Jk (xk ) = min E gk (xk , uk , wk )
uk ∈Uk (xk ) wk

+ Jk+1 fk (xk , uk , wk ) , k = 0, 1, . . . , N − 1.

• Then J0 (x0 ), generated at the last step, is equal

to the optimal cost J ∗ (x0 ). Also, the policy
π ∗ = {µ∗0 , . . . , µ∗N −1 }
where µ∗k (xk ) minimizes in the right side above for
each xk and k, is optimal
• Justification: Proof by induction that Jk (xk ) is
equal to Jk∗ (xk ), defined as the optimal cost of the
tail subproblem that starts at time k at state xk
• Note:
− ALL the tail subproblems are solved (in ad-
dition to the original problem)
− Intensive computational requirements
PROOF OF THE INDUCTION STEP

• Let πk = µk , µk+1 , . . . , µN −1 denote a tail
policy from time k onward
∗ (x
• Assume that Jk+1 (xk+1 ) = Jk+1 k+1 ). Then

(

Jk∗ (xk ) = min E gk xk , µk (xk ), wk
(µk ,πk+1 ) wk ,...,wN −1

N −1
)
X
+ gN (xN ) + gi xi , µi (xi ), wi
i=k+1
(

= min E gk xk , µk (xk ), wk
µk wk
" ( N −1
)#
X
+ min E gN (xN ) + gi xi , µi (xi ), wi
πk+1 wk+1 ,...,wN −1
i=k+1
∗

= min E gk xk , µk (xk ), wk + Jk+1 fk xk , µk (xk ), wk
µk wk

= min E gk xk , µk (xk ), wk + Jk+1 fk xk , µk (xk ), wk
µk wk

= min E gk (xk , uk , wk ) + Jk+1 fk (xk , uk , wk )
uk ∈Uk (xk ) wk

= Jk (xk )
LINEAR-QUADRATIC ANALYTICAL EXAMPLE

Initial Final
Temperature x0 Oven 1 x1 Oven 2 Temperature x2
Temperature Temperature
u0 u1

• System

xk+1 = (1 − a)xk + auk , k = 0, 1,

where a is given scalar from the interval (0, 1)

• Cost
r(x2 − T )2 + u20 + u21
where r is given positive scalar
• DP Algorithm:

J2 (x2 ) = r(x2 − T )2
h i
2
J1 (x1 ) = min u21 + r (1 − a)x1 + au1 − T

u1
2
J0 (x0 ) = min u0 + J1 (1 − a)x0 + au0
u0
STATE AUGMENTATION

• When assumptions of the basic problem are

violated (e.g., disturbances are correlated, cost is
nonadditive, etc) reformulate/augment the state
• Example: Time lags

xk+1 = fk (xk , xk−1 , uk , wk )

• Introduce additional state variable yk = xk−1 .

New system takes the form

xk+1 fk (xk , yk , uk , wk )
=
yk+1 xk

View x̃k = (xk , yk ) as the new state.

• DP algorithm for the reformulated problem:
n
Jk (xk , xk−1 ) = min E gk (xk , uk , wk )
uk ∈Uk (xk ) wk
o
+ Jk+1 fk (xk , xk−1 , uk , wk ), xk
6.231 DYNAMIC PROGRAMMING

LECTURE 3

LECTURE OUTLINE

• Deterministic finite-state DP problems

• Backward shortest path algorithm
• Forward shortest path algorithm
• Shortest path examples
• Alternative shortest path algorithms
DETERMINISTIC FINITE-STATE PROBLEM

Terminal Arcs
with Cost Equal
to Terminal Cost

...
t
Artificial Terminal
Initial State
... Node
s

...

Stage 0 Stage 1 Stage 2 ... Stage N - 1 Stage N

• States <==> Nodes

• Controls <==> Arcs
• Control sequences (open-loop) <==> paths
from initial state to terminal states
• akij : Cost of transition from state i ∈ Sk to state
j ∈ Sk+1 at time k (view it as “length” of the arc)
• aN
it : Terminal cost of state i ∈ SN

• Cost of control sequence <==> Cost of the cor-

responding path (view it as “length” of the path)
BACKWARD AND FORWARD DP ALGORITHMS

• DP algorithm:
JN (i) = aN
it , i ∈ SN ,
k
Jk (i) = min aij +Jk+1 (j) , i ∈ Sk , k = 0, . . . , N −1
j∈Sk+1

The optimal cost is J0 (s) and is equal to the

length of the shortest path from s to t
• Observation: An optimal path s → t is also an
optimal path t → s in a “reverse” shortest path
problem where the direction of each arc is reversed
and its length is left unchanged
• Forward DP algorithm (= backward DP algo-
rithm for the reverse problem):
J˜N (j) = a0sj , j ∈ S1 ,
˜ ˜
N −k
Jk (j) = min aij + Jk+1 (i) , j ∈ SN −k+1
i∈SN −k

˜ ˜
N
The optimal cost is J0 (t) = mini∈SN ait + J1 (i)
• View J˜k (j) as optimal cost-to-arrive to state j
from initial state s
A NOTE ON FORWARD DP ALGORITHMS

• There is no forward DP algorithm for stochastic

problems
• Mathematically, for stochastic problems, we
cannot restrict ourselves to open-loop sequences,
so the shortest path viewpoint fails
• Conceptually, in the presence of uncertainty,
the concept of “optimal-cost-to-arrive” at a state
xk does not make sense. For example, it may be
impossible to guarantee (with prob. 1) that any
given state can be reached
• By contrast, even in stochastic problems, the
concept of “optimal cost-to-go” from any state xk
makes clear sense
GENERIC SHORTEST PATH PROBLEMS

• {1, 2, . . . , N, t}: nodes of a graph (t: the desti-

nation)
• aij : cost of moving from node i to node j
• Find a shortest (minimum cost) path from each
node i to node t
• Assumption: All cycles have nonnegative length.
Then an optimal path need not take more than N
moves
• We formulate the problem as one where we re-
quire exactly N moves but allow degenerate moves
from a node i to itself with cost aii = 0

Jk (i) = optimal cost of getting from i to t in N −k move

J0 (i): Cost of the optimal path from i to t.

• DP algorithm:

Jk (i) = min aij +Jk+1 (j) , k = 0, 1, . . . , N −2,
j=1,...,N

with JN −1 (i) = ait , i = 1, 2, . . . , N

EXAMPLE

State i
Destination
5
5 3 3 3 3
2 3
4
7 5 4 4 4 5
1 4 3
2 4.5 4.5 5.5 7
5 5
2
6 1 2 2 2 2
1

2 3
0.5
0 1 2 3 4 Stage k

(a) (b)

JN −1 (i) = ait , i = 1, 2, . . . , N,

Jk (i) = min aij +Jk+1 (j) , k = 0, 1, . . . , N −2.
j=1,...,N
ESTIMATION / HIDDEN MARKOV MODELS

• Markov chain with transition probabilities pij

• State transitions are hidden from view
• For each transition, we get an (independent)
observation
• r(z; i, j): Prob. the observation takes value z
when the state transition is from i to j
• Trajectory estimation problem: Given the ob-
servation sequence ZN = {z1 , z2 , . . . , zN }, what is
the “most likely” state transition sequence X̂N =
{x̂0 , x̂1 , . . . , x̂N } [one that maximizes p(XN | ZN )
over all XN = {x0 , x1 , . . . , xN }].

s x0 x1 x2 xN - 1 xN t

...

...
VITERBI ALGORITHM

• We have
p(XN , ZN )
p(XN | ZN ) =
p(ZN )
where p(XN , ZN ) and p(ZN ) are the unconditional
probabilities of occurrence of (XN , ZN ) and ZN
• Maximizing p(XN | ZN ) is equivalent with max-
imizing ln(p(XN , ZN ))
• We have
N
Y
p(XN , ZN ) = πx0 pxk−1 xk r(zk ; xk−1 , xk )
k=1

so the problem is equivalent to

N
X
minimize − ln(πx0 ) − ln pxk−1 xk r(zk ; xk−1 , xk )
k=1
over all possible sequences {x0 , x1 , . . . , xN }.

• This is a shortest path problem.

GENERAL SHORTEST PATH ALGORITHMS

• There are many nonDP shortest path algo-

rithms. They can all be used to solve deterministic
finite-state problems
• They may be preferable than DP if they avoid
calculating the optimal cost-to-go of EVERY state
• This is essential for problems with HUGE state
spaces. Such problems arise for example in com-
binatorial optimization
A Origin Node s

5 1 15

AB AC AD

20 4 20 3 4 3

ABC ABD ACB ACD ADB ADC

3 3 4 4 20 20

ABCD ABDC ACBD ACDB ADBC ADCB

1 15 5 1
15 5

Artificial Terminal Node t

5 1 15
5 20 4
1 20 3
15 4 3
LABEL CORRECTING METHODS

• Given: Origin s, destination t, lengths aij ≥ 0.

• Idea is to progressively discover shorter paths
from the origin s to every other node i
• Notation:
− di (label of i): Length of the shortest path
found (initially ds = 0, di = ∞ for i 6= s)
− UPPER: The label dt of the destination
− OPEN list: Contains nodes that are cur-
rently active in the sense that they are candi-
dates for further examination (initially OPEN={s}
Label Correcting Algorithm
Step 1 (Node Removal): Remove a node i from
OPEN and for each child j of i, do step 2
Step 2 (Node Insertion Test): If di + aij <
min{dj , UPPER}, set dj = di + aij and set i to
be the parent of j. In addition, if j 6= t, place j in
OPEN if it is not already in OPEN, while if j = t,
set UPPER to the new value di + ait of dt
Step 3 (Termination Test): If OPEN is empty,
terminate; else go to step 1
VISUALIZATION/EXPLANATION

• Given: Origin s, destination t, lengths aij ≥ 0

• di (label of i): Length of the shortest path found
thus far (initially ds = 0, di = ∞ for i 6= s). The
label di is implicitly associated with an s → i path
• UPPER: The label dt of the destination
• OPEN list: Contains “active” nodes (initially
OPEN={s})

Is di + aij < UPPER ?

YES
(Does the path s --> i --> j
have a chance to be part
of a shorter s --> t path ?)
Set dj = di + aij
INSERT YES

Is di + aij < dj ?
(Is the path s --> i --> j
i j
better than the
OPEN current path s --> j ?)

REMOVE
EXAMPLE

1 A Origin Node s

5 1 15

2 AB 7 AC 10 AD

20 4 20 3 4 3

3 ABC 5 ABD ACB 8 ACD ADB ADC

3 3 4 4 20 20

4 ABCD 6 ABDC ACBD 9 ACDB ADBC ADCB

1 15 5 1
15 5

Artificial Terminal Node t

Iter. No. Node Exiting OPEN OPEN after Iteration UPPER

0 - 1 ∞
1 1 2, 7,10 ∞
2 2 3, 5, 7, 10 ∞
3 3 4, 5, 7, 10 ∞
4 4 5, 7, 10 43
5 5 6, 7, 10 43
6 6 7, 10 13
7 7 8, 10 13
8 8 9, 10 13
9 9 10 13
10 10 Empty 13

• Note that some nodes never entered OPEN

6.231 DYNAMIC PROGRAMMING

LECTURE 4

LECTURE OUTLINE

• Label correcting methods for shortest paths

• Variants of label correcting methods
• Branch-and-bound as a shortest path algorithm
LABEL CORRECTING METHODS

• Origin s, destination t, lengths aij that are ≥ 0

• di (label of i): Length of the shortest path
found thus far (initially di = ∞ except ds = 0).
The label di is implicitly associated with an s → i
path
• UPPER: Label dt of the destination
• OPEN list: Contains “active” nodes (initially
OPEN={s})

Is di + aij < UPPER ?

YES
(Does the path s --> i --> j
have a chance to be part
of a shorter s --> t path ?)
Set dj = di + aij
INSERT YES

Is di + aij < dj ?
(Is the path s --> i --> j
i j
better than the
OPEN current path s --> j ?)

REMOVE
VALIDITY OF LABEL CORRECTING METHODS

Proposition: If there exists at least one path

from the origin to the destination, the label cor-
recting algorithm terminates with UPPER equal
to the shortest distance from the origin to the des-
tination
Proof: (1) Each time a node j enters OPEN, its
label is decreased and becomes equal to the length
of some path from s to j
(2) The number of possible distinct path lengths
is finite, so the number of times a node can enter
OPEN is finite, and the algorithm terminates
(3) Let (s, j1 , j2 , . . . , jk , t) be a shortest path and
let d∗ be the shortest distance. If UPPER > d∗
at termination, UPPER will also be larger than
the length of all the paths (s, j1 , . . . , jm ), m =
1, . . . , k, throughout the algorithm. Hence, node
jk will never enter the OPEN list with djk equal
to the shortest distance from s to jk . Similarly
node jk−1 will never enter the OPEN list with
djk−1 equal to the shortest distance from s to jk−1 .
Continue to j1 to get a contradiction
MAKING THE METHOD EFFICIENT

• Reduce the value of UPPER as quickly as pos-

sible
− Try to discover “good” s → t paths early in
the course of the algorithm
• Keep the number of reentries into OPEN low
− Try to remove from OPEN nodes with small
label first.
− Heuristic rationale: if di is small, then dj
when set to di +aij will be accordingly small,
so reentrance of j in the OPEN list is less
likely
• Reduce the overhead for selecting the node to
be removed from OPEN
• These objectives are often in conflict. They give
rise to a large variety of distinct implementations
• Good practical strategies try to strike a compro-
mise between low overhead and small label node
selection
NODE SELECTION METHODS

• Depth-first search: Remove from the top of

OPEN and insert at the top of OPEN.
− Has low memory storage properties (OPEN
is not too long). Reduces UPPER quickly.
Origin Node s
1

2 10

3 6 11 12

4 5 7 8 9 13 14

Destination Node t

• Best-first search (Djikstra): Remove from

OPEN a node with minimum value of label.
− Interesting property: Each node will be in-
serted in OPEN at most once.
− Nodes enter OPEN at minimum distance
− Many implementations/approximations
ADVANCED INITIALIZATION

• Instead of starting from di = ∞ for all i 6= s,

start with

di = length of some path from s to i (or di = ∞)

OPEN = {i 6= t | di < ∞}

• Motivation: Get a small starting value of UP-

PER.
• No node with shortest distance ≥ initial value
of UPPER will enter OPEN
• Good practical idea:
− Run a heuristic (or use common sense) to
get a “good” starting path P from s to t
− Use as UPPER the length of P , and as di
the path distances of all nodes i along P
• Very useful also in reoptimization, where we
solve the same problem with slightly different data
VARIANTS OF LABEL CORRECTING METHODS

• If a lower bound hj of the true shortest dis-

tance from j to t is known, use the test

di + aij + hj < UPPER

for entry into OPEN, instead of

di + aij < UPPER

The label correcting method with lower bounds as

above is often referred to as the A∗ method.

• If an upper bound mj of the true shortest

distance from j to t is known, then if dj + mj <
UPPER, reduce UPPER to dj + mj .
• Important use: Branch-and-bound algorithm
for discrete optimization can be viewed as an im-
plementation of this last variant.
BRANCH-AND-BOUND METHOD

• Problem: Minimize f (x) over a finite set of

feasible solutions X.
• Idea of branch-and-bound: Partition the fea-
sible set into smaller subsets, and then calculate
certain bounds on the attainable cost within some
of the subsets to eliminate from further consider-
ation other subsets.
Bounding Principle
Given two subsets Y1 ⊂ X and Y2 ⊂ X, suppose
that we have bounds

f 1 ≤ min f (x), f 2 ≥ min f (x).

x∈Y1 x∈Y2

Then, if f 2 ≤ f 1 , the solutions in Y1 may be dis-

regarded since their cost cannot be smaller than
the cost of the best solution in Y2 .

• The B+B algorithm can be viewed as a la-

bel correcting algorithm, where lower bounds de-
fine the arc costs, and upper bounds are used to
strengthen the test for admission to OPEN.
SHORTEST PATH IMPLEMENTATION

• Acyclic graph/partition of X into subsets (typ-

ically a tree). The leafs consist of single solutions.
• Upper/Lower bounds f Y and f Y for the mini-
mum cost over each subset Y can be calculated.
• The lower bound of a leaf {x} is f (x)
• Each arc (Y, Z) has length f Z − f Y
• Shortest distance from X to Y = f Y − f X
• Distance from origin X to a leaf {x} is f (x)−f X
• Shortest path from X to the set of leafs gives
the optimal cost and optimal solution
• UPPER is the smallest f (x) − f X out of leaf
nodes {x} examined so far
{1,2,3,4,5}

{1,2,3} {4,5}

{1,2,} {3} {4} {5}

{1} {2}
BRANCH-AND-BOUND ALGORITHM

Step 1: Remove a node Y from OPEN. For each

child Yj of Y , do the following:
− Entry Test: If f Y j < UPPER, place Yj in
OPEN.
− Update UPPER: If f Y j < UPPER, set UP-
PER = f Y j , and if Yj consists of a single
solution, mark that as being the best solu-
tion found so far
Step 2: (Termination Test) If OPEN: empty,
terminate; the best solution found so far is opti-
mal. Else go to Step 1
• It is neither practical nor necessary to generate
a priori the acyclic graph (generate it as you go)
• Keys to branch-and-bound:
− Generate as sharp as possible upper and lower
bounds at each node
− Have a good partitioning and node selection
strategy
• Method involves a lot of art, may be prohibitively
time-consuming ... but guaranteed to find an op-
timal solution
6.231 DYNAMIC PROGRAMMING

LECTURE 5

LECTURE OUTLINE

• Examples of stochastic DP problems

• Linear-quadratic problems
• Inventory control
LINEAR-QUADRATIC PROBLEMS

• System: xk+1 = Ak xk + Bk uk + wk
• Quadratic cost
−1
( N
)
X
E x′N QN xN + (x′k Qk xk + u′k Rk uk )
w k
k=0,1,...,N −1 k=0

where Qk ≥ 0 and Rk > 0 (in the positive (semi)definite

sense).
• wk are independent and zero mean
• DP algorithm:
JN (xN ) = x′N QN xN ,
Jk (xk ) = min E xk Qk xk + u′k Rk uk
′
uk

+ Jk+1 (Ak xk + Bk uk + wk )
• Key facts:
− Jk (xk ) is quadratic
− Optimal policy {µ∗0 , . . . , µ∗N −1 } is linear:
µ∗k (xk ) = Lk xk
− Similar treatment of a number of variants
DERIVATION

• By induction verify that

µ∗k (xk ) = Lk xk , Jk (xk ) = x′k Kk xk + constant,

where Lk are matrices given by

Lk = −(Bk′ Kk+1 Bk + Rk )−1 Bk′ Kk+1 Ak ,

and where Kk are symmetric positive semidefinite

matrices given by

KN = QN ,

Kk = A′k Kk+1 − Kk+1 Bk (Bk′ Kk+1 Bk

′

−1
+ Rk ) Bk Kk+1 Ak + Qk .
• This is called the discrete-time Riccati equation.
• Just like DP, it starts at the terminal time N
and proceeds backwards.
• Certainty equivalence holds (optimal policy is
the same as when wk is replaced by its expected
value E{wk } = 0).
ASYMPTOTIC BEHAVIOR OF RICCATI EQUATIO

• Assume time-independent system and cost per

stage, and some technical assumptions: controla-
bility of (A, B) and observability of (A, C) where
Q = C ′C
• The Riccati equation converges limk→−∞ Kk =
K, where K is pos. definite, and is the unique
(within the class of pos. semidefinite matrices) so-
lution of the algebraic Riccati equation

K= A′ K− KB(B ′ KB + R)−1 B ′ K A+Q

• The corresponding steady-state controller µ∗ (x) =

Lx, where

L = −(B ′ KB + R)−1 B ′ KA,

is stable in the sense that the matrix (A + BL) of

the closed-loop system

xk+1 = (A + BL)xk + wk

satisfies limk→∞ (A + BL)k = 0.

GRAPHICAL PROOF FOR SCALAR SYSTEMS
2
A R
2 +Q
B F(P)

R
- 2
B P 0 450
Pk Pk + 1 P* P

• Riccati equation (with Pk = KN −k ):

B 2 Pk2

Pk+1 = A2 Pk − + Q,
B 2 Pk + R

or Pk+1 = F (Pk ), where

A2 RP
F (P ) = 2 + Q.
B P +R

• Note the two steady-state solutions, satisfying

P = F (P ), of which only one is positive.
RANDOM SYSTEM MATRICES

• Suppose that {A0 , B0 }, . . . , {AN −1 , BN −1 } are

not known but rather are independent random
matrices that are also independent of the wk
• DP algorithm is

JN (xN ) = x′N QN xN ,

x′k Qk xk

Jk (xk ) = min E
uk wk ,Ak ,Bk

+ u′k Rk uk + Jk+1 (Ak xk + Bk uk + wk )

• Optimal policy µ∗k (xk ) = Lk xk , where

′
−1
Lk = − Rk + E{Bk Kk+1 Bk } E{Bk′ Kk+1 Ak },

and where the matrices Kk are given by

KN = QN ,
Kk = E{A′k Kk+1 Ak } − E{A′k Kk+1 Bk }
′
−1
Rk + E{Bk Kk+1 Bk } E{Bk′ Kk+1 Ak } + Qk
PROPERTIES

• Certainty equivalence may not hold

• Riccati equation may not converge to a steady-
state
F (P)

450
R
- 2 0 P
E{B }

• We have Pk+1 = F̃ (Pk ), where

E{A2 }RP TP2

F̃ (P ) = +Q+ ,
E{B 2 }P + R E{B 2 }P + R
2 2
T = E{A2 }E{B 2 } − E{A} E{B}
INVENTORY CONTROL

• xk : stock, uk : inventory purchased, wk : de-

mand

xk+1 = xk + uk − wk , k = 0, 1, . . . , N − 1

• Minimize
(N −1 )
X
E cuk + r(xk + uk − wk )
k=0

where, for some p > 0 and h > 0,

r(x) = p max(0, −x) + h max(0, x)

• DP algorithm:

JN (xN ) = 0,

Jk (xk ) = min cuk +H(xk +uk )+E Jk+1 (xk +uk −wk ) ,
uk ≥0

where H(x + u) = E{r(x + u − w)}.

OPTIMAL POLICY

• DP algorithm can be written as

JN (xN ) = 0,

Jk (xk ) = min Gk (xk + uk ) − cxk ,

uk ≥0

where

Gk (y) = cy + H(y) + E Jk+1 (y − w) .

• If Gk is convex and lim|x|→∞ Gk (x) → ∞, we

have

Sk − xk if xk < Sk ,
µ∗k (xk ) =
0 if xk ≥ Sk ,

where Sk minimizes Gk (y).

• This is shown, assuming that c < p, by showing
that Jk is convex for all k, and

lim Jk (x) → ∞
|x|→∞
JUSTIFICATION

• Graphical inductive proof that Jk is convex.

cy + H(y)

H(y)
cSN - 1

SN - 1 y
- cy

JN - 1(xN - 1)

SN - 1 xN - 1
- cy
6.231 DYNAMIC PROGRAMMING

LECTURE 6

LECTURE OUTLINE

• Stopping problems
• Scheduling problems
• Other applications
PURE STOPPING PROBLEMS

• Two possible controls:

− Stop (incur a one-time stopping cost, and
move to cost-free and absorbing stop state)
− Continue [using xk+1 = fk (xk , wk ) and in-
curring the cost-per-stage]
• Each policy consists of a partition of the set of
states xk into two regions:
− Stop region, where we stop
− Continue region, where we continue

CONTINUE STOP
REGION REGION

Stop State
EXAMPLE: ASSET SELLING

• A person has an asset, and at k = 0, 1, . . . , N −1

receives a random offer wk
• May accept wk and invest the money at fixed
rate of interest r, or reject wk and wait for wk+1 .
Must accept the last offer wN −1
• DP algorithm (xk : current offer, T : stop state):

xN if xN =
6 T,
JN (xN ) =
0 if xN = T ,
n
Jk (xk ) = max (1 + r)N −k xk , E Jk+1 (wk ) if xk 6= T ,
0 if xk = T .

• Optimal policy;

accept the offer xk if xk > αk ,

reject the offer xk if xk < αk ,

where
E Jk+1 (wk )
αk = .
(1 + r)N −k
FURTHER ANALYSIS

a1
a2
ACCEPT

REJECT

aN -1

0 1 2 N-1 N k

• Can show that αk ≥ αk+1 for all k

• Proof: Let Vk (xk ) = Jk (xk )/(1 + r)N −k for
xk 6= T. Then the DP algorithm is VN (xN ) = xN
and

Vk (xk ) = max xk , (1 + r)−1 E Vk+1 (w) .
w

We have αk = Ew Vk+1 (w) /(1 + r), so it is enough
to show that Vk (x) ≥ Vk+1 (x) for all x and k.
Start with VN −1 (x) ≥ VN (x) and use the mono-
tonicity property of DP.
• We can also show that αk → a as k → −∞.
Suggests that for an infinite horizon the optimal
policy is stationary.
GENERAL STOPPING PROBLEMS

• At time k, we may stop at cost t(xk ) or choose

a control uk ∈ U (xk ) and continue
JN (xN ) = t(xN ),

Jk (xk ) = min t(xk ), min E g(xk , uk , wk )
uk ∈U (xk )

+ Jk+1 f (xk , uk , wk )

• Optimal to stop at time k for states x in the

set

Tk = x t(x) ≤ min E g(x, u, w) + Jk+1 f (x, u, w)
u∈U (x)

• Since JN −1 (x) ≤ JN (x), we have Jk (x) ≤

Jk+1 (x) for all k, so

T0 ⊂ · · · ⊂ Tk ⊂ Tk+1 ⊂ · · · ⊂ TN −1 .
• Interesting case is when all the Tk are equal (to
TN −1 , the set where it is better to stop than to go
one step and stop). Can be shown to be true if

f (x, u, w) ∈ TN −1 , for all x ∈ TN −1 , u ∈ U (x), w.

SCHEDULING PROBLEMS

• Set of tasks to perform, the ordering is subject

to optimal choice.
• Costs depend on the order
• There may be stochastic uncertainty, and prece-
dence and resource availability constraints
• Some of the hardest combinatorial problems
are of this type (e.g., traveling salesman, vehicle
routing, etc.)
• Some special problems admit a simple quasi-
analytical solution method
− Optimal policy has an “index form”, i.e.,
each task has an easily calculable “index”,
and it is optimal to select the task that has
the maximum value of index (multi-armed
bandit problems - to be discussed later)
− Some problems can be solved by an “inter-
change argument”(start with some schedule,
interchange two adjacent tasks, and see what
happens)
EXAMPLE: THE QUIZ PROBLEM

• Given a list of N questions. If question i is an-

swered correctly (given probability pi ), we receive
reward Ri ; if not the quiz terminates. Choose or-
der of questions to maximize expected reward.
• Let i and j be the kth and (k + 1)st questions
in an optimally ordered list

L = (i0 , . . . , ik−1 , i, j, ik+2 , . . . , iN −1 )

E {reward of L} = E reward of {i0 , . . . , ik−1 }
+ pi0 · · · pik−1 (pi Ri + pi pj Rj )

+ pi0 · · · pik−1 pi pj E reward of {ik+2 , . . . , iN −1 }

Consider the list with i and j interchanged

L′ = (i0 , . . . , ik−1 , j, i, ik+2 , . . . , iN −1 )

Since L is optimal, E{reward of L} ≥ E{reward of L′ },

so it follows that pi Ri + pi pj Rj ≥ pj Rj + pj pi Ri
or
pi Ri /(1 − pi ) ≥ pj Rj /(1 − pj ).
MINIMAX CONTROL

• Consider basic problem with the difference that

the disturbance wk instead of being random, it is
just known to belong to a given set Wk (xk , uk ).
• Find policy π that minimizes the cost
h
Jπ (x0 ) = max gN (xN )
wk ∈Wk (xk ,µk (xk ))
k=0,1,...,N −1
N −1
X i
+ gk xk , µk (xk ), wk
k=0

• The DP algorithm takes the form

JN (xN ) = gN (xN ),

Jk (xk ) = min max gk (xk , uk , wk )
uk ∈U (xk ) wk ∈Wk (xk ,uk )

+ Jk+1 fk (xk , uk , wk )
(Exercise 1.5 in the text, solution posted on the
www).
UNKNOWN-BUT-BOUNDED CONTROL

• For each k, keep the xk of the controlled system

xk+1 = fk xk , µk (xk ), wk

inside a given set Xk , the target set at time k.

• This is a minimax control problem, where the
cost at stage k is

0 if xk ∈ Xk ,
gk (xk ) =
1 if xk ∈
/ Xk .
• We must reach at time k the set

X k = xk | Jk (xk ) = 0

in order to be able to maintain the state within

the subsequent target sets.
• Start with X N = XN , and for k = 0, 1, . . . , N −
1,

X k = xk ∈ Xk | there exists uk ∈ Uk (xk ) such that
fk (xk , uk , wk ) ∈ X k+1 , for all wk ∈ Wk (xk , uk )
6.231 DYNAMIC PROGRAMMING

LECTURE 7

LECTURE OUTLINE

• Problems with imperfect state info

• Reduction to the perfect state info case
• Linear quadratic problems
• Separation of estimation and control
BASIC PROBLEM WITH IMPERFECT STATE INF

• Same as basic problem of Chapter 1 with one

difference: the controller, instead of knowing xk ,
receives at each time k an observation of the form

z0 = h0 (x0 , v0 ), zk = hk (xk , uk−1 , vk ), k ≥ 1

• The observation zk belongs to some space Zk .

• The random observation disturbance vk is char-
acterized by a probability distribution

Pvk (· | xk , . . . , x0 , uk−1 , . . . , u0 , wk−1 , . . . , w0 , vk−1 , . . . , v0 )

• The initial state x0 is also random and charac-

terized by a probability distribution Px0 .
• The probability distribution Pwk (· | xk , uk ) of
wk is given, and it may depend explicitly on xk
and uk but not on w0 , . . . , wk−1 , v0 , . . . , vk−1 .
• The control uk is constrained to a given subset
Uk (this subset does not depend on xk , which is
not assumed known).
INFORMATION VECTOR AND POLICIES

• Denote by Ik the information vector , i.e., the

information available at time k:

Ik = (z0 , z1 , . . . , zk , u0 , u1 , . . . , uk−1 ), k ≥ 1,
I0 = z0
• We consider policies π = {µ0 , µ1 , . . . , µN −1 },
where each function µk maps the information vec-
tor Ik into a control uk and

µk (Ik ) ∈ Uk , for all Ik , k ≥ 0

• We want to find a policy π that minimizes

−1
( N
)
X
Jπ = E gN (xN ) + gk xk , µk (Ik ), wk
x0 ,wk ,vk
k=0,...,N −1 k=0

subject to the equations

xk+1 = fk xk , µk (Ik ), wk , k ≥ 0,

z0 = h0 (x0 , v0 ), zk = hk xk , µk−1 (Ik−1 ), vk , k ≥ 1
REFORMULATION AS PERFECT INFO PROBLEM

• We have
Ik+1 = (Ik , zk+1 , uk ), k = 0, 1, . . . , N −2, I0 = z0

View this as a dynamic system with state Ik , con-

trol uk , and random disturbance zk+1
• We have

P (zk+1 | Ik , uk ) = P (zk+1 | Ik , uk , z0 , z1 , . . . , zk ),

since z0 , z1 , . . . , zk are part of the information vec-

tor Ik . Thus the probability distribution of zk+1
depends explicitly only on the state Ik and control
uk and not on the prior “disturbances” zk , . . . , z0
• Write

E gk (xk , uk , wk ) = E E gk (xk , uk , wk ) | Ik , uk
xk ,wk

so the cost per stage of the new system is

g̃k (Ik , uk ) = E gk (xk , uk , wk ) | Ik , uk
xk ,wk
DP ALGORITHM

• Writing the DP algorithm for the (reformulated)

perfect state info problem and doing the algebra:
h
Jk (Ik ) = min E gk (xk , uk , wk )
uk ∈Uk xk , wk , zk+1
i
+ Jk+1 (Ik , zk+1 , uk ) | Ik , uk

for k = 0, 1, . . . , N − 2, and for k = N − 1,

JN −1 (IN −1 ) = min
uN −1 ∈UN −1
"

E gN fN −1 (xN −1 , uN −1 , wN −1 )
xN −1 , wN −1
#
+ gN −1 (xN −1 , uN −1 , wN −1 ) | IN −1 , uN −1

• The optimal cost J ∗ is given by

J∗ = E J0 (z0 )
z0
LINEAR-QUADRATIC PROBLEMS

• System: xk+1 = Ak xk + Bk uk + wk
• Quadratic cost
−1
( N
)
X
E x′N QN xN + (x′k Qk xk + u′k Rk uk )
w k
k=0,1,...,N −1 k=0

where Qk ≥ 0 and Rk > 0

• Observations

zk = Ck xk + vk , k = 0, 1, . . . , N − 1

• w0 , . . . , wN −1 , v0 , . . . , vN −1 indep. zero mean

• Key fact to show:
− Optimal policy {µ∗0 , . . . , µ∗N −1 } is of the form:

µ∗k (Ik ) = Lk E{xk | Ik }

Lk : same as for the perfect state info case

− Estimation problem and control problem can
be solved separately
DP ALGORITHM I

• Last stage N − 1 (supressing index N − 1):

h
JN −1 (IN −1 ) = min ExN −1 ,wN −1 x′N −1 QxN −1
uN −1

+ u′N −1 RuN −1 + (AxN −1 + BuN −1 + wN −1 )′

i
· Q(AxN −1 + BuN −1 + wN −1 ) | IN −1 , uN −1

• Since E{wN −1 | IN −1 } = E{wN −1 } = 0, the

minimization involves
′ ′
min uN −1 (B QB + R)uN −1
uN −1

′ ′

+ 2E{xN −1 | IN −1 } A QBuN −1

The minimization yields the optimal µ∗N −1 :

u∗N −1 = µ∗N −1 (IN −1 ) = LN −1 E{xN −1 | IN −1 }

where

LN −1 = −(B ′ QB + R)−1 B ′ QA
DP ALGORITHM II

• Substituting in the DP algorithm

′
JN −1 (IN −1 ) = E xN −1 KN −1 xN −1 | IN −1
xN −1
′
+ E xN −1 − E{xN −1 | IN −1 }
xN −1

· PN −1 xN −1 − E{xN −1 | IN −1 } | IN −1
′
+ E {wN −1 QN wN −1 },
wN −1

where the matrices KN −1 and PN −1 are given by

PN −1 = A′N −1 QN BN −1 (RN −1 + BN
′
−1 QN BN −1 )
−1

′
· BN −1 QN AN −1 ,
KN −1 = A′N −1 QN AN −1 − PN −1 + QN −1

• Note the structure of JN −1 : in addition to

the quadratic and constant terms, it involves a
quadratic in the estimation error

xN −1 − E{xN −1 | IN −1 }
DP ALGORITHM III

• DP equation for period N − 2:

h
JN −2 (IN −2 ) = min E {x′N −2 QxN −2
uN −2 xN −2 ,wN −2 ,zN −1
i
+ u′N −2 RuN −2 + JN −1 (IN −1 ) | IN −2 , uN −2 }

= E x′N −2 QxN −2 | IN −2
h
+ min u′N −2 RuN −2
uN −2
i
+ E x′N −1 KN −1 xN −1 | IN −2 , uN −2
′
+E xN −1 − E{xN −1 | IN −1 }

· PN −1 xN −1 − E{xN −1 | IN −1 } | IN −2 , uN −2
′
+ EwN −1 {wN −1 QN wN −1 }

• Key point: We have excluded the next to last

term from the minimization with respect to uN −2
• This term turns out to be independent of uN −2
QUALITY OF ESTIMATION LEMMA

• For every k, there is a function Mk such that

we have

xk −E{xk | Ik } = Mk (x0 , w0 , . . . , wk−1 , v0 , . . . , vk ),

independently of the policy being used

• Using the lemma,

xN −1 − E{xN −1 | IN −1 } = ξN −1 ,

where

ξN −1 : function of x0 , w0 , . . . , wN −2 , v0 , . . . , vN −1

• Since ξN −1 is independent of uN −2 , the condi-

′
tional expectation of ξN −1 PN −1 ξN −1 satisfies

′
E{ξN −1 PN −1 ξN −1 | IN −2 , uN −2 }
′
= E{ξN −1 PN −1 ξN −1 | IN −2 }

and is independent of uN −2 .
• So minimization in the DP algorithm yields

u∗N −2 = µ∗N −2 (IN −2 ) = LN −2 E{xN −2 | IN −2 }

FINAL RESULT

• Continuing similarly (using also the quality of

estimation lemma)

µ∗k (Ik ) = Lk E{xk | Ik },

where Lk is the same as for perfect state info:

Lk = −(Rk + Bk′ Kk+1 Bk )−1 Bk′ Kk+1 Ak ,

with Kk generated from KN = QN , using

Kk = A′k Kk+1 Ak − Pk + Qk ,

Pk = A′k Kk+1 Bk (Rk + Bk′ Kk+1 Bk )−1 Bk′ Kk+1 Ak

wk vk

xk zk
xk + 1 = A kxk + B ku k + wk zk = Ckxk + vk
uk

Delay

uk -1
uk E{xk | Ik} zk
Lk Estimator
SEPARATION INTERPRETATION

• The optimal controller can be decomposed into

(a) An estimator , which uses the data to gener-
ate the conditional expectation E{xk | Ik }.
(b) An actuator , which multiplies E{xk | Ik } by
the gain matrix Lk and applies the control
input uk = Lk E{xk | Ik }.
• Generically the estimate x̂ of a random vector x
given some information (random vector) I, which
minimizes the mean squared error

Ex {kx − x̂k2 | I} = kxk2 − 2E{x | I}x̂ + kx̂k2

is E{x | I} (set to zero the derivative with respect

to x̂ of the above quadratic form).
• The estimator portion of the optimal controller
is optimal for the problem of estimating the state
xk assuming the control is not subject to choice.
• The actuator portion is optimal for the control
problem assuming perfect state information.
STEADY STATE/IMPLEMENTATION ASPECTS

• As N → ∞, the solution of the Riccati equation

converges to a steady state and Lk → L.
• If x0 , wk , and vk are Gaussian, E{xk | Ik } is
a linear function of Ik and is generated by a nice
recursive algorithm, the Kalman filter.
• The Kalman filter involves also a Riccati equa-
tion, so for N → ∞, and a stationary system, it
also has a steady-state structure.
• Thus, for Gaussian uncertainty, the solution is
nice and possesses a steady state.
• For nonGaussian uncertainty, computing E{xk | Ik }
maybe very difficult, so a suboptimal solution is
typically used.
• Most common suboptimal controller: Replace
E{xk | Ik } by the estimate produced by the Kalman
filter (act as if x0 , wk , and vk are Gaussian).
• It can be shown that this controller is optimal
within the class of controllers that are linear func-
tions of Ik .
6.231 DYNAMIC PROGRAMMING

LECTURE 8

LECTURE OUTLINE

• DP for imperfect state info

• Sufficient statistics
• Conditional state distribution as a sufficient
statistic
• Finite-state systems
• Examples
REVIEW: IMPERFECT STATE INFO PROBLEM

• Instead of knowing xk , we receive observations

z0 = h0 (x0 , v0 ), zk = hk (xk , uk−1 , vk ), k ≥ 0

• Ik : information vector available at time k:

I0 = z0 , Ik = (z0 , z1 , . . . , zk , u0 , u1 , . . . , uk−1 ), k ≥ 1

• Optimization over policies π = {µ0 , µ1 , . . . , µN −1 },

where µk (Ik ) ∈ Uk , for all Ik and k.
• Find a policy π that minimizes
−1
( N
)
X
Jπ = E gN (xN ) + gk xk , µk (Ik ), wk
x0 ,wk ,vk
k=0,...,N −1 k=0

subject to the equations

xk+1 = fk xk , µk (Ik ), wk , k ≥ 0,

z0 = h0 (x0 , v0 ), zk = hk xk , µk−1 (Ik−1 ), vk , k ≥ 1
DP ALGORITHM

• DP algorithm:
h
Jk (Ik ) = min E gk (xk , uk , wk )
uk ∈Uk xk , wk , zk+1
i
+ Jk+1 (Ik , zk+1 , uk ) | Ik , uk

for k = 0, 1, . . . , N − 2, and for k = N − 1,

JN −1 (IN −1 ) = min
uN −1 ∈UN −1
"

E gN fN −1 (xN −1 , uN −1 , wN −1 )
xN −1 , wN −1
#
+ gN −1 (xN −1 , uN −1 , wN −1 ) | IN −1 , uN −1

• The optimal cost J ∗ is given by

J∗ = E J0 (z0 ) .
z0
SUFFICIENT STATISTICS

• Suppose that we can find a function Sk (Ik ) such

that the right-hand side of the DP algorithm can
be written in terms of some function Hk as

min Hk Sk (Ik ), uk .
uk ∈Uk

• Such a function Sk is called a sufficient statistic.

• An optimal policy obtained by the preceding
minimization can be written as

µ∗k (Ik )

= µk Sk (Ik ) ,

where µk is an appropriate function.

• Example of a sufficient statistic: Sk (Ik ) = Ik
• Another important sufficient statistic

Sk (Ik ) = Pxk |Ik

DP ALGORITHM IN TERMS OF PXK |IK

• It turns out that Pxk |Ik is generated recursively

by a dynamic system (estimator) of the form

Pxk+1 |Ik+1 = Φk Pxk |Ik , uk , zk+1

for a suitable function Φk

• DP algorithm can be written as
h
J k (Pxk |Ik ) = min E gk (xk , uk , wk )
uk ∈Uk xk ,wk ,zk+1
i
+ J k+1 Φk (Pxk |Ik , uk , zk+1 ) | Ik , uk

wk vk

uk xk zk
System Measurement
xk + 1 = fk(xk ,u k ,wk) zk = hk(xk ,u k - 1,vk)
uk -1

Delay

uk -1
Px k | Ik zk
Actuator Estimator
mk fk -1
EXAMPLE: A SEARCH PROBLEM

• At each period, decide to search or not search

a site that may contain a treasure.
• If we search and a treasure is present, we find
it with prob. β and remove it from the site.
• Treasure’s worth: V . Cost of search: C
• States: treasure present & treasure not present
• Each search can be viewed as an observation of
the state
• Denote

pk : prob. of treasure present at the start of time k

with p0 given.
• pk evolves at time k according to the equation

 pk if not search,


pk+1 = 0 if search and find treasure,

pk (1−β)

pk (1−β)+1−pk if search and no treasure.
SEARCH PROBLEM (CONTINUED)

• DP algorithm
h
J k (pk ) = max 0, −C + pk βV
i
pk (1 − β)
+ (1 − pk β)J k+1 ,
pk (1 − β) + 1 − pk

with J N (pN ) = 0.
• Can be shown by induction that the functions
J k satisfy

C
J k (pk ) = 0, for all pk ≤
βV

• Furthermore, it is optimal to search at period

k if and only if
pk βV ≥ C
(expected reward from the next search ≥ the cost
of the search)
FINITE-STATE SYSTEMS - POMDP

• Suppose the system is a finite-state Markov

chain, with states 1, . . . , n.
• Then the conditional probability distribution
Pxk |Ik is a vector

P (xk = 1 | Ik ), . . . , P (xk = n | Ik )

• The DP algorithm can be executed over the n-

dimensional simplex (state space is not expanding
with increasing k)
• When the control and observation spaces are
also finite sets the problem is called a POMDP
(Partially Observed Markov Decision Problem).
• For POMDP it turns out that the cost-to-go
functions J k in the DP algorithm are piecewise
linear and concave (Exercise 5.7).
• This is conceptually important. It is also useful
in practice because it forms the basis for approxi-
mations.
INSTRUCTION EXAMPLE

• Teaching a student some item. Possible states

are L: Item learned, or L: Item not learned.
• Possible decisions: T : Terminate the instruc-
tion, or T : Continue the instruction for one period
and then conduct a test that indicates whether the
student has learned the item.
• The test has two possible outcomes: R: Student
gives a correct answer, or R: Student gives an
incorrect answer.
• Probabilistic structure

1 1
L L R

t r

L L R
1-t 1-r

• Cost of instruction is I per period

• Cost of terminating instruction; 0 if student has
learned the item, and C > 0 if not.
INSTRUCTION EXAMPLE II

• Let pk : prob. student has learned the item given

the test results so far

pk = P (xk |Ik ) = P (xk = L | z0 , z1 , . . . , zk ).

• Using Bayes’ rule we can obtain

pk+1 = Φ(pk , zk+1 )

(
1−(1−t)(1−pk )
1−(1−t)(1−r)(1−pk ) if zk+1 = R,
=
0 if zk+1 = R.

• DP algorithm:

J k (pk ) = min (1 − pk )C, I + E J k+1 Φ(pk , zk+1 ) .
zk+1

starting with

J N −1 (pN −1 ) = min (1−pN −1 )C, I+(1−t)(1−pN −1 )C .
INSTRUCTION EXAMPLE III

• Write the DP algorithm as

J k (pk ) = min (1 − pk )C, I + Ak (pk ) ,
where

Ak (pk ) = P (zk+1 = R | Ik )J k+1 Φ(pk , R)

+ P (zk+1 = R | Ik )J k+1 Φ(pk , R)
• Can show by induction that Ak (p) are piecewise
linear, concave, monotonically decreasing, with

Ak−1 (p) ≤ Ak (p) ≤ Ak+1 (p), for all p ∈ [0, 1].

I + A N - 1(p)
C
I + A N - 2(p)

I + A N - 3(p)

0 a N-1 a N-2 a N-3 1 - I 1 p

C
6.231 DYNAMIC PROGRAMMING

LECTURE 9

LECTURE OUTLINE

• Suboptimal control
• Cost approximation methods: Classification
• Certainty equivalent control: An example
• Limited lookahead policies
• Performance bounds
• Problem approximation approach
• Parametric cost-to-go approximation
PRACTICAL DIFFICULTIES OF DP

• The curse of modeling

• The curse of dimensionality
− Exponential growth of the computational and
storage requirements as the number of state
variables and control variables increases
− Quick explosion of the number of states in
combinatorial problems
− Intractability of imperfect state information
problems
• There may be real-time solution constraints
− A family of problems may be addressed. The
data of the problem to be solved is given with
little advance notice
− The problem data may change as the system
is controlled – need for on-line replanning
COST-TO-GO FUNCTION APPROXIMATION

• Use a policy computed from the DP equation

where the optimal cost-to-go function Jk+1 is
replaced
by an approximation J˜k+1 . (Sometimes
E gk is also replaced by an approximation.)
• Apply µk (xk ), which attains the minimum in
n o
min E gk (xk , uk , wk )+J˜k+1 fk (xk , uk , wk )
uk ∈Uk (xk )

• There are several ways to compute J˜k+1 :

− Off-line approximation: The entire func-
tion J˜k+1 is computed for every k, before the
control process begins.
− On-line approximation: Only the values
J˜k+1 (xk+1 ) at the relevant next states xk+1
are computed and used to compute uk just
after the current state xk becomes known.
− Simulation-based methods: These are off-
line and on-line methods that share the com-
mon characteristic that they are based on
Monte-Carlo simulation. Some of these meth-
ods are suitable for problems of very large
size.
CERTAINTY EQUIVALENT CONTROL (CEC)

• Idea: Replace the stochastic problem with a

deterministic problem
• At each time k, the uncertain quantities are
fixed at some “typical” values
• On-line implementation for a perfect state
info problem. At each time k:
(1) Fix the wi , i ≥ k, at some wi . Solve the
deterministic problem:

N
X −1

minimize gN (xN ) + gi xi , ui , wi
i=k

where xk is known, and

ui ∈ Ui , xi+1 = fi xi , ui , wi .
(2) Use as control the first element in the opti-
mal control sequence found.
• So we apply µ̄k (xk ) that minimizes

˜

gk xk , uk , wk + Jk+1 fk (xk , uk , wk )

where J˜k+1 is the optimal cost of the correspond-

ing deterministic problem.
ALTERNATIVE OFF-LINE IMPLEMENTATION
d
• Let µ0 (x0 ), . . . , µdN −1 (xN −1 ) be an optimal
controller obtained from the DP algorithm for the
deterministic problem

N −1
X
minimize gN (xN ) + gk xk , µk (xk ), wk
k=0

subject to xk+1 = fk xk , µk (xk ), wk , µk (xk ) ∈ Uk

• The CEC applies at time k the control input

µdk (xk ).
• In an imperfect info version, xk is replaced by
an estimate xk (Ik ).

wk vk

d
u k = mk (x k) System xk Measurement zk
xk + 1 = fk(xk ,u k ,wk) zk = hk(xk ,u k - 1,vk)
uk -1

Delay

uk -1
Actuator x k (Ik ) zk
Estimator
mdk
PARTIALLY STOCHASTIC CEC

• Instead of fixing all future disturbances to their

typical values, fix only some, and treat the rest as
stochastic.
• Important special case: Treat an imperfect
state information problem as one of perfect state
information, using an estimate xk (Ik ) of xk as if
it were exact.
• Multiaccess communication example: Con-
sider controlling the slotted Aloha system (dis-
cussed in Ch. 5) by optimally choosing the prob-
ability of transmission of waiting packets. This
is a hard problem of imperfect state info, whose
perfect state info version is easy.
• Natural partially stochastic CEC:

1
µ̃k (Ik ) = min 1, ,
xk (Ik )

where xk (Ik ) is an estimate of the current packet

backlog based on the entire past channel history
of successes, idles, and collisions (which is Ik ).
GENERAL COST-TO-GO APPROXIMATION

• One-step lookahead (1SL) policy: At each

k and state xk , use the control µk (xk ) that

˜

min E gk (xk , uk , wk )+Jk+1 fk (xk , uk , wk ) ,
uk ∈Uk (xk )

where
− J˜N = gN .
− J˜k+1 : approximation to true cost-to-go Jk+1
• Two-step lookahead policy: At each k and
xk , use the control µ̃k (xk ) attaining the minimum
above, where the function J˜k+1 is obtained using
a 1SL approximation (solve a 2-step DP problem).
• If J˜k+1 is readily available and the minimiza-
tion above is not too hard, the 1SL policy is im-
plementable on-line.
• Sometimes one also replaces Uk (xk ) above with
a subset of “most promising controls” U k (xk ).
• As the length of lookahead increases, the re-
quired computation quickly explodes.
PERFORMANCE BOUNDS FOR 1SL

• Let J k (xk ) be the cost-to-go from (xk , k) of the

1SL policy, based on functions J˜k .
• Assume that for all (xk , k), we have

Jˆk (xk ) ≤ J˜k (xk ), (*)

where JˆN = gN and for all k,

Jˆk (xk ) =

min E gk (xk , uk , wk )
uk ∈Uk (xk )

+ J˜k+1

fk (xk , uk , wk ) ,
[so Jˆk (xk ) is computed along with µk (xk )]. Then

J k (xk ) ≤ Jˆk (xk ), for all (xk , k).

• Important application: When J˜k is the cost-
to-go of some heuristic policy (then the 1SL policy
is called the rollout policy).
• The bound can be extended to the case where
there is a δk in the RHS of (*). Then
J k (xk ) ≤ J˜k (xk ) + δk + · · · + δN −1
COMPUTATIONAL ASPECTS

• Sometimes nonlinear programming can be used

to calculate the 1SL or the multistep version [par-
ticularly when Uk (xk ) is not a discrete set]. Con-
nection with stochastic programming methods.
• The choice of the approximating functions J˜k
is critical, and is calculated in a variety of ways.
• Some approaches:
(a) Problem Approximation: Approximate the
optimal cost-to-go with some cost derived
from a related but simpler problem
(b) Parametric Cost-to-Go Approximation: Ap-
proximate the optimal cost-to-go with a func-
tion of a suitable parametric form, whose pa-
rameters are tuned by some heuristic or sys-
tematic scheme (Neuro-Dynamic Program-
ming)
(c) Rollout Approach: Approximate the optimal
cost-to-go with the cost of some suboptimal
policy, which is calculated either analytically
or by simulation
PROBLEM APPROXIMATION

• Many (problem-dependent) possibilities

− Replace uncertain quantities by nominal val-
ues, or simplify the calculation of expected
values by limited simulation
− Simplify difficult constraints or dynamics
• Example of enforced decomposition: Route m
vehicles that move over a graph. Each node has a
“value.” The first vehicle that passes through the
node collects its value. Max the total collected
value, subject to initial and final time constraints
(plus time windows and other constraints).
• Usually the 1-vehicle version of the problem is
much simpler. This motivates an approximation
obtained by solving single vehicle problems.
• 1SL scheme: At time k and state xk (position
of vehicles and “collected value nodes”), consider
all possible kth moves by the vehicles, and at the
resulting states we approximate the optimal value-
to-go with the value collected by optimizing the
vehicle routes one-at-a-time
PARAMETRIC COST-TO-GO APPROXIMATION

• Use a cost-to-go approximation from a paramet-

˜ r) where x is the current state and
ric class J(x,
r = (r1 , . . . , rm ) is a vector of “tunable” scalars
(weights).
• By adjusting the weights, one can change the
“shape” of the approximation J˜ so that it is rea-
sonably close to the true optimal cost-to-go func-
tion.
• Two key issues:
˜ r) (the
− The choice of parametric class J(x,
approximation architecture).
− Method for tuning the weights (“training”
the architecture).
• Successful application strongly depends on how
these issues are handled, and on insight about the
problem.
• Sometimes a simulator is used, particularly
when there is no mathematical model of the sys-
tem.
APPROXIMATION ARCHITECTURES

• Divided in linear and nonlinear [i.e., linear or

˜ r) on r].
nonlinear dependence of J(x,
• Linear architectures are easier to train, but non-
linear ones (e.g., neural networks) are richer.
• Architectures based on feature extraction
Feature Cost Approximation
State x Vector y J (y,r )
Feature Extraction Cost Approximator w/
Mapping Parameter Vector r

• Ideally, the features will encode much of the

nonlinearity that is inherent in the cost-to-go ap-
proximated, and the approximation may be quite
accurate without a complicated architecture.
• Sometimes the state space is partitioned, and
“local” features are introduced for each subset of
the partition (they are 0 outside the subset).
• With a well-chosen feature vector y(x), we can
use a linear architecture
X
˜ r) = Jˆ y(x), r =
J(x, ri yi (x)
i
AN EXAMPLE - COMPUTER CHESS

• Programs use a feature-based position evaluator

that assigns a score to each move/position

Features:
Material balance,
Mobility,
Safety, etc Score
Feature Weighting
Extraction of Features

Position Evaluator

• Many context-dependent special features.

• Most often the weighting of features is linear
but multistep lookahead is involved.
• Most often the training is done by trial and
error.
6.231 DYNAMIC PROGRAMMING

LECTURE 10

LECTURE OUTLINE

• Rollout algorithms
• Cost improvement property
• Discrete deterministic problems
• Sequential consistency and greedy algorithms
• Sequential improvement
ROLLOUT ALGORITHMS

• One-step lookahead policy: At each k and

state xk , use the control µk (xk ) that

˜

min E gk (xk , uk , wk )+Jk+1 fk (xk , uk , wk ) ,
uk ∈Uk (xk )

where
− J˜N = gN .
− J˜k+1 : approximation to true cost-to-go Jk+1
• Rollout algorithm: When J˜k is the cost-to-go
of some heuristic policy (called the base policy)
• Cost improvement property (to be shown): The
rollout algorithm achieves no worse (and usually
much better) cost than the base heuristic starting
from the same state.
• Main difficulty: Calculating J˜k (xk ) may be
computationally intensive if the cost-to-go of the
base policy cannot be analytically calculated.
− May involve Monte Carlo simulation if the
problem is stochastic.
− Things improve in the deterministic case.
EXAMPLE: THE QUIZ PROBLEM

• A person is given N questions; answering cor-

rectly question i has probability pi , reward vi .
Quiz terminates at the first incorrect answer.
• Problem: Choose the ordering of questions so
as to maximize the total expected reward.
• Assuming no other constraints, it is optimal to
use the index policy: Answer questions in decreas-
ing order of pi vi /(1 − pi ).
• With minor changes in the problem, the index
policy need not be optimal. Examples:
− A limit (< N ) on the maximum number of
questions that can be answered.
− Time windows, sequence-dependent rewards,
precedence constraints.
• Rollout with the index policy as base policy:
Convenient because at a given state (subset of
questions already answered), the index policy and
its expected reward can be easily calculated.
• Very effective for solving the quiz problem and
important generalizations in scheduling (see Bert-
sekas and Castanon, J. of Heuristics, Vol. 5, 1999).
COST IMPROVEMENT PROPERTY

• Let

J k (xk ): Cost-to-go of the rollout policy

Hk (xk ): Cost-to-go of the base policy

• We claim that J k (xk ) ≤ Hk (xk ) for all xk , k

• Proof by induction: We have J N (xN ) = HN (xN )
for all xN . Assume that

J k+1 (xk+1 ) ≤ Hk+1 (xk+1 ), ∀ xk+1 .

Then, for all xk

J k (xk ) = E gk xk , µk (xk ), wk + J k+1 fk xk , µk (xk ), wk

≤ E gk xk , µk (xk ), wk + Hk+1 fk xk , µk (xk ), wk

≤ E gk xk , µk (xk ), wk + Hk+1 fk xk , µk (xk ), wk

= Hk (xk )

− Induction hypothesis ==> 1st inequality

− Min selection of µk (xk ) ==> 2nd inequality
− Definition of Hk , µk ==> last equality
EXAMPLE: THE BREAKTHROUGH PROBLEM

root

• Given a binary tree with N stages.

• Each arc is either free or is blocked (crossed out
in the figure).
• Problem: Find a free path from the root to the
leaves (such as the one shown with thick lines).
• Base heuristic (greedy): Follow the right branch
if free; else follow the left branch if free.
• For large N and given prob. of free branch:
the rollout algorithm requires O(N ) times more
computation, but has O(N ) times larger prob. of
finding a free path than the greedy algorithm.
DISCRETE DETERMINISTIC PROBLEMS

• Any discrete optimization problem (with finite

number of choices/feasible solutions) can be rep-
resented as a sequential decision process by using
a tree.
• The leaves of the tree correspond to the feasible
solutions.
• The problem can be solved by DP, starting from
the leaves and going back towards the root.
• Example: Traveling salesman problem. Find a
minimum cost tour that goes exactly once through
each of N cities.
A Origin Node s

AB AC AD

ABC ABD ACB ACD ADB ADC

ABCD ABDC ACBD ACDB ADBC ADCB

Traveling salesman problem with four cities A, B, C, D

A CLASS OF GENERAL DISCRETE PROBLEMS

• Generic problem:
− Given a graph with directed arcs
− A special node s called the origin
− A set of terminal nodes, called destinations,
and a cost g(i) for each destination i.
− Find min cost path starting at the origin,
ending at one of the destination nodes.
• Base heuristic: For any nondestination node i,
constructs a path (i, i1 , . . . , im , i) starting at i and
ending at one of the destination nodes i. We call
i the projection of i, and we denote H(i) = g(i).
• Rollout algorithm: Start at the origin; choose
the successor node with least cost projection
j1 p(j1)

j2 p(j2)

j3 p(j3)
s i1 im-1 im
j4 p(j4)

Neighbors of im Projections of
Neighbors of im
EXAMPLE: ONE-DIMENSIONAL WALK

• A person takes either a unit step to the left or

a unit step to the right. Minimize the cost g(i) of
the point i where he will end up after N steps.
(0,0)

_
(N,-N) (N,0) i (N,N)
g(i)

_
-N 0 i N-2 N i

• Base heuristic: Always go to the right. Rollout

finds the rightmost local minimum.
• Base heuristic: Compare always go to the right
and always go the left. Choose the best of the two.
Rollout finds a global minimum.
SEQUENTIAL CONSISTENCY

• The base heuristic is sequentially consistent if

all nodes of its path have the same projection, i.e.,
for every node i, whenever it generates the path
(i, i1 , . . . , im , i) starting at i, it also generates the
path (i1 , . . . , im , i) starting at i1 .
• Prime example of a sequentially consistent heuris-
tic is a greedy algorithm. It uses an estimate F (i)
of the optimal cost starting from i.
• At the typical step, given a path (i, i1 , . . . , im ),
where im is not a destination, the algorithm adds
to the path a node im+1 such that

im+1 = arg min F (j)

j∈N (im )

• Prop.: If the base heuristic is sequentially con-

sistent, the cost of the rollout algorithm is no more
than the cost of the base heuristic. In particular,
if (s, i1 , . . . , im̄ ) is the rollout path, we have
H(s) ≥ H(i1 ) ≥ · · · ≥ H(im̄−1 ) ≥ H(im̄ )
where H(i) = cost of the heuristic starting at i.
• Proof: Rollout deviates from the greedy path
only when it discovers an improved path.
SEQUENTIAL IMPROVEMENT

• We say that the base heuristic is sequentially

improving if for every non-destination node i, we
have
H(i) ≥ min H(j)
j is neighbor of i

• If the base heuristic is sequentially improving,

the cost of the rollout algorithm is no more than
the cost of the base heuristic, starting from any
node.
• Fortified rollout algorithm:
− Simple variant of the rollout algorithm, where
we keep the best path found so far through
the application of the base heuristic.
− If the rollout path deviates from the best
path found, then follow the best path.
− Can be shown to be a rollout algorithm with
sequentially improving base heuristic for a
slightly modified variant of the original prob-
lem.
− Has the cost improvement property.
6.231 DYNAMIC PROGRAMMING

LECTURE 11

LECTURE OUTLINE

• More on rollout algorithms - Stochastic prob-

lems
• Simulation-based methods for rollout
• Approximations of rollout algorithms
• Rolling horizon approximations
• Discretization of continuous time
• Discretization of continuous space
• Other suboptimal approaches
ROLLOUT ALGORITHMS-STOCHASTIC CASE

• Rollout policy: At each k and state xk , use

the control µk (xk ) that

min Qk (xk , uk ),
uk ∈Uk (xk )

where

Qk (xk , uk ) = E gk (xk , uk , wk )+Hk+1 fk (xk , uk , wk )

and Hk+1 (xk+1 ) is the cost-to-go of the heuristic.

• Qk (xk , uk ) is called the Q-factor of (xk , uk ),
and for a stochastic problem, its computation may
involve Monte Carlo simulation.
• Potential difficulty: To minimize over uk
the Q-factor, we must form Q-factor differences
Qk (xk , u)−Qk (xk , u). This differencing often am-
plifies the simulation error in the calculation of the
Q-factors.
• Potential remedy: Compare any two controls
u and u by simulating the Q-factor differences
Qk (xk , u) − Qk (xk , u) directly. This may effect
variance reduction of the simulation-induced er-
ror.
Q-FACTOR APPROXIMATION

• Here, instead of simulating the Q-factors, we

approximate the costs-to-go Hk+1 (xk+1 ).
• Certainty equivalence approach: Given xk , fix
future disturbances at “typical” values wk+1 , . . . , wN −1
and approximate the Q-factors with

Q̃k (xk , uk ) = E gk (xk , uk , wk )+H̃k+1 fk (xk , uk , wk )

where H̃k+1 fk (xk , uk , wk ) is the cost of the heuris-
tic with the disturbances fixed at the typical val-
ues.

• This is an approximation of Hk+1 fk (xk , uk , wk )
by using a “single sample simulation.”
• Variant of the certainty equivalence
approach:
Approximate Hk+1 fk (xk , uk , wk ) by simulation
using a small number of “representative samples”
(scenarios).
• Alternative: Calculate (exact or approximate)
values for the cost-to-go of the base policy at a
limited set of state-time pairs, and then approx-
imate Hk+1 using an approximation architecture
and a “training algorithm” or “least-squares fit.”
ROLLING HORIZON APPROACH

• This is an l-step lookahead policy where the

cost-to-go approximation is just 0.
• Alternatively, the cost-to-go approximation is
the terminal cost function gN .
• A short rolling horizon saves computation.
• “Paradox”: It is not true that a longer rolling
horizon always improves performance.
• Example: At the initial state, there are two
controls available (1 and 2). At every other state,
there is only one control.

Optimal Trajectory
... ...
1

Current
State

2
... ...

l Stages High Low High

Cost Cost Cost
ROLLING HORIZON WITH ROLLOUT

• We can use a rolling horizon approximation in

calculating the cost-to-go of the base heuristic.
• Because the heuristic is suboptimal, the ratio-
nale for a long rolling horizon becomes weaker.
• Example: N -stage stopping problem where the
stopping cost is 0, the continuation cost is either
−ǫ or 1, where 0 < ǫ << 1, and the first state
with continuation cost equal to 1 is state m. Then
the optimal policy is to stop at state m, and the
optimal cost is −mǫ.
- e - e ... 1 ...
0 1 2 m N

Stopped State

• Consider the heuristic that continues at every

state, and the rollout policy that is based on this
heuristic, with a rolling horizon of l ≤ m steps.
• It will continue up to the first m − l + 1 stages,
thus compiling a cost of −(m−l +1)ǫ. The rollout
performance improves as l becomes shorter!
• Limited vision may work to our advantage!
MODEL PREDICTIVE CONTROL (MPC)

• Special case of rollout for controlling linear de-

terministic systems (extensions to nonlinear/stochastic
are similar)
− System: xk+1 = Axk + Buk
− Quadratic cost per stage: x′k Qxk + u′k Ruk
− Constraints: xk ∈ X, uk ∈ U (xk )
• Assumption: For any x0 ∈ X there is a feasible
state-control sequence that brings the system to 0
in m steps, i.e., xm = 0
• MPC at state xk solves an m-step optimal con-
trol problem with constraint xk+m = 0, i.e., finds
a sequence ūk , . . . , ūk+m−1 that minimizes
m−1
X
x′k+ℓ Qxk+ℓ u′k+ℓ Ruk+ℓ

+
ℓ=0

subject to xk+m = 0
• Then applies the first control ūk (and repeats
at the next state xk+1 )
• MPC is rollout with heuristic derived from the
corresponding m − 1-step optimal control problem
• Key Property of MPC: Since the heuris-
tic is stable, the rollout is also stable (by policy
improvement property).
DISCRETIZATION

• If the state space and/or control space is con-

tinuous/infinite, it must be replaced by a finite
discretization.
• Need for consistency, i.e., as the discretization
becomes finer, the cost-to-go functions of the dis-
cretized problem converge to those of the contin-
uous problem.
• Pitfall with discretizing continuous time:
The control constraint set may change a lot as we
pass to the discrete-time approximation.
• Example: Let
ẋ(t) = u(t),
with control constraint u(t) ∈ {−1, 1}. The reach-
able states after time δ are x(t + δ) = x(t) + u,
with u ∈ [−δ, δ].
• Compare it with the reachable states after we
discretize the system naively: x(t + δ) = x(t) +
δu(t), with u(t) ∈ {−1, 1}.
• “Convexification effect” of continuous time: a
discrete control constraint set in continuous-time
differential systems, is equivalent to a continuous
control constraint set when the system is looked
at discrete times.
SPACE DISCRETIZATION I

• Given a discrete-time system with state space

S, consider a finite subset S; for example S could
be a finite grid within a continuous state space S.
• Difficulty: f (x, u, w) ∈
/ S for x ∈ S.
• We define an approximation to the original
problem, with state space S, as follows:
• Express each x ∈ S as a convex combination of
states in S, i.e.,
X X
x= γi (x)xi where γi (x) ≥ 0, γi (x) = 1
xi ∈S i

• Define a “reduced” dynamic system with state

space S, whereby from each xi ∈ S we move to
x = f (xi , u, w) according to the system equation
of the original problem, and then move to xj ∈ S
with probabilities γj (x).
• Define similarly the corresponding cost per stage
of the transitions of the reduced system.
• Note application to finite-state POMDP (Par-
tially Observed Markov Decision Problems)
SPACE DISCRETIZATION II

• Let J k (xi ) be the optimal cost-to-go of the “re-

duced” problem from each state xi ∈ S and time
k onward.
• Approximate the optimal cost-to-go of any x ∈
S for the original problem by
J˜k (x) =
X
γi (x)J k (xi ),
xi ∈S
and use one-step-lookahead based on J˜k .
• The choice of coefficients γi (x) is in principle
arbitrary, but should aim at consistency, i.e., as
the number of states in S increases, J˜k (x) should
converge to the optimal cost-to-go of the original
problem.
• Interesting observation: While the original
problem may be deterministic, the reduced prob-
lem is always stochastic.
• Generalization: The set S may be any fi-
nite set (not a subset of S) as long as the coef-
ficients γi (x) admit a meaningful interpretation
that quantifies the degree of association of x with
xi .
OTHER SUBOPTIMAL APPROACHES

• Minimize the DP equation error: Approxi-

mate the optimal cost-to-go functions Jk (xk ) with
J˜k (xk , rk ), where rk is a parameter vector, chosen
to minimize some form of error in the DP equa-
tions
− Can be done sequentially going backwards
in time (approximate Jk using an approxi-
mation of Jk+1 ).
• Direct approximation of control policies:
For a subset of states xi , i = 1, . . . , m, find

µ̂k (xi ) = arg min E g(xi , uk , wk )
uk ∈Uk (xi )

+ J˜k+1 fk (xi , uk , wk ), rk+1

Then find µ̃k (xk , sk ), where sk is a vector of pa-

rameters obtained by solving the problem
m
X
min kµ̂k (xi ) − µ̃k (xi , s)k2 .
s
i=1
• Approximation in policy space: Do not
bother with cost-to-go approximations. Parametrize
the policies as µ̃k (xk , sk ), and minimize the cost
function of the problem over the parameters sk .
6.231 DYNAMIC PROGRAMMING

LECTURE 12

LECTURE OUTLINE

• Infinite horizon problems

• Stochastic shortest path problems
• Bellman’s equation
• Dynamic programming – value iteration
• Discounted problems as special case of SSP
TYPES OF INFINITE HORIZON PROBLEMS

• Same as the basic problem, but:

− The number of stages is infinite.
− The system is stationary.
• Total cost problems: Minimize
(N −1 )
X
Jπ (x0 ) = lim E αk g xk , µk (xk ), wk
N →∞ wk
k=0,1,... k=0

− Stochastic shortest path problems (α = 1,

finite-state system with a termination state)
− Discounted problems (α < 1, bounded cost
per stage)
− Discounted and undiscounted problems with
unbounded cost per stage
• Average cost problems
(N −1 )
1 X
lim E g xk , µk (xk ), wk
N →∞ N wk
k=0,1,... k=0

• Infinite horizon characteristics: Challenging anal-

ysis, elegance of solutions and algorithms
PREVIEW OF INFINITE HORIZON RESULTS

• Key issue: The relation between the infinite

and finite horizon optimal cost-to-go functions.
• Illustration: Let α = 1 and JN (x) denote the
optimal cost of the N -stage problem, generated
after N DP iterations, starting from J0 (x) ≡ 0

Jk+1 (x) = min E g(x, u, w) + Jk f (x, u, w) , ∀ x
u∈U (x) w

• Typical results for total cost problems:

− Convergence of DP algorithm (value itera-
tion):

J ∗ (x) = lim JN (x), ∀ x

N →∞

− Bellman’s Equation holds for all x:

J ∗ (x) = min E g(x, u, w) + J∗ f (x, u, w)
u∈U (x) w

− If µ(x) minimizes in Bellman’s Eq., the pol-

icy {µ, µ, . . .} is optimal.
• Bellman’s Eq. holds for all types of prob-
lems. The other results are true for SSP (and
bounded/discounted; unusual exceptions for other
problems).
STOCHASTIC SHORTEST PATH PROBLEMS

• Assume finite-state system: States 1, . . . , n and

special cost-free termination state t
− Transition probabilities pij (u)
− Control constraints u ∈ U (i)
− Cost of policy π = {µ0 , µ1 , . . .} is
(N −1 )
X
Jπ (i) = lim E g xk , µk (xk ) x0 = i
N →∞
k=0

− Optimal policy if Jπ (i) = J ∗ (i) for all i.

− Special notation: For stationary policies π =
{µ, µ, . . .}, we use Jµ (i) in place of Jπ (i).
• Assumption (Termination inevitable): There
exists integer m such that for every policy and
initial state, there is positive probability that the
termination state will be reached after no more
that m stages; for all π, we have

ρπ = max P {xm 6= t | x0 = i, π} < 1

i=1,...,n
FINITENESS OF POLICY COST FUNCTIONS

• Let
ρ = max ρπ .
π

Note that ρπ depends only on the first m compo-

nents of the policy π, so that ρ < 1.
• For any π and any initial state i

P {x2m 6= t | x0 = i, π} = P {x2m 6= t | xm 6= t, x0 = i, π}
× P {xm 6= t | x0 = i, π} ≤ ρ2

and similarly

P {xkm 6= t | x0 = i, π} ≤ ρk , i = 1, . . . , n

• So E{Cost between times km and (k + 1)m − 1 }

≤ mρk max g(i, u)

i=1,...,n
u∈U (i)
and
∞
X m
Jπ (i) ≤ mρk max g(i, u) = max g(i, u)
i=1,...,n 1−ρ i=1,...,n
k=0 u∈U (i) u∈U (i)
MAIN RESULT

• Given any initial conditions J0 (1), . . . , J0 (n),

the sequence Jk (i) generated by the DP iteration
 
Xn
Jk+1 (i) = min g(i, u) + pij (u)Jk (j) , ∀ i
u∈U (i)
j=1

converges to the optimal cost J ∗ (i) for each i.

• Bellman’s equation has J ∗ (i) as unique solution:
 
n
X
J ∗ (i) = min g(i, u) + pij (u)J ∗ (j) , ∀ i
u∈U (i)
j=1

J ∗ (t) = 0
• A stationary policy µ is optimal if and only
if for every state i, µ(i) attains the minimum in
Bellman’s equation.
• Key proof idea: The “tail” of the cost series,
∞
X
E g xk , µk (xk )
k=mK

vanishes as K increases to ∞.
OUTLINE OF PROOF THAT JN → J ∗

• Assume for simplicity that J0 (i) = 0 for all i,

and for any K ≥ 1, write the cost of any policy π
as
mK−1 ∞
X X
Jπ (x0 ) = E g xk , µk (xk ) + E g xk , µk (xk )
k=0 k=mK
mK−1 ∞
X X
≤ E g xk , µk (xk ) + ρk m max |g(i, u)|
i,u
k=0 k=K

Take the minimum of both sides over π to obtain

ρK
J ∗ (x0 ) ≤ JmK (x0 ) + m max |g(i, u)|.
1−ρ i,u

Similarly, we have

ρK
JmK (x0 ) − m max |g(i, u)| ≤ J ∗ (x0 ).
1−ρ i,u

It follows that limK→∞ JmK (x0 ) = J ∗ (x0 ).

• It can be seen that JmK (x0 ) and JmK+k (x0 )
converge to the same limit for k = 1, . . . , m −
1 (since k extra steps far into the future don’t
matter), so
JN (x0 ) → J ∗ (x0 )
EXAMPLE

• Minimizing the E{Time to Termination}: Let

g(i, u) = 1, ∀ i = 1, . . . , n, u ∈ U (i)

• Under our assumptions, the costs J ∗ (i) uniquely

solve Bellman’s equation, which has the form
 
n
X
J ∗ (i) = min 1 + pij (u)J ∗ (j) , i = 1, . . . , n
u∈U (i)
j=1

• In the special case where there is only one con-

trol at each state, J ∗ (i) is the mean first passage
time from i to t. These times, denoted mi , are the
unique solution of the equations
n
X
mi = 1 + pij mj , i = 1, . . . , n.
j=1
DISCOUNTED PROBLEMS

• Assume a discount factor α < 1.

• Conversion to an SSP problem.

p ij (u) a p ij (u)

p ii (u) i j p jj (u ) a p ii (u) i j a p jj (u)

p ji (u) a p ji (u)
1-a 1-a

• Value iteration converges to J ∗ for all initial J0 :

 
n
X
Jk+1 (i) = min g(i, u) + α pij (u)Jk (j) , ∀ i
u∈U (i)
j=1

• J ∗ is the unique solution of Bellman’s equation:

 
n
X
J ∗ (i) = min g(i, u) + α pij (u)J ∗ (j) , ∀ i
u∈U (i)
j=1

• A stationary policy µ is optimal if and only

if for every state i, µ(i) attains the minimum in
Bellman’s equation.
6.231 DYNAMIC PROGRAMMING

LECTURE 13

LECTURE OUTLINE

• Review of stochastic shortest path problems

• Computational methods for SSP
− Value iteration
− Policy iteration
− Linear programming
• Computational methods for discounted prob-
lems
STOCHASTIC SHORTEST PATH PROBLEMS

• Assume finite-state system: States 1, . . . , n and

− Optimal policy if Jπ (i) = J ∗ (i) for all i.

− Special notation: For stationary policies π =
{µ, µ, . . .}, we use Jµ (i) in place of Jπ (i).
• Assumption (Termination inevitable): There ex-
ists integer m such that for every policy and initial
state, there is positive probability that the termi-
nation state will be reached after no more that m
stages; for all π, we have

ρπ = max P {xm 6= t | x0 = i, π} < 1

i=1,...,n
MAIN RESULT

• Given any initial conditions J0 (1), . . . , J0 (n),

the sequence Jk (i) generated by value iteration
 
Xn
Jk+1 (i) = min g(i, u) + pij (u)Jk (j) , ∀ i
u∈U (i)
j=1

converges to the optimal cost J ∗ (i) for each i.

• Bellman’s equation has J ∗ (i) as unique solution:
 
n
X
J ∗ (i) = min g(i, u) + pij (u)J ∗ (j) , ∀ i
u∈U (i)
j=1

• A stationary policy µ is optimal if and only

if for every state i, µ(i) attains the minimum in
Bellman’s equation.
• Key proof idea: The “tail” of the cost series,
∞
X
E g xk , µk (xk )
k=mK

vanishes as K increases to ∞.
BELLMAN’S EQUATION FOR A SINGLE POLICY

• Consider a stationary policy µ

• Jµ (i), i = 1, . . . , n, are the unique solution of
the linear system of n equations
n
X
Jµ (i) = g i, µ(i) + pij µ(i) Jµ (j), ∀ i = 1, . . . , n
j=1

• Proof: This is just Bellman’s equation for a

modified/restricted problem where there is only
one policy, the stationary policy µ, i.e., the control
constraint set at state i is Ũ (i) = {µ(i)}
• The equation provides a way to compute Jµ (i),
i = 1, . . . , n, but the computation is substantial
for large n [O(n3 )]
• For large n, value iteration may be preferable.
(Typical case of a large linear system of equations,
where an iterative method may be better than a
direct solution method.)
• For VERY large n, exact methods cannot be
applied, and approximations are needed. (We will
discuss these later.)
POLICY ITERATION

• It generates a sequence µ1 , µ2 , . . . of stationary

policies, starting with any stationary policy µ0 .
• At the typical iteration, given µk , we perform
a policy evaluation step, that computes the Jµk (i)
as the solution of the (linear) system of equations
n
X
k
J(i) = g i, µ (i) + k
pij µ (i) J(j), i = 1, . . . , n,
j=1

in the n unknowns J(1), . . . , J(n). We then per-

form a policy improvement step, which computes
a new policy µk+1 as
 
n
X
µk+1 (i) = arg min g(i, u) + pij (u)Jµk (j) , ∀ i
u∈U (i)
j=1

• The algorithm stops when Jµk (i) = Jµk+1 (i) for

all i
• Note the connection with the rollout algorithm,
which is just a single policy iteration
JUSTIFICATION OF POLICY ITERATION

• We can show thatJµk+1 (i) ≤ Jµk (i) for all i, k

• Fix k and consider the sequence generated by
n
X
JN +1 (i) = g i, µk+1 (i) + pij µk+1 (i) JN (j)
j=1
where J0 (i) = Jµk (i). We have
n
X
k
J0 (i) = g i, µ (i) + k
pij µ (i) J0 (j)
j=1
n
X
≥ g i, µk+1 (i) + pij µk+1 (i) J0 (j) = J1 (i)
j=1
Using the monotonicity property of DP,
J0 (i) ≥ J1 (i) ≥ · · · ≥ JN (i) ≥ JN +1 (i) ≥ · · · , ∀i
Since JN (i) → Jµk+1 (i) as N → ∞, we obtain
Jµk (i) = J0 (i) ≥ Jµk+1 (i) for all i. Also if Jµk (i) =
Jµk+1 (i) for all i, Jµk solves Bellman’s equation
and is therefore equal to J ∗
• A policy cannot be repeated, there are finitely
many stationary policies, so the algorithm termi-
nates with an optimal policy
LINEAR PROGRAMMING

• We claim that J ∗ is the “largest” J that satisfies

the constraint
n
X
J(i) ≤ g(i, u) + pij (u)J(j), (1)
j=1

for all i = 1, . . . , n and u ∈ U (i).

• Proof: If we use value iteration to generate a
sequence of vectors Jk = Jk (1), . . . , Jk (n) start-
ing with a J0 such that
 
n
X
J0 (i) ≤ min g(i, u) + pij (u)J0 (j) , ∀ i
u∈U (i)
j=1

Then, Jk (i) ≤ Jk+1 (i) for all k and i (mono-

tonicity property of DP) and Jk → J ∗ , so that
J0 (i) ≤ J ∗ (i) for all i.
• So J ∗ = (J ∗ (1), . . . , J ∗ (n)) is P
the solution of the
n
linear program of maximizing i=1 J(i) subject
to the constraint (1).
LINEAR PROGRAMMING (CONTINUED)

J(2)

2
J(2) = g(2,u ) + p 21(u 2)J(1) + p 22(u 2)J(2)
J* = (J*(1),J*(2))

J(2) = g(2,u1 ) + p 21(u1 )J(1) + p 22(u1 )J(2)

J(1) = g(1,u 2) + p 11(u 2)J(1) + p 12(u 2)J(2)

J(1) = g(1,u1 ) + p 11(u1 )J(1) + p 12(u1 )J(2)

0 J(1)

• Drawback: For large n the dimension of this

program is very large. Furthermore, the num-
ber of constraints is equal to the number of state-
control pairs.
DISCOUNTED PROBLEMS

• Assume a discount factor α < 1.

• Conversion to an SSP problem.

p ij (u) a p ij (u)

p ii (u) i j p jj (u ) a p ii (u) i j a p jj (u)

p ji (u) a p ji (u)
1-a 1-a

• Value iteration converges to J ∗ for all initial J0 :

 
n
X
Jk+1 (i) = min g(i, u) + α pij (u)Jk (j) , ∀ i
u∈U (i)
j=1

• J ∗ is the unique solution of Bellman’s equation:

 
n
X
J ∗ (i) = min g(i, u) + α pij (u)J ∗ (j) , ∀ i
u∈U (i)
j=1
DISCOUNTED PROBLEMS (CONTINUED)

• Policy iteration converges finitely to an optimal

policy, and linear programming works.
• Example: Asset selling over an infinite hori-
zon. If accepted, the offer xk of period k, is in-
vested at a rate of interest r.
• By depreciating the sale amount to period 0
dollars, we view (1 + r)−k xk as the reward for
selling the asset in period k at a price xk , where
r > 0 is the rate of interest. So the discount factor
is α = 1/(1 + r).
• J ∗ is the unique solution of Bellman’s equation
" #
E J ∗ (w)
J ∗ (x) = max x, .
1+r

• An optimal policy is to sell if and only if the cur-

rent offer xk is greater than or equal to ᾱ, where

E J ∗ (w)
ᾱ = .
1+r
6.231 DYNAMIC PROGRAMMING

LECTURE 14

LECTURE OUTLINE

• Average cost per stage problems

• Connection with stochastic shortest path prob-
lems
• Bellman’s equation
• Value iteration
• Policy iteration
AVERAGE COST PER STAGE PROBLEM

• Stationary system with finite number of states

and controls
• Minimize over policies π = {µ0 , µ1 , ...}
(N −1 )
1 X
Jπ (x0 ) = lim E g xk , µk (xk ), wk
N →∞ N wk
k=0,1,... k=0

• Important characteristics (not shared by other

types of infinite horizon problems)
− For any fixed K, the cost incurred up to time
K does not matter (only the state that we
are at time K matters)
− If all states “communicate” the optimal cost
is independent of the initial state [if we can
go from i to j in finite expected time, we
must have J ∗ (i) ≤ J ∗ (j)]. So J ∗ (i) ≡ λ∗ for
all i.
− Because “communication” issues are so im-
portant, the methodology relies heavily on
Markov chain theory.
CONNECTION WITH SSP

• Assumption: State n is such that for all initial

states and all policies, n will be visited infinitely
often (with probability 1).
• Divide the sequence of generated states into
cycles marked by successive visits to n.
• Each of the cycles can be viewed as a state
trajectory of a corresponding stochastic shortest
path problem with n as the termination state.
p ij (u) p ij (u)

p ii (u) i p ji (u) j p jj (u) p ii (u) i p ji (u) j p jj (u)

p n i (u) p n j (u)
p n i (u) p n j (u)
p in(u) p jn(u) n
p in(u) p jn(u)
n
p nn(u)
p nn(u)
Special
State n
t
Artificial Termination State

• Let the cost at i of the SSP be g(i, u) − λ∗

• We will show that

Av. Cost Probl. ≡ A Min Cost Cycle Probl. ≡ SSP Probl.

CONNECTION WITH SSP (CONTINUED)

• Consider a minimum cycle cost problem: Find

a stationary policy µ that minimizes the expected
cost per transition within a cycle
Cnn (µ)
,
Nnn (µ)
where for a fixed µ,

Cnn (µ) : E{cost from n up to the first return to n}

Nnn (µ) : E{time from n up to the first return to n}

• Intuitively, Cnn (µ)/Nnn (µ) = average cost of

µ, and optimal cycle cost = λ∗ , so

Cnn (µ) − Nnn (µ)λ∗ ≥ 0,

with equality if µ is optimal.

• Thus, the optimal µ must minimize over µ the
expression Cnn (µ) − Nnn (µ)λ∗ , which is the ex-
pected cost of µ starting from n in the SSP with
stage costs g(i, u) − λ∗ .
• Also: Optimal SSP Cost = 0.
BELLMAN’S EQUATION

• Let h∗ (i) the optimal cost of this SSP problem

when starting at the nontermination states i =
1, . . . , n. Then, h∗ (n) = 0, and h∗ (1), . . . , h∗ (n)
solve uniquely the corresponding Bellman’s equa-
tion
 
n−1
X
h∗ (i) = min g(i, u) − λ∗ + pij (u)h∗ (j) , ∀ i
u∈U (i)
j=1

• If µ∗ is an optimal stationary policy for the SSP

problem, we have (since Optimal SSP Cost = 0)

h∗ (n) = Cnn (µ∗ ) − Nnn (µ∗ )λ∗ = 0

• Combining these equations, we have

 
n
X
λ∗ +h∗ (i) = min g(i, u) + pij (u)h∗ (j) , ∀ i
u∈U (i)
j=1

• If µ∗ (i) attains the min for each i, µ∗ is optimal.

• Interpretation of h∗ (i) as a relative or differen-

tial cost: It is the minimum of

E{cost to reach n from i for the first time}

− E{cost if the stage cost were λ∗ and not g(i, u)}

• We don’t know λ∗ , so we can’t solve the aver-

age cost problem as an SSP problem. But similar
value and policy iteration algorithms are possible.
• Example: A manufacturer at each time:
− Receives an order with prob. p and no order
with prob. 1 − p.
− May process all unfilled orders at cost K >
0, or process no order at all. The cost per
unfilled order at each time is c > 0.
− Maximum number of orders that can remain
unfilled is n.
− Find a processing policy that minimizes the
total expected cost per stage.
EXAMPLE (CONTINUED)

• State = number of unfilled orders. State 0 is

the special state for the SSP formulation.
• Bellman’s equation: For states i = 0, 1, . . . , n−1

λ∗ + h∗ (i) = min K + (1 − p)h∗ (0) + ph∗ (1),

ci + (1 − p)h∗ (i) + ph∗ (i + 1) ,

and for state n

λ∗ + h∗ (n) = K + (1 − p)h∗ (0) + ph∗ (1)

• Optimal policy: Process i unfilled orders if

K+(1−p)h∗ (0)+ph∗ (1) ≤ ci+(1−p)h∗ (i)+ph∗ (i+1).

• Intuitively, h∗ (i) is monotonically nondecreas-

ing with i (interpret h∗ (i) as optimal costs-to-go
for the associate SSP problem). So a threshold pol-
icy is optimal: process the orders if their number
exceeds some threshold integer m∗ .
VALUE ITERATION

• Natural value iteration method: Generate op-

timal k-stage costs by DP algorithm starting with
any J0 :
 
n
X
Jk+1 (i) = min g(i, u) + pij (u)Jk (j) , ∀ i
u∈U (i)
j=1

• Result: limk→∞ Jk (i)/k = λ∗ for all i.

• Proof outline: Let Jk∗ be so generated from the
initial condition J0∗ = h∗ . Then, by induction,

Jk∗ (i) = kλ∗ + h∗ (i), ∀i, ∀ k.

On the other hand,

Jk (i) − Jk∗ (i) ≤ max J0 (j) − h∗ (j) , ∀i

j=1,...,n

since Jk (i) and Jk∗ (i) are optimal costs for two
k-stage problems that differ only in the terminal
cost functions, which are J0 and h∗ .
RELATIVE VALUE ITERATION

• The value iteration method just described has

two drawbacks:
− Since typically some components of Jk di-
verge to ∞ or −∞, calculating limk→∞ Jk (i)/k
is numerically cumbersome.
− The method will not compute a correspond-
ing differential cost vector h∗ .
• We can bypass both difficulties by subtracting
a constant from all components of the vector Jk ,
so that the difference, call it hk , remains bounded.
• Relative value iteration algorithm: Pick any
state s, and iterate according to
 
Xn
hk+1 (i) = min g(i, u) + pij (u)hk (j)
u∈U (i)
j=1
 
n
X
− min g(s, u) + psj (u)hk (j) , ∀i
u∈U (s)
j=1

• Then we can show hk → h∗ (under an extra

assumption).
POLICY ITERATION

• At the typical iteration, we have a stationary

µk .
• Policy evaluation: Compute λk and hk (i) of µk ,
using the n + 1 equations hk (n) = 0 and
n
X
k k k
λ + h (i) = g i, µ (i) + pij µ (i) hk (j), ∀ i
k

j=1

• Policy improvement: (For the λk -SSP) Find for

all i
 
n
X
µk+1 (i) = arg min g(i, u) + pij (u)hk (j)
u∈U (i)
j=1

• If λk+1 = λk and hk+1 (i) = hk (i) for all i, stop;

otherwise, repeat with µk+1 replacing µk .
• Result: For each k, we either have λk+1 < λk
or we have policy improvement for the λk -SSP:

λk+1 = λk , hk+1 (i) ≤ hk (i), i = 1, . . . , n.

The algorithm terminates with an optimal policy.

6.231 DYNAMIC PROGRAMMING

LECTURE 15

LECTURE OUTLINE

• Control of continuous-time Markov chains –

Semi-Markov problems
• Problem formulation – Equivalence to discrete-
time problems
• Discounted problems
• Average cost problems
CONTINUOUS-TIME MARKOV CHAINS

• Stationary system with finite number of states

and controls
• State transitions occur at discrete times
• Control applied at these discrete times and stays
constant between transitions
• Time between transitions is random
• Cost accumulates in continuous time (may also
be incurred at the time of transition)
• Example: Admission control in a system with
restricted capacity (e.g., a communication link)
− Customer arrivals: a Poisson process
− Customers entering the system, depart after
exponentially distributed time
− Upon arrival we must decide whether to ad-
mit or to block a customer
− There is a cost for blocking a customer
− For each customer that is in the system, there
is a customer-dependent reward per unit time
− Minimize time-discounted or average cost
PROBLEM FORMULATION

• x(t) and u(t): State and control at time t

• tk : Time of kth transition (t0 = 0)
• xk = x(tk ); x(t) = xk for tk ≤ t < tk+1 .
• uk = u(tk ); u(t) = uk for tk ≤ t < tk+1 .
• No transition probabilities; instead transition
distributions (quantify the uncertainty about both
transition time and next state)

Qij (τ, u) = P {tk+1 −tk ≤ τ, xk+1 = j | xk = i, uk = u}

• Two important formulas:
(1) Transition probabilities are specified by

pij (u) = P {xk+1 = j | xk = i, uk = u} = lim Qij (τ, u)

τ →∞

(2) The Cumulative Distribution Function (CDF)

of τ given i, j, u is (assuming pij (u) > 0)

Qij (τ, u)
P {tk+1 −tk ≤ τ | xk = i, xk+1 = j, uk = u} =
pij (u)

Thus, Qij (τ, u) can be viewed as a “scaled CDF”

EXPONENTIAL TRANSITION DISTRIBUTIONS

• Important example of transition distributions:

Qij (τ, u) = pij (u) 1 − e−νi (u)τ ,

where pij (u) are transition probabilities, and νi (u)

is called the transition rate at state i.
• Interpretation: If the system is in state i and
control u is applied
− the next state will be j with probability pij (u)
− the time between the transition to state i
and the transition to the next state j is ex-
ponentially distributed with parameter νi (u)
(independently of j):
P {transition time interval > τ | i, u} = e−νi (u)τ

• The exponential distribution is memoryless.

This implies that for a given policy, the system
is a continuous-time Markov chain (the future de-
pends on the past through the current state).
• Without the memoryless property, the Markov
property holds only at the times of transition.
COST STRUCTURES

• There is cost g(i, u) per unit time, i.e.

g(i, u)dt = the cost incurred in time dt

• There may be an extra “instantaneous” cost

ĝ(i, u) at the time of a transition (let’s ignore this
for the moment)
• Total discounted cost of π = {µ0 , µ1 , . . .} start-
ing from state i (with discount factor β > 0)
(N −1 Z )
X tk+1
−βt

lim E e g xk , µk (xk ) dt x0 = i
N →∞
k=0 tk

• Average cost per unit time

(N −1 Z )
tk+1
1 X
lim E g xk , µk (xk ) dt x0 = i
N →∞ E{tN }
k=0 tk

• We will see that both problems have equivalent

discrete-time versions.
A NOTE ON NOTATION

• The scaled CDF Qij (τ, u) can be used to model

discrete, continuous, and mixed distributions for
the transition time τ .
• Generally, expected values of functions of τ can
be written as integrals involving d Qij (τ, u). For
example, the conditional expected value of τ given
i, j, and u is written as
Z ∞
d Qij (τ, u)
E{τ | i, j, u} = τ
0 pij (u)
• If Qij (τ, u) is continuous with respect to τ , its
derivative
dQij
qij (τ, u) = (τ, u)
dτ
can be viewed as a “scaled” density function. Ex-
pected values of functions of τ can then be written
in terms of qij (τ, u). For example
Z ∞
qij (τ, u)
E{τ | i, j, u} = τ dτ
0 pij (u)
• If Qij (τ, u) is discontinuous and “staircase-like,”
expected values can be written as summations.
DISCOUNTED CASE - COST CALCULATION

• For a policy π = {µ0 , µ1 , . . .}, write

Jπ (i) = E{1st transition cost}+E{e−βτ Jπ1 (j) | i, µ0 (i)}

where Jπ1 (j) is the cost-to-go of the policy π1 =

{µ1 , µ2 , . . .}
• We calculate the two costs in the RHS. The
E{1st transition cost}, if u is applied at state i, is

G(i, u) = Ej Eτ {1st transition cost | j}
n Z ∞
Z τ

X dQij (τ, u)
= pij (u) e−βt g(i, u)dt
0 0
pij (u)
j=1
n Z ∞
X 1 − e−βτ
= g(i, u)dQij (τ, u)
0
β
j=1

• Thus the E{1st transition cost} is

n Z ∞
X 1 − e−βτ
G i, µ0 (i) = g i, µ0 (i) dQij τ, µ0 (i)
0
β
j=1

(The summation term can be viewed as a “dis-

counted length of the transition interval t1 − t0 ”.)
COST CALCULATION (CONTINUED)

• Also the expected (discounted) cost from the

next state j is

E e−βτ Jπ1 (j) | i, µ0 (i)

= Ej E{e−βτ | i, µ0 (i), j}Jπ1 (j) | i, µ0 (i)
n Z ∞
X
−βτ
dQij (τ, u)
= pij (u) e Jπ1 (j)
j=1 0 pij (u)
n
X
= mij µ(i) Jπ1 (j)
j=1

where mij (u) is given by

Z ∞ Z ∞
mij (u) = e−βτ dQij (τ, u) < dQij (τ, u) = pij (u)
0 0

and can be viewed as the “effective discount fac-

tor” [the analog of αpij (u) in discrete-time case].
• So Jπ (i) can be written as
n
X
Jπ (i) = G i, µ0 (i) + mij µ0 (i) Jπ1 (j)
j=1
i.e., the (continuous-time discounted) cost of 1st
period, plus the (continuous-time discounted) cost-
to-go from the next state.
EQUIVALENCE TO AN SSP

• Similar to the discrete-time case, introduce a

stochastic shortest path problem with an artificial
termination state t
• Under control u, from state i the system moves
to state j with probability mij (u) and
Pnto the ter-
mination state t with probability 1 − j=1 mij (u)
• Bellman’s equation: For i = 1, . . . , n,
 
Xn
J ∗ (i) = min G(i, u) + mij (u)J ∗ (j)
u∈U (i)
j=1

• Analogs of value iteration, policy iteration, and

linear programming.
• If in addition to the cost per unit time g, there
is an extra (instantaneous) one-stage cost ĝ(i, u),
Bellman’s equation becomes
 
n
X
J ∗ (i) = min ĝ(i, u) + G(i, u) + mij (u)J ∗ (j)
u∈U (i)
j=1
MANUFACTURER’S EXAMPLE REVISITED

• A manufacturer receives orders with interarrival

times uniformly distributed in [0, τmax ].
• He may process all unfilled orders at cost K > 0,
or process none. The cost per unit time of an
unfilled order is c. Max number of unfilled orders
is n.
• The nonzero transition distributions are

τ
Qi1 (τ, Fill) = Qi(i+1) (τ, Not Fill) = min 1,
τmax
• The one-stage expected cost G is

G(i, Fill) = 0, G(i, Not Fill) = γ c i,

where
n Z ∞ τmax
1− e−βτ 1 − e−βτ
X Z
γ= dQij (τ, u) = dτ
j=1 0 β 0 βτmax

• There is an “instantaneous” cost

ĝ(i, Fill) = K, ĝ(i, Not Fill) = 0

MANUFACTURER’S EXAMPLE CONTINUED

• The “effective discount factors” mij (u) in Bell-

man’s Equation are

mi1 (Fill) = mi(i+1) (Not Fill) = α,

where
Z ∞ Z τmax
−βτ e−βτ 1 − e−βτmax
α= e dQij (τ, u) = dτ =
0 0
τmax βτmax

• Bellman’s equation has the form

J ∗ (i) ∗ ∗
= min K+αJ (1), γci+αJ (i+1) , i = 1, 2, . . .

• As in the discrete-time case, we can conclude

that there exists an optimal threshold i∗ :

fill the orders <==> their number i exceeds i∗

AVERAGE COST

• Minimize
Z tN
1
lim E g x(t), u(t) dt
N →∞ E{tN } 0

assuming there is a special state that is “recurrent

under all policies”
• Total expected cost of a transition
G(i, u) = g(i, u)τ i (u),
where τ i (u): Expected transition time.
• We now apply the SSP argument used for the
discrete-time case. Divide trajectory into cycles
marked by successive visits to n. The cost at (i, u)
is G(i, u) − λ∗ τ i (u), where λ∗ is the optimal ex-
pected cost per unit time. Each cycle is viewed as
a state trajectory of a corresponding SSP problem
with the termination state being essentially n.
• So Bellman’s Eq. for the average cost problem:
 
n
X
h∗ (i) = min G(i, u) − λ∗ τ i (u) + pij (u)h∗ (j)
u∈U (i)
j=1
MANUFACTURER EXAMPLE/AVERAGE COST

• The expected transition times are

τmax
τ i (Fill) = τ i (Not Fill) =
2
the expected transition cost is

c i τmax
G(i, Fill) = 0, G(i, Not Fill) =
2
and there is also the “instantaneous” cost

ĝ(i, Fill) = K, ĝ(i, Not Fill) = 0

• Bellman’s equation:
h τmax
h∗ (i) = min K − λ∗ + h∗ (1),
2
τmax τ max
i
ci − λ∗ + h∗ (i + 1)
2 2

• Again it can be shown that a threshold policy

is optimal.
6.231 DYNAMIC PROGRAMMING

LECTURE 16

LECTURE OUTLINE

• We start a nine-lecture sequence on advanced

infinite horizon DP and approximate solution meth-
ods
• We allow infinite state space, so the stochastic
shortest path framework cannot be used any more
• Results are rigorous assuming a countable dis-
turbance space
− This includes deterministic problems with
arbitrary state space, and countable state
Markov chains
− Otherwise the mathematics of measure the-
ory make analysis difficult, although the fi-
nal results are essentially the same as for
countable disturbance space
• The discounted problem is the proper starting
point for this analysis
• The central mathematical structure is that the
DP mapping is a contraction mapping (instead of
existence of a termination state)
DISCOUNTED PROBLEMS/BOUNDED COST

• Stationary system with arbitrary state space

xk+1 = f (xk , uk , wk ), k = 0, 1, . . .

• Cost of a policy π = {µ0 , µ1 , . . .}

(N −1 )
X
Jπ (x0 ) = lim E αk g xk , µk (xk ), wk
N →∞ wk
k=0,1,... k=0

with α < 1, and for some M , we have |g(x, u, w)| ≤

M for all (x, u, w)
• Shorthand notation for DP mappings (operate
on functions of state to produce other functions)

(T J)(x) = min E g(x, u, w) + αJ f (x, u, w) , ∀x
u∈U (x) w

T J is the optimal cost function for the one-stage

problem with stage cost g and terminal cost αJ.
• For any stationary policy µ

(Tµ J)(x) = E g x, µ(x), w + αJ f (x, µ(x), w) , ∀x
w
“SHORTHAND” THEORY – A SUMMARY

• Cost function expressions [with J0 (x) ≡ 0]

Jπ (x) = lim (Tµ0 Tµ1 · · · Tµk J0 )(x), Jµ (x) = lim (Tµk J0 )(x)
k→∞ k→∞

• Bellman’s equation: J ∗ = T J ∗ , Jµ = Tµ Jµ
• Optimality condition:

µ: optimal <==> Tµ J ∗ = T J ∗

• Value iteration: For any (bounded) J and all

x,
J ∗ (x) = lim (T k J)(x)
k→∞

• Policy iteration: Given µk ,

− Policy evaluation: Find Jµk by solving

Jµk = Tµk Jµk

− Policy improvement: Find µk+1 such that

Tµk+1 Jµk = T Jµk

TWO KEY PROPERTIES

• Monotonicity property: For any functions J

and J ′ such that J(x) ≤ J ′ (x) for all x, and any
µ
(T J)(x) ≤ (T J ′ )(x), ∀ x,
(Tµ J)(x) ≤ (Tµ J ′ )(x), ∀ x.

• Additivity property: For any J, any scalar

r, and any µ

T (J + re) (x) = (T J)(x) + αr, ∀ x,

Tµ (J + re) (x) = (Tµ J)(x) + αr, ∀ x,
where e is the unit function [e(x) ≡ 1].
CONVERGENCE OF VALUE ITERATION

• If J0 ≡ 0,

J ∗ (x) = lim (T N J0 )(x), for all x

N →∞
Proof: For any initial state x0 , and policy π =
{µ0 , µ1 , . . .},

∞
( )
X
Jπ (x0 ) = E αk g xk , µk (xk ), wk
k=0
(N −1 )
X
=E αk g xk , µk (xk ), wk
k=0
∞
( )
X
+E αk g xk , µk (xk ), wk
k=N

The tail portion satisfies

∞
( )
X αN M
E αk g xk , µk (xk ), wk ≤ ,
1−α
k=N

where M ≥ |g(x, u, w)|. Take the min over π of

both sides. Q.E.D.
BELLMAN’S EQUATION

• The optimal cost function J ∗ satisfies Bellman’s

Eq., i.e. J ∗ = T (J ∗ ).
Proof: For all x and N ,

αN M αN M
J ∗ (x) − N ∗
≤ (T J0 )(x) ≤ J (x) + ,
1−α 1−α

where J0 (x) ≡ 0 and M ≥ |g(x, u, w)|. Applying

T to this relation, and using Monotonicity and
Additivity,

α N +1 M
(T J ∗ )(x) − ≤ (T N +1 J0 )(x)
1−α
α N +1 M
≤ (T J ∗ )(x) +
1−α

Taking the limit as N → ∞ and using the fact

lim (T N +1 J0 )(x) = J ∗ (x)

N →∞

we obtain J ∗ = T J ∗ . Q.E.D.
THE CONTRACTION PROPERTY

• Contraction property: For any bounded func-

tions J and J ′ , and any µ,

max (T J)(x) − (T J ′ )(x) ≤ α max J(x) − J ′ (x) ,

x x

max (Tµ J)(x)−(Tµ J ′ )(x) ≤ α max J(x)−J ′ (x) .

x x

Proof: Denote c = maxx∈S J(x) − J ′ (x) . Then

J(x) − c ≤ J ′ (x) ≤ J(x) + c, ∀x

Apply T to both sides, and use the Monotonicity

and Additivity properties:

(T J)(x) − αc ≤ (T J ′ )(x) ≤ (T J)(x) + αc, ∀x

Hence

(T J)(x) − (T J ′ )(x) ≤ αc, ∀ x.

Q.E.D.
IMPLICATIONS OF CONTRACTION PROPERTY

• We can strengthen our earlier result:

• Bellman’s equation J = T J has a unique solu-
tion, namely J ∗ , and for any bounded J, we have

lim (T k J)(x) = J ∗ (x), ∀x

k→∞

Proof: Use

max (T k J)(x) − J ∗ (x) = max (T k J)(x) − (T k J ∗ )(x)

x x
≤ αk max J(x) − J ∗ (x)
x

• Special Case: For each stationary µ, Jµ is the

unique solution of J = Tµ J and

lim (Tµk J)(x) = Jµ (x), ∀ x,

k→∞

for any bounded J.

• Convergence rate: For all k,

max (T k J)(x) − J ∗ (x) ≤ αk max J(x) − J ∗ (x)

x x
NEC. AND SUFFICIENT OPT. CONDITION

• A stationary policy µ is optimal if and only if

µ(x) attains the minimum in Bellman’s equation
for each x; i.e.,

T J ∗ = Tµ J ∗ .

Proof: If T J ∗ = Tµ J ∗ , then using Bellman’s equa-

tion (J ∗ = T J ∗ ), we have

J ∗ = Tµ J ∗ ,

so by uniqueness of the fixed point of Tµ , we obtain

J ∗ = Jµ ; i.e., µ is optimal.
• Conversely, if the stationary policy µ is optimal,
we have J ∗ = Jµ , so

J ∗ = Tµ J ∗ .

Combining this with Bellman’s equation (J ∗ =

T J ∗ ), we obtain T J ∗ = Tµ J ∗ . Q.E.D.
COMPUTATIONAL METHODS

• Value iteration and variants

− Gauss-Seidel version
− Approximate value iteration
• Policy iteration and variants
− Combination with value iteration
− Modified policy iteration
− Asynchronous policy iteration
• Linear programming
n
X
maximize J(i)
i=1
n
X
subject to J(i) ≤ g(i, u) + α pij (u)J(j), ∀ (i, u)
j=1

• Approximate linear programming: use in place

of J(i) a low-dim. basis function representation
m
˜ r) =
X
J(i, rk wk (i)
k=1
and low-dim. LP (with many constraints)
6.231 DYNAMIC PROGRAMMING

LECTURE 17

LECTURE OUTLINE

• One-step lookahead and rollout for discounted

problems
• Approximate policy iteration: Infinite state
space
• Contraction mappings in DP
• Discounted problems: Countable state space
with unbounded costs
ONE-STEP LOOKAHEAD POLICIES

• At state i use the control µ(i) that attains the

minimum in
 
Xn
˜
(T J)(i) = min g(i, u) + α ˜ ,
pij (u)J(j)
u∈U (i)
j=1

where J˜ is some approximation to J ∗ .

• Assume that
T J˜ ≤ J˜ + δe,
for some scalar δ, where e is the unit vector. Then

˜ αδ ˜ δ
Jµ ≤ T J + e≤J+ e.
1−α 1−α
• Assume that

J ∗ − ǫe ≤ J˜ ≤ J ∗ + ǫe,

for some scalar ǫ. Then

2αǫ
Jµ ≤ J∗ + e.
1−α
APPLICATION TO ROLLOUT POLICIES

• Let µ1 , . . . , µM be stationary policies, and let

˜

J(i) = min Jµ1 (i) , . . . , JµM (i) , ∀ i.

• Then, for all i, and m = 1, . . . , M , we have

 
n
˜ ˜ 
X
(T J)(i) = min g(i, u) + α pij (u)J(j)
u∈U (i)
j=1
 
n
X
≤ min g(i, u) + α pij (u)Jµm (j)
u∈U (i)
j=1

≤ Jµm (i)

• Taking minimum over m,

˜
(T J)(i) ˜
≤ J(i), ∀ i.

• Using the preceding slide result with δ = 0,

˜

Jµ (i) ≤ J(i) = min Jµ1 (i) , . . . , JµM (i) , ∀ i,

i.e., the rollout policy µ improves over each µm .

APPROXIMATE POLICY ITERATION

• Suppose that the policy evaluation is approxi-

mate, according to,

max |Jk (x) − Jµk (x)| ≤ δ, k = 0, 1, . . .

and policy improvement is approximate, according

to,

max |(Tµk+1 Jk )(x)−(T Jk )(x)| ≤ ǫ, k = 0, 1, . . .

where δ and ǫ are some positive scalars.

• Error Bound: The sequence {µk } generated
by approximate policy iteration satisfies

ǫ + 2αδ
lim sup max Jµk (x) − J ∗ (x) ≤
k→∞ x∈S (1 − α)2

• Typical practical behavior: The method makes

steady progress up to a point and then the iterates
Jµk oscillate within a neighborhood of J ∗ .
CONTRACTION MAPPINGS

• Given a real vector space Y with a norm k · k

(i.e., kyk ≥ 0 for all y ∈ Y , kyk = 0 if and only if
y = 0, and ky + zk ≤ kyk + kzk for all y, z ∈ Y )
• A function F : Y 7→ Y is said to be a contraction
mapping if for some ρ ∈ (0, 1), we have

kF (y) − F (z)k ≤ ρky − zk, for all y, z ∈ Y.

ρ is called the modulus of contraction of F .

• For m > 1, we say that F is an m-stage con-
traction if F m is a contraction.
• Important example: Let S be a set (e.g., state
space in DP), v : S 7→ ℜ be a positive-valued
function. Let B(S) be the set of all functions J :
S 7→ ℜ such that J(s)/v(s) is bounded over s.
• We define a norm on B(S), called the weighted
sup-norm, by
|J(s)|
kJk = max .
s∈S v(s)

• Important special case: The discounted prob-

lem mappings T and Tµ [for v(s) ≡ 1, ρ = α].
CONTRACTION MAPPING FIXED-POINT TH.

• Contraction Mapping Fixed-Point Theo-

rem: If F : B(S) 7→ B(S) is a contraction with
modulus ρ ∈ (0, 1), then there exists a unique
J ∗ ∈ B(S) such that

J ∗ = F J ∗.

Furthermore, if J is any function in B(S), then

{F k J} converges to J ∗ and we have

kF k J − J ∗ k ≤ ρk kJ − J ∗ k, k = 1, 2, . . . .

• Similar result if F is an m-stage contraction

mapping.
• This is a special case of a general result for
contraction mappings F : Y 7→ Y over normed
vector spaces Y that are complete: every sequence
{yk } that is Cauchy (satisfies kym − yn k → 0 as
m, n → ∞) converges.
• The space B(S) is complete (see the text for a
proof).
A DP-LIKE CONTRACTION MAPPING I

• Let S = {1, 2, . . .}, and let F : B(S) 7→ B(S)

be a linear mapping of the form
X
(F J)(i) = b(i) + a(i, j) J(j), ∀i
j∈S

where b(i) and a(i, j) are some scalars. Then F is

a contraction with modulus ρ if
P
j∈S |a(i, j)| v(j)
≤ ρ, ∀i
v(i)

• Let F : B(S) 7→ B(S) be a mapping of the form

(F J)(i) = min (Fµ J)(i), ∀i

µ∈M

where M is parameter set, and for each µ ∈ M ,

Fµ is a contraction mapping from B(S) to B(S)
with modulus ρ. Then F is a contraction mapping
with modulus ρ.
A DP-LIKE CONTRACTION MAPPING II

• Let S = {1, 2, . . .}, let M be a parameter set,

and for each µ ∈ M , let
X
(Fµ J)(i) = b(i, µ) + a(i, j, µ) J(j), ∀i
j∈S

• We have Fµ J ∈ B(S) for all J ∈ B(S) provided

bµ ∈ B(S) and Vµ ∈ B(S), where

bµ = b(1, µ), b(2, µ), . . . , Vµ = V (1, µ), V (2, µ), . . . ,
X
V (i, µ) = a(i, j, µ) v(j), ∀i
j∈S

• Consider the mapping F

(F J)(i) = min (Fµ J)(i), ∀i

µ∈M

We have F J ∈ B(S) for all J ∈ B(S), provided

b ∈ B(S) and V ∈ B(S), where

b = b(1), b(2), . . . , V = V (1), V (2), . . . ,

with b(i) = maxµ∈M b(i, µ) and V (i) = maxµ∈M V (i, µ).

DISCOUNTED DP - UNBOUNDED COST I

• State space S = {1, 2, . . .}, transition probabil-

ities pij (u), cost g(i, u).
• Weighted sup-norm
|J(i)|
kJk = max
i∈S vi

on B(S): sequences J(i) such that kJk < ∞.
• Assumptions:

(a) G = G(1), G(2), . . . ∈ B(S), where

G(i) = max g(i, u) , ∀i

u∈U (i)

(b) V = V (1), V (2), . . . ∈ B(S), where
X
V (i) = max pij (u) vj , ∀i
u∈U (i)
j∈S

(c) There exists an integer m ≥ 1 and a scalar

ρ ∈ (0, 1) such that for every policy π,
P
j∈S P (xm = j | x0 = i, π) vj
αm ≤ ρ, ∀i
vi
DISCOUNTED DP - UNBOUNDED COST II

• Example: Let vi = i for all i = 1, 2, . . .

• Assumption (a) is satisfied if the maximum ex-
pected absolute cost per stage at state i grows no
faster than linearly with i.
• Assumption (b) states that the maximum ex-
pected next state following state i,

max E{j | i, u},

u∈U (i)

also grows no faster than linearly with i.

• Assumption (c) is satisfied if
X
αm P (xm = j | x0 = i, π) j ≤ ρ i, ∀i
j∈S

It requires that for all π, the expected value of the

state obtained m stages after reaching state i is no
more than α−m ρ i.
• If there is bounded upward expected change of
the state starting at i, there exists m sufficiently
large so that Assumption (c) is satisfied.
DISCOUNTED DP - UNBOUNDED COST III

• Consider the DP mappings Tµ and T ,

X
(Tµ J)(i) = g i, µ(i) +α pij µ(i) J(j), ∀ i,
j∈S

 
X
(T J)(i) = min g(i, u) + α pij (u)J(j) , ∀ i
u∈U (i)
j∈S

• Proposition: Under the earlier assumptions,

T and Tµ map B(S) into B(S), and are m-stage
contraction mappings with modulus ρ.
• The m-stage contraction properties can be used
to essentially replicate the analysis for the case of
bounded cost, and to show the standard results:
− The value iteration method Jk+1 = T Jk con-
verges to the unique solution J ∗ of Bellman’s
equation J = T J.
− The unique solution J ∗ of Bellman’s equa-
tion is the optimal cost function.
− A stationary policy µ is optimal if and only
if Tµ J ∗ = T J ∗ .
6.231 DYNAMIC PROGRAMMING

LECTURE 18

LECTURE OUTLINE

• Undiscounted problems
• Stochastic shortest path problems (SSP)
• Proper and improper policies
• Analysis and computational methods for SSP
• Pathologies of SSP
UNDISCOUNTED PROBLEMS

• System: xk+1 = f (xk , uk , wk )

• Cost of a policy π = {µ0 , µ1 , . . .}
(N −1 )
X
Jπ (x0 ) = lim E g xk , µk (xk ), wk
N →∞ wk
k=0,1,... k=0

• Shorthand notation for DP mappings

(T J)(x) = min E g(x, u, w) + J f (x, u, w) , ∀x
u∈U (x) w

• For any stationary policy µ

(Tµ J)(x) = E g x, µ(x), w + J f (x, µ(x), w) , ∀x
w

• Neither T nor Tµ are contractions in general,

but their monotonicity is helpful.
• SSP problems provide a “soft boundary” be-
tween the easy finite-state discounted problems
and the hard undiscounted problems.
− They share features of both.
− Some of the nice theory is recovered because
of the termination state.
SSP THEORY SUMMARY I

• As earlier, we have a cost-free term. state t, a

finite number of states 1, . . . , n, and finite number
of controls, but we will make weaker assumptions.
• Mappings T and Tµ (modified to account for
termination state t):
" n
#
X
(T J)(i) = min g(i, u) + pij (u)J(j) , i = 1, . . . , n,
u∈U (i)
j=1

n
X
(Tµ J)(i) = g i, µ(i) + pij µ(i) J(j), i = 1, . . . , n.
j=1

• Definition: A stationary policy µ is called

proper, if under µ, from every state i, there is
a positive probability path that leads to t.
• Important fact: If µ is proper, Tµ is contrac-
tion with respect to some weighted max norm

1 1
max |(Tµ J)(i)−(Tµ J ′ )(i)| ≤ ρµ max |J(i)−J ′ (i)|
i vi i vi
• T is similarly a contraction if all µ are proper
(the case discussed in the text, Ch. 7, Vol. I).
SSP THEORY SUMMARY II

• The theory can be pushed one step further.

Assume that:
(a) There exists at least one proper policy
(b) For each improper µ, Jµ (i) = ∞ for some i
• Then T is not necessarily a contraction, but:
− J ∗ is the unique solution of Bellman’s Equ.
− µ∗ is optimal if and only if Tµ∗ J ∗ = T J ∗
− limk→∞ (T k J)(i) = J ∗ (i) for all i
− Policy iteration terminates with an optimal
policy, if started with a proper policy
• Example: Deterministic shortest path problem
with a single destination t.
− States <=> nodes; Controls <=> arcs
− Termination state <=> the destination
− Assumption (a) <=> every node is con-
nected to the destination
− Assumption (b) <=> all cycle costs > 0
SSP ANALYSIS I

• For a proper policy µ, Jµ is the unique fixed

point of Tµ , and Tµk J → Jµ for all J (holds by the
theory of Vol. I, Section 7.2)
• A stationary µ satisfying J ≥ Tµ J for some J
must be proper - true because

k−1
X
J ≥ Tµk J = Pµk J + Pµm gµ
m=0

and some component of the term on the right

blows up if µ is improper (by our assumptions).
• Consequence: T can have at most one fixed
point.
Proof: If J and J ′ are two solutions, select µ
and µ′ such that J = T J = Tµ J and J ′ = T J ′ =
Tµ′ J ′ . By preceding assertion, µ and µ′ must be
proper, and J = Jµ and J ′ = Jµ′ . Also

J = T k J ≤ Tµk′ J → Jµ′ = J ′

Similarly, J ′ ≤ J, so J = J ′ .
SSP ANALYSIS II

• We now show that T has a fixed point, and also

that policy iteration converges.
• Generate a sequence {µk } by policy iteration
starting from a proper policy µ0 .
• µ1 is proper and Jµ0 ≥ Jµ1 since

Jµ0 = Tµ0 Jµ0 ≥ T Jµ0 = Tµ1 Jµ0 ≥ Tµk1 Jµ0 ≥ Jµ1

• Thus {Jµk } is nonincreasing, some policy µ will

be repeated, with Jµ = T Jµ . So Jµ is a fixed point
of T .
• Next show T k J → Jµ for all J, i.e., value it-
eration converges to the same limit as policy iter-
ation. (Sketch: True if J = Jµ , argue using the
properness of µ to show that the terminal cost
difference J − Jµ does not matter.)
• To show Jµ = J ∗ , for any π = {µ0 , µ1 , . . .}

Tµ0 · · · Tµk−1 J0 ≥ T k J0 ,

where J0 ≡ 0. Take lim sup as k → ∞, to obtain

Jπ ≥ Jµ , so µ is optimal and Jµ = J ∗ .
SSP ANALYSIS III

• If all policies are proper (the assumption of

Section 7.1, Vol. I), Tµ and T are contractions
with respect to a weighted sup norm.
Proof: Consider a new SSP problem where the
transition probabilities are the same as in the orig-
inal, but the transition costs are all equal to −1.
Let Ĵ be the corresponding optimal cost vector.
For all µ,
n n
X X
Ĵ(i) = −1+ min pij (u)Ĵ(j) ≤ −1+ pij µ(i) Ĵ(j)
u∈U (i)
j=1 j=1

ˆ
For vi = −J(i), we have vi ≥ 1, and for all µ,
n
X
pij µ(i) vj ≤ vi − 1 ≤ ρ vi , i = 1, . . . , n,
j=1

where
vi − 1
ρ = max < 1.
i=1,...,n vi
This implies contraction of Tµ and T by the results
of the preceding lecture.
PATHOLOGIES I: DETERM. SHORTEST PATHS

• If there is a cycle with cost = 0, Bellman’s equa-

tion has an infinite number of solutions. Example:
0

1
1 2 t

• We have J ∗ (1) = J ∗ (2) = 0.

• Bellman’s equation is

J(1) = J(2), J(2) = min J(1), 1].

• It has J ∗ as solution.
• Set of solutions of Bellman’s equation:

J | J(1) = J(2) ≤ 1 .
PATHOLOGIES II: DETERM. SHORTEST PATHS

• If there is a cycle with cost < 0, Bellman’s

equation has no solution [among functions J with
−∞ < J(i) < ∞ for all i]. Example:
0

1
1 2 t

-1

• We have J ∗ (1) = J ∗ (2) = −∞.

• Bellman’s equation is

J(1) = J(2), J(2) = min −1 + J(1), 1].

• There is no solution [among functions J with

−∞ < J(i) < ∞ for all i].
• Bellman’s equation has as solution J ∗ (1) =
J ∗ (2) = −∞ [within the larger class of functions
J(·) that can take the value −∞ for some (or
all) states]. This situation can be generalized (see
Chapter 3 of Vol. 2 of the text).
PATHOLOGIES III: THE BLACKMAILER

• Two states, state 1 and the termination state t.

• At state 1, choose a control u ∈ (0, 1] (the
blackmail amount demanded) at a cost −u, and
move to t with probability u2 , or stay in 1 with
probability 1 − u2 .
• Every stationary policy is proper, but the con-
trol set in not finite.
• For any stationary µ with µ(1) = u, we have

Jµ (1) = −u + (1 − u2 )Jµ (1)

from which Jµ (1) = − u1

• Thus J ∗ (1) = −∞, and there is no optimal
stationary policy.
• It turns out that a nonstationary policy is
optimal: demand µk (1) = γ/(k + 1) at time k,
with γ ∈ (0, 1/2).
− Blackmailer requests diminishing amounts over
time, which add to ∞.
− The probability of the victim’s refusal dimin-
ishes at a much faster rate, so the probabil-
ity that the victim stays forever compliant is
strictly positive.
6.231 DYNAMIC PROGRAMMING

LECTURE 19

LECTURE OUTLINE

• We begin a 6-lecture series on approximate DP

for large/intractable problems.
• We will follow an updated/expanded version of
Chapter 6, Vol. 2, which is posted at
http://web.mit.edu/dimitrib/www/dpchapter.html
• In this lecture we classify/overview the main
approaches:
− Rollout/Simulation-based single policy iter-
ation (we will not discuss this further)
− Approximation in value space (approximate
policy iteration, Q-Learning, Bellman error
approach, approximate LP)
− Approximation in policy space (policy para-
metrization, gradient methods)
− Problem approximation (simplification - ag-
gregation - limited lookahead) - we will not
discuss this much (see Vol. I, Ch. 6)
GENERAL ORIENTATION

• We will mainly adopt an n-state discounted

model (the easiest case - but think of HUGE n).
• Extensions to SSP and average cost are possible
(but more quirky). We will discuss them later.
• Other than manual/trial-and-error approaches
(e.g., as in computer chess), the only other ap-
proaches are simulation-based. They are collec-
tively known as “neuro-dynamic programming” or
“reinforcement learning”.
• Simulation is essential for large state spaces
because of its (potential) computational complex-
ity advantage in computing sums/expectations in-
volving a very large number of terms.
• Simulation also comes in handy when an ana-
lytical model of the system is unavailable, but a
simulation/computer model is possible.
• Simulation-based methods are of three types:
− Rollout (we will not discuss further)
− Approximation in value space
− Approximation in policy space
APPROXIMATION IN POLICY SPACE

• A brief discussion; we will return to it at the

end.
• We parameterize the set of policies by a vector
r = (r1 , . . . , rs ) and we optimize the cost over r.
• Discounted problem example:
− Each value of r defines a stationary policy,
with cost starting at state i denoted by Ji (r).
− Use a gradient (or other) method to mini-
mize over r
n
¯ =
X
J(r) q(i)Ji (r),
i=1

where q(1), . . . , q(n) is some probability dis-
tribution over the states.
• In a special case of this approach, the param-
eterization of the policies is indirect, through an
approximate cost function.
− A cost approximation architecture parame-
terized by r, defines a policy dependent on r
via the minimization in Bellman’s equation.
APPROX. IN VALUE SPACE - APPROACHES

• Policy evaluation/Policy improvement

− Uses simulation algorithms to approximate
the cost Jµ of the current policy µ
• Approximation of the optimal cost function J ∗
− Q-Learning: Use a simulation algorithm to
approximate the optimal costs J ∗ (i) or the
Q-factors
n
X
Q∗ (i, u) = g(i, u) + α pij (u)J ∗ (j)
j=1

− Bellman error approach: Find r to

n 2 o
min Ei ˜ r) − (T J)(i,
J(i, ˜ r)
r

where Ei {·} is taken with respect to some

distribution
− Approximate LP (discussed earlier - supple-
mented with clever schemes to overcome the
large number of constraints issue)
POLICY EVALUATE/POLICY IMPROVE

• An example

System Simulator
~ -
Least-Squares J(j,r)
Optimization

_ Decision Generator
Decision µ(i) State i

Cost-to-Go Approximator
~
Supplies Values J(j,r)

• The “least squares optimization” may be re-

placed by a different algorithm
POLICY EVALUATE/POLICY IMPROVE I

• Approximate the cost of the current policy by

using a simulation method.
− Direct policy evaluation - Cost samples gen-
erated by simulation, and optimization by
least squares
− Indirect policy evaluation - solving the pro-
jected equation Φr = ΠTµ (Φr) where Π is
projection w/ respect to a suitable weighted
Euclidean norm

Jµ Tµ(Φr)

Projection
Projection on S
on S

Φr = ΠTµ(Φr)

ΠJµ
0 0
S: Subspace spanned by basis functions S: Subspace spanned by basis functions

Direct Mehod: Projection of cost vector Jµ Indirect method: Solving a projected

form of Bellman’s equation

• Batch and incremental methods

• Regular and optimistic policy iteration
POLICY EVALUATE/POLICY IMPROVE II

• Projected equation methods are preferred and

have rich theory
• TD(λ): Stochastic iterative algorithm for solv-
ing Φr = ΠTµ (Φr)
• LSPE(λ): A simulation-based form of projected
value iteration

Φrk+1 = ΠTµ (Φrk ) + simulation noise

Value Iterate Value Iterate

T(Φrk) = g + αPΦrk T(Φrk) = g + αPΦrk

Projection Projection
on S on S

Φrk+1
Φrk+1
Φrk Φrk Simulation error
0 0
S: Subspace spanned by basis functions S: Subspace spanned by basis functions

Projected Value Iteration (PVI) Least Squares Policy Evaluation (LSPE)

• LSTD(λ): Solves a simulation-based approxi-

mation Φr = Π̂T̂µ (Φr) of the projected equation,
using a linear system solver (e.g., Gaussian elimi-
nation/Matlab)
THEORETICAL BASIS

• If policies are approximately evaluated using an

approximation architecture:

˜ rk ) − J k (i)| ≤ δ,
max |J(i, k = 0, 1, . . .
µ
i

• If policy improvement is also approximate,

˜ rk )−(T J˜)(i, rk )| ≤ ǫ,
max |(Tµk+1 J)(i, k = 0, 1, . . .
i

• Error Bound: The sequence {µk } generated

by approximate policy iteration satisfies

ǫ + 2αδ
lim sup max Jµk (i) − J ∗ (i) ≤
k→∞ i (1 − α)2

• Typical practical behavior: The method makes

steady progress up to a point and then the iterates
Jµk oscillate within a neighborhood of J ∗ .
SIMULATION-BASED POLICY EVALUATION

• Suppose we can implement in a simulator the

improved policy µ, and want to calculate Jµ by
simulation.
• Generate by simulation sample costs. Then:

Mi
1 X
Jµ (i) ≈ c(i, m)
Mi m=1

c(i, m) : mth (noisy) sample cost starting from state i

• Approximating well each Jµ (i) is impractical

for a large state space. Instead, a “compact rep-
resentation” J˜µ (i, r) is used, where r is a tunable
parameter vector.
• Direct approach: Calculate an optimal value r∗
of r by a least squares fit

Mi
n X
2
X
r∗ = arg min c(i, m) − J˜µ (i, r)
r
i=1 m=1

• Note that this is much easier when the archi-

tecture is linear - but this is not a requirement.
SIMULATION-BASED DIRECT APPROACH

System Simulator
~ -
Least-Squares J(j,r)
Optimization

_ Decision Generator
Decision µ(i) State i

Cost-to-Go Approximator
~
Supplies Values J(j,r)

• Simulator: Given a state-control pair (i, u), gen-

erates the next state j using system’s transition
probabilities under policy µ currently evaluated
• Decision generator: Generates the control µ(i)
of the evaluated policy at the current state i
• Cost-to-go approximator: J(j,˜ r) used by the
decision generator and corresponding to preceding
policy (already evaluated in preceding iteration)
• Least squares optimizer: Uses cost samples c(i, m)
produced by the simulator and solves a least squares
problem to approximate J˜µ (·, r)
• There are several important issues relating to
the design of each block (to be discussed in the
next lecture).
6.231 DYNAMIC PROGRAMMING

LECTURE 20

LECTURE OUTLINE

• Discounted problems - Approximate policy eval-

uation/policy improvement
• Exploration issues
• Direct approach - Least squares
• Batch and incremental gradient methods
• Implementation using TD
• Optimistic policy iteration
POLICY EVALUATE/POLICY IMPROVE

• Focus on policy evaluation: approximate the

cost of the current policy by using a simulation
method.
− Direct policy evaluation - Cost samples gen-
erated by simulation, and optimization by
least squares
− Indirect policy evaluation - solving the pro-
jected equation Φr = ΠTµ (Φr) where Π is
projection w/ respect to a suitable weighted
Euclidean norm

Jµ Tµ(Φr)

Projection
Projection on S
on S
Φr = ΠTµ(Φr)

ΠJµ
0 0
S: Subspace spanned by basis functions S: Subspace spanned by basis functions

Direct Mehod: Projection of cost vector Jµ Indirect method: Solving a projected

form of Bellman’s equation
THE ISSUE OF EXPLORATION

• To evaluate a policy µ, we need to generate cost

samples using that policy - this biases the simula-
tion by underrepresenting states that are unlikely
to occur under µ.
• As a result, the cost-to-go estimates of these
underrepresented states may be highly inaccurate.
• This seriously impacts the improved policy µ.
• This is known as inadequate exploration - a par-
ticularly acute difficulty when the randomness em-
bodied in the transition probabilities is “relatively
small” (e.g., a deterministic system).
• One possibility to guarantee adequate explo-
ration: Frequently restart the simulation and en-
sure that the initial states employed form a rich
and representative subset.
• Another possibility: Occasionally generating
transitions that use a randomly selected control
rather than the one dictated by the policy µ.
• Other methods, to be discussed later, use two
Markov chains (one is the chain of the policy and
is used to generate the transition sequence, the
other is used to generate the state sequence).
APPROXIMATING Q-FACTORS

• The approach described so far for policy eval-

uation requires calculating expected values for all
controls u ∈ U (i) (and knowledge of pij (u)).
• Model-free alternative: Approximate Q-factors
n
X
Q̃(i, u, r) ≈ pij (u) g(i, u, j) + αJµ (j)
j=1

and use for policy improvement the minimization

µ(i) = arg min Q̃(i, u, r)
u∈U (i)

• r is an adjustable parameter vector and Q̃(i, u, r)

is a parametric architecture, such as
m
X
Q̃(i, u, r) = rk φk (i, u)
k=1

• Can use any method for constructing cost ap-

proximations, e.g., TD(λ).
• Use the Markov chain with states (i, u) - pij (µ(i))
is the transition prob. to (j, µ(i)), 0 to other (j, u′ ).
• Major concern: Acutely diminished exploration.
DIRECT APPROACH

• Focus on a batch: an N -transition portion

(i0 , . . . , iN ) of a simulated trajectory
• We view the numbers
N
X −1

αt−k g it , µ(it ), it+1 , k = 0, . . . , N − 1,
t=k

as cost samples, one per initial state i0 , . . . , iN −1

• Least squares problem

N −1 N −1
!2
1 X
˜ k , r) −
X
min J(i αt−k g it , µ(it ), it+1
r 2
k=0 t=k

˜ k , r) is linear in r, this problem can be

• If J(i
solved by matrix inversion.
˜ k , r) is linear in r, or if the number of
• If J(i
samples is very large, a gradient-type method may
be preferable.
BATCH GRADIENT METHOD

• Gradient iteration
N −1
∇J˜(ik , r)
X
r := r − γ
k=0
−1
N
!
˜ k , r) −
X
J(i αt−k g it , µ(it ), it+1
t=k
• Important tradeoff:
− In order to reduce simulation error and ob-
tain cost samples for a representatively large
subset of states, we must use a large N
− To keep the work per gradient iteration small,
we must use a small N
• To address the issue of size of N , small batches
may be used and changed after one or more iter-
ations.
• Then the method requires a diminishing stepsize
for convergence.
• This slows down the convergence (which can
generally be very slow for a gradient method).
• Theoretical convergence is guaranteed (with a
diminishing stepsize) under reasonable conditions,
but in practice this is not much of a guarantee.
INCREMENTAL GRADIENT METHOD

• Again focus on an N -transition portion (i0 , . . . , iN )

of a simulated trajectory.
• The batch gradient method processes the N
transitions all at once, and updates r using the
gradient iteration.
• The incremental method updates r a total of N
times, once after each transition.
• After each transition (ik , ik+1 ) it uses only the
portion of the gradient affected by that transition:
− Evaluate the (single-term) gradient ∇J˜(ik , r)
at the current value of r (call it rk ).
− Sum all the terms that involve the transi-
tion (ik , ik+1 ), and update rk by making a
correction along their sum:

˜ k , rk )J(i
rk+1 =rk − γ ∇J(i ˜ k , rk )

k
! !
X
˜ t , rt )

− αk−t ∇J(i g ik , µ(ik ), ik+1
t=0
INCREMENTAL GRADIENT - CONTINUED

• After N transitions, all the component gradient

terms of the batch iteration are accumulated.
• BIG difference:
− In the incremental method, r is changed while
processing the batch – the (single-term) gra-
˜ t , r) is evaluated at the most re-
dient ∇J(i
cent value of r [after the transition (it , it+1 )].
− In the batch version these gradients are eval-
uated at the value of r prevailing at the be-
ginning of the batch.
• Because r is updated at intermediate transi-
tions within a batch (rather than at the end of
the batch), the location of the end of the batch
becomes less relevant.
• Can have very long batches - can have a single
very long simulated trajectory and a single batch.
• The incremental version can be implemented
more flexibly, converges much faster in practice.
• Interesting convergence analysis (beyond our
scope - see Bertsekas and Tsitsiklis, NDP book,
also paper in SIAM J. on Optimization, 2000)
TEMPORAL DIFFERENCES - TD(1)

• A mathematically equivalent implementation of

the incremental method.
• It uses temporal difference (TD for short)

˜ k , r), k ≤ N −2
˜ k+1 , r)−J(i

qk = g ik , µ(ik ), ik+1 +αJ(i

˜ N −1 , r)

qN −1 = g iN −1 , µ(iN −1 ), iN − J(i

• Following the transition (ik , ik+1 ), set

k
˜ t , rt )
X
rk+1 = rk + γk qk αk−t ∇J(i
t=0

• This algorithm is known as TD(1). In the im-

˜ r) = φ(i)′ r, it becomes
portant linear case J(i,
k
X
rk+1 = rk + γk qk αk−t φ(it )
t=0

• A variant of TD(1) is TD(λ), λ ∈ [0, 1]. It sets

k
X
rk+1 = rk + γk qk (αλ)k−t φ(it )
t=0
OPTIMISTIC POLICY ITERATION

• We have assumed so far is that the least squares

optimization must be solved completely for r.
• An alternative, known as optimistic policy iter-
ation, is to solve this problem approximately and
replace policy µ with policy µ after only a few
simulation samples.
• Extreme possibility is to replace µ with µ at the
end of each state transition: After state transition
(ik , ik+1 ), set
k
˜ t , rt ),
X
rk+1 = rk + γk qk (αλ)k−t ∇J(i
t=0

and simulate next transition (ik+1 , ik+2 ) using µ(ik+1 ),

the control of the new policy.
• For λ = 0, we obtain (the popular) optimistic
TD(0), which has the simple form

˜ k , rk )
rk+1 = rk + γk qk ∇J(i

• Optimistic policy iteration can exhibit fascinat-

ing and counterintuitive behavior (see the NDP
book by Bertsekas and Tsitsiklis, Section 6.4.2).
6.231 DYNAMIC PROGRAMMING

LECTURE 21

LECTURE OUTLINE

• Discounted problems - Approximate policy eval-

uation/policy improvement
• Indirect approach - The projected equation
• Contraction properties - Error bounds
• Matrix form of the projected equation
• Simulation-based implementation (LSTD)
• Iterative methods (LSPE)
POLICY EVALUATION/POLICY IMPROVEMENT

Guess Initial Policy

Evaluate Approximate Cost

Approximate Policy
J˜µ (r) = Φr Using Simulation Evaluation

Generate “Improved” Policy µ Policy Improvement

• Linear cost function approximation

˜ = Φr
J(r)

where Φ is full rank n × s matrix with columns

the basis functions, and ith row denoted φ(i)′ .
• Policy “improvement”
n
X
µ(i) = arg min pij (u) g(i, u, j) + αφ(j)′ r
u∈U (i)
j=1

• Indirect methods find Φr by solving a projected

equation.
WEIGHTED EUCLIDEAN PROJECTIONS

• Consider a weighted Euclidean norm

v
u n
uX 2
kJkv = t vi J(i) ,
i=1

where v is a vector of positive weights v1 , . . . , vn .

• Let Π denote the projection operation onto

S = {Φr | r ∈ ℜs }

with respect to this norm, i.e., for any J ∈ ℜn ,

ΠJ = Φr∗

where
r∗ = arg mins kJ − Φrk2v
r∈ℜ
THE PROJECTED BELLMAN EQUATION

• For a fixed policy µ to be evaluated, consider

the corresponding mapping T :
n
X
(T J)(i) = pij g(i, j)+αJ(j) , i = 1, . . . , n,
i=1

or more compactly,
T J = g + αP J

• The solution Jµ of Bellman’s equation J = T J

is approximated by the solution of

Φr = ΠT (Φr)

T(Φr)

Projection
on S

Φr = ΠT(Φr)

0
S: Subspace spanned by basis functions

Indirect method: Solving a projected

form of Bellmanʼs equation
KEY QUESTIONS AND RESULTS

• Does the projected equation have a solution?

• Under what conditions is the mapping ΠT a
contraction, so ΠT has unique fixed point?
• Assuming ΠT has unique fixed point Φr∗ , how
close is Φr∗ to Jµ ?
• Assumption: P has a single recurrent class
and no transient states, i.e., it has steady-state
probabilities that are positive

N
1 X
ξj = lim P (ik = j | i0 = i) > 0
N →∞ N
k=1

• Proposition: ΠT is contraction of modulus

α with respect to the weighted Euclidean norm
k · kξ , where ξ = (ξ1 , . . . , ξn ) is the steady-state
probability vector. The unique fixed point Φr∗ of
ΠT satisfies
1
kJµ − Φr∗ kξ ≤√ kJµ − ΠJµ kξ
1−α 2
ANALYSIS

• Important property of the projection Π on S

with weighted Euclidean norm k · kv . For all J ∈
ℜn , J ∈ S, the Pythagorean Theorem holds:

kJ − Jk2v = kJ − ΠJk2v + kΠJ − Jk2v

• Proof: Geometrically, (J − ΠJ) and (ΠJ − J)

are orthogonal in the scaled geometry of the norm
kv , where two vectors x, y ∈ ℜn are orthogonal
k ·P
n
if i=1 vi xi yi = 0. Expand the quadratic in the
RHS below:

kJ − Jk2v = k(J − ΠJ) + (ΠJ − J)k2v

• The Pythagorean Theorem implies that the pro-

jection is nonexpansive, i.e.,

¯ v ≤ kJ − Jk
kΠJ − ΠJk ¯ v, for all J, J¯ ∈ ℜn .

To see this, note that

2 2 2
Π(J − J) v
≤ Π(J − J) v
+ (I − Π)(J − J) v
= kJ − Jk2v
PROOF OF CONTRACTION PROPERTY

• Lemma: We have
kP zkξ ≤ kzkξ , z ∈ ℜn

• Proof of lemma: Let pij be the components of

P . For all z ∈ ℜn , we have
 2
n
X n
X n
X n
X
kP zk2ξ = ξi  pij zj  ≤ ξi pij zj2
i=1 j=1 i=1 j=1
n X
X n n
X
= ξi pij zj2 = ξj zj2 = kzk2ξ ,
j=1 i=1 j=1

where the inequality follows from the convexity of

the quadratic function, and the next to
Pnlast equal-
ity follows from the defining property i=1 ξi pij =
ξj of the steady-state probabilities.
• Using the lemma, the nonexpansiveness of Π,
and the definition T J = g + αP J, we have

kΠT J−ΠT J̄kξ ≤ kT J−T J̄ kξ = αkP (J−J̄ )kξ ≤ αkJ−J̄ kξ

for all J, J¯ ∈ ℜn . Hence ΠT is a contraction of

modulus α.
PROOF OF ERROR BOUND

• Let Φr∗ be the fixed point of ΠT . We have

1
kJµ − Φr∗ kξ ≤ √ kJµ − ΠJµ kξ .
1−α 2

Proof: We have
2
kJµ − Φr∗ k2ξ = kJµ − ΠJµ k2ξ + ΠJµ − Φr∗ ξ
2
= kJµ − ΠJµ k2ξ + ΠT Jµ − ΠT (Φr ) ξ∗

≤ kJµ − ΠJµ k2ξ + α2 kJµ − Φr∗ k2ξ ,

where the first equality uses the Pythagorean The-

orem, the second equality holds because Jµ is the
fixed point of T and Φr∗ is the fixed point of ΠT ,
and the inequality uses the contraction property
of ΠT . From this relation, the result follows.
√
• Note: The factor 1/ 1 − α2 in the RHS can
be replaced by a factor that is smaller and com-
putable. See
H. Yu and D. P. Bertsekas, “New Error Bounds
for Approximations from Projected Linear Equa-
tions,” Report LIDS-P-2797, MIT, July 2008.
MATRIX FORM OF PROJECTED EQUATION

• Its solution is the vector J = Φr∗ , where r∗

solves the problem
2
mins Φr − (g + αP Φr∗ ) ξ
.
r∈ℜ

• Setting to 0 the gradient with respect to r of

this quadratic, we obtain

Φ′ Ξ Φr∗ − (g + αP Φr∗ ) = 0,

where Ξ is the diagonal matrix with the steady-

state probabilities ξ1 , . . . , ξn along the diagonal.
• Equivalently,
Cr∗ = d,
where

C = Φ′ Ξ(I − αP )Φ, d = Φ′ Ξg.

PROJECTED EQUATION: SOLUTION METHODS

• Matrix inversion: r∗ = C −1 d
• Projected Value Iteration (PVI) method:

Φrk+1 = ΠT (Φrk ) = Π(g + αP Φrk )

Converges to r∗ because ΠT is a contraction.
Value Iterate
T(Φrk) = g + αPΦrk

Projection
on S

Φrk+1

Φrk
0
S: Subspace spanned by basis functions

• PVI can be written as:

2
rk+1 = arg mins Φr − (g + αP Φrk ) ξ
r∈ℜ

By setting to 0 the gradient with respect to r,

Φ′ Ξ Φrk+1 − (g + αP Φrk ) = 0,

which yields
rk+1 = rk − (Φ′ ΞΦ)−1 (Crk − d)
SIMULATION-BASED IMPLEMENTATIONS

• Key idea: Calculate simulation-based approxi-

mations based on k samples

Ck ≈ C, dk ≈ d

• Matrix inversion r∗ = C −1 d is approximated

by
r̂k = Ck−1 dk
This is the LSTD (Least Squares Temporal Dif-
ferences) Method.
• PVI method rk+1 = rk − (Φ′ ΞΦ)−1 (Crk − d) is
approximated by

rk+1 = rk − Dk−1 (Ck rk − dk )

where
Dk ≈ Φ′ ΞΦ
This is the LSPE (Least Squares Policy Evalua-
tion) Method.
• Key fact: Ck , dk , and Dk can be computed
with low-dimensional linear algebra (of order s;
the number of basis functions).
SIMULATION MECHANICS

• We generate an infinitely long trajectory (i0 , i1 , . . .)

of the Markov chain, so states i and transitions
(i, j) appear with long-term frequencies ξi and pij .
• After generating the transition (it , it+1 ), we
compute the row φ(it )′ of Φ and the cost com-
ponent g(it , it+1 ).
• We form
k
1 X ′
Ck = φ(it ) φ(it )−αφ(it+1 ) ≈ Φ′ Ξ(I−αP )ΦΦ
k + 1 t=0

k
1 X
dk = φ(it )g(it , it+1 ) ≈ Φ′ Ξg
k + 1 t=0

Also in the case of LSPE

k
1 X
Dk = φ(it )φ(it )′ ≈ Φ′ ΞΦ
k + 1 t=0

• Convergence proof is simple: Use the law of

large numbers.
6.231 DYNAMIC PROGRAMMING

LECTURE 22

LECTURE OUTLINE

• More on projected equation approach

• Q-Learning
• Q-Learning with Function Approximation
• Aggregation
REVIEW: PROJECTED BELLMAN EQUATION

• For a fixed policy µ to be evaluated, consider

the corresponding mapping T :
n
X
(T J)(i) = pij g(i, j)+αJ(j) , i = 1, . . . , n,
i=1

or more compactly,
T J = g + αP J

• The solution Jµ of Bellman’s equation J = T J

is approximated by the solution of

Φr = ΠT (Φr)

T(Φr)

Projection
on S

Φr = ΠT(Φr)

0
S: Subspace spanned by basis functions

Indirect method: Solving a projected

form of Bellmanʼs equation
MULTISTEP METHODS

• Introduce a multistep version of Bellman’s equa-

tion J = T (λ) J, where for λ ∈ [0, 1),
X∞
T (λ) = (1 − λ) λt T t+1
t=0
• Note that T t is a contraction with modulus
αt , with respect to the weighted Euclidean norm
k·kξ , where ξ is the steady-state probability vector
of the Markov chain.
• From this it follows that T (λ) is a contraction
with modulus
∞
X α(1 − λ)
αλ = (1 − λ) αt+1 λt =
t=0
1 − αλ
• T t and T (λ) have the same fixed point Jµ and

1
kJµ − Φrλ∗ kξ ≤ p kJµ − ΠJµ kξ
1− αλ2

where Φrλ∗ is the fixed point of ΠT (λ) .

• The fixed point Φrλ∗ depends on λ.
• Note that αλ ↓ 0 as λ ↑ 1, so error bound (and
the quality of approximation) improves as λ ↑ 1.
MULTISTEP PROJECTED EQUATION

• The projected Bellman equation is

Φr = ΠT (λ) (Φr)
• In matrix form: C (λ) r = d(λ) , where

C (λ) = Φ′ Ξ I− αP (λ) Φ, d(λ) = Φ′ Ξg (λ) ,

with
∞
X ∞
X
P (λ) = (1 − λ) αℓ λℓ P ℓ+1 , g (λ) = α ℓ λℓ P ℓ g
ℓ=0 ℓ=0

• The LSTD(λ) method is

(λ) −1 (λ)
Ck dk ,
(λ) (λ)
where Ck and dk are simulation-based approx-
imations of C (λ) and d(λ) .
• The LSPE(λ) method is
(λ) (λ)
rk+1 = rk − γDk−1 Ck rk − dk

where Dk is a simulation-based approx. to Φ′ ΞΦ

• TD(λ) is a simplified version of LSPE(λ) where
Dk = I, and simpler approximations to C and d
are used. It is more properly viewed as a stochas-
tic iteration for solving the projected equation.
MORE ON MULTISTEP METHODS

(λ) (λ)
• The simulation process to obtain Ck and dk
is similar to the case λ = 0 (single simulation tra-
jectory i0 , i1 , . . . but the formulas are more com-
plicated)

k k
(λ) 1 X X
Ck = φ(it ) αm−t λm−t φ(im )−αφ(im+1 )
k + 1 t=0 m=t

k k
(λ) 1 X X
dk = φ(it ) αm−t λm−t gim
k+1 t=0 m=t

• Note the λ-tradeoff:

(λ) (λ)
− As λ ↑ 1 Ck and dk contain more “sim-
ulation noise”, so more samples are needed
for a close approximation of rλ , the solution
of the projected equation
− The error bound kJµ − Φrλ kξ is becomes
smaller
• In the context of approximate policy iteration,
we can use optimistic versions (few samples be-
tween policy updates). Also exploration is always
an issue.
• Many different versions ... see on-line chapter.
Q-LEARNING I

• Q-learning has two motivations:

− Dealing with multiple policies simultaneously
− Using a model-free approach [no need to know
pij (u), only be able to simulate them]
• The Q-factors are defined by
n
X
Q∗ (i, u) = pij (u) g(i, u, j) + αJ ∗ (j) , ∀ (i, u)
j=1

• Since J ∗ = T J ∗ , we have J ∗ (i) = minu∈U (i) Q∗ (i, u)

so the Q factors solve the equation
n
X
Q∗ (i, u) = pij (u) g(i, u, j) + α min Q∗ (j, u′ )
u′ ∈U (j)
j=1

• Q∗ (i, u) can be shown to be the unique solu-

tion of this equation. Reason: This is Bellman’s
equation for a system whose states are the original
states 1, . . . , n, together with all the pairs (i, u).
• Value iteration: For all (i, u)
n
X
Q(i, u) := pij (u) g(i, u, j) + α ′min Q(j, u′ )
u ∈U (j)
j=1
Q-LEARNING II

• Use any probabilistic mechanism to select se-

quence of pairs (ik , uk ) [all pairs (i, u) are chosen
infinitely often], and for each k, select jk according
to pik j (uk ).
• At each k, Q-learning algorithm updates Q(ik , uk )
according to

Q(ik , uk ) := 1 − γk (ik , uk ) Q(ik , uk )
!
+ γk (ik , uk ) g(ik , uk , jk ) + α ′ min Q(jk , u′ )
u ∈U (jk )

• Stepsize γk (ik , uk ) must converge to 0 at proper

rate (e.g., like 1/k).
• Important mathematical point: In the Q-
factor version of Bellman’s equation the order of
expectation and minimization is reversed relative
to the ordinary cost version of Bellman’s equation:
n
X
J ∗ (i) = min pij (u) g(i, u, j) + αJ ∗ (j)
u∈U (i)
j=1

• Q-learning can be shown to converge to true/exact

Q-factors (a sophisticated proof).
• Major drawback: The large number of pairs
(i, u) - no function approximation is used.
Q-FACTOR APROXIMATIONS

• Introduce basis function approximation for Q-

factors:
Q̃(i, u, r) = φ(i, u)′ r
• We can use approximate policy iteration and
LSPE/LSTD/TD for policy evaluation (exploration
issue is acute)
• Optimistic policy iteration methods are fre-
quently used on a heuristic basis.
• Example: Generate infinite trajectory {(ik , uk ) |
k = 0, 1, . . .}
• At iteration k, given rk and state/control (ik , uk ):
(1) Simulate next transition (ik , ik+1 ) using the
transition probabilities pik j (uk ).
(2) Generate control uk+1 from

uk+1 = arg min Q̃(ik+1 , u, rk )

u∈U (ik+1 )

(3) Update the parameter vector via

rk+1 = rk − (LSPE or TD-like correction)

• Unclear validity. There is a solid basis for an

important but very special case: optimal stopping
(see the on-line chapter)
PROBLEM APPROXIMATION - AGGREGATION

• Another major idea in ADP is to approximate

the cost-to-go function of the problem with the
cost-to-go function of a simpler problem. The sim-
plification is often ad-hoc/problem dependent.
• Aggregation is a (semi-)systematic approach for
problem approximation. Main elements:
− Introduce a few “aggregate” states, viewed
as the states of an “aggregate” system
− Define transition probabilities and costs of
the aggregate system, by associating multi-
ple states of the original system with each
aggregate state
− Solve (exactly or approximately) the “ag-
gregate” problem by any kind of value or pol-
icy iteration method (including simulation-
based methods, such as Q-learning)
− Use the optimal cost of the aggregate prob-
lem to approximate the optimal cost of the
original problem
• Example (Hard Aggregation): We are given
a partition of the state space into subsets of states,
and each subset is viewed as an aggregate state
(each state belongs to one and only one subset).
AGGREGATION/DISAGGREGATION PROBS

• The aggregate system transition probabilities

are defined via two (somewhat arbitrary) choices:
• For each aggregate state x and original system
state i, the disaggregation probability dxi
− This may be roughly interpreted as the “de-
gree to which i is representative of x.”
− In the hard aggregation example (assuming
all states that belong to aggregate state/subset
x are “equally representative”) dxi = 1/|x|
for each state i that belongs to aggregate
state/subset x, where |x| is the cardinality
(number of states) of x.
• For each original system state j and aggregate
state y, the aggregation probability φjy
− This may be roughly interpreted as the “de-
gree of membership of j in the aggregate
state y.”
− In the hard aggregation example, φjy = 1 if
state j belongs
!
to aggregate state/subset y.
Original System according
States Aggregate States
to pij (u), with cost
Original System
! States Aggregate
i States , j=1 !

Disaggregation Probabilities φjy Q Aggregation Probabilities

dxi
Aggregation Probabilities Aggregation
S
Probabilities Probabilities
Disaggregation
Disaggregation Probabilities Disaggregation Probabilities
!
|
), x ), y
Original System States Aggregate States
!
n !
n
p̂xy (u) = dxi pij (u)φjy ,
i=1 j=1
AGGREGATE TRANSITION PROBABILITIES

• The transition probability from aggregate state

x to aggregate state y under control u
n
X Xn
p̂xy (u) = dxi pij (u)φjy
i=1 j=1

• The expected transition cost is

n
X Xn
ĝ(x, u) = dxi pij (u)g(i, u, j)
i=1 j=1

• The optimal cost function of the aggregate prob-

ˆ is obtained as the unique solution
lem, denoted J,
of Bellman’s equation
" #
X
ˆ
J(x) = min ĝ(x, u) + α ˆ
p̂xy (u)J(y) , ∀x
u∈U
y

• The optimal cost function J ∗ of the original

problem is approximated by J˜ given by
˜ = ˆ
X
J(j) φjy J(y), ∀j
y
Equivalent Form (Similar to Projected Equation):
J˜ = Φr, where r = DT (Φr) where the rows of D
and Φ are the disaggr. and aggr. probs.
• May solve aggregate problem by Q-learning or
approximate policy iteration with policy evalua-
tion step done by aggregation
AGGREGATION EXAMPLES

• Hard aggregation (each original system state

is associated with one aggregate state)
• Soft aggregation (each original system state
is associated with multiple aggregate states)
• Coarse grid (each aggregate state is an original
system state)
− A coarse grid is well-suited for continuous
space problems
− Important example is POMDP, where the
state space of the sufficient statistic is the
space of beliefs (a continuous space)
!
Original System according
States Aggregate States
to pij (u), with cost
Original System
! States Aggregate
i States , j=1 !

Disaggregation Probabilities φjy Q Aggregation Prob

dxi
egation Probabilities Aggregation
S
Probabilities Pr
Disaggregation
ggregation Probabilities Disaggregation Probabilities
!
|
), x ), y
m States Aggregate States
!
n !
n
p̂xy (u) = dxi pij (u)φjy ,
i=1 j=1
6.231 DYNAMIC PROGRAMMING

LECTURE 23

LECTURE OUTLINE

• More on projected equation methods/policy

evaluation
• Stochastic shortest path problems
• Average cost problems
• Generalization - Two Markov Chain methods
• LSTD-like methods - Use to enhance explo-
ration
REVIEW: PROJECTED BELLMAN EQUATION

• For fixed policy µ to be evaluated, the solution

of Bellman’s equation J = T J is approximated by
the solution of

Φr = ΠT (Φr)

whose solution is in turn obtained using a simulation-

based method such as LSPE(λ), LSTD(λ), or TD(λ).

T(Φr)

Projection
on S

Φr = ΠT(Φr)

0
S: Subspace spanned by basis functions

Indirect method: Solving a projected

form of Bellmanʼs equation

• These ideas apply to other (linear) Bellman

equations, e.g., for SSP and average cost.
• Key Issue: Construct framework where ΠT [or
at least ΠT (λ) ] is a contraction.
STOCHASTIC SHORTEST PATHS

• Introduce approximation subspace

S = {Φr | r ∈ ℜs }
and for a given proper policy, Bellman’s equation
and its projected version
J = T J = g + P J, Φr = ΠT (Φr)

Also its λ-version

∞
X
Φr = ΠT (λ) (Φr), T (λ) = (1 − λ) λt T t+1
t=0
• Question: What should be the norm of pro-
jection?
• Speculation based on discounted case: It
should be a weighted Euclidean norm with weight
vector ξ = (ξ1 , . . . , ξn ), where ξi should be some
type of long-term occupancy probability of state i
(which can be generated by simulation).
• But what does “long-term occupancy probabil-
ity of a state” mean in the SSP context?
• How do we generate infinite length trajectories
given that termination occurs with prob. 1?
SIMULATION TRAJECTORIES FOR SSP

• We envision simulation of trajectories up to

termination, followed by restart at state i with
some fixed probabilities q0 (i) > 0.
• Then the “long-term occupancy probability of
a state” of i is proportional to
X∞
q(i) = qt (i), i = 1, . . . , n,
t=0

where
qt (i) = P (it = i), i = 1, . . . , n, t = 0, 1, . . .
• We use the projection norm
v
u n
uX 2
kJkq = t q(i) J(i)
i=1

[Note that 0 < q(i) < ∞, but q is not a prob.

distribution. ]
• We can show that ΠT (λ) is a contraction with
respect to k · kξ (see the next slide).
• LSTD(λ), LSPE(λ), and TD(λ) are possible.
CONTRACTION PROPERTY FOR SSP
P∞
• We have q = t=0 qt so
∞
X ∞
X
q′P = qt′ P = qt′ = q ′ − q0′
t=0 t=1
or
n
X
q(i)pij = q(j) − q0 (j), ∀j
i=1
• To verify that ΠT is a contraction, we show
that there exists β < 1 such that kP zk2q ≤ βkzk2q
for all z ∈ ℜn .
• For all z ∈ ℜn , we have
 2
n
X Xn n
X n
X
kP zk2q = q(i)  pij zj  ≤ q(i) pij zj2
i=1 j=1 i=1 j=1
n
X n
X n
X
zj2
2
= q(i)pij = q(j) − q0 (j) zj
j=1 i=1 j=1

= kzk2q − kzk2q0 ≤ βkzk2q

where
q0 (j)
β = 1 − min
j q(j)
AVERAGE COST PROBLEMS

• Consider a single policy to be evaluated, with

single recurrent class, no transient states, and steady-
state probability vector ξ = (ξ1 , . . . , ξn ).
• The average cost, denoted by η, is independent
of the initial state
(N −1 )
1 X
η = lim E g xk , xk+1 x0 = i , ∀ i
N →∞ N
k=0
• Bellman’s equation is J = F J with

F J = g − ηe + P J

where e is the unit vector e = (1, . . . , 1).

• The projected equation and its λ-version are

Φr = ΠF (Φr), Φr = ΠF (λ) (Φr)

• A problem here is that F is not a contraction
with respect to any norm (since e = P e).
• However, ΠF (λ) turns out to be a contraction
with respect to k · kξ assuming that e does not be-
long to S and λ > 0 [the case λ = 0 is exceptional,
but can be handled - see the text].
• LSTD(λ), LSPE(λ), and TD(λ) are possible.
GENERALIZATION/UNIFICATION

• Consider approximate solution of x = T (x),

where

T (x) = Ax + b, A is n × n, b ∈ ℜn

by solving the projected equation y = ΠT (y),

where Π is projection on a subspace of basis func-
tions (with respect to some Euclidean norm).
• We will generalize from DP to the case where
A is arbitrary, subject only to

I − ΠA : invertible

• Benefits of generalization:
− Unification/higher perspective for TD meth-
ods in approximate DP
− An extension to a broad new area of applica-
tions, where a DP perspective may be help-
ful
• Challenge: Dealing with less structure
− Lack of contraction
− Absence of a Markov chain
GENERALIZED PROJECTED EQUATION

• Let Π be projection with respect to

v
u n
uX
kxkξ = t ξi x2i
i=1

where ξ ∈ ℜn is a probability distribution with

positive components.
• If r∗ is the solution of the projected equation,
we have Φr∗ = Π(AΦr∗ + b) or
 2
Xn n
X
r∗ = arg mins ξi φ(i)′ r − aij φ(j)′ r∗ − bi 
r∈ℜ
i=1 j=1

where φ(i)′ denotes the ith row of the matrix Φ.

• Optimality condition/equivalent form:
 ′
n
X n
X n
X
ξi φ(i) φ(i) − aij φ(j) r∗ = ξi φ(i)bi
i=1 j=1 i=1

• The two expected values can be approximated

by simulation.
SIMULATION MECHANISM

Row Sampling According to ξ (Ma

Column Sampling According to
i0 i0 i1 +1 ... j1 ik ik ik+1 +1 . . .
Ro Row ng to plin ling Accordi
ng to Column Sampling Ac
g According
= ( ) to ΦPΠMar

i1 j0 0j1 +1jk jk jk+1

w Sam amp Accordi According

• Row sampling: Generate sequence {i0 , i1 , . . .}

according to ξ, i.e., relative frequency of each row
i is ξi

• Column sampling: Generate (i0 , j0 ), (i1 , j1 ), . . .
according to some transition probability matrix P
with
pij > 0 if aij 6= 0,
i.e., for each i, the relative frequency of (i, j) is pij
• Row sampling may be done using a Markov
chain with transition matrix Q (unrelated to P )
• Row sampling may also be done without a
Markov chain - just sample rows according to some
known distribution ξ (e.g., a uniform)
ROW AND COLUMN SAMPLING

Row Sampling According to ξ (Ma

Column
ξ (May UseSampling According
Markov Chain Q) to
to Markov Chain P ∼ |A|
i0 i0 i1 +1 ... j1 ik ik ik+1 +1 . . . Column Sampling Ac
Ro Row ng to plin ling Accordi
ng to g(May =Use ( Marko
According )toΦMar
Π
kov(Φ
to Markov ) Subspace
Chain
Chain
Subspace
ain P ∼ |A| Projection
i1 j0 0j1 +1 jk jk jk+1 jection on
w Sam amp Accordi According

• Row sampling ∼ State Sequence Generation in

DP. Affects:
− The projection norm.
− Whether ΠA is a contraction.
• Column sampling ∼ Transition Sequence Gen-
eration in DP.
− Can be totally unrelated to row sampling.
Affects the sampling/simulation error.
− “Matching” P with |A| is beneficial (has an
effect like in importance sampling).
• Independent row and column sampling allows
exploration at will! Resolves the exploration prob-
lem that is critical in approximate policy iteration.
LSTD-LIKE METHOD

• Optimality condition/equivalent form of pro-

jected equation
 ′
X n n
X n
X
ξi φ(i) φ(i) − aij φ(j) r∗ = ξi φ(i)bi
i=1 j=1 i=1

• The two expected values are approximated by

row and column sampling (batch 0 → t).
• We solve the linear equation

t ′ t
X aik jk X
φ(ik ) φ(ik ) − φ(jk ) rt = φ(ik )bik
pik jk
k=0 k=0
• We have rt → r∗ , regardless of ΠA being a con-
traction (by law of large numbers; see next slide).
• An LSPE-like method is also possible, but re-
quires that ΠA is a contraction.
Pn
• Under the assumption j=1 |aij | ≤ 1 for all i,
there are conditions that guarantee contraction of
ΠA; see the paper by Bertsekas and Yu,“Projected
Equation Methods for Approximate Solution of
Large Linear Systems,” 2009, or the expanded ver-
sion of Chapter 6, Vol. 2.
JUSTIFICATION W/ LAW OF LARGE NUMBERS

• We will match terms in the exact optimality

condition and the simulation-based version.
• Let ξ̂it be the relative frequency of i in row
sampling up to time t.
• We have
t n n
1 X
ξˆi φ(i)φ(i)′ ≈
X X
φ(ik )φ(ik )′ = t ξi φ(i)φ(i)′
t+1 i=1 i=1
k=0

t n n
1 X
ξˆi φ(i)bi ≈
X X
φ(ik )bik = t ξi φ(i)bi
t+1 i=1 i=1
k=0

• Let p̂tij be the relative frequency of (i, j) in

column sampling up to time t.
t
1 X aik jk
φ(ik )φ(jk )′
t+1 pik jk
k=0
n n
X X aij
= ξ̂it p̂tij φ(i)φ(j)′
i=1 j=1
pij
Xn Xn
≈ ξi aij φ(i)φ(j)′
i=1 j=1
6.231 DYNAMIC PROGRAMMING

LECTURE 24

LECTURE OUTLINE

• Additional topics in ADP

• Nonlinear versions of the projected equation
• Extension of Q-learning for optimal stopping
• Basis function adaptation
• Gradient-based approximation in policy space
NONLINEAR EXTENSIONS OF PROJECTED EQ.

• If the mapping T is nonlinear (as for exam-

ple in the case of multiple policies) the projected
equation Φr = ΠT (Φr) is also nonlinear.
• Any solution r∗ satisfies
2
r∗ ∈ arg mins Φr − T (Φr∗ )
r∈ℜ

or equivalently

Φ′ Φr∗ − T (Φr∗ ) =0

This is a nonlinear equation, which may have one

or many solutions, or no solution at all.
• If ΠT is a contraction, then there is a unique
solution that can be obtained (in principle) by the
fixed point iteration
Φrk+1 = ΠT (Φrk )
• We have seen a nonlinear special case of pro-
jected value iteration/LSPE where ΠT is a con-
traction, namely optimal stopping.
• This case can be generalized.
LSPE FOR OPTIMAL STOPPING EXTENDED

• Consider a system of the form

x = T (x) = Af (x) + b,
where f : ℜn 7→ ℜn is a mapping with scalar com-
ponents of the form f (x) = f1 (x1 ), . . . , fn (xn ) .
• Assume that each fi : ℜ 7→ ℜ is nonexpansive:

fi (xi ) − fi (x̄i ) ≤ |xi − x̄i |, ∀ i, xi , x̄i ∈ ℜ

This guarantees that T is a contraction with re-
spect to any weighted Euclidean norm k · kξ when-
ever A is a contraction with respect to that norm.
• Algorithms similar to LSPE [approximating
Φrk+1 = ΠT (Φrk )] are then possible.
• Special case: In the optimal stopping problem
of Section 6.4, x is the Q-factor corresponding to
the continuation action, α ∈ (0, 1) is a discount
factor, fi (xi ) = min{ci , xi }, and A = αP , where
P is the transition matrix for continuing.
Pn
• If j=1 pij < 1 for some state i, and 0 ≤ P ≤
Q, where Q is an irreducible transition matrix,
then Π((1−γ)I +γT ) is a contraction with respect
to k · kξ for all γ ∈ (0, 1), even with α = 1.
BASIS FUNCTION ADAPTATION I

• An important issue in ADP is how to select

basis functions.
• A possible approach is to introduce basis func-
tions that are parametrized by a vector θ, and
optimize over θ, i.e., solve the problem

˜

min F J(θ)
θ∈Θ

˜
where J(θ) is the solution of the projected equa-
tion.
• One example is
2
˜ ˜ ˜

F J(θ) = J(θ) − T J(θ)

• Another example is

˜ ˜
X
F J(θ) = |J(i) − J(θ)(i)| 2,

i∈I

where I is a subset of states, and J(i), i ∈ I, are

the costs of the policy at these states calculated
directly by simulation.
BASIS FUNCTION ADAPTATION II

˜

• Some algorithm may be used to minimize F J(θ)
over θ.
• A challenge here is that the algorithm should
use low-dimensional calculations.
• One possibility is to use a form of random search
(the cross-entropy method); see the paper by Men-
ache, Mannor, and Shimkin (Annals of Oper. Res.,
Vol. 134, 2005)
• Another possibility is to use a gradient method.
For this it is necessary to estimate the partial
˜ with respect to the components
derivatives of J(θ)
of θ.
• It turns out that by differentiating the pro-
jected equation, these partial derivatives can be
calculated using low-dimensional operations. See
the paper by Menache, Mannor, and Shimkin, and
a recent paper by Yu and Bertsekas (2009).
APPROXIMATION IN POLICY SPACE I

• Consider an average cost problem, where the

problem data are parametrized by a vector r, i.e.,
a cost vector g(r), transition probability matrix
P (r). Let η(r) be the (scalar) average cost per
stage, satisfying Bellman’s equation

η(r)e + h(r) = g(r) + P (r)h(r)

where h(r) is the corresponding differential cost

vector.
• Consider minimizing η(r) over r (here the data
dependence on control is encoded in the parametriza-
tion). We can try to solve the problem by nonlin-
ear programming/gradient descent methods.
• Important fact: If ∆η is the change in η due
to a small change ∆r from a given r, we have
∆η = ξ ′ (∆g + ∆P h),
where ξ is the steady-state probability distribu-
tion/vector corresponding to P (r), and all the quan-
tities above are evaluated at r:
∆η = η(r + ∆r) − η(r),

∆g = g(r+∆r)−g(r), ∆P = P (r+∆r)−P (r)

APPROXIMATION IN POLICY SPACE II

• Proof of the gradient formula: We have,

by “differentiating” Bellman’s equation,

∆η(r)·e+∆h(r) = ∆g(r)+∆P (r)h(r)+P (r)∆h(r)

By left-multiplying with ξ ′ ,

′ ′ ′
ξ ∆η(r)·e+ξ ∆h(r) = ξ ∆g(r)+∆P (r)h(r) +ξ ′ P (r)∆h(r)

Since ξ ′ ∆η(r) · e = ∆η(r) and ξ ′ = ξ ′ P (r), this

equation simplifies to

∆η = ξ ′ (∆g + ∆P h)
• Since we don’t know ξ, we cannot implement a
gradient-like method for minimizing η(r). An al-
ternative is to use “sampled gradients”, i.e., gener-
ate a simulation trajectory (i0 , i1 , . . .), and change
r once in a while, in the direction of a simulation-
based estimate of ξ ′ (∆g + ∆P h).
• There is much recent research on this subject,
see e.g., the work of Marbach and Tsitsiklis, and
Konda and Tsitsiklis, and the refs given there.
6.231 DYNAMIC PROGRAMMING

OVERVIEW-EPILOGUE

LECTURE OUTLINE

• Finite horizon problems

− Deterministic vs Stochastic
− Perfect vs Imperfect State Info
• Infinite horizon problems
− Stochastic shortest path problems
− Discounted problems
− Average cost problems
FINITE HORIZON PROBLEMS - ANALYSIS

• Perfect state info

− A general formulation - Basic problem, DP
algorithm
− A few nice problems admit analytical solu-
tion
• Imperfect state info
− Sufficient statistics - Reduction to perfect
state info
− Very few nice problems that admit analytical
solution
− Finite-state problems admit reformulation as
perfect state info problems whose states are
prob. distributions (the belief vectors)
FINITE HORIZON PROBS - EXACT COMP. SOL.

• Deterministic finite-state problems

− Equivalent to shortest path
− A wealth of fast algorithms
− Hard combinatorial problems are a special
case (but # of states grows exponentially)
• Stochastic perfect state info problems
− The DP algorithm is the only choice
− Curse of dimensionality is big bottleneck
• Imperfect state info problems
− Forget it!
− Only trivial examples admit an exact com-
putational solution
FINITE HORIZON PROBS - APPROX. SOL.

• Many techniques (and combinations thereof) to

choose from
• Simplification approaches
− Certainty equivalence
− Problem simplification
− Rolling horizon
− Aggregation - Coarse grid discretization
• Limited lookahead combined with:
− Rollout
− MPC (an important special case)
− Feature-based cost function approximation
• Approximation in policy space
− Gradient methods
− Random search
INFINITE HORIZON PROBLEMS - ANALYSIS

• A more extensive theory

• Bellman’s equation
• Optimality conditions
• Contraction mappings
• A few nice problems admit analytical solution
• Idiosynchracies of average cost problems
• Idiosynchracies of problems with no underlying
contraction
• Elegant analysis
INF. HORIZON PROBS - EXACT COMP. SOL.

• Value iteration
− Variations (asynchronous, modified, etc)
• Policy iteration
− Variations (asynchronous, based on value it-
eration, etc)
• Linear programming
• Elegant algorithmic analysis
• Curse of dimensionality is major bottleneck
INFINITE HORIZON PROBS - ADP

• Approximation in value space (over a subspace

of basis functions)
• Approximate policy evaluation
− Direct methods
− Indirect methods (projected equation meth-
ods)
− Aggregation methods - Coarse grid discretiza-
tion
• Q-Learning
− Exact Q-factor computation by simulation
− Approximate Q-factor computation by sim-
ulation
− Projected equation methods for optimal stop-
ping
− Aggregation-based Q-learning
• Approximate LP
• Approximation in policy space
− Gradient methods
− Random search

Optimal Design of Control Systems Stochastic and Deterministic Problems
100% (2)
Optimal Design of Control Systems Stochastic and Deterministic Problems
423 pages
Markov Decision Processes: Lecture Notes For STP 425: Jay Taylor
100% (1)
Markov Decision Processes: Lecture Notes For STP 425: Jay Taylor
86 pages
Optimal Predictive Adaptive Control
100% (1)
Optimal Predictive Adaptive Control
425 pages
Hiller - Dynamic Programming PDF
No ratings yet
Hiller - Dynamic Programming PDF
6 pages
Dynamic Programming and Optimal Control: Third Edition Dimitri P. Bertsekas
0% (1)
Dynamic Programming and Optimal Control: Third Edition Dimitri P. Bertsekas
54 pages
Pham
No ratings yet
Pham
77 pages
Discrete-Time Stochastic Control and Dynamic Potential Games The Euler - Equation Approach
100% (1)
Discrete-Time Stochastic Control and Dynamic Potential Games The Euler - Equation Approach
81 pages
Dynamic Programing and Optimal Control PDF
No ratings yet
Dynamic Programing and Optimal Control PDF
276 pages
MIT6 231F11 Notes Short
No ratings yet
MIT6 231F11 Notes Short
125 pages
LQR
No ratings yet
LQR
34 pages
RL Monograph1
No ratings yet
RL Monograph1
42 pages
MIT6 231F15 Notes PDF
No ratings yet
MIT6 231F15 Notes PDF
303 pages
Dynamic Programming and Optimal Control
No ratings yet
Dynamic Programming and Optimal Control
199 pages
MIT Dynamic Programming Lecture Slides
No ratings yet
MIT Dynamic Programming Lecture Slides
261 pages
MIT6 231F15 Complete Slide
No ratings yet
MIT6 231F15 Complete Slide
166 pages
Dynamic Programming and Optimal Control Script
No ratings yet
Dynamic Programming and Optimal Control Script
58 pages
Vol I Dimitri PDF
No ratings yet
Vol I Dimitri PDF
30 pages
High-Frequency Market-Making For Multi-Dimensional Markov Processes
No ratings yet
High-Frequency Market-Making For Multi-Dimensional Markov Processes
28 pages
Chapter 1 PDF
No ratings yet
Chapter 1 PDF
45 pages
Dynamic Programming and Linear Quadratic (LQ) Control (Discrete-Time and Continuous Time Cases)
No ratings yet
Dynamic Programming and Linear Quadratic (LQ) Control (Discrete-Time and Continuous Time Cases)
53 pages
Dynamic Programing and Optimal Control
No ratings yet
Dynamic Programing and Optimal Control
276 pages
RL and ObC Lecture 2
No ratings yet
RL and ObC Lecture 2
20 pages
UCL-Thesis-Hui Gong
No ratings yet
UCL-Thesis-Hui Gong
101 pages
RL Monograph1
No ratings yet
RL Monograph1
48 pages
Figure by Mit Opencourseware
No ratings yet
Figure by Mit Opencourseware
26 pages
16.323 Principles of Optimal Control: Mit Opencourseware
No ratings yet
16.323 Principles of Optimal Control: Mit Opencourseware
27 pages
Dynamic Programming: Xiaolan Xie
No ratings yet
Dynamic Programming: Xiaolan Xie
97 pages
3 DP PDF
No ratings yet
3 DP PDF
42 pages
Dynamic Programming and Optimal Control
No ratings yet
Dynamic Programming and Optimal Control
62 pages
Moritz Lars
No ratings yet
Moritz Lars
97 pages
Dynamic Programming: Quantitative Macroeconomics (Econ 5725)
No ratings yet
Dynamic Programming: Quantitative Macroeconomics (Econ 5725)
55 pages
Dynamic Optimization
No ratings yet
Dynamic Optimization
73 pages
A17 Complexdecisions
No ratings yet
A17 Complexdecisions
28 pages
CH 9 MDP
No ratings yet
CH 9 MDP
97 pages
04 RL DP
No ratings yet
04 RL DP
76 pages
NDP PDF
No ratings yet
NDP PDF
5 pages
Abstract Dynamic Programming
No ratings yet
Abstract Dynamic Programming
257 pages
P550
No ratings yet
P550
27 pages
Dynamic Programming Online Teaching FOR PRINT
No ratings yet
Dynamic Programming Online Teaching FOR PRINT
44 pages
Neuro-Dynamic Programming An Overview Dimitri P. Bertsekas
No ratings yet
Neuro-Dynamic Programming An Overview Dimitri P. Bertsekas
9 pages
10 - Reinforcement Learning
No ratings yet
10 - Reinforcement Learning
24 pages
Introducción Piazza
No ratings yet
Introducción Piazza
33 pages
DPOCexam2017 Solution BB
No ratings yet
DPOCexam2017 Solution BB
20 pages
Dynamic Programming 2
No ratings yet
Dynamic Programming 2
39 pages
Dynamic Programming
No ratings yet
Dynamic Programming
52 pages
Module 04
No ratings yet
Module 04
63 pages
Stochastic Models Estimation and Control
No ratings yet
Stochastic Models Estimation and Control
6 pages
Reinforcement Learning: Foundations Exam
No ratings yet
Reinforcement Learning: Foundations Exam
42 pages
EE290 Lecture 16
No ratings yet
EE290 Lecture 16
4 pages
Notas - Dynamic Optimation and Optimal Control
No ratings yet
Notas - Dynamic Optimation and Optimal Control
26 pages
Markov Decision Processes and Exact Solution Methods
No ratings yet
Markov Decision Processes and Exact Solution Methods
34 pages
Optimal Control Theory
No ratings yet
Optimal Control Theory
28 pages
Stochastic DP Problems
No ratings yet
Stochastic DP Problems
11 pages
08 - Markov Decision Processes
No ratings yet
08 - Markov Decision Processes
31 pages
Dynamic Programming and Optimal Control
No ratings yet
Dynamic Programming and Optimal Control
62 pages
Semicontractive Lecture1
No ratings yet
Semicontractive Lecture1
14 pages
1 Optimal Control: 1.1 Problem Definition
No ratings yet
1 Optimal Control: 1.1 Problem Definition
8 pages
M. Tech. Process Control and Instrumentation Engineering
No ratings yet
M. Tech. Process Control and Instrumentation Engineering
31 pages
ECE 551 Lecture 3
No ratings yet
ECE 551 Lecture 3
8 pages
RL Module 4
No ratings yet
RL Module 4
50 pages
Further Topics On Discrete-Time Markov Control Processes
No ratings yet
Further Topics On Discrete-Time Markov Control Processes
285 pages
IJCAS v2 n3 pp263-278
No ratings yet
IJCAS v2 n3 pp263-278
16 pages
Approximate Dynamic Programming - II: Algorithms: Warren B. Powell
No ratings yet
Approximate Dynamic Programming - II: Algorithms: Warren B. Powell
22 pages
Namic Programming
No ratings yet
Namic Programming
18 pages
DP - Bellman - 1741339134 2025-03-07 09 - 19 - 05
No ratings yet
DP - Bellman - 1741339134 2025-03-07 09 - 19 - 05
13 pages
Dynamic Programming and Optimal Control 3rd Edition, Volume II
No ratings yet
Dynamic Programming and Optimal Control 3rd Edition, Volume II
233 pages
Dynamic Programming and Optimal Control, Volumes I Solution Selected
No ratings yet
Dynamic Programming and Optimal Control, Volumes I Solution Selected
30 pages
5.1 Dynamic Programming and The HJB Equation: k+1 K K K K
No ratings yet
5.1 Dynamic Programming and The HJB Equation: k+1 K K K K
30 pages
Practical Risk Theory For Actuaries
No ratings yet
Practical Risk Theory For Actuaries
1 page
Concave Impact
No ratings yet
Concave Impact
30 pages
Optimal Pairs Trading - A Stochastic Control Approach
No ratings yet
Optimal Pairs Trading - A Stochastic Control Approach
5 pages
(Touzi) Deterministic and Stochastic Control, Application To Finance
No ratings yet
(Touzi) Deterministic and Stochastic Control, Application To Finance
117 pages
Dynamic Spread Trading
No ratings yet
Dynamic Spread Trading
33 pages
MAF2008 Schiess Presentation
No ratings yet
MAF2008 Schiess Presentation
24 pages
Non Standard Central Bank Loss Functions
No ratings yet
Non Standard Central Bank Loss Functions
37 pages
Dynamic Programming Handout - : 14.451 Recitation, February 18, 2005 - Todd Gormley
No ratings yet
Dynamic Programming Handout - : 14.451 Recitation, February 18, 2005 - Todd Gormley
11 pages
Borden Knight Trading Columbia Seminar
No ratings yet
Borden Knight Trading Columbia Seminar
16 pages
Optimal Control Theory Chapter 12
No ratings yet
Optimal Control Theory Chapter 12
55 pages
Mathematics 0
No ratings yet
Mathematics 0
1 page
Vedula Xu 2009
No ratings yet
Vedula Xu 2009
13 pages
2.153 Adaptive Control Simple Adaptive Systems: Identification
No ratings yet
2.153 Adaptive Control Simple Adaptive Systems: Identification
44 pages
L1 4
No ratings yet
L1 4
5 pages
Efficient Control
No ratings yet
Efficient Control
6 pages
03 Basic Model of An MDP
No ratings yet
03 Basic Model of An MDP
7 pages
Linear Theory For Control
No ratings yet
Linear Theory For Control
4 pages
Continuous Time Mean Variance Portfolio
No ratings yet
Continuous Time Mean Variance Portfolio
15 pages
Chapter. 08 - Markov Decision Processes - Examples
No ratings yet
Chapter. 08 - Markov Decision Processes - Examples
23 pages
Gaussian Process Latent Force Models For Learning and Stochastic Control of Physical Systems
No ratings yet
Gaussian Process Latent Force Models For Learning and Stochastic Control of Physical Systems
7 pages
Stochastic Control and Communication - Course
No ratings yet
Stochastic Control and Communication - Course
4 pages
Shortcuts to College Calculus Refreshment Kit
From Everand
Shortcuts to College Calculus Refreshment Kit
Juan Acevedo
No ratings yet