0% found this document useful (0 votes)

190 views303 pages

MIT6 231F15 Notes PDF

Uploaded by

Mufasa Siddiqui

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

190 views303 pages

MIT6 231F15 Notes PDF

Uploaded by

Mufasa Siddiqui

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 303

LECTURE SLIDES - DYNAMIC PROGRAMMING

BASED ON LECTURES GIVEN AT THE

MASSACHUSETTS INST. OF TECHNOLOGY

CAMBRIDGE, MASS

FALL 2015

DIMITRI P. BERTSEKAS

These lecture slides are based on the two-

volume book: “Dynamic Programming and
Optimal Control” Athena Scientific, by D.
P. Bertsekas (Vol. I, 3rd Edition, 2005; Vol.
II, 4th Edition, 2012); see
http://www.athenasc.com/dpbook.html
Two related reference books:
(1) “Abstract Dynamic Programming,” by
D. P. Bertsekas, Athena Scientific, 2013
(2) “Neuro-Dynamic Programming,” Athena
Scientific, by D. P. Bertsekas and J. N.
Tsitsiklis, 1996
Athena is MIT's UNIX-based computing environment. OCW does not provide access to it.
1
6.231: DYNAMIC PROGRAMMING

LECTURE 1

LECTURE OUTLINE

• Problem Formulation
• Examples
• The Basic Problem
• Significance of Feedback

2
DP AS AN OPTIMIZATION METHODOLOGY

• Generic optimization problem:

min g(u)
u∈U

where u is the optimization/decision variable, g(u)

is the cost function, and U is the constraint set
• Categories of problems:
− Discrete (U is finite) or continuous
− Linear (g is linear and U is polyhedral) or
nonlinear
− Stochastic or deterministic: In stochastic prob-
lems the cost involves a stochastic parameter
w, which is averaged, i.e., it has the form

g(u) = Ew G(u, w)

where w is a random parameter.

• DP can deal with complex stochastic problems
where information about w becomes available in
stages, and the decisions are also made in stages
and make use of this information.
3
BASIC STRUCTURE OF STOCHASTIC DP

• Discrete-time system

xk+1 = fk (xk , uk , wk ), k = 0, 1, . . . , N − 1

− k: Discrete time
− xk : State; summarizes past information that
is relevant for future optimization
− uk : Control; decision to be selected at time
k from a given set
− wk : Random parameter (also called distur-
bance or noise depending on the context)
− N : Horizon or number of times control is
applied

• Cost function that is additive over time

−1
( N
)
X
E gN (xN ) + gk (xk , uk , wk )
k=0

• Alternative system description: P (xk+1 | xk , uk )

xk+1 = wk with P (wk | xk , uk ) = P (xk+1 | xk , uk )

4
INVENTORY CONTROL EXAMPLE

wk Demand at Period k

Stock at Period k Stock at Period k + 1

xk Inventory System
xk + 1 = xk + uk - wk

Stock ordered at
Period k
Cost of Period k
uk
r(xk) + cuk

• Discrete-time system

xk+1 = fk (xk , uk , wk ) = xk + uk − wk
• Cost function that is additive over time
−1
( N
)
X
E gN (xN ) + gk (xk , uk , wk )
k=0
(N −1 )
X
=E cuk + r(xk + uk − wk )
k=0

• Optimization over policies: Rules/functions uk =

µk (xk ) that map states to controls
5
ADDITIONAL ASSUMPTIONS

• The set of values that the control uk can take

depend at most on xk and not on prior x or u
• Probability distribution of wk does not depend
on past values wk−1 , . . . , w0 , but may depend on
xk and uk
− Otherwise past values of w or x would be
useful for future optimization
• Sequence of events envisioned in period k:
− xk occurs according to

xk = fk−1 xk−1 , uk−1 , wk−1

− uk is selected with knowledge of xk , i.e.,

uk ∈ Uk (xk )

− wk is random and generated according to a

distribution

Pwk (xk , uk )

6
DETERMINISTIC FINITE-STATE PROBLEMS

• Scheduling example: Find optimal sequence of

operations A, B, C, D
• A must precede B, and C must precede D
• Given startup cost SA and SC , and setup tran-
sition cost Cmn from operation m to operation n

ABC CCD

CBC
AB
ACB CBD
CAB
A CCB
CAC AC
CCD ACD CDB
SA
Initial
State
CAB CBD
CA CAB
SC
C CCA
CAD CAD CDB

CCD CD

CDA
CDA CAB

7
STOCHASTIC FINITE-STATE PROBLEMS

• Example: Find two-game chess match strategy

• Timid play draws with prob. pd > 0 and loses
with prob. 1 − pd . Bold play wins with prob. pw <
1/2 and loses with prob. 1 − pw

0.5-0.5 1- 0
pd pw

0-0 0-0
1 - pd 1 - pw

0-1 0-1

1st Game / Timid Play 1st Game / Bold Play

2-0
2-0
pw

pd 1-0 1 - pw
1-0 1.5-0.5
1.5-0.5
1 - pd
pw
pd 0.5-0.5 1-1
0.5-0.5 1-1 1 - pw
1 - pd
pw
pd 0.5-1.5
0.5-1.5 0-1
0-1 1 - pw
1 - pd
0-2
0-2

2nd Game / Timid Play 2nd Game / Bold Play

8
BASIC PROBLEM

• System xk+1 = fk (xk , uk , wk ), k = 0, . . . , N −1

• Control contraints uk ∈ Uk (xk )
• Probability distribution Pk (· | xk , uk ) of wk
• Policies π = {µ0 , . . . , µN −1 }, where µk maps
states xk into controls uk = µk (xk ) and is such
that µk (xk ) ∈ Uk (xk ) for all xk
• Expected cost of π starting at x0 is

−1
( N
)
X
Jπ (x0 ) = E gN (xN ) + gk (xk , µk (xk ), wk )
k=0

• Optimal cost function

J ∗ (x0 ) = min Jπ (x0 )

• Optimal policy π ∗ satisfies

Jπ∗ (x0 ) = J ∗ (x0 )

When produced by DP, π ∗ is independent of x0 .

9
SIGNIFICANCE OF FEEDBACK

• Open-loop versus closed-loop policies

u kµ=km
uk = k(x
(xk )k) System xk
xk + 1 = fk( xk,u k,wk)

) m
µkk

• In deterministic problems open loop is as good

as closed loop
• Value of information; chess match example
• Example of open-loop policy: Play always bold
• Consider the closed-loop policy: Play timid if
and only if you are ahead
pd 1.5-0.5

1- 0
1 - pd
pw
Timid Play
1-1
0-0
1 - pw
Bold Play
pw 1- 1

0-1
1 - pw
Bold Play
0-2

10
VARIANTS OF DP PROBLEMS

• Continuous-time problems
• Imperfect state information problems
• Infinite horizon problems
• Suboptimal control

11
LECTURE BREAKDOWN

• Finite Horizon Problems (Vol. 1, Ch. 1-6)

− Ch. 1: The DP algorithm (2 lectures)
− Ch. 2: Deterministic finite-state problems (1
lecture)
− Ch. 4: Stochastic DP problems (2 lectures)
− Ch. 5: Imperfect state information problems
(2 lectures)
− Ch. 6: Suboptimal control (2 lectures)
• Infinite Horizon Problems - Simple (Vol. 1, Ch.
7, 3 lectures)

********************************************
• Infinite Horizon Problems - Advanced (Vol. 2)
− Chs. 1, 2: Discounted problems - Computa-
tional methods (3 lectures)
− Ch. 3: Stochastic shortest path problems (2
lectures)
− Chs. 6, 7: Approximate DP (6 lectures)

12
COURSE ADMINISTRATION

• Homework ... once a week or two weeks (30%

of grade)
• In class midterm, near end of October ... will
cover finite horizon and simple infinite horizon ma-
terial (30% of grade)
• Project (40% of grade)
• Collaboration in homework allowed but indi-
vidual solutions are expected
• Prerequisites: Introductory probability, good
gasp of advanced calculus (including convergence
concepts)
• Textbook: Vol. I of text is required. Vol. II
is strongly recommended, but you may be able to
get by without it using OCW material (including
videos)

13
A NOTE ON THESE SLIDES

• These slides are a teaching aid, not a text

• Don’t expect a rigorous mathematical develop-
ment or precise mathematical statements
• Figures are meant to convey and enhance ideas,
not to express them precisely
• Omitted proofs and a much fuller discussion
can be found in the textbook, which these slides
follow

14
6.231 DYNAMIC PROGRAMMING

LECTURE 2

LECTURE OUTLINE

• The basic problem

• Principle of optimality
• DP example: Deterministic problem
• DP example: Stochastic problem
• The general DP algorithm
• State augmentation

15
BASIC PROBLEM

• System xk+1 = fk (xk , uk , wk ), k = 0, . . . , N −1

• Control constraints uk ∈ Uk (xk )
• Probability distribution Pk (· | xk , uk ) of wk
• Policies π = {µ0 , . . . , µN −1 }, where µk maps
states xk into controls uk = µk (xk ) and is such
that µk (xk ) ∈ Uk (xk ) for all xk
• Expected cost of π starting at x0 is

−1
( N
)
X
Jπ (x0 ) = E gN (xN ) + gk (xk , µk (xk ), wk )
k=0

• Optimal cost function

J ∗ (x0 ) = min Jπ (x0 )

• Optimal policy π ∗ is one that satisfies

Jπ∗ (x0 ) = J ∗ (x0 )

16
PRINCIPLE OF OPTIMALITY

• Let π ∗ = {µ∗0 , µ∗1 , . . . , µ∗N −1 } be optimal policy

• Consider the “tail subproblem” whereby we are
at xi at time i and wish to minimize the “cost-to-
go” from time i to time N

−1
( N
)
X
E gN (xN ) + gk xk , µk (xk ), wk
k=i

and the “tail policy” {µ∗i , µ∗i+1 , . . . , µ∗N −1 }

xi Tail Subproblem

0 i N

• Principle of optimality: The tail policy is opti-

mal for the tail subproblem (optimization of the
future does not depend on what we did in the past)
• DP first solves ALL tail subroblems of final
stage
• At the generic step, it solves ALL tail subprob-
lems of a given time length, using the solution of
the tail subproblems of shorter time length
17
DETERMINISTIC SCHEDULING EXAMPLE

• Find optimal sequence of operations A, B, C,

D (A must precede B and C must precede D)

ABC 6

3
AB
ACB 1
2 9
A 4
3 AC
8 ACD 3
5 6
5
Initial
1 0 State
CAB 1
CA 2
3
4
C 3 4
CAD 3
7 6
CD

5 3
CDA 2

• Start from the last tail subproblem and go back-

wards
• At each state-time pair, we record the optimal
cost-to-go and the optimal decision

18
STOCHASTIC INVENTORY EXAMPLE

wk Demand at Period k

Stock at Period k Inventory Stock at Period k + 1

xk System
xk + 1 = xk + uk - wk

Stock Ordered at
Period k
Cost of Period k uk
cuk + r (xk + uk - wk)

• Tail Subproblems of Length 1:

JN −1 (xN −1 ) = min E cuN −1
uN −1 ≥0 wN −1

+ r(xN −1 + uN −1 − wN −1 )

• Tail Subproblems of Length N − k:

Jk (xk ) = min E cuk + r(xk + uk − wk )
uk ≥0 wk

+ Jk+1 (xk + uk − wk )

• J0 (x0 ) is opt. cost of initial state x0

19
DP ALGORITHM

• Start with

JN (xN ) = gN (xN ),

and go backwards using

Jk (xk ) = min E gk (xk , uk , wk )
uk ∈Uk (xk ) wk

+ Jk+1 fk (xk , uk , wk ) , k = 0, 1, . . . , N − 1.

• Then J0 (x0 ), generated at the last step, is equal

to the optimal cost J ∗ (x0 ). Also, the policy
π ∗ = {µ∗0 , . . . , µ∗N −1 }
where µ∗k (xk ) minimizes in the right side above for
each xk and k, is optimal
• Justification: Proof by induction that Jk (xk ) is
equal to Jk∗ (xk ), defined as the optimal cost of the
tail subproblem that starts at time k at state xk
• Note:
− ALL the tail subproblems are solved (in ad-
dition to the original problem)
− Intensive computational requirements 20
PROOF OF THE INDUCTION STEP

• Let πk = µk , µk+1 , . . . , µN −1 denote a tail
policy from time k onward
∗ (x
• Assume that Jk+1 (xk+1 ) = Jk+1 k+1 ). Then

(

Jk∗ (xk ) = min E gk xk , µk (xk ), wk
(µk ,πk+1 ) wk ,...,wN −1

N −1
)
X
+ gN (xN ) + gi xi , µi (xi ), wi
i=k+1
(

= min E gk xk , µk (xk ), wk
µk wk
" ( N −1
)# )
X
+ min E gN (xN ) + gi xi , µi (xi ), wi
πk+1 wk+1 ,...,wN −1
i=k+1
∗

= min E gk xk , µk (xk ), wk + Jk+1 fk xk , µk (xk ), wk
µk wk

= min E gk xk , µk (xk ), wk + Jk+1 fk xk , µk (xk ), wk
µk wk

= min E gk (xk , uk , wk ) + Jk+1 fk (xk , uk , wk )
uk ∈Uk (xk ) wk

= Jk (xk )

21
LINEAR-QUADRATIC ANALYTICAL EXAMPLE

Initial Final
Temperature x0 Oven 1 x1 Oven 2 Temperature x2
Temperature Temperature
u0 u1

• System

xk+1 = (1 − a)xk + auk , k = 0, 1,

where a is given scalar from the interval (0, 1)

• Cost
r(x2 − T )2 + u20 + u21
where r is given positive scalar
• DP Algorithm:

J2 (x2 ) = r(x2 − T )2
h i
2
J1 (x1 ) = min u21 + r (1 − a)x1 + au1 − T

u1
2
J0 (x0 ) = min u0 + J1 (1 − a)x0 + au0
u0

22
STATE AUGMENTATION

• When assumptions of the basic problem are

violated (e.g., disturbances are correlated, cost is
nonadditive, etc) reformulate/augment the state
• DP algorithm still applies, but the problem gets
BIGGER
• Example: Time lags

xk+1 = fk (xk , xk−1 , uk , wk )

• Introduce additional state variable yk = xk−1 .

New system takes the form

xk+1 fk (xk , yk , uk , wk )
=
yk+1 xk

View x̃k = (xk , yk ) as the new state.

• DP algorithm for the reformulated problem:
n
Jk (xk , xk−1 ) = min E gk (xk , uk , wk )
uk ∈Uk (xk ) wk
o
+ Jk+1 fk (xk , xk−1 , uk , wk ), xk
23
6.231 DYNAMIC PROGRAMMING

LECTURE 3

LECTURE OUTLINE

• Deterministic finite-state DP problems

• Backward shortest path algorithm
• Forward shortest path algorithm
• Shortest path examples
• Alternative shortest path algorithms

24
DETERMINISTIC FINITE-STATE PROBLEM

Terminal Arcs
with Cost Equal
to Terminal Cost

...
t
Artificial Terminal
Initial State
... Node
s

...

Stage 0 Stage 1 Stage 2 ... Stage N - 1 Stage N

• States <==> Nodes

• Controls <==> Arcs
• Control sequences (open-loop) <==> paths
from initial state to terminal states
• akij : Cost of transition from state i ∈ Sk to state
j ∈ Sk+1 at time k (view it as “length” of the arc)
• aN
it : Terminal cost of state i ∈ SN

• Cost of control sequence <==> Cost of the cor-

responding path (view it as “length” of the path)

25
BACKWARD AND FORWARD DP ALGORITHMS

• DP algorithm:
JN (i) = aN
it , i ∈ SN ,
k
Jk (i) = min aij +Jk+1 (j) , i ∈ Sk , k = 0, . . . , N −1
j∈Sk+1

The optimal cost is J0 (s) and is equal to the

length of the shortest path from s to t
• Observation: An optimal path s → t is also an
optimal path t → s in a “reverse” shortest path
problem where the direction of each arc is reversed
and its length is left unchanged
• Forward DP algorithm (= backward DP algo-
rithm for the reverse problem):
J˜N (j) = a0sj , j ∈ S1 ,
˜ ˜
N −k
Jk (j) = min aij + Jk+1 (i) , j ∈ SN −k+1
i∈SN −k

˜ ˜
N
The optimal cost is J0 (t) = mini∈SN ait + J1 (i)
• View J˜k (j) as optimal cost-to-arrive to state j
from initial state s

26
A NOTE ON FORWARD DP ALGORITHMS

• There is no forward DP algorithm for stochastic

problems
• Mathematically, for stochastic problems, we
cannot restrict ourselves to open-loop sequences,
so the shortest path viewpoint fails
• Conceptually, in the presence of uncertainty,
the concept of “optimal-cost-to-arrive” at a state
xk does not make sense. For example, it may be
impossible to guarantee (with prob. 1) that any
given state can be reached
• By contrast, even in stochastic problems, the
concept of “optimal cost-to-go” from any state xk
makes clear sense

27
GENERIC SHORTEST PATH PROBLEMS

• {1, 2, . . . , N, t}: nodes of a graph (t: the desti-

nation)
• aij : cost of moving from node i to node j
• Find a shortest (minimum cost) path from each
node i to node t
• Assumption: All cycles have nonnegative length.
Then an optimal path need not take more than N
moves
• We formulate the problem as one where we re-
quire exactly N moves but allow degenerate moves
from a node i to itself with cost aii = 0

Jk (i) = opt. cost of getting from i to t in N −k moves

J0 (i): Cost of the optimal path from i to t.

• DP algorithm:

Jk (i) = min aij +Jk+1 (j) , k = 0, 1, . . . , N −2,
j=1,...,N

with JN −1 (i) = ait , i = 1, 2, . . . , N

28
EXAMPLE

State i
Destination
5
5 3 3 3 3
2 3
4
7 5 4 4 4 5
1 4 3
2 4.5 4.5 5.5 7
5 5
2
6 1 2 2 2 2
1

2 3
0.5
0 1 2 3 4 Stage k

(a) (b)

JN −1 (i) = ait , i = 1, 2, . . . , N,

Jk (i) = min aij +Jk+1 (j) , k = 0, 1, . . . , N −2.
j=1,...,N

29
ESTIMATION / HIDDEN MARKOV MODELS

• Markov chain with transition probabilities pij

• State transitions are hidden from view
• For each transition, we get an (independent)
observation
• r(z; i, j): Prob. the observation takes value z
when the state transition is from i to j
• Trajectory estimation problem: Given the ob-
servation sequence ZN = {z1 , z2 , . . . , zN }, what is
the “most likely” state transition sequence X ˆN =
{x̂0 , x̂1 , . . . , x̂N } [one that maximizes p(XN | ZN )
over all XN = {x0 , x1 , . . . , xN }].

s x0 x1 x2 xN - 1 xN t

...

30
VITERBI ALGORITHM

• We have
p(XN , ZN )
p(XN | ZN ) =
p(ZN )
where p(XN , ZN ) and p(ZN ) are the unconditional
probabilities of occurrence of (XN , ZN ) and ZN
• Maximizing p(XN | ZN ) is equivalent with max-
imizing ln(p(XN , ZN ))
• We have (using the “multiplication rule” for
cond. probs)

N
Y
p(XN , ZN ) = πx0 pxk−1 xk r(zk ; xk−1 , xk )
k=1

so the problem is equivalent to

N
X
minimize − ln(πx0 ) − ln pxk−1 xk r(zk ; xk−1 , xk )
k=1
over all possible sequences {x0 , x1 , . . . , xN }.

• This is a shortest path problem.

31
GENERAL SHORTEST PATH ALGORITHMS

• There are many nonDP shortest path algo-

rithms. They can all be used to solve deterministic
finite-state problems
• They may be preferable than DP if they avoid
calculating the optimal cost-to-go of EVERY state
• Essential for problems with HUGE state spaces.
• Combinatorial optimization is prime example
(e.g., scheduling/traveling salesman)
A Origin Node s

5 1 15

AB AC AD

20 4 20 3 4 3

ABC ABD ACB ACD ADB ADC

3 3 4 4 20 20

ABCD ABDC ACBD ACDB ADBC ADCB

1 15 5 1
15 5

Artificial Terminal Node t

5 1 15
5 20 4
1 20 3
15 4 3

32
LABEL CORRECTING METHODS

• Given: Origin s, destination t, lengths aij ≥ 0.

• Idea is to progressively discover shorter paths
from the origin s to every other node i
• Notation:
− di (label of i): Length of the shortest path
found (initially ds = 0, di = ∞ for i 6= s)
− UPPER: The label dt of the destination
− OPEN list: Contains nodes that are cur-
rently active in the sense that they are candi-
dates for further examination (initially OPEN={s})
Label Correcting Algorithm
Step 1 (Node Removal): Remove a node i from
OPEN and for each child j of i, do step 2
Step 2 (Node Insertion Test): If di + aij <
min{dj , UPPER}, set dj = di + aij and set i to
be the parent of j. In addition, if j 6= t, place j in
OPEN if it is not already in OPEN, while if j = t,
set UPPER to the new value di + ait of dt
Step 3 (Termination Test): If OPEN is empty,
terminate; else go to step 1
33
VISUALIZATION/EXPLANATION

• Given: Origin s, destination t, lengths aij ≥ 0

• di (label of i): Length of the shortest path found
thus far (initially ds = 0, di = ∞ for i 6= s). The
label di is implicitly associated with an s → i path
• UPPER: The label dt of the destination
• OPEN list: Contains “active” nodes (initially
OPEN={s})

Is di + aij < UPPER ?

YES
(Does the path s --> i --> j
have a chance to be part
of a shorter s --> t path ?)
Set dj = di + aij
INSERT YES

Is di + aij < dj ?
(Is the path s --> i --> j
i j
better than the
OPEN current path s --> j ?)

REMOVE

34
EXAMPLE

1 A Origin Node s

5 1 15

2 AB 7 AC 10 AD

20 4 20 3 4 3

3 ABC 5 ABD ACB 8 ACD ADB ADC

3 3 4 4 20 20

4 ABCD 6 ABDC ACBD 9 ACDB ADBC ADCB

1 15 5 1
15 5

Artificial Terminal Node t

Iter. No. Node Exiting OPEN OPEN after Iteration UPPER

0 - 1 ∞
1 1 2, 7,10 ∞
2 2 3, 5, 7, 10 ∞
3 3 4, 5, 7, 10 ∞
4 4 5, 7, 10 43
5 5 6, 7, 10 43
6 6 7, 10 13
7 7 8, 10 13
8 8 9, 10 13
9 9 10 13
10 10 Empty 13

• Note that some nodes never entered OPEN

35
VALIDITY OF LABEL CORRECTING METHODS

Proposition: If there exists at least one path from

the origin to the destination, the label correcting
algorithm terminates with UPPER equal to the
shortest distance from the origin to the destina-
tion
Proof: (1) Each time a node j enters OPEN, its
label is decreased and becomes equal to the length
of some path from s to j
(2) The number of possible distinct path lengths
is finite, so the number of times a node can enter
OPEN is finite, and the algorithm terminates
(3) Let (s, j1 , j2 , . . . , jk , t) be a shortest path and
let d∗ be the shortest distance. If UPPER > d∗
at termination, UPPER will also be larger than
the length of all the paths (s, j1 , . . . , jm ), m =
1, . . . , k, throughout the algorithm. Hence, node
jk will never enter the OPEN list with djk equal
to the shortest distance from s to jk . Similarly
node jk−1 will never enter the OPEN list with
djk−1 equal to the shortest distance from s to jk−1 .
Continue to j1 to get a contradiction

36
6.231 DYNAMIC PROGRAMMING

LECTURE 4

LECTURE OUTLINE

• Examples of stochastic DP problems

• Linear-quadratic problems
• Inventory control

37
LINEAR-QUADRATIC PROBLEMS

• System: xk+1 = Ak xk + Bk uk + wk
• Quadratic cost
−1
( N
)
X
E x′N QN xN + (x′k Qk xk + u′k Rk uk )
w k
k=0,1,...,N −1 k=0

where Qk ≥ 0 and Rk > 0 [in the positive (semi)definite

sense].
• wk are independent and zero mean
• DP algorithm:
JN (xN ) = x′N QN xN ,
Jk (xk ) = min E xk Qk xk + u′k Rk uk
′
uk

+ Jk+1 (Ak xk + Bk uk + wk )
• Key facts:
− Jk (xk ) is quadratic
− Optimal policy {µ∗0 , . . . , µN
∗
−1 } is linear:

µ∗k (xk ) = Lk xk
− Similar treatment of a number of variants
38
DERIVATION

• By induction verify that

µ∗k (xk ) = Lk xk , Jk (xk ) = x′k Kk xk + constant,

where Lk are matrices given by

Lk = −(Bk′ Kk+1 Bk + Rk )−1 Bk′ Kk+1 Ak ,

and where Kk are symmetric positive semidefinite

matrices given by

KN = QN ,

Kk = A′k Kk+1 − Kk+1 Bk (Bk′ Kk+1 Bk

′

−1
+ Rk ) Bk Kk+1 Ak + Qk

• This is called the discrete-time Riccati equation

• Just like DP, it starts at the terminal time N
and proceeds backwards.
• Certainty equivalence holds (optimal policy is
the same as when wk is replaced by its expected
value E{wk } = 0).
39
ASYMPTOTIC BEHAVIOR OF RICCATI EQ.

• Assume stationary system and cost per stage,

and technical assumptions: controlability of (A, B)
and observability of (A, C) where Q = C ′ C
• The Riccati equation converges limk→−∞ Kk =
K, where K is pos. definite, and is the unique
(within the class of pos. semidefinite matrices) so-
lution of the algebraic Riccati equation

K= A′ K− KB(B ′ KB + R)−1 B ′ K A+Q

• The optimal steady-state controller µ∗ (x) = Lx

L = −(B ′ KB + R)−1 B ′ KA,

is stable in the sense that the matrix (A + BL) of

the closed-loop system

xk+1 = (A + BL)xk + wk

satisfies limk→∞ (A + BL)k = 0.

40
GRAPHICAL PROOF FOR SCALAR SYSTEMS
2
A R
2 +Q
B F(P)

R
- 2
B P 0 450
Pk Pk + 1 P* P

• Riccati equation (with Pk = KN −k ):

B 2 Pk2

Pk+1 = A2 Pk − + Q,
B 2 Pk + R

or Pk+1 = F (Pk ), where

B2P 2
A2 RP

F (P ) = A2 P− 2 +Q = 2 +Q
B P +R B P +R

• Note the two steady-state solutions, satisfying

P = F (P ), of which only one is positive.
41
RANDOM SYSTEM MATRICES

• Suppose that {A0 , B0 }, . . . , {AN −1 , BN −1 } are

not known but rather are independent random
matrices that are also independent of the wk
• DP algorithm is

JN (xN ) = x′N QN xN ,

x′k Qk xk

Jk (xk ) = min E
uk wk ,Ak ,Bk

u′k Rk uk

+ + Jk+1 (Ak xk + Bk uk + wk )

• Optimal policy µ∗k (xk ) = Lk xk , where

′
−1
Lk = − Rk + E{Bk Kk+1 Bk } E{Bk′ Kk+1 Ak },

and where the matrices Kk are given by

KN = QN ,
Kk = E{A′k Kk+1 Ak } − E{A′k Kk+1 Bk }
′
−1
Rk + E{Bk Kk+1 Bk } E{Bk′ Kk+1 Ak } + Qk

42
PROPERTIES

• Certainty equivalence may not hold

• Riccati equation may not converge to a steady-
state
F (P)

450
R
- 2 0 P
E{B }

• We have Pk+1 = F˜ (Pk ), where

E {A2 }RP TP2

F̃ (P ) = +Q+ ,
E{B 2 }P + R E{B 2 }P + R
2 2
T = E {A2 }E {B 2 } − E {A} E {B }

43
INVENTORY CONTROL

• xk : stock, uk : stock purchased, wk : demand

xk+1 = xk + uk − wk , k = 0, 1, . . . , N − 1

• Minimize
(N −1 )
X
E cuk + H(xk + uk )
k=0

where

H(x + u) = E{r(x + u − w)}

is the expected shortage/holding cost, with r de-

fined e.g., for some p > 0 and h > 0, as

r(x) = p max(0, −x) + h max(0, x)

• DP algorithm:

JN (xN ) = 0,

Jk (xk ) = min cuk +H(xk +uk )+E Jk+1 (xk +uk −wk )
uk ≥0
44
OPTIMAL POLICY

• DP algorithm can be written as JN (xN ) = 0,

Jk (xk ) = min cuk + H(xk + uk ) + E Jk+1 (xk + uk − wk )
uk ≥0

= min Gk (xk + uk ) − cxk = min Gk (y) − cxk ,

uk ≥0 y≥xk

where

Gk (y) = cy + H(y) + E Jk+1 (y − w)

• If Gk is convex and lim|x|→∞ Gk (x) → ∞, we

have
Sk − xk if xk < Sk ,
n
µk∗ (xk ) =
0 if xk ≥ Sk ,
where Sk minimizes Gk (y).
• This is shown, assuming that H is convex and
c < p, by showing that Jk is convex for all k , and

lim Jk (x) → ∞
|x|→∞

45
JUSTIFICATION

• Graphical inductive proof that Jk is convex.

cy + H(y)

H(y)
cSN - 1

SN - 1 y
- cy

JN - 1(xN - 1)

SN - 1 xN - 1
- cy

46
6.231 DYNAMIC PROGRAMMING

LECTURE 5

LECTURE OUTLINE

• Stopping problems
• Scheduling problems
• Minimax Control

47
PURE STOPPING PROBLEMS

• Two possible controls:

− Stop (incur a one-time stopping cost, and
move to cost-free and absorbing stop state)
− Continue [using xk+1 = fk (xk , wk ) and incur-
ring the cost-per-stage]
• Each policy consists of a partition of the set of
states xk into two regions:
− Stop region, where we stop
− Continue region, where we continue

CONTINUE STOP
REGION REGION

Stop State

48
EXAMPLE: ASSET SELLING

• A person has an asset, and at k = 0, 1, . . . , N − 1

receives a random offer wk
• May accept wk and invest the money at fixed
rate of interest r, or reject wk and wait for wk+1 .
Must accept the last offer wN −1
• DP algorithm (xk : current offer, T : stop state):

xN if xN =
6 T,
n
JN (xN ) =
0 if xN = T ,
n
Jk (xk ) = max (1 + r)N −k xk , E Jk+1 (wk ) if xk =
6 T,
0 if xk = T .
• Optimal policy;

accept the offer xk if xk > αk ,

reject the offer xk if xk < αk ,

where
E Jk+1 (wk )
αk = .
(1 + r)N −k

49
FURTHER ANALYSIS

a1
a2
ACCEPT

REJECT

aN -1

0 1 2 N-1 N k

• Can show that αk ≥ αk+1 for all k

• Proof: Let Vk (xk ) = Jk (xk )/(1 + r)N −k for xk 6=
T. Then the DP algorithm is

VN (xN ) = xN , Vk (xk ) = max xk , (1 + r)−1 E Vk+1 (w)

w

We have αk = Ew Vk+1 (w) /(1 + r), so it is enough
to show that Vk (x) ≥ Vk+1 (x) for all x and k. Start
with VN −1 (x) ≥ VN (x) and use the monotonicity
property of DP. Q.E.D.

• We can also show that if w is bounded, αk → a

as k → −∞. Suggests that for an infinite horizon
the optimal policy is stationary.

50
GENERAL STOPPING PROBLEMS

• At time k , we may stop at cost t(xk ) or choose

a control uk ∈ U (xk ) and continue
JN (xN ) = t(xN ),

Jk (xk ) = min t(xk ), min E g(xk , uk , wk )
uk ∈U (xk )

+ Jk+1 f (xk , uk , wk )

• Optimal to stop at time k for x in the set

Tk = x t(x) ≤ min E g(x, u, w) + Jk+1 f (x, u, w)

u∈U (x)

• Since JN −1 (x) ≤ JN (x), we have Jk (x) ≤ Jk+1 (x)

for all k, so

T0 ⊂ · · · ⊂ Tk ⊂ Tk+1 ⊂ · · · ⊂ TN −1 .
• Interesting case is when all the Tk are equal (to
TN −1 , the set where it is better to stop than to go
one step and stop). Can be shown to be true if

f (x, u, w) ∈ TN −1 , for all x ∈ TN −1 , u ∈ U (x), w.

51
SCHEDULING PROBLEMS

• We have a set of tasks to perform, the ordering

is subject to optimal choice.
• Costs depend on the order
• There may be stochastic uncertainty, and prece-
dence and resource availability constraints
• Some of the hardest combinatorial problems
are of this type (e.g., traveling salesman, vehicle
routing, etc.)
• Some special problems admit a simple quasi-
analytical solution method
− Optimal policy has an “index form”, i.e.,
each task has an easily calculable “cost in-
dex”, and it is optimal to select the task
that has the minimum value of index (multi-
armed bandit problems - to be discussed later)
− Some problems can be solved by an “inter-
change argument”(start with some schedule,
interchange two adjacent tasks, and see what
happens). They require existence of an op-
timal policy which is open-loop.

52
EXAMPLE: THE QUIZ PROBLEM

• Given a list of N questions. If question i is an-

swered correctly (given probability pi ), we receive
reward Ri ; if not the quiz terminates. Choose or-
der of questions to maximize expected reward.
• Let i and j be the k th and (k + 1)st questions
in an optimally ordered list

L = (i0 , . . . , ik−1 , i, j, ik+2 , . . . , iN −1 )

E {reward of L} = E reward of {i0 , . . . , ik−1 }
+ pi0 · · · pik−1 (pi Ri + pi pj Rj )

+ pi0 · · · pik−1 pi pj E reward of {ik+2 , . . . , iN −1 }

Consider the list with i and j interchanged

L′ = (i0 , . . . , ik−1 , j, i, ik+2 , . . . , iN −1 )

Since L is optimal, E{reward of L} ≥ E{reward of L′ },

so it follows that pi Ri + pi pj Rj ≥ pj Rj + pj pi Ri or

pi Ri /(1 − pi ) ≥ pj Rj /(1 − pj ).

53
MINIMAX CONTROL

• Consider basic problem with the difference that

the disturbance wk instead of being random, it is
just known to belong to a given set Wk (xk , uk ).
• Find policy π that minimizes the cost
h
Jπ (x0 ) = max gN (xN )
wk ∈Wk (xk ,µk (xk ))
k=0,1,...,N −1
N −1
X i
+ gk xk , µk (xk ), wk
k=0

• The DP algorithm takes the form

JN (xN ) = gN (xN ),

Jk (xk ) = min max gk (xk , uk , wk )
uk ∈U (xk ) wk ∈Wk (xk ,uk )

+ Jk+1 fk (xk , uk , wk )

(Section 1.6 in the text).

54
DERIVATION OF MINIMAX DP ALGORITHM

• Similar to the DP algorithm for stochastic prob-

lems. The optimal cost J ∗ (x0 ) is

J ∗ (x0 ) = min · · · min max ··· max

µ0 µN −1 w0 ∈W [x0 ,µ0 (x0 )] wN −1 ∈W [xN −1 ,µN −1 (xN −1 )]
"N −1 #
X
gk xk , µk (xk ), wk + gN (xN )
k=0
"
= min · · · min min max ··· max
µ0 µN −2 µN −1 w0 ∈W [x0 ,µ0 (x0 )] wN −2 ∈W [xN −2 ,µN −2 (xN −2 )]
" N −2
X
gk xk , µk (xk ), wk + max
wN −1 ∈W [xN −1 ,µN −1 (xN −1 )]
k=0
##
h i
gN −1 xN −1 , µN −1 (xN −1 ), wN −1 + JN (xN )

• Interchange the min over µN −1 and the max over

w0 , . . . , wN −2 , and similarly continue backwards,
with N − 1 in place of N , etc. After N steps we
obtain J ∗ (x0 ) = J0 (x0 ).
• Construct optimal policy by minimizing in the
RHS of the DP algorithm.

55
UNKNOWN-BUT-BOUNDED CONTROL

• For each k , keep the xk of the controlled system

xk+1 = fk xk , µk (xk ), wk

inside a given set Xk , the target set at time k.

• This is a minimax control problem, where the
cost at stage k is

if xk ∈ X k ,
0
n
g k (xk ) =
if xk ∈
/ Xk .
1
• We must reach at time k the set

X k = xk | Jk (xk ) = 0

in order to be able to maintain the state within

the subsequent target sets.
• Start with X N = XN , and for k = 0, 1, . . . , N − 1,

Xk = xk ∈ Xk | there exists uk ∈ Uk (xk ) such that

fk (xk , uk , wk ) ∈ X k+1 , for all wk ∈ Wk (xk , uk )

56
6.231 DYNAMIC PROGRAMMING

LECTURE 6

LECTURE OUTLINE

• Problems with imperfect state info

• Reduction to the perfect state info case
• Linear quadratic problems
• Separation of estimation and control

57
BASIC PROBL. W/ IMPERFECT STATE INFO

• Same as basic problem of Chapter 1 with one

difference: the controller, instead of knowing xk ,
receives at each time k an observation of the form

z0 = h0 (x0 , v0 ), zk = hk (xk , uk−1 , vk ), k ≥ 1

• The observation zk belongs to some space Zk .

• The random observation disturbance vk is char-
acterized by a probability distribution

Pvk (· | xk , . . . , x0 , uk−1 , . . . , u0 , wk−1 , . . . , w0 , vk−1 , . . . , v0 )

• The initial state x0 is also random and charac-

terized by a probability distribution Px0 .
• The probability distribution Pwk (· | xk , uk ) of wk
is given, and it may depend explicitly on xk and
uk but not on w0 , . . . , wk−1 , v0 , . . . , vk−1 .
• The control uk is constrained to a given subset
Uk (this subset does not depend on xk , which is
not assumed known).

58
INFORMATION VECTOR AND POLICIES

• Denote by Ik the information vector, i.e., the

information available at time k:

Ik = (z0 , z1 , . . . , zk , u0 , u1 , . . . , uk−1 ), k ≥ 1,
I0 = z 0

• We consider policies π = {µ0 , µ1 , . . . , µN −1 }, where

each µk maps Ik into a uk and

µk (Ik ) ∈ Uk , for all Ik , k ≥ 0

• We want to find a policy π that minimizes

( N −1
)
X
Jπ = E gN (xN ) + gk xk , µk (Ik ), wk
x0 ,wk ,vk
k=0,...,N −1 k=0

subject to the equations

xk+1 = fk xk , µk (Ik ), wk , k ≥ 0,

z0 = h0 (x0 , v0 ), zk = hk xk , µk−1 (Ik−1 ), vk , k ≥ 1

59
REFORMULATION AS PERFECT INFO PROBL.

• System: We have
Ik+1 = (Ik , zk+1 , uk ), k = 0, 1, . . . , N − 2, I0 = z 0

View this as a dynamic system with state Ik , con-

trol uk , and random disturbance zk+1
• Disturbance: We have

P (zk+1 | Ik , uk ) = P (zk+1 | Ik , uk , z0 , z1 , . . . , zk ),

since z0 , z1 , . . . , zk are part of the information vec-

tor Ik . Thus the probability distribution of zk+1
depends explicitly only on the state Ik and control
uk and not on the prior “disturbances” zk , . . . , z0
• Cost Function: Write

E gk (xk , uk , wk ) = E E gk (xk , uk , wk ) | Ik , uk
xk ,wk

so the cost per stage of the new system is

g̃k (Ik , uk ) = E gk (xk , uk , wk ) | Ik , uk
xk ,wk

60
DP ALGORITHM

• Writing the DP algorithm for the (reformulated)

perfect state info problem:
h
Jk (Ik ) = min E gk (xk , uk , wk )
uk ∈Uk xk , wk , zk+1
i
+ Jk+1 (Ik , zk+1 , uk ) | Ik , uk

for k = 0, 1, . . . , N − 2, and for k = N − 1,

"

JN −1 (IN −1 ) = min E gN −1 (xN −1 , uN −1 , wN −1 )
uN −1 ∈UN −1 xN −1 , wN −1
#

+ gN fN −1 (xN −1 , uN −1 , wN −1 ) | IN −1 , uN −1

• The optimal cost J ∗ is given by

∗

J = E J0 (z0 )
z0

61
LINEAR-QUADRATIC PROBLEMS

• System: xk+1 = Ak xk + Bk uk + wk
• Quadratic cost
( N −1
)
X
E x′N QN xN + (xk′ Qk xk + uk′ Rk uk )
w k
k=0,1,...,N −1 k=0

where Qk ≥ 0 and Rk > 0

• Observations

zk = Ck xk + vk , k = 0, 1, . . . , N − 1

• w0 , . . . , wN −1 , v0 , . . . , vN −1 indep. zero mean

• Key fact to show:
− Optimal policy {µ∗0 , . . . , µ∗N −1 } is of the form:

µ∗k (Ik ) = Lk E{xk | Ik }

Lk : same as for the perfect state info case

− Estimation problem and control problem can
be solved separately

62
DP ALGORITHM I

• Last stage N − 1 (supressing index N − 1):

h ′
JN −1 (IN −1 ) = min ExN −1 ,wN −1 xN −1 QxN −1
uN −1

′ ′
+ uN −1 RuN −1 + (AxN −1 + BuN −1 + wN −1 )
i
· Q(AxN −1 + BuN −1 + wN −1 ) | IN −1 , uN −1

• Since E{wN −1 | IN −1 , uN −1 } = E{wN −1 } = 0,

the minimization involves
′ ′
min uN −1 (B QB + R)uN −1
uN −1

′ ′

+ 2E{xN −1 | IN −1 } A QBuN −1

The minimization yields the optimal µ∗N −1 :

u∗N −1 = µ∗N −1 (IN −1 ) = LN −1 E{xN −1 | IN −1 }

where

LN −1 = −(B ′ QB + R)−1 B ′ QA

63
DP ALGORITHM II

• Substituting in the DP algorithm

′

JN −1 (IN −1 ) = E xN −1 KN −1 xN −1 | IN −1
xN −1
′
+ E xN −1 − E{xN −1 | IN −1 }
xN −1

· PN −1 xN −1 − E{xN −1 | IN −1 } | IN −1
′
+ E {wN −1 QN wN −1 },
wN −1

where the matrices KN −1 and PN −1 are given by

PN −1 = A′N −1 QN BN −1 (RN −1 + BN
′
−1 QN BN −1 )
−1

′
· BN −1 QN AN −1 ,

KN −1 = A′N −1 QN AN −1 − PN −1 + QN −1

• Note the structure of JN −1 : in addition to the

quadratic and constant terms, it involves a (≥ 0)
quadratic in the estimation error

xN −1 − E{xN −1 | IN −1 }

64
DP ALGORITHM III

• DP equation for period N − 2:

h
JN −2 (IN −2 ) = min E {x′N −2 QxN −2
uN −2 xN −2 ,wN −2 ,zN −1
i
+ u′N −2 RuN −2 + JN −1 (IN −1 ) | IN −2 , uN −2 }

=E x′N −2 QxN −2 | IN −2
h
+ min u′N −2 RuN −2
uN −2
i
+ E x′N −1 KN −1 xN −1 | IN −2 , uN −2
′
+E xN −1 − E{xN −1 | IN −1 }

· PN −1 xN −1 − E{xN −1 | IN −1 } | IN −2 , uN −2
′
+ EwN −1 {wN −1 QN wN −1 }

• Key point: We have excluded the estimation

error term from the minimization over uN −2
• This term turns out to be independent of uN −2

65
QUALITY OF ESTIMATION LEMMA

• Current estimation error is unaffected by past

controls: For every k, there is a function Mk s.t.

xk − E{xk | Ik } = Mk (x0 , w0 , . . . , wk−1 , v0 , . . . , vk ),

independently of the policy being used

• Consequence: Using the lemma,
xN −1 − E{xN −1 | IN −1 } = ξN −1 ,
where
ξN −1 : function of x0 , w0 , . . . , wN −2 , v0 , . . . , vN −1

• Since ξN −1 is independent of uN −2 , the condi-

′
tional expectation of ξN −1 PN −1 ξN −1 satisfies

′
E{ξN −1 PN −1 ξN −1 | IN −2 , uN −2 }
′
= E{ξN −1 PN −1 ξN −1 | IN −2 }

and is independent of uN −2 .
• So minimization in the DP algorithm yields

u∗N −2 = µN
∗
−2 (IN −2 ) = LN −2 E{xN −2 | IN −2 }

66
FINAL RESULT

• Continuing similarly (using also the quality of

estimation lemma)

µ∗k (Ik ) = Lk E{xk | Ik },

where Lk is the same as for perfect state info:

Lk = −(Rk + Bk′ Kk+1 Bk )−1 Bk′ Kk+1 Ak ,

with Kk generated using the Riccati equation:

KN = QN , Kk = A′k Kk+1 Ak − Pk + Qk ,

Pk = A′k Kk+1 Bk (Rk + Bk′ Kk+1 Bk )−1 Bk′ Kk+1 Ak

wk vk

xk zk
xk + 1 = A kxk + B ku k + wk zk = Ckxk + vk
uk

Delay

uk - 1
uk E{xk | Ik} zk
Lk Estimator

67
SEPARATION INTERPRETATION

• The optimal controller can be decomposed into

(a) An estimator, which uses the data to gener-
ate the conditional expectation E{xk | Ik }.
(b) An actuator, which multiplies E{xk | Ik } by
the gain matrix Lk and applies the control
input uk = Lk E{xk | Ik }.
• Generically the estimate x̂ of a random vector x
given some information (random vector) I , which
minimizes the mean squared error

Ex {kx − x̂k2 | I} = kxk2 − 2E{x | I}x̂ + kx̂k2

is E{x | I} (set to zero the derivative with respect

to x̂ of the above quadratic form).
• The estimator portion of the optimal controller
is optimal for the problem of estimating the state
xk assuming the control is not subject to choice.
• The actuator portion is optimal for the control
problem assuming perfect state information.

68
STEADY STATE/IMPLEMENTATION ASPECTS

• As N → ∞, the solution of the Riccati equation

converges to a steady state and Lk → L.
• If x0 , wk , and vk are Gaussian, E{xk | Ik } is
a linear function of Ik and is generated by a nice
recursive algorithm, the Kalman filter.
• The Kalman filter involves also a Riccati equa-
tion, so for N → ∞, and a stationary system, it
also has a steady-state structure.
• Thus, for Gaussian uncertainty, the solution is
nice and possesses a steady state.
• For nonGaussian uncertainty, computing E{xk | Ik }
maybe very difficult, so a suboptimal solution is
typically used.
• Most common suboptimal controller: Replace
E{xk | Ik } by the estimate produced by the Kalman
filter (act as if x0 , wk , and vk are Gaussian).
• It can be shown that this controller is optimal
within the class of controllers that are linear func-
tions of Ik .

69
6.231 DYNAMIC PROGRAMMING

LECTURE 7

LECTURE OUTLINE

• DP for imperfect state info

• Sufficient statistics
• Conditional state distribution as a sufficient
statistic
• Finite-state systems
• Examples

70
REVIEW: IMPERFECT STATE INFO PROBLEM

• Instead of knowing xk , we receive observations

z0 = h0 (x0 , v0 ), zk = hk (xk , uk−1 , vk ), k ≥ 0

• Ik : information vector available at time k :

I0 = z0 , Ik = (z0 , z1 , . . . , zk , u0 , u1 , . . . , uk−1 ), k ≥ 1

• Optimization over policies π = {µ0 , µ1 , . . . , µN −1 },

where µk (Ik ) ∈ Uk , for all Ik and k.
• Find a policy π that minimizes
( N −1
)
X
Jπ = E gN (xN ) + gk xk , µk (Ik ), wk
x0 ,wk ,vk
k=0,...,N −1 k=0

subject to the equations

xk+1 = fk xk , µk (Ik ), wk , k ≥ 0,

z0 = h0 (x0 , v0 ), zk = hk xk , µk−1 (Ik−1 ), vk , k ≥ 1

71
DP ALGORITHM

• DP algorithm:
h
Jk (Ik ) = min E gk (xk , uk , wk )
uk ∈Uk xk , wk , zk+1
i
+ Jk+1 (Ik , zk+1 , uk ) | Ik , uk

for k = 0, 1, . . . , N − 2, and for k = N − 1,

"

JN −1 (IN −1 ) = min E gN −1 (xN −1 , uN −1 , wN −1 )
uN −1 ∈UN −1 xN −1 , wN −1
#

+ gN fN −1 (xN −1 , uN −1 , wN −1 ) | IN −1 , uN −1

• The optimal cost J ∗ is given by

∗

J = E J0 (z0 ) .
z0

72
SUFFICIENT STATISTICS

• Suppose there is a function Sk (Ik ) such that the

min in the right-hand side of the DP algorithm can
be written in terms of some function Hk as

min Hk Sk (Ik ), uk
uk ∈Uk

• Such a function Sk is called a sufficient statistic.

• An optimal policy obtained by the preceding
minimization can be written as

µ∗k (Ik )

= µk Sk (Ik ) ,

where µk is an appropriate function.

• Example of a sufficient statistic: Sk (Ik ) = Ik
• Another important sufficient statistic

Sk (Ik ) = Pxk |Ik ,

assuming that vk is characterized by a probability

distribution Pvk (· | xk−1 , uk−1 , wk−1 )

73
DP ALGORITHM IN TERMS OF PXK |IK

• Filtering Equation: Pxk |Ik is generated recur-

sively by a dynamic system (estimator) of the form

Pxk+1 |Ik+1 = Φk Pxk |Ik , uk , zk+1

for a suitable function Φk

• DP algorithm can be written as
h
J k (Pxk |Ik ) = min E gk (xk , uk , wk )
uk ∈Uk xk ,wk ,zk+1
i
+ J k+1 Φk (Pxk |Ik , uk , zk+1 ) | Ik , uk

• It is the DP algorithm for a new problem whose

state is Pxk |Ik (also called belief state)

wk vk

uk xk zk
System Measurement
xk + 1 = fk(xk ,uk ,wk) zk = hk(xk ,uk - 1,vk)
uk -1

Delay

uk -1
Px k | Ik zk
Actuator Estimator
µk φk - 1

74
EXAMPLE: A SEARCH PROBLEM

• At each period, decide to search or not search

a site that may contain a treasure.
• If we search and a treasure is present, we find
it with prob. β and remove it from the site.
• Treasure’s worth: V . Cost of search: C
• States: treasure present & treasure not present
• Each search can be viewed as an observation of
the state
• Denote

pk : prob. of treasure present at the start of time k

with p0 given.
• pk evolves at time k according to the equation

 pk if not search,
pk+1 = 0 if search and find treasure,
pk (1−β)

pk (1−β)+1−pk
if search and no treasure.

This is the filtering equation.

75
SEARCH PROBLEM (CONTINUED)

• DP algorithm
h
J k (pk ) = max 0, −C + pk βV
i
pk (1 − β)
+ (1 − pk β)J k+1 ,
pk (1 − β) + 1 − pk

with J N (pN ) = 0.
• Can be shown by induction that the functions
J k satisfy

C
= 0 if pk ≤ βV
,
J k (pk )
C
> 0 if pk > βV
.

• Furthermore, it is optimal to search at period

k if and only if
pk βV ≥ C

(expected reward from the next search ≥ the cost

of the search - a myopic rule)

76
FINITE-STATE SYSTEMS - POMDP

• Suppose the system is a finite-state Markov

chain, with states 1, . . . , n.
• Then the conditional probability distribution
Pxk |Ik is an n-vector

P (xk = 1 | Ik ), . . . , P (xk = n | Ik )

• The DP algorithm can be executed over the n-

dimensional simplex (state space is not expanding
with increasing k)
• When the control and observation spaces are
also finite sets the problem is called a POMDP
(Partially Observed Markov Decision Problem).
• For POMDP it turns out that the cost-to-go
functions J k in the DP algorithm are piecewise
linear and concave (Exercise 5.7)
• Useful in practice both for exact and approxi-
mate computation.

77
INSTRUCTION EXAMPLE I

• Teaching a student some item. Possible states

are L: Item learned, or L: Item not learned.
• Possible decisions: T : Terminate the instruc-
tion, or T : Continue the instruction for one period
and then conduct a test that indicates whether the
student has learned the item.
• Possible test outcomes: R: Student gives a cor-
rect answer, or R: Student gives an incorrect an-
swer.
• Probabilistic structure

1 1
L L R

t r

L L R
1-t 1-r

• Cost of instruction: I per period

• Cost of terminating instruction: 0 if student
has learned the item, and C > 0 if not.

78
INSTRUCTION EXAMPLE II

• Let pk : prob. student has learned the item given

the test results so far

pk = P (xk = L | z0 , z1 , . . . , zk ).

• Filtering equation: Using Bayes’ rule

pk+1 = Φ(pk , zk+1 )

( 1−(1−t)(1−pk )
1−(1−t)(1−r)(1−pk )
if zk+1 = R,
=
0 if zk+1 = R.

• DP algorithm:

J k (pk ) = min (1 − pk )C, I + E J k+1 Φ(pk , zk+1 )
zk+1

starting with

J N −1 (pN −1 ) = min (1−pN −1 )C, I+(1−t)(1−pN −1 )C .

79
INSTRUCTION EXAMPLE III

• Write the DP algorithm as

J k (pk ) = min (1 − pk )C, I + Ak (pk ) ,
where

Ak (pk ) = P (zk+1 = R | Ik )J k+1 Φ(pk , R)

+ P (zk+1 = R | Ik )J k+1 Φ(pk , R)

• Can show by induction that Ak (p) are piecewise

linear, concave, monotonically decreasing, with

Ak−1 (p) ≤ Ak (p) ≤ Ak+1 (p), for all p ∈ [0, 1].

(The cost-to-go at knowledge prob. p increases as

we come closer to the end of horizon.)

I + A N - 1(p)
C
I + A N - 2(p)

I + A N - 3(p)

0 a N-1 a N-2 a N-3 1 - I 1 p

C
80
6.231 DYNAMIC PROGRAMMING

LECTURE 8

LECTURE OUTLINE

• Suboptimal control
• Cost approximation methods: Classification
• Certainty equivalent control: An example
• Limited lookahead policies
• Performance bounds
• Problem approximation approach
• Parametric cost-to-go approximation

81
PRACTICAL DIFFICULTIES OF DP

• The curse of dimensionality

− Exponential growth of the computational and
storage requirements as the number of state
variables and control variables increases
− Quick explosion of the number of states in
combinatorial problems
− Intractability of imperfect state information
problems
• The curse of modeling
− Mathematical models
− Computer/simulation models
• There may be real-time solution constraints
− A family of problems may be addressed. The
data of the problem to be solved is given with
little advance notice
− The problem data may change as the system
is controlled – need for on-line replanning

82
COST-TO-GO FUNCTION APPROXIMATION

• Use a policy computed from the DP equation

where the optimal cost-to-go function Jk+1 is re-
˜

placed by an approximation J k+1 . (Sometimes E gk
is also replaced by an approximation.)
• Apply µk (xk ), which attains the minimum in
n o
min E gk (xk , uk , wk ) + J˜k+1 fk (xk , uk , wk )
uk ∈Uk (xk )

• There are several ways to compute J˜k+1 :

− Off-line approximation: The entire function
J˜k+1 is computed for every k , before the con-
trol process begins.
− On-line approximation: Only the values J˜k+1 (xk+1 )
at the relevant next states xk+1 are com-
puted and used to compute uk just after the
current state xk becomes known.
− Simulation-based methods: These are off-
line and on-line methods that share the com-
mon characteristic that they are based on
Monte-Carlo simulation. Some of these meth-
ods are suitable for are suitable for very large
problems.

83
CERTAINTY EQUIVALENT CONTROL (CEC)

• Idea: Replace the stochastic problem with a

deterministic problem
• At each time k , the future uncertain quantities
are fixed at some “typical” values
• On-line implementation for a perfect state info
problem. At each time k:
(1) Fix the wi , i ≥ k, at some wi . Solve the
deterministic problem:
N
X −1

minimize gN (xN ) + gi xi , ui , w i
i=k
where xk is known, and

ui ∈ Ui , xi+1 = fi xi , ui , wi .
(2) Use the first control in the optimal control
sequence found.
• Equivalently, we apply µ̄k (xk ) that minimizes
+ J˜k+1 fk (xk , uk , wk )

gk xk , uk , wk

where J˜k+1 is the optimal cost of the correspond-

ing deterministic problem.

84
EQUIVALENT OFF-LINE IMPLEMENTATION

• Let µd0 (x0 ), . . . , µdN −1 (xN −1 )
be an optimal con-
troller obtained from the DP algorithm for the de-
terministic problem

N −1
X
minimize gN (xN ) + gk xk , µk (xk ), wk
k=0

subject to xk+1 = fk xk , µk (xk ), wk , µk (xk ) ∈ Uk

• The CEC applies at time k the control input

µdk (xk ).
• In an imperfect info version, xk is replaced by
an estimate xk (Ik ).

85
PARTIALLY STOCHASTIC CEC

• Instead of fixing all future disturbances to their

typical values, fix only some, and treat the rest as
stochastic.
• Important special case: Treat an imperfect state
information problem as one of perfect state infor-
mation, using an estimate xk (Ik ) of xk as if it were
exact.
• Multiaccess communication example: Consider
controlling the slotted Aloha system (Example 5.1.1
in the text) by optimally choosing the probabil-
ity of transmission of waiting packets. This is a
hard problem of imperfect state info, whose per-
fect state info version is easy.
• Natural partially stochastic CEC:

1
µ̃k (Ik ) = min 1, ,
xk (Ik )

where xk (Ik ) is an estimate of the current packet

backlog based on the entire past channel history
of successes, idles, and collisions (which is Ik ).

86
GENERAL COST-TO-GO APPROXIMATION

• One-step lookahead (1SL) policy: At each k

and state xk , use the control µk (xk ) that

E gk (xk , uk , wk ) + J˜k+1 fk (xk , uk , wk )

min ,
uk ∈Uk (xk )

where
− J˜N = gN .
− J˜k+1 : approximation to true cost-to-go Jk+1
• Two-step lookahead policy: At each k and
xk , use the control µ̃k (xk ) attaining the minimum
above, where the function J˜k+1 is obtained using a
1SL approximation (solve a 2-step DP problem).
• If J˜k+1 is readily available and the minimiza-
tion above is not too hard, the 1SL policy is im-
plementable on-line.
• Sometimes one also replaces Uk (xk ) above with
a subset of “most promising controls” U k (xk ).
• As the length of lookahead increases, the re-
quired computation quickly explodes.

87
PERFORMANCE BOUNDS FOR 1SL

• Let J k (xk ) be the cost-to-go from (xk , k) of the

1SL policy, based on functions J˜k .
• Assume that for all (xk , k), we have

Jˆk (xk ) ≤ J˜k (xk ), (*)

where JˆN = gN and for all k,

Jˆk (xk ) =

min E gk (xk , uk , wk )
uk ∈Uk (xk )

+ J˜k+1 fk (xk , uk , wk )

,

[so Jˆk (xk ) is computed along with µk (xk )]. Then

J k (xk ) ≤ Jˆk (xk ), for all (xk , k).

• Important application: When J˜k is the cost-to-
go of some heuristic policy (then the 1SL policy is
called the rollout policy).
• The bound can be extended to the case where
there is a δk in the RHS of (*). Then
J k (xk ) ≤ J˜k (xk ) + δk + · · · + δN −1

88
COMPUTATIONAL ASPECTS

• Sometimes nonlinear programming can be used

to calculate the 1SL or the multistep version [par-
ticularly when Uk (xk ) is not a discrete set]. Con-
nection with stochastic programming (2-stage DP)
methods (see text).
• The choice of the approximating functions J˜k
is critical, and is calculated in a variety of ways.
• Some approaches:
(a) Problem Approximation: Approximate the
optimal cost-to-go with some cost derived
from a related but simpler problem
(b) Parametric Cost-to-Go Approximation: Ap-
proximate the optimal cost-to-go with a func-
tion of a suitable parametric form, whose pa-
rameters are tuned by some heuristic or sys-
tematic scheme (Neuro-Dynamic Program-
ming)
(c) Rollout Approach: Approximate the opti-
mal cost-to-go with the cost of some subop-
timal policy, which is calculated either ana-
lytically or by simulation

89
PROBLEM APPROXIMATION

• Many (problem-dependent) possibilities

− Replace uncertain quantities by nominal val-
ues, or simplify the calculation of expected
values by limited simulation
− Simplify difficult constraints or dynamics
• Enforced decomposition example: Route m ve-
hicles that move over a graph. Each node has a
“value.” First vehicle that passes through the node
collects its value. Want to max the total collected
value, subject to initial and final time constraints
(plus time windows and other constraints).
• Usually the 1-vehicle version of the problem is
much simpler. This motivates an approximation
obtained by solving single vehicle problems.
• 1SL scheme: At time k and state xk (position
of vehicles and “collected value nodes”), consider
all possible kth moves by the vehicles, and at the
resulting states we approximate the optimal value-
to-go with the value collected by optimizing the
vehicle routes one-at-a-time

90
PARAMETRIC COST-TO-GO APPROXIMATION

• Use a cost-to-go approximation from a para-

metric class J˜(x, r) where x is the current state
and r = (r1 , . . . , rm ) is a vector of “tunable” scalars
(weights).
• By adjusting the weights, one can change the
“shape” of the approximation J˜ so that it is rea-
sonably close to the true optimal cost-to-go func-
tion.
• Two key issues:
− The choice of parametric class J˜(x, r) (the
approximation architecture).
− Method for tuning the weights (“training”
the architecture).
• Successful application strongly depends on how
these issues are handled, and on insight about the
problem.
• Sometimes a simulation-based algorithm is used,
particularly when there is no mathematical model
of the system.
• We will look in detail at these issues after a few
lectures.

91
APPROXIMATION ARCHITECTURES

• Divided in linear and nonlinear [i.e., linear or

nonlinear dependence of J˜(x, r) on r]
• Linear architectures are easier to train, but non-
linear ones (e.g., neural networks) are richer
• Linear feature-based architecture: φ = (φ1 , . . . , φm )

m
X
˜ r) = φ(x)′ r =
J(x, φj (x)rj
j=1

i) Linear Cost
Feature Feature
State x i Feature Extraction xMapping Vector φ(x) ) Approximator φ(x)′ r
Vectori) Linear Cost
Approximator
eature Extraction( Mapping
) Feature Vector
Feature Extraction Mapping Feature Vector

• Ideally, the features will encode much of the

nonlinearity that is inherent in the cost-to-go ap-
proximated, and the approximation may be quite
accurate without a complicated architecture
• Anything sensible can be used as features. Some-
times the state space is partitioned, and “local”
features are introduced for each subset of the par-
tition (they are 0 outside the subset)

92
AN EXAMPLE - COMPUTER CHESS

• Chess programs use a feature-based position

evaluator that assigns a score to each move/position

Features:
Material balance,
Mobility,
Safety, etc Score
Feature Weighting
Extraction of Features

Position Evaluator

• Many context-dependent special features.

• Most often the weighting of features is linear
but multistep lookahead is involved.
• Most often the training is done “manually,” by
trial and error.

93
ANOTHER EXAMPLE - AGGREGATION

• Main elements (in a finite-state context):

− Introduce “aggregate” states S1 , . . . , Sm , viewed
as the states of an “aggregate” system
− Define transition probabilities and costs of
the aggregate system, by relating original
system states with aggregate states (using so
called “aggregation and disaggregation prob-
abilities”)
− Solve (exactly or approximately) the “ag-
gregate” problem by any kind of method (in-
cluding simulation-based) ... more on this
later.
− Use the optimal cost of the aggregate prob-
lem to approximate the optimal cost of each
original problem state as a linear combina-
tion of the optimal aggregate state costs
• This is a linear feature-based architecture (the
optimal aggregate state costs are the features)
• Hard aggregation example: Aggregate states
Sj are a partition of original system states (each
original state belongs to one and only one Sj ).

94
AN EXAMPLE: REPRESENTATIVE SUBSETS

• The aggregate states Sj are disjoint “represen-

tative” subsets of original system states
y3 Original State Space

x S1 φx1 1 S2
1 φx2
xS
S4 2 φx6 2 S3
! !
5 S6
6 S7 7 S8
4 S5

Aggregate States/Subsets
0 1 2 49
• Common case: Each Sj is a group of states with
“similar characteristics”
• Compute a “cost” rj for each aggregate state
Sj (using some method)
• Approximate the Poptimal cost of each original
system state x with m φ r
j=1 xj j

• For each x, the φxj , j = 1, . . . , m, are the “ag-

gregation probabilities” ... roughly the degrees of
membership of state x in the aggregate states Sj
• Each φxj is prespecified and can be viewed as
the j th feature of state x

95
6.231 DYNAMIC PROGRAMMING

LECTURE 9

LECTURE OUTLINE

• Rollout algorithms
• Policy improvement property
• Discrete deterministic problems
• Approximations of rollout algorithms
• Model Predictive Control (MPC)
• Discretization of continuous time
• Discretization of continuous space
• Other suboptimal approaches

96
ROLLOUT ALGORITHMS

• One-step lookahead policy: At each k and state

xk , use the control µk (xk ) that

E gk (xk , uk , wk ) + J˜k+1 fk (xk , uk , wk )

min ,
uk ∈Uk (xk )

where
− J˜N = gN .
− J˜k+1 : approximation to true cost-to-go Jk+1
• Rollout algorithm: When J˜k is the cost-to-go of
some heuristic policy (called the base policy)
• Policy improvement property (to be shown):
The rollout algorithm achieves no worse (and usu-
ally much better) cost than the base heuristic start-
ing from the same state.
• Main difficulty: Calculating J˜k (xk ) may be com-
putationally intensive if the cost-to-go of the base
policy cannot be analytically calculated.
− May involve Monte Carlo simulation if the
problem is stochastic.
− Things improve in the deterministic case.

97
EXAMPLE: THE QUIZ PROBLEM

• A person is given N questions; answering cor-

rectly question i has probability pi , reward vi .
Quiz terminates at the first incorrect answer.
• Problem: Choose the ordering of questions so
as to maximize the total expected reward.
• Assuming no other constraints, it is optimal to
use the index policy: Answer questions in decreas-
ing order of pi vi /(1 − pi ).
• With minor changes in the problem, the index
policy need not be optimal. Examples:
− A limit (< N ) on the maximum number of
questions that can be answered.
− Time windows, sequence-dependent rewards,
precedence constraints.
• Rollout with the index policy as base policy:
Convenient because at a given state (subset of
questions already answered), the index policy and
its expected reward can be easily calculated.
• Very effective for solving the quiz problem and
important generalizations in scheduling (see Bert-
sekas and Castanon, J. of Heuristics, Vol. 5, 1999).

98
COST IMPROVEMENT PROPERTY

• Let
J k (xk ): Cost-to-go of the rollout policy

Hk (xk ): Cost-to-go of the base policy

• We claim that J k (xk ) ≤ Hk (xk ) for all xk , k

• Proof by induction: We have J N (xN ) = HN (xN )
for all xN . Assume that

J k+1 (xk+1 ) ≤ Hk+1 (xk+1 ), ∀ xk+1 .

Let µk (xk ) and µk (xk ) be the controls applied by

rollout and heuristic at xk . Then, for all xk

J k (xk ) = E gk xk , µk (xk ), wk + J k+1 fk xk , µk (xk ), wk

≤ E gk xk , µk (xk ), wk + Hk+1 fk xk , µk (xk ), wk

≤ E gk xk , µk (xk ), wk + Hk+1 fk xk , µk (xk ), wk

= Hk (xk )

− Induction hypothesis ==> 1st inequality

− Min selection of µk (xk ) ==> 2nd inequality
− Definition of Hk , µk ==> last equality

99
DISCRETE DETERMINISTIC PROBLEMS

• Any discrete optimization problem can be repre-

sented sequentially by breaking down the decision
process into stages.
• A tree/shortest path representation. The leaves
of the tree correspond to the feasible solutions.
• Example: Traveling salesman problem. Find a
minimum cost tour through N cities.
A Origin Node s

AB AC AD

ABC ABD ACB ACD ADB ADC

ABCD ABDC ACBD ACDB ADBC ADCB

Traveling salesman problem with four cities A, B, C, D

• Complete partial solutions, one stage at a time

• May apply rollout with any heuristic that can
complete a partial solution
• No costly stochastic simulation needed

100
EXAMPLE: THE BREAKTHROUGH PROBLEM

root

• Given a binary tree with N stages.

• Each arc is free or is blocked (crossed out)
• Problem: Find a free path from the root to the
leaves (such as the one shown with thick lines).
• Base heuristic (greedy): Follow the right branch
if free; else follow the left branch if free.
• This is a rare rollout instance that admits a
detailed analysis.
• For large N and given prob. of free branch:
the rollout algorithm requires O(N ) times more
computation, but has O(N ) times larger prob. of
finding a free path than the greedy algorithm.

101
DET. EXAMPLE: ONE-DIMENSIONAL WALK

• A person takes either a unit step to the left or

a unit step to the right. Minimize the cost g(i) of
the point i where he will end up after N steps.
(0,0)

_
(N,-N) (N,0) i (N,N)
g(i)

_
-N 0 i N-2 N i

• Base heuristic: Always go to the right. Rollout

finds the rightmost local minimum.
• Alternative base heuristic: Compare always go
to the right and always go the left. Choose the
best of the two. Rollout finds a global minimum.

102
A ROLLOUT ISSUE FOR DISCRETE PROBLEMS

• The base heuristic need not constitute a policy

in the DP sense.
• Reason: Depending on its starting point, the
base heuristic may not apply the same control at
the same state.
• As a result the cost improvement property may
be lost (except if the base heuristic has a property
called sequential consistency; see the text for a
formal definition).
• The cost improvement property is restored in
two ways:
− The base heuristic has a property called se-
quential improvement which guarantees cost
reduction at each step (see the text for a for-
mal definition).
− A variant of the rollout algorithm, called for-
tified rollout, is used, which enforces cost
improvement. Roughly speaking the “best”
solution found so far is maintained, and it
is followed whenever at any time the stan-
dard version of the algorithm tries to follow
a “worse” solution (see the text).

103
ROLLING HORIZON WITH ROLLOUT

• We can use a rolling horizon approximation in

calculating the cost-to-go of the base heuristic.
• Because the heuristic is suboptimal, the ratio-
nale for a long rolling horizon becomes weaker.
• Example: N -stage stopping problem where the
stopping cost is 0, the continuation cost is either
−ǫ or 1, where 0 < ǫ << 1, and the first state
with continuation cost equal to 1 is state m. Then
the optimal policy is to stop at state m, and the
optimal cost is −mǫ.

• Consider the heuristic that continues at every

state, and the rollout policy that is based on this
heuristic, with a rolling horizon of ℓ ≤ m steps.
• It will continue up to the first m − ℓ + 1 stages,
thus compiling a cost of −(m − ℓ + 1)ǫ. The rollout
performance improves as l becomes shorter!
• Limited vision may work to our advantage!

104
MODEL PREDICTIVE CONTROL (MPC)

• Special case of rollout for linear deterministic

systems (similar extensions to nonlinear/stochastic)
− System: xk+1 = Axk + Buk
− Quadratic cost per stage: x′k Qxk + u′k Ruk
− Constraints: xk ∈ X , uk ∈ U (xk )
• Assumption: For any x0 ∈ X there is a feasible
state-control sequence that brings the system to 0
in m steps, i.e., xm = 0
• MPC at state xk solves an m-step optimal con-
trol problem with constraint xk+m = 0, i.e., finds
a sequence ūk , . . . , ūk+m−1 that minimizes
m−1
X
x′k+ℓ Qxk+ℓ u′k+ℓ Ruk+ℓ

+
ℓ=0
subject to xk+m = 0
• Then applies the first control ūk (and repeats
at the next state xk+1 )
• MPC is rollout with heuristic derived from the
corresponding m − 1-step optimal control problem
• Key Property of MPC: Since the heuristic is sta-
ble, the rollout is also stable (suggested by policy
improvement property; see the text).
105
DISCRETIZATION

• If the time, and/or state space, and/or control

space are continuous, they must be discretized.
• Consistency issue, i.e., as the discretization be-
comes finer, the cost-to-go functions of the dis-
cretized problem should converge to those of the
original problem.
• Pitfall with discretizing continuous time: The
control constraint set may change a lot as we pass
to the discrete-time approximation.
• Example: Consider the system ẋ(t) = u(t), with
control constraint u(t) ∈ {−1, 1}. The reachable
states after time δ are x(t + δ) = x(t) + u, with
u ∈ [−δ, δ].
• Compare it with the reachable states after we
discretize the system naively: x(t+δ) = x(t)+δu(t),
with u(t) ∈ {−1, 1}.
• “Convexification effect” of continuous time: a
discrete control constraint set in continuous-time
differential systems, is equivalent to a continuous
control constraint set when the system is looked
at discrete times.

106
SPACE DISCRETIZATION

• Given a discrete-time system with state space

S , consider a finite subset S ; for example S could
be a finite grid within a continuous state space S .
• Difficulty: f (x, u, w) ∈
/ S for x ∈ S .
• We define an approximation to the original
problem, with state space S , as follows:
• Express each x ∈ S as a convex combination of
states in S , i.e.,
X X
x= φi (x)xi where φi (x) ≥ 0, φi (x) = 1
xi ∈S i

• Define a “reduced” dynamic system with state

space S , whereby from each xi ∈ S we move to
x = f (xi , u, w) according to the system equation
of the original problem, and then move to xj ∈ S
with probabilities φj (x).
• Define similarly the corresponding cost per stage
of the transitions of the reduced system.
• Note application to finite-state POMDP (dis-
cretization of the simplex of the belief states).

107
SPACE DISCRETIZATION/AGGREGATION

• Let J k (xi ) be the optimal cost-to-go of the “re-

duced” problem from each state xi ∈ S and time
k onward.
• Approximate the optimal cost-to-go of any x ∈ S
for the original problem by
X
J˜k (x) = φi (x)J k (xi ),
xi ∈S

and use one-step-lookahead based on J˜k .

• The coefficients φi (x) can be viewed as features
in an aggregation scheme.
• Important question: Consistency, i.e., as the
number of states in S increases, J˜k (x) should con-
verge to the optimal cost-to-go of the original prob.
• Interesting observation: While the original prob-
lem may be deterministic, the reduced problem is
always stochastic.
• Generalization: The set S may be any finite set
(not a subset of S ) as long as the coefficients φi (x)
admit a meaningful interpretation that quantifies
the degree of association of x with xi (a form of
aggregation).

108
OTHER SUBOPTIMAL APPROACHES

• Minimize the DP equation error (Fitted Value

Iteration): Approximate Jk (xk ) with J˜k (xk , rk ), where
rk is a parameter vector, chosen to minimize some
form of error in the DP equations
− Can be done sequentially going backwards
in time (approximate Jk using an approxi-
mation of Jk+1 , starting with J˜N = gN ).
• Direct approximation of control policies: For a
subset of states xi , i = 1, . . . , m, find
i i

µ̂k (x ) = arg min E g (x , uk , wk )
uk ∈Uk (xi )

+ J˜k+1 fk (xi , uk , wk ), rk+1

Then find µ̃k (xk , sk ), where sk is a vector of pa-

rameters obtained by solving the problem
m
X
min kµ̂k (xi ) − µ̃k (xi , s)k2
s
i=1

• Approximation in policy space: Do not bother

with cost-to-go approximations. Parametrize the
policies as µ̃k (xk , sk ), and minimize the cost func-
tion of the problem over the parameters sk (ran-
dom search is a possibility). 109
6.231 DYNAMIC PROGRAMMING

LECTURE 10

LECTURE OUTLINE

• Infinite horizon problems

• Stochastic shortest path (SSP) problems
• Bellman’s equation
• Dynamic programming – value iteration
• Discounted problems as special case of SSP

110
TYPES OF INFINITE HORIZON PROBLEMS

• Same as the basic problem, but:

− The number of stages is infinite.
− Stationary system and cost (except for dis-
counting).
• Total cost problems: Minimize
(N −1 )
X
αk g xk , µk (xk ), wk

Jπ (x0 ) = lim E
N →∞ w k
k=0,1,... k=0

(if the lim exists - otherwise lim sup).

− Stochastic shortest path (SSP) problems (α =
1, and a termination state)
− Discounted problems (α < 1, bounded g )
− Undiscounted, and discounted problems with
unbounded g
• Average cost problems
(N −1 )
1 X
lim E g xk , µk (xk ), wk
N →∞ N wk
k=0,1,... k=0

• Infinite horizon characteristics: Challenging anal-

ysis, elegance of solutions and algorithms (station-
ary optimal policies are likely)
111
PREVIEW OF INFINITE HORIZON RESULTS

• Key issue: The relation between the infinite and

finite horizon optimal cost-to-go functions.
• For example, let α = 1 and JN (x) denote the
optimal cost of the N -stage problem, generated
after N DP iterations, starting from some J0

Jk+1 (x) = min E g(x, u, w) + Jk f (x, u, w) , ∀x
u∈U (x) w

• Typical results for total cost problems:

− Convergence of value iteration to J ∗ :

J ∗ (x) = min Jπ (x) = lim JN (x), ∀x

π N →∞

− Bellman’s equation holds for all x:

∗ ∗

J (x) = min E g(x, u, w) + J f (x, u, w)
u∈U (x) w

− Optimality condition: If µ(x) minimizes in

Bellman’s Eq., {µ, µ, . . .} is optimal.
• Bellman’s Eq. holds for all deterministic prob-
lems and “almost all” stochastic problems.
• Other results: True for SSP and discounted;
exceptions for other problems.
112
“EASY” AND “DIFFICULT” PROBLEMS

• Easy problems (Chapter 7, Vol. I of text)

− All of them are finite-state, finite-control
− Bellman’s equation has unique solution
− Optimal policies obtained from Bellman Eq.
− Value and policy iteration algorithms apply
• Somewhat complicated problems
− Infinite state, discounted, bounded g (con-
tractive structure)
− Finite-state SSP with “nearly” contractive
structure
− Bellman’s equation has unique solution, value
and policy iteration work
• Difficult problems (w/ additional structure)
− Infinite state, g ≥ 0 or g ≤ 0 (for all x, u, w)
− Infinite state deterministic problems
− SSP without contractive structure
• Hugely large and/or model-free problems
− Big state space and/or simulation model
− Approximate DP methods
• Measure theoretic formulations (not in this course)

113
STOCHASTIC SHORTEST PATH PROBLEMS

• Assume finite-state system: States 1, . . . , n and

special cost-free termination state t
− Transition probabilities pij (u)
− Control constraints u ∈ U (i) (finite set)
− Cost of policy π = {µ0 , µ1 , . . .} is
(N −1 )
X
Jπ (i) = lim E g xk , µk (xk ) x0 = i
N →∞
k=0

− Optimal policy if Jπ (i) = J ∗ (i) for all i.

− Special notation: For stationary policies π =
{µ, µ, . . .}, we use Jµ (i) in place of Jπ (i).
• Assumption (termination inevitable): There ex-
ists integer m such that for all policies π :

ρπ = max P {xm 6= t | x0 = i, π} < 1

i=1,...,n

• Note: We have ρ = maxπ ρπ < 1, since ρπ de-

pends only on the first m components of π .
• Shortest path examples: Acyclic (assumption is
satisfied); nonacyclic (assumption is not satisfied)

114
FINITENESS OF POLICY COST FUNCTIONS

• View
ρ = max ρπ < 1
π

as an upper bound on the non-termination prob.

during 1st m steps, regardless of policy used
• For any π and any initial state i

P {x2m 6= t | x0 = i, π} = P {x2m 6= t | xm 6= t, x0 = i, π}
× P {xm 6= t | x0 = i, π} ≤ ρ2

and similarly

6 t | x0 = i, π} ≤ ρk ,
P {xkm = i = 1, . . . , n

• So E{Cost between times km and (k + 1)m − 1 }

k

≤ mρ max g(i, u)

i=1,...,n
u∈U (i)
and
∞
X
Jπ (i) ≤ k
m
mρ max g(i, u) =
max g(i, u)

i=1,...,n 1−ρ i=1,...,n
k=0 u∈U (i) u∈U (i)

115
MAIN RESULT

• Given any initial conditions J0 (1), . . . , J0 (n), the

sequence Jk (i) generated by value iteration,
" n
#
X
Jk+1 (i) = min g(i, u) + pij (u)Jk (j) , ∀ i
u∈U (i)
j=1

converges to the optimal cost J ∗ (i) for each i.

• Bellman’s equation has J ∗ (i) as unique solution:
" n
#
∗
X
J (i) = min g(i, u) + pij (u)J ∗ (j) , ∀ i
u∈U (i)
j=1

J ∗ (t) = 0
• A stationary policy µ is optimal if and only
if for every state i, µ(i) attains the minimum in
Bellman’s equation.
• Key proof idea: The “tail” of the cost series,

∞
X
E g xk , µk (xk )
k=mK

vanishes as K increases to ∞.
116
OUTLINE OF PROOF THAT JN → J ∗

• Assume for simplicity that J0 (i) = 0 for all i.

For any K ≥ 1, write the cost of any policy π as

mK−1 ∞
X X
Jπ (x0 ) = E g xk , µk (xk ) + E g xk , µk (xk )
k=0 k=mK
mK−1 ∞
X X
≤ E g xk , µk (xk ) + ρk m max |g(i, u)|
i,u
k=0 k=K

Take the minimum of both sides over π to obtain

∗ ρK
J (x0 ) ≤ JmK (x0 ) + m max |g(i, u)|.
1−ρ i,u
Similarly, we have

ρK
JmK (x0 ) − m max |g(i, u)| ≤ J ∗ (x0 ).
1−ρ i,u

It follows that limK→∞ JmK (x0 ) = J ∗ (x0 ).

• JmK (x0 ) and JmK+k (x0 ) converge to the same
limit for k < m (since k extra steps far into the
future don’t matter), so JN (x0 ) → J ∗ (x0 ).
• Similarly, J0 =
6 0 does not matter.

117
EXAMPLE

• Minimizing the E{Time to Termination}: Let

g(i, u) = 1, ∀ i = 1, . . . , n, u ∈ U (i)

• Under our assumptions, the costs J ∗ (i) uniquely

solve Bellman’s equation, which has the form
" n
#
∗
X
J (i) = min 1+ pij (u)J ∗ (j) , i = 1, . . . , n
u∈U (i)
j=1

• In the special case where there is only one con-

trol at each state, J ∗ (i) is the mean first passage
time from i to t. These times, denoted mi , are the
unique solution of the classical equations
n
X
mi = 1 + pij mj , i = 1, . . . , n,
j=1

which are seen to be a form of Bellman’s equation

118
6.231 DYNAMIC PROGRAMMING

LECTURE 11

LECTURE OUTLINE

• Review of stochastic shortest path problems

• Computational methods for SSP
− Value iteration
− Policy iteration
− Linear programming
• Computational methods for discounted prob
lems

119
STOCHASTIC SHORTEST PATH PROBLEMS

• Assume ﬁnite-state system: States 1, . . . , n and

special cost-free termination state t
− Transition probabilities pij (u)
− Control constraints u ∈ U (i) (ﬁnite set)
− Cost of policy π = {µ0 , µ1 , . . .} is

N −1
( )
Jπ (i) = lim E g xk , µk (xk ) x0 = i
N →∞
k=0

− Optimal policy if Jπ (i) = J ∗ (i) for all i.

− Special notation: For stationary policies π =
{µ, µ, . . .}, we use Jµ (i) in place of Jπ (i).
• Assumption (Termination inevitable): There ex
ists integer m such that for every policy and initial
state, there is positive probability that the termi
nation state will be reached after no more that m
stages; for all π , we have

ρπ = max P {xm = t | x0 = i, π} < 1

i=1,...,n

120
MAIN RESULT

• Given any initial conditions J0 (1), . . . , J0 (n), the

sequence Jk (i) generated by value iteration
" n
#
X
Jk+1 (i) = min g (i, u) + pij (u)Jk (j) , ∀ i
u∈U (i)
j=1

converges to the optimal cost J ∗ (i) for each i.

• Bellman’s equation has J ∗ (i) as unique solution:
" n
#
∗
X
J (i) = min g(i, u) + pij (u)J ∗ (j) , ∀ i
u∈U (i)
j=1

• For a stationary policy µ, Jµ (i), i = 1, . . . , n,

are the unique solution of the linear system of n
equations
n
X
Jµ (i) = g i, µ(i) + pij µ(i) Jµ (j), ∀ i = 1, . . . , n
j=1

• A stationary policy µ is optimal if and only

if for every state i, µ(i) attains the minimum in
Bellman’s equation.

121
BELLMAN’S EQ. FOR A SINGLE POLICY

• Consider a stationary policy µ

• Jµ (i), i = 1, . . . , n, are the unique solution of the
linear system of n equations
n
( ) X ( )
Jµ (i) = g i, µ(i) + pij µ(i) Jµ (j), ∀ i = 1, . . . , n
j=1

• The equation provides a way to compute Jµ (i),

i = 1, . . . , n, but the computation is substantial for
large n [O(n3 )]
• For large n, value iteration may be preferable.
(Typical case of a large linear system of equations,
where an iterative method may be better than a
direct solution method.)
• For VERY large n, exact methods cannot be
applied, and approximations are needed. (We will
discuss these later.)

122
POLICY ITERATION

• It generates a sequence µ1 , µ2 , . . . of stationary

policies, starting with any stationary policy µ0 .
• At the typical iteration, given µk , we perform
a policy evaluation step, that computes the Jµk (i)
as the solution of the (linear) system of equations
n
k
X k

J(i) = g i, µ (i) + pij µ (i) J(j), i = 1, . . . , n,
j=1
in the n unknowns J(1), . . . , J (n). We then per-
form a policy improvement step,
" n
#
k+1
X
µ (i) = arg min g(i, u) + pij (u)Jµk (j) , ∀ i
u∈U (i)
j=1

• Terminate when Jµk (i) = Jµk+1 (i) ∀ i. Then

Jµk+1 = J ∗ and µk+1 is optimal, since
n
X
Jµk+1 (i) = g(i, µk+1 (i)) + pij (µk+1 (i))Jµk+1 (j)
j=1
" n
#
X
= min g(i, u) + pij (u)Jµk+1 (j)
u∈U (i)
j=1

123
JUSTIFICATION OF POLICY ITERATION

• We can show that Jµk (i) ≥ Jµk+1 (i) for all i, k

• Fix k and consider the sequence generated by
n
k+1
X k+1

JN +1 (i) = g i, µ (i) + pij µ (i) JN (j)
j=1
where J0 (i) = Jµk (i). We have
n
k
X k

J0 (i) = g i, µ (i) + pij µ (i) J0 (j)
j=1
n
k+1
X k+1

≥ g i, µ (i) + pij µ (i) J0 (j) = J1 (i)
j=1

• Using the monotonicity property of DP,

J0 (i) ≥ J1 (i) ≥ · · · ≥ JN (i) ≥ JN +1 (i) ≥ · · · , ∀i
Since JN (i) → Jµk+1 (i) as N → ∞, we obtain pol-
icy improvement, i.e.

Jµk (i) = J0 (i) ≥ Jµk+1 (i) ∀ i, k

• A policy cannot be repeated (there are finitely

many stationary policies), so the algorithm termi-
nates with an optimal policy

124
LINEAR PROGRAMMING

• We claim that J ∗ is the “largest” J that satisfies

the constraint
n
X
J(i) ≤ g(i, u) + pij (u)J(j), (1)
j=1

for all i = 1, . . . , n and u ∈ U (i).

• Proof: If we use value iteration to generate a
sequence of vectors Jk = Jk (1), . . . , Jk (n) starting
with a J0 that satisfies the constraint, i.e.,
" n
#
X
J0 (i) ≤ min g(i, u) + pij (u)J0 (j) , ∀ i
u∈U (i)
j=1

then, Jk (i) ≤ Jk+1 (i) for all k and i (monotonicity

property of DP) and Jk → J ∗ , so that J0 (i) ≤ J ∗ (i)
for all i.
∗ ∗ ∗

• So J = J (1), . . . , J (n) is the solution of the
Pn
linear program of maximizing i=1 J(i) subject to
the constraint (1).

125
LINEAR PROGRAMMING (CONTINUED)

∗
Pn
• Obtain J by Max i=1
J(i) subject to

n
X
J(i) ≤ g(i, u)+ pij (u)J(j), i = 1, . . . , n, u ∈ U (i)
j=1

J(2) = J(2) = g(2, u2 ) + p21 (u2 )J(1) + p22 (u2 )J(2)

J(2) = g(2, u1 ) + p21 (u1 )J(1) + p22 (u1 )J(2) ( )

J ∗ = J ∗ (1), J ∗ (2)

J(1) = g(1, u2 ) + p11 (u2 )J(1) + p12 (u2 )J(2)

J(1) = g(1, u1 ) + p11 (u1 )J(1) + p12 (u1 )J(2)

=0 J(1) =

• Drawback: For large n the dimension of this pro

gram is very large. Furthermore, the number of
constraints is equal to the number of state-control
pairs.

126
DISCOUNTED PROBLEMS

• Assume a discount factor α < 1.

• Conversion to an SSP problem.

• k th stage cost is the same for both problems

• Value iteration converges to J ∗ for all initial J0 :
" n
#
X
Jk+1 (i) = min g(i, u) + α pij (u)Jk (j) , ∀ i
u∈U (i)
j=1

• J ∗ is the unique solution of Bellman’s equation:

" n
#
∗
X
J (i) = min g(i, u) + α pij (u)J ∗ (j) , ∀ i
u∈U (i)
j=1

• Policy iteration terminates with an optimal pol-

icy, and linear programming works.

127
DISCOUNTED PROBLEM EXAMPLE

• A manufacturer at each time:

− Receives an order with prob. p and no order
with prob. 1 − p.
− May process all unfilled orders at cost K >
0, or process no order at all. The cost per
unfilled order at each time is c > 0.
− Maximum number of orders that can remain
unfilled is n.
− Find a processing policy that minimizes the
α-discounted cost per stage.
− State: Number of unfilled orders at the start
of a period (i = 0, 1, . . . , n).
• Bellman’s Eq.:

J (i) = min K + α(1 − p)J ∗ (0) + αpJ ∗ (1),

∗
[
∗ ∗
�
ci + α(1 − p)J (i) + αpJ (i + 1) ,
for the states i = 0, 1, . . . , n − 1, and

J ∗ (n) = K + α(1 − p)J ∗ (0) + αpJ ∗ (1)

for state n.
• Analysis: Argue that J ∗ (i) is mon. increasing in
i, to show that the optimal policy is a threshold
policy. 128
6.231 DYNAMIC PROGRAMMING

LECTURE 12

LECTURE OUTLINE

• Average cost per stage problems

• Connection with stochastic shortest path prob-
lems
• Bellman’s equation
• Value iteration
• Policy iteration

129
AVERAGE COST PER STAGE PROBLEM

• Assume a stationary system with finite number

of states and controls.
• Minimize over policies π = {µ0 , µ1 , ...}
(N −1 )
1 X
Jπ (x0 ) = lim E g xk , µk (xk ), wk
N →∞ N wk
k=0,1,... k=0

• Important characteristics (not shared by other

types of infinite horizon problems).
− For any fixed T , the cost incurred up to time
T does not matter (only the state that we are
at time T matters)
− If all states “communicate” the optimal cost
is independent of initial state [if we can go
from i to j in finite expected time, we must
have J ∗ (i) ≤ J ∗ (j)]. So J ∗ (i) ≡ λ∗ for all i.
− Because “communication” issues are so im-
portant, the methodology relies heavily on
Markov chain theory.
− The theory depends a lot on whether the
chains corresponding to policies have a single
or multiple recurrent classes. We will focus
on the simplest version, using SSP theory.
130
CONNECTION WITH SSP

• Assumption: State n is special, in that for all

initial states and all policies, n will be visited in-
finitely often (with probability 1).
• Then we expect that J ∗ (i) ≡ some λ∗
• Divide the sequence of generated states into
cycles marked by successive visits to n.
• Let’s focus on a single cycle: It can be viewed
as a state trajectory of an SSP problem with n as
the termination state.

• Let the cost at i of the SSP be g(i, u) − λ∗

• We will argue (informally) that

Av. Cost Probl. ≡ A Min Cost Cycle Probl. ≡ SSP Probl.

131
CONNECTION WITH SSP (CONTINUED)

• Consider a minimum cycle cost problem: Find

a stationary policy µ that minimizes the expected
cost per transition within a cycle
Cnn (µ)
,
Nnn (µ)
where for a fixed µ,

Cnn (µ) : E{cost from n up to the first return to n}

Nnn (µ) : E{time from n up to the first return to n}

• Intuitively, Cnn (µ)/Nnn (µ) = average cost of

µ, and optimal cycle cost = λ∗ , so

Cnn (µ) − Nnn (µ)λ∗ ≥ 0,

with equality if µ is optimal.

• Consider SSP with stage costs g(i, u) − λ∗ . The
cost of µ starting from n is Cnn (µ) − Nnn (µ)λ∗ ,
so the optimal/min cycle µ is also optimal for the
SSP.
• Also: Optimal SSP cost starting from n = 0.
132
BELLMAN’S EQUATION

• Let h∗ (i) the optimal cost of this SSP problem

when starting at the nontermination states i =
1, . . . , n. Then h∗ (1), . . . , h∗ (n) solve uniquely the
corresponding Bellman’s equation
 
n−1
X
∗
h (i) = min g(i, u) − λ + ∗ pij (u)h∗ (j) , ∀ i
u∈U (i)
j=1

• If µ∗ is an optimal stationary policy for the SSP

problem, we have

h∗ (n) = Cnn (µ∗ ) − Nnn (µ∗ )λ∗ = 0

• Combining these equations, we have

 
n
X
λ∗ +h∗ (i) = min g(i, u) + pij (u)h∗ (j) , ∀ i
u∈U (i)
j=1

h∗ (n) = 0
• If µ∗ (i) attains the min for each i, µ∗ is optimal.
• There is also Bellman Eq. for a single policy µ.
133
MORE ON THE CONNECTION WITH SSP

• Interpretation of h∗ (i) as a relative or differen-

tial cost: It is the minimum of

E{cost to reach n from i for the first time}

− E{cost if the stage cost were λ∗ and not g(i, u)}

• Algorithms: We don’t know λ∗ , so we can’t

solve the average cost problem as an SSP problem.
But similar value and policy iteration algorithms
are possible, and will be given shortly.

• Example: A manufacturer at each time

− Receives an order with prob. p and no order
with prob. 1 − p.
− May process all unfilled orders at cost K >
0, or process no order at all. The cost per
unfilled order at each time is c > 0.
− Maximum number of orders that can remain
unfilled is n.
− Find a processing policy that minimizes the
total expected cost per stage.

134
EXAMPLE (CONTINUED)

• State = number of unfilled orders. State 0 is

the special state for the SSP formulation.
• Bellman’s equation: For states i = 0, 1, . . . , n−1

λ∗ + h∗ (i) = min K + (1 − p)h∗ (0) + ph∗ (1),

ci + (1 − p)h∗ (i) + ph∗ (i + 1) ,

and for state n

λ∗ + h∗ (n) = K + (1 − p)h∗ (0) + ph∗ (1)

Also h∗ (0) = 0.
• Optimal policy: Process i unfilled orders if

K+(1−p)h∗ (0)+ph∗ (1) ≤ ci+(1−p)h∗ (i)+ph∗ (i+1)

• Intuitively, h∗ (i) is monotonically nondecreas-

ing with i (interpret h∗ (i) as optimal costs-to-go
for the associate SSP problem). So a threshold
policy is optimal: process the orders if their num-
ber exceeds some threshold integer m∗ .

135
VALUE ITERATION

• Natural VI method: Generate optimal k-stage

costs by DP algorithm starting with any J0 :
 
n
X
Jk+1 (i) = min g(i, u) + pij (u)Jk (j) , ∀ i
u∈U (i)
j=1

• Convergence: limk→∞ Jk (i)/k = λ∗ for all i.

• Proof outline: Let Jk∗ be so generated start-
ing from the opt. differential cost, i.e., the initial
condition J0∗ = h∗ . Then, by induction,

Jk∗ (i) = kλ∗ + h∗ (i), ∀i, ∀ k.

On the other hand,

∗
Jk (i) − J (i) ≤ max J0 (j) − h∗ (j), ∀i
k j=1,...,n

since Jk (i) and Jk∗ (i) are optimal costs for two
k-stage problems that differ only in the terminal
cost functions, which are J0 and h∗ .

136
RELATIVE VALUE ITERATION

• The VI method just described has two draw-

backs:
− Since typically some components of Jk di-
verge to ∞ or −∞, calculating limk→∞ Jk (i)/k
is numerically cumbersome.
− The method will not compute a correspond-
ing differential cost vector h∗ .
• We can bypass both difficulties by subtracting
a constant from all components of the vector Jk ,
so that the difference, call it hk , remains bounded.
• Relative VI algorithm: Pick any state s, and
iterate according to
 
n
X
hk+1 (i) = min g(i, u) + pij (u)hk (j)
u∈U (i)
j=1
 
n
X
− min g(s, u) + psj (u)hk (j) , ∀i
u∈U (s)
j=1

• Convergence: We can show hk → h∗ (under an

extra assumption; see Vol. II).
137
POLICY ITERATION

• At iteration k, we have a stationary µk .

• Policy evaluation: Compute λk and hk (i) of µk ,
using the n + 1 equations hk (n) = 0 and
n
X
k k k
λ + h (i) = g i, µ (i) + pij µ (i) hk (j), ∀ i
k

j=1

• Policy improvement: (For the λk -SSP) Find

 
n
X
µk+1 (i) = arg min g(i, u) + pij (u)hk (j) , ∀i
u∈U (i)
j=1

• If λk+1 = λk and hk+1 (i) = hk (i) for all i, stop;

otherwise, repeat with µk+1 replacing µk .
• Result: For each k, we either have λk+1 < λk
or we have policy improvement for the λk -SSP:

λk+1 = λk , hk+1 (i) ≤ hk (i), i = 1, . . . , n.

The algorithm terminates with an optimal policy.

138
6.231 DYNAMIC PROGRAMMING

LECTURE 13

LECTURE OUTLINE

• Control of continuous-time Markov chains –

Semi-Markov problems
• Problem formulation – Equivalence to discrete-
time problems
• Discounted problems
• Average cost problems

139
CONTINUOUS-TIME MARKOV CHAINS

• Stationary system with finite number of states

and controls
• State transitions occur at discrete times
• Control applied at these discrete times and stays
constant between transitions
• Time between transitions is random
• Cost accumulates in continuous time (may also
be incurred at the time of transition)
• Example: Admission control in a system with
restricted capacity (e.g., a communication link)
− Customer arrivals: a Poisson process
− Customers entering the system, depart after
exponentially distributed time
− Upon arrival we must decide whether to ad-
mit or to block a customer
− There is a cost for blocking a customer
− For each customer that is in the system, there
is a customer-dependent reward per unit time
− Minimize time-discounted or average cost

140
PROBLEM FORMULATION

• x(t) and u(t): State and control at time t

• tk : Time of kth transition (t0 = 0)
• xk = x(tk ); x(t) = xk for tk ≤ t < tk+1 .
• uk = u(tk ); u(t) = uk for tk ≤ t < tk+1 .
• No transition probabilities; instead transition
distributions (quantify the uncertainty about both
transition time and next state)

Qij (τ, u) = P {tk+1 −tk ≤ τ, xk+1 = j | xk = i, uk = u}

• Two important formulas:
(1) Transition probabilities are specified by

pij (u) = P {xk+1 = j | xk = i, uk = u} = lim Qij (τ, u)

τ →∞

(2) The Cumulative Distribution Function (CDF)

of τ given i, j, u is (assuming pij (u) > 0)

Qij (τ, u)
P {tk+1 −tk ≤ τ | xk = i, xk+1 = j, uk = u} =
pij (u)
Thus, Qij (τ, u) can be viewed as a “scaled CDF”
141
EXPONENTIAL TRANSITION DISTRIBUTIONS

• Important example of transition distributions:

Qij (τ, u) = pij (u) 1 − e−νi (u)τ ,

where pij (u) are transition probabilities, and νi (u)

is called the transition rate at state i.
• Interpretation: If the system is in state i and
control u is applied
− the next state will be j with probability pij (u)
− the time between the transition to state i
and the transition to the next state j is ex-
ponentially distributed with parameter νi (u)
(independently of j):
P {transition time interval > τ | i, u} = e−νi (u)τ
• The exponential distribution is memoryless.
This implies that for a given policy, the system
is a continuous-time Markov chain (the future de-
pends on the past through the current state).
• Without the memoryless property, the Markov
property holds only at the times of transition.

142
COST STRUCTURES

• There is cost g(i, u) per unit time, i.e.

g(i, u)dt = the cost incurred in time dt

• There may be an extra “instantaneous” cost

ĝ(i, u) at the time of a transition (let’s ignore this
for the moment)
• Total discounted cost of π = {µ0 , µ1 , . . .} start-
ing from state i (with discount factor β > 0)
(N −1 Z )
X tk+1
e−βt g xk , µk (xk ) dt x0 = i

lim E

N →∞
k=0 tk

• Average cost per unit time

(N −1 Z )
tk+1
1 X
lim E g xk , µk (xk ) dt x0 = i
N →∞ E{tN } tk
k=0

• We will see that both problems have equivalent

discrete-time versions.

143
DISCOUNTED CASE - COST CALCULATION

• For a policy π = {µ0 , µ1 , . . .}, write

Jπ (i) = E{1st transition cost}+E{e−βτ Jπ1 (j) | i, µ0 (i)}

R τ
where E{1st transition cost} = E 0 0 (i))dt e−βt g(i, µ
and Jπ1 (j) is the cost-to-go of π1 = {µ1 , µ2 , . . .}
• We calculate the two costs in the RHS. The
E{1st transition cost}, if u is applied at state i, is

G(i, u) = Ej Eτ {1st transition cost | j}
n Z ∞ Z τ
X dQij (τ, u)
= pij (u) e−βt g(i, u)dt
0 0
pij (u)
j=1
n Z ∞
X 1 − e−βτ
= g(i, u) dQij (τ, u)
0
β
j=1

• Thus the E{1st transition cost} is

n ∞
1 − e−βτ
Z
X
G i, µ0 (i) = g i, µ0 (i) dQij τ, µ0 (i)
0
β
j=1

(The summation term can be viewed as a “dis-

counted length of the transition interval t1 − t0 ”.)
144
COST CALCULATION (CONTINUED)

• Also the expected (discounted) cost from the

next state j is

E e −βτ Jπ1 (j) | i, µ0 (i)

= Ej E{e −βτ | i, µ0 (i), j}Jπ1 (j) | i, µ0 (i)
n Z ∞
X
−βτ
dQij (τ, µ0 (i))
= pij (µ0 (i)) e Jπ1 (j)
j=1 0 pij (µ0 (i))
n
X
= mij µ0 (i) Jπ1 (j)
j=1

where mij (u) is given by

Z ∞ Z ∞
mij (u) = e−βτ dQij (τ, u) < dQij (τ, u) = pij (u)
0 0
and can be viewed as the “effective discount fac-
tor” [the analog of αpij (u) in discrete-time case].
• So Jπ (i) can be written as
n
X
Jπ (i) = G i, µ0 (i) + mij µ0 (i) Jπ1 (j)
j=1

i.e., the (continuous-time discounted) cost of 1st

period, plus the (continuous-time discounted) cost-
to-go from the next state.
145
COST CALCULATION (CONTINUED)

• Also the expected (discounted) cost from the

where mij (u) is given by

Z ∞ Z ∞
mij (u) = e−βτ dQij (τ, u) < dQij (τ, u) = pij (u)
0 0

and can be viewed as the “effective discount fac-

tor” [the analog of αpij (u) in discrete-time case].
• So Jπ (i) can be written as
n
X
Jπ (i) = G i, µ0 (i) + mij µ0 (i) Jπ1 (j)
j=1
i.e., the (continuous-time discounted) cost of 1st
period, plus the (continuous-time discounted) cost-
to-go from the next state.
146
EQUIVALENCE TO AN SSP

• Similar to the discrete-time case, introduce an

“equivalent” stochastic shortest path problem with
an artificial termination state t
• Under control u, from state i the system moves
to state j with probability mij (u) and
Pnto the ter-
mination state t with probability 1 − j=1 mij (u)
• Bellman’s equation: For i = 1, . . . , n,
 
n
X
J ∗ (i) = min G(i, u) + mij (u)J ∗ (j)
u∈U (i)
j=1

• Analogs of value iteration, policy iteration, and

linear programming.
• If in addition to the cost per unit time g, there
is an extra (instantaneous) one-stage cost ĝ(i, u),
Bellman’s equation becomes
 
n
X
J ∗ (i) = min ĝ(i, u) + G(i, u) + mij (u)J ∗ (j)
u∈U (i)
j=1

147
MANUFACTURER’S EXAMPLE REVISITED

• A manufacturer receives orders with interarrival

times uniformly distributed in [0, τmax ].
• He may process all unfilled orders at cost K > 0,
or process none. The cost per unit time of an
unfilled order is c. Max number of unfilled orders
is n.
• The nonzero transition distributions are

τ
Qi1 (τ, Fill) = Qi(i+1) (τ, Not Fill) = min 1,
τmax
• The one-stage expected cost G is

G(i, Fill) = 0, G(i, Not Fill) = γ c i,

where
n Z ∞ τmax
1− e−β τ 1 − e−βτ
X Z
γ= dQij (τ, u) = dτ
j =1 0 β 0 βτmax

• There is an “instantaneous” cost

ĝ(i, Fill) = K, ĝ(i, Not Fill) = 0

148
MANUFACTURER’S EXAMPLE CONTINUED

• The “effective discount factors” mij (u) in Bell-

man’s Equation are

mi1 (Fill) = mi(i+1) (Not Fill) = α,

where
Z ∞ τmax
e−β τ 1 − e−βτmax
Z
−βτ
α= e dQij (τ, u) = dτ =
0 0
τmax βτmax

• Bellman’s equation has the form

J ∗ (i) = min K+αJ ∗ (1), γci+αJ ∗ (i+1) , i = 1, 2, . . .

• As in the discrete-time case, we can conclude

that there exists an optimal threshold i∗ :

fill the orders <==> their number i exceeds i∗

149
AVERAGE COST
nR o
tN
• Minimize limN →∞ E{t1 } E

g x(t), u(t) dt
0
N
assuming there is a special state that is “recurrent
under all policies”
• Total expected cost of a transition
G(i, u) = g(i, u)τ i (u),
where τ i (u): Expected transition time.
• We apply the SSP argument used for the discrete-
time case.
− Divide trajectory into cycles marked by suc-
cessive visits to n.
− The cost at (i, u) is G(i, u) − λ∗ τ i (u), where
λ∗ is the optimal expected cost per unit time.
− Each cycle is viewed as a state trajectory of
a corresponding SSP problem with the ter-
mination state being essentially n.
• So Bellman’s Eq. for the average cost problem:
 
n
X
h∗ (i) = min G(i, u) − λ∗ τ i (u) + pij (u)h∗ (j)
u∈U (i)
j=1

150
MANUFACTURER EXAMPLE/AVERAGE COST

• The expected transition times are

τmax
τ i (Fill) = τ i (Not Fill) =
2
the expected transition cost is

c i τmax
G(i, Fill) = 0, G(i, Not Fill) =
2
and there is also the “instantaneous” cost

ĝ(i, Fill) = K, ĝ(i, Not Fill) = 0

• Bellman’s equation:
h τmax
h∗ (i) = min K − λ∗ + h∗ (1),
2
τmax τ max
i
ci − λ∗ + h∗ (i + 1)
2 2

• Again it can be shown that a threshold policy

is optimal.

151
6.231 DYNAMIC PROGRAMMING

LECTURE 14

LECTURE OUTLINE

• We start a ten-lecture sequence on advanced

infinite horizon DP and approximation methods
• We allow infinite state space, so the stochastic
shortest path framework cannot be used any more
• Results are rigorous assuming a finite or count-
able disturbance space
− This includes deterministic problems with
arbitrary state space, and countable state
Markov chains
− Otherwise the mathematics of measure the-
ory make analysis difficult, although the fi-
nal results are essentially the same as for fi-
nite disturbance space
• We use Vol. II of the textbook, starting with
discounted problems (Ch. 1)
• The central mathematical structure is that the
DP mapping is a contraction mapping (instead of
existence of a termination state)
152
DISCOUNTED PROBLEMS/BOUNDED COST

• Stationary system with arbitrary state space

xk+1 = f (xk , uk , wk ), k = 0, 1, . . .

• Cost of a policy π = {µ0 , µ1 , . . .}

(N −1 )
X
Jπ (x0 ) = lim E αk g xk , µk (xk ), wk
N →∞ wk
k=0,1,... k=0

with α < 1, and for some M , we have

|g(x, u, w)| ≤ M, ∀ (x, u, w)

• We have

Jπ (x0 ) ≤ M + αM + α2 M + · · · = M
, ∀ x0
1−α

• The “tail” of the cost Jπ (x0 ) diminishes to 0

• The limit defining Jπ (x0 ) exists

153
WE ADOPT “SHORTHAND” NOTATION

• Compact pointwise notation for functions:

− If for two functions J and J ′ we have J(x) =
J ′ (x) for all x, we write J = J ′
− If for two functions J and J ′ we have J(x) ≤
J ′ (x) for all x, we write J ≤ J ′
− For a sequence {Jk } with Jk (x) → J(x) for
all x, we write Jk → J; also J ∗ = minπ Jπ
• Shorthand notation for DP mappings (operate
on functions of state to produce other functions)

(T J)(x) = min E g(x, u, w) + αJ f (x, u, w) , ∀x
u∈U (x) w

T J is the optimal cost function for the one-stage

problem with stage cost g and terminal cost αJ.
• For any stationary policy µ

(Tµ J)(x) = E g x, µ(x), w + αJ f (x, µ(x), w) , ∀x
w

• For finite-state problems:

Tµ J = gµ + αPµ J, T J = min Tµ J
µ

154
“SHORTHAND” COMPOSITION NOTATION

• Composition notation: T 2 J is defined by (T 2 J)(x) =

(T (T J))(x) for all x (similar for T k J)
• For any policy π = {µ0 , µ1 , . . .} and function J:
− Tµ0 J is the cost function of π for the one-
stage problem with terminal cost function
αJ
− Tµ0 Tµ1 J (i.e., Tµ0 applied to Tµ1 J) is the
cost function of π for the two-stage problem
with terminal cost α2 J
− Tµ0 Tµ1 · · · TµN −1 J is the cost function of π
for the N -stage problem with terminal cost
αN J
• For any function J:
− T J is the optimal cost function of the one-
stage problem with terminal cost function
αJ
− T 2 J (i.e., T applied to T J) is the optimal
cost function of the two-stage problem with
terminal cost α2 J
− T N J is the optimal cost function of the N -
stage problem with terminal cost αN J
155
“SHORTHAND” THEORY – A SUMMARY

• Cost function expressions [with J0 (x) ≡ 0]

Jπ (x) = lim (Tµ0 Tµ1 · · · Tµk J0 )(x), Jµ (x) = lim (Tµk J0 )(x)
k→∞ k→∞

• Bellman’s equation: J ∗ = T J ∗ , Jµ = Tµ Jµ
• Optimality condition:

µ: optimal <==> Tµ J ∗ = T J ∗

• Value iteration: For any (bounded) J and all

x,
J ∗ (x) = lim (T k J)(x)
k→∞

• Policy iteration: Given µk :

− Policy evaluation: Find Jµk by solving

Jµk = Tµk Jµk

− Policy improvement: Find µk+1 such that

Tµk+1 Jµk = T Jµk

156
SOME KEY PROPERTIES

• Monotonicity property: For any functions J and

J ′ such that J(x) ≤ J ′ (x) for all x, and any µ

(T J)(x) ≤ (T J ′ )(x), ∀ x,

(Tµ J)(x) ≤ (Tµ J ′ )(x), ∀ x.

Also

J ≤ TJ ⇒ T k J ≤ T k+1 J, ∀k

• Constant Shift property: For any J, any scalar

r, and any µ

T (J + re) (x) = (T J)(x) + αr, ∀ x,

Tµ (J + re) (x) = (Tµ J)(x) + αr, ∀ x,
where e is the unit function [e(x) ≡ 1] (holds for
most DP models).
• A third important property that holds for some
(but not all) DP models is that T and Tµ are con-
traction mappings (more on this later).
157
CONVERGENCE OF VALUE ITERATION

• If J0 ≡ 0,

J ∗ (x) = lim (T N J0 )(x), for all x

N →∞

Proof: For any initial state x0 , and policy π =

{µ0 , µ1 , . . .},

∞
( )
X
Jπ (x0 ) = E αk g xk , µk (xk ), wk
k=0
(N −1 )
X
=E αk g xk , µk (xk ), wk
k=0
∞
( )
X
+E αk g xk , µk (xk ), wk
k=N

from which

αN M αN M
Jπ (x0 )− ≤ (Tµ0 · · · TµN −1 J0 )(x0 ) ≤ Jπ (x0 )+ ,
1−α 1−α

where M ≥ |g(x, u, w)|. Take the min over π of

both sides. Q.E.D.
158
BELLMAN’S EQUATION

• The optimal cost function J ∗ satisfies Bellman’s

Eq., i.e. J ∗ = T J ∗ .
Proof: For all x and N ,

α NM α NM
J ∗ (x) − ≤ (T N J0 )(x) ≤ J ∗ (x) + ,
1−α 1−α

where J0 (x) ≡ 0 and M ≥ |g(x, u, w)|.

• Apply T to this relation and use Monotonicity
and Constant Shift,

αN +1 M
(T J ∗ )(x) − ≤ (T N +1 J0 )(x)
1−α
α N +1 M
≤ (T J ∗ )(x) +
1−α

• Take limit as N → ∞ and use the fact

lim (T N +1 J0 )(x) = J ∗ (x)

N →∞

to obtain J ∗ = T J ∗ . Q.E.D.

159
THE CONTRACTION PROPERTY

• Contraction property: For any bounded func-

tions J and J ′ , and any µ,

′ ′
max (T J)(x) − (T J )(x) ≤ α max J(x) − J (x),
x x

max(Tµ J)(x) −(Tµ J ′ )(x) ≤ α maxJ(x) − J ′ (x).
x x

Proof: Denote c = maxx∈S J(x) − J ′ (x). Then

J(x) − c ≤ J ′ (x) ≤ J(x) + c, ∀x

Apply T to both sides, and use the Monotonicity

and Constant Shift properties:

(T J)(x) − αc ≤ (T J ′ )(x) ≤ (T J)(x) + αc, ∀x

Hence

(T J)(x) − (T J ′ )(x) ≤ αc, ∀ x.

Similar for Tµ . Q.E.D.

160
IMPLICATIONS OF CONTRACTION PROPERTY

• We can strengthen our earlier result:

• Bellman’s equation J = T J has a unique solu-
tion, namely J ∗ , and for any bounded J, we have

lim (T k J)(x) = J ∗ (x), ∀x

k→∞

Proof: Use

k ∗ k k ∗
max (T J)(x) − J (x) = max (T J)(x) − (T J )(x)
x x

k ∗
≤ α max J (x) − J (x)

x

• Special Case: For each stationary µ, Jµ is the

unique solution of J = Tµ J and

lim (Tµk J)(x) = Jµ (x), ∀ x,

k→∞

for any bounded J.

• Convergence rate: For all k,

max(T k J)(x) − J ∗ (x) ≤ αk maxJ(x) − J ∗ (x)
x x

161
NEC. AND SUFFICIENT OPT. CONDITION

• A stationary policy µ is optimal if and only if

µ(x) attains the minimum in Bellman’s equation
for each x; i.e.,

T J ∗ = Tµ J ∗ .

Proof: If T J ∗ = Tµ J ∗ , then using Bellman’s equa-

tion (J ∗ = T J ∗ ), we have

J ∗ = Tµ J ∗ ,

so by uniqueness of the fixed point of Tµ , we obtain

J ∗ = Jµ ; i.e., µ is optimal.
• Conversely, if the stationary policy µ is optimal,
we have J ∗ = Jµ , so

J ∗ = Tµ J ∗ .

Combining this with Bellman’s equation (J ∗ =

T J ∗ ), we obtain T J ∗ = Tµ J ∗ . Q.E.D.

162
COMPUTATIONAL METHODS - AN OVERVIEW

• Typically must work with a finite-state system.

Possibly an approximation of the original system.
• Value iteration and variants
− Gauss-Seidel and asynchronous versions
• Policy iteration and variants
− Combination with (possibly asynchronous)
value iteration
− “Optimistic” policy iteration
• Linear programming
n
X
maximize J(i)
i=1
n
X
subject to J(i) ≤ g(i, u) + α pij (u)J(j), ∀ (i, u)
j=1

• Versions with subspace approximation: Use in

place of J(i) a low-dim. basis function representa-
tion, with state features φm (i), m = 1, . . . , s
s
˜ r) =
X
J(i, rm φm (i)
m=1
and modify the basic methods appropriately.
163
USING Q-FACTORS I

• Let the states be i = 1, . . . , n. We can write

Bellman’s equation as

J ∗ (i) = min Q∗ (i, u) i = 1, . . . , n,

u∈U (i)

where
n
X
Q∗ (i, u) = pij (u) g(i, u, j) + αJ ∗ (j)
j=1

for all (i, u)

• Q∗ (i, u) is called the optimal Q-factor of (i, u)
• Q-factors have optimal cost interpretation in
an “augmented” problem whose states are i and
(i, u), u ∈ U (i) - the optimal cost vector is (J ∗ , Q∗ )
• The Bellman Eq. is J ∗ = T J ∗ , Q∗ = F Q∗ where
n
X
(F Q∗ )(i, u) = pij (u) g(i, u, j) + α min Q∗ (j, v)
v∈U (j)
j=1

• It has a unique solution.

164
USING Q-FACTORS II

• We can equivalently write the VI method as

Jk+1 (i) = min Qk+1 (i, u), i = 1, . . . , n,

u∈U (i)

where Qk+1 is generated for all i and u ∈ U (i) by

n
X
Qk+1 (i, u) = pij (u) g(i, u, j) + α min Qk (j, v)
v∈U (j)
j=1

or Jk+1 = T Jk , Qk+1 = F Qk .
• Equal amount of computation ... just more
storage.
• Having optimal Q-factors is convenient when
implementing an optimal policy on-line by

µ∗ (i) = min Q∗ (i, u)

u∈U (i)

• Once Q∗ (i, u) are known, the model [g and

pij (u)] is not needed. Model-free operation.
• Stochastic/sampling methods can be used to
calculate (approximations of) Q∗ (i, u) [not J ∗ (i)]
with a simulator of the system.
165
6.231 DYNAMIC PROGRAMMING

LECTURE 15

LECTURE OUTLINE

• Review of basic theory of discounted problems

• Monotonicity and contraction properties
• Contraction mappings in DP
• Discounted problems: Countable state space
with unbounded costs
• Generalized discounted DP
• An introduction to abstract DP

166
DISCOUNTED PROBLEMS/BOUNDED COST

• Stationary system with arbitrary state space

xk+1 = f (xk , uk , wk ), k = 0, 1, . . .

• Cost of a policy π = {µ0 , µ1 , . . .}

(N −1 )
X
Jπ (x0 ) = lim E αk g xk , µk (xk ), wk
N →∞ wk
k=0,1,... k=0

with α < 1, and for some M , we have |g(x, u, w)| ≤

M for all (x, u, w)
• Shorthand notation for DP mappings (operate
on functions of state to produce other functions)

(T J)(x) = min E g(x, u, w) + αJ f (x, u, w) , ∀x
u∈U (x) w

T J is the optimal cost function for the one-stage

problem with stage cost g and terminal cost αJ.
• For any stationary policy µ

(Tµ J)(x) = E g x, µ(x), w + αJ f (x, µ(x), w) , ∀x
w

167
“SHORTHAND” THEORY – A SUMMARY

• Cost function expressions [with J0 (x) ≡ 0]

Jπ (x) = lim (Tµ0 Tµ1 · · · Tµk J0 )(x), Jµ (x) = lim (Tµk J0 )(x)
k→∞ k→∞

• Bellman’s equation: J ∗ = T J ∗ , Jµ = Tµ Jµ
• Optimality condition:

µ: optimal <==> Tµ J ∗ = T J ∗

• Value iteration: For any (bounded) J and all

x:
J ∗ (x) = lim (T k J)(x)
k→∞

• Policy iteration: Given µk ,

− Policy evaluation: Find Jµk by solving

Jµk = Tµk Jµk

− Policy improvement: Find µk+1 such that

Tµk+1 Jµk = T Jµk

168
MAJOR PROPERTIES

• Monotonicity property: For any functions J and

J ′ on the state space X such that J(x) ≤ J ′ (x)
for all x ∈ X, and any µ

(T J)(x) ≤ (T J ′ )(x), ∀ x ∈ X,

(Tµ J)(x) ≤ (Tµ J ′ )(x), ∀ x ∈ X.

• Contraction property: For any bounded func-
tions J and J ′ , and any µ,

′ ′
max (T J)(x) − (T J )(x) ≤ α max J(x) − J (x),
x x

′ ′
max (Tµ J)(x)−(Tµ J )(x) ≤ α max J(x) − J (x).
x x

• Shorthand writing of the contraction property

kT J−T J ′ k ≤ αkJ−J ′ k, kTµ J−Tµ J ′ k ≤ αkJ−J ′ k,

where for any bounded function J, we denote by

kJk the sup-norm

kJk = max J(x).

x∈X
169
CONTRACTION MAPPINGS

• Given a real vector space Y with a norm k · k

(see text for definitions).
• A function F : Y 7→ Y is said to be a contraction
mapping if for some ρ ∈ (0, 1), we have

kF y − F zk ≤ ρky − zk, for all y, z ∈ Y.

ρ is called the modulus of contraction of F .
• Linear case, Y = ℜn : F y = Ay + b is a con-
traction (for some norm k · k) if and only if all
eigenvalues of A are strictly within the unit circle.
• For m > 1, we say that F is an m-stage con-
traction if F m is a contraction.
• Important example: Let X be a set (e.g., state
space in DP), v : X 7→ ℜ be a positive-valued
function. Let B(X) be the set of all functions
J : X 7→ ℜ such that J(s)/v(s) is bounded over s.
• The weighted sup-norm on B(X):
|J(s)|
kJk = max .
s∈X v(s)

• Important special case: The discounted prob-

lem mappings T and Tµ [for v(s) ≡ 1, ρ = α].

170
A DP-LIKE CONTRACTION MAPPING

• Let X = {1, 2, . . .}, and let F : B(X) 7→ B(X)

be a linear mapping of the form
X
(F J)(i) = b(i) + a(i, j) J(j), ∀i
j ∈X

where b(i) and a(i, j) are some scalars. Then F is

a contraction with modulus ρ if
P
j∈X |a(i, j)| v(j)
≤ ρ, ∀i
v(i)
[Think of the special case where a(i, j) are the
transition probs. of a policy].
• Let F : B(X) 7→ B(X) be the mapping

(F J)(i) = min(Fµ J)(i), ∀i

µ∈M

where M is parameter set, and for each µ ∈ M , Fµ

is a contraction from B(X) to B(X) with modulus
ρ. Then F is a contraction with modulus ρ.

171
CONTRACTION MAPPING FIXED-POINT TH.

• Contraction Mapping Fixed-Point Theorem: If

F : B(X) 7→ B(X) is a contraction with modulus
ρ ∈ (0, 1), then there exists a unique J ∗ ∈ B(X)
such that
J ∗ = F J ∗.
Furthermore, if J is any function in B(X), then
{F k J} converges to J ∗ and we have

kF k J − J ∗ k ≤ ρk kJ − J ∗ k, k = 1, 2, . . . .

• Similar result if F is an m-stage contraction

mapping.
• This is a special case of a general result for
contraction mappings F : Y 7→ Y over normed
vector spaces Y that are complete: every sequence
{yk } that is Cauchy (satisfies kym − yn k → 0 as
m, n → ∞) converges.
• The space B(X) is complete [see the text (Sec-
tion 1.5) for a proof].

172
GENERAL FORMS OF DISCOUNTED DP

• Monotonicity assumption: If J, J ′ ∈ R(X) and

J ≤ J ′ , then

H(x, u, J) ≤ H(x, u, J ′ ), ∀ x ∈ X, u ∈ U (x)

• Contraction assumption:
− For every J ∈ B(X), the functions Tµ J and
T J belong to B(X).
− For some α ∈ (0, 1) and all J, J ′ ∈ B(X), H
satisfies

H(x, u, J)−H(x, u, J ′ ) ≤ α max J(y)−J ′ (y )
y ∈X

for all x ∈ X and u ∈ U (x).

• We can show all the standard analytical and
computational results of discounted DP based on
these two assumptions (with identical proofs!)
• With just the monotonicity assumption (as in
shortest path problem) we can still show various
forms of the basic results under appropriate as-
sumptions (like in the SSP problem)

173
EXAMPLES

• Discounted problems

H(x, u, J) = E g(x, u, w) + αJ f (x, u, w)

• Discounted Semi-Markov Problems

n
X
H(x, u, J) = G(x, u) + mxy (u)J(y)
y=1

where mxy are “discounted” transition probabili-

ties, defined by the transition distributions
• Deterministic Shortest Path Problems

axu + J (u) if u 6= t,
H(x, u, J ) =
axt if u = t

where t is the destination

• Minimax Problems

H(x, u, J) = max g(x, u, w)+αJ f (x, u, w)
w∈W (x,u)

174
RESULTS USING CONTRACTION

• The mappings Tµ and T are sup-norm contrac-

tion mappings with modulus α over B(X), and
have unique fixed points in B(X), denoted Jµ and
J ∗ , respectively (cf. Bellman’s equation). Proof :
From contraction assumption and fixed point Th.
• For any J ∈ B(X) and µ ∈ M,

lim Tµk J = Jµ , lim T k J = J ∗

k→∞ k→∞

(cf. convergence of value iteration). Proof : From

contraction property of Tµ and T .
• We have Tµ J ∗ = T J ∗ if and only if Jµ = J ∗
(cf. optimality condition). Proof : Tµ J ∗ = T J ∗ ,
then Tµ J ∗ = J ∗ , implying J ∗ = Jµ . Conversely,
if Jµ = J ∗ , then Tµ J ∗ = Tµ Jµ = Jµ = J ∗ = T J ∗ .
• Useful bound for Jµ : For all J ∈ B(X), µ ∈ M
kTµ J − Jk
kJµ − J k ≤
1−α
Proof: Take limit as k → ∞ in the relation
k
X k
X
kTµk J−Jk ≤ kTµℓ J−Tµℓ−1 Jk ≤ kTµ J−Jk αℓ−1
ℓ=1 ℓ=1

175
RESULTS USING MON. AND CONTRACTION I

• Existence of a nearly optimal policy: For every

ǫ > 0, there exists µǫ ∈ M such that
J ∗ (x) ≤ Jµǫ (x) ≤ J ∗ (x) + ǫv(x), ∀x∈X
Proof: For all µ ∈ M, we have J ∗ = T J ∗ ≤ Tµ J ∗ .
By monotonicity, J ∗ ≤ Tµk+1 J ∗ ≤ Tµk J ∗ for all k.
Taking limit as k → ∞, we obtain J ∗ ≤ Jµ .
Also, choose µǫ ∈ M such that for all x ∈ X,

kTµǫ J ∗ −J ∗ k ∗ ∗
= (Tµǫ J )(x)−(T J )(x) ≤ ǫ(1−α)

From the earlier error bound, we have

kT J ∗ − J ∗k
µ
kJµ − J ∗ k ≤ , ∀µ∈M
1−α

Combining the preceding two relations,

Jµǫ (x) − J ∗ (x) ǫ(1 − α)
≤ = ǫ, ∀x∈X
v(x) 1−α
• Optimality of J ∗ over stationary policies:

J ∗ (x) = min Jµ (x), ∀x∈X

µ∈M

Proof: Take ǫ ↓ 0 in the preceding result.

176
RESULTS USING MON. AND CONTRACTION II

• Nonstationary policies: Consider the set Π of

all sequences π = {µ0 , µ1 , . . .} with µk ∈ M for
all k, and define for any J ∈ B(X)

Jπ (x) = lim sup(Tµ0 Tµ1 · · · Tµk J)(x), ∀ x ∈ X,

k→∞

(the choice of J does not matter because of the

contraction property).
• Optimality of J ∗ over nonstationary policies:

J ∗ (x) = min Jπ (x), ∀x∈X

π∈Π

Proof: Use our earlier existence result to show

that for any ǫ > 0, there is µǫ such that kJµǫ −
J ∗ k ≤ ǫ(1 − α). We have

J ∗ (x) = min Jµ (x) ≥ min Jπ (x)

µ∈M π∈Π

Also
T k J ≤ Tµ0 · · · Tµk−1 J
Take limit as k → ∞ to obtain J ≤ Jπ for all
π ∈ Π.

177
6.231 DYNAMIC PROGRAMMING

LECTURE 16

LECTURE OUTLINE

• Review of computational theory of discounted

problems
• Value iteration (VI), policy iteration (PI)
• Optimistic PI
• Computational methods for generalized dis-
counted DP
• Asynchronous algorithms

178
DISCOUNTED PROBLEMS

• Stationary system with arbitrary state space

xk+1 = f (xk , uk , wk ), k = 0, 1, . . .

• Bounded g. Cost of a policy π = {µ0 , µ1 , . . .}

(N −1 )
X
Jπ (x0 ) = lim E αk g xk , µk (xk ), wk
N →∞ wk
k=0,1,... k=0

• Shorthand notation for DP mappings (n-state

Markov chain case)

(T J)(x) = min E g(x, u, w)+αJ f (x, u, w) , ∀ x
u∈U (x)

T J is the optimal cost function for the one-stage

problem with stage cost g and terminal cost αJ.
• For any stationary policy µ

(Tµ J)(x) = E g(x, µ(x), w)+αJ f (x, µ(x), w) , ∀ x

Note: Tµ is linear [in short Tµ J = Pµ (gµ + αJ )].

179
“SHORTHAND” THEORY – A SUMMARY

• Cost function expressions (with J0 ≡ 0)

Jπ = lim Tµ0 Tµ1 · · · Tµk J0 , Jµ = lim Tµk J0

k→∞ k→∞

• Bellman’s equation: J ∗ = T J ∗ , Jµ = Tµ Jµ
• Optimality condition:

µ: optimal <==> Tµ J ∗ = T J ∗

• Contraction: kT J1 − T J2 k ≤ αkJ1 − J2 k
• Value iteration: For any (bounded) J

J ∗ = lim T k J
k→∞

• Policy iteration: Given µk ,

− Policy evaluation: Find Jµk by solving

Jµk = Tµk Jµk

− Policy improvement: Find µk+1 such that

Tµk+1 Jµk = T Jµk

180
INTERPRETATION OF VI AND PI

T J 45 Degree Line
Prob. = 1 Prob. =
∗ TJ
n Value Iterations Prob. = 1 Prob. =
Do not J J ∗ =Set
Replace T J ∗S
0 Prob.==T12 J0

= T J0

J0 J J∗ = T J∗
0 Prob. = 1
J0 Do not Replace Set S 1 J J
= T J0 = T 2 J0

T J Tµ1 J J
TJ ∗

Policy Improvement Exact Policy Evaluation Prob.

Approximate Policy
= 1 Prob. =
Evaluation

= T J0
Policy Improvement Exact Policy Evalua
Evaluation

J0 J J∗ = T J∗ J Jµ1 = Tµ1 Jµ1

0 Prob. = 1
J0 1 J J
Policy Improvement Exact Policy Evaluation (Exact if
181
VI AND PI METHODS FOR Q-LEARNING

• We can write Bellman’s equation as

J ∗ (i) = min Q∗ (i, u) i = 1, . . . , n,

u∈U (i)

where Q∗ is the vector of optimal Q-factors

n
X
Q∗ (i, u) = pij (u) g(i, u, j) + αJ ∗ (j)
j=1

• VI and PI for Q-factors are mathematically

equivalent to VI and PI for costs.
• They require equal amount of computation ...
they just need more storage.
• For example, we can write the VI method as

Jk+1 (i) = min Qk+1 (i, u), i = 1, . . . , n,

u∈U (i)

where Qk+1 is generated for all i and u ∈ U (i) by

n
X
Qk+1 (i, u) = pij (u) g(i, u, j) + α min Qk (j, v)
v∈U (j)
j=1

182
APPROXIMATE PI

• Suppose that the policy evaluation is approxi-

mate, according to,

max |Jk (x) − Jµk (x)| ≤ δ, k = 0, 1, . . .

and policy improvement is approximate, according

to,

max |(Tµk+1 Jk )(x)−(T Jk )(x)| ≤ ǫ, k = 0, 1, . . .

where δ and ǫ are some positive scalars.

• Error Bound: The sequence {µk } generated by
approximate policy iteration satisfies

ǫ + 2αδ
lim sup max Jµk (x) − J ∗ (x) ≤
k→∞ x∈S (1 − α)2

• Typical practical behavior: The method makes

steady progress up to a point and then the iterates
Jµk oscillate within a neighborhood of J ∗ .

183
OPTIMISTIC PI

• This is PI, where policy evaluation is carried

out by a finite number of VI
• Shorthand definition: For some integers mk
m
Tµk Jk = T Jk , Jk+1 = Tµkk Jk , k = 0, 1, . . .

− If mk ≡ 1 it becomes VI
− If mk = ∞ it becomes PI
− For intermediate values of mk , it is generally
more efficient than either VI or PI

Tµ0 J
T J = minµ Tµ J
Policy Improvement Exact Policy Evaluation Approximate Policy
Evaluation

= T J0
Policy Improvement Exact Policy Evaluat
Evaluation

J0 J J∗ = T J∗ J Jµ0 = Tµ0 Jµ0

0 Prob. = 1
J0 1 J J
= T J0 = Tµ0 J0 J1 = Tµ20 J0
Approx. Policy Evaluation 184
EXTENSIONS TO GENERALIZED DISC. DP

• All the preceding VI and PI methods extend to

generalized/abstract discounted DP.
• Summary: For a mapping H : X ×U ×R(X) 7→
ℜ, consider

(T J)(x) = min H(x, u, J), ∀ x ∈ X.

u∈U (x)

(Tµ J)(x) = H x, µ(x), J , ∀ x ∈ X.

• We want to find J ∗ such that

J ∗ (x) = min H(x, u, J ∗ ), ∀x∈X

u∈U (x)

and a µ∗ such that Tµ∗ J ∗ = T J ∗ .

• Discounted, Discounted Semi-Markov, Minimax

H(x, u, J) = E g(x, u, w) + αJ f (x, u, w)
n
X
H(x, u, J) = G(x, u) + mxy (u)J(y)
y=1

H(x, u, J) = max g(x, u, w)+αJ f (x, u, w)
w∈W (x,u)

185
ASSUMPTIONS AND RESULTS

• Monotonicity assumption: If J, J ′ ∈ R(X) and

J ≤ J ′ , then
H(x, u, J) ≤ H(x, u, J ′ ), ∀ x ∈ X, u ∈ U (x)
• Contraction assumption:
− For every J ∈ B(X), the functions Tµ J and
T J belong to B(X).
− For some α ∈ (0, 1) and all J, J ′ ∈ B(X), H
satisfies

H(x, u, J)−H(x, u, J ′ ) ≤ α max J(y)−J ′ (y)
y∈X

for all x ∈ X and u ∈ U (x).
• Standard algorithmic results extend:
− Generalized VI converges to J ∗ , the unique
fixed point of T
− Generalized PI and optimistic PI generate
{µk } such that

lim kJµk − J ∗ k = 0, lim kJk −J ∗ k = 0

k→∞ k→∞
• Analytical Approach: Start with a problem,
match it with an H, invoke the general results.
186
ASYNCHRONOUS ALGORITHMS

• Motivation for asynchronous algorithms

− Faster convergence
− Parallel and distributed computation
− Simulation-based implementations
• General framework: Partition X into disjoint
nonempty subsets X1 , . . . , Xm , and use separate
processor ℓ updating J(x) for x ∈ Xℓ .
• Let J be partitioned as J = (J1 , . . . , Jm ), where
Jℓ is the restriction of J on the set Xℓ .
• Synchronous algorithm: Processor ℓ updates J
for the states x ∈ Xℓ at all times t,

Jℓt+1 (x) = T (J1t , . . . , Jm

t )(x), x ∈ X , ℓ = 1, . . . , m
ℓ

• Asynchronous algorithm: Processor ℓ updates

J for the states x ∈ Xℓ only at a subset of times
Rℓ ,

τℓ1 (t) τℓm (t)

Jℓt+1 (x) = T J1 , . . . , Jm (x) if t ∈ Rℓ ,
Jℓt (x) if t ∈
/ Rℓ

where t − τℓj (t) are communication “delays” 187

ONE-STATE-AT-A-TIME ITERATIONS

• Important special case: Assume n “states”, a

separate processor for each state, and no delays
• Generate a sequence of states {x0 , x1 , . . .}, gen-
erated in some way, possibly by simulation (each
state is generated infinitely often)
• Asynchronous VI: Change any one component
of J t at time t, the one that corresponds to xt :

T J t (1), . . . , J t (n) (ℓ) if ℓ = xt ,
J t+1 (ℓ) =
J t (ℓ) if ℓ = xt ,
6
• The special case where

{x0 , x1 , . . .} = {1, . . . , n, 1, . . . , n, 1, . . .}

is the Gauss-Seidel method

• More generally, the components used at time t
are delayed by t − τℓj (t)
• Flexible in terms of timing and “location” of
the iterations
• We can show that J t → J ∗ under assumptions
typically satisfied in DP
188
ASYNCHRONOUS CONV. THEOREM I

• Assume that for all ℓ, j = 1, . . . , m, the set of

times Rℓ is infinite and limt→∞ τℓj (t) = ∞
• Proposition: Let T have a unique fixed point J ∗ ,
and assume
that
there is a sequence of nonempty
subsets S(k) ⊂ R(X) with S(k + 1) ⊂ S(k) for
all k, and with the following properties:
(1) Synchronous Convergence Condition: Ev-
ery sequence {J k } with J k ∈ S(k) for each
k, converges pointwise to J ∗ . Moreover, we
have

T J ∈ S(k+1), ∀ J ∈ S(k), k = 0, 1, . . . .

(2) Box Condition: For all k, S(k) is a Cartesian

product of the form

S(k) = S1 (k) × · · · × Sm (k),

where Sℓ (k) is a set of real-valued functions

on Xℓ , ℓ = 1, . . . , m.
Then for every J ∈ S(0), the sequence {J t } gen-
erated by the asynchronous algorithm converges
pointwise to J ∗ .
189
ASYNCHRONOUS CONV. THEOREM II

• Interpretation of assumptions:

∗ J = (J1 , J2 )
(0) S2 (0) ) S(k + 1) + 1) J ∗
∗
TJ
(0) (0) S(k)
S(0) ) + 1)
(0)

S1 (0)

A synchronous iteration from any J in S(k) moves

into S(k + 1) (component-by-component)

• Convergence mechanism:

J1 Iterations

∗
J = (J1 , J2 )

S(k + 1) J∗ ∗
) + 1)
S(k)
S(0)(0) ) + 1)
(0) J2 Iteration
Iterations
Iterations
Key: “Independent” component-wise improvement.
An asynchronous component iteration from any J
in S(k) moves into the corresponding component
portion of S(k + 1) permanently!

190
PRINCIPAL DP APPLICATIONS

• The assumptions of the asynchronous conver-

gence theorem are satisfied in two principal cases:
− When T is a (weighted) sup-norm contrac-
tion.
− When T is monotone and the Bellman equa-
tion J = T J has a unique solution.
• The theorem can be applied also to convergence
of asynchronous optimistic PI for:
− Discounted problems (Section 2.6.2 of the
text).
− SSP problems (Section 3.5 of the text).
• There are variants of the theorem that can be
applied in the presence of special structure.
• Asynchronous convergence ideas also underlie
stochastic VI algorithms like Q-learning.

191
6.231 DYNAMIC PROGRAMMING

LECTURE 17

LECTURE OUTLINE

• Undiscounted problems
• Stochastic shortest path problems (SSP)
• Proper and improper policies
• Analysis and computational methods for SSP
• Pathologies of SSP
• SSP under weak conditions

192
UNDISCOUNTED PROBLEMS

• System: xk+1 = f (xk , uk , wk )

• Cost of a policy π = {µ0 , µ1 , . . .}
(N −1 )
X
Jπ (x0 ) = lim sup E g xk , µk (xk ), wk
wk
N →∞
k=0,1,... k=0

Note that Jπ (x0 ) and J ∗ (x0 ) can be +∞ or −∞

• Shorthand notation for DP mappings

(T J)(x) = min E g(x, u, w) + J f (x, u, w) , ∀x
u∈U (x) w

(Tµ J)(x) = E g x, µ(x), w + J f (x, µ(x), w) , ∀x
w

• T and Tµ need not be contractions in general,

but their monotonicity is helpful (see Ch. 4, Vol.
II of text for an analysis).
• SSP problems provide a “soft boundary” be-
tween the easy finite-state discounted problems
and the hard undiscounted problems.
− They share features of both.
− Some nice theory is recovered thanks to the
termination state, and special conditions. 193
SSP THEORY SUMMARY I

• As before, we have a cost-free term. state t, a

finite number of states 1, . . . , n, and finite number
of controls.
• Mappings T and Tµ (modified to account for
termination state t). For all i = 1, . . . , n:
n
X
(Tµ J)(i) = g i, µ(i) + pij µ(i) J(j),
j=1
 
n
X
(T J)(i) = min g(i, u) + pij (u)J(j) ,
u∈U (i)
j=1

or Tµ J = gµ + Pµ J and T J = minµ [gµ + Pµ J].

• Definition: A stationary policy µ is called proper,
if under µ, from every state i, there is a positive
probability path that leads to t.
• Important fact: (To be shown) If µ is proper,
Tµ is contraction w. r. t. some weighted sup-norm
1 1
max |(Tµ J)(i)−(Tµ J )(i)| ≤ ρµ max |J(i)−J ′ (i)|
′
i vi i vi
• T is similarly a contraction if all µ are proper
(the case discussed in the text, Ch. 7, Vol. I).
194
SSP THEORY SUMMARY II

• The theory can be pushed one step further.

Instead of all policies being proper, assume that:
(a) There exists at least one proper policy
(b) For each improper µ, Jµ (i) = ∞ for some i
• Example: Deterministic shortest path problem
with a single destination t.
− States <=> nodes; Controls <=> arcs
− Termination state <=> the destination
− Assumption (a) <=> every node is con-
nected to the destination
− Assumption (b) <=> all cycle costs > 0
• Note that T is not necessarily a contraction.
• The theory in summary is as follows:
− J ∗ is the unique solution of Bellman’s Eq.
− µ∗ is optimal if and only if Tµ∗ J ∗ = T J ∗
− VI converges: T k J → J ∗ for all J ∈ ℜn
− PI terminates with an optimal policy, if started
with a proper policy

195
SSP ANALYSIS I

• For a proper policy µ, Jµ is the unique fixed

point of Tµ , and Tµk J → Jµ for all J (holds by the
theory of Vol. I, Section 7.2)
• Key Fact: A µ satisfying J ≥ Tµ J for some
J ∈ ℜn must be proper - true because
k−1
X
J ≥ Tµk J = Pµk J + Pµm gµ
m=0
P∞
since Jµ = m=0 Pµm gµ and some component of
the term on the right blows up as k → ∞ if µ is
improper (by our assumptions).
• Consequence: T can have at most one fixed
point within ℜn .
Proof: If J and J ′ are two fixed points, select µ
and µ′ such that J = T J = Tµ J and J ′ = T J ′ =
Tµ′ J ′ . By preceding assertion, µ and µ′ must be
proper, and J = Jµ and J ′ = Jµ′ . Also

J = T k J ≤ Tµk′ J → Jµ′ = J ′

Similarly, J ′ ≤ J, so J = J ′ .

196
SSP ANALYSIS II

• We first show that T has a fixed point, and also

that PI converges to it.
• Use PI. Generate a sequence of proper policies
{µk } starting from a proper policy µ0 .
• µ1 is proper and Jµ0 ≥ Jµ1 since

Jµ0 = Tµ0 Jµ0 ≥ T Jµ0 = Tµ1 Jµ0 ≥ Tµk1 Jµ0 ≥ Jµ1

• Thus {Jµk } is nonincreasing, some policy µ̄ is

repeated and Jµ̄ = T Jµ̄ . So Jµ̄ is fixed point of T .
• Next show that T k J → Jµ̄ for all J, i.e., VI
converges to the same limit as PI. (Sketch: True
if J = Jµ̄ , argue using the properness of µ̄ to show
that the terminal cost difference J − Jµ̄ does not
matter.)
• To show Jµ̄ = J ∗ , for any π = {µ0 , µ1 , . . .}

Tµ0 · · · Tµk−1 J0 ≥ T k J0 ,

where J0 ≡ 0. Take lim sup as k → ∞, to obtain

Jπ ≥ Jµ̄ , so µ̄ is optimal and Jµ̄ = J ∗ .
197
SSP ANALYSIS III

• Contraction Property: If all policies are proper

(cf. Section 7.1, Vol. I), Tµ and T are contractions
with respect to a weighted sup norm.
Proof: Consider a new SSP problem where the
transition probabilities are the same as in the orig-
inal, but the transition costs are all equal to −1.
Let Jˆ be the corresponding optimal cost vector.
For all µ,
n n
X X
Ĵ(i) = −1+ min pij (u)Ĵ(j) ≤ −1+ pij µ(i) Ĵ(j )
u∈U (i)
j=1 j=1

ˆ
For vi = −J(i), we have vi ≥ 1, and for all µ,
n
X
pij µ(i) vj ≤ vi − 1 ≤ ρ vi , i = 1, . . . , n,
j=1

where
vi − 1
ρ = max < 1.
i=1,...,n vi
This implies Tµ and T are contractions of modu-
lus ρ for norm kJk = maxi=1,...,n |J(i)|/vi (by the
results of earlier lectures). 198
SSP ALGORITHMS

• All the basic algorithms have counterparts un-

der our assumptions; see the text (Ch. 3, Vol. II)
• “Easy” case: All policies proper, in which case
the mappings T and Tµ are contractions
• Even with improper (infinite cost) policies all
basic algorithms have satisfactory counterparts
− VI and PI
− Optimistic PI
− Asynchronous VI
− Asynchronous PI
− Q-learning analogs
• ** THE BOUNDARY OF NICE THEORY **
• Serious complications arise under any one of the
following:
− There is no proper policy
− There is improper policy with finite cost ∀ i
− The state space is infinite and/or the control
space is infinite [infinite but compact U (i)
can be dealt with]

199
PATHOLOGIES I: DETERM. SHORTEST PATHS

t b c u′ , Cost 0

t b Destination
u, Cost b
a12 12tb

• Two policies, one proper (apply u), one im-

proper (apply u′ )
• Bellman’s equation is

J(1) = min J(1), b]

Set of solutions is (−∞, b].

• Case b > 0, J ∗ = 0: VI does not converge to
J ∗ except if started from J ∗ . PI may get stuck
starting from the inferior proper policy
• Case b < 0, J ∗ = b: VI converges to J ∗ if
started above J ∗ , but not if started below J ∗ . PI
can oscillate (if started with u′ it generates u, and
if started with u it can generate u′ )

200
PATHOLOGIES II: BLACKMAILER’S DILEMMA

• Two states, state 1 and the termination state t.

• At state 1, choose u ∈ (0, 1] (the blackmail
amount demanded) at a cost −u, and move to t
with prob. u2 , or stay in 1 with prob. 1 − u2 .
• Every stationary policy is proper, but the con-
trol set in not finite (also not compact).
• For any stationary µ with µ(1) = u, we have

Jµ (1) = −u + (1 − u2 )Jµ (1)

from which Jµ (1) = − u1

• Thus J ∗ (1) = −∞, and there is no optimal
stationary policy.
• A nonstationary policy is optimal: demand
µk (1) = γ/(k + 1) at time k, with γ ∈ (0, 1/2).
− Blackmailer requests diminishing amounts over
time, which add to ∞.
− The probability of the victim’s refusal dimin-
ishes at a much faster rate, so the probabil-
ity that the victim stays forever compliant is
strictly positive.
201
SSP UNDER WEAK CONDITIONS I

• Assume there exists a proper policy, and J ∗ is

real-valued. Let

ˆ =
J(i) min Jµ (i), i = 1, . . . , n
µ: proper

Note that we may have Jˆ 6= J ∗ [i.e., J(i)

ˆ 6= J ∗ (i)
for some i].
• It can be shown that Jˆ is the unique solution
ˆ
of Bellman’s equation within the set {J | J ≥ J}
• Also VI converges to Jˆ starting from any J ≥ Jˆ
• The analysis is based on the δ-perturbed prob-
lem: adding a small δ > 0 to g. Then:
− All improper policies have infinite cost for
some states in the δ-perturbed problem
− All proper policies have an additional O(δ)
cost for all states
− The optimal cost Jδ∗ of the δ-perturbed prob-
lem converges to Jˆ as δ ↓ 0
• There is also a PI method that generates a
sequence {µk } with Jµk → J. ˆ Uses sequence δk ↓
0, and policy evaluation based on the δk -perturbed
problems with δk ↓ 0. 202
SSP UNDER WEAK CONDITIONS II

• J ∗ need not be a solution of Bellman’s equation!

Also Jµ for an improper policy µ.
Cost 0

u 1 Cost

Prob. p 0 2 3 Prob.
4 5 1−p

p
3 2 5 6

0Cost
1 2 −2
45 0 1Cost3 14 50 1 2Cost
3 4 −1 0 1 2 3 4 Cost
5 72
Destination
Cost 0 Cost 2 Cost 1 u Cost 1 Cost 0 Cost
4 7
tb t
0 1 2Cost
3 51 0 1 2 3 4 5Cost
6 −1
12 b
1 u Cost 1
Cost 0

)
• For p = 1/2, we have

Jµ (1) = 0, Jµ (2) = Jµ (5) = 1, Jµ (3) = Jµ (7) = 0, Jµ (4) = Jµ (6) = 2,

1

Bellman Eq. at state 1, Jµ (1) = 2 Jµ (2)+Jµ (5) ,
is violated.
• References: Bertsekas, D. P., and Yu, H., 2015.
“Stochastic Shortest Path Problems Under Weak
Conditions,” Report LIDS-2909; Math. of OR, to
appear. Also the on-line updated Ch. 4 of the
text.
203
6.231 DYNAMIC PROGRAMMING

LECTURE 18

LECTURE OUTLINE

• Undiscounted total cost problems

• Positive and negative cost problems
• Deterministic optimal cost problems
• Adaptive (linear quadratic) DP
• Affine monotonic and risk sensitive problems

Reference:
Updated Chapter 4 of Vol. II of the text:
Noncontractive Total Cost Problems
On-line at:
http://web.mit.edu/dimitrib/www/dpchapter.html
Check for most recent version

204
CONTRACTIVE/SEMICONTRACTIVE PROBLEMS

• Infinite horizon total cost DP theory divides in

− “Easy” problems where the results one ex-
pects hold (uniqueness of solution of Bell-
man Eq., convergence of PI and VI, etc)
− “Difficult” problems where one of more of
these results do not hold
• “Easy” problems are characterized by the pres-
ence of strong contraction properties in the asso-
ciated algorithmic maps T and Tµ
• A typical example of an “easy” problem is dis-
counted problems with bounded cost per stage
(Chs. 1 and 2 of Voll. II) and some with unbounded
cost per stage (Section 1.5 of Voll. II)
• Another is semicontractive problems, where Tµ
is a contraction for some µ but is not for other
µ, and assumptions are imposed that exclude the
“ill-behaved” µ from optimality
• A typical example is SSP where the improper
policies are assumed to have infinite cost for some
initial states (Chapter 3 of Vol. II)
• In this lecture we go into “difficult” problems

205
UNDISCOUNTED TOTAL COST PROBLEMS

• Beyond problems with strong contraction prop-

erties. One or more of the following hold:
− No termination state assumed
− Infinite state and control spaces
− Either no discounting, or discounting and
unbounded cost per stage
− Risk-sensitivity/exotic cost functions (e.g.,
SSP problems with exponentiated cost)
• Important classes of problems
− SSP under weak conditions (e.g., the previ-
ous lecture)
− Positive cost problems (control/regulation,
robotics, inventory control)
− Negative cost problems (maximization of pos-
itive rewards - investment, gambling, finance)
− Deterministic positive cost problems - Adap-
tive DP
− A variety of infinite-state problems in queue-
ing, optimal stopping, etc
− Affine monotonic and risk-sensitive problems
(a generalization of SSP)
206
POS. AND NEG. COST - FORMULATION

• System xk+1 = f (xk , uk , wk ) and cost

(N −1 )
X
Jπ (x0 ) = lim E αk g xk , µk (xk ), wk
N →∞ wk
k=0,1,... k=0

Discount factor α ∈ (0, 1], but g may be unbounded

• Case P: g(x, u, w) ≥ 0 for all (x, u, w)
• Case N: g(x, u, w) ≤ 0 for all (x, u, w)
• Summary of analytical results:
− Many of the strong results for discounted
and SSP problems fail
− Analysis more complex; need to allow for Jπ
and J * to take values +∞ (under P) or −∞
(under N)
− However, J * is a solution of Bellman’s Eq.
(typically nonunique)
− Opt. conditions: µ is optimal if and only if
Tµ J * = T J * (P) or if Tµ Jµ = T Jµ (N)

207
SUMMARY OF ALGORITHMIC RESULTS

• Neither VI nor PI are guaranteed to work

• Behavior of VI
− P: T k J → J * for all J with 0 ≤ J ≤ J * , if
U (x) is finite (or compact plus more condi-
tions - see the text)
− N: T k J → J * for all J with J * ≤ J ≤ 0
• Behavior of PI
− P: Jµk is monotonically nonincreasing but
may get stuck at a nonoptimal policy
− N: Jµk may oscillate (but an optimistic form
of PI converges to J * - see the text)
• These anomalies may be mitigated to a greater
or lesser extent by exploiting special structure, e.g.
− Presence of a termination state
− Proper/improper policy structure in SSP
• Finite-state problems under P can be trans-
formed to equivalent SSP problems by merging
(with a simple algorithm) all states x with J * (x) =
0 into a termination state. They can then be
solved using the powerful SSP methodology (see
updated Ch. 4, Section 4.1.4)
208
EXAMPLE FROM THE PREVIOUS LECTURE

• This is essentially a shortest path example with

termination state t
u′ , Cost 0
tbc
Cost 0
2
u, Cost b
1 t
a 2 12 b

J(t) J(t)
(1) Case P Case N (1) Case P Case N
Bellman Eq. Bellman Eq.
Solutions Solutions Solutions Solutions
Bellman Eq. Bellman Eq.

Jµ′ = J * = (0, 0) Jµ′ = (0, 0)

0)
Jµ = (b, 0) J(1) Jµ = J * = (b, 0) J(1)
0) 0) 0) 0)
Case P Case N
t) startingCase
VI fails fromN ) Case P starting from
VI fails
J(1) = 0, J(t) = 0 J(1) < J ∗ (1), J(t) = 0
PI ! stops at µ PI oscilllates between µ and µ′
µ

• Bellman Equation:

J(1) = min J(1), b + J(t)], J(t) = J(t)

209
DETERM. OPT. CONTROL - FORMULATION

• System: xk+1 = f (xk , uk ), arbitrary state and

control spaces X and U
• Cost positivity: 0 ≤ g(x, u), ∀ x ∈ X, u ∈ U (x)
• No discounting:
N
X −1

Jπ (x0 ) = lim g xk , µk (xk )
N →∞
k=0

• “Goal set of states” X0

− All x ∈ X0 are cost-free and absorbing
• A shortest path-type problem, but with possibly
infinite number of states
• A common formulation of control/regulation
and planning/robotics problems
• Example: Linear system, quadratic cost (possi-
bly with state and control constraints), X0 = {0}
or X0 is a small set around 0
• Strong analytical and computational results

210
DETERM. OPT. CONTROL - ANALYSIS

• Bellman’s Eq. holds (for not only this problem,

but also all deterministic total cost problems)

* *
J (x) = min g(x, u)+J f (x, u) , ∀ x ∈ X
u∈U (x)

• Definition: A policy π terminates starting from

x if the state sequence {xk } generated starting
from x0 = x and using π reaches X0 in finite time,
i.e., satisfies xk̄ ∈ X0 for some index k¯
• Assumptions: The cost structure is such that
− J * (x) > 0, ∀ x ∈
/ X0 (termination incentive)
− For every x with J * (x) < ∞ and every ǫ > 0,
there exists a policy π that terminates start-
ing from x and satisfies Jπ (x) ≤ J * (x) + ǫ.
• Uniqueness of solution of Bellman’s Eq.: J * is
the unique solution within the set

J = J | 0 ≤ J(x) ≤ ∞, ∀ x ∈ X, J(x) = 0, ∀ x ∈ X0

• Counterexamples: Earlier SP problem. Also

linear quadratic problems where the Riccati equa-
tion has two solutions (observability not satisfied).
211
DET. OPT. CONTROL - VI/PI CONVERGENCE

• The sequence {T k J} generated by VI starting

from a J ∈ J with J ≥ J * converges to J *
• If in addition U (x) is finite (or compact plus
more conditions - see the text), the sequence {T k J}
generated by VI starting from any function J ∈ J
converges to J *
• A sequence {Jµk } generated by PI satisfies
Jµk (x) ↓ J * (x) for all x ∈ X
• PI counterexample: The earlier SP example
• Optimistic PI algorithm: Generates pairs {Jk , µk }
as follows: Given Jk , we generate µk according to

µk (x) = arg min g(x, u)+Jk f (x, u) , x∈X
u∈U (x)

and obtain Jk+1 with mk ≥ 1 VIs using µk :

mk −1
X
Jk+1 (x0 ) = Jk (xmk )+ g xt , µk (xt ) , x0 ∈ X
t=0

If J0 ∈ J and J0 ≥ T J0 , we have Jk ↓ J * .
• Rollout with terminating heuristic (e.g., MPC).
212
LINEAR-QUADRATIC ADAPTIVE CONTROL

• System: xk+1 = Axk +Buk , xk ∈ ℜn , uk ∈ ℜm

P∞ ′ Qx + u′ Ru ), Q ≥ 0, R > 0
• Cost: (x
k=0 k k k k

• Optimal policy is linear: µ∗ (x) = Lx

• The Q-factor of each linear policy µ is quadratic:

′ ′ x
Qµ (x, u) = ( x u ) Kµ (∗)
u
• We will consider A and B unknown
• We use as basis functions all the quadratic func-
tions involving state and control components
xi xj , ui uj , xi uj , ∀ i, j
These form the “rows” φ(x, u)′ of a matrix Φ
• The Q-factor Qµ of a linear policy µ can be
exactly represented within the subspace spanned
by the basis functions:
Qµ (x, u) = φ(x, u)′ rµ
where rµ consists of the components of Kµ in (*)
• Key point: Compute rµ by simulation of µ (Q-
factor evaluation by simulation, in a PI scheme)
213
PI FOR LINEAR-QUADRATIC PROBLEM

• Policy evaluation: rµ is found (exactly) by least

squares minimization
X 2
′ ′ ′
′
min φ(x , u ) r − x Qx + u Ru + φ x , µ(x ) r

k k k k k k k+1 k+1
r
(xk ,uk )

where (xk , uk , xk+1 ) are “enough” samples gener-

ated by the system or a simulator of the system.
• Policy improvement:

µ(x) ∈ arg min φ(x, u)′ rµ
u

• Knowledge of A and B is not required

• If the policy evaluation is done exactly, this
becomes exact PI, and convergence to an optimal
policy can be shown
• The basic idea of this example has been gener-
alized and forms the starting point of the field of
adaptive DP
• This field deals with adaptive control of continuous-
space (possibly nonlinear) dynamic systems, in
both discrete and continuous time

214
FINITE-STATE AFFINE MONOTONIC PROBLEMS

• Generalization of positive cost finite-state stochas-

tic total cost problems where:
− In place of a transition prob. matrix Pµ , we
have a general matrix Aµ ≥ 0
− In place of 0 terminal cost function, we have
a more general terminal cost function J¯ ≥ 0
• Mappings
Tµ J = bµ + Aµ J, (T J)(i) = min (Tµ J)(i)
µ∈M

• Cost function of π = {µ0 , µ1 , . . .}

¯
Jπ (i) = lim sup (Tµ0 · · · TµN −1 J)(i), i = 1, . . . , n
N →∞

• Special case: An SSP with an exponential risk-

sensitive cost, where for all i and u ∈ U (i)
Aij (u) = pij (u)eg(i,u,j) , b(i, u) = pit (u)eg(i,u,t)

• Interpretation:

Jπ (i) = E {e(length of path of π starting from i) }

215
AFFINE MONOTONIC PROBLEMS: ANALYSIS

• The analysis follows the lines of analysis of SSP

• Key notion (generalizes the notion of a proper
policy in SSP): A policy µ is stable if Akµ → 0; else
it is called unstable
• We have
N
X −1
TµN J = AN
µ J+ Akµ bµ , ∀ J ∈ ℜn , N = 1, 2, . . . ,
k=0
• For a stable policy µ, we have for all J ∈ ℜn
∞
X
Jµ = lim sup TµN J = lim sup Akµ bµ = (I −Aµ )−1 bµ
N →∞ N →∞
k=0
• Consider the following assumptions:
(1) There exists at least one stable policy
(2) For every unstable
P∞ policy µ, at least one com-
ponent of k=0 Akµ bµ is equal to ∞
• Under (1) and (2) the strong SSP analytical
and algorithmic theory generalizes
• Under just (1) the weak SSP theory generalizes.
216
6.231 DYNAMIC PROGRAMMING

LECTURE 19

LECTURE OUTLINE

• We begin a lecture series on approximate DP.

• Reading: Chapters 6 and 7, Vol. 2 of the text.
• Today we discuss some general issues about
approximation and simulation
• We classify/overview the main approaches:
− Approximation in policy space (policy para-
metrization, gradient methods, random search)
− Approximation in value space (approximate
PI, approximate VI, Q-Learning, Bellman
error approach, approximate LP)
− Rollout/Simulation-based single policy iter-
ation (will not discuss this further)
− Approximation in value space using problem
approximation (simplification - forms of ag-
gregation - limited lookahead) - will not dis-
cuss much
217
GENERAL ORIENTATION TO ADP

• ADP (late 80s - present) is a breakthrough

methodology that allows the application of DP to
problems with many or infinite number of states.
• Other names for ADP are:
− “reinforcement learning” (RL)
− “neuro-dynamic programming” (NDP)
• We will mainly adopt an n-state discounted
model (the easiest case - but think of HUGE n).
• Extensions to other DP models (continuous
space, continuous-time, not discounted) are possi-
ble (but more quirky). We will set aside for later.
• There are many approaches:
− Problem approximation and 1-step lookahead
− Simulation-based approaches (we will focus
on these)
• Simulation-based methods are of three types:
− Rollout (we will not discuss further)
− Approximation in policy space
− Approximation in value space
218
WHY DO WE USE SIMULATION?

• One reason: Computational complexity advan-

tage in computing expected values and sums/inner
products involving a very large number of terms
Pn
− Speeds up linear algebra: Any sum i=1 ai
can be written as an expected value
n n
X Xai ai
ai = ξi = Eξ ,
i=1 i=1
ξi ξi

where ξ is any prob. distribution over {1, . . . , n}

− It is approximated by generating many sam-
ples {i1 , . . . , ik } from {1, . . . , n}, according
to ξ, and Monte Carlo averaging:
n k
X ai 1 X a it
a i = Eξ ≈
i=1
ξi k t=1 ξit

− Choice of ξ makes a difference. Importance

sampling methodology.
• Simulation is also convenient when an analytical
model of the system is unavailable, but a simula-
tion/computer model is possible.
219
APPROXIMATION IN POLICY SPACE

• A brief discussion; we will return to it later.

• Use parametrization µ(i; r) of policies with a
vector r = (r1 , . . . , rs ). Examples:
− Polynomial, e.g., µ(i; r) = r1 + r2 · i + r3 · i2
− Multi-warehouse inventory system: µ(i; r) is
threshold policy with thresholds r = (r1 , . . . , rs )
• Optimize the cost over r. For example:
− Each value of r defines a stationary policy,
˜ r).
with cost starting at state i denoted by J(i;
− Let (p1 , . . . , pn ) be some probability distri-
bution over the states, and minimize over r
n
˜ r)
X
pi J(i;
i=1

− Use a random search, gradient, or other method

• A special case: The parameterization of the
policies is indirect, through a cost approximation
architecture J,ˆ i.e.,
X n
ˆ

µ(i; r) ∈ arg min pij (u) g(i, u, j) + αJ(j; r)
u∈U (i)
j=1

220
APPROXIMATION IN VALUE SPACE

• Approximate J ∗ or Jµ from a parametric class

˜ r) where i is the current state and r = (r1 , . . . , rm )
J(i;
is a vector of “tunable” scalars weights
• Use J˜ in place of J ∗ or Jµ in various algorithms
and computations (VI, PI, LP)
• Role of r: By adjusting r we can change the
“shape” of J˜ so that it is “close” to J ∗ or Jµ
• Two key issues:
˜ r) (the
− The choice of parametric class J(i;
approximation architecture)
− Method for tuning the weights (“training”
the architecture)
• Success depends strongly on how these issues
are handled ... also on insight about the problem
• A simulator may be used, particularly when
there is no mathematical model of the system
• We will focus on simulation, but this is not the
only possibility
• We may also use parametric approximation for
Q-factors
221
APPROXIMATION ARCHITECTURES

• Divided in linear and nonlinear [i.e., linear or

˜ r) on r]
nonlinear dependence of J(i;
• Linear architectures are easier to train, but non-
linear ones (e.g., neural networks) are richer
• Computer chess example:
− Think of board position as state and move
as control
− Uses a feature-based position evaluator that
assigns a score (or approximate Q-factor) to
each position/move

Features:
Material balance,
Mobility,
Safety, etc Score
Feature Weighting
Extraction of Features

Position Evaluator

• Relatively few special features and weights, and

multistep lookahead
222
LINEAR APPROXIMATION ARCHITECTURES

• Often, the features encode much of the nonlin-

earity inherent in the cost function approximated
• Then the approximation may be quite accurate
without a complicated architecture. (Extreme ex-
ample: The ideal feature is the true cost function)
• With well-chosen features, we can use a linear
architecture: s
˜ r) = φ(i)′ r, ∀ i or J(r)
˜ = Φr =
X
J(i; Φj rj
j=1

Φ: the matrix whose rows are φ(i)′ , i = 1, . . . , n,

Φj is the jth column of Φ

i) Linear Cost
Feature Extraction
State i Feature
i Feature Mapping
Extraction Feature
Mapping
Extraction Mapping Vector
Feature φ(i)
Vector
Feature Linear Cost Approximator
Vectori) Linear Cost φ(i)′ r
Approximator
Approximator
Feature Extraction( Mapping
) Feature Vector
Feature Extraction Mapping Feature Vector

• This is approximation on the subspace

S = {Φr | r ∈ ℜs }
spanned by the columns of Φ (basis functions)
• Many examples of feature types: Polynomial
approximation, radial basis functions, domain spe-
cific, etc 223
ILLUSTRATIONS: POLYNOMIAL TYPE

• Polynomial Approximation, e.g., a quadratic

approximating function. Let the state be i =
(i1 , . . . , iq ) (i.e., have q “dimensions”) and define

φ0 (i) = 1, φk (i) = ik , φkm (i) = ik im , k, m = 1, . . . , q

Linear approximation architecture:

q q q
˜ r) = r0 +
X X X
J(i; rk ik + rkm ik im ,
k=1 k=1 m=k

where r has components r0 , rk , and rkm .

• Interpolation: A subset I of special/representative
states is selected, and the parameter vector r has
one component ri per state i ∈ I. The approxi-
mating function is

˜ r) = ri ,
J(i; i ∈ I,

˜ r) = interpolation using the values at i ∈ I, i ∈

J(i; /I
For example, piecewise constant, piecewise linear,
more general polynomial interpolations.
224
A DOMAIN SPECIFIC EXAMPLE

• Tetris game (used as testbed in competitions)

......

TERMINATION

• J ∗ (i): optimal score starting from position i

• Number of states > 2200 (for 10 × 20 board)
• Success with just 22 features, readily recognized
by tetris players as capturing important aspects of
the board position (heights of columns, etc)
225
APPROX. PI - OPTION TO APPROX. Jµ OR Qµ

• Use simulation to approximate the cost Jµ of

the current policy µ
• Generate “improved” policy µ by minimizing in
(approx.) Bellman equation

Guess Initial Policy

Evaluate Approximate Cost

Approximate Policy
J˜µ (r) = Φr Using Simulation Evaluation

Generate “Improved” Policy µ Policy Improvement

• Altenatively approximate the Q-factors of µ

• A survey reference: D. P. Bertsekas, “Approx-
imate Policy Iteration: A Survey and Some New
Methods,” J. of Control Theory and Appl., Vol.
9, 2011, pp. 310-335.

226
DIRECTLY APPROXIMATING J ∗ OR Q∗

• Approximation of the optimal cost function J ∗

directly (without PI)
− Q-Learning: Use a simulation algorithm to
approximate the Q-factors
n
X
Q∗ (i, u) = g(i, u) + α pij (u)J ∗ (j);
j=1
and the optimal costs

J ∗ (i) = min Q∗ (i, u)

u∈U (i)
− Bellman Error approach: Find r to
n 2 o
min Ei ˜ r) − (T J)(i;
J(i; ˜ r)
r

where Ei {·} is taken with respect to some

distribution over the states
− Approximate Linear Programming (we will
not discuss here)
• Q-learning can also be used with approxima-
tions
• Q-learning and Bellman error approach can also
be used for policy evaluation
227
DIRECT POLICY EVALUATION

• Can be combined with regular and optimistic

policy iteration
˜ r)k2 , i.e.,
• Find r that minimizes kJµ − J(·, ξ
n
2
˜
X
ξi Jµ (i) − J(i, r) , ξi : some pos. weights
i=1
• Nonlinear architectures may be used
• The linear architecture case: Amounts to pro-
jection of Jµ onto the approximation subspace

Direct Method: Projection of cost vector Jµ Π

µ ΠJµ
=0
Subspace S = {Φr | r ∈ ℜs } Set

Direct Method: Projection of cost vector

( )cost (vector
ct Method: Projection of ) (Jµ)

• Solution by linear least squares methods

228
POLICY EVALUATION BY SIMULATION

• Projection by Monte Carlo Simulation: Com-

pute the projection ΠJµ of Jµ on subspace S =
{Φr | r ∈ ℜs }, with respect to a weighted Eu-
clidean norm k · kξ
• Equivalently, find Φr∗ , where
n
X 2
r∗ = arg mins kΦr−Jµ k2ξ = arg mins ξi Jµ ′
(i)−φ(i) r
r∈ℜ r∈ℜ
i=1
• Setting to 0 the gradient at r∗ ,
n
!−1 n
X X
r∗ = ξi φ(i)φ(i)′ ξi φ(i)Jµ (i)
i=1 i=1

• Generate samples (i1 , Jµ (i1 )), . . . , (ik , Jµ (ik ))
using distribution ξ
• Approximate by Monte Carlo the two “expected
values” with low-dimensional calculations
k
!−1 k
X X
r̂k = φ(it )φ(it )′ φ(it )Jµ (it )
t=1 t=1
• Equivalent least squares alternative calculation:
k
X 2
r̂k = arg mins φ(it )′ r − Jµ (it )
r∈ℜ
t=1

229
INDIRECT POLICY EVALUATION

• An example: Solve the projected equation Φr =

ΠTµ (Φr) where Π is projection w/ respect to a
suitable weighted Euclidean norm (Galerkin ap-
prox.

Jµ Tµ (Φr)

Direct Method: Projection of cost vector Π

ΠJµ Φr = ΠTµ (Φr)

0 0
µ
= S = {Φr | r ∈ ℜ }
Subspace = S = {Φr | r ∈ ℜ }
Subspace
s s

Set Set
Direct Method: Projection of Indirect Method: Solving a projected
cost vector Jµ form of Bellman’s equation
cost vector form of Be
( )
jection of ( )Indirect
( ) Method: Solving a projected Projection on

• Solution methods that use simulation (to man-

age the calculation of Π)
− TD(λ): Stochastic iterative algorithm for solv-
ing Φr = ΠTµ (Φr)
− LSTD(λ): Solves a simulation-based approx-
imation w/ a standard solver
− LSPE(λ): A simulation-based form of pro-
jected value iteration; essentially
Φrk+1 = ΠTµ (Φrk ) + simulation noise
230
BELLMAN EQUATION ERROR METHODS

• Another example of indirect approximate policy

evaluation:
min kΦr − Tµ (Φr)k2ξ (∗)
r
where k · kξ is Euclidean norm, weighted with re-
spect to some distribution ξ
• It is closely related to the projected equation ap-
proach (with a special choice of projection norm)
• Several ways to implement projected equation
and Bellman error methods by simulation. They
involve:
− Generating many random samples of states
ik using the distribution ξ
− Generating many samples of transitions (ik , jk )
using the policy µ
− Form a simulation-based approximation of
the optimality condition for projection prob-
lem or problem (*) (use sample averages in
place of inner products)
− Solve the Monte-Carlo approximation of the
optimality condition
• Issues for indirect methods: How to generate
the samples? How to calculate r∗ efficiently?
231
ANOTHER INDIRECT METHOD: AGGREGATION

• An example: Group similar states together into

“aggregate states” x1 , . . . , xs ; assign a common
cost ri to each group xi . A linear architecture
called hard aggregation.

1 0 0 0
 
1 2 3 1 0 0 0
0 1 0 0
 
2 3 4 15 6 37 48 59 16 27 8 49 5 6 7 8 9 
x1 x2 1

0 0 0

123456789 4 5 6
Φ = 1 0 0 0
 
1 2 3 15 26 37 48 9 16 27 38 49 5 7 8 9 0

1 0 0

7 x3 8 x4 9 0

0 1 0

0 0 1 0
 
1 2 3 4 15 26 3 48 59 16 27 3 49 5 6 7 8
0 0 0 1

• Solve an “aggregate” DP problem to obtain

r = (r1 , . . . , rs ).
• More general/mathematical view: Solve
Φr = ΦDTµ (Φr)
where the rows of D and Φ are prob. distributions
(e.g., D and Φ “aggregate” rows and columns of
the linear system J = Tµ J)
• Compare with projected equation Φr = ΠTµ (Φr).
Note: ΦD is a projection in some interesting cases

232
AGGREGATION AS PROBLEM APPROXIMATION
!
Original System States Aggregate States
Original System States Aggregate States
! i , j=1 !
according to pij (u),, g(i,
with cost
u, j)
Matrix Matrix Aggregation Probabilities
Disaggregation Probabilities
Aggregation Probabilities Aggregation Disaggregation
Probabilities Probabilities
Disaggregation Probabilities
dxi ! Disaggregation
φjyProbabilities
Q
| S

Original System States Aggregate States

), x ), y
!
n !
n
p̂xy (u) = dxi pij (u)φjy ,
i=1 j=1
!
n !
n
ĝ(x, u) = dxi pij (u)g(i, u, j)
i=1 j=1

• Aggregation can be viewed as a systematic ap-

proach for problem approx. Main elements:
− Solve (exactly or approximately) the “ag-
gregate” problem by any kind of VI or PI
method (including simulation-based methods)
− Use the optimal cost of the aggregate prob-
lem to approximate the optimal cost of the
original problem
• Because an exact PI algorithm is used to solve
the approximate/aggregate problem the method
behaves more regularly than the projected equa-
tion approach
233
THEORETICAL BASIS OF APPROXIMATE PI

• If policies are approximately evaluated using an

approximation architecture such that

˜ rk ) − J k (i)| ≤ δ,
max |J(i, k = 0, 1, . . .
µ
i

• If policy improvement is also approximate,

˜ rk )−(T J˜)(i, rk )| ≤ ǫ,
max |(Tµk+1 J)(i, k = 0, 1, . . .
i

• Error bound: The sequence {µk } generated by

approximate policy iteration satisfies

ǫ + 2αδ
lim sup max Jµk (i) − J ∗ (i) ≤
k→∞ i (1 − α)2

• Typical practical behavior: The method makes

steady progress up to a point and then the iterates
Jµk oscillate within a neighborhood of J ∗ .
• Oscillations are quite unpredictable.
− Bad examples of oscillations are known.
− In practice oscillations between policies is
probably not the major concern.
− In aggregation case, there are no oscillations

234
THE ISSUE OF EXPLORATION

• To evaluate a policy µ, we need to generate cost

samples using that policy - this biases the simula-
tion by underrepresenting states that are unlikely
to occur under µ
• Cost-to-go estimates of underrepresented states
may be highly inaccurate
• This seriously impacts the improved policy µ
• This is known as inadequate exploration - a
particularly acute difficulty when the randomness
embodied in the transition probabilities is “rela-
tively small” (e.g., a deterministic system)
• Some remedies:
− Frequently restart the simulation and ensure
that the initial states employed form a rich
and representative subset
− Occasionally generate transitions that use a
randomly selected control rather than the
one dictated by the policy µ
− Other methods: Use two Markov chains (one
is the chain of the policy and is used to gen-
erate the transition sequence, the other is
used to generate the state sequence).
235
APPROXIMATING Q-FACTORS

˜ r), policy improvement requires a

• Given J(i;
model [knowledge of pij (u) for all u ∈ U (i)]
• Model-free alternative: Approximate Q-factors
n
X
Q̃(i, u; r) ≈ pij (u) g(i, u, j) + αJµ (j)
j=1

and use for policy improvement the minimization

˜ u; r)
µ(i) ∈ arg min Q(i,
u∈U (i)

˜ u; r)
• r is an adjustable parameter vector and Q(i,
is a parametric architecture, such as
s
X
Q̃(i, u; r) = rm φm (i, u)
m=1

• We can adapt any of the cost approximation

approaches, e.g., projected equations, aggregation
• Use the Markov chain with states (i, u), so
pij (µ(i)) is the transition prob. to (j, µ(i)), 0 to
other (j, u′ )
• Major concern: Acutely diminished exploration

236
STOCHASTIC ALGORITHMS: GENERALITIES

• Consider solution of a linear equation x = b +

Ax by using m simulation samples b + wk and
A + Wk , k = 1, . . . , m, where wk , Wk are random,
e.g., “simulation noise”
• Think of x = b + Ax as approximate policy
evaluation (projected or aggregation equations)
• Stoch. approx. (SA) approach: For k = 1, . . . , m

xk+1 = (1 − γk )xk + γk (b + wk ) + (A + Wk )xk

• Monte Carlo estimation (MCE) approach: Form

Monte Carlo estimates of b and A
m m
1 X 1 X
bm = (b + wk ), Am = (A + Wk )
m m
k=1 k=1

Then solve x = bm + Am x by matrix inversion

xm = (1 − Am )−1 bm

or iteratively
• TD(λ) and Q-learning are SA methods
• LSTD(λ) and LSPE(λ) are MCE methods
237
6.231 DYNAMIC PROGRAMMING

LECTURE 20

LECTURE OUTLINE

• Discounted problems - Approximation on sub-

space {Φr | r ∈ ℜs }
• Approximate (fitted) VI
• Approximate PI
• The projected equation
• Contraction properties - Error bounds
• Matrix form of the projected equation
• Simulation-based implementation
• LSTD and LSPE methods

238
REVIEW: APPROXIMATION IN VALUE SPACE

• Finite-spaces discounted problems: Defined by

mappings Tµ and T (T J = minµ Tµ J).
• Exact methods:
− VI: Jk+1 = T Jk
− PI: Jµk = Tµk Jµk , Tµk+1 Jµk = T Jµk
− LP: minJ c′ J subject to J ≤ T J
• Approximate versions: Plug-in subspace ap-
proximation with Φr in place of J
− VI: Φrk+1 ≈ T Φrk
− PI: Φrk ≈ Tµk Φrk , Tµk+1 Φrk = T Φrk
− LP: minr c′ Φr subject to Φr ≤ T Φr
• Approx. onto subspace S = {Φr | r ∈ ℜs }
is often done by projection with respect to some
(weighted) Euclidean norm.
• Another possibility is aggregation. Here:
− The rows of Φ are probability distributions
− Φr ≈ Jµ or Φr ≈ J * , with r the solution of
an “aggregate Bellman equation” r = DTµ (Φr)
or r = DT (Φr), where the rows of D are
probability distributions
239
APPROXIMATE (FITTED) VI

• Approximates sequentially Jk (i) = (T k J0 )(i),

k = 1, 2, . . ., with J˜k (i; rk )
• The starting function J0 is given (e.g., J0 ≡ 0)
• Approximate (Fitted) Value Iteration: A se-
quential “fit” to produce J˜k+1 from J˜k , i.e., J˜k+1 ≈
T J˜k or (for a single policy µ) J˜k+1 ≈ Tµ J˜k

• After a large enough number N of steps, J˜N (i; rN )

is used as approximation to J ∗ (i)
• Possibly use (approximate) projection Π with
respect to some projection norm,
J˜k+1 ≈ ΠT J˜k
240
WEIGHTED EUCLIDEAN PROJECTIONS

• Consider a weighted Euclidean norm

v
u n
uX 2
kJkξ = t ξi J(i) ,
i=1

where ξ = (ξ1 , . . . , ξn ) is a positive distribution

(ξi > 0 for all i).
• Let Π denote the projection operation onto
S = {Φr | r ∈ ℜs }
with respect to this norm, i.e., for any J ∈ ℜn ,
ΠJ = Φr∗
where
r∗ = arg mins kΦr − Jk2ξ
r∈ℜ

• Recall that weighted Euclidean projection can

be implemented by simulation and least squares,
i.e., sampling J(i) according to ξ and solving
k
X 2
mins φ(it )′ r − J(it )
r∈ℜ
t=1
241
FITTED VI - NAIVE IMPLEMENTATION

• Select/sample a “small” subset Ik of represen-

tative states
• For each i ∈ Ik , given J˜k , compute
n
(T J˜k )(i) = min pij (u) g(i, u, j) + αJ˜k (j; r)
X
u∈U (i)
j=1

• “Fit” the function J˜k+1 (i; rk+1 ) to the “small”

set of values (T J˜k )(i), i ∈ Ik (for example use
some form of approximate projection)
• “Model-free” implementation by simulation
• Error Bound: If the fit is uniformly accurate
within δ > 0, i.e.,

max |J˜k+1 (i) − T J˜k (i)| ≤ δ,

then

˜ ∗
δ
lim sup max Jk (i, rk ) − J (i) ≤
k→∞ i=1,...,n 1−α

• But there is a potential serious problem!

242
AN EXAMPLE OF FAILURE

• Consider two-state discounted MDP with states

1 and 2, and a single policy.
− Deterministic transitions: 1 → 2 and 2 → 2
− Transition costs ≡ 0, so J ∗ (1) = J ∗ (2) = 0.
• Consider (exact) fitted VI scheme that approx-
imates cost functions within S = (r, 2r) | r ∈ ℜ
with a weighted least squares fit; here Φ = ( 1, 2 )′
• Given J˜k = (rk , 2rk ), we find J˜k+1 = (rk+1 , 2rk+1 ),
where J˜k+1 = Πξ (T J˜k ), with weights ξ = (ξ1 , ξ2 ):
h 2 2 i
rk+1 = arg min ξ1 r−(T J˜k )(1) +ξ2 2r−(T J˜k )(2)
r

• With straightforward calculation

rk+1 = αβrk , where β = 2(ξ1 +2ξ2 )/(ξ1 +4ξ2 ) > 1

• So if α > 1/β (e.g., ξ1 = ξ2 = 1), the sequence

{rk } diverges and so does {J˜k }.
• Difficulty is that T is a contraction, but Πξ T
(= least squares fit composed with T ) is not.

243
NORM MISMATCH PROBLEM

• For fitted VI to converge, we need Πξ T to be a

contraction; T being a contraction is not enough

• We need a ξ such that T is a contraction w. r.

to the weighted Euclidean norm k · kξ
• Then Πξ T is a contraction w. r. to k · kξ
• We will come back to this issue, and show how
to choose ξ so that Πξ Tµ is a contraction for a
given µ

244
APPROXIMATE PI

Guess Initial Policy

Evaluate Approximate Cost

Approximate Policy
J˜µ (r) = Φr Using Simulation Evaluation

Generate “Improved” Policy µ Policy Improvement

• Evaluation of typical µ: Linear cost function

approximation J˜µ (r) = Φr, where Φ is full rank
n×s matrix with columns the basis functions, and
ith row denoted φ(i)′ .
• Policy “improvement” to generate µ:
Xn

µ(i) = arg min ′
pij (u) g(i, u, j) + αφ(j) r
u∈U (i)
j=1

• Error Bound (same as approximate VI): If

max |J˜µk (i, rk ) − Jµk (i)| ≤ δ, k = 0, 1, . . .

the sequence {µk } satisfies

2αδ
lim sup max Jµk (i) − J ∗ (i) ≤
k→∞ i (1 − α)2

245
APPROXIMATE POLICY EVALUATION

• Consider approximate evaluation of Jµ , the cost

of the current policy µ by using simulation.
− Direct policy evaluation - generate cost sam-
ples by simulation, and optimization by least
squares
− Indirect policy evaluation - solving the pro-
jected equation Φr = ΠTµ (Φr) where Π is
projection w/ respect to a suitable weighted
Euclidean norm

Direct Method: Projection of cost vector Jµ Π Tµ (Φr)

µ ΠJµ Φr = ΠTµ (Φr)

=0 =0
Subspace S = {Φr | r ∈ ℜs } Set Subspace S = {Φr | r ∈ ℜs } Set

Direct Method: Projection of cost vector Indirect Method: Solving a projected form of Bell
( )cost (vector
ojection of )Indirect
(Jµ) Method: Solving a projected form
Projection on
of Bellman’s equation

• Recall that projection can be implemented by

simulation and least squares

246
PI WITH INDIRECT POLICY EVALUATION

Guess Initial Policy

Evaluate Approximate Cost

Approximate Policy
J˜µ (r) = Φr Using Simulation Evaluation

Generate “Improved” Policy µ Policy Improvement

• Given the current policy µ:

− We solve the projected Bellman’s equation

Φr = ΠTµ (Φr)

− We approximate the solution Jµ of Bellman’s

equation
J = Tµ J
with the projected equation solution J˜µ (r)

247
KEY QUESTIONS AND RESULTS

• Does the projected equation have a solution?

• Under what conditions is the mapping ΠTµ a
contraction, so ΠTµ has unique fixed point?
• Assumption: The Markov chain corresponding
to µ has a single recurrent class and no transient
states, with steady-state prob. vector ξ, so that
N
1 X
ξj = lim P (ik = j | i0 = i) > 0
N →∞ N
k=1

Note that ξj is the long-term frequency of state j.

• Proposition: (Norm Matching Property) As-
sume that the projection Π is with respect to k·kξ ,
where ξ = (ξ1 , . . . , ξn ) is the steady-state proba-
bility vector. Then:
(a) ΠTµ is contraction of modulus α with re-
spect to k · kξ .
(b) The unique fixed point Φr∗ of ΠTµ satisfies

1
kJµ − Φr∗ kξ ≤ √ kJµ − ΠJµ kξ
1−α 2

248
PRELIMINARIES: PROJECTION PROPERTIES

• Important property of the projection Π on S

with weighted Euclidean norm k · kξ . For all J ∈
ℜn , Φr ∈ S, the Pythagorean Theorem holds:

kJ − Φrk2ξ = kJ − ΠJk2ξ + kΠJ − Φrk2ξ

• The Pythagorean Theorem implies that the pro-

jection is nonexpansive, i.e.,

¯ ξ ≤ kJ − Jk
kΠJ − ΠJk ¯ ξ, for all J, J¯ ∈ ℜn .

To see this, note that

Π(J − J) 2 ≤ Π(J − J) 2 + (I − Π)(J − J) 2

ξ ξ ξ

= kJ − Jk2ξ
249
PROOF OF CONTRACTION PROPERTY

• Lemma: If P is the transition matrix of µ,

kP zkξ ≤ kzkξ , z ∈ ℜn ,
where ξ is the steady-state prob. vector.
Proof: For all z ∈ ℜn
 2
n
X n
X n
X n
X
kP zk2ξ = ξi  pij zj  ≤ ξi pij zj2
i=1 j=1 i=1 j=1
n X
X n n
X
= ξi pij zj2 = ξj zj2 = kz k2ξ .
j=1 i=1 j=1

The inequality follows from the convexity of the

quadratic function, and the next Pto last equality
n
follows from the defining property i=1 ξi pij = ξj
• Using the lemma, the nonexpansiveness of Π,
and the definition Tµ J = g + αP J , we have

kΠTµ J−ΠTµ J̄kξ ≤ kTµ J−Tµ J̄ kξ = αkP (J−J̄ )kξ ≤ αkJ−J̄ kξ

for all J, J¯ ∈ ℜn . Hence ΠTµ is a contraction of

modulus α.
250
PROOF OF ERROR BOUND

• Let Φr∗ be the fixed point of ΠT . We have

1
kJµ − Φr∗ kξ ≤√ kJµ − ΠJµ kξ .
1−α 2

Proof: We have
2
kJµ − Φr∗ k2ξ = kJµ − ΠJµ k2ξ
+ ΠJµ − Φr∗ ξ
2
2
= kJµ − ΠJµ kξ + ΠT Jµ − ΠT (Φr∗ ) ξ
≤ kJµ − ΠJµ k2ξ + α2 kJµ − Φr∗ k2ξ ,

where
− The first equality uses the Pythagorean The-
orem
− The second equality holds because Jµ is the
fixed point of T and Φr∗ is the fixed point
of ΠT
− The inequality uses the contraction property
of ΠT .
Q.E.D.

251
MATRIX FORM OF PROJECTED EQUATION

• The solution Φr∗ satisfies the orthogonality con-

dition: The error
Φr∗ − (g + αP Φr∗ )
is “orthogonal” to the subspace spanned by the
columns of Φ.
• This is written as

Φ′ Ξ Φr∗ − (g + αP Φr∗ ) = 0,
where Ξ is the diagonal matrix with the steady-
state probabilities ξ1 , . . . , ξn along the diagonal.
• Equivalently, Cr∗ = d, where

C = Φ′ Ξ(I − αP )Φ, d = Φ′ Ξg
but computing C and d is HARD (high-dimensional
inner products). 252
SOLUTION OF PROJECTED EQUATION

• Solve Cr∗ = d by matrix inversion: r∗ = C −1 d

• Alternative: Projected Value Iteration (PVI)
Φrk+1 = ΠT (Φrk ) = Π(g + αP Φrk )
Converges to r∗ because ΠT is a contraction.
Value Iterate
T(Φrk) = g + αPΦrk

Projection
on S

Φrk+1

Φrk
0
S: Subspace spanned by basis functions

• PVI can be written as:

2
rk+1 = arg mins Φr − (g + αP Φrk ) ξ
r ∈ℜ

By setting to 0 the gradient with respect to r,

Φ′ Ξ Φrk+1 − (g + αP Φrk ) = 0,

which yields
rk+1 = rk − (Φ′ ΞΦ)−1 (Crk − d)
253
SIMULATION-BASED IMPLEMENTATIONS

• Key idea: Calculate simulation-based approxi-

mations based on k samples
Ck ≈ C, dk ≈ d
• Approximate matrix inversion r∗ = C −1 d by

r̂k = Ck−1 dk

This is the LSTD (Least Squares Temporal Dif-

ferences) method.
• PVI method rk+1 = rk − (Φ′ ΞΦ)−1 (Crk − d) is
approximated by
rk+1 = rk − Gk (Ck rk − dk )
where
Gk ≈ (Φ′ ΞΦ)−1
This is the LSPE (Least Squares Policy Evalua-
tion) method.
• Key fact: Ck , dk , and Gk can be computed
with low-dimensional linear algebra (of order s;
the number of basis functions).

254
SIMULATION MECHANICS

• We generate an infinitely long trajectory (i0 , i1 , . . .)

of the Markov chain, so states i and transitions
(i, j) appear with long-term frequencies ξi and pij .
• After generating each transition (it , it+1 ), we
compute the row φ(it )′ of Φ and the cost compo-
nent g(it , it+1 ).
• We form
k
1 X X
dk = φ(it )g (it , it+1 ) ≈ ξi pij φ(i)g(i, j) = Φ′ Ξg = d
k+1
t=0 i,j

k
1 X ′
Ck = φ(it ) φ(it )−αφ(it+1 ) ≈ Φ′ Ξ(I−αP )Φ = C
k+1
t=0

Also in the case of LSPE

k
1 X
Gk = φ(it )φ(it )′ ≈ Φ′ ΞΦ
k + 1 t=0
• Convergence based on law of large numbers.
• Ck , dk , and Gk can be formed incrementally.
Also can be written using the formalism of tem-
poral differences (this is just a matter of style)
255
OPTIMISTIC VERSIONS

• Instead of calculating nearly exact approxima-

tions Ck ≈ C and dk ≈ d, we do a less accurate
approximation, based on few simulation samples
• Evaluate (coarsely) current policy µ, then do a
policy improvement
• This often leads to faster computation (as op-
timistic methods often do)
• Very complex behavior (see the subsequent dis-
cussion on oscillations)
• The matrix inversion/LSTD method has serious
problems due to large simulation noise (because of
limited sampling) - particularly if the C matrix is
ill-conditioned
• LSPE tends to cope better because of its itera-
tive nature (this is true of other iterative methods
as well)
• A stepsize γ ∈ (0, 1] in LSPE may be useful to
damp the effect of simulation noise

rk+1 = rk − γGk (Ck rk − dk )

256
6.231 DYNAMIC PROGRAMMING

LECTURE 21

LECTURE OUTLINE

• Review of approximate policy iteration

• Projected equation methods for policy evalua-
tion
• Issues related to simulation-based implementa-
tion
• Multistep projected equation methods
• Bias-variance tradeoff
• Exploration-enhanced implementations
• Oscillations

257
REVIEW: PROJECTED BELLMAN EQUATION

• For a fixed policy µ to be evaluated, consider

the corresponding mapping T :
n
X
(T J)(i) = pij g(i, j)+αJ(j) , i = 1, . . . , n,
i=1

or more compactly, T J = g + αP J
• Approximate Bellman’s equation J = T J by
Φr = ΠT (Φr) or the matrix form/orthogonality
condition Cr∗ = d, where

C = Φ′ Ξ(I − αP )Φ, d = Φ′ Ξg.

T(Φr)

Projection
on S

Φr = ΠT(Φr)

0
S: Subspace spanned by basis functions

Indirect method: Solving a projected

form of Bellmanʼs equation

258
PROJECTED EQUATION METHODS

• Matrix inversion: r∗ = C −1 d
• Iterative Projected Value Iteration (PVI) method:

Φrk+1 = ΠT (Φrk ) = Π(g + αP Φrk )

Converges to r∗ if ΠT is a contraction. True if Π is
projection w.r.t. steady-state distribution norm.
• Simulation-Based Implementations: Generate
k+1 simulated transitions sequence {i0 , i1 , . . . , ik }
and approximations Ck ≈ C and dk ≈ d:

k
1 X ′
Ck = φ(it ) φ(it )−αφ(it+1 ) ≈ Φ′ Ξ(I−αP )Φ
k+1
t=0
k
1 X
dk = φ(it )g (it , it+1 ) ≈ Φ′ Ξg
k + 1 t=0

• LSTD: r̂k = Ck−1 dk

• LSPE: rk+1 = rk − Gk (Ck rk − dk ) where

Gk ≈ G = (Φ′ ΞΦ)−1

Converges to r∗ if ΠT is contraction.
259
ISSUES FOR PROJECTED EQUATIONS

• Implementation of simulation-based solution of

projected equation Φr ≈ Jµ , where Ck r = dk and
Ck ≈ Φ′ Ξ(I − αP )Φ, dk ≈ Φ′ Ξg
• Low-dimensional linear algebra needed for the
simulation-based approximations Ck and dk (of
order s; the number of basis functions).
• Very large number of samples needed to solve
reliably nearly singular projected equations.
• Special methods for nearly singular equations
by simulation exist; see Section 7.3 of the text.
• Optimistic (few sample) methods are more vul-
nerable to simulation error
• Norm mismatch/sampling distribution issue
• The problem of bias: Projected equation solu-
tion =
6 ΠJµ , the “closest” approximation of Jµ
• Everything said so far relates to policy evalua-
tion. How about the effect of approximations on
policy improvement?
• We will next address some of these issues
260
MULTISTEP METHODS

• Introduce a multistep version of Bellman’s equa-

tion J = T (λ) J, where for λ ∈ [0, 1),
X∞
T (λ) = (1 − λ) λℓ T ℓ+1
ℓ=0
Geometrically weighted sum of powers of T .
• T ℓ is a contraction with mod. αℓ , w. r. to
weighted Euclidean norm k · kξ , where ξ is the
steady-state probability vector of the Markov chain.
• Hence T (λ) is a contraction with modulus
∞
X α(1 − λ)
αλ = (1 − λ) αℓ+1 λℓ =
1 − αλ
ℓ=0

Note αλ → 0 as λ → 1 - affects norm mismatch

• T ℓ and T (λ) have the same fixed point Jµ and

1
kJµ − Φrλ∗ kξ ≤p kJµ − ΠJµ kξ
1− αλ2

where Φrλ∗ is the fixed point of ΠT (λ) .

• Φrλ∗ depends on λ.
261
BIAS-VARIANCE TRADEOFF

. Solution of projected equation Φ

∗ Φr = ΠT (λ) (Φr)

Slope Jµ
)λ=0

Simulation error ΠJµ Simulation error

=0λ=10 Simulation error Bias
Simulation error Solution of
Subspace S = {Φr | r ∈ ℜs } Set

1
• From kJµ − Φrλ,µ kξ ≤ p kJµ − ΠJµ kξ
1−α2
λ
error bound
• As λ ↑ 1, we have αλ ↓ 0, so error bound (and
quality of approximation) improves:
lim Φrλ,µ = ΠJµ
λ↑1

• But the simulation noise in approximating

∞
X
T (λ) = (1 − λ) λℓ T ℓ+1
ℓ=0
increases
• Choice of λ is usually based on trial and error
262
MULTISTEP PROJECTED EQ. METHODS

• The multistep projected Bellman equation is

Φr = ΠT (λ) (Φr)
• In matrix form: C (λ) r = d(λ) , where

C (λ) = Φ′ Ξ I− αP (λ) Φ, d(λ) = Φ′ Ξg (λ) ,

with
∞
X ∞
X
P (λ) = (1 − λ) αℓ λℓ P ℓ+1 , g (λ) = α ℓ λℓ P ℓ g
ℓ=0 ℓ=0

(λ) −1 (λ)

• The LSTD(λ) method is Ck
where dk ,
(λ) (λ)
Ck and dk are simulation-based approximations
of C (λ) and d(λ) .
• The LSPE(λ) method is
(λ) (λ)
rk+1 = rk − γGk Ck rk − dk

where Gk is a simulation-based approx. to (Φ′ ΞΦ)−1

• TD(λ): An important simpler/slower iteration
[similar to LSPE(λ) with Gk = I - see the text].
263
MORE ON MULTISTEP METHODS

(λ) (λ)
• The simulation process to obtain Ck and dk
is similar to the case λ = 0 (single simulation tra-
jectory i0 , i1 , . . ., more complex formulas)
k k
(λ) 1 X X
m−t m−t
′
Ck = φ(it ) α λ φ(im )−αφ(im+1 )
k + 1 t=0 m=t

k k
(λ) 1 X X
dk = φ(it ) αm−t λm−t gim
k + 1 t=0 m=t
• In the context of approximate policy iteration,
we can use optimistic versions (few samples be-
tween policy updates).
• Many different versions (see the text).
• Note the λ-tradeoffs:
(λ) (λ)
− As λ ↑ 1, Ck and dk contain more “sim-
ulation noise”, so more samples are needed
for a close approximation of rλ,µ
− The error bound kJµ −Φrλ,µ kξ becomes smaller
− As λ ↑ 1, ΠT (λ) becomes a contraction for
arbitrary projection norm

264
APPROXIMATE PI ISSUES - EXPLORATION

• 1st major issue: exploration. Common remedy

is the off-policy approach: Replace P of current
policy with
P = (I − B)P + BQ,
where B is a diagonal matrix with βi ∈ [0, 1] on
the diagonal, and Q is another transition matrix.
• Then LSTD and LSPE formulas must be modi-
fied ... otherwise the policy associated with P (not
P ) is evaluated (see the textbook, Section 6.4).
• Alternatives: Geometric and free-form sampling
• Both of these use multiple short simulated tra-
jectories, with random restart state, chosen to en-
hance exploration (see the text)
• Geometric sampling uses trajectories with geo-
metrically distributed number of transitions with
parameter λ ∈ [0, 1). It implements LSTD(λ) and
LSPE(λ) with exploration.
• Free-form sampling uses trajectories with more
generally distributed number of transitions. It im-
plements method for approximation of the solu-
tion of a generalized multistep Bellman equation.
265
APPROXIMATE PI ISSUES - OSCILLATIONS

• Define for each policy µ

Rµ = r | Tµ (Φr) = T (Φr)

• These sets form the greedy partition of the pa-

rameter r-space !
!
! ! " " µ Rµ′
Rµ = r | Tµ (Φr) = T (Φr)
Rµ ′ Rµ′′
For a policy µ, Rµ is the set of all r such that
policy improvement based on Φr produces µ
′′ Rµ′′′

• Oscillations of nonoptimistic approx.: rµ is gen-

erated by an evaluation method so that Φrµ ≈ Jµ
Rµk+1
rµk
+2 rµk+3
Rµk

+1 rµk+2
+2 Rµk+3 k rµk+1
Rµk+2
266
MORE ON OSCILLATIONS/CHATTERING

• For optimistic PI a different picture holds

2 Rµ3
1 rµ2
rµ1

Rµ2 2 rµ3
Rµ1

• Oscillations are less violent, but the “limit”

point is meaningless!
• Fundamentally, oscillations are due to the lack
of monotonicity of the projection operator, i.e.,
J ≤ J ′ does not imply ΠJ ≤ ΠJ ′ .
• If approximate PI uses policy evaluation

Φr = (W Tµ )(Φr)
with W a monotone operator, the generated poli-
cies converge (to an approximately optimal limit).
• The operator W used in the aggregation ap-
proach has this monotonicity property.
267
6.231 DYNAMIC PROGRAMMING

LECTURE 22

LECTURE OUTLINE

• Aggregation as an approximation methodology

• Aggregate problem
• Examples of aggregation
• Simulation-based aggregation
• Q-Learning

268
PROBLEM APPROXIMATION - AGGREGATION

• Another major idea in ADP is to approximate

the cost-to-go function of the problem with the
cost-to-go function of a simpler problem. The sim-
plification is often ad-hoc/problem dependent.
• Aggregation is a systematic approach for prob-
lem approximation. Main elements:
− Introduce a few “aggregate” states, viewed
as the states of an “aggregate” system
− Define transition probabilities and costs of
the aggregate system, by relating original
system states with aggregate states
− Solve (exactly or approximately) the “ag-
gregate” problem by any kind of value or pol-
icy iteration method (including simulation-
based methods)
− Use the optimal cost of the aggregate prob-
lem to approximate the optimal cost of the
original problem
• Hard aggregation example: Aggregate states
are subsets of original system states, treated as if
they all have the same cost.
269
AGGREGATION/DISAGGREGATION PROBS

• The aggregate system transition probabilities

are defined via two (somewhat arbitrary) choices
!
Original System States Aggregate States
Original System States Aggregate States
! i , j=1 !
according to pij (u),, g(i,
with cost
u, j)
Matrix Matrix Aggregation Probabilities
Disaggregation Probabilities
Aggregation Probabilities Aggregation Disaggregation
Probabilities Probabilities
Disaggregation Probabilities
dxi ! Disaggregation
φjyProbabilities
Q
| S

Original System States Aggregate States

), x ), y
!
n !
n
p̂xy (u) = dxi pij (u)φjy ,
i=1 j=1
!
n !
n
ĝ(x, u) = dxi pij (u)g(i, u, j)
i=1 j=1

• For each original system state j and aggregate

state y, the aggregation probability φjy
− The “degree of membership of j in the ag-
gregate state y.”
− In hard aggregation, φjy = 1 if state j be-
longs to aggregate state/subset y.
• For each aggregate state x and original system
state i, the disaggregation probability dxi
− The “degree of i being representative of x.”
− In hard aggregation, one possibility is all
states i that belongs to aggregate state/subset
x have equal dxi . 270
AGGREGATE PROBLEM

• The transition probability from aggregate state

x to aggregate state y under control u
n n
pij (u)φjy , or Pˆ (u) = DP (u)Φ
X X
p̂xy (u) = dxi
i=1 j=1
where the rows of D and Φ are the disaggr. and
aggr. probs.
• The aggregate expected transition cost is
n
X n
X
ĝ(x, u) = dxi pij (u)g(i, u, j), or ĝ = DP g
i=1 j=1

• The optimal cost function of the aggregate prob-

ˆ is
lem, denoted R,
" #
ˆ ˆ
X
R(x) = min ĝ(x, u) + α p̂xy (u)R(y) , ∀x
u∈U
y

ˆ = minu [ĝ + αPˆ R]

or R ˆ - Bellman’s equation for
the aggregate problem.
• The optimal cost J ∗ of the original problem is
approximated using interpolation, J ∗ ≈ J˜ = ΦR:
ˆ
˜ = ˆ
X
J(j) φjy R(y), ∀j
271
y
EXAMPLE I: HARD AGGREGATION

• Group the original system states into subsets,

and view each subset as an aggregate state
• Aggregation probs: φjy = 1 if j belongs to
aggregate state y.

1 0 0 0
 
1 2 3 4 15 26 37 48 59 16 27 38 49 5 6 7 8 9  1 0 0 0
0 1 0 0
 
1 2 3 4 5 6 7 8 9 x1 x2 1

0 0 0

1 2 3 4 15 26 37 48 59 16 27 38 49 5 6 7 8 9
Φ = 1 0 0 0
 
0 1 0 0
 
1 2 3 4 15 26 37 48 x
59316 27 38 49 5 6x74 8 9 0

0 1 0

0 0 1 0
 
0 0 0 1

• Disaggregation probs: There are many possi-

bilities, e.g., all states i within aggregate state x
have equal prob. dxi .
• If optimal cost vector J ∗ is piecewise constant
over the aggregate states/subsets, hard aggrega-
tion is exact. Suggests grouping states with “roughly
equal” cost into aggregates.
• Soft aggregation (provides “soft boundaries”
between aggregate states). 272
EXAMPLE II: FEATURE-BASED AGGREGATION

• If we know good features, it makes sense to

group together states that have “similar features”
• Essentially discretize the features and assign a
weight to each discretization point

Feature Extraction Mapping Feature Vector

Special
SpecialStates
StatesAggregate
AggregateStates Special States Aggregate States Features
StatesFeatures
Features
)

• A general approach for passing from a feature-

based state representation to an aggregation-based
architecture
• Hard aggregation architecture based on features
is more powerful (nonlinear/piecewise constant in
the features, rather than linear)
• ... but may require many more aggregate states
to reach the same level of performance as the cor-
responding linear feature-based architecture

273
EXAMPLE III: REP. STATES/COARSE GRID

• Choose a collection of “representative” original

system states, and associate each one of them with
an aggregate state. Then “interpolate”
y3 Original State Space

j3 y1 1 y2

x j1 2 y3
xj
x j1 j2

j2 j3

Representative/Aggregate States

• Disaggregation probs. are dxi = 1 if i is equal

to representative state x.
• Aggregation probs. associate original system
states with convex combinations of rep. states
X
j∼ φjy y
y ∈A

• Well-suited for Euclidean space discretization

• Extends nicely to continuous state space, in-
cluding belief space of POMDP
274
EXAMPLE IV: REPRESENTATIVE FEATURES

• Choose a collection of “representative” subsets

of original system states, and associate each one
of them with an aggregate state
y3 Original State Space
Small cost
Sx 1 φjx1 Sx 2
ij j
φjxSmall
2
cost
pij Sx 3
Aggregate
φ States/Subsets
φjx3
0 1 2 49 i
pijij j
φ

Aggregate States/Subsets
0 1 2 49

• Common case: Sx is a group of states with

“similar features”
• Hard aggregation is special case: ∪x Sx = {1, . . . , n}
• Aggregation with representative states is special
case: Sx consists of just one state
• With rep. features, aggregation approach is a
special case of projected equation approach with
“seminorm” projection. So the TD methods and
multistage Bellman Eq. methodology apply
275
APPROXIMATE PI BY AGGREGATION
!
Original System States Aggregate States
Original System States Aggregate States
! i , j=1 !
,
according to pij (u), with cost
g(i, u, j)
Matrix Matrix Aggregation Probabilities
Disaggregation Probabilities
Aggregation Probabilities Aggregation Disaggregation
Probabilities Probabilities
Disaggregation Probabilities
dxi ! Disaggregation
φjyProbabilities
Q
| S

Original System States Aggregate States

), x ), y
!
n !
n
p̂xy (u) = dxi pij (u)φjy ,
i=1 j=1
!
n !
n
ĝ(x, u) = dxi pij (u)g(i, u, j)
i=1 j=1

• Consider approximate PI for the original prob-

lem, with evaluation done using the aggregate prob-
lem (other possibilities exist - see the text)
• Evaluation of policy µ: J˜ = ΦR, where R =
DTµ (ΦR) (R is the vector of costs of aggregate
states corresponding to µ). May use simulation.
• Similar form to the projected equation ΦR =
ΠTµ (ΦR) (ΦD in place of Π).
• Advantages: It has no problem with exploration
or with oscillations.
• Disadvantage: The rows of D and Φ must be
probability distributions. 276
Q-LEARNING I

• Q-learning has two motivations:

− Dealing with multiple policies simultaneously
− Using a model-free approach [no need to know
pij (u), only be able to simulate them]
• The Q-factors are defined by
n
X
Q∗ (i, u) = pij (u) g(i, u, j) + αJ ∗ (j) , ∀ (i, u)
j=1

• Since J ∗ = T J ∗ , we have J ∗ (i) = minu∈U (i) Q∗ (i, u)

so the Q factors solve the equation
n
X
Q∗ (i, u) = pij (u) g(i, u, j) + α min Q∗ (j, u′ )
u′ ∈U (j)
j=1

• Q∗ (i, u) can be shown to be the unique solu-

tion of this equation. Reason: This is Bellman’s
equation for a system whose states are the original
states 1, . . . , n, together with all the pairs (i, u).
• Value iteration: For all (i, u)
n
X
Q(i, u) := pij (u) g(i, u, j) + α ′min Q(j, u′ )
u ∈U (j)
j=1

277
Q-LEARNING II

• Use some randomization to generate sequence

of pairs (ik , uk ) [all pairs (i, u) are chosen infinitely
often]. For each k, select jk according to pik j (uk ).
• Q-learning algorithm: updates Q(ik , uk ) by

Q(ik , uk ) := 1 − γk (ik , uk ) Q(ik , uk )
!
+ γk (ik , uk ) g(ik , uk , jk ) + α ′ min Q(jk , u′ )
u ∈U (jk )

• Stepsize γk (ik , uk ) must converge to 0 at proper

rate (e.g., like 1/k).
• Important mathematical point: In the Q-factor
version of Bellman’s equation the order of expec-
tation and minimization is reversed relative to the
ordinary cost version of Bellman’s equation:
X n

∗
J (i) = min ∗
pij (u) g(i, u, j) + αJ (j)
u∈U (i)
j=1

• Q-learning can be shown to converge to true/exact

Q-factors (sophisticated stoch. approximation proof).
• Major drawback: Large number of pairs (i, u) -
no function approximation is used.
278
Q-FACTOR APPROXIMATIONS

• Basis function approximation for Q-factors:

Q̃(i, u, r) = φ(i, u)′ r
• We can use approximate policy iteration and
LSPE/LSTD/TD for policy evaluation (exploration
issue is acute).
• Optimistic policy iteration methods are fre-
quently used on a heuristic basis.
• Example (very optimistic). At iteration k, given
rk and state/control (ik , uk ):
(1) Simulate next transition (ik , ik+1 ) using the
transition probabilities pik j (uk ).
(2) Generate control uk+1 from
uk+1 = arg min ˜ k+1 , u, rk )
Q(i
u∈U (ik+1 )

(3) Update the parameter vector via

rk+1 = rk − (LSPE or TD-like correction)

• Unclear validity. Solid basis for aggregation

case, and for case of optimal stopping (see text).
279
6.231 DYNAMIC PROGRAMMING

LECTURE 23

LECTURE OUTLINE

• Additional topics in ADP

• Stochastic shortest path problems
• Average cost problems
• Generalizations
• Basis function adaptation
• Gradient-based approximation in policy space
• An overview

280
REVIEW: PROJECTED BELLMAN EQUATION

• Policy Evaluation: Bellman’s equation J = T J

is approximated the projected equation

Φr = ΠT (Φr)

which can be solved by a simulation-based meth-

ods, e.g., LSPE(λ), LSTD(λ), or TD(λ). Aggre-
gation is another approach - simpler in some ways.
T(Φr)

Projection
on S

Φr = ΠT(Φr)

0
S: Subspace spanned by basis functions

Indirect method: Solving a projected

form of Bellmanʼs equation

• These ideas apply to other (linear) Bellman

equations, e.g., for SSP and average cost.
• Important Issue: Construct simulation frame-
work where ΠT [or ΠT (λ) ] is a contraction.
281
STOCHASTIC SHORTEST PATHS

• Introduce approximation subspace

S = {Φr | r ∈ ℜs }
and for a given proper policy, Bellman’s equation
and its projected version
J = T J = g + P J, Φr = ΠT (Φr)

Also its λ-version

∞
X
Φr = ΠT (λ) (Φr), T (λ) = (1 − λ) λt T t+1
t=0
• Question: What should be the norm of projec-
tion? How to implement it by simulation?
• Speculation based on discounted case: It should
be a weighted Euclidean norm with weight vector
ξ = (ξ1 , . . . , ξn ), where ξi should be some type of
long-term occupancy probability of state i (which
can be generated by simulation).
• But what does “long-term occupancy probabil-
ity of a state” mean in the SSP context?
• How do we generate infinite length trajectories
given that termination occurs with prob. 1? 282
SIMULATION FOR SSP

• We envision simulation of trajectories up to

termination, followed by restart at state i with
some fixed probabilities q0 (i) > 0.
• Then the “long-term occupancy probability of
a state” of i is proportional to
X∞
q(i) = qt (i), i = 1, . . . , n,
t=0

where
qt (i) = P (it = i), i = 1, . . . , n, t = 0, 1, . . .
• We use the projection norm
v
u n
uX 2
kJkq = t q(i) J(i)
i=1

[Note that 0 < q(i) < ∞, but q is not a prob.

distribution.]
• We can show that ΠT (λ) is a contraction with
respect to k · kq (see the next slide).
• LSTD(λ), LSPE(λ), and TD(λ) are possible.
283
CONTRACTION PROPERTY FOR SSP
P∞
• We have q = t=0 qt so
∞
X ∞
X
q′P = qt′ P = qt′ = q ′ − q0′
t=0 t=1
or
n
X
q(i)pij = q(j) − q0 (j), ∀j
i=1
• To verify that ΠT is a contraction, we show
that there exists β < 1 such that kP z k2q ≤ βkzk2q
for all z ∈ ℜn .
• For all z ∈ ℜn , we have
 2
n
X Xn n
X n
X
kP zk2q = q(i)  pij zj  ≤ q(i) pij zj2
i=1 j=1 i=1 j=1
n
X n
X n
X
zj2 q(j) − q0 (j ) zj2

= q(i)pij =
j=1 i=1 j =1

= kzk2q − kzk2q0 ≤ βkzk2q

where
q0 (j )
β = 1 − min
j q(j)
284
AVERAGE COST PROBLEMS

• Consider a single policy to be evaluated, with

single recurrent class, no transient states, and steady-
state probability vector ξ = (ξ1 , . . . , ξn ).
• The average cost, denoted by η, is
(N −1 )
1 X

η = lim E g xk , xk+1 x0 = i , ∀i
N →∞ N
k=0
• Bellman’s equation is J = F J with

F J = g − ηe + P J

where e is the unit vector e = (1, . . . , 1).

• The projected equation and its λ-version are

Φr = ΠF (Φr), Φr = ΠF (λ) (Φr)

• A problem here is that F is not a contraction
with respect to any norm (since e = P e).
• ΠF (λ) is a contraction w. r. to k · kξ assuming
that e does not belong to S and λ > 0 (the case
λ = 0 is exceptional, but can be handled); see the
text. LSTD(λ), LSPE(λ), and TD(λ) are possible.
285
GENERALIZATION/UNIFICATION

• Consider approx. solution of x = T (x), where

T (x) = Ax + b, A is n × n, b ∈ ℜn

by solving the projected equation y = ΠT (y),

where Π is projection on a subspace of basis func-
tions (with respect to some Euclidean norm).
• We can generalize from DP to the case where
A is arbitrary, subject only to
I − ΠA : invertible
Also can deal with case where I − ΠA is (nearly)
singular (iterative methods, see the text).
• Benefits of generalization:
− Unification/higher perspective for projected
equation (and aggregation) methods in ap-
proximate DP
− An extension to a broad new area of appli-
cations, based on an approx. DP perspective
• Challenge: Dealing with less structure
− Lack of contraction
− Absence of a Markov chain 286
GENERALIZED PROJECTED EQUATION

• Let Π be projection with respect to

v
u n
uX
kxkξ = t ξi x2i
i=1

where ξ ∈ ℜn is a probability distribution with

positive components.
• If r∗ is the solution of the projected equation,
we have Φr∗ = Π(AΦr∗ + b) or
 2
Xn n
X
r∗ = arg mins ξi φ(i)′ r − aij φ(j)′ r∗ − bi 
r∈ℜ
i=1 j=1

where φ(i)′ denotes the ith row of the matrix Φ.

• Optimality condition/equivalent form:
 ′
n
X n
X n
X
ξi φ(i) φ(i) − aij φ(j) r∗ = ξi φ(i)bi
i=1 j=1 i=1

• The two expected values can be approximated

by simulation 287
SIMULATION MECHANISM

Row Sampling According to ξ (Ma

Column Sampling According to
i0 i0 i1 +1 ... j1 ik ik ik+1 +1 . . .
Ro Row ng to plin ling Accordi
ng to Column Sampling Ac
g According
= ( ) to ΦPΠMar

i1 j0 0j1 +1jk jk jk+1

w Sam amp Accordi According

• Row sampling: Generate sequence {i0 , i1 , . . .}

according to ξ, i.e., relative frequency of each row
i is ξi

• Column sampling: Generate (i0 , j0 ), (i1 , j1 ), . . .
according to some transition probability matrix P
with
pij > 0 if aij 6= 0,
i.e., for each i, the relative frequency of (i, j) is pij
(connection to importance sampling)
• Row sampling may be done using a Markov
chain with transition matrix Q (unrelated to P )
• Row sampling may also be done without a
Markov chain - just sample rows according to some
known distribution ξ (e.g., a uniform)
288
ROW AND COLUMN SAMPLING

Row Sampling According to ξ (Ma

Column
ξ (May UseSampling According
Markov Chain Q) to
to Markov Chain P ∼ |A|
i0 i0 i1 +1 ... j1 ik ik ik+1 +1 . . . Column Sampling Ac
Ro Row ng to plin ling Accordi
ng to g(May =Use ( Marko
According )toΦMar
Π
kov(Φ
to Markov ) Subspace
Chain
Chain
Subspace
ain P ∼ |A| Projection
i1 j0 0j1 +1 jk jk jk+1 jection on
w Sam amp Accordi According

• Row sampling ∼ State Sequence Generation in

DP. Affects:
− The projection norm.
− Whether ΠA is a contraction.
• Column sampling ∼ Transition Sequence Gen-
eration in DP.
− Can be totally unrelated to row sampling.
Affects the sampling/simulation error.
− “Matching” P with |A| is beneficial (has an
effect like in importance sampling).
• Independent row and column sampling allows
exploration at will! Resolves the exploration prob-
lem that is critical in approximate policy iteration.
289
LSTD-LIKE METHOD

• Optimality condition/equivalent form of pro-

jected equation
 ′
X n n
X n
X
ξi φ(i) φ(i) − aij φ(j) r∗ = ξi φ(i)bi
i=1 j=1 i=1

• The two expected values are approximated by

row and column sampling (batch 0 → t).
• We solve the linear equation

t ′ t
X aik jk X
φ(ik ) φ(ik ) − φ(jk ) rt = φ(ik )bik
pik jk
k=0 k=0
• We have rt → r∗ , regardless of ΠA being a con-
traction (by law of large numbers; see next slide).
• Issues of singularity or near-singularity of I−ΠA
may be important; see the text.
• An LSPE-like method is also possible, but re-
quires that ΠA is a contraction.
Pn
• Under the assumption j=1 |aij | ≤ 1 for all i,
there are conditions that guarantee contraction of
ΠA; see the text. 290
JUSTIFICATION W/ LAW OF LARGE NUMBERS

• We will match terms in the exact optimality

condition and the simulation-based version.
• Let ξˆit be the relative frequency of i in row
sampling up to time t.
• We have
t n n
1 X
ξˆit φ(i)φ(i)′ ≈
X X
φ(ik )φ(ik )′ = ξi φ(i)φ(i)′
t+1 i=1 i=1
k=0

t n n
1 X
ξˆit φ(i)bi ≈
X X
φ(ik )bik = ξi φ(i)bi
t+1 i=1 i=1
k=0

• Let p̂tij be the relative frequency of (i, j) in

column sampling up to time t.
t
1 X ai k jk
φ(ik )φ(jk )′
t+1 pik jk
k=0
n n
aij
ξˆit
X X
= t
p̂ij φ(i)φ(j)′
i=1 j=1
pij
Xn Xn
≈ ξi aij φ(i)φ(j)′
i=1 j=1

291
BASIS FUNCTION ADAPTATION I

• An important issue in ADP is how to select

basis functions.
• A possible approach is to introduce basis func-
tions parametrized by a vector θ, and optimize
over θ, i.e., solve a problem of the form
˜

min F J(θ)
θ∈Θ
˜
where J(θ) approximates a cost vector J on the
subspace spanned by the basis functions.
• One example is

˜ ˜
X
F J(θ) = |J(i) − J(θ)(i)| 2,

i∈I
where I is a subset of states, and J(i), i ∈ I, are
the costs of the policy at these states calculated
directly by simulation.
• Another example is
2
˜ ) = J(θ)
˜ − T J(θ)
˜

F J(θ ,

˜ is the solution of a projected equation.

where J(θ)

292
BASIS FUNCTION ADAPTATION II

• Some optimization algorithm may be used to

˜

minimize F J(θ) over θ.
• A challenge here is that the algorithm should
use low-dimensional calculations.
• One possibility is to use a form of random search
(the cross-entropy method); see the paper by Men-
ache, Mannor, and Shimkin (Annals of Oper. Res.,
Vol. 134, 2005)
• Another possibility is to use a gradient method.
For this it is necessary to estimate the partial
˜ with respect to the components
derivatives of J(θ)
of θ.
• It turns out that by differentiating the pro-
jected equation, these partial derivatives can be
calculated using low-dimensional operations. See
the references in the text.

293
APPROXIMATION IN POLICY SPACE I

• Consider an average cost problem, where the

problem data are parametrized by a vector r, i.e.,
a cost vector g(r), transition probability matrix
P (r). Let η(r) be the (scalar) average cost per
stage, satisfying Bellman’s equation

η(r)e + h(r) = g(r) + P (r)h(r)

where h(r) is the differential cost vector.

• Consider minimizing η(r) over r. Other than
random search, we can try to solve the problem
by a policy gradient method:
rk+1 = rk − γk ∇η(rk )
• Approximate calculation of ∇η(rk ): If ∆η, ∆g,
∆P are the changes in η, g, P due to a small change
∆r from a given r, we have
∆η = ξ ′ (∆g + ∆P h),
where ξ is the steady-state probability distribu-
tion/vector corresponding to P (r), and all the quan-
tities above are evaluated at r.

294
APPROXIMATION IN POLICY SPACE II

• Proof of the gradient formula: We have, by “dif-

ferentiating” Bellman’s equation,

∆η(r)·e+∆h(r) = ∆g(r)+∆P (r)h(r)+P (r)∆h(r)

By left-multiplying with ξ ′ ,

ξ ∆η (r)·e+ξ ∆h(r) = ξ ∆g (r)+∆P (r)h(r) +ξ ′ P (r)∆h(r)
′ ′ ′

Since ξ ′ ∆η(r) · e = ∆η(r) and ξ ′ = ξ ′ P (r), this

equation simplifies to

∆η = ξ ′ (∆g + ∆P h)
• Since we don’t know ξ, we cannot implement a
gradient-like method for minimizing η(r). An al-
ternative is to use “sampled gradients”, i.e., gener-
ate a simulation trajectory (i0 , i1 , . . .), and change
r once in a while, in the direction of a simulation-
based estimate of ξ ′ (∆g + ∆P h).
• Important Fact: ∆η can be viewed as an ex-
pected value!
• Much research on this subject, see the text.
295
6.231 DYNAMIC PROGRAMMING

OVERVIEW-EPILOGUE

• Finite horizon problems

− Deterministic vs Stochastic
− Perfect vs Imperfect State Info
• Infinite horizon problems
− Stochastic shortest path problems
− Discounted problems
− Average cost problems

296
FINITE HORIZON PROBLEMS - ANALYSIS

• Perfect state info

− A general formulation - Basic problem, DP
algorithm
− A few nice problems admit analytical solu-
tion
• Imperfect state info
− Reduction to perfect state info - Sufficient
statistics
− Very few nice problems admit analytical so-
lution
− Finite-state problems admit reformulation as
perfect state info problems whose states are
prob. distributions (the belief vectors)

297
FINITE HORIZON PROBS - EXACT COMP. SOL.

• Deterministic finite-state problems

− Equivalent to shortest path
− A wealth of fast algorithms
− Hard combinatorial problems are a special
case (but # of states grows exponentially)
• Stochastic perfect state info problems
− The DP algorithm is the only choice
− Curse of dimensionality is big bottleneck
• Imperfect state info problems
− Forget it!
− Only small examples admit an exact compu-
tational solution

298
FINITE HORIZON PROBS - APPROX. SOL.

• Many techniques (and combinations thereof) to

choose from
• Simplification approaches
− Certainty equivalence
− Problem simplification
− Rolling horizon
− Aggregation - Coarse grid discretization
• Limited lookahead combined with:
− Rollout
− MPC (an important special case)
− Feature-based cost function approximation
• Approximation in policy space
− Gradient methods
− Random search

299
INFINITE HORIZON PROBLEMS - ANALYSIS

• A more extensive theory

• Bellman’s equation
• Optimality conditions
• Contraction mappings
• A few nice problems admit analytical solution
• Idiosynchracies of problems with no underlying
contraction
• Idiosynchracies of average cost problems
• Elegant analysis

300
INF. HORIZON PROBS - EXACT COMP. SOL.

• Value iteration
− Variations (Gauss-Seidel, asynchronous, etc)
• Policy iteration
− Variations (asynchronous, based on value it-
eration, optimistic, etc)
• Linear programming
• Elegant algorithmic analysis
• Curse of dimensionality is major bottleneck

301
INFINITE HORIZON PROBS - ADP

• Approximation in value space (over a subspace

of basis functions)
• Approximate policy evaluation
− Direct methods (fitted VI)
− Indirect methods (projected equation meth-
ods, complex implementation issues)
− Aggregation methods (simpler implementa-
tion/many basis functions tradeoff)
• Q-Learning (model-free, simulation-based)
− Exact Q-factor computation
− Approximate Q-factor computation (fitted VI)
− Aggregation-based Q-learning
− Projected equation methods for opt. stop-
ping
• Approximate LP
• Rollout
• Approximation in policy space
− Gradient methods
− Random search
302
MIT OpenCourseWare
http://ocw.mit.edu

6.231 Dynamic Programming and Stochastic Control

Fall 2015

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

DP Slides
No ratings yet
DP Slides
263 pages
MIT6 231F15 Complete Slide
No ratings yet
MIT6 231F15 Complete Slide
166 pages
Reinforcement Learning and Optimal Control - Draft Version by Dmitri Bertsekas
No ratings yet
Reinforcement Learning and Optimal Control - Draft Version by Dmitri Bertsekas
268 pages
04 RL DP
No ratings yet
04 RL DP
76 pages
RL Monograph1
No ratings yet
RL Monograph1
48 pages
Dynamic Programming and Optimal Control
No ratings yet
Dynamic Programming and Optimal Control
62 pages
Adprl Chapter Icis
No ratings yet
Adprl Chapter Icis
43 pages
Dynamic Programming Online Teaching FOR PRINT
No ratings yet
Dynamic Programming Online Teaching FOR PRINT
44 pages
Infinite Horizon Problems
No ratings yet
Infinite Horizon Problems
69 pages
Application of Reinforcement Learning - Finance
No ratings yet
Application of Reinforcement Learning - Finance
540 pages
Dynamic Programming and Optimal Control
No ratings yet
Dynamic Programming and Optimal Control
62 pages
RL Monograph1
No ratings yet
RL Monograph1
42 pages
RL Module 4
No ratings yet
RL Module 4
50 pages
Neuro-Dynamic Programming An Overview Dimitri P. Bertsekas
No ratings yet
Neuro-Dynamic Programming An Overview Dimitri P. Bertsekas
9 pages
Markov Decision Processes: Lecture Notes For STP 425: Jay Taylor
100% (1)
Markov Decision Processes: Lecture Notes For STP 425: Jay Taylor
86 pages
Abstract Dynamic Programming Bertsekas Dimitri P Download
No ratings yet
Abstract Dynamic Programming Bertsekas Dimitri P Download
87 pages
Dynamic Optimization
No ratings yet
Dynamic Optimization
73 pages
Chapter 1 PDF
No ratings yet
Chapter 1 PDF
45 pages
P550
No ratings yet
P550
27 pages
Abstract Dynamic Programming
No ratings yet
Abstract Dynamic Programming
257 pages
Hiller - Dynamic Programming PDF
No ratings yet
Hiller - Dynamic Programming PDF
6 pages
Dynamic Programming: Xiaolan Xie
No ratings yet
Dynamic Programming: Xiaolan Xie
97 pages
MIT6 231F11 Notes Short
No ratings yet
MIT6 231F11 Notes Short
125 pages
Vol I Dimitri PDF
No ratings yet
Vol I Dimitri PDF
30 pages
Module 04
No ratings yet
Module 04
63 pages
Reinforcement Learning: Foundations Exam
No ratings yet
Reinforcement Learning: Foundations Exam
42 pages
16.323 Principles of Optimal Control: Mit Opencourseware
No ratings yet
16.323 Principles of Optimal Control: Mit Opencourseware
27 pages
Powell-Tutorial-ComputationalStochasticOptimization Informs Nov152014
No ratings yet
Powell-Tutorial-ComputationalStochasticOptimization Informs Nov152014
142 pages
10 - Reinforcement Learning
No ratings yet
10 - Reinforcement Learning
24 pages
RL Class Notes
No ratings yet
RL Class Notes
68 pages
3 DP PDF
No ratings yet
3 DP PDF
42 pages
Moritz Lars
No ratings yet
Moritz Lars
97 pages
Figure by Mit Opencourseware
No ratings yet
Figure by Mit Opencourseware
26 pages
CH 9 MDP
No ratings yet
CH 9 MDP
97 pages
Dynamic Programming and Optimal Control Script
No ratings yet
Dynamic Programming and Optimal Control Script
58 pages
Dynamic Programming
No ratings yet
Dynamic Programming
52 pages
An Introduction To Reinforcement Learning From Theory To Algorithms (December 19, 2024) - Joon Kwon
No ratings yet
An Introduction To Reinforcement Learning From Theory To Algorithms (December 19, 2024) - Joon Kwon
66 pages
Optimal Control Theory
No ratings yet
Optimal Control Theory
28 pages
Reinforcement Learning I:: The Setting and Classical Stochastic Dynamic Programming Algorithms
No ratings yet
Reinforcement Learning I:: The Setting and Classical Stochastic Dynamic Programming Algorithms
42 pages
NDP PDF
No ratings yet
NDP PDF
5 pages
1 Optimal Control: 1.1 Problem Definition
No ratings yet
1 Optimal Control: 1.1 Problem Definition
8 pages
Approximate Dynamic Programming - II: Algorithms: Warren B. Powell
No ratings yet
Approximate Dynamic Programming - II: Algorithms: Warren B. Powell
22 pages
Notas - Dynamic Optimation and Optimal Control
No ratings yet
Notas - Dynamic Optimation and Optimal Control
26 pages
Tad1241ge PDF
No ratings yet
Tad1241ge PDF
14 pages
Spec Hyundai HX210
No ratings yet
Spec Hyundai HX210
10 pages
An Overview of Machine Learning
No ratings yet
An Overview of Machine Learning
42 pages
Dynamic Programming and Optimal Control
No ratings yet
Dynamic Programming and Optimal Control
199 pages
Dynamic Programming and Optimal Control 3rd Edition, Volume II
No ratings yet
Dynamic Programming and Optimal Control 3rd Edition, Volume II
233 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
101 pages
EE290 Lecture 16
No ratings yet
EE290 Lecture 16
4 pages
Dynamic Programming 2
No ratings yet
Dynamic Programming 2
39 pages
MIT Dynamic Programming Lecture Slides
No ratings yet
MIT Dynamic Programming Lecture Slides
261 pages
A17 Complexdecisions
No ratings yet
A17 Complexdecisions
28 pages
Dynamic Programing and Optimal Control
No ratings yet
Dynamic Programing and Optimal Control
276 pages
Powell UnifiedFrameworkStochasticOptimization Jan292018
No ratings yet
Powell UnifiedFrameworkStochasticOptimization Jan292018
69 pages
Namic Programming
No ratings yet
Namic Programming
18 pages
Grid Scale Battery Storage
100% (1)
Grid Scale Battery Storage
8 pages
La5 PDF
No ratings yet
La5 PDF
35 pages
Sawcod Hsim Final
No ratings yet
Sawcod Hsim Final
249 pages
Dynamic Programming and Optimal Control, Volumes I Solution Selected
No ratings yet
Dynamic Programming and Optimal Control, Volumes I Solution Selected
30 pages
USBDLA User Guide V1.0
No ratings yet
USBDLA User Guide V1.0
18 pages
Dynamic Programming and Optimal Control: Third Edition Dimitri P. Bertsekas
0% (1)
Dynamic Programming and Optimal Control: Third Edition Dimitri P. Bertsekas
54 pages
Dynamic Programing and Optimal Control PDF
No ratings yet
Dynamic Programing and Optimal Control PDF
276 pages
RL Frontmatter
No ratings yet
RL Frontmatter
11 pages
Newseam 1 Module 2 Matanacio
No ratings yet
Newseam 1 Module 2 Matanacio
32 pages
Chapter: 9.8 HTML Images Topic: 9.8.1 HTML Images: E-Content of Internet Technology and Web Design
No ratings yet
Chapter: 9.8 HTML Images Topic: 9.8.1 HTML Images: E-Content of Internet Technology and Web Design
7 pages
Relay Module
No ratings yet
Relay Module
3 pages
Base Excitation
No ratings yet
Base Excitation
24 pages
Art. 4 - Cosca v. Palaypayon
No ratings yet
Art. 4 - Cosca v. Palaypayon
2 pages
BUHK408
No ratings yet
BUHK408
5 pages
Mock Test
No ratings yet
Mock Test
6 pages
Denim Fabric Consumption & Booking (Final)
No ratings yet
Denim Fabric Consumption & Booking (Final)
7 pages
Pixilab Blocks
No ratings yet
Pixilab Blocks
108 pages
Internet Safety: Here's How To Be Safe On The Internet
No ratings yet
Internet Safety: Here's How To Be Safe On The Internet
2 pages
Stepper Motor - Parallel Port Interface by Jose Rene Flores and Jaime Zavala
No ratings yet
Stepper Motor - Parallel Port Interface by Jose Rene Flores and Jaime Zavala
18 pages
Q3 Brochure
No ratings yet
Q3 Brochure
24 pages
Repair/Maintenace Work Required at EDC 5th Floor
No ratings yet
Repair/Maintenace Work Required at EDC 5th Floor
11 pages
Questions 1. Research Design: Balangay: A Proposed Flood Resilient House Methodology
No ratings yet
Questions 1. Research Design: Balangay: A Proposed Flood Resilient House Methodology
3 pages
New Form Gsse From Kiet Form
No ratings yet
New Form Gsse From Kiet Form
2 pages
Tally Prime Additional Entries
No ratings yet
Tally Prime Additional Entries
4 pages
Adjusting Review 2
No ratings yet
Adjusting Review 2
9 pages
Mechanical Completion Dossier Submission Progress
No ratings yet
Mechanical Completion Dossier Submission Progress
21 pages
Cool Bot Pro Spec Sheet 2020
No ratings yet
Cool Bot Pro Spec Sheet 2020
1 page
Cabbage: Schedule of Cabbage Production Practices
No ratings yet
Cabbage: Schedule of Cabbage Production Practices
19 pages
Resume Sonali Sahu Tenth Revolution Group
No ratings yet
Resume Sonali Sahu Tenth Revolution Group
2 pages
Technology Management
No ratings yet
Technology Management
2 pages
A New Decade For Soci Al Changes
No ratings yet
A New Decade For Soci Al Changes
16 pages
Loesche Procurement Status
No ratings yet
Loesche Procurement Status
3 pages
Instructional English Classification Test
No ratings yet
Instructional English Classification Test
4 pages
Tabel Ses
No ratings yet
Tabel Ses
6 pages
Latihan Lab 3 - General Ledger and Adjusting Entries
No ratings yet
Latihan Lab 3 - General Ledger and Adjusting Entries
3 pages
Dharamsar and Satsar
No ratings yet
Dharamsar and Satsar
2 pages
PP86S20 400m3.hr at 88.2m TDH Performance Datasheet
No ratings yet
PP86S20 400m3.hr at 88.2m TDH Performance Datasheet
1 page
Sequences and Infinite Series, A Collection of Solved Problems
From Everand
Sequences and Infinite Series, A Collection of Solved Problems
Steven Tan
No ratings yet