DP Slides
DP Slides
CAMBRIDGE, MASS
FALL 2009
DIMITRI P. BERTSEKAS
http://www.athenasc.com/dpbook.html
LECTURE 1
LECTURE OUTLINE
• Problem Formulation
• Examples
• The Basic Problem
• Significance of Feedback
DP AS AN OPTIMIZATION METHODOLOGY
min g(u)
u∈U
• Discrete-time system
xk+1 = fk (xk , uk , wk ), k = 0, 1, . . . , N − 1
− k: Discrete time
− xk : State; summarizes past information that
is relevant for future optimization
− uk : Control; decision to be selected at time
k from a given set
− wk : Random parameter (also called distur-
bance or noise depending on the context)
− N : Horizon or number of times control is
applied
• Cost function that is additive over time
−1
( N
)
X
E gN (xN ) + gk (xk , uk , wk )
k=0
INVENTORY CONTROL EXAMPLE
wk Demand at Period k
Stock Ordered at
Period k
Cos t of P e riod k uk
c uk + r (xk + uk - wk)
• Discrete-time system
xk+1 = fk (xk , uk , wk ) = xk + uk − wk
• Cost function that is additive over time
−1
( N
)
X
E gN (xN ) + gk (xk , uk , wk )
k=0
(N −1 )
X
=E cuk + r(xk + uk − wk )
k=0
uk ∈ Uk (xk )
Pwk (xk , uk )
DETERMINISTIC FINITE-STATE PROBLEMS
ABC CC D
C BC
AB
ACB C BD
C AB
A C CB
AC
C AC
CC D ACD C DB
SA
Initial
State
CAB C BD
CA C AB
SC
C C CA
C AD CAD C DB
CC D CD
C DA
CDA C AB
STOCHASTIC FINITE-STATE PROBLEMS
0.5-0.5 1- 0
pd pw
0-0 0-0
1 - pd 1 - pw
0-1 0-1
2-0
2-0
pw
pd 1-0
1 - pw
1-0 1.5-0.5
1.5-0.5
1 - pd
pw
pd 0.5-0.5 1-1
0.5-0.5 1-1 1 - pw
1 - pd
pw
pd 0.5-1.5
0.5-1.5 0-1
0-1 1 - pw
1 - pd
0-2
0-2
−1
( N
)
X
Jπ (x0 ) = E gN (xN ) + gk (xk , µk (xk ), wk )
k=0
wk
u kµ=km
uk = k(x
(xk )k) System xk
xk + 1 = fk( xk,u k,wk)
) m
µkk
pd 1.5-0.5
1- 0
1 - pd
pw
Timid Play
1-1
0-0
1 - pw
Bold Play
pw 1- 1
0-1
1 - pw
Bold Play
0-2
VARIANTS OF DP PROBLEMS
• Continuous-time problems
• Imperfect state information problems
• Infinite horizon problems
• Suboptimal control
LECTURE BREAKDOWN
LECTURE 2
LECTURE OUTLINE
−1
( N
)
X
Jπ (x0 ) = E gN (xN ) + gk (xk , µk (xk ), wk )
k=0
−1
( N
)
X
E gN (xN ) + gk xk , µk (xk ), wk
k=i
0 i N
ABC 6
3
AB
ACB 1
2 9
A 4
3 AC
8 ACD 3
5 6
5
Initial
1 0 State
CAB 1
CA 2
3
4
C 3 4
CAD 3
7 6
CD
5 3
CDA 2
wk Demand at Period k
Stock Ordered at
Period k
Cost of Period k uk
cuk + r (xk + uk - wk)
+ r(xN −1 + uN −1 − wN −1 )
+ Jk+1 (xk + uk − wk )
• Start with
JN (xN ) = gN (xN ),
(
Jk∗ (xk ) = min E gk xk , µk (xk ), wk
(µk ,πk+1 ) wk ,...,wN −1
N −1
)
X
+ gN (xN ) + gi xi , µi (xi ), wi
i=k+1
(
= min E gk xk , µk (xk ), wk
µk wk
" ( N −1
)#
X
+ min E gN (xN ) + gi xi , µi (xi ), wi
πk+1 wk+1 ,...,wN −1
i=k+1
∗
= min E gk xk , µk (xk ), wk + Jk+1 fk xk , µk (xk ), wk
µk wk
= min E gk xk , µk (xk ), wk + Jk+1 fk xk , µk (xk ), wk
µk wk
= min E gk (xk , uk , wk ) + Jk+1 fk (xk , uk , wk )
uk ∈Uk (xk ) wk
= Jk (xk )
LINEAR-QUADRATIC ANALYTICAL EXAMPLE
Initial Final
Temperature x0 Oven 1 x1 Oven 2 Temperature x2
Temperature Temperature
u0 u1
• System
J2 (x2 ) = r(x2 − T )2
h i
2
J1 (x1 ) = min u21 + r (1 − a)x1 + au1 − T
u1
2
J0 (x0 ) = min u0 + J1 (1 − a)x0 + au0
u0
STATE AUGMENTATION
LECTURE 3
LECTURE OUTLINE
Terminal Arcs
with Cost Equal
to Terminal Cost
...
t
Artificial Terminal
Initial State
... Node
s
...
• DP algorithm:
JN (i) = aN
it , i ∈ SN ,
k
Jk (i) = min aij +Jk+1 (j) , i ∈ Sk , k = 0, . . . , N −1
j∈Sk+1
˜ ˜
N
The optimal cost is J0 (t) = mini∈SN ait + J1 (i)
• View J˜k (j) as optimal cost-to-arrive to state j
from initial state s
A NOTE ON FORWARD DP ALGORITHMS
• DP algorithm:
Jk (i) = min aij +Jk+1 (j) , k = 0, 1, . . . , N −2,
j=1,...,N
State i
Destination
5
5 3 3 3 3
2 3
4
7 5 4 4 4 5
1 4 3
2 4.5 4.5 5.5 7
5 5
2
6 1 2 2 2 2
1
2 3
0.5
0 1 2 3 4 Stage k
(a) (b)
JN −1 (i) = ait , i = 1, 2, . . . , N,
Jk (i) = min aij +Jk+1 (j) , k = 0, 1, . . . , N −2.
j=1,...,N
ESTIMATION / HIDDEN MARKOV MODELS
s x0 x1 x2 xN - 1 xN t
...
...
...
VITERBI ALGORITHM
• We have
p(XN , ZN )
p(XN | ZN ) =
p(ZN )
where p(XN , ZN ) and p(ZN ) are the unconditional
probabilities of occurrence of (XN , ZN ) and ZN
• Maximizing p(XN | ZN ) is equivalent with max-
imizing ln(p(XN , ZN ))
• We have
N
Y
p(XN , ZN ) = πx0 pxk−1 xk r(zk ; xk−1 , xk )
k=1
N
X
minimize − ln(πx0 ) − ln pxk−1 xk r(zk ; xk−1 , xk )
k=1
over all possible sequences {x0 , x1 , . . . , xN }.
5 1 15
AB AC AD
20 4 20 3 4 3
3 3 4 4 20 20
1 15 5 1
15 5
5 1 15
5 20 4
1 20 3
15 4 3
LABEL CORRECTING METHODS
Is di + aij < dj ?
(Is the path s --> i --> j
i j
better than the
OPEN current path s --> j ?)
REMOVE
EXAMPLE
1 A Origin Node s
5 1 15
2 AB 7 AC 10 AD
20 4 20 3 4 3
3 3 4 4 20 20
1 15 5 1
15 5
0 - 1 ∞
1 1 2, 7,10 ∞
2 2 3, 5, 7, 10 ∞
3 3 4, 5, 7, 10 ∞
4 4 5, 7, 10 43
5 5 6, 7, 10 43
6 6 7, 10 13
7 7 8, 10 13
8 8 9, 10 13
9 9 10 13
10 10 Empty 13
LECTURE 4
LECTURE OUTLINE
Is di + aij < dj ?
(Is the path s --> i --> j
i j
better than the
OPEN current path s --> j ?)
REMOVE
VALIDITY OF LABEL CORRECTING METHODS
2 10
3 6 11 12
4 5 7 8 9 13 14
Destination Node t
OPEN = {i 6= t | di < ∞}
{1,2,3} {4,5}
{1} {2}
BRANCH-AND-BOUND ALGORITHM
LECTURE 5
LECTURE OUTLINE
• System: xk+1 = Ak xk + Bk uk + wk
• Quadratic cost
−1
( N
)
X
E x′N QN xN + (x′k Qk xk + u′k Rk uk )
w k
k=0,1,...,N −1 k=0
+ Jk+1 (Ak xk + Bk uk + wk )
• Key facts:
− Jk (xk ) is quadratic
− Optimal policy {µ∗0 , . . . , µ∗N −1 } is linear:
µ∗k (xk ) = Lk xk
− Similar treatment of a number of variants
DERIVATION
KN = QN ,
xk+1 = (A + BL)xk + wk
R
- 2
B P 0 450
Pk Pk + 1 P* P
B 2 Pk2
Pk+1 = A2 Pk − + Q,
B 2 Pk + R
A2 RP
F (P ) = 2 + Q.
B P +R
JN (xN ) = x′N QN xN ,
x′k Qk xk
Jk (xk ) = min E
uk wk ,Ak ,Bk
′
−1
Lk = − Rk + E{Bk Kk+1 Bk } E{Bk′ Kk+1 Ak },
450
R
- 2 0 P
E{B }
xk+1 = xk + uk − wk , k = 0, 1, . . . , N − 1
• Minimize
(N −1 )
X
E cuk + r(xk + uk − wk )
k=0
• DP algorithm:
JN (xN ) = 0,
Jk (xk ) = min cuk +H(xk +uk )+E Jk+1 (xk +uk −wk ) ,
uk ≥0
JN (xN ) = 0,
where
Gk (y) = cy + H(y) + E Jk+1 (y − w) .
lim Jk (x) → ∞
|x|→∞
JUSTIFICATION
cy + H(y)
H(y)
cSN - 1
SN - 1 y
- cy
JN - 1(xN - 1)
SN - 1 xN - 1
- cy
6.231 DYNAMIC PROGRAMMING
LECTURE 6
LECTURE OUTLINE
• Stopping problems
• Scheduling problems
• Other applications
PURE STOPPING PROBLEMS
CONTINUE STOP
REGION REGION
Stop State
EXAMPLE: ASSET SELLING
• Optimal policy;
a1
a2
ACCEPT
REJECT
aN -1
0 1 2 N-1 N k
We have αk = Ew Vk+1 (w) /(1 + r), so it is enough
to show that Vk (x) ≥ Vk+1 (x) for all x and k.
Start with VN −1 (x) ≥ VN (x) and use the mono-
tonicity property of DP.
• We can also show that αk → a as k → −∞.
Suggests that for an infinite horizon the optimal
policy is stationary.
GENERAL STOPPING PROBLEMS
T0 ⊂ · · · ⊂ Tk ⊂ Tk+1 ⊂ · · · ⊂ TN −1 .
• Interesting case is when all the Tk are equal (to
TN −1 , the set where it is better to stop than to go
one step and stop). Can be shown to be true if
JN (xN ) = gN (xN ),
Jk (xk ) = min max gk (xk , uk , wk )
uk ∈U (xk ) wk ∈Wk (xk ,uk )
+ Jk+1 fk (xk , uk , wk )
(Exercise 1.5 in the text, solution posted on the
www).
UNKNOWN-BUT-BOUNDED CONTROL
LECTURE 7
LECTURE OUTLINE
Ik = (z0 , z1 , . . . , zk , u0 , u1 , . . . , uk−1 ), k ≥ 1,
I0 = z0
• We consider policies π = {µ0 , µ1 , . . . , µN −1 },
where each function µk maps the information vec-
tor Ik into a control uk and
• We have
Ik+1 = (Ik , zk+1 , uk ), k = 0, 1, . . . , N −2, I0 = z0
P (zk+1 | Ik , uk ) = P (zk+1 | Ik , uk , z0 , z1 , . . . , zk ),
JN −1 (IN −1 ) = min
uN −1 ∈UN −1
"
E gN fN −1 (xN −1 , uN −1 , wN −1 )
xN −1 , wN −1
#
+ gN −1 (xN −1 , uN −1 , wN −1 ) | IN −1 , uN −1
• System: xk+1 = Ak xk + Bk uk + wk
• Quadratic cost
−1
( N
)
X
E x′N QN xN + (x′k Qk xk + u′k Rk uk )
w k
k=0,1,...,N −1 k=0
zk = Ck xk + vk , k = 0, 1, . . . , N − 1
′ ′
+ 2E{xN −1 | IN −1 } A QBuN −1
where
LN −1 = −(B ′ QB + R)−1 B ′ QA
DP ALGORITHM II
PN −1 = A′N −1 QN BN −1 (RN −1 + BN
′
−1 QN BN −1 )
−1
′
· BN −1 QN AN −1 ,
KN −1 = A′N −1 QN AN −1 − PN −1 + QN −1
xN −1 − E{xN −1 | IN −1 }
DP ALGORITHM III
xN −1 − E{xN −1 | IN −1 } = ξN −1 ,
where
ξN −1 : function of x0 , w0 , . . . , wN −2 , v0 , . . . , vN −1
′
E{ξN −1 PN −1 ξN −1 | IN −2 , uN −2 }
′
= E{ξN −1 PN −1 ξN −1 | IN −2 }
and is independent of uN −2 .
• So minimization in the DP algorithm yields
Kk = A′k Kk+1 Ak − Pk + Qk ,
wk vk
xk zk
xk + 1 = A kxk + B ku k + wk zk = Ckxk + vk
uk
Delay
uk -1
uk E{xk | Ik} zk
Lk Estimator
SEPARATION INTERPRETATION
LECTURE 8
LECTURE OUTLINE
I0 = z0 , Ik = (z0 , z1 , . . . , zk , u0 , u1 , . . . , uk−1 ), k ≥ 1
• DP algorithm:
h
Jk (Ik ) = min E gk (xk , uk , wk )
uk ∈Uk xk , wk , zk+1
i
+ Jk+1 (Ik , zk+1 , uk ) | Ik , uk
JN −1 (IN −1 ) = min
uN −1 ∈UN −1
"
E gN fN −1 (xN −1 , uN −1 , wN −1 )
xN −1 , wN −1
#
+ gN −1 (xN −1 , uN −1 , wN −1 ) | IN −1 , uN −1
µ∗k (Ik )
= µk Sk (Ik ) ,
wk vk
uk xk zk
System Measurement
xk + 1 = fk(xk ,u k ,wk) zk = hk(xk ,u k - 1,vk)
uk -1
Delay
uk -1
Px k | Ik zk
Actuator Estimator
mk fk -1
EXAMPLE: A SEARCH PROBLEM
with p0 given.
• pk evolves at time k according to the equation
pk if not search,
• DP algorithm
h
J k (pk ) = max 0, −C + pk βV
i
pk (1 − β)
+ (1 − pk β)J k+1 ,
pk (1 − β) + 1 − pk
with J N (pN ) = 0.
• Can be shown by induction that the functions
J k satisfy
C
J k (pk ) = 0, for all pk ≤
βV
1 1
L L R
t r
L L R
1-t 1-r
• DP algorithm:
J k (pk ) = min (1 − pk )C, I + E J k+1 Φ(pk , zk+1 ) .
zk+1
starting with
J N −1 (pN −1 ) = min (1−pN −1 )C, I+(1−t)(1−pN −1 )C .
INSTRUCTION EXAMPLE III
I + A N - 1(p)
C
I + A N - 2(p)
I + A N - 3(p)
LECTURE 9
LECTURE OUTLINE
• Suboptimal control
• Cost approximation methods: Classification
• Certainty equivalent control: An example
• Limited lookahead policies
• Performance bounds
• Problem approximation approach
• Parametric cost-to-go approximation
PRACTICAL DIFFICULTIES OF DP
N
X −1
minimize gN (xN ) + gi xi , ui , wi
i=k
˜
gk xk , uk , wk + Jk+1 fk (xk , uk , wk )
N −1
X
minimize gN (xN ) + gk xk , µk (xk ), wk
k=0
subject to xk+1 = fk xk , µk (xk ), wk , µk (xk ) ∈ Uk
wk vk
d
u k = mk (x k) System xk Measurement zk
xk + 1 = fk(xk ,u k ,wk) zk = hk(xk ,u k - 1,vk)
uk -1
Delay
uk -1
Actuator x k (Ik ) zk
Estimator
mdk
PARTIALLY STOCHASTIC CEC
˜
min E gk (xk , uk , wk )+Jk+1 fk (xk , uk , wk ) ,
uk ∈Uk (xk )
where
− J˜N = gN .
− J˜k+1 : approximation to true cost-to-go Jk+1
• Two-step lookahead policy: At each k and
xk , use the control µ̃k (xk ) attaining the minimum
above, where the function J˜k+1 is obtained using
a 1SL approximation (solve a 2-step DP problem).
• If J˜k+1 is readily available and the minimiza-
tion above is not too hard, the 1SL policy is im-
plementable on-line.
• Sometimes one also replaces Uk (xk ) above with
a subset of “most promising controls” U k (xk ).
• As the length of lookahead increases, the re-
quired computation quickly explodes.
PERFORMANCE BOUNDS FOR 1SL
Jˆk (xk ) =
min E gk (xk , uk , wk )
uk ∈Uk (xk )
+ J˜k+1
fk (xk , uk , wk ) ,
[so Jˆk (xk ) is computed along with µk (xk )]. Then
Features:
Material balance,
Mobility,
Safety, etc Score
Feature Weighting
Extraction of Features
Position Evaluator
LECTURE 10
LECTURE OUTLINE
• Rollout algorithms
• Cost improvement property
• Discrete deterministic problems
• Sequential consistency and greedy algorithms
• Sequential improvement
ROLLOUT ALGORITHMS
˜
min E gk (xk , uk , wk )+Jk+1 fk (xk , uk , wk ) ,
uk ∈Uk (xk )
where
− J˜N = gN .
− J˜k+1 : approximation to true cost-to-go Jk+1
• Rollout algorithm: When J˜k is the cost-to-go
of some heuristic policy (called the base policy)
• Cost improvement property (to be shown): The
rollout algorithm achieves no worse (and usually
much better) cost than the base heuristic starting
from the same state.
• Main difficulty: Calculating J˜k (xk ) may be
computationally intensive if the cost-to-go of the
base policy cannot be analytically calculated.
− May involve Monte Carlo simulation if the
problem is stochastic.
− Things improve in the deterministic case.
EXAMPLE: THE QUIZ PROBLEM
• Let
= Hk (xk )
root
AB AC AD
• Generic problem:
− Given a graph with directed arcs
− A special node s called the origin
− A set of terminal nodes, called destinations,
and a cost g(i) for each destination i.
− Find min cost path starting at the origin,
ending at one of the destination nodes.
• Base heuristic: For any nondestination node i,
constructs a path (i, i1 , . . . , im , i) starting at i and
ending at one of the destination nodes i. We call
i the projection of i, and we denote H(i) = g(i).
• Rollout algorithm: Start at the origin; choose
the successor node with least cost projection
j1 p(j1)
j2 p(j2)
j3 p(j3)
s i1 im-1 im
j4 p(j4)
Neighbors of im Projections of
Neighbors of im
EXAMPLE: ONE-DIMENSIONAL WALK
_
(N,-N) (N,0) i (N,N)
g(i)
_
-N 0 i N-2 N i
LECTURE 11
LECTURE OUTLINE
min Qk (xk , uk ),
uk ∈Uk (xk )
where
Qk (xk , uk ) = E gk (xk , uk , wk )+Hk+1 fk (xk , uk , wk )
Optimal Trajectory
... ...
1
Current
State
2
... ...
Stopped State
subject to xk+m = 0
• Then applies the first control ūk (and repeats
at the next state xk+1 )
• MPC is rollout with heuristic derived from the
corresponding m − 1-step optimal control problem
• Key Property of MPC: Since the heuris-
tic is stable, the rollout is also stable (by policy
improvement property).
DISCRETIZATION
LECTURE 12
LECTURE OUTLINE
• Let
ρ = max ρπ .
π
P {x2m 6= t | x0 = i, π} = P {x2m 6= t | xm 6= t, x0 = i, π}
× P {xm 6= t | x0 = i, π} ≤ ρ2
and similarly
P {xkm 6= t | x0 = i, π} ≤ ρk , i = 1, . . . , n
J ∗ (t) = 0
• A stationary policy µ is optimal if and only
if for every state i, µ(i) attains the minimum in
Bellman’s equation.
• Key proof idea: The “tail” of the cost series,
∞
X
E g xk , µk (xk )
k=mK
vanishes as K increases to ∞.
OUTLINE OF PROOF THAT JN → J ∗
ρK
J ∗ (x0 ) ≤ JmK (x0 ) + m max |g(i, u)|.
1−ρ i,u
Similarly, we have
ρK
JmK (x0 ) − m max |g(i, u)| ≤ J ∗ (x0 ).
1−ρ i,u
g(i, u) = 1, ∀ i = 1, . . . , n, u ∈ U (i)
p ij (u) a p ij (u)
p ji (u) a p ji (u)
1-a 1-a
n
X
J ∗ (i) = min g(i, u) + α pij (u)J ∗ (j) , ∀ i
u∈U (i)
j=1
LECTURE 13
LECTURE OUTLINE
vanishes as K increases to ∞.
BELLMAN’S EQUATION FOR A SINGLE POLICY
J(2)
2
J(2) = g(2,u ) + p 21(u 2)J(1) + p 22(u 2)J(2)
J* = (J*(1),J*(2))
0 J(1)
p ij (u) a p ij (u)
p ji (u) a p ji (u)
1-a 1-a
n
X
J ∗ (i) = min g(i, u) + α pij (u)J ∗ (j) , ∀ i
u∈U (i)
j=1
DISCOUNTED PROBLEMS (CONTINUED)
LECTURE 14
LECTURE OUTLINE
p n i (u) p n j (u)
p n i (u) p n j (u)
p in(u) p jn(u) n
p in(u) p jn(u)
n
p nn(u)
p nn(u)
Special
State n
t
Artificial Termination State
since Jk (i) and Jk∗ (i) are optimal costs for two
k-stage problems that differ only in the terminal
cost functions, which are J0 and h∗ .
RELATIVE VALUE ITERATION
j=1
LECTURE 15
LECTURE OUTLINE
Qij (τ, u)
P {tk+1 −tk ≤ τ | xk = i, xk+1 = j, uk = u} =
pij (u)
where
n Z ∞ τmax
1− e−βτ 1 − e−βτ
X Z
γ= dQij (τ, u) = dτ
j=1 0 β 0 βτmax
where
Z ∞ Z τmax
−βτ e−βτ 1 − e−βτmax
α= e dQij (τ, u) = dτ =
0 0
τmax βτmax
• Minimize
Z tN
1
lim E g x(t), u(t) dt
N →∞ E{tN } 0
c i τmax
G(i, Fill) = 0, G(i, Not Fill) =
2
and there is also the “instantaneous” cost
• Bellman’s equation:
h τmax
h∗ (i) = min K − λ∗ + h∗ (1),
2
τmax τ max
i
ci − λ∗ + h∗ (i + 1)
2 2
LECTURE 16
LECTURE OUTLINE
xk+1 = f (xk , uk , wk ), k = 0, 1, . . .
Jπ (x) = lim (Tµ0 Tµ1 · · · Tµk J0 )(x), Jµ (x) = lim (Tµk J0 )(x)
k→∞ k→∞
• Bellman’s equation: J ∗ = T J ∗ , Jµ = Tµ Jµ
• Optimality condition:
µ: optimal <==> Tµ J ∗ = T J ∗
• If J0 ≡ 0,
∞
( )
X
Jπ (x0 ) = E αk g xk , µk (xk ), wk
k=0
(N −1 )
X
=E αk g xk , µk (xk ), wk
k=0
∞
( )
X
+E αk g xk , µk (xk ), wk
k=N
αN M αN M
J ∗ (x) − N ∗
≤ (T J0 )(x) ≤ J (x) + ,
1−α 1−α
α N +1 M
(T J ∗ )(x) − ≤ (T N +1 J0 )(x)
1−α
α N +1 M
≤ (T J ∗ )(x) +
1−α
we obtain J ∗ = T J ∗ . Q.E.D.
THE CONTRACTION PROPERTY
Hence
Q.E.D.
IMPLICATIONS OF CONTRACTION PROPERTY
Proof: Use
T J ∗ = Tµ J ∗ .
J ∗ = Tµ J ∗ ,
J ∗ = Tµ J ∗ .
LECTURE 17
LECTURE OUTLINE
˜ αδ ˜ δ
Jµ ≤ T J + e≤J+ e.
1−α 1−α
• Assume that
J ∗ − ǫe ≤ J˜ ≤ J ∗ + ǫe,
˜
J(i) = min Jµ1 (i) , . . . , JµM (i) , ∀ i.
≤ Jµm (i)
˜
(T J)(i) ˜
≤ J(i), ∀ i.
˜
Jµ (i) ≤ J(i) = min Jµ1 (i) , . . . , JµM (i) , ∀ i,
ǫ + 2αδ
lim sup max Jµk (x) − J ∗ (x) ≤
k→∞ x∈S (1 − α)2
J ∗ = F J ∗.
kF k J − J ∗ k ≤ ρk kJ − J ∗ k, k = 1, 2, . . . .
(b) V = V (1), V (2), . . . ∈ B(S), where
X
V (i) = max pij (u) vj , ∀i
u∈U (i)
j∈S
X
(T J)(i) = min g(i, u) + α pij (u)J(j) , ∀ i
u∈U (i)
j∈S
LECTURE 18
LECTURE OUTLINE
• Undiscounted problems
• Stochastic shortest path problems (SSP)
• Proper and improper policies
• Analysis and computational methods for SSP
• Pathologies of SSP
UNDISCOUNTED PROBLEMS
n
X
(Tµ J)(i) = g i, µ(i) + pij µ(i) J(j), i = 1, . . . , n.
j=1
1 1
max |(Tµ J)(i)−(Tµ J ′ )(i)| ≤ ρµ max |J(i)−J ′ (i)|
i vi i vi
• T is similarly a contraction if all µ are proper
(the case discussed in the text, Ch. 7, Vol. I).
SSP THEORY SUMMARY II
k−1
X
J ≥ Tµk J = Pµk J + Pµm gµ
m=0
J = T k J ≤ Tµk′ J → Jµ′ = J ′
Similarly, J ′ ≤ J, so J = J ′ .
SSP ANALYSIS II
Tµ0 · · · Tµk−1 J0 ≥ T k J0 ,
ˆ
For vi = −J(i), we have vi ≥ 1, and for all µ,
n
X
pij µ(i) vj ≤ vi − 1 ≤ ρ vi , i = 1, . . . , n,
j=1
where
vi − 1
ρ = max < 1.
i=1,...,n vi
This implies contraction of Tµ and T by the results
of the preceding lecture.
PATHOLOGIES I: DETERM. SHORTEST PATHS
1
1 2 t
• It has J ∗ as solution.
• Set of solutions of Bellman’s equation:
J | J(1) = J(2) ≤ 1 .
PATHOLOGIES II: DETERM. SHORTEST PATHS
1
1 2 t
-1
LECTURE 19
LECTURE OUTLINE
where q(1), . . . , q(n) is some probability dis-
tribution over the states.
• In a special case of this approach, the param-
eterization of the policies is indirect, through an
approximate cost function.
− A cost approximation architecture parame-
terized by r, defines a policy dependent on r
via the minimization in Bellman’s equation.
APPROX. IN VALUE SPACE - APPROACHES
• An example
System Simulator
~ -
Least-Squares J(j,r)
Optimization
_ Decision Generator
Decision µ(i) State i
Cost-to-Go Approximator
~
Supplies Values J(j,r)
Jµ Tµ(Φr)
Projection
Projection on S
on S
Φr = ΠTµ(Φr)
ΠJµ
0 0
S: Subspace spanned by basis functions S: Subspace spanned by basis functions
Projection Projection
on S on S
Φrk+1
Φrk+1
Φrk Φrk Simulation error
0 0
S: Subspace spanned by basis functions S: Subspace spanned by basis functions
˜ rk ) − J k (i)| ≤ δ,
max |J(i, k = 0, 1, . . .
µ
i
˜ rk )−(T J˜)(i, rk )| ≤ ǫ,
max |(Tµk+1 J)(i, k = 0, 1, . . .
i
ǫ + 2αδ
lim sup max Jµk (i) − J ∗ (i) ≤
k→∞ i (1 − α)2
Mi
1 X
Jµ (i) ≈ c(i, m)
Mi m=1
Mi
n X
2
X
r∗ = arg min c(i, m) − J˜µ (i, r)
r
i=1 m=1
System Simulator
~ -
Least-Squares J(j,r)
Optimization
_ Decision Generator
Decision µ(i) State i
Cost-to-Go Approximator
~
Supplies Values J(j,r)
LECTURE 20
LECTURE OUTLINE
Jµ Tµ(Φr)
Projection
Projection on S
on S
Φr = ΠTµ(Φr)
ΠJµ
0 0
S: Subspace spanned by basis functions S: Subspace spanned by basis functions
N −1 N −1
!2
1 X
˜ k , r) −
X
min J(i αt−k g it , µ(it ), it+1
r 2
k=0 t=k
• Gradient iteration
N −1
∇J˜(ik , r)
X
r := r − γ
k=0
−1
N
!
˜ k , r) −
X
J(i αt−k g it , µ(it ), it+1
t=k
• Important tradeoff:
− In order to reduce simulation error and ob-
tain cost samples for a representatively large
subset of states, we must use a large N
− To keep the work per gradient iteration small,
we must use a small N
• To address the issue of size of N , small batches
may be used and changed after one or more iter-
ations.
• Then the method requires a diminishing stepsize
for convergence.
• This slows down the convergence (which can
generally be very slow for a gradient method).
• Theoretical convergence is guaranteed (with a
diminishing stepsize) under reasonable conditions,
but in practice this is not much of a guarantee.
INCREMENTAL GRADIENT METHOD
˜ k , rk )J(i
rk+1 =rk − γ ∇J(i ˜ k , rk )
k
! !
X
˜ t , rt )
− αk−t ∇J(i g ik , µ(ik ), ik+1
t=0
INCREMENTAL GRADIENT - CONTINUED
˜ k , r), k ≤ N −2
˜ k+1 , r)−J(i
qk = g ik , µ(ik ), ik+1 +αJ(i
˜ N −1 , r)
qN −1 = g iN −1 , µ(iN −1 ), iN − J(i
˜ k , rk )
rk+1 = rk + γk qk ∇J(i
LECTURE 21
LECTURE OUTLINE
˜ = Φr
J(r)
S = {Φr | r ∈ ℜs }
ΠJ = Φr∗
where
r∗ = arg mins kJ − Φrk2v
r∈ℜ
THE PROJECTED BELLMAN EQUATION
or more compactly,
T J = g + αP J
Φr = ΠT (Φr)
T(Φr)
Projection
on S
Φr = ΠT(Φr)
0
S: Subspace spanned by basis functions
N
1 X
ξj = lim P (ik = j | i0 = i) > 0
N →∞ N
k=1
¯ v ≤ kJ − Jk
kΠJ − ΠJk ¯ v, for all J, J¯ ∈ ℜn .
• Lemma: We have
kP zkξ ≤ kzkξ , z ∈ ℜn
1
kJµ − Φr∗ kξ ≤ √ kJµ − ΠJµ kξ .
1−α 2
Proof: We have
2
kJµ − Φr∗ k2ξ = kJµ − ΠJµ k2ξ + ΠJµ − Φr∗ ξ
2
= kJµ − ΠJµ k2ξ + ΠT Jµ − ΠT (Φr ) ξ∗
• Matrix inversion: r∗ = C −1 d
• Projected Value Iteration (PVI) method:
Projection
on S
Φrk+1
Φrk
0
S: Subspace spanned by basis functions
which yields
rk+1 = rk − (Φ′ ΞΦ)−1 (Crk − d)
SIMULATION-BASED IMPLEMENTATIONS
Ck ≈ C, dk ≈ d
where
Dk ≈ Φ′ ΞΦ
This is the LSPE (Least Squares Policy Evalua-
tion) Method.
• Key fact: Ck , dk , and Dk can be computed
with low-dimensional linear algebra (of order s;
the number of basis functions).
SIMULATION MECHANICS
k
1 X
dk = φ(it )g(it , it+1 ) ≈ Φ′ Ξg
k + 1 t=0
LECTURE 22
LECTURE OUTLINE
or more compactly,
T J = g + αP J
Φr = ΠT (Φr)
T(Φr)
Projection
on S
Φr = ΠT(Φr)
0
S: Subspace spanned by basis functions
1
kJµ − Φrλ∗ kξ ≤ p kJµ − ΠJµ kξ
1− αλ2
with
∞
X ∞
X
P (λ) = (1 − λ) αℓ λℓ P ℓ+1 , g (λ) = α ℓ λℓ P ℓ g
ℓ=0 ℓ=0
(λ) (λ)
• The simulation process to obtain Ck and dk
is similar to the case λ = 0 (single simulation tra-
jectory i0 , i1 , . . . but the formulas are more com-
plicated)
k k
(λ) 1 X X
Ck = φ(it ) αm−t λm−t φ(im )−αφ(im+1 )
k + 1 t=0 m=t
k k
(λ) 1 X X
dk = φ(it ) αm−t λm−t gim
k+1 t=0 m=t
LECTURE 23
LECTURE OUTLINE
Φr = ΠT (Φr)
T(Φr)
Projection
on S
Φr = ΠT(Φr)
0
S: Subspace spanned by basis functions
where
qt (i) = P (it = i), i = 1, . . . , n, t = 0, 1, . . .
• We use the projection norm
v
u n
uX 2
kJkq = t q(i) J(i)
i=1
where
q0 (j)
β = 1 − min
j q(j)
AVERAGE COST PROBLEMS
F J = g − ηe + P J
T (x) = Ax + b, A is n × n, b ∈ ℜn
I − ΠA : invertible
• Benefits of generalization:
− Unification/higher perspective for TD meth-
ods in approximate DP
− An extension to a broad new area of applica-
tions, where a DP perspective may be help-
ful
• Challenge: Dealing with less structure
− Lack of contraction
− Absence of a Markov chain
GENERALIZED PROJECTED EQUATION
t ′ t
X aik jk X
φ(ik ) φ(ik ) − φ(jk ) rt = φ(ik )bik
pik jk
k=0 k=0
• We have rt → r∗ , regardless of ΠA being a con-
traction (by law of large numbers; see next slide).
• An LSPE-like method is also possible, but re-
quires that ΠA is a contraction.
Pn
• Under the assumption j=1 |aij | ≤ 1 for all i,
there are conditions that guarantee contraction of
ΠA; see the paper by Bertsekas and Yu,“Projected
Equation Methods for Approximate Solution of
Large Linear Systems,” 2009, or the expanded ver-
sion of Chapter 6, Vol. 2.
JUSTIFICATION W/ LAW OF LARGE NUMBERS
t n n
1 X
ξˆi φ(i)bi ≈
X X
φ(ik )bik = t ξi φ(i)bi
t+1 i=1 i=1
k=0
LECTURE 24
LECTURE OUTLINE
or equivalently
Φ′ Φr∗ − T (Φr∗ ) =0
˜
min F J(θ)
θ∈Θ
˜
where J(θ) is the solution of the projected equa-
tion.
• One example is
2
˜ ˜ ˜
F J(θ) = J(θ) − T J(θ)
• Another example is
˜ ˜
X
F J(θ) = |J(i) − J(θ)(i)| 2,
i∈I
˜
• Some algorithm may be used to minimize F J(θ)
over θ.
• A challenge here is that the algorithm should
use low-dimensional calculations.
• One possibility is to use a form of random search
(the cross-entropy method); see the paper by Men-
ache, Mannor, and Shimkin (Annals of Oper. Res.,
Vol. 134, 2005)
• Another possibility is to use a gradient method.
For this it is necessary to estimate the partial
˜ with respect to the components
derivatives of J(θ)
of θ.
• It turns out that by differentiating the pro-
jected equation, these partial derivatives can be
calculated using low-dimensional operations. See
the paper by Menache, Mannor, and Shimkin, and
a recent paper by Yu and Bertsekas (2009).
APPROXIMATION IN POLICY SPACE I
By left-multiplying with ξ ′ ,
′ ′ ′
ξ ∆η(r)·e+ξ ∆h(r) = ξ ∆g(r)+∆P (r)h(r) +ξ ′ P (r)∆h(r)
∆η = ξ ′ (∆g + ∆P h)
• Since we don’t know ξ, we cannot implement a
gradient-like method for minimizing η(r). An al-
ternative is to use “sampled gradients”, i.e., gener-
ate a simulation trajectory (i0 , i1 , . . .), and change
r once in a while, in the direction of a simulation-
based estimate of ξ ′ (∆g + ∆P h).
• There is much recent research on this subject,
see e.g., the work of Marbach and Tsitsiklis, and
Konda and Tsitsiklis, and the refs given there.
6.231 DYNAMIC PROGRAMMING
OVERVIEW-EPILOGUE
LECTURE OUTLINE
• Value iteration
− Variations (asynchronous, modified, etc)
• Policy iteration
− Variations (asynchronous, based on value it-
eration, etc)
• Linear programming
• Elegant algorithmic analysis
• Curse of dimensionality is major bottleneck
INFINITE HORIZON PROBS - ADP