MIT6 231F15 Notes PDF
MIT6 231F15 Notes PDF
CAMBRIDGE, MASS
FALL 2015
DIMITRI P. BERTSEKAS
LECTURE 1
LECTURE OUTLINE
• Problem Formulation
• Examples
• The Basic Problem
• Significance of Feedback
2
DP AS AN OPTIMIZATION METHODOLOGY
min g(u)
u∈U
• Discrete-time system
xk+1 = fk (xk , uk , wk ), k = 0, 1, . . . , N − 1
− k: Discrete time
− xk : State; summarizes past information that
is relevant for future optimization
− uk : Control; decision to be selected at time
k from a given set
− wk : Random parameter (also called distur-
bance or noise depending on the context)
− N : Horizon or number of times control is
applied
wk Demand at Period k
Stock ordered at
Period k
Cost of Period k
uk
r(xk) + cuk
• Discrete-time system
xk+1 = fk (xk , uk , wk ) = xk + uk − wk
• Cost function that is additive over time
−1
( N
)
X
E gN (xN ) + gk (xk , uk , wk )
k=0
(N −1 )
X
=E cuk + r(xk + uk − wk )
k=0
uk ∈ Uk (xk )
Pwk (xk , uk )
6
DETERMINISTIC FINITE-STATE PROBLEMS
ABC CCD
CBC
AB
ACB CBD
CAB
A CCB
CAC AC
CCD ACD CDB
SA
Initial
State
CAB CBD
CA CAB
SC
C CCA
CAD CAD CDB
CCD CD
CDA
CDA CAB
7
STOCHASTIC FINITE-STATE PROBLEMS
0.5-0.5 1- 0
pd pw
0-0 0-0
1 - pd 1 - pw
0-1 0-1
2-0
2-0
pw
pd 1-0 1 - pw
1-0 1.5-0.5
1.5-0.5
1 - pd
pw
pd 0.5-0.5 1-1
0.5-0.5 1-1 1 - pw
1 - pd
pw
pd 0.5-1.5
0.5-1.5 0-1
0-1 1 - pw
1 - pd
0-2
0-2
8
BASIC PROBLEM
−1
( N
)
X
Jπ (x0 ) = E gN (xN ) + gk (xk , µk (xk ), wk )
k=0
9
SIGNIFICANCE OF FEEDBACK
u kµ=km
uk = k(x
(xk )k) System xk
xk + 1 = fk( xk,u k,wk)
) m
µkk
1- 0
1 - pd
pw
Timid Play
1-1
0-0
1 - pw
Bold Play
pw 1- 1
0-1
1 - pw
Bold Play
0-2
10
VARIANTS OF DP PROBLEMS
• Continuous-time problems
• Imperfect state information problems
• Infinite horizon problems
• Suboptimal control
11
LECTURE BREAKDOWN
********************************************
• Infinite Horizon Problems - Advanced (Vol. 2)
− Chs. 1, 2: Discounted problems - Computa-
tional methods (3 lectures)
− Ch. 3: Stochastic shortest path problems (2
lectures)
− Chs. 6, 7: Approximate DP (6 lectures)
12
COURSE ADMINISTRATION
13
A NOTE ON THESE SLIDES
14
6.231 DYNAMIC PROGRAMMING
LECTURE 2
LECTURE OUTLINE
15
BASIC PROBLEM
−1
( N
)
X
Jπ (x0 ) = E gN (xN ) + gk (xk , µk (xk ), wk )
k=0
16
PRINCIPLE OF OPTIMALITY
−1
( N
)
X
E gN (xN ) + gk xk , µk (xk ), wk
k=i
0 i N
ABC 6
3
AB
ACB 1
2 9
A 4
3 AC
8 ACD 3
5 6
5
Initial
1 0 State
CAB 1
CA 2
3
4
C 3 4
CAD 3
7 6
CD
5 3
CDA 2
18
STOCHASTIC INVENTORY EXAMPLE
wk Demand at Period k
Stock Ordered at
Period k
Cost of Period k uk
cuk + r (xk + uk - wk)
• Start with
JN (xN ) = gN (xN ),
(
Jk∗ (xk ) = min E gk xk , µk (xk ), wk
(µk ,πk+1 ) wk ,...,wN −1
N −1
)
X
+ gN (xN ) + gi xi , µi (xi ), wi
i=k+1
(
= min E gk xk , µk (xk ), wk
µk wk
" ( N −1
)# )
X
+ min E gN (xN ) + gi xi , µi (xi ), wi
πk+1 wk+1 ,...,wN −1
i=k+1
∗
= min E gk xk , µk (xk ), wk + Jk+1 fk xk , µk (xk ), wk
µk wk
= min E gk xk , µk (xk ), wk + Jk+1 fk xk , µk (xk ), wk
µk wk
= min E gk (xk , uk , wk ) + Jk+1 fk (xk , uk , wk )
uk ∈Uk (xk ) wk
= Jk (xk )
21
LINEAR-QUADRATIC ANALYTICAL EXAMPLE
Initial Final
Temperature x0 Oven 1 x1 Oven 2 Temperature x2
Temperature Temperature
u0 u1
• System
J2 (x2 ) = r(x2 − T )2
h i
2
J1 (x1 ) = min u21 + r (1 − a)x1 + au1 − T
u1
2
J0 (x0 ) = min u0 + J1 (1 − a)x0 + au0
u0
22
STATE AUGMENTATION
LECTURE 3
LECTURE OUTLINE
24
DETERMINISTIC FINITE-STATE PROBLEM
Terminal Arcs
with Cost Equal
to Terminal Cost
...
t
Artificial Terminal
Initial State
... Node
s
...
25
BACKWARD AND FORWARD DP ALGORITHMS
• DP algorithm:
JN (i) = aN
it , i ∈ SN ,
k
Jk (i) = min aij +Jk+1 (j) , i ∈ Sk , k = 0, . . . , N −1
j∈Sk+1
˜ ˜
N
The optimal cost is J0 (t) = mini∈SN ait + J1 (i)
• View J˜k (j) as optimal cost-to-arrive to state j
from initial state s
26
A NOTE ON FORWARD DP ALGORITHMS
27
GENERIC SHORTEST PATH PROBLEMS
• DP algorithm:
Jk (i) = min aij +Jk+1 (j) , k = 0, 1, . . . , N −2,
j=1,...,N
28
EXAMPLE
State i
Destination
5
5 3 3 3 3
2 3
4
7 5 4 4 4 5
1 4 3
2 4.5 4.5 5.5 7
5 5
2
6 1 2 2 2 2
1
2 3
0.5
0 1 2 3 4 Stage k
(a) (b)
JN −1 (i) = ait , i = 1, 2, . . . , N,
Jk (i) = min aij +Jk+1 (j) , k = 0, 1, . . . , N −2.
j=1,...,N
29
ESTIMATION / HIDDEN MARKOV MODELS
s x0 x1 x2 xN - 1 xN t
...
...
...
30
VITERBI ALGORITHM
• We have
p(XN , ZN )
p(XN | ZN ) =
p(ZN )
where p(XN , ZN ) and p(ZN ) are the unconditional
probabilities of occurrence of (XN , ZN ) and ZN
• Maximizing p(XN | ZN ) is equivalent with max-
imizing ln(p(XN , ZN ))
• We have (using the “multiplication rule” for
cond. probs)
N
Y
p(XN , ZN ) = πx0 pxk−1 xk r(zk ; xk−1 , xk )
k=1
N
X
minimize − ln(πx0 ) − ln pxk−1 xk r(zk ; xk−1 , xk )
k=1
over all possible sequences {x0 , x1 , . . . , xN }.
5 1 15
AB AC AD
20 4 20 3 4 3
3 3 4 4 20 20
1 15 5 1
15 5
5 1 15
5 20 4
1 20 3
15 4 3
32
LABEL CORRECTING METHODS
Is di + aij < dj ?
(Is the path s --> i --> j
i j
better than the
OPEN current path s --> j ?)
REMOVE
34
EXAMPLE
1 A Origin Node s
5 1 15
2 AB 7 AC 10 AD
20 4 20 3 4 3
3 3 4 4 20 20
1 15 5 1
15 5
0 - 1 ∞
1 1 2, 7,10 ∞
2 2 3, 5, 7, 10 ∞
3 3 4, 5, 7, 10 ∞
4 4 5, 7, 10 43
5 5 6, 7, 10 43
6 6 7, 10 13
7 7 8, 10 13
8 8 9, 10 13
9 9 10 13
10 10 Empty 13
36
6.231 DYNAMIC PROGRAMMING
LECTURE 4
LECTURE OUTLINE
37
LINEAR-QUADRATIC PROBLEMS
• System: xk+1 = Ak xk + Bk uk + wk
• Quadratic cost
−1
( N
)
X
E x′N QN xN + (x′k Qk xk + u′k Rk uk )
w k
k=0,1,...,N −1 k=0
µ∗k (xk ) = Lk xk
− Similar treatment of a number of variants
38
DERIVATION
KN = QN ,
xk+1 = (A + BL)xk + wk
40
GRAPHICAL PROOF FOR SCALAR SYSTEMS
2
A R
2 +Q
B F(P)
R
- 2
B P 0 450
Pk Pk + 1 P* P
B 2 Pk2
Pk+1 = A2 Pk − + Q,
B 2 Pk + R
B2P 2
A2 RP
F (P ) = A2 P− 2 +Q = 2 +Q
B P +R B P +R
JN (xN ) = x′N QN xN ,
x′k Qk xk
Jk (xk ) = min E
uk wk ,Ak ,Bk
u′k Rk uk
+ + Jk+1 (Ak xk + Bk uk + wk )
′
−1
Lk = − Rk + E{Bk Kk+1 Bk } E{Bk′ Kk+1 Ak },
42
PROPERTIES
450
R
- 2 0 P
E{B }
43
INVENTORY CONTROL
xk+1 = xk + uk − wk , k = 0, 1, . . . , N − 1
• Minimize
(N −1 )
X
E cuk + H(xk + uk )
k=0
where
• DP algorithm:
JN (xN ) = 0,
Jk (xk ) = min cuk +H(xk +uk )+E Jk+1 (xk +uk −wk )
uk ≥0
44
OPTIMAL POLICY
where
Gk (y) = cy + H(y) + E Jk+1 (y − w)
lim Jk (x) → ∞
|x|→∞
45
JUSTIFICATION
cy + H(y)
H(y)
cSN - 1
SN - 1 y
- cy
JN - 1(xN - 1)
SN - 1 xN - 1
- cy
46
6.231 DYNAMIC PROGRAMMING
LECTURE 5
LECTURE OUTLINE
• Stopping problems
• Scheduling problems
• Minimax Control
47
PURE STOPPING PROBLEMS
CONTINUE STOP
REGION REGION
Stop State
48
EXAMPLE: ASSET SELLING
xN if xN =
6 T,
n
JN (xN ) =
0 if xN = T ,
n
Jk (xk ) = max (1 + r)N −k xk , E Jk+1 (wk ) if xk =
6 T,
0 if xk = T .
• Optimal policy;
49
FURTHER ANALYSIS
a1
a2
ACCEPT
REJECT
aN -1
0 1 2 N-1 N k
We have αk = Ew Vk+1 (w) /(1 + r), so it is enough
to show that Vk (x) ≥ Vk+1 (x) for all x and k. Start
with VN −1 (x) ≥ VN (x) and use the monotonicity
property of DP. Q.E.D.
50
GENERAL STOPPING PROBLEMS
T0 ⊂ · · · ⊂ Tk ⊂ Tk+1 ⊂ · · · ⊂ TN −1 .
• Interesting case is when all the Tk are equal (to
TN −1 , the set where it is better to stop than to go
one step and stop). Can be shown to be true if
51
SCHEDULING PROBLEMS
52
EXAMPLE: THE QUIZ PROBLEM
pi Ri /(1 − pi ) ≥ pj Rj /(1 − pj ).
53
MINIMAX CONTROL
JN (xN ) = gN (xN ),
Jk (xk ) = min max gk (xk , uk , wk )
uk ∈U (xk ) wk ∈Wk (xk ,uk )
+ Jk+1 fk (xk , uk , wk )
54
DERIVATION OF MINIMAX DP ALGORITHM
55
UNKNOWN-BUT-BOUNDED CONTROL
if xk ∈ X k ,
0
n
g k (xk ) =
if xk ∈
/ Xk .
1
• We must reach at time k the set
X k = xk | Jk (xk ) = 0
56
6.231 DYNAMIC PROGRAMMING
LECTURE 6
LECTURE OUTLINE
57
BASIC PROBL. W/ IMPERFECT STATE INFO
58
INFORMATION VECTOR AND POLICIES
Ik = (z0 , z1 , . . . , zk , u0 , u1 , . . . , uk−1 ), k ≥ 1,
I0 = z 0
59
REFORMULATION AS PERFECT INFO PROBL.
• System: We have
Ik+1 = (Ik , zk+1 , uk ), k = 0, 1, . . . , N − 2, I0 = z 0
P (zk+1 | Ik , uk ) = P (zk+1 | Ik , uk , z0 , z1 , . . . , zk ),
60
DP ALGORITHM
∗
J = E J0 (z0 )
z0
61
LINEAR-QUADRATIC PROBLEMS
• System: xk+1 = Ak xk + Bk uk + wk
• Quadratic cost
( N −1
)
X
E x′N QN xN + (xk′ Qk xk + uk′ Rk uk )
w k
k=0,1,...,N −1 k=0
zk = Ck xk + vk , k = 0, 1, . . . , N − 1
62
DP ALGORITHM I
′ ′
+ uN −1 RuN −1 + (AxN −1 + BuN −1 + wN −1 )
i
· Q(AxN −1 + BuN −1 + wN −1 ) | IN −1 , uN −1
′ ′
+ 2E{xN −1 | IN −1 } A QBuN −1
where
LN −1 = −(B ′ QB + R)−1 B ′ QA
63
DP ALGORITHM II
′
JN −1 (IN −1 ) = E xN −1 KN −1 xN −1 | IN −1
xN −1
′
+ E xN −1 − E{xN −1 | IN −1 }
xN −1
· PN −1 xN −1 − E{xN −1 | IN −1 } | IN −1
′
+ E {wN −1 QN wN −1 },
wN −1
PN −1 = A′N −1 QN BN −1 (RN −1 + BN
′
−1 QN BN −1 )
−1
′
· BN −1 QN AN −1 ,
KN −1 = A′N −1 QN AN −1 − PN −1 + QN −1
xN −1 − E{xN −1 | IN −1 }
64
DP ALGORITHM III
65
QUALITY OF ESTIMATION LEMMA
′
E{ξN −1 PN −1 ξN −1 | IN −2 , uN −2 }
′
= E{ξN −1 PN −1 ξN −1 | IN −2 }
and is independent of uN −2 .
• So minimization in the DP algorithm yields
u∗N −2 = µN
∗
−2 (IN −2 ) = LN −2 E{xN −2 | IN −2 }
66
FINAL RESULT
KN = QN , Kk = A′k Kk+1 Ak − Pk + Qk ,
wk vk
xk zk
xk + 1 = A kxk + B ku k + wk zk = Ckxk + vk
uk
Delay
uk - 1
uk E{xk | Ik} zk
Lk Estimator
67
SEPARATION INTERPRETATION
68
STEADY STATE/IMPLEMENTATION ASPECTS
69
6.231 DYNAMIC PROGRAMMING
LECTURE 7
LECTURE OUTLINE
70
REVIEW: IMPERFECT STATE INFO PROBLEM
I0 = z0 , Ik = (z0 , z1 , . . . , zk , u0 , u1 , . . . , uk−1 ), k ≥ 1
71
DP ALGORITHM
• DP algorithm:
h
Jk (Ik ) = min E gk (xk , uk , wk )
uk ∈Uk xk , wk , zk+1
i
+ Jk+1 (Ik , zk+1 , uk ) | Ik , uk
∗
J = E J0 (z0 ) .
z0
72
SUFFICIENT STATISTICS
µ∗k (Ik )
= µk Sk (Ik ) ,
73
DP ALGORITHM IN TERMS OF PXK |IK
wk vk
uk xk zk
System Measurement
xk + 1 = fk(xk ,uk ,wk) zk = hk(xk ,uk - 1,vk)
uk -1
Delay
uk -1
Px k | Ik zk
Actuator Estimator
µk φk - 1
74
EXAMPLE: A SEARCH PROBLEM
with p0 given.
• pk evolves at time k according to the equation
pk if not search,
pk+1 = 0 if search and find treasure,
pk (1−β)
pk (1−β)+1−pk
if search and no treasure.
75
SEARCH PROBLEM (CONTINUED)
• DP algorithm
h
J k (pk ) = max 0, −C + pk βV
i
pk (1 − β)
+ (1 − pk β)J k+1 ,
pk (1 − β) + 1 − pk
with J N (pN ) = 0.
• Can be shown by induction that the functions
J k satisfy
C
= 0 if pk ≤ βV
,
J k (pk )
C
> 0 if pk > βV
.
76
FINITE-STATE SYSTEMS - POMDP
77
INSTRUCTION EXAMPLE I
1 1
L L R
t r
L L R
1-t 1-r
78
INSTRUCTION EXAMPLE II
pk = P (xk = L | z0 , z1 , . . . , zk ).
• DP algorithm:
J k (pk ) = min (1 − pk )C, I + E J k+1 Φ(pk , zk+1 )
zk+1
starting with
J N −1 (pN −1 ) = min (1−pN −1 )C, I+(1−t)(1−pN −1 )C .
79
INSTRUCTION EXAMPLE III
I + A N - 1(p)
C
I + A N - 2(p)
I + A N - 3(p)
LECTURE 8
LECTURE OUTLINE
• Suboptimal control
• Cost approximation methods: Classification
• Certainty equivalent control: An example
• Limited lookahead policies
• Performance bounds
• Problem approximation approach
• Parametric cost-to-go approximation
81
PRACTICAL DIFFICULTIES OF DP
82
COST-TO-GO FUNCTION APPROXIMATION
83
CERTAINTY EQUIVALENT CONTROL (CEC)
84
EQUIVALENT OFF-LINE IMPLEMENTATION
• Let µd0 (x0 ), . . . , µdN −1 (xN −1 )
be an optimal con-
troller obtained from the DP algorithm for the de-
terministic problem
N −1
X
minimize gN (xN ) + gk xk , µk (xk ), wk
k=0
subject to xk+1 = fk xk , µk (xk ), wk , µk (xk ) ∈ Uk
85
PARTIALLY STOCHASTIC CEC
86
GENERAL COST-TO-GO APPROXIMATION
where
− J˜N = gN .
− J˜k+1 : approximation to true cost-to-go Jk+1
• Two-step lookahead policy: At each k and
xk , use the control µ̃k (xk ) attaining the minimum
above, where the function J˜k+1 is obtained using a
1SL approximation (solve a 2-step DP problem).
• If J˜k+1 is readily available and the minimiza-
tion above is not too hard, the 1SL policy is im-
plementable on-line.
• Sometimes one also replaces Uk (xk ) above with
a subset of “most promising controls” U k (xk ).
• As the length of lookahead increases, the re-
quired computation quickly explodes.
87
PERFORMANCE BOUNDS FOR 1SL
Jˆk (xk ) =
min E gk (xk , uk , wk )
uk ∈Uk (xk )
+ J˜k+1 fk (xk , uk , wk )
,
88
COMPUTATIONAL ASPECTS
89
PROBLEM APPROXIMATION
90
PARAMETRIC COST-TO-GO APPROXIMATION
91
APPROXIMATION ARCHITECTURES
m
X
˜ r) = φ(x)′ r =
J(x, φj (x)rj
j=1
i) Linear Cost
Feature Feature
State x i Feature Extraction xMapping Vector φ(x) ) Approximator φ(x)′ r
Vectori) Linear Cost
Approximator
eature Extraction( Mapping
) Feature Vector
Feature Extraction Mapping Feature Vector
92
AN EXAMPLE - COMPUTER CHESS
Features:
Material balance,
Mobility,
Safety, etc Score
Feature Weighting
Extraction of Features
Position Evaluator
93
ANOTHER EXAMPLE - AGGREGATION
94
AN EXAMPLE: REPRESENTATIVE SUBSETS
x S1 φx1 1 S2
1 φx2
xS
S4 2 φx6 2 S3
! !
5 S6
6 S7 7 S8
4 S5
Aggregate States/Subsets
0 1 2 49
• Common case: Each Sj is a group of states with
“similar characteristics”
• Compute a “cost” rj for each aggregate state
Sj (using some method)
• Approximate the Poptimal cost of each original
system state x with m φ r
j=1 xj j
95
6.231 DYNAMIC PROGRAMMING
LECTURE 9
LECTURE OUTLINE
• Rollout algorithms
• Policy improvement property
• Discrete deterministic problems
• Approximations of rollout algorithms
• Model Predictive Control (MPC)
• Discretization of continuous time
• Discretization of continuous space
• Other suboptimal approaches
96
ROLLOUT ALGORITHMS
where
− J˜N = gN .
− J˜k+1 : approximation to true cost-to-go Jk+1
• Rollout algorithm: When J˜k is the cost-to-go of
some heuristic policy (called the base policy)
• Policy improvement property (to be shown):
The rollout algorithm achieves no worse (and usu-
ally much better) cost than the base heuristic start-
ing from the same state.
• Main difficulty: Calculating J˜k (xk ) may be com-
putationally intensive if the cost-to-go of the base
policy cannot be analytically calculated.
− May involve Monte Carlo simulation if the
problem is stochastic.
− Things improve in the deterministic case.
97
EXAMPLE: THE QUIZ PROBLEM
98
COST IMPROVEMENT PROPERTY
• Let
J k (xk ): Cost-to-go of the rollout policy
= Hk (xk )
99
DISCRETE DETERMINISTIC PROBLEMS
AB AC AD
100
EXAMPLE: THE BREAKTHROUGH PROBLEM
root
101
DET. EXAMPLE: ONE-DIMENSIONAL WALK
_
(N,-N) (N,0) i (N,N)
g(i)
_
-N 0 i N-2 N i
102
A ROLLOUT ISSUE FOR DISCRETE PROBLEMS
103
ROLLING HORIZON WITH ROLLOUT
104
MODEL PREDICTIVE CONTROL (MPC)
106
SPACE DISCRETIZATION
107
SPACE DISCRETIZATION/AGGREGATION
108
OTHER SUBOPTIMAL APPROACHES
LECTURE 10
LECTURE OUTLINE
110
TYPES OF INFINITE HORIZON PROBLEMS
∗ ∗
J (x) = min E g(x, u, w) + J f (x, u, w)
u∈U (x) w
113
STOCHASTIC SHORTEST PATH PROBLEMS
114
FINITENESS OF POLICY COST FUNCTIONS
• View
ρ = max ρπ < 1
π
P {x2m 6= t | x0 = i, π} = P {x2m 6= t | xm 6= t, x0 = i, π}
× P {xm 6= t | x0 = i, π} ≤ ρ2
and similarly
6 t | x0 = i, π} ≤ ρk ,
P {xkm = i = 1, . . . , n
k
≤ mρ max g(i, u)
i=1,...,n
u∈U (i)
and
∞
X
Jπ (i) ≤ k
m
mρ max g(i, u) =
max g(i, u)
i=1,...,n 1−ρ i=1,...,n
k=0 u∈U (i) u∈U (i)
115
MAIN RESULT
J ∗ (t) = 0
• A stationary policy µ is optimal if and only
if for every state i, µ(i) attains the minimum in
Bellman’s equation.
• Key proof idea: The “tail” of the cost series,
∞
X
E g xk , µk (xk )
k=mK
vanishes as K increases to ∞.
116
OUTLINE OF PROOF THAT JN → J ∗
mK−1 ∞
X X
Jπ (x0 ) = E g xk , µk (xk ) + E g xk , µk (xk )
k=0 k=mK
mK−1 ∞
X X
≤ E g xk , µk (xk ) + ρk m max |g(i, u)|
i,u
k=0 k=K
∗ ρK
J (x0 ) ≤ JmK (x0 ) + m max |g(i, u)|.
1−ρ i,u
Similarly, we have
ρK
JmK (x0 ) − m max |g(i, u)| ≤ J ∗ (x0 ).
1−ρ i,u
117
EXAMPLE
g(i, u) = 1, ∀ i = 1, . . . , n, u ∈ U (i)
118
6.231 DYNAMIC PROGRAMMING
LECTURE 11
LECTURE OUTLINE
119
STOCHASTIC SHORTEST PATH PROBLEMS
N −1
( )
Jπ (i) = lim E g xk , µk (xk ) x0 = i
N →∞
k=0
120
MAIN RESULT
121
BELLMAN’S EQ. FOR A SINGLE POLICY
122
POLICY ITERATION
123
JUSTIFICATION OF POLICY ITERATION
124
LINEAR PROGRAMMING
125
LINEAR PROGRAMMING (CONTINUED)
∗
Pn
• Obtain J by Max i=1
J(i) subject to
n
X
J(i) ≤ g(i, u)+ pij (u)J(j), i = 1, . . . , n, u ∈ U (i)
j=1
=0 J(1) =
126
DISCOUNTED PROBLEMS
127
DISCOUNTED PROBLEM EXAMPLE
LECTURE 12
LECTURE OUTLINE
129
AVERAGE COST PER STAGE PROBLEM
131
CONNECTION WITH SSP (CONTINUED)
h∗ (n) = 0
• If µ∗ (i) attains the min for each i, µ∗ is optimal.
• There is also Bellman Eq. for a single policy µ.
133
MORE ON THE CONNECTION WITH SSP
134
EXAMPLE (CONTINUED)
Also h∗ (0) = 0.
• Optimal policy: Process i unfilled orders if
135
VALUE ITERATION
since Jk (i) and Jk∗ (i) are optimal costs for two
k-stage problems that differ only in the terminal
cost functions, which are J0 and h∗ .
136
RELATIVE VALUE ITERATION
j=1
138
6.231 DYNAMIC PROGRAMMING
LECTURE 13
LECTURE OUTLINE
139
CONTINUOUS-TIME MARKOV CHAINS
140
PROBLEM FORMULATION
Qij (τ, u)
P {tk+1 −tk ≤ τ | xk = i, xk+1 = j, uk = u} =
pij (u)
Thus, Qij (τ, u) can be viewed as a “scaled CDF”
141
EXPONENTIAL TRANSITION DISTRIBUTIONS
142
COST STRUCTURES
143
DISCOUNTED CASE - COST CALCULATION
147
MANUFACTURER’S EXAMPLE REVISITED
where
n Z ∞ τmax
1− e−β τ 1 − e−βτ
X Z
γ= dQij (τ, u) = dτ
j =1 0 β 0 βτmax
where
Z ∞ τmax
e−β τ 1 − e−βτmax
Z
−βτ
α= e dQij (τ, u) = dτ =
0 0
τmax βτmax
149
AVERAGE COST
nR o
tN
• Minimize limN →∞ E{t1 } E
g x(t), u(t) dt
0
N
assuming there is a special state that is “recurrent
under all policies”
• Total expected cost of a transition
G(i, u) = g(i, u)τ i (u),
where τ i (u): Expected transition time.
• We apply the SSP argument used for the discrete-
time case.
− Divide trajectory into cycles marked by suc-
cessive visits to n.
− The cost at (i, u) is G(i, u) − λ∗ τ i (u), where
λ∗ is the optimal expected cost per unit time.
− Each cycle is viewed as a state trajectory of
a corresponding SSP problem with the ter-
mination state being essentially n.
• So Bellman’s Eq. for the average cost problem:
n
X
h∗ (i) = min G(i, u) − λ∗ τ i (u) + pij (u)h∗ (j)
u∈U (i)
j=1
150
MANUFACTURER EXAMPLE/AVERAGE COST
c i τmax
G(i, Fill) = 0, G(i, Not Fill) =
2
and there is also the “instantaneous” cost
• Bellman’s equation:
h τmax
h∗ (i) = min K − λ∗ + h∗ (1),
2
τmax τ max
i
ci − λ∗ + h∗ (i + 1)
2 2
151
6.231 DYNAMIC PROGRAMMING
LECTURE 14
LECTURE OUTLINE
xk+1 = f (xk , uk , wk ), k = 0, 1, . . .
• We have
Jπ (x0 ) ≤ M + αM + α2 M + · · · = M
, ∀ x0
1−α
153
WE ADOPT “SHORTHAND” NOTATION
Tµ J = gµ + αPµ J, T J = min Tµ J
µ
154
“SHORTHAND” COMPOSITION NOTATION
Jπ (x) = lim (Tµ0 Tµ1 · · · Tµk J0 )(x), Jµ (x) = lim (Tµk J0 )(x)
k→∞ k→∞
• Bellman’s equation: J ∗ = T J ∗ , Jµ = Tµ Jµ
• Optimality condition:
µ: optimal <==> Tµ J ∗ = T J ∗
156
SOME KEY PROPERTIES
(T J)(x) ≤ (T J ′ )(x), ∀ x,
J ≤ TJ ⇒ T k J ≤ T k+1 J, ∀k
• If J0 ≡ 0,
∞
( )
X
Jπ (x0 ) = E αk g xk , µk (xk ), wk
k=0
(N −1 )
X
=E αk g xk , µk (xk ), wk
k=0
∞
( )
X
+E αk g xk , µk (xk ), wk
k=N
from which
αN M αN M
Jπ (x0 )− ≤ (Tµ0 · · · TµN −1 J0 )(x0 ) ≤ Jπ (x0 )+ ,
1−α 1−α
α NM α NM
J ∗ (x) − ≤ (T N J0 )(x) ≤ J ∗ (x) + ,
1−α 1−α
αN +1 M
(T J ∗ )(x) − ≤ (T N +1 J0 )(x)
1−α
α N +1 M
≤ (T J ∗ )(x) +
1−α
to obtain J ∗ = T J ∗ . Q.E.D.
159
THE CONTRACTION PROPERTY
max(Tµ J)(x) −(Tµ J ′ )(x) ≤ α maxJ(x) − J ′ (x).
x x
Proof: Denote c = maxx∈S J(x) − J ′ (x). Then
Hence
(T J)(x) − (T J ′ )(x) ≤ αc, ∀ x.
160
IMPLICATIONS OF CONTRACTION PROPERTY
Proof: Use
k ∗ k k ∗
max (T J)(x) − J (x) = max (T J)(x) − (T J )(x)
x x
k ∗
≤ α max J (x) − J (x)
x
161
NEC. AND SUFFICIENT OPT. CONDITION
T J ∗ = Tµ J ∗ .
J ∗ = Tµ J ∗ ,
J ∗ = Tµ J ∗ .
162
COMPUTATIONAL METHODS - AN OVERVIEW
where
n
X
Q∗ (i, u) = pij (u) g(i, u, j) + αJ ∗ (j)
j=1
or Jk+1 = T Jk , Qk+1 = F Qk .
• Equal amount of computation ... just more
storage.
• Having optimal Q-factors is convenient when
implementing an optimal policy on-line by
LECTURE 15
LECTURE OUTLINE
166
DISCOUNTED PROBLEMS/BOUNDED COST
xk+1 = f (xk , uk , wk ), k = 0, 1, . . .
167
“SHORTHAND” THEORY – A SUMMARY
Jπ (x) = lim (Tµ0 Tµ1 · · · Tµk J0 )(x), Jµ (x) = lim (Tµk J0 )(x)
k→∞ k→∞
• Bellman’s equation: J ∗ = T J ∗ , Jµ = Tµ Jµ
• Optimality condition:
µ: optimal <==> Tµ J ∗ = T J ∗
168
MAJOR PROPERTIES
(T J)(x) ≤ (T J ′ )(x), ∀ x ∈ X,
170
A DP-LIKE CONTRACTION MAPPING
171
CONTRACTION MAPPING FIXED-POINT TH.
kF k J − J ∗ k ≤ ρk kJ − J ∗ k, k = 1, 2, . . . .
172
GENERAL FORMS OF DISCOUNTED DP
• Contraction assumption:
− For every J ∈ B(X), the functions Tµ J and
T J belong to B(X).
− For some α ∈ (0, 1) and all J, J ′ ∈ B(X), H
satisfies
H(x, u, J)−H(x, u, J ′ ) ≤ α max J(y)−J ′ (y )
y ∈X
173
EXAMPLES
• Discounted problems
H(x, u, J) = E g(x, u, w) + αJ f (x, u, w)
174
RESULTS USING CONTRACTION
175
RESULTS USING MON. AND CONTRACTION I
Also
T k J ≤ Tµ0 · · · Tµk−1 J
Take limit as k → ∞ to obtain J ≤ Jπ for all
π ∈ Π.
177
6.231 DYNAMIC PROGRAMMING
LECTURE 16
LECTURE OUTLINE
178
DISCOUNTED PROBLEMS
xk+1 = f (xk , uk , wk ), k = 0, 1, . . .
179
“SHORTHAND” THEORY – A SUMMARY
• Bellman’s equation: J ∗ = T J ∗ , Jµ = Tµ Jµ
• Optimality condition:
µ: optimal <==> Tµ J ∗ = T J ∗
• Contraction: kT J1 − T J2 k ≤ αkJ1 − J2 k
• Value iteration: For any (bounded) J
J ∗ = lim T k J
k→∞
180
INTERPRETATION OF VI AND PI
T J 45 Degree Line
Prob. = 1 Prob. =
∗ TJ
n Value Iterations Prob. = 1 Prob. =
Do not J J ∗ =Set
Replace T J ∗S
0 Prob.==T12 J0
= T J0
J0 J J∗ = T J∗
0 Prob. = 1
J0 Do not Replace Set S 1 J J
= T J0 = T 2 J0
T J Tµ1 J J
TJ ∗
= T J0
Policy Improvement Exact Policy Evalua
Evaluation
182
APPROXIMATE PI
ǫ + 2αδ
lim sup max Jµk (x) − J ∗ (x) ≤
k→∞ x∈S (1 − α)2
183
OPTIMISTIC PI
− If mk ≡ 1 it becomes VI
− If mk = ∞ it becomes PI
− For intermediate values of mk , it is generally
more efficient than either VI or PI
Tµ0 J
T J = minµ Tµ J
Policy Improvement Exact Policy Evaluation Approximate Policy
Evaluation
= T J0
Policy Improvement Exact Policy Evaluat
Evaluation
(Tµ J)(x) = H x, µ(x), J , ∀ x ∈ X.
185
ASSUMPTIONS AND RESULTS
{x0 , x1 , . . .} = {1, . . . , n, 1, . . . , n, 1, . . .}
T J ∈ S(k+1), ∀ J ∈ S(k), k = 0, 1, . . . .
• Interpretation of assumptions:
∗ J = (J1 , J2 )
(0) S2 (0) ) S(k + 1) + 1) J ∗
∗
TJ
(0) (0) S(k)
S(0) ) + 1)
(0)
S1 (0)
• Convergence mechanism:
J1 Iterations
∗
J = (J1 , J2 )
S(k + 1) J∗ ∗
) + 1)
S(k)
S(0)(0) ) + 1)
(0) J2 Iteration
Iterations
Iterations
Key: “Independent” component-wise improvement.
An asynchronous component iteration from any J
in S(k) moves into the corresponding component
portion of S(k + 1) permanently!
190
PRINCIPAL DP APPLICATIONS
191
6.231 DYNAMIC PROGRAMMING
LECTURE 17
LECTURE OUTLINE
• Undiscounted problems
• Stochastic shortest path problems (SSP)
• Proper and improper policies
• Analysis and computational methods for SSP
• Pathologies of SSP
• SSP under weak conditions
192
UNDISCOUNTED PROBLEMS
195
SSP ANALYSIS I
J = T k J ≤ Tµk′ J → Jµ′ = J ′
Similarly, J ′ ≤ J, so J = J ′ .
196
SSP ANALYSIS II
Tµ0 · · · Tµk−1 J0 ≥ T k J0 ,
ˆ
For vi = −J(i), we have vi ≥ 1, and for all µ,
n
X
pij µ(i) vj ≤ vi − 1 ≤ ρ vi , i = 1, . . . , n,
j=1
where
vi − 1
ρ = max < 1.
i=1,...,n vi
This implies Tµ and T are contractions of modu-
lus ρ for norm kJk = maxi=1,...,n |J(i)|/vi (by the
results of earlier lectures). 198
SSP ALGORITHMS
199
PATHOLOGIES I: DETERM. SHORTEST PATHS
t b c u′ , Cost 0
t b Destination
u, Cost b
a12 12tb
200
PATHOLOGIES II: BLACKMAILER’S DILEMMA
ˆ =
J(i) min Jµ (i), i = 1, . . . , n
µ: proper
u 1 Cost
Prob. p 0 2 3 Prob.
4 5 1−p
p
3 2 5 6
0Cost
1 2 −2
45 0 1Cost3 14 50 1 2Cost
3 4 −1 0 1 2 3 4 Cost
5 72
Destination
Cost 0 Cost 2 Cost 1 u Cost 1 Cost 0 Cost
4 7
tb t
0 1 2Cost
3 51 0 1 2 3 4 5Cost
6 −1
12 b
1 u Cost 1
Cost 0
)
• For p = 1/2, we have
1
Bellman Eq. at state 1, Jµ (1) = 2 Jµ (2)+Jµ (5) ,
is violated.
• References: Bertsekas, D. P., and Yu, H., 2015.
“Stochastic Shortest Path Problems Under Weak
Conditions,” Report LIDS-2909; Math. of OR, to
appear. Also the on-line updated Ch. 4 of the
text.
203
6.231 DYNAMIC PROGRAMMING
LECTURE 18
LECTURE OUTLINE
Reference:
Updated Chapter 4 of Vol. II of the text:
Noncontractive Total Cost Problems
On-line at:
http://web.mit.edu/dimitrib/www/dpchapter.html
Check for most recent version
204
CONTRACTIVE/SEMICONTRACTIVE PROBLEMS
205
UNDISCOUNTED TOTAL COST PROBLEMS
207
SUMMARY OF ALGORITHMIC RESULTS
J(t) J(t)
(1) Case P Case N (1) Case P Case N
Bellman Eq. Bellman Eq.
Solutions Solutions Solutions Solutions
Bellman Eq. Bellman Eq.
• Bellman Equation:
J(1) = min J(1), b + J(t)], J(t) = J(t)
209
DETERM. OPT. CONTROL - FORMULATION
210
DETERM. OPT. CONTROL - ANALYSIS
If J0 ∈ J and J0 ≥ T J0 , we have Jk ↓ J * .
• Rollout with terminating heuristic (e.g., MPC).
212
LINEAR-QUADRATIC ADAPTIVE CONTROL
214
FINITE-STATE AFFINE MONOTONIC PROBLEMS
¯
Jπ (i) = lim sup (Tµ0 · · · TµN −1 J)(i), i = 1, . . . , n
N →∞
• Interpretation:
215
AFFINE MONOTONIC PROBLEMS: ANALYSIS
LECTURE 19
LECTURE OUTLINE
220
APPROXIMATION IN VALUE SPACE
Features:
Material balance,
Mobility,
Safety, etc Score
Feature Weighting
Extraction of Features
Position Evaluator
i) Linear Cost
Feature Extraction
State i Feature
i Feature Mapping
Extraction Feature
Mapping
Extraction Mapping Vector
Feature φ(i)
Vector
Feature Linear Cost Approximator
Vectori) Linear Cost φ(i)′ r
Approximator
Approximator
Feature Extraction( Mapping
) Feature Vector
Feature Extraction Mapping Feature Vector
˜ r) = ri ,
J(i; i ∈ I,
......
TERMINATION
© source unknown. All rights reserved. This content is excluded from our Creative
Commons license. For more information, see http://ocw.mit.edu/fairuse.
226
DIRECTLY APPROXIMATING J ∗ OR Q∗
µ ΠJµ
=0
Subspace S = {Φr | r ∈ ℜs } Set
229
INDIRECT POLICY EVALUATION
Jµ Tµ (Φr)
Set Set
Direct Method: Projection of Indirect Method: Solving a projected
cost vector Jµ form of Bellman’s equation
cost vector form of Be
( )
jection of ( )Indirect
( ) Method: Solving a projected Projection on
1 0 0 0
1 2 3 1 0 0 0
0 1 0 0
2 3 4 15 6 37 48 59 16 27 8 49 5 6 7 8 9
x1 x2 1
0 0 0
123456789 4 5 6
Φ = 1 0 0 0
1 2 3 15 26 37 48 9 16 27 38 49 5 7 8 9 0
1 0 0
7 x3 8 x4 9 0
0 1 0
0 0 1 0
1 2 3 4 15 26 3 48 59 16 27 3 49 5 6 7 8
0 0 0 1
232
AGGREGATION AS PROBLEM APPROXIMATION
!
Original System States Aggregate States
Original System States Aggregate States
! i , j=1 !
according to pij (u),, g(i,
with cost
u, j)
Matrix Matrix Aggregation Probabilities
Disaggregation Probabilities
Aggregation Probabilities Aggregation Disaggregation
Probabilities Probabilities
Disaggregation Probabilities
dxi ! Disaggregation
φjyProbabilities
Q
| S
˜ rk ) − J k (i)| ≤ δ,
max |J(i, k = 0, 1, . . .
µ
i
˜ rk )−(T J˜)(i, rk )| ≤ ǫ,
max |(Tµk+1 J)(i, k = 0, 1, . . .
i
ǫ + 2αδ
lim sup max Jµk (i) − J ∗ (i) ≤
k→∞ i (1 − α)2
234
THE ISSUE OF EXPLORATION
˜ u; r)
• r is an adjustable parameter vector and Q(i,
is a parametric architecture, such as
s
X
Q̃(i, u; r) = rm φm (i, u)
m=1
236
STOCHASTIC ALGORITHMS: GENERALITIES
xm = (1 − Am )−1 bm
or iteratively
• TD(λ) and Q-learning are SA methods
• LSTD(λ) and LSPE(λ) are MCE methods
237
6.231 DYNAMIC PROGRAMMING
LECTURE 20
LECTURE OUTLINE
238
REVIEW: APPROXIMATION IN VALUE SPACE
then
˜ ∗
δ
lim sup max Jk (i, rk ) − J (i) ≤
k→∞ i=1,...,n 1−α
243
NORM MISMATCH PROBLEM
244
APPROXIMATE PI
245
APPROXIMATE POLICY EVALUATION
Direct Method: Projection of cost vector Indirect Method: Solving a projected form of Bell
( )cost (vector
ojection of )Indirect
(Jµ) Method: Solving a projected form
Projection on
of Bellman’s equation
246
PI WITH INDIRECT POLICY EVALUATION
Φr = ΠTµ (Φr)
247
KEY QUESTIONS AND RESULTS
1
kJµ − Φr∗ kξ ≤ √ kJµ − ΠJµ kξ
1−α 2
248
PRELIMINARIES: PROJECTION PROPERTIES
¯ ξ ≤ kJ − Jk
kΠJ − ΠJk ¯ ξ, for all J, J¯ ∈ ℜn .
= kJ − Jk2ξ
249
PROOF OF CONTRACTION PROPERTY
1
kJµ − Φr∗ kξ ≤√ kJµ − ΠJµ kξ .
1−α 2
Proof: We have
2
kJµ − Φr∗ k2ξ = kJµ − ΠJµ k2ξ
+
ΠJµ − Φr∗
ξ
2
2
= kJµ − ΠJµ kξ +
ΠT Jµ − ΠT (Φr∗ )
ξ
≤ kJµ − ΠJµ k2ξ + α2 kJµ − Φr∗ k2ξ ,
where
− The first equality uses the Pythagorean The-
orem
− The second equality holds because Jµ is the
fixed point of T and Φr∗ is the fixed point
of ΠT
− The inequality uses the contraction property
of ΠT .
Q.E.D.
251
MATRIX FORM OF PROJECTED EQUATION
C = Φ′ Ξ(I − αP )Φ, d = Φ′ Ξg
but computing C and d is HARD (high-dimensional
inner products). 252
SOLUTION OF PROJECTED EQUATION
Projection
on S
Φrk+1
Φrk
0
S: Subspace spanned by basis functions
which yields
rk+1 = rk − (Φ′ ΞΦ)−1 (Crk − d)
253
SIMULATION-BASED IMPLEMENTATIONS
r̂k = Ck−1 dk
254
SIMULATION MECHANICS
k
1 X ′
Ck = φ(it ) φ(it )−αφ(it+1 ) ≈ Φ′ Ξ(I−αP )Φ = C
k+1
t=0
256
6.231 DYNAMIC PROGRAMMING
LECTURE 21
LECTURE OUTLINE
257
REVIEW: PROJECTED BELLMAN EQUATION
or more compactly, T J = g + αP J
• Approximate Bellman’s equation J = T J by
Φr = ΠT (Φr) or the matrix form/orthogonality
condition Cr∗ = d, where
T(Φr)
Projection
on S
Φr = ΠT(Φr)
0
S: Subspace spanned by basis functions
258
PROJECTED EQUATION METHODS
• Matrix inversion: r∗ = C −1 d
• Iterative Projected Value Iteration (PVI) method:
k
1 X ′
Ck = φ(it ) φ(it )−αφ(it+1 ) ≈ Φ′ Ξ(I−αP )Φ
k+1
t=0
k
1 X
dk = φ(it )g (it , it+1 ) ≈ Φ′ Ξg
k + 1 t=0
Gk ≈ G = (Φ′ ΞΦ)−1
Converges to r∗ if ΠT is contraction.
259
ISSUES FOR PROJECTED EQUATIONS
1
kJµ − Φrλ∗ kξ ≤p kJµ − ΠJµ kξ
1− αλ2
Slope Jµ
)λ=0
1
• From kJµ − Φrλ,µ kξ ≤ p kJµ − ΠJµ kξ
1−α2
λ
error bound
• As λ ↑ 1, we have αλ ↓ 0, so error bound (and
quality of approximation) improves:
lim Φrλ,µ = ΠJµ
λ↑1
with
∞
X ∞
X
P (λ) = (1 − λ) αℓ λℓ P ℓ+1 , g (λ) = α ℓ λℓ P ℓ g
ℓ=0 ℓ=0
(λ) (λ)
• The simulation process to obtain Ck and dk
is similar to the case λ = 0 (single simulation tra-
jectory i0 , i1 , . . ., more complex formulas)
k k
(λ) 1 X X
m−t m−t
′
Ck = φ(it ) α λ φ(im )−αφ(im+1 )
k + 1 t=0 m=t
k k
(λ) 1 X X
dk = φ(it ) αm−t λm−t gim
k + 1 t=0 m=t
• In the context of approximate policy iteration,
we can use optimistic versions (few samples be-
tween policy updates).
• Many different versions (see the text).
• Note the λ-tradeoffs:
(λ) (λ)
− As λ ↑ 1, Ck and dk contain more “sim-
ulation noise”, so more samples are needed
for a close approximation of rλ,µ
− The error bound kJµ −Φrλ,µ kξ becomes smaller
− As λ ↑ 1, ΠT (λ) becomes a contraction for
arbitrary projection norm
264
APPROXIMATE PI ISSUES - EXPLORATION
+1 rµk+2
+2 Rµk+3 k rµk+1
Rµk+2
266
MORE ON OSCILLATIONS/CHATTERING
2 Rµ3
1 rµ2
rµ1
Rµ2 2 rµ3
Rµ1
Φr = (W Tµ )(Φr)
with W a monotone operator, the generated poli-
cies converge (to an approximately optimal limit).
• The operator W used in the aggregation ap-
proach has this monotonicity property.
267
6.231 DYNAMIC PROGRAMMING
LECTURE 22
LECTURE OUTLINE
268
PROBLEM APPROXIMATION - AGGREGATION
1 0 0 0
1 2 3 4 15 26 37 48 59 16 27 38 49 5 6 7 8 9 1 0 0 0
0 1 0 0
1 2 3 4 5 6 7 8 9 x1 x2 1
0 0 0
1 2 3 4 15 26 37 48 59 16 27 38 49 5 6 7 8 9
Φ = 1 0 0 0
0 1 0 0
1 2 3 4 15 26 37 48 x
59316 27 38 49 5 6x74 8 9 0
0 1 0
0 0 1 0
0 0 0 1
Special
SpecialStates
StatesAggregate
AggregateStates Special States Aggregate States Features
StatesFeatures
Features
)
273
EXAMPLE III: REP. STATES/COARSE GRID
j3 y1 1 y2
x j1 2 y3
xj
x j1 j2
j2 j3
Representative/Aggregate States
Aggregate States/Subsets
0 1 2 49
277
Q-LEARNING II
LECTURE 23
LECTURE OUTLINE
280
REVIEW: PROJECTED BELLMAN EQUATION
Φr = ΠT (Φr)
Projection
on S
Φr = ΠT(Φr)
0
S: Subspace spanned by basis functions
where
qt (i) = P (it = i), i = 1, . . . , n, t = 0, 1, . . .
• We use the projection norm
v
u n
uX 2
kJkq = t q(i) J(i)
i=1
where
q0 (j )
β = 1 − min
j q(j)
284
AVERAGE COST PROBLEMS
F J = g − ηe + P J
T (x) = Ax + b, A is n × n, b ∈ ℜn
t ′ t
X aik jk X
φ(ik ) φ(ik ) − φ(jk ) rt = φ(ik )bik
pik jk
k=0 k=0
• We have rt → r∗ , regardless of ΠA being a con-
traction (by law of large numbers; see next slide).
• Issues of singularity or near-singularity of I−ΠA
may be important; see the text.
• An LSPE-like method is also possible, but re-
quires that ΠA is a contraction.
Pn
• Under the assumption j=1 |aij | ≤ 1 for all i,
there are conditions that guarantee contraction of
ΠA; see the text. 290
JUSTIFICATION W/ LAW OF LARGE NUMBERS
t n n
1 X
ξˆit φ(i)bi ≈
X X
φ(ik )bik = ξi φ(i)bi
t+1 i=1 i=1
k=0
291
BASIS FUNCTION ADAPTATION I
˜ ˜
X
F J(θ) = |J(i) − J(θ)(i)| 2,
i∈I
where I is a subset of states, and J(i), i ∈ I, are
the costs of the policy at these states calculated
directly by simulation.
• Another example is
2
˜ ) =
J(θ)
˜ − T J(θ)
˜
F J(θ
,
292
BASIS FUNCTION ADAPTATION II
293
APPROXIMATION IN POLICY SPACE I
294
APPROXIMATION IN POLICY SPACE II
By left-multiplying with ξ ′ ,
ξ ∆η (r)·e+ξ ∆h(r) = ξ ∆g (r)+∆P (r)h(r) +ξ ′ P (r)∆h(r)
′ ′ ′
∆η = ξ ′ (∆g + ∆P h)
• Since we don’t know ξ, we cannot implement a
gradient-like method for minimizing η(r). An al-
ternative is to use “sampled gradients”, i.e., gener-
ate a simulation trajectory (i0 , i1 , . . .), and change
r once in a while, in the direction of a simulation-
based estimate of ξ ′ (∆g + ∆P h).
• Important Fact: ∆η can be viewed as an ex-
pected value!
• Much research on this subject, see the text.
295
6.231 DYNAMIC PROGRAMMING
OVERVIEW-EPILOGUE
296
FINITE HORIZON PROBLEMS - ANALYSIS
297
FINITE HORIZON PROBS - EXACT COMP. SOL.
298
FINITE HORIZON PROBS - APPROX. SOL.
299
INFINITE HORIZON PROBLEMS - ANALYSIS
300
INF. HORIZON PROBS - EXACT COMP. SOL.
• Value iteration
− Variations (Gauss-Seidel, asynchronous, etc)
• Policy iteration
− Variations (asynchronous, based on value it-
eration, optimistic, etc)
• Linear programming
• Elegant algorithmic analysis
• Curse of dimensionality is major bottleneck
301
INFINITE HORIZON PROBS - ADP
For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.