Dynamic Programming Online Teaching FOR PRINT
Dynamic Programming Online Teaching FOR PRINT
Dynamic Programming
Miao Song
1 Introduction
N−1
X
gN (xN ) + gk (xk , uk , wk )
k=0
over the controls u0 , u1 , ..., uN−1 , where the expectation is with respect to
the joint distribution of the random variables involved.
xk+1 = xk + uk − wk ,
We want to minimize this cost by proper choice of the orders u0 , ..., uN−1 ,
subject to the natural constraint uk ≥ 0 for all k.
Given an initial state x0 and an admissible policy π = {µ0 , ..., µN−1 }, the
states xk are random variables defined through the system equation
1 Introduction
A Travel Analogy
The fastest route from Beijing to Hong Kong is
µ∗ µ∗ µ∗ µ∗
Beijing →0 Shanghai →1 Guangzhou →2 Shenzhen →3 Hong Kong.
JN (xN ) = gN (xN ),
n o
Jk (xk ) = min Ewk gk (xk , uk , wk ) + Jk+1 (fk (xk , uk , wk )) , (1)
uk ∈Uk (xk )
For k = 0, 1, ..., N − 1, let Jk∗ (xk ) be the optimal cost for the (N − k)-stage
problem that starts at state xk and time k, and ends at time N, i.e.,
N−1
( )
X
∗
Jk (xk ) = min Ewk ,...,wN−1 gN (xN ) + gi (xi , µi (xi ), wi ) .
πk
i=k
We want to show J ∗ (x0 ) = J0 (x0 ). As long as Jk∗ (xk ) = Jk (xk ) for all k
and xk , we obtain J ∗ (x0 ) = J0∗ (x0 ) = J0 (x0 ).
Miao Song (PolyU, HK) Dynamic Programming January 18, 2024 17 / 44
Proof of Theorem 1 (Contd)
Period N. By definition,
Period k, k = 0, 1, ..., N − 1.
Induction Hypothesis
∗ (x
Jk+1 k+1 ) = Jk+1 (xk+1 ) for all xk+1 .
Since π k = {µk , µk+1 , ..., µN−1 } and π k+1 = {µk+1 , ..., µN−1 },
π k = (µk , π k+1 ).
N−1
( )
X
Jk∗ (xk ) = min Ewk ,...,wN−1 gN (xN ) + gi (xi , µi (xi ), wi )
πk
i=k
N−1
( )
X
= min Ewk ,...,wN−1 gN (xN ) + gi (xi , µi (xi ), wi )
(µk ,π k+1 )
i=k
gk (xkN−1
, µk (xk ), wk ) + gN (xN )
= min Ewk ,...,wN−1 X
(µk ,π k+1 ) +
gi (xi , µi (xi ), wi )
i=k+1
N−1
( )
X
Ewk ,...,wN−1 gk (xk , µk (xk ), wk ) + gN (xN ) + gi (xi , µi (xi ), wi )
i=k+1
g k (xk , µk (xk ), w k ) + g N (x N )
N−1
= Ewk E wk , X wk tower rule
w...,
+
k+1 , gi (xi , µi (xi ), wi )
wN−1 i=k+1
gk (xkN−1
, µk (xk ), wk ) + gN (xN )
= Ewk Ewk+1 , X wk
w...,
N−1
+
gi (xi , µi (xi ), wi )
i=k+1
gN (xN )
N−1
= Ewk gk (xk , µk (xk ), wk ) + Ewk+1 , X wk
..., + gi (xi , µi (xi ), wi )
wN−1
i=k+1
N−1
( )
X
Ewk ,...,wN−1 gk (xk , µk (xk ), wk ) + gN (xN ) + gi (xi , µi (xi ), wi )
i=k+1
gN (xN )
N−1
= Ewk gk (xk , µk (xk ), wk ) + Ewk+1 , X wk
..., + gi (xi , µi (xi ), wi )
wN−1
i=k+1
)
N−1
(
X
= Ewk gk (xk , µk (xk ), wk ) + Ewk+1 , gN (xN ) + gi (xi , µi (xi ), wi )
...,
wN−1 i=k+1
gk (xkN−1
, µk (xk ), wk ) + gN (xN )
Jk∗ (xk ) = min Ewk ,...,wN−1 X
(µk ,π k+1 ) +
gi (xi , µi (xi ), wi )
i=k+1
N−1 gN (xN )
= min Ewk gk (xk , µk (xk ), wk ) + Ewk+1 , X
(µk ,π k+1 ) ..., + gi (xi , µi (xi ), wi )
wN−1
i=k+1
N−1 gN (xN )
= min Ewk gk (xk , µk (xk ), wk ) + minEwk+1 ,
X
..., + gi (xi , µi (xi ), wi )
µk π k+1
wN−1
i=k+1
The first equality is obtained since xk+1 is known when wk is given. The
second equality follows from tower rule.
Miao Song (PolyU, HK) Dynamic Programming January 18, 2024 28 / 44
π k+1 = {µk+1 , µk+2 , ..., µN−1 }
= {µk+1 (xk+1 )∀xk+1 , µk+2 (xk+2 )∀xk+2 , ..., µN−1 (xN−1 )∀xN−1 }
As xi+1 = fi (xi , µi (xi ), wi ) for all i ∈ {k + 1, ..., N − 1}, xk+1 , xk+2 , ...,
xN−1 all depend on xk+1 . With a little abuse of notation,
This simply represents that the actions we will take from period k + 1 to
period N − 1 depend on the state xk+1 in period k + 1, which is in
accordance with the closed-loop optimization.
Jk∗ (xk )
N−1 gN (xN )
= min Ewk gk (xk , µk (xk ), wk ) + minEwk+1 ,
X
..., + gi (xi , µi (xi ), wi )
µk π k+1
wN−1
i=k+1
n o
∗
= min Ewk gk (xk , µk (xk ), wk ) + Jk+1 (fk (xk , µk (xk ), wk )) ,
µk
where the second equality follows from xk+1 = fk (xk , µk (xk ), wk ) and the
∗ (x
definition of Jk+1 k+1 ).
N−1
( )
X
Jk∗ (xk ) = min Ewk ,...,wN−1 gN (xN ) + gi (xi , µi (xi ), wi )
πk
i=k
where M is the set of all functions µ(x) such that µ(x) ∈ U(x) for all
x.
Miao Song (PolyU, HK) Dynamic Programming January 18, 2024 35 / 44
Proof of Theorem 1 (Contd)
n o
Jk∗ (xk ) = min Ewk gk (xk , µk (xk ), wk ) + Jk+1
∗
(fk (xk , µk (xk ), wk ))
µk
n o
= min Ewk gk (xk , µk (xk ), wk ) + Jk+1 (fk (xk , µk (xk ), wk ))
µk
n o
= min Ewk gk (xk , uk , wk ) + Jk+1 (fk (xk , uk , wk ))
uk ∈Uk (xk )
= Jk (xk )
This completes the induction proof that shows Jk∗ (xk ) = Jk (xk ) for all k
and xk . Recall that J ∗ (x0 ) = J0∗ (x0 ). We obtain
J ∗ (x0 ) = J0∗ (x0 ) = J0 (x0 ).
To show the optimality of π ∗ = {µ∗0 , ..., µ∗N−1 }, where uk∗ = µ∗k (xk )
minimizes the right side of
n o
Jk (xk ) = min Ewk gk (xk , uk , wk ) + Jk+1 (fk (xk , uk , wk ))
uk ∈Uk (xk )
for each xk and k, we can make use of the principle of optimality, i.e., “the
tail portion of an optimal policy is optimal for the tail subproblem.”
An optimal policy for the problem JN−1 is {µ∗N−1 }.
An optimal policy for the problem JN−2 is µ∗N−2 plus an optimal
policy for the problem JN−1 , i.e., {µ∗N−2 , µ∗N−1 }.
Continuing in this manner, an optimal policy for the problem J0 , i.e.,
J ∗ , is π ∗ = {µ∗0 , ..., µ∗N−1 }.
n o
Jk (xk ) = Ewk gk (xk , µ∗k (xk ), wk ) + Jk+1 (xk+1 )
)
N−1
(
X
= Ewk gk (xk , µ∗k (xk ), wk ) + Ewk+1 , gN (xN ) + gi (xi , µ∗i (xi ), wi ) ,
...,
wN−1 i=k+1
where xk+1 = fk (xk , µ∗k (xk ), wk ) and xi+1 = fi (xi , µ∗i (xi ), wi ) for
i = k + 1, ..., N − 1.
Jk (xk )
)
N−1
(
X
= Ewk gk (xk , µ∗k (xk ), wk ) + Ewk+1 , gN (xN ) + gi (xi , µ∗i (xi ), wi ) ,
...,
wN−1 i=k+1
N−1
( )
X
= Ewk ,...,wN−1 gN (xN ) + gk (xk , µ∗k (xk ), wk ) + gi (xi , µ∗i (xi ), wi )
i=k+1
where xk+1 = fk (xk , µ∗k (xk ), wk ) and xi+1 = fi (xi , µ∗i (xi ), wi ) for
i = k + 1, ..., N − 1.
N−1
( )
X
Jk (xk ) = Ewk ,...,wN−1 gN (xN ) + gi (xi , µ∗i (xi ), wi ) ,
i=k
n o
Jk (xk ) = min Ewk gk (xk , uk , wk )+Jk+1 (fk (xk , uk , wk )) Jk (
| {z } uk ∈Uk (xk ) | {z } | {z } | {
cost-to-go at time k cost at time k cost-to-go at time k + 1 cost-to-go