0% found this document useful (0 votes)
29 views44 pages

Dynamic Programming Online Teaching FOR PRINT

Uploaded by

qubaahmed20
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views44 pages

Dynamic Programming Online Teaching FOR PRINT

Uploaded by

qubaahmed20
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

LGT6204 Inventory and Supply Chain Management

Dynamic Programming

Miao Song

Department of Logistics and Maritime Studies


The Hong Kong Polytechnic University

[email protected]

January 18, 2024

Miao Song (PolyU, HK) Dynamic Programming January 18, 2024 1 / 44


Overview

1 Introduction

2 The Dynamic Programming Algorithm

Miao Song (PolyU, HK) Dynamic Programming January 18, 2024 2 / 44


Principal Features

An underlying discrete-time dynamic system over a finite number of


stages (a finite horizon)
A cost function that is additive over time

Miao Song (PolyU, HK) Dynamic Programming January 18, 2024 3 / 44


Discrete-Time Dynamic System

xk+1 = fk (xk , uk , wk ), k = 0, 1, ..., N − 1

k indexes discrete time and N is the horizon or number of times


control is applied.
xk ∈ Sk is the state of the system. It summarizes past information
that is relevant for future optimization.
uk is the control or decision variable to be selected at time k.
uk ∈ Uk (xk ) for all xk ∈ Sk and k.
wk is a random parameter (also called disturbance or noise depending
on the context). Its distribution P(·|xk , uk ) may depend explicitly on
xk and uk but not on values of prior disturbances wk−1 , ..., w0 .
▶ The system is deterministic if each wk can take only one value.
fk is a function that describes the system and in particular the
mechanism by which the state is updated.
Any of xk , uk , wk can be either a scalar or vector.
Miao Song (PolyU, HK) Dynamic Programming January 18, 2024 4 / 44
Additive Cost Function

N−1
X
gN (xN ) + gk (xk , uk , wk )
k=0

gN (xN ) is a terminal cost incurred at the end of the process.


gk (xk , uk , wk ) is the cost incurred at time k.
The problem is formulated as an optimization of the expected cost
N−1
( )
X
E gN (xN ) + gk (xk , uk , wk )
k=0

over the controls u0 , u1 , ..., uN−1 , where the expectation is with respect to
the joint distribution of the random variables involved.

Miao Song (PolyU, HK) Dynamic Programming January 18, 2024 5 / 44


Example 1 (Inventory Control)
Consider a problem of ordering a quantity of a certain item at each of N
periods so as to minimize the incurred expected cost. Let us denote
xk stock available at the beginning of the kth periods,
uk stock ordered (and immediately delivered) at the beginning of the
kth period,
wk demand during the kth period with given probability distribution.
We assume that w0 , w1 , ..., wN−1 are independent random variables, and
that excess demand is backlogged and filled as soon as additional
inventory becomes available. Then stock evolves according to the
discrete-time equation

xk+1 = xk + uk − wk ,

where negative stock corresponds to backlogged demand.

Miao Song (PolyU, HK) Dynamic Programming January 18, 2024 6 / 44


Example 1 (Contd)
The cost incurred in period k consists of two components.
A cost h(xk+1 ) representing a penalty for either positive stock xk+1
(holding cost for excess inventory) or negative stock xk+1 (shortage
cost for unfilled demand).
The purchasing cost c(uk ).
There is also a terminal cost gN (xN ) for being left with inventory xN at
the end of N periods. Thus, the total cost over N periods is
N−1
( )
X 
E gN (XN ) + h(xk+1 ) + c(uk ) .
k=0

We want to minimize this cost by proper choice of the orders u0 , ..., uN−1 ,
subject to the natural constraint uk ≥ 0 for all k.

Miao Song (PolyU, HK) Dynamic Programming January 18, 2024 7 / 44


Open-Loop and Closed-Loop Optimization
Open-loop optimization
▶ Select all decisions u0 , ..., uN−1 at once at time 0, without waiting to
see the subsequent disturbances wk .
▶ Find optimal numerical values of uk .
Closed-loop optimization: dynamic programming (DP)
▶ Postpone the decision uk until the last possible moment (time k) when
the current state xk will be known.
▶ Find an optimal rule for selecting at each period k a decision uk for
each possible value of state xk that can conceivably occur.
▶ Mathematically, it is to find a sequence of functions µk ,
k = 0, ..., N − 1, mapping state xk into decision uk so as to minimize
the expected cost.
⋆ For each k and each possible value of xk , µk (xk ) represents the action
to be taken at time k if the state is xk .
⋆ The sequence π = {µ0 , ..., µN−1 } is referred to as a policy or control
law.
⋆ A policy such that µk (xk ) ∈ Uk (xk ) for all xk ∈ Sk is called admissible.

Miao Song (PolyU, HK) Dynamic Programming January 18, 2024 8 / 44


Open-Loop and Closed-Loop Optimization: Same for
Deterministic Problems
Suppose that wk is deterministic for all k.
Admissible policy {µ0 , ..., µN−1 } vs. control vector {u0 , ..., uN−1 }
For period 0, we can observe the initial state x0 .
▶ As x0 is the given initial state, we can just consider u0 = µ0 (x0 ).
▶ Given x0 and µ0 (equivalently, u0 ), as w0 is deterministic,
x1 = f0 (x0 , µ0 (x0 ), w0 ) = f0 (x0 , u0 , w0 )
is perfectly predictable.
For any period k, suppose that xk is perfectly predictable.
▶ As xk is perfectly predictable, uk = µk (xk ) is also a perfectly
predictable variable, instead of a function.
▶ Given xk and µk (equivalently, uk ), as wk is deterministic,
xk+1 = fk (xk , µk (xk ), wk ) = fk (xk , uk , wk )
is perfectly predictable.
The cost achieved by an admissible policy {µ0 , ..., µN−1 } is also
achieved by the control sequence {u0 , ..., uN−1 }.
Miao Song (PolyU, HK) Dynamic Programming January 18, 2024 9 / 44
Dynamic Programming (DP)

Given an initial state x0 and an admissible policy π = {µ0 , ..., µN−1 }, the
states xk are random variables defined through the system equation

xk+1 = fk (xk , µk (xk ), wk ), k = 0, 1, ..., N − 1.

Thus, for given functions gk , k = 0, 1, ..., N, the expected cost of π


starting at x0 is
N−1
( )
X
Jπ (x0 ) = E gN (xN ) + gk (xk , µk (xk ), wk ) ,
k=0

where the expectation is taken over the random variables wk and xk .

Miao Song (PolyU, HK) Dynamic Programming January 18, 2024 10 / 44


Dynamic Programming (Contd)
The optimal cost depends on x0 and is denoted by J ∗ (x0 ), i.e.,

J ∗ (x0 ) = min Jπ (x0 ),


π∈Π

where Π is the set of all admissible policies.


J ∗ can be viewed as a function that assigns to each initial state x0 the
optimal cost J ∗ (x0 ) and call it the optimal cost function or optimal
value function.
An optimal policy π ∗ is one that minimizes this cost, i.e.,

Jπ∗ (x0 ) = J ∗ (x0 ) = min Jπ (x0 ).


π∈Π

▶ By this definition, the optimal policy π ∗ is associated with a fixed


initial state x0 . Nevertheless, we are typically interested in a policy π ∗
that is simultaneously optimal for all initial states, i.e.,

Jπ∗ (x0 ) = J ∗ (x0 ) = min Jπ (x0 ) ∀x0 ∈ S0 .


π∈Π

Miao Song (PolyU, HK) Dynamic Programming January 18, 2024 11 / 44


Dynamic Programming (Contd)
To formulate a dynamic program, we need to determine
state xk ,
disturbance wk ,
control uk and feasible set Uk (xk ),
state transition function fk ,
additive cost including the one-period cost function gk and the
terminal cost gN .
A policy π = {µ0 , ..., µN−1 }, where µk is a function of the state xk , is
admissible if µk (xk ) ∈ Uk (xk ) for all k, xk .
N−1
( )
X
Jπ (x0 ) = E gN (xN ) + gk (xk , µk (xk ), wk ) ,
k=0

where xk+1 = fk (xk , µk (xk ), wk ) for all k.


J ∗ (x0 ) = Jπ∗ (x0 ) = min Jπ (x0 ),
π∈Π

where Π is the set of all admissible policies.


Miao Song (PolyU, HK) Dynamic Programming January 18, 2024 12 / 44
Overview

1 Introduction

2 The Dynamic Programming Algorithm

Miao Song (PolyU, HK) Dynamic Programming January 18, 2024 13 / 44


Principle of Optimality
If a policy {µ∗0 , µ∗1 , ..., µ∗N−1 } is optimal for the problem from time 0 to
time N, then the truncated policy {µ∗k , µ∗k+1 , ..., µ∗N−1 } is optimal for the
subproblem minimizing the cost from time k to time N.
The tail portion of an optimal policy is optimal for the tail
subproblem.

A Travel Analogy
The fastest route from Beijing to Hong Kong is
µ∗ µ∗ µ∗ µ∗
Beijing →0 Shanghai →1 Guangzhou →2 Shenzhen →3 Hong Kong.

The fastest route from Shanghai to Hong Kong is


µ∗ µ∗ µ∗
Shanghai →1 Guangzhou →2 Shenzhen →3 Hong Kong.

The fastest route from Guangzhou to Hong Kong is


µ∗ µ∗
Guangzhou →2 Shenzhen →3 Hong Kong.

Miao Song (PolyU, HK) Dynamic Programming January 18, 2024 14 / 44


Principle of Optimality (Contd)

An optimal policy can be constructed in piecemeal fashion.


Construct an optimal policy for the “tail subproblem” involving the
last stage
Extend the optimal policy for the last two stages
Continue in this manner until an optimal policy for the entire problem
is constructed

Miao Song (PolyU, HK) Dynamic Programming January 18, 2024 15 / 44


Theorem 1
For every initial state x0 , the optimal cost J ∗ (x0 ) of the basic problem is
equal to J0 (x0 ), given by the last step of the following algorithm, which
proceeds backward in time from period N − 1 to period 0:

JN (xN ) = gN (xN ),
n o
Jk (xk ) = min Ewk gk (xk , uk , wk ) + Jk+1 (fk (xk , uk , wk )) , (1)
uk ∈Uk (xk )

for any k = 0, 1, ..., N − 1, where the expectation is taken with respect to


the probability distribution of wk depending on xk and uk . Furthermore, if
uk∗ = µ∗k (xk ) minimizes the right side of (1) for each xk and k, the policy
π ∗ = {µ∗0 , ..., µ∗N−1 } is optimal.

The theorem holds as long as wk is independent of w0 , ..., wk−1 when


conditioning on xk , uk . For simplicity, we suppose that w0 , ..., wN−1 are all
independent.

Miao Song (PolyU, HK) Dynamic Programming January 18, 2024 16 / 44


Proof of Theorem 1
For any admissible policy π = {µ0 , µ1 , ..., µN−1 } and each
k = 0, 1, ..., N − 1, denote

π k = {µk , µk+1 , ..., µN−1 }.

For k = 0, 1, ..., N − 1, let Jk∗ (xk ) be the optimal cost for the (N − k)-stage
problem that starts at state xk and time k, and ends at time N, i.e.,
N−1
( )
X

Jk (xk ) = min Ewk ,...,wN−1 gN (xN ) + gi (xi , µi (xi ), wi ) .
πk
i=k

For k = N, we define JN∗ (xN ) = gN (xN ).


By definition, we know that

J0∗ (x0 ) = J ∗ (x0 ).

We want to show J ∗ (x0 ) = J0 (x0 ). As long as Jk∗ (xk ) = Jk (xk ) for all k
and xk , we obtain J ∗ (x0 ) = J0∗ (x0 ) = J0 (x0 ).
Miao Song (PolyU, HK) Dynamic Programming January 18, 2024 17 / 44
Proof of Theorem 1 (Contd)

We will show Jk∗ (xk ) = Jk (xk ) for all k and xk by induction.

Period N. By definition,

JN∗ (xN ) = gN (xN ) and JN (xN ) = gN (xN ).

Thus, JN∗ (xN ) = JN (xN ) for any xN .

Period k, k = 0, 1, ..., N − 1.
Induction Hypothesis
∗ (x
Jk+1 k+1 ) = Jk+1 (xk+1 ) for all xk+1 .

Miao Song (PolyU, HK) Dynamic Programming January 18, 2024 18 / 44


Proof of Theorem 1 (Contd)

Since π k = {µk , µk+1 , ..., µN−1 } and π k+1 = {µk+1 , ..., µN−1 },

π k = (µk , π k+1 ).

N−1
( )
X
Jk∗ (xk ) = min Ewk ,...,wN−1 gN (xN ) + gi (xi , µi (xi ), wi )
πk
i=k
N−1
( )
X
= min Ewk ,...,wN−1 gN (xN ) + gi (xi , µi (xi ), wi )
(µk ,π k+1 )
i=k
 
 gk (xkN−1
 , µk (xk ), wk ) + gN (xN ) 

= min Ewk ,...,wN−1 X
(µk ,π k+1 )  +
 gi (xi , µi (xi ), wi ) 

i=k+1

Miao Song (PolyU, HK) Dynamic Programming January 18, 2024 19 / 44


Proof of Theorem 1 (Contd)

Next, we would like to show


N−1
( )
X
Ewk ,...,wN−1 gk (xk , µk (xk ), wk ) + gN (xN ) + gi (xi , µi (xi ), wi )
i=k+1
 )
N−1
(
 X 
= Ewk gk (xk , µk (xk ), wk ) + Ewk+1 , gN (xN ) + gi (xi , µi (xi ), wi ) .
 ..., 
wN−1 i=k+1

Miao Song (PolyU, HK) Dynamic Programming January 18, 2024 20 / 44


Proof of Theorem 1 (Contd)

N−1
( )
X
Ewk ,...,wN−1 gk (xk , µk (xk ), wk ) + gN (xN ) + gi (xi , µi (xi ), wi )
i=k+1
  

 
 g k (xk , µk (xk ), w k ) + g N (x N ) 

N−1
= Ewk E wk , X wk tower rule
 w...,
  +
k+1 ,  gi (xi , µi (xi ), wi ) 

wN−1 i=k+1
  

  gk (xkN−1
 , µk (xk ), wk ) + gN (xN ) 


= Ewk Ewk+1 , X wk
 w...,

N−1
 +
 gi (xi , µi (xi ), wi ) 


i=k+1
  

 
 gN (xN ) 


N−1
= Ewk gk (xk , µk (xk ), wk ) + Ewk+1 , X wk
 ...,  + gi (xi , µi (xi ), wi ) 

 wN−1  
i=k+1

Miao Song (PolyU, HK) Dynamic Programming January 18, 2024 21 / 44


Proof of Theorem 1 (Contd)

N−1
( )
X
Ewk ,...,wN−1 gk (xk , µk (xk ), wk ) + gN (xN ) + gi (xi , µi (xi ), wi )
i=k+1
  

 
 gN (xN ) 
 
N−1
= Ewk gk (xk , µk (xk ), wk ) + Ewk+1 , X wk
 ...,  + gi (xi , µi (xi ), wi ) 

 wN−1  
i=k+1
 )
N−1
(
 X 
= Ewk gk (xk , µk (xk ), wk ) + Ewk+1 , gN (xN ) + gi (xi , µi (xi ), wi )
 ..., 
wN−1 i=k+1

wk , wk+1 , ..., wN−1 independent

Miao Song (PolyU, HK) Dynamic Programming January 18, 2024 22 / 44


Proof of Theorem 1 (Contd)

 
 gk (xkN−1
 , µk (xk ), wk ) + gN (xN ) 

Jk∗ (xk ) = min Ewk ,...,wN−1 X
(µk ,π k+1 )  +
 gi (xi , µi (xi ), wi ) 

i=k+1
  

  N−1 gN (xN )
 


= min Ewk gk (xk , µk (xk ), wk ) + Ewk+1 , X
(µk ,π k+1 )  ...,  + gi (xi , µi (xi ), wi ) 

 wN−1  
i=k+1
   

  N−1 gN (xN )
 

= min Ewk gk (xk , µk (xk ), wk ) + minEwk+1 ,
 X 
..., + gi (xi , µi (xi ), wi )

µk  π k+1 
 wN−1  
i=k+1

Miao Song (PolyU, HK) Dynamic Programming January 18, 2024 23 / 44


Move the minimization over π k+1 inside the expectation
  

 
 N−1 g N (x N ) 


min Ewk gk (xk , µk (xk ), wk ) + Ewk+1 , X
(µk ,π k+1 )  ...,  + gi (xi , µi (xi ), wi ) 

 wN−1  
i=k+1
   

  N−1 gN (xN )
 

= min Ewk gk (xk , µk (xk ), wk ) + minEwk+1 ,
 X 
..., + gi (xi , µi (xi ), wi )

µk  π k+1 
 wN−1  
i=k+1

Miao Song (PolyU, HK) Dynamic Programming January 18, 2024 24 / 44


Let π (k+1)∗ = (µ∗k+1 , µ∗k+2 , ..., µ∗N−1 ) be an optimal policy to the tail
subproblem
    

 g N (x N ) 
 
 g N (x N ) 

N−1 N−1
minE k+1 = Ewk+1 ,
 w , X  X

..., + gi (xi , µi (xi ), wi ) ...,  + gi (xi , µi (xi ), wi ) 

π k+1
wN−1   wN−1  
i=k+1 i=k+1

Principle of optimality: the tail portion of an optimal policy is optimal for


the tail subproblem
  

 
 gN (xN ) 


N−1
min Ewk gk (xk , µk (xk ), wk ) + Ewk+1 , X
(µk ,π k+1 ) 
 ...,
wN−1 +
 gi (xi , µi (xi ), wi ) 


i=k+1
  

 
 gN (xN ) 

N−1
= min Ewk gk (xk , µk (xk ), wk ) + Ewk+1 , X
µk  ...,
+
 gi (xi , µ∗i (xi ), wi ) 

 wN−1 
i=k+1
   

 
 gN (xN ) 
 

N−1
= min Ewk gk (xk , µk (xk ), wk ) + minEwk+1 ,
 X 
..., + gi (xi , µi (xi ), wi )
µk  π k+1
 wN−1   

i=k+1

Miao Song (PolyU, HK) Dynamic Programming January 18, 2024 25 / 44


  

 
 N−1 g N (x N ) 


min Ewk gk (xk , µk (xk ), wk ) + Ewk+1 , X
(µk ,π k+1 )  ...,  + gi (xi , µi (xi ), wi ) 

 wN−1  
i=k+1
 
 Ewk [gk (xk , µk (xk ), wk )] 
    
 N−1 gN (xN )

 

 
= min

+ E wk E
 
(µk ,π k+1 )  w
 k+1 , X
+ g (x , µ (x ), w )
 
  ...,  i i i i i  

 wN−1   
i=k+1
 
E [g (x , µ (x ), wk )]

  wk k k k k  

 N−1 gN (xN )

 

 
= min

+ min Ewk Ewk+1 ,
 
µk X
+ g (x , µ (x ), w )
  

 π k+1 ...,  i i i i i  

 wN−1   
i=k+1

π k+1 does not appear in gk (xk , µk (xk ), wk )


Holds for both closed-loop optimization and open-loop optimization
(i.e., replace µk with uk and π k+1 with (uk+1 , ..., uN−1 ))
Miao Song (PolyU, HK) Dynamic Programming January 18, 2024 26 / 44
Next, we would like to show that
 )
N−1
(
X
min Ewk Ewk+1 , gN (xN ) + gi (xi , µi (xi ), wi ) 
π k+1 ...,
wN−1 i=k+1
 )
N−1
(
X
= Ewk min Ewk+1 , gN (xN ) + gi (xi , µi (xi ), wi ) 
π k+1 ...,
wN−1 i=k+1

for any given xk and µk .


  
  N−1
X 
min Ewk gk (xk , µk (xk ), wk ) + Ewk+1 , gN (xN ) + gi (xi , µi (xi ), wi )
(µk ,π k+1 )  ...,  
wN−1 i=k+1
   
  N−1
X  
= min Ewk [gk (xk , µk (xk ), wk )] + min Ewk Ewk+1 , gN (xN ) + gi (xi , µi (xi ), wi ) 
µk  π k+1 ...,   
wN−1 i=k+1
   
  N−1
X  
= min Ewk [gk (xk , µk (xk ), wk )] + Ewk  min Ewk+1 , gN (xN ) + gi (xi , µi (xi ), wi ) 
µk  π k+1 ...,   
wN−1 i=k+1
   
  N−1
X  
= min Ewk gk (xk , µk (xk ), wk ) + min Ewk+1 , gN (xN ) + gi (xi , µi (xi ), wi ) 
µk  π k+1 ...,   
wN−1 i=k+1

Miao Song (PolyU, HK) Dynamic Programming January 18, 2024 27 / 44


Given xk and µk ,
N−1
( )
X
Ewk+1 , gN (xN ) + gi (xi , µi (xi ), wi )
...,
wN−1 i=k+1

depends on wk only because xk+1 = fk (xk , µk (xk ), wk ) depends on wk .


 )
N−1
(
X
Ewk Ewk+1 , gN (xN ) + gi (xi , µi (xi ), wi ) 
...,
wN−1 i=k+1
  
N−1
( )
X
= Ewk  Exk+1 Ewk+1 , gN (xN ) + gi (xi , µi (xi ), wi ) wk 
...,
wN−1 i=k+1
 )
N−1
(
X
= Exk+1 Ewk+1 , gN (xN ) + gi (xi , µi (xi ), wi ) 
...,
wN−1 i=k+1

The first equality is obtained since xk+1 is known when wk is given. The
second equality follows from tower rule.
Miao Song (PolyU, HK) Dynamic Programming January 18, 2024 28 / 44
π k+1 = {µk+1 , µk+2 , ..., µN−1 }
= {µk+1 (xk+1 )∀xk+1 , µk+2 (xk+2 )∀xk+2 , ..., µN−1 (xN−1 )∀xN−1 }

As xi+1 = fi (xi , µi (xi ), wi ) for all i ∈ {k + 1, ..., N − 1}, xk+1 , xk+2 , ...,
xN−1 all depend on xk+1 . With a little abuse of notation,

π k+1 = {π k+1 (xk+1 )∀xk+1 }.

This simply represents that the actions we will take from period k + 1 to
period N − 1 depend on the state xk+1 in period k + 1, which is in
accordance with the closed-loop optimization.

Miao Song (PolyU, HK) Dynamic Programming January 18, 2024 29 / 44


Given xk and µk ,
 )
N−1
(
X
min Ewk Ewk+1 , gN (xN ) + gi (xi , µi (xi ), wi ) 
π k+1 ...,
wN−1 i=k+1
 )
N−1
(
X
= min Exk+1 Ewk+1 , gN (xN ) + gi (xi , µi (xi ), wi ) 
π k+1 ...,
wN−1 i=k+1
 )
N−1
(
X
= min Exk+1 Ewk+1 , gN (xN ) + gi (xi , µi (xi ), wi ) 
π k+1 (xk+1 )∀xk+1 ...,
wN−1 i=k+1
 )
N−1
(
X
= Exk+1  min Ewk+1 , gN (xN ) + gi (xi , µi (xi ), wi ) 
π k+1 (xk+1 )∀xk+1 ...,
wN−1 i=k+1

Last equality analogous to minf (Z ) E [g (Z , f (Z ))] = E [minf (Z ) g (Z , f (Z ))]


for some random variable Z since xk+1 ↔ Z , π k+1 ↔ f , and
N−1
( )
X
Ewk+1 , gN (xN ) + gi (xi , µi (xi ), wi ) is a function of xk+1 and π k+1
...,
wN−1 i=k+1
Miao Song (PolyU, HK) Dynamic Programming January 18, 2024 30 / 44
To see why minf (Z ) E [g (Z , f (Z ))] = EP
[minf (Z ) g (Z , f (Z ))], suppose that
P(Z = zi ) = pi for all i = 1, ..., n and i pi = 1.
n
X
min E [g (Z , f (Z ))] = min pi g (zi , f (zi ))
f (Z ) f (z1 ),...,f (zn )
i=1
n
X
= min pi g (zi , yi )
y1 ,...,yn
i=1
n
X
= pi min g (zi , yi )
yi
i=1
Xn
= pi min g (zi , f (zi ))
f (zi )
i=1
 
= E min g (Z , f (Z ))
f (Z )

Miao Song (PolyU, HK) Dynamic Programming January 18, 2024 31 / 44


Given xk and µk ,
 )
N−1
(
X
min Ewk Ewk+1 , gN (xN ) + gi (xi , µi (xi ), wi ) 
π k+1 ...,
wN−1 i=k+1
 )
N−1
(
X
= Exk+1  min Ewk+1 , gN (xN ) + gi (xi , µi (xi ), wi ) 
π k+1 (xk+1 )∀xk+1 ...,
wN−1 i=k+1
 )
N−1
(
X
= Exk+1 min Ewk+1 , gN (xN ) + gi (xi , µi (xi ), wi ) 
π k+1 ...,
wN−1 i=k+1
  
N−1
( )
X
= Ewk  Exk+1 min Ewk+1 , gN (xN ) + gi (xi , µi (xi ), wi ) wk 
π k+1 ...,
wN−1 i=k+1
 )
N−1
(
X
= Ewk min Ewk+1 , gN (xN ) + gi (xi , µi (xi ), wi ) 
π k+1 ...,
wN−1 i=k+1

Miao Song (PolyU, HK) Dynamic Programming January 18, 2024 32 / 44


  

  N−1 gN (xN )
 


min Ewk gk (xk , µk (xk ), wk ) + Ewk+1 , X
(µk ,π k+1 )  ...,  + gi (xi , µi (xi ), wi ) 

 wN−1  
i=k+1
 
E [g (x , µ (x ), wk )]

  wk k k k k  


  g N (x N ) 


= min
 N−1 
+ min Ewk Ewk+1 ,
 
µk X
+ g (x , µ (x ), w )
  

 π k+1 ...,  i i i i i 


 wN−1   
i=k+1
 
 Ewk [gk (xk , µk (xk ), wk )] 
    

  gN (xN ) 


= min
 N−1

+ Ewk min E k+1
 
µk w , X
+ g (x , µ (x ), w )
  

 π k+1 ...,  i i i i i  

 wN−1  
i=k+1
   

 
 N−1 g N (x N ) 

= min Ewk gk (xk , µk (xk ), wk ) + minEwk+1 ,
 X 
..., + gi (xi , µi (xi ), wi )

µk  π k+1 
 wN−1  
i=k+1

Miao Song (PolyU, HK) Dynamic Programming January 18, 2024 33 / 44


Proof of Theorem 1 (Contd)

Jk∗ (xk )
   

  N−1 gN (xN )
 

= min Ewk gk (xk , µk (xk ), wk ) + minEwk+1 ,
 X 
..., + gi (xi , µi (xi ), wi )

µk  π k+1 
 wN−1  
i=k+1
n o

= min Ewk gk (xk , µk (xk ), wk ) + Jk+1 (fk (xk , µk (xk ), wk )) ,
µk

where the second equality follows from xk+1 = fk (xk , µk (xk ), wk ) and the
∗ (x
definition of Jk+1 k+1 ).

N−1
( )
X
Jk∗ (xk ) = min Ewk ,...,wN−1 gN (xN ) + gi (xi , µi (xi ), wi )
πk
i=k

Miao Song (PolyU, HK) Dynamic Programming January 18, 2024 34 / 44


Proof of Theorem 1 (Contd)
n o
Jk∗ (xk ) = min Ewk gk (xk , µk (xk ), wk ) + Jk+1

(fk (xk , µk (xk ), wk ))
µk
n o
= min Ewk gk (xk , µk (xk ), wk ) + Jk+1 (fk (xk , µk (xk ), wk ))
µk
n o
= min Ewk gk (xk , uk , wk ) + Jk+1 (fk (xk , uk , wk ))
uk ∈Uk (xk )

The second equality follows from the induction hypothesis, i.e.,


∗ (x
Jk+1 k+1 ) = Jk+1 (xk+1 ) for all xk+1 .
The third equality follows from the fact that for any function F of x
and u, we have
min F (x, µ(x)) = min F (x, u),
µ∈M u∈U(x)

where M is the set of all functions µ(x) such that µ(x) ∈ U(x) for all
x.
Miao Song (PolyU, HK) Dynamic Programming January 18, 2024 35 / 44
Proof of Theorem 1 (Contd)
n o
Jk∗ (xk ) = min Ewk gk (xk , µk (xk ), wk ) + Jk+1

(fk (xk , µk (xk ), wk ))
µk
n o
= min Ewk gk (xk , µk (xk ), wk ) + Jk+1 (fk (xk , µk (xk ), wk ))
µk
n o
= min Ewk gk (xk , uk , wk ) + Jk+1 (fk (xk , uk , wk ))
uk ∈Uk (xk )

= Jk (xk )

The last equality follows from the definition of Jk (xk ), i.e.,


n o
Jk (xk ) = min Ewk gk (xk , uk , wk ) + Jk+1 (fk (xk , uk , wk )) .
uk ∈Uk (xk )

This completes the induction proof that shows Jk∗ (xk ) = Jk (xk ) for all k
and xk . Recall that J ∗ (x0 ) = J0∗ (x0 ). We obtain
J ∗ (x0 ) = J0∗ (x0 ) = J0 (x0 ).

Miao Song (PolyU, HK) Dynamic Programming January 18, 2024 36 / 44


Proof of Theorem 1 (Contd)

To show the optimality of π ∗ = {µ∗0 , ..., µ∗N−1 }, where uk∗ = µ∗k (xk )
minimizes the right side of
n o
Jk (xk ) = min Ewk gk (xk , uk , wk ) + Jk+1 (fk (xk , uk , wk ))
uk ∈Uk (xk )

for each xk and k, we can make use of the principle of optimality, i.e., “the
tail portion of an optimal policy is optimal for the tail subproblem.”
An optimal policy for the problem JN−1 is {µ∗N−1 }.
An optimal policy for the problem JN−2 is µ∗N−2 plus an optimal
policy for the problem JN−1 , i.e., {µ∗N−2 , µ∗N−1 }.
Continuing in this manner, an optimal policy for the problem J0 , i.e.,
J ∗ , is π ∗ = {µ∗0 , ..., µ∗N−1 }.

Miao Song (PolyU, HK) Dynamic Programming January 18, 2024 37 / 44


Proof of Theorem 1 (Contd)

Period N − 1. For any given xN−1 ,


 
gN−1 (xN−1 , uN−1 , wN−1 )
JN−1 (xN−1 ) = min EwN−1
uN−1 ∈UN−1 (xN−1 ) + JN (fN−1 (xN−1 , uN−1 , wN−1 ))
gN−1 (xN−1 , µ∗N−1 (xN−1 ), wN−1 )
 
= EwN−1 ,
+ JN (fN−1 (xN−1 , µ∗N−1 (xN−1 ), wN−1 ))

since uN−1 = µ∗N−1 (xN−1 ) is the optimal solution to the optimization
problem on the right hand side in the first equality. Let
xN = fN−1 (xN−1 , µ∗N−1 (xN−1 ), wN−1 ). We have
n o
JN−1 (xN−1 ) = EwN−1 gN (xN ) + gN−1 (xN−1 , µ∗N−1 (xN−1 ), wN−1 ) .

Miao Song (PolyU, HK) Dynamic Programming January 18, 2024 38 / 44


Proof of Theorem 1 (Contd)
Period k, for any k = 0, 1, ..., N − 2.
Induction Hypothesis
N−1
( )
X
Jk+1 (xk+1 ) = Ewk+1 ,...,wN−1 gN (xN ) + gi (xi , µ∗i (xi ), wi ) ,
i=k+1

where xi+1 = fi (xi , µ∗i (xi ), wi ) for i = k + 1, ..., N − 1.

For any given xk ,


n o
Jk (xk ) = min Ewk gk (xk , uk , wk ) + Jk+1 (fk (xk , uk , wk ))
uk ∈Uk (xk )
n o
= Ewk gk (xk , µ∗k (xk ), wk ) + Jk+1 (fk (xk , µ∗k (xk ), wk )) ,
since uk∗ = µ∗k (xk ) is the optimal solution to the optimization problem on
the right hand side in the first equality. Let xk+1 = fk (xk , µ∗k (xk ), wk ). We
have n o
Jk (xk ) = Ewk+1 gk (xk , µ∗k (xk ), wk ) + Jk+1 (xk+1 ) .
Miao Song (PolyU, HK) Dynamic Programming January 18, 2024 39 / 44
Proof of Theorem 1 (Contd)
Period k, for any k = 0, 1, ..., N − 2.
Induction Hypothesis
N−1
( )
X
Jk+1 (xk+1 ) = Ewk+1 ,...,wN−1 gN (xN ) + gi (xi , µ∗i (xi ), wi ) ,
i=k+1

where xi+1 = fi (xi , µ∗i (xi ), wi ) for i = k + 1, ..., N − 1.

n o
Jk (xk ) = Ewk gk (xk , µ∗k (xk ), wk ) + Jk+1 (xk+1 )
 )
N−1
(
 X 
= Ewk gk (xk , µ∗k (xk ), wk ) + Ewk+1 , gN (xN ) + gi (xi , µ∗i (xi ), wi ) ,
 ..., 
wN−1 i=k+1

where xk+1 = fk (xk , µ∗k (xk ), wk ) and xi+1 = fi (xi , µ∗i (xi ), wi ) for
i = k + 1, ..., N − 1.

Miao Song (PolyU, HK) Dynamic Programming January 18, 2024 40 / 44


Proof of Theorem 1 (Contd)

Jk (xk )
 )
N−1
(
 X 
= Ewk gk (xk , µ∗k (xk ), wk ) + Ewk+1 , gN (xN ) + gi (xi , µ∗i (xi ), wi ) ,
 ..., 
wN−1 i=k+1
N−1
( )
X
= Ewk ,...,wN−1 gN (xN ) + gk (xk , µ∗k (xk ), wk ) + gi (xi , µ∗i (xi ), wi )
i=k+1

where xk+1 = fk (xk , µ∗k (xk ), wk ) and xi+1 = fi (xi , µ∗i (xi ), wi ) for
i = k + 1, ..., N − 1.
N−1
( )
X
Jk (xk ) = Ewk ,...,wN−1 gN (xN ) + gi (xi , µ∗i (xi ), wi ) ,
i=k

where xi+1 = fi (xi , µ∗i (xi ), wi ) for i = k, ..., N − 1.

Miao Song (PolyU, HK) Dynamic Programming January 18, 2024 41 / 44


Proof of Theorem 1 (Contd)

Thus, for any x0 ,


N−1
( )
X
J0 (x0 ) = Ew0 ,...,wN−1 gN (xN ) + gi (xi , µ∗i (xi ), wi ) ,
i=0

where xk+1 = fk (xk , µ∗k (xk ), wk ) for k = 0, ..., N − 1, i.e.,

J ∗ (x0 ) = J0 (x0 ) = Jπ∗ (x0 ),

where π ∗ = {µ∗0 , ..., µ∗N−1 }. This implies that π ∗ is an optimal policy.

Miao Song (PolyU, HK) Dynamic Programming January 18, 2024 42 / 44


The Bellman Equation

n o
Jk (xk ) = min Ewk gk (xk , uk , wk )+Jk+1 (fk (xk , uk , wk )) Jk (
| {z } uk ∈Uk (xk ) | {z } | {z } | {
cost-to-go at time k cost at time k cost-to-go at time k + 1 cost-to-go

Jk (xk ) is the optimal cost for an (N − k)-stage problem starting at


state xk and time k, and ending at time N.
▶ Jk (xk ) is called the cost-to-go at state xk and time k.
▶ Jk is called the cost-to-go function at time k.

Miao Song (PolyU, HK) Dynamic Programming January 18, 2024 43 / 44


Numerical Execution of the DP Algorithm

The minimization in the DP recursion (1), i.e.,


n o
Jk (xk ) = min Ewk gk (xk , uk , wk ) + Jk+1 (fk (xk , uk , wk )) ,
uk ∈Uk (xk )

must be carried out for each value of xk .


Curse of dimensionality
▶ Suppose that any state in a period always leads to two possible states
in the next period. Starting from x0 , the number of possible states in
period k, i.e., the number of the possible values of xk , can be 2k .

Miao Song (PolyU, HK) Dynamic Programming January 18, 2024 44 / 44

You might also like