DP AS AN OPTIMIZATION METHODOLOGY
• Generic optimization problem:
min g(u)
u∈U
where u is the optimization/decision variable,
g(u) is the cost function, and U is the constraint
set
• Categories of problems:
− Discrete (U is finite) or continuous
− Linear (g is linear and U is polyhedral) or
nonlinear
− Stochastic or deterministic: In stochastic prob-
lems the cost involves a stochastic parameter
w, which is averaged, i.e., it has the form
g(u) = Ew G(u, w)
where w is a random parameter.
• DP can deal with complex stochastic problems
where information about w becomes available in
stages, and the decisions are also made in stages
and make use of this information.
BASIC STRUCTURE OF STOCHASTIC DP
• Discrete-time system
xk+1 = fk (xk , uk , wk ), k = 0, 1, . . . , N − 1
− k: Discrete time
− xk : State; summarizes past information that
is relevant for future optimization
− uk : Control; decision to be selected at time
k from a given set
− wk : Random parameter (also called distur-
bance or noise depending on the context)
− N : Horizon or number of times control is
applied
• Cost function that is additive over time
−1
( N
)
X
E gN (xN ) + gk (xk , uk , wk )
k=0
• Alternative system description: P (xk+1 | xk , uk )
xk+1 = wk with P (wk | xk , uk ) = P (xk+1 | xk , uk )
INVENTORY CONTROL EXAMPLE
wk Demand at Period k
Stock at Period k Inventory Stock at Period k + 1
xk System xk + 1 = xk + uk - wk
Stock Ordered at
Period k
Cos t of P e riod k uk
c uk + r (xk + uk - wk)
• Discrete-time system
xk+1 = fk (xk , uk , wk ) = xk + uk − wk
• Cost function that is additive over time
−1
( N
)
X
E gN (xN ) + gk (xk , uk , wk )
k=0
(N −1 )
X
=E cuk + r(xk + uk − wk )
k=0
• Optimization over policies: Rules/functions uk =
µk (xk ) that map states to controls
ADDITIONAL ASSUMPTIONS
• The set of values that the control uk can take
depend at most on xk and not on prior x or u
• Probability distribution of wk does not depend
on past values wk−1 , . . . , w0 , but may depend on
xk and uk
− Otherwise past values of w or x would be
useful for future optimization
• Sequence of events envisioned in period k:
− xk occurs according to
xk = fk−1 xk−1 , uk−1 , wk−1
− uk is selected with knowledge of xk , i.e.,
uk ∈ Uk (xk )
− wk is random and generated according to a
distribution
Pwk (xk , uk )
DETERMINISTIC FINITE-STATE PROBLEMS
• Scheduling example: Find optimal sequence of
operations A, B, C, D
• A must precede B, and C must precede D
• Given startup cost SA and SC , and setup tran-
sition cost Cmn from operation m to operation n
ABC CC D
C BC
AB
ACB C BD
C AB
A C CB
AC
C AC
CC D ACD C DB
SA
Initial
State
CAB C BD
CA C AB
SC
C C CA
C AD CAD C DB
CC D CD
C DA
CDA C AB
STOCHASTIC FINITE-STATE PROBLEMS
• Example: Find two-game chess match strategy
• Timid play draws with prob. pd > 0 and loses
with prob. 1 − pd . Bold play wins with prob. pw <
1/2 and loses with prob. 1 − pw
0.5-0.5 1- 0
pd pw
0-0 0-0
1 - pd 1 - pw
0-1 0-1
1st Game / Timid Play 1st Game / Bold Play
2-0
2-0
pw
pd 1-0 1 - pw
1-0 1.5-0.5
1.5-0.5
1 - pd
pw
pd 0.5-0.5 1-1
0.5-0.5 1-1 1 - pw
1 - pd
pw
pd 0.5-1.5
0.5-1.5 0-1
0-1 1 - pw
1 - pd
0-2
0-2
2nd Game / Timid Play 2nd Game / Bold Play
BASIC PROBLEM
• System xk+1 = fk (xk , uk , wk ), k = 0, . . . , N −1
• Control contraints uk ∈ Uk (xk )
• Probability distribution Pk (· | xk , uk ) of wk
• Policies π = {µ0 , . . . , µN −1 }, where µk maps
states xk into controls uk = µk (xk ) and is such
that µk (xk ) ∈ Uk (xk ) for all xk
• Expected cost of π starting at x0 is
−1
( N
)
X
Jπ (x0 ) = E gN (xN ) + gk (xk , µk (xk ), wk )
k=0
• Optimal cost function
J ∗ (x0 ) = min Jπ (x0 )
π
• Optimal policy π ∗ satisfies
Jπ∗ (x0 ) = J ∗ (x0 )
When produced by DP, π ∗ is independent of x0 .
SIGNIFICANCE OF FEEDBACK
• Open-loop versus closed-loop policies
wk
u kµ=km
uk = k(x
(xk )k) System xk
xk + 1 = fk( xk,u k,wk)
) m
µkk
• In deterministic problems open loop is as good
as closed loop
• Value of information; chess match example
• Example of open-loop policy: Play always bold
• Consider the closed-loop policy: Play timid if
and only if you are ahead
pd 1.5-0.5
1- 0
1 - pd
pw
Timid Play
1-1
0-0
1 - pw
Bold Play
pw 1- 1
0-1
1 - pw
VARIANTS OF DP PROBLEMS
• Continuous-time problems
• Imperfect state information problems
• Infinite horizon problems
• Suboptimal control
PRINCIPLE OF OPTIMALITY
• Let π ∗ = {µ∗0 , µ∗1 , . . . , µ∗N −1 } be optimal policy
• Consider the “tail subproblem” whereby we are
at xi at time i and wish to minimize the “cost-to-
go” from time i to time N
−1
( N
)
X
E gN (xN ) + gk xk , µk (xk ), wk
k=i
and the “tail policy” {µ∗i , µ∗i+1 , . . . , µ∗N −1 }
xi Tail Subproblem
0 i N
• Principle of optimality: The tail policy is opti-
mal for the tail subproblem (optimization of the
future does not depend on what we did in the past)
• DP first solves ALL tail subroblems of final
stage
• At the generic step, it solves ALL tail subprob-
lems of a given time length, using the solution of
the tail subproblems of shorter time length
DETERMINISTIC SCHEDULING EXAMPLE
• Find optimal sequence of operations A, B, C,
D (A must precede B and C must precede D)
ABC 6
3
AB
ACB 1
2 9
A 4
3 AC
8 ACD 3
5 6
5
Initial
1 0 State
CAB 1
CA 2
3
4
C 3 4
CAD 3
7 6
CD
5 3
CDA 2
• Start from the last tail subproblem and go back-
wards
• At each state-time pair, we record the optimal
cost-to-go and the optimal decision
STOCHASTIC INVENTORY EXAMPLE
wk Demand at Period k
Stock at Period k Inventory Stock at Period k + 1
xk System
xk + 1 = xk + uk - wk
Stock Ordered at
Period k
Cost of Period k uk
cuk + r (xk + uk - wk)
• Tail Subproblems of Length 1:
JN −1 (xN −1 ) = min E cuN −1
uN −1 ≥0 wN −1
+ r(xN −1 + uN −1 − wN −1 )
• Tail Subproblems of Length N − k:
Jk (xk ) = min E cuk + r(xk + uk − wk )
uk ≥0 wk
+ Jk+1 (xk + uk − wk )
• J0 (x0 ) is opt. cost of initial state x0
DP ALGORITHM
• Start with
JN (xN ) = gN (xN ),
and go backwards using
Jk (xk ) = min E gk (xk , uk , wk )
uk ∈Uk (xk ) wk
+ Jk+1 fk (xk , uk , wk ) , k = 0, 1, . . . , N − 1.
• Then J0 (x0 ), generated at the last step, is equal
to the optimal cost J ∗ (x0 ). Also, the policy
π ∗ = {µ∗0 , . . . , µ∗N −1 }
where µ∗k (xk ) minimizes in the right side above for
each xk and k, is optimal
• Justification: Proof by induction that Jk (xk ) is
equal to Jk∗ (xk ), defined as the optimal cost of the
tail subproblem that starts at time k at state xk
• Note:
− ALL the tail subproblems are solved (in ad-
dition to the original problem)
− Intensive computational requirements
PROOF OF THE INDUCTION STEP
• Let πk = µk , µk+1 , . . . , µN −1 denote a tail
policy from time k onward
∗ (x
• Assume that Jk+1 (xk+1 ) = Jk+1 k+1 ). Then
(
Jk∗ (xk ) = min E gk xk , µk (xk ), wk
(µk ,πk+1 ) wk ,...,wN −1
N −1
)
X
+ gN (xN ) + gi xi , µi (xi ), wi
i=k+1
(
= min E gk xk , µk (xk ), wk
µk wk
" ( N −1
)# )
X
+ min E gN (xN ) + gi xi , µi (xi ), wi
πk+1 wk+1 ,...,wN −1
i=k+1
∗
= min E gk xk , µk (xk ), wk + Jk+1 fk xk , µk (xk ), wk
µk wk
= min E gk xk , µk (xk ), wk + Jk+1 fk xk , µk (xk ), wk
µk wk
= min E gk (xk , uk , wk ) + Jk+1 fk (xk , uk , wk )
uk ∈Uk (xk ) wk
= Jk (xk )
LINEAR-QUADRATIC ANALYTICAL EXAMPLE
Initial Final
Temperature x0 Oven 1 x1 Oven 2 Temperature x2
Temperature Temperature
u0 u1
• System
xk+1 = (1 − a)xk + auk , k = 0, 1,
where a is given scalar from the interval (0, 1)
• Cost
r(x2 − T )2 + u20 + u21
where r is given positive scalar
• DP Algorithm:
J2 (x2 ) = r(x2 − T )2
h i
2
J1 (x1 ) = min u21 + r (1 − a)x1 + au1 − T
u1
2
J0 (x0 ) = min u0 + J1 (1 − a)x0 + au0
u0
STATE AUGMENTATION
• When assumptions of the basic problem are
violated (e.g., disturbances are correlated, cost is
nonadditive, etc) reformulate/augment the state
• DP algorithm still applies, but the problem gets
BIGGER
• Example: Time lags
xk+1 = fk (xk , xk−1 , uk , wk )
• Introduce additional state variable yk = xk−1 .
New system takes the form
xk+1 fk (xk , yk , uk , wk )
=
yk+1 xk
View x̃k = (xk , yk ) as the new state.
• DP algorithm for the reformulated problem:
n
Jk (xk , xk−1 ) = min E gk (xk , uk , wk )
uk ∈Uk (xk ) wk
o
+ Jk+1 fk (xk , xk−1 , uk , wk ), xk