0% found this document useful (0 votes)
2 views16 pages

Chapter 1 Introduction and Basic Dynamic Programming Algorithm

The document discusses Dynamic Programming (DP) as an optimization methodology, outlining its structure for solving both deterministic and stochastic problems. It covers the formulation of optimization problems, the significance of feedback in decision-making, and various examples including inventory control and scheduling. Additionally, it explains the principle of optimality and the DP algorithm for finding optimal policies and costs in complex systems.

Uploaded by

suhezero
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views16 pages

Chapter 1 Introduction and Basic Dynamic Programming Algorithm

The document discusses Dynamic Programming (DP) as an optimization methodology, outlining its structure for solving both deterministic and stochastic problems. It covers the formulation of optimization problems, the significance of feedback in decision-making, and various examples including inventory control and scheduling. Additionally, it explains the principle of optimality and the DP algorithm for finding optimal policies and costs in complex systems.

Uploaded by

suhezero
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

DP AS AN OPTIMIZATION METHODOLOGY

• Generic optimization problem:

min g(u)
u∈U

where u is the optimization/decision variable,


g(u) is the cost function, and U is the constraint
set
• Categories of problems:
− Discrete (U is finite) or continuous
− Linear (g is linear and U is polyhedral) or
nonlinear
− Stochastic or deterministic: In stochastic prob-
lems the cost involves a stochastic parameter
w, which is averaged, i.e., it has the form

g(u) = Ew G(u, w)

where w is a random parameter.


• DP can deal with complex stochastic problems
where information about w becomes available in
stages, and the decisions are also made in stages
and make use of this information.
BASIC STRUCTURE OF STOCHASTIC DP

• Discrete-time system

xk+1 = fk (xk , uk , wk ), k = 0, 1, . . . , N − 1

− k: Discrete time
− xk : State; summarizes past information that
is relevant for future optimization
− uk : Control; decision to be selected at time
k from a given set
− wk : Random parameter (also called distur-
bance or noise depending on the context)
− N : Horizon or number of times control is
applied

• Cost function that is additive over time


−1
( N
)
X
E gN (xN ) + gk (xk , uk , wk )
k=0

• Alternative system description: P (xk+1 | xk , uk )

xk+1 = wk with P (wk | xk , uk ) = P (xk+1 | xk , uk )


INVENTORY CONTROL EXAMPLE

wk Demand at Period k

Stock at Period k Inventory Stock at Period k + 1


xk System xk + 1 = xk + uk - wk

Stock Ordered at
Period k
Cos t of P e riod k uk
c uk + r (xk + uk - wk)

• Discrete-time system

xk+1 = fk (xk , uk , wk ) = xk + uk − wk
• Cost function that is additive over time
−1
( N
)
X
E gN (xN ) + gk (xk , uk , wk )
k=0
(N −1 )
X 
=E cuk + r(xk + uk − wk )
k=0

• Optimization over policies: Rules/functions uk =


µk (xk ) that map states to controls
ADDITIONAL ASSUMPTIONS

• The set of values that the control uk can take


depend at most on xk and not on prior x or u
• Probability distribution of wk does not depend
on past values wk−1 , . . . , w0 , but may depend on
xk and uk
− Otherwise past values of w or x would be
useful for future optimization
• Sequence of events envisioned in period k:
− xk occurs according to

xk = fk−1 xk−1 , uk−1 , wk−1

− uk is selected with knowledge of xk , i.e.,

uk ∈ Uk (xk )

− wk is random and generated according to a


distribution

Pwk (xk , uk )
DETERMINISTIC FINITE-STATE PROBLEMS

• Scheduling example: Find optimal sequence of


operations A, B, C, D
• A must precede B, and C must precede D
• Given startup cost SA and SC , and setup tran-
sition cost Cmn from operation m to operation n

ABC CC D

C BC
AB
ACB C BD
C AB
A C CB
AC
C AC
CC D ACD C DB
SA
Initial
State
CAB C BD
CA C AB
SC
C C CA
C AD CAD C DB

CC D CD

C DA
CDA C AB
STOCHASTIC FINITE-STATE PROBLEMS

• Example: Find two-game chess match strategy


• Timid play draws with prob. pd > 0 and loses
with prob. 1 − pd . Bold play wins with prob. pw <
1/2 and loses with prob. 1 − pw

0.5-0.5 1- 0
pd pw

0-0 0-0
1 - pd 1 - pw

0-1 0-1

1st Game / Timid Play 1st Game / Bold Play

2-0
2-0
pw

pd 1-0 1 - pw
1-0 1.5-0.5
1.5-0.5
1 - pd
pw
pd 0.5-0.5 1-1
0.5-0.5 1-1 1 - pw
1 - pd
pw
pd 0.5-1.5
0.5-1.5 0-1
0-1 1 - pw
1 - pd
0-2
0-2

2nd Game / Timid Play 2nd Game / Bold Play


BASIC PROBLEM

• System xk+1 = fk (xk , uk , wk ), k = 0, . . . , N −1


• Control contraints uk ∈ Uk (xk )
• Probability distribution Pk (· | xk , uk ) of wk
• Policies π = {µ0 , . . . , µN −1 }, where µk maps
states xk into controls uk = µk (xk ) and is such
that µk (xk ) ∈ Uk (xk ) for all xk
• Expected cost of π starting at x0 is

−1
( N
)
X
Jπ (x0 ) = E gN (xN ) + gk (xk , µk (xk ), wk )
k=0

• Optimal cost function

J ∗ (x0 ) = min Jπ (x0 )


π

• Optimal policy π ∗ satisfies

Jπ∗ (x0 ) = J ∗ (x0 )

When produced by DP, π ∗ is independent of x0 .


SIGNIFICANCE OF FEEDBACK

• Open-loop versus closed-loop policies

wk

u kµ=km
uk = k(x
(xk )k) System xk
xk + 1 = fk( xk,u k,wk)

) m
µkk

• In deterministic problems open loop is as good


as closed loop
• Value of information; chess match example
• Example of open-loop policy: Play always bold
• Consider the closed-loop policy: Play timid if
and only if you are ahead

pd 1.5-0.5

1- 0
1 - pd
pw
Timid Play
1-1
0-0
1 - pw
Bold Play
pw 1- 1

0-1
1 - pw
VARIANTS OF DP PROBLEMS

• Continuous-time problems
• Imperfect state information problems
• Infinite horizon problems
• Suboptimal control
PRINCIPLE OF OPTIMALITY

• Let π ∗ = {µ∗0 , µ∗1 , . . . , µ∗N −1 } be optimal policy


• Consider the “tail subproblem” whereby we are
at xi at time i and wish to minimize the “cost-to-
go” from time i to time N

−1
( N
)
X 
E gN (xN ) + gk xk , µk (xk ), wk
k=i

and the “tail policy” {µ∗i , µ∗i+1 , . . . , µ∗N −1 }


xi Tail Subproblem

0 i N

• Principle of optimality: The tail policy is opti-


mal for the tail subproblem (optimization of the
future does not depend on what we did in the past)
• DP first solves ALL tail subroblems of final
stage
• At the generic step, it solves ALL tail subprob-
lems of a given time length, using the solution of
the tail subproblems of shorter time length
DETERMINISTIC SCHEDULING EXAMPLE

• Find optimal sequence of operations A, B, C,


D (A must precede B and C must precede D)

ABC 6

3
AB
ACB 1
2 9
A 4
3 AC
8 ACD 3
5 6
5
Initial
1 0 State
CAB 1
CA 2
3
4
C 3 4
CAD 3
7 6
CD

5 3
CDA 2

• Start from the last tail subproblem and go back-


wards
• At each state-time pair, we record the optimal
cost-to-go and the optimal decision
STOCHASTIC INVENTORY EXAMPLE

wk Demand at Period k

Stock at Period k Inventory Stock at Period k + 1


xk System
xk + 1 = xk + uk - wk

Stock Ordered at
Period k
Cost of Period k uk
cuk + r (xk + uk - wk)

• Tail Subproblems of Length 1:



JN −1 (xN −1 ) = min E cuN −1
uN −1 ≥0 wN −1

+ r(xN −1 + uN −1 − wN −1 )

• Tail Subproblems of Length N − k:



Jk (xk ) = min E cuk + r(xk + uk − wk )
uk ≥0 wk

+ Jk+1 (xk + uk − wk )

• J0 (x0 ) is opt. cost of initial state x0


DP ALGORITHM

• Start with

JN (xN ) = gN (xN ),

and go backwards using



Jk (xk ) = min E gk (xk , uk , wk )
uk ∈Uk (xk ) wk

+ Jk+1 fk (xk , uk , wk ) , k = 0, 1, . . . , N − 1.

• Then J0 (x0 ), generated at the last step, is equal


to the optimal cost J ∗ (x0 ). Also, the policy
π ∗ = {µ∗0 , . . . , µ∗N −1 }
where µ∗k (xk ) minimizes in the right side above for
each xk and k, is optimal
• Justification: Proof by induction that Jk (xk ) is
equal to Jk∗ (xk ), defined as the optimal cost of the
tail subproblem that starts at time k at state xk
• Note:
− ALL the tail subproblems are solved (in ad-
dition to the original problem)
− Intensive computational requirements
PROOF OF THE INDUCTION STEP

• Let πk = µk , µk+1 , . . . , µN −1 denote a tail
policy from time k onward
∗ (x
• Assume that Jk+1 (xk+1 ) = Jk+1 k+1 ). Then

(

Jk∗ (xk ) = min E gk xk , µk (xk ), wk
(µk ,πk+1 ) wk ,...,wN −1

N −1
)
X 
+ gN (xN ) + gi xi , µi (xi ), wi
i=k+1
(

= min E gk xk , µk (xk ), wk
µk wk
" ( N −1
)# )
X 
+ min E gN (xN ) + gi xi , µi (xi ), wi
πk+1 wk+1 ,...,wN −1
i=k+1
  ∗

= min E gk xk , µk (xk ), wk + Jk+1 fk xk , µk (xk ), wk
µk wk
  
= min E gk xk , µk (xk ), wk + Jk+1 fk xk , µk (xk ), wk
µk wk
 
= min E gk (xk , uk , wk ) + Jk+1 fk (xk , uk , wk )
uk ∈Uk (xk ) wk

= Jk (xk )
LINEAR-QUADRATIC ANALYTICAL EXAMPLE

Initial Final
Temperature x0 Oven 1 x1 Oven 2 Temperature x2
Temperature Temperature
u0 u1

• System

xk+1 = (1 − a)xk + auk , k = 0, 1,

where a is given scalar from the interval (0, 1)


• Cost
r(x2 − T )2 + u20 + u21
where r is given positive scalar
• DP Algorithm:

J2 (x2 ) = r(x2 − T )2
h i
2
J1 (x1 ) = min u21 + r (1 − a)x1 + au1 − T

u1
 2 
J0 (x0 ) = min u0 + J1 (1 − a)x0 + au0
u0
STATE AUGMENTATION

• When assumptions of the basic problem are


violated (e.g., disturbances are correlated, cost is
nonadditive, etc) reformulate/augment the state
• DP algorithm still applies, but the problem gets
BIGGER
• Example: Time lags

xk+1 = fk (xk , xk−1 , uk , wk )

• Introduce additional state variable yk = xk−1 .


New system takes the form
   
xk+1 fk (xk , yk , uk , wk )
=
yk+1 xk

View x̃k = (xk , yk ) as the new state.


• DP algorithm for the reformulated problem:
n
Jk (xk , xk−1 ) = min E gk (xk , uk , wk )
uk ∈Uk (xk ) wk
o
+ Jk+1 fk (xk , xk−1 , uk , wk ), xk

You might also like