0% found this document useful (0 votes)

52 views197 pages

UE2MDP2019

The document discusses using dynamic programming to solve optimization problems with sequential structure. It provides examples of dynamic programming applications, including shortest path problems, single machine scheduling (knapsack problem), and inventory control. The key concepts of dynamic programming - principle of optimality, recursive formulation, decomposition into steps - are explained. Dynamic programming is also described from a control theory perspective using value functions and Bellman's optimality equation.

Uploaded by

kevin Aroca

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

52 views197 pages

UE2MDP2019

Uploaded by

kevin Aroca

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 197

Plan

 Dynamic programming

 Introduction to Markov decision processes

 Markov decision processes formulation

 Discounted markov decision processes

 Average cost markov decision processes

 Continuous-time Markov decision processes

Xiaolan Xie
Dynamic programming
 Basic principe of dynamic programming

 Some applications

 Stochastic dynamic programming

Xiaolan Xie
Dynamic programming
 Basic principe of dynamic programming

 Some applications

 Stochastic dynamic programming

Xiaolan Xie
Introduction
 Dynamic programming (DP) is a general optimization
technique based on implicit enumeration of the solution
space.
 The problems should have a particular sequential structure,
such that the set of unknowns can be made sequentially.
 It is based on the "principle of optimality"
 A wide range of problems can be put in seqential form and
solved by dynamic programming

Xiaolan Xie
Introduction

Applications :
• Optimal control
• Most problems in graph theory
• Investment
• Deterministic and stochastic inventory control
• Project scheduling
• Production scheduling

We limit ourselves to discrete optimization

Xiaolan Xie
Illustration of DP by shortest path problem
Problem : We are planning the construction of a
highway from city A to city K. Different construction
alternatives and their costs are given in the following
graph. The problem consists in determine the highway
with the minimum total cost.
14 D 3
B I 10
8 10 G 9

E 5 K
A 10 9
10 8
C 7 8 H J
F
15

Xiaolan Xie
BELLMAN's principle of optimality
General form:
if C belongs to an optimal path from A to B, then the sub-path A
to C and C to B are also optimal
or
all sub-path of an optimal path is optimal

A
B
optimal C optimal

Corollary :
SP(xo, y) = min {SP(xo, z) + l(z, y) | z : predecessor of y}

Xiaolan Xie
Solving a problem by DP

1. Extension
Extend the problem to a family of problems of the same nature
2. Recursive Formulation (application of the principle of optimality)
Link optimal solutions of these problems by a recursive relation
3. Decomposition into steps or phases
Define the order of the resolution of the problems in such a way
that, when solving a problem P, optimal solutions of all other
problems needed for computation of P are already known.
4. Computation by steps

Xiaolan Xie
Solving a problem by DP

Difficulties in using dynamic programming :

• Identification of the family of problems
• transformation of the problem into a sequential form.

Xiaolan Xie
Shortest Path in an acyclic graph
• Problem setting : find a shortest path from x0 (root of the graph) to a given
node y0
• Extension : Find a shortest path from x0 to any node y, denoted SP(x0, y)
• Recursive formulation
SP(y) = min { SP(z) + l(z, y) : z predecessorr of y}
• Decomposition into steps : At each step k, consider only nodes y with
unknown SP(y) but for which the SP of all precedecssors are known.
• Compute SP(y) step by step
Remarks :
• It is a backward dynamic programming
• It is also possible to solve this problem by forward dynamic programming

Xiaolan Xie
DP from a control point of view
Consider the control of
(i) a discrete-time dynamic system, with
(ii) costs generated over time depending on the states and the
control actions

action action

State t State t+1

Cost Cost

present decision epoch next decision epoch

Xiaolan Xie
DP from a control point of view
System dynamics :
x t+1 = ft(xt, ut), t = 0, 1, ..., N-1
where
t : temps index
xt : state of the system
action action
ut = control action to decide at t

State t State t+1

Cost Cost

present decision epoch next decision epoch

Xiaolan Xie
DP from a control point of view
Criterion to optimize
N 1
Minimize g N  xN    gt  xt , ut 
t 0

action action

State t State t+1

Cost
gt  xt , ut  Cost

present decision epoch next decision epoch

Xiaolan Xie
DP from a control point of view
Value function or cost-to-go function:
N 1
J n  x  = Minimize g N  xN    gt  xt , ut  xn  x
t n

action action

State t State t+1

Cost Cost

present decision epoch next decision epoch

Xiaolan Xie
DP from a control point of view
Optimality equation or Bellman equation


J n  x  = MIN g n  x, un   J n+1 f n  x, un 
un


action action

State t State t+1

Cost Cost

present decision epoch next decision epoch

Xiaolan Xie
Applications

Single machine scheduling (Knapsac)

Inventory control
Traveling salesman problem

Xiaolan Xie
Applications
Single machine scheduling (Knapsac)

Problem :
Consider a set of N production requests, each needing a
production time ti on a bottleneck machine and generating
a profit pi. The capacity of the bottleneck machine is C.
Question: determine the production requests to confirm in
order to maximize the total profit.
Formulation:

max  pi Xi
subject to:
 ti Xi  C

Xiaolan Xie
Knapsack Problem
Knapsack Problème :
• Mr Radin can take 7 KG without paying over-weight fee on
his return flight. He decides to take advantage of it and look
for some local products that he can sale at home for extra gain.
• He selects n most interesting objects, weighs each of them,
and bargains the price.
• Which objects should he buy in order to maximize his gain?

Object (i) 1 2 3 4 5 6
Weight (wi) 2 1 1 3 2 1
Expected gain (ri) 8 5 5 6 3 2

Xiaolan Xie
Knapsack Problem
Generic formulation:
• Time = 1, …, 7
• State st = remaining capacity for objects t, t+1, …
• State space = {0, 1, 2, …, 7}
• Action at time t = selection or not object t
• Action space At(s) = {1=YES, 0=NO} if s ≥ wt and = {0} if s < wt
• Immediate gain at time t
gt(st, ut) = rt if YES
=0 if NO
• State transition or system dynamics:
st+1 = st – wt if YES
= st if NO

Xiaolan Xie
Knapsack Problem
Value function:
Jn(s) = Maximal gain from objects n, n+1, …, 6 with a
remaing capacity of s KG.
Optimality equation:

Jn  s  max rn an  J n 1  s  wn an 
an s  wn an  ¨0

J7  s   0

Xiaolan Xie
Knapsack Problem
time
7 6 5 4 3 2 1
wi ri wi ri wi ri wi ri wi ri wi ri
1 2 2 3 3 6 1 5 1 5 2 8
state Jn(s) action Jn(s) action Jn(s) action Jn(s) action Jn(s) action Jn(s) action Jn(s) action
0 0 0 N 0 N 0 N 0 N 0 N 0 N
1 0 2 Y 2 N 2 N 5 Y 5 N 5 N
2 0 2 Y 3 Y 3 N 7 Y 10 Y 10 N
3 0 2 Y 5 Y 6 Y 8 Y 12 Y 13 Y
4 0 2 Y 5 Y 8 Y 11 Y 13 Y 18 Y
5 0 2 Y 5 Y 9 Y 13 Y 16 Y 20 Y
6 0 2 Y 5 Y 11 Y 14 Y 18 Y 21 Y
7 0 2 Y 5 Y 11 Y 16 Y 19 Y 24 Y
YES NO YES NO YES NO YES NO YES NO YES NO
0 -1 0 -1 0 -1 0 -1 0 -1 0 -1 0 -1
1 2 0 -1 2 -1 2 5 2 5 5 -1 5 =
2 2 0 3 2 -1 3 7 3 10 7 8 10
3 2 0 5 2 6 5 8 6 12 8 13 12
Infeasible
4 2 0 5 2 8 5 11 8 13 11 18 13 action
5 2 0 5 2 9 5 13 9 16 13 20 16
6 2 0 5 2 11 5 14 11 18 14 21 18
7 2 0 5 2 11 5 16 11 19 16 24 19

Xiaolan Xie
Knapsack Problem

Control map or control policy

stage
1 2 3 4 5 6
0 N N N N N N
1 N N Y N N Y
2 N Y Y N Y Y
state 3 Y Y Y Y Y Y
4 Y Y Y Y Y Y
5 Y Y Y Y Y Y
6 Y Y Y Y Y Y
7 Y Y Y Y Y Y

Xiaolan Xie
Applications
Inventory control

Problem: determining the purchasing quantity at the beginning

of each period in order to minimize the total expense
• Unit price and the demand
Period i 1 2 3 4 5 6
Demand di (in 000) 3 2 1 2 3 4
Unit price p (in 00 $) 11 18 13 17 20 10

• Storage capacity 5 (in 000), Initial stock = 0

• Fixed order cost K = 20 (00$)
• Unit inventory holding cost h = 1 (00$)

Xiaolan Xie
Applications
Inventory control

Generic formulation:
• Time = 1, …, 7
• State st = Inventory at the beginning of period t
• State space = {0, 1, 2, …, 5}
• Action at time t = purchasing quantity ut of period t
• Action space A(st) = {max(0, dt – st), …, 5 + dt - st}
• Immediate cost at time t
gt(st, ut) = K + ptut + ht(st + ut - dt) if u > 0
= ht(st + ut - dt) if NO
• State transition or system dynamics:
st+1 = st + ut - dt

Xiaolan Xie
Applications
Inventory control

Value function:
Jn(s) = minimal total cost over periods n, n+1, …, 6 by
starting with an inventory s at the beginning of period n.
Optimality equation:
Jn  s  min g n  s, u   J n 1  s  u  d n 
u ¨0 s  u  d n  5

J7  s  0

Xiaolan Xie
Applications
Traveling salesman problem

Problem :
Data: a graph with N nodes and a distance matrix
[dij] beteen any two nodes i and j.
Question: determine a circuit of minimum total
distance passing each node once.
Extensions:
C(y, S): shortest path from y to x0 passing once
each node in S.
Application: Machine scheduling with setups.

2007 Xiaolan Xie

Applications
Total tardiness minimization on a single machine

Si  starting time of job i

1, if job i precedes job j
X ij  
0, otherwise
Ti  tardiness
n
min  wiTi Job 1 2 3
i 1 Due date di 5 6 5
Ti  Si  pi  di Processing time pi 3 2 4
weight wi 3 1 2
S j  Si  pi  M  X ij  1
Si , Ti  0
X ij   0,1
where M is a large constant.

Xiaolan Xie
Stochastic dynamic programming
Model
Consider the control of
(i) a discrete-time stochastic dynamic system, with
(ii) costs generated over time

perturbation perturbation

action action

State t State t+1

stage cost cost

present decision epoch next decision epoch

Xiaolan Xie
Stochastic dynamic programming
Model
System dynamics :
x t+1 = ft(xt, ut, wt), t = 0, 1, ..., N-1
where
t : time index
xt : state of the system
ut = decision at time t perturbation
action action

wt : random perturbations
State t State t+1

cost cost

present decision epoch next decision epoch

Xiaolan Xie
Stochastic dynamic programming
Model
Criterion
 N 1 
Minimize E  g N  xN    gt  xt , ut , wt  
 t 0 

perturbation
action action

State t State t+1

cost cost

present decision epoch next decision epoch

Xiaolan Xie
Stochastic dynamic programming
Example
Consider a problem of ordering a quantity of a certain item at
each of N periods so as to meet a stochastic demand, while
minimizing the incurred expected cost.

xt : stock available at the beginning of period t

ut : quantity ordered at the beginning of period t
wt : random demand during period t with given probability.

xt+1 = xt + ut - wt

Xiaolan Xie
Stochastic dynamic programming
Example
Cost :
purchaing cost cut
inventory cost : r(xt + ut - wt)
wt
stock at period t stock at period t
Total cost: Inventory
xt
system
 N 1 

Ew   cut  r  xt  ut  wt    xt+1 = xt + ut - wt
 t 0 
order quantity

Xiaolan Xie
Stochastic dynamic programming
Model
Open-loop control:

Order quantities u1, u2, ..., uN-1 are determined once at time 0

Closed-loop control:

Order quantity ut at each period is determined dynamically with

the knowledge of state xt

Xiaolan Xie
Stochastic dynamic programming
Control policy
The rule for selecting at each period t a control action ut
for each possible state xt.

Examples of inventory control policies:

1. Order a constant quantity ut = E[wt]

2. Order up to policy :
ut = St – xt, if xt  St
ut = 0, if xt > St
where St is a constant order up to level.

2007 Xiaolan Xie

Stochastic dynamic programming
Optimal control policy
Mathematically, in closed-loop control, we want to

find a sequence of functions mt, t = 0, ..., N-1, mapping state

xt into control ut

so as to minimize the total expected cost.

The sequence p = {m0, ..., mN-1} is called a policy.

2007 Xiaolan Xie

Stochastic dynamic programming
Optimal control
Cost of a given policy p = {m0, ..., mN-1},

 N 1 
J  x0   
 E   ct  xt   r  xt  ut  wt  
 t 0 

Optimal control:
minimize Jp(x0) over all possible polciy p

2007 Xiaolan Xie

Stochastic dynamic programming
State transition probabilities
State transition probabilty:

pij(u, t) = P{xt+1 = j | xt = i, ut = u}

depending on the control policy.

2007 Xiaolan Xie

Stochastic dynamic programming
Basic problem
A discrete-time dynamic system :
x t+1 = ft(xt, ut, wt), t = 0, 1, ..., N-1

Finite state space st  St

Finite control space ut  Ct

Control policy p = {m0, ..., mN-1} with ut = mt(xt)

State-transition probability: pij(u)

stage cost : gt(xt, mt(xt), wt)

2007 Xiaolan Xie

Stochastic dynamic programming
Basic problem
Expected cost of a policy
 N 1 
 
J  x0   E  g N  x N    g t xt , t  xt  , wt 
 t 0 
Optimal control policy p* is the policy with minimal cost:

J *  x0   MIN J  x0 
 

where P is the set of all admissible policies.

J*(x) : optimal cost function or optimal value function.

Xiaolan Xie
Stochastic dynamic programming
Principle of optimality
Let p* = {m*0, ..., m*N-1} be an optimal policy for the basic
problem for the N time periods.

Then the truncated policy {m*i, ..., m*N-1} is optimal for the
following subproblem
•minimization of the following total cost (called cost-to-go
function) from time i to time N by starting with state xi at
time i
 N 1 
 
J i  xi   MIN E  g N  x N    g t xt , t  xt  , wt 
 t i 

Xiaolan Xie
Stochastic dynamic programming
DP algorithm
Theorem: For every initial state x0, the optimal cost J*(x0) of
the basic problem is equal to J0(x0), given by the last step of the
following algorithm, which proceeds backward in time from
period N-1 to period 0

J N  xN   g N  xN  , ( A)

J t  xt   MIN
ut U t  xt    
Ewt  g t  xt , ut , wt   J t 1 f t  xt , ut , wt   , ( B )


Furthermore, if ut = mt(xt) minimizes the right side of Eq (B)

for each xt and t, the policy p* = {m*0, ..., m*N-1} is optimal.

Xiaolan Xie
Stochastic dynamic programming
Example
Consider the inventory control problem with the following:

• Excess demand is lost, i.e. xt+1 = max{0, xt + ut – wt}

• The inventory capacity is 2, i.e. xt + ut  2

• The inventory holding/shortage cost is : (xt + ut – wt)2

• Unit ordering cost is a, i.e. gt(xt,ut,wt) = aut + (xt + ut – wt)2.

• N = 3 and the terminal cost, gN(XN) = 0

• Demand : P(wt = 0) = 0.1, P(wt = 1) = 0.2, P(wt = 2) = 0.7.

Xiaolan Xie
Stochastic dynamic programming
Example
Generic formulation:
• Time = {1, 2, 3, 4=end}
• State xt = inventory level at the beginning of a period
• State space = {0, 1, 2}
• Action ut = order quantity of period t
• Action space = {0, 1, …, 2 – xt}
• Perturbation dt = demand of period t
• Immediate cost = aut + (xt + ut – dt)2
• System dynamics xt+1 = max{0, xt + ut – dt}

Xiaolan Xie
Stochastic dynamic programming
Example

Value function:
Jn(s) = minimal total cost over periods n, n+1, …, 3 by
starting with an inventory s at the beginning of period n.
Optimality equation:

J n  s   min E  g n  s, u , d n   J n 1  max  0, s  u  d n   
u ¨ s u  2

J4  s  0

Xiaolan Xie
Stochastic dynamic programming
Example – Immediate cost

a 0,25 dt = w 0 1 2
Pw 0,1 0,2 0,7 mean
0 0 0 1 4 3 mean stage
0 1 1,25 0,25 1,25 1,05 cost
(s,u) 0 2 4,5 1,5 0,5 1,1 =
1 0 1 0 1 0,8 0.1g(s,u,0)+
0.2g(s,u,1)+
1 1 4,25 1,25 0,25 0,85
0.7g(s,u,2)
2 0 4 1 0 0,6
g(s,u,w) = 0.25u+(s+u– w)2

Myopic policy: (s=0, u=1), (s=1, u=0), (s=2, u=0)

Xiaolan Xie
Stochastic dynamic programming
Example – Period 3-problem
Period n = 3
a 0,25 dt = w 0 1 2 Mean
Pw 0,1 0,2 0,7 total
0 0 0 1 4 3
period-4
opt 0 1 1,25 0,25 1,25 1,05
s' J4(s') (s,u) 0 2 4,5 1,5 0,5 1,1
0 0 1 0 1 0 1 0,8
1 0 1 1 4,25 1,25 0,25 0,85
2 0 2 0 4 1 0 0,6

Stage cost 0.25u+(s+u–

w)2
+
Remaining cost J4((s+u-w) +
)
Xiaolan Xie
Stochastic dynamic programming
Example – Periods 2+3-problem

Period n = 2
a 0,25 dt = w 0 1 2 Mean
Pw 0,1 0,2 0,7 total
0 0 1,05 2,05 5,05 4,05
period-3
opt 0 1 2,05 1,3 2,3 2,075
s' J3(s') (s,u) 0 2 5,1 2,3 1,55 2,055
0 1,05 1 0 1,8 1,05 2,05 1,825
1 0,8 1 1 4,85 2,05 1,3 1,805
2 0,6 2 0 4,6 1,8 1,05 1,555

Stage cost 0.25u+(s+u–

w)2
+
Remaining cost J3((s+u-w) +
)
Xiaolan Xie
Stochastic dynamic programming
Example – Periods 1+2+3-problem

Period n = 1
a 0,25 dt = w 0 1 2 Mean
Pw 0,1 0,2 0,7 total
0 0 2,055 3,055 6,055 5,055
period-2
opt 0 1 3,055 2,305 3,305 3,08
s' J2(s') (s,u) 0 2 6,055 3,305 2,555 3,055
0 2,055 1 0 2,805 2,055 3,055 2,83
1 1,805 1 1 5,805 3,055 2,305 2,805
2 1,555 2 0 5,555 2,805 2,055 2,555

Xiaolan Xie
Stochastic dynamic programming
Example – value function & control
Optimal policy a =0,25
Stock 3-period policy 2-period policy 1-period policy

Stage 1 Stage 2 Stage 3

Cos-to-go Cos-to-go Cos-to-go
(order quantity) (order quantity) (order quantity)

0 3.055 (2) 2.055 (2) 1.05 (1)

1 2.805 (1) 1.805 (1) 0.8 (0)

2 2.555 (0) 1.555 (0) 0.6 (0)

Xiaolan Xie
Stochastic dynamic programming
Example – Control map or policy

Stock Period-1 Period-2 Period-3

0 2 2 1
Long-term to short-term
1 1 1 0
2 0 0 0

From
Long-term policy: (s=0, u=2), (s=1, u=1), (s=2, u=0)
To
Myopic policy: (s=0, u=1), (s=1, u=0), (s=2, u=0)

Xiaolan Xie
Stochastic dynamic programming
Example – Sample paths
Stock Period-1 Period-2 Period-3
Control Map 0 2 2 1
1 1 1 0
2 0 0 0

Period-1 Period-2 Period-3 Period-4

Sample path Initial stock 0 0 1 0
Sample Control 2 2 0
Demand path 1 Demand 2 1 2
scenarios Initial stock 0 1 0 0
Sample Control 2 1 1
(2,1,2) path 2 Demand 1 2 1
Initial stock 0 2 2 1
(1,2,1)
Sample Control 2 0 0
(0,0,1) path 3 Demand 0 0 1
Initial stock 2 0 1 0
Sample Control 0 2 0
path 4 Demand 2 1 2
Xiaolan Xie
Sequential decision model
Key ingredients: Policy: Issues:
• A set of decision epochs
a sequence of Existence of opt.
• A set of system states decision rules in policy
• A set of available actions order to mini. the Form of the opt. policy
• A set of state/action cost function Computation of opt.
dependent immediate costs policy
• A set of state/action
dependent transition action action
probabilities
Present Next
state state

costs costs

Xiaolan Xie
Applications
Inventory management
Bus engine replacement
Highway pavement maintenance
Bed allocation in hospitals
Personal staffing in fire department
Traffic control in communication networks
…

Xiaolan Xie
Example
• Consider a with one machine producing one product. The
processing time of a part is exponentially distributed with rate
p. The demand arrive according to a Poisson process of rate d.
• state Xt = stock level, Action : at = make or rest

1 T hX , if X  0
Minimize lim    
g X t dt with g   
X 
T  T
t 0 bX , if X  0

(make, p) (make, p) (make, p)

(make, p)

0 1 2 3
d d d
d

Xiaolan Xie
Example
• Zero stock policy (M/M/1) p0 = 1-r, p-n = rn p 0, r = d/p

p p p average cost =br/(1-r)

-2 -1 0
d d d

• Hedging point policy with

hedging point 1
p1 = 1-r, p-n = rn+1 p 1
p p p p
average cost =h(1-r) + r.br/(1-r)
-2 -1 0 1
d d d Better iff h/b < r/(1-r)
d

Xiaolan Xie
MDP = Markov Decision Process
MDP Model formulation

Xiaolan Xie
Decision epochs
Times at which decisions are made.

The set T of decisions epochs can be either a discrete set or a

continuum.

The set T can be finite (finite horizon problem) or infinite

(infinite horizon).

Xiaolan Xie
State and action sets
At each decision epoch, the system occupies a state.

S : the set of all possible system states.

As : the set of allowable actions in state s.

A = sSAs: the set of all possible actions.

S and As can be:

finite sets
countable infinite sets
compact sets

Xiaolan Xie
Costs and Transition probabilities
As a result of choosing action a  As in state s at decision epoch t,
• the decision maker receives a cost Ct(s, a) and
• the system state at the next decision epoch is determined by the
probability distribution pt(. |s, a).

If the cost depends on the state at next decision epoch, then

Ct(s, a) =  jS Ct(s, a, j) pt(j|s, a).

where Ct(s, a, j) is the cost if the next state is j.

An Markov decision process is characterized by {T, S, As, pt(. |s, a), Ct(s, a)}

Xiaolan Xie
Exemple of inventory management
Consider the inventory control problem with the following:

• Excess demand is lost, i.e. xt+1 = max{0, xt + ut – wt}

• The inventory capacity is 2, i.e. xt + ut  2

• The inventory holding/shortage cost is : (xt + ut – wt)2

• Unit ordering cost is 1, i.e. gt(xt, ut, wt) = ut + (xt + ut – wt)2.

• N = 3 and the terminal cost, gN+1(XN+1) = 0

• Demand : P(wt = 0) = 0.1, P(wt = 1) = 0.7, P(wt = 2) = 0.2.

Xiaolan Xie
Exemple of inventory management
Decision Epochs T = {0, 1, 2, …, N}
Set of states : S = {0, 1, 2} indicating the initial stock Xt
Action set : As indicating the possible order quantity Ut
A0 = {0, 1, 2}, A1 = {0, 1}, A2 = {0}

(s,a) C(s,a) (s,a) P(0|s,a) P(1|s,a) P(2|s,a)

(0, 0) 3 (0, 0) 1
Transition (0, 1) 0,9 0,1
Cost (0, 1) 1,05
probability
function (0, 2) 1,1 (0, 2) 0,2 0,7 0,1
(1, 0) 0,8 (1, 0) 0,9 0,1
(1, 1) 0,85 (1, 1) 0,2 0,7 0,1
(2, 0) 0,6 (2, 0) 0,2 0,7 0,1

Xiaolan Xie
Decision Rules
A decision rule prescribes a procedure for action selection in each
state at a specified decision epoch.

A decision rule can be either

Markovian (memoryless) if the selection of action at is based

only on the current state st;

History dependent if the action selection depends on the past

history, i.e. the sequence of state/actions ht = (s1, a1, …, st-1, at-1, st)

Xiaolan Xie
Decision Rules
A decision rule can also be either

Deterministic if the decision rule selects one action with certainty

Randomized if the decision rule only specifies a probability

distribution on the set of actions.

Xiaolan Xie
Decision Rules
As a result, the decision rules can be:

HR : history dependent and randomized

HD : history dependent and deterministic
MR : Markovian and randomized
MD : Markovian and deterministic

Xiaolan Xie
Policies
A policy specifies the decision rule to be used at all decision epoch.

A policy p is a sequence of decision rules, i.e. p = {d1, d2, …, dN-1}

A policy is stationary if dt = d for all t.

Stationary deterministic or stationary randomized policies are

important for infinite horizon markov decision processes.

Xiaolan Xie
Example
Decision epochs: T = {1, 2, …, N}
State : S = {s1, s2}
Actions: As1 = {a11, a12}, As2 = {a21}
Costs: Ct(s1, a11) =5, Ct(s1, a12) =10, Ct(s2, a21) = -1, CN(s1) = rN(s2) 0
Transition probabilities: pt(s1 |s1, a11) = 0.5, pt(s2|s1, a11) = 0.5, pt(s1 |s1,
a12) = 0, pt(s2|s1, a12) = 1, pt(s1 |s2, a21) = 0, pt(s2 |s2, a21) = 1

a11 a11
a21
{5, .5}
{5, .5}
S1 S2 {-1, 1}
a12
{10, 1}
Xiaolan Xie
Example
A deterministic Markov policy
Decision epoch 1:
d1(s1) = a11, d1(s2) = a21 One state one action
Decision epoch 2: (also called control map)
d2(s1) = a12, d2(s2) = a21

a11 a11
a21
{5, .5}
{5, .5}
S1 S2 {-1, 1}
a12
{10, 1}
Xiaolan Xie
Example
A randomized Markov policy
Decision epoch 1:
P1, s1(a11) = 0.7, P1, s1(a12) = 0.3
One state
P1, s2(a21) = 1
one proba distribution of actions
Decision epoch 2:
P2, s1(a11) = 0.4, P2, s1(a12) = 0.6
P2, s2(a21) = 1
a11 a11
a21
{5, .5}
{5, .5}
S1 S2 {-1, 1}
a12
{10, 1}
Xiaolan Xie
Example
A deterministic history-dependent policy
Decision epoch 1: Decision epoch 2:
d1(s1) = a11 history h
d1(s2) = a21 d2(h)
(s1, a11, s1)
One history one action a13
(s1, a12, s1)
a13 infeasible
{0, 1}
a11 a11 (s1, a13, s1)
{5, .5} a11
{5, .5} (s2, a21, s1)
{-1, 1}
S1 S2 infeasible
a21
a12
(*, *, s2)
{10, 1} a21
Xiaolan Xie
Example
A randomized history-dependent policy
Decision epoch 1: Decision epoch 2:
P1, s1(a11) = 0.6 history h P(a = a11) P(a = a12) P(a = a13)
P1, s1(a12) = 0.3 (s1, a11, s1) 0.4 0.3 0.3

P1, s1(a12) = 0.1 (s1, a12, s1) infeasible infeasible infeasible

(s1, a13, s1) 0.8 0.1 0.1
P1, s2(a21) = 1
(s2, a21, s1) infeasible infeasible infeasible
a13
{0, 1}
a11 a11 at s = s2, select a21
{5, .5} a21
{5, .5} {-1, 1}
S1 S2
a12
{10, 1}
Xiaolan Xie
Stochastic inventory example revisited
Decision Epochs T = {0, 1, 2, …, N}
Set of states : S = {0, 1, 2} indicating the initial stock Xt
Action set : As indicating the possible order quantity Ut
A0 = {0, 1, 2}, A1 = {0, 1}, A2 = {0}

(s,a) C(s,a) (s,a) P(0|s,a) P(1|s,a) P(2|s,a)

(0, 0) 3 (0, 0) 1
Transition (0, 1) 0,9 0,1
Cost (0, 1) 1,05
probability
function (0, 2) 1,1 (0, 2) 0,7 0,2 0,1
(1, 0) 0,8 (1, 0) 0,9 0,1
(1, 1) 0,85 (1, 1) 0,7 0,2 0,1
(2, 0) 0,6 (2, 0) 0,7 0,2 0,1

Xiaolan Xie
Stochastic inventory control policies
State s = inventory at the beginning of a period
Action a = order quantity such that s+a  2
MD : Markovian and deterministic
Stationary: {s=0: a = 2, s=1: a=1, s=2: a = 0}
Nonstationary:
{(s,a)=(0,2), (1,1), (2,0)} for period 1 to 5
{(s,a)=(0,1), (1,0), (2,0)} for period 6 on
MR : Markovian and randomized
Stationary: {s=0: a = 2 w.p. 0.5 a=0 w.p. 0.5, s=1: a=1, s=2: a = 0}
Nonstationary:
{(s,a)=(0,2), (1,1), (2,0)} for period 1 to 5
{(s,a)=(0,2 w.p. 0.5 & 0 w.p. 0.5), (1,0), (2,0)} for period 6 on
where w.p. = with probability

Xiaolan Xie
Stochastic inventory control policies
s Action a
HD
history dependent 0 2
1
if lost sales (s+a-d < 0) for last two periods
if demand for the last period
and deterministic 0 if no demand for the last period

1 1
0
if lost sale for the last period
if no demand for the last period

2 0

s Action a
HR 0 2 if lost sales for last two periods
2 w.p. 0.5 & 0 w.p. 0.5 if demandfor the last period
history dependent 1 w.p. 0.3 & 0 w.p. 0.7 if no demand for the last period
and randomized
1 1 w.p. 0.5 & 0 w.p. 0.5
0
if lost sale for the last period
if no demand for the last period

2 0
Xiaolan Xie
Remarks
Each Markov policy leads to a discrete time Markov Chain
and the policy can be evaluated by solving the related
Markov chain.

Xiaolan Xie
Remarks
MD : Markovian and MR : Markovian and
deterministic
randomized
s=0: a = 2,
s=0: a = 2 w.p. 0.5 a=0 w.p. 0.5,
s=1: a=1,
s=1: a=1,
s=2: a = 0
s=2: a = 0

Transition matrix Transition matrix

s 0 1 2 s 0 1 2
0 0,7 0,2 0,1 0 0,85 0,1 0,05
1 0,7 0,2 0,1 1 0,7 0,2 0,1
2 0,7 0,2 0,1 2 0,7 0,2 0,1

Stationary Markov chain (to draw)

Xiaolan Xie
Remarks
Nonstationary MD : Markovian and deterministic
{(s,a)=(0,2), (1,1), (2,0)} for period 1 to 2
{(s,a)=(0,1), (1,0), (2,0)} for period 3 on

period 1 period 2 period 3 period 4

s 0 1 2 0 1 2 0 1 2 0 1 2
0 0,7 0,2 0,1 0,7 0,2 0,1 0,9 0,1 0,9 0,1
1 0,7 0,2 0,1 0,7 0,2 0,1 0,9 0,1 0,9 0,1
2 0,7 0,2 0,1 0,7 0,2 0,1 0,7 0,2 0,1 0,7 0,2 0,1

Non Stationary MR : Markovian and randomized

{(s,a)=(0,2), (1,1), (2,0)} for period 1 to 2
{(s,a)=(0,2 w.p. 0.5 & 0 w.p. 0.5), (1,0), (2,0)} for period 3 on
Xiaolan Xie
Finite Horizon Markov Decision
Processes

Xiaolan Xie
Assumptions
Assumption 1: The decision epochs T = {1, 2, …, N}

Assumption 2: The state space S is finite or countable

Assumption 3: The action space As is finite for each s

Criterion:

 N 1 
infHR E   Ct  X t , at   CN  X N  X1  s 
   t 1 
where PHR is the set of all possible policies.

Xiaolan Xie
Optimality of Markov deterministic policy
Theorem :
Assume S is finite or countable, and that As is finite for each
s  S.

Then there exists a Markovian deterministic policy

which is optimal.

Xiaolan Xie
Optimality equations
Theorem : The following value functions
 N 1 
Vn  s   MIN HR
E   Ct  X t , at   CN  X N  X n  s 
   t n 
satisfy the following optimality equation:

 
Vt  s   MIN Ct  s, a    pt  j s, a  Vt 1  j  
a As  
 jS

VN  s   rN  s 

and the action a that minimizes the above term defines the
optimal policy.

Xiaolan Xie
Optimality equations
The optimality equation can also be expressed as:

Vt  s   MIN  Qt  s, a  
aAs

Qt  s, a   Ct  s, a    pt  j s, a  Vt 1  j 
jS

where Q(s,a) is a Q-function used to evaluate the

consequence of an action from a state s.

Xiaolan Xie
Backward induction algorithm
•Set t = N and
VN  sN   rN  sN  for all s N  S
•Substitute t-1 for t and compute the following for each st S
 
Vt  s   MIN Ct  s, a    pt  j s, a  Vt 1  j  
a As  
 jS

 
dt  s   arg min Ct  s, a    pt  j s, a  Vt 1  j  
a As   jS 
3. Repeat 2 till t = 1.

Xiaolan Xie
Infinite Horizon discounted
Markov decision processes

Xiaolan Xie
Assumptions
Assumption 1: The decision epochs T = {1, 2, …}
Assumption 2: The state space S is finite or countable
Assumption 3: The action space As is finite for each s
Assumption 4: Stationary costs and transition probabilities;
C(s, a) and p(j |s, a), do not vary from decision epoch to
decision epoch
Assumption 5: Bounded costs: | Ct(s, a) |  M for all a  As
and all s  S (to be relaxed)

Xiaolan Xie
Assumptions
Criterion:

N 
infHR lim E   Ct  X t , at   X1  s 
t 1
  N 
 t 1 

where
0 < l< 1 is the discounting factor
PHR is the set of all possible policies.

Xiaolan Xie
Discounting factor
Large discounting factor l  1 : long-term optimum

N  N 
E   Ct  X t , at   t 1
X 1  s   E   Ct  X t , at  X 1  s 
 t 1   t 1 

Small discounting factor l  0 : short-term optimum or myopic

N 
E   Ct  X t , at   X 1  s   E C1  X 1 , a1  X 1  s 
t 1

 t 1 

Xiaolan Xie
Optimality equations
Theorem: Under assumptions 1-5, the following optimal cost
function V*(s) exists:
N 
V *  s   infHR lim E   Ct  X t , at   X 1  s 
t
  N 
 t 1 
and satisfies the following optimality equation:

 
V *  s   MIN C  s, a     p  j s , a  V *  j  
a As  
 jS
Further, V*(.) is the unique solution of the optimality equation.
Moreover, a statonary policy p is optimal iff (if and only if) it
gives the minimum value in the optimality equation.

Xiaolan Xie
Computation of optimal policy
Value Iteration
Value iteration algorithm:
1.Select any bounded value function V0, let n =0
2. For each s S, compute

 
V  s   MIN C  s, a     p  j s, a  V  j  
n 1 n
a As  
 jS
3.Repeat 2 until convergence.
Meaning of Vn
4. For each s S, compute

 
d  s   arg min C  s, a     p  j s, a  V  j  
n 1
a As  jS 

Xiaolan Xie
Computation of optimal policy
Value Iteration
Theorem: Under assumptions 1-5,
a.Vn converges to V*
b. The stationary policy defined in the value iteration
algorithm converges to an optimal policy.

Xiaolan Xie
Computation of optimal policy
Policy Iteration
Policy iteration algorithm:
1.Select arbitrary stationary policy p0, let n =0
2. (Policy evaluation) Obtain the value function Vn of policy
pn.
3.(Policy improvement) Choose pn+1 = {dn+1, dn+1,…} such
 
that d n 1  s   arg min C  s, a     p  j s, a  V  j  
n
a As  jS 

4.Repeat 2-3 till pn+1 = pn.

Xiaolan Xie
Computation of optimal policy
Policy Iteration
Policy evaluation:
For any stationary deterministic policy p = {d, d, …}, its
value function
 
V  s   E   rt  X t , at   X1  s 
 t

 t 1 
is the unique solution of the following equation:

 
V   s   C  s, d  s      p j s , d  s  V   j 
jS

Xiaolan Xie
Computation of optimal policy
Policy Iteration
Theorem:
The value functions Vn generated by the policy iteration
algorithm is such that Vn+1 <= Vn.
Further, if Vn+1 = Vn, Vn = V*.

Xiaolan Xie
Computation of optimal policy
Linear programming
Recall the optimality equation

 
V  s   MIN C  s, a     p  j s , a  V  j  
a As  
 jS

The optimal value function can be determine by the

following Linear programme with a > 0 and Sa(s) = 1:

Maximize    s V  s
sS
subject to
V  s   C  s, a     p  j s, a  V  j  , s, a
jS

Xiaolan Xie
Computation of optimal policy
Linear programming
Dual linear program
Maximize   r  s, a  x  s, a 
sS a As

subject to
 x  j, a      p  j s, a  x  s, a     j  , j  S
aA j sS a As

x  s, a   0
1/ Optimal basic solution x* gives a deterministic optimal policy.
2/ x(s, a) = total discounted joint proba under initiate-state distribution a that
the system occupies state s and choose action a
3/ Dual linear program extends to constrained model with upper limit C of
total discounted cost, i.e.
  c  s, a  x  s, a   C
sS aAs

Xiaolan Xie
Extensition to Unbounded Costs
Theorem 1. Under the condition C(s, a) ≥ 0 (or C(s, a) ≤0) for all
states i and control actions a, the optimal cost function V*(s) among
all stationary determinitic policies satisfies the optimality equation

 

V *  s   MIN C  s, a     p  j s, a  V *  j  
a As  
 jS 

Theorem 2. Assume that the set of control actions is finite. Then, under
the condition C(s, a) ≥ 0 for all states i and control actions a, we have
lim V N  s   V *  s 
N 
where V (s) is the solution of the value iteration algorithm with V0(s) = 0.
N

Implication of Theorem 2 : The optimal cost can be obtained as the limit

of value iteration and the optimal stationary policy can also be obtained in
the limit.

Xiaolan Xie
Example
• Consider a computer system consisting of M different processors.
• Using processor i for a job incurs a finite cost Ci with C1 < C2 < ... < CM.
• When we submit a job to this system, processor i is assigned to our job with
probability pi.
• At this point we can (a) decide to go with this processor or (b) choose to hold the
job until a lower-cost processor is assigned.
• The system periodically return to our job and assign a processor in the same
way.
• Waiting until the next processor assignment incurs a fixed finite cost c.
Question:
How do we decide to go with the processor currently assigned to our job versus
waiting for the next assignment?
Suggestions:
• The state definition should include all information useful for decision
• The problem belongs to the so-called stochastic shortest path problem.
Xiaolan Xie
Why does it work: Preliminary
• Policy p value function (cost minimization)
 
V  s   E   C  X t ,   X t   X1  s 


 t 1 

V *  s   min V   s 


• Without loss of generality, 0  C(s, a)  M

Transformation by C’(s, a) = C(s, a) + M if | C(s, a) |  M
   M
V '  s   E  C '  X t ,   X t   X1  s   V   s  


 t 1  1 

Xiaolan Xie
Why does it work: DP & optimality equation
• DP (Dynamic Programming)

 
 s   min C  s, a     p  j s, a  V  j  
n 1 n
V
a As
 jS 
V 0  s  0
• Optimality equation

 
V  s   min C  s, a     p  j s, a  V  j  
* *
a As
 jS 
 
V   s   C  s,   s      p j s,   s  V   j 
jS

Xiaolan Xie
Why does it work: DP & optimality equation
 
T  f  s    min C  s, a     p  j s, a  f  j  
• DP operator T
a As  
 jS

 
T   f  s    C  s,   s      p j s,   s  f  j 
jS

T k  f  s    T T k 1  f  s   
 
T  k  f  s    T T  k 1  f  s   
 
• Contraction of the DP operator
T k  f  i      T k  f  i     k 
T  k  f  i      T  k  f  i     k 

Xiaolan Xie
Why does it work : DP convergence
Lemma 1: If 0  C(s,a)  M, then VN(s) is monotone converging
and limN VN(s) = V*(s)
Property guarantees the existence of V*(s).
Proof. Part one due to VN(s)  VN+1(s) and VN(s)  M/(1-l) .
 N t 1    t 1 
V  s   E    C  X t , at    E    C  X t , at  


 t 1  t  N 1 
Due to C(s,a)  M, Due to C(s,a) ≥ 0,
 N t 1 
 N t 1  N
V  s   E    C  X t , at  

V  s   E    C  X t , at    M


 t 1  1   t 1 

Taking min on both side of the inequalities,

N
V  s  V  s  M
* N
V *  s  V N  s
1 

Xiaolan Xie
Why does it work : convergence of value iteration
Lemma 2: If 0  C(s,a)  M, for any bounded function f, then
limN TN(f(s)) = V*(s) and limN TpN(f(s)) = Vp (s)

  0,   f  s   
V 0  s    f  s  V 0  s  
T N V 0  s      T N  f  s    T N V 0  s    
   
T N V 0  s     N   T N  f  s    T N V 0  s     N  (by contraction of T)
   
V N  s    N   T N  f  s    V N  s    N 
V *  s   lim T N  f  s    V *  s  (by Lemma 1)
N 

Similary, limN TpN(f(s)) = Vp(s)

Xiaolan Xie
Why does it work : optimality equation
Theorem 1: If 0  C(s,a)  M, V*(s) is the unique bounded
function of the optimality equation. Moreover, any stationary
policy is optimal iff p(s) is any minimizer of the right hand term.
N
From DP, V  s   V  s   V  s   M
N * N
1 
 N 1
Applying T, V N 1
 s   T V  s    V
* N 1
 s  M
1 
 
Taking limit, V  s   T V  s    min C  s, a     p  j s, a  V  j  
* * *
  aAs
 jS 

Uniqueness: assume f(s) be any solution of the optimality equation.

f  s   T  f  s    T N  f  s    f  s 
By Lemma 2, T N  f  s    V *  s   f  s   V *  s 

Similary V is the unique solution of f  s   T   f  s   .

Xiaolan Xie
Why does it work : optimality equation
Theorem A: If 0  C(s,a)  M, V*(s) is the unique bounded
function of the optimality equation. Moreover, any stationary
policy is optimal iff p(s) is any minimizer of the right hand term.
For any minimizer   s  of the optimality equation,
V *  s   T  V *  s  
 
By uniqueness of f  T   f  , V *  s   V   s   optimality of 
For any stationary optimal policy   s  ,
V *  s   V   s  , V   s   T  V   s    V *  s   T  V *  s  
   
Since V *  s   T V *  s   ,T V *  s    T  V *  s   .
     

Xiaolan Xie
Why does it work : convergence of policy iteration
Theorem B: The value functions Vn generated by the policy
iteration algorithm is such that Vn+1  Vn.

Matrix form of policy   s  value equation V   s   T  V   s  

 
V   C    P V  where V  , C  are column vectors.

 I   P  V  C where I is a square matrix with 1 on diagonal and 0 elsewhere.

  

V = I  P  C
1
  


 I  P    P   0
1 i
 

i 0
Policy improvement,

 V  
1
n 1 n 1 n 1 n 1 n 1
C  P n
V V  C n
 I  P n n
 V  I  P C n 1  V n 1

Xiaolan Xie
Why does it work : convergence of policy iteration
Theorem B: The value functions Vn generated by the policy
iteration algorithm is such that Vn+1  Vn.
Alternative proof.
By definition,
T  n1 V  n  s    T  n V  n  s    V  n  s 
   
Repeating the application of T  n1 to both side of the inequality leads to
T  n1 , N V  n  s    V  n  s 
 
Leting N  ,
V  n  s   lim T  n1 , N V  n  s    V  n 1  s 
N   

Xiaolan Xie
Infinite Horizon average cost
Markov decision processes

Xiaolan Xie
Assumptions
Assumption 1: The decision epochs T = {1, 2, …}
Assumption 2: The state space S is finite
Assumption 3: The action space As is finite for each s
Assumption 4: Stationary costs and transition probabilities;
C(s, a) and p(j |s, a) do not vary from decision epoch to
decision epoch
Assumption 5: Bounded costs: | Ct(s, a) |  M for all a  As
and all s  S
Assumption 6: The markov chain correponding to any
stationary deterministic policy contains a single recurrent
class. (Unichain)

Xiaolan Xie
Assumptions
Criterion:

1 N 
infHR lim E   Ct  X t , at  X1  s 
  N 
N t 1 

where
PHR is the set of all possible policies.

Xiaolan Xie
Optimal policy
Main Theorem: Under Assumptions 1-6,
• There exists an optimal stationary deterministic policy.
• There exists a real g and a value function h(s) that satisfy the
following optimality equation:
 
h  s   g  MAX r  s, a    p  j s, a  h  j  
a As
 jS 
• For any solutions (g, h) and (g’, h’) of the optimality equation:
(a) g = g’ is the optimal average reward;
(b) h(s) = h’(s) + k (translation closure);
• Any maximizer of the optimality equation is an optimal policy.

Xiaolan Xie
Relation between discounted and average cost MDP

• It can be shown that (why? online)

g  lim  1    V  s 
 1

h  s   lim V  s   V  x0   differential

 1 cost
for any given state x0.

Xiaolan Xie
Relation between discounted and average cost MDP

 V  s   V  x0    1    V  x0  where x0 = any given reference state

 
 max  r  s, a     p  j s, a   V  j   V  x0   
a As  
 j 
 
 h  s   g  max  r  s, a     p  j s, a   h  j   
a As  
 j 
where g    1    V  x0  , h  s   V  s   V  x0 
 
 h  s   g  max  r  s, a    p  j s, a   h  j    (by taking limit   1)
a As  
 j 
where
g  lim  1    V  x0 
 1 h(s) : differential reward/cost
h  s   V1  s   V1  x0  (starting from s vs x0)

Xiaolan Xie
Relation between discounted and average cost MDP

• Whyg  lim g   1    V  s 
 1
N
1

g  lim
N  N
 r  sn ,   sn  
n 1
N N
 lim lim  r  sn ,   sn    n 1
  n 1
N   1 n 1 n 1
N N
 lim lim  r  sn ,   sn    n 1   n 1 if limits interchangeable
 1 N  n 1 n 1

 lim  1    V  s 
 1

If the discounted policy converges to average cost policy,

lim  1    V  s   g *
g Blackwell optimality
 1

Xiaolan Xie
Computation of the optimal policy by LP
Recall the optimality equation:
 
h  s   g  MIN C  s, a    p  j s, a  h  j  
a As  
 jS

This leads to the following LP for optimal policy computation

Maximize g
subject to
h  s   g  C  s, a    p  j s, a  h  j  , s, a
jS

h( x0 )  0
Remarks: Value iteration and policy iteration can also be
extended to the average cost case.

Xiaolan Xie
Computation of optimal policy
Value Iteration
1.Select any bounded value function h0 with h0(s0) = 0, let n =0
2. For each s S, compute
 
 s  h  s   g  MIN  r  s, a    p  j s, a  h  j  
n 1 n 1 n n
U
a As  
 jS

h n 1  s   U n 1  s   U n 1  s0 
g n  U n 1  s0 
3.Repeat 2 until convergence.
4. For each s S, compute
 
d  s   arg min C  s, a    p  j s, a  h  j  
n 1
a As  jS 
Xiaolan Xie
Computation of optimal policy : Policy Iteration

1. Select any policy p0, let n =0

2. Policy evaluation: determine gp = stationary expected reward
and solve
 
h n  s   g  n  r  s,  n  s     p j s,  n  s  h n  j 
jS

h n  s0   0

3. Policy improvement:
 
 n 1  s   arg max r  s, a    p  j s, a  h  j  
n
a As  jS 
Set n := n+1 and repeat 2-3 till convergence.

Xiaolan Xie
Extensions to unbounded cost
Theorem. Assume that the set of control actions is finite. Suppose
that there exists a finite constant L and some state x0 such that
|Vl(x) - Vl(x0)| ≤ L
for all states x and for all l (0,1). Then, for some sequence {ln}
converging to 1, the following limit exist and satisfy the optimality
equation.
g  lim  1    V  s 
 1

h  s   lim V  s   V  x0  
 1

Easy extension to policy iteration.

More conditions: Sennott, L.I. (1999) Stochastic Dynamic Programming and
the Control of Queueing Systems, New York: Wiley.

Xiaolan Xie
Why does it work : convergence of policy iteration
Theorem: If all policies generated by policy iteration are unichains,
then gn+1 ≥ gn.
By definition,
T  n1  h n  s    T  n  h n  s    h n  s   g  n
   
Repeating the application of T  n1 to both side of the inequality leads to
T  n1 , N  h n  s    h n  s   Ng  n
 
1  n1 , N   n 1 n
T h  s   h  s   gn

N   N
Leting N  ,
1  n1 , N   n
g  n1
 lim T h  s    gn
N  N  

Xiaolan Xie
Continuous time Markov decision
processes

Xiaolan Xie
Assumptions

Assumption 1: The decision epochs T = R+

Assumption 2: The state space S is finite
Assumption 3: The action space As is finite for each s
Assumption 4: Stationary cost rates and transition rates;
C(s, a) and m(j |s, a) do not vary from decision epoch to
decision epoch

Xiaolan Xie
Assumptions
Criterion:

 
infHR E   C  X  t  , a  t   e dt 
t
   t 0 

1 T 
infHR lim E   C  X  t  , a  t   dt 
  T   T 
 t 0

Xiaolan Xie
Example
• Consider a system with one machine producing one product. The
processing time of a part is exponentially distributed with rate p. The
demand arrive according to a Poisson process of rate d.
• state Xt = stock level, Action : at = make or rest

 hX , if X  0
 g  X  t  e dt with g  X   
t
Minimize
t 0  bX , if X  0

(make, p) (make, p) (make, p)

(make, p)

0 1 2 3
d d d
d

Xiaolan Xie
Uniformization
Any continuous-time Markov chain can be converted to a
discrete-time chain through a process called
« uniformization ».

Each Continuous Time Markov Chain is characterized by

the transition rates mij of all possible transitions.
The sojourn time Ti in each state i is exponentially
distributed with rate m(i) = Sj≠i mij, i.e. E[Ti] = 1/m(i)
Transitions different states are unpaced and
asynchronuous depending on m(i).

Xiaolan Xie
Uniformization
In order to synchronize (uniformize) the transitions at the same
pace, we choose a uniformization rate
g  MAX{m(i)}
« Uniformized » Markov chain with
•transitions occur only at instants generated by a common a
Poisson process of rate g (also called standard clock)
•state-transition probabilities
pij = mij / g
pii = 1 - m(i)/ g
where the self-loop transitions correspond to fictitious events.

Xiaolan Xie
Uniformization
CTMC
a Step1: Determine rate of the states
m(S1) = a, m(S2) = b
S1 S2
b
Step 2: Select an uniformization
Uniformized CTMC rate
g-a a
g-b g ≥ max{m(i)}
S1 S2
b Step 3: Add self-loop transitions to
states of CTMC.
DTMC by uniformization
1-a/g a/g 1-b/g
Step 4: Derive the corresponding
S1 S2 uniformized DTMC
b/g
Xiaolan Xie
Uniformization

Rates associated to states

m(0,0) = l1+l2
m(1,0) = m1+l2
m(0,1) = l1+m2
m(1,1) = m1
Xiaolan Xie
Uniformization

For Markov decision process, the uniformization rate

shoudl be such that

g  m(s, a) = SjS m(j|s, a)

for all states s and for all possible control actions a.

The state-transition probabilities of a uniformized Markov

decision process becomes:

p(j|s, a) = m(j|s, a)/ g

p(s|s, a) = 1- SjS m(j|s, a)/ g

Xiaolan Xie
Uniformization

(make, p) (make, p) (make, p)

(make, p)

0 1 2 3
d d d
d

Uniformized Markov decision process

at rate g = p+d
(make, p/g) (make, p/g) (make, p/g) (make, p/g) (make, p/g)

0 1 2 3
d/g d/g d/g d/g d/g

(not make, p/g) (not make, p/g) (not make, p/g) (not make, p/g)

Xiaolan Xie
Uniformization
Under the uniformization,
• a sequence of discrete decision epochs T1, T2, … is generated
where Tk+1 – Tk = EXP(g).
• The discrete-time markov chain describes the state of the system at
these decision epochs.
• All criteria can be easily converted.
continuous cost C(s,a)
per unit time fixed cost
fixed cost k(s,a, j)
K(s,a)
(s,a) j
EXP(g) EXP(g) EXP(g)

T0 T1 T2 T3 Poisson process at rate

g
Xiaolan Xie
Cost function convertion
for uniformized Markov chain
Discounted cost of a stationary policy p (only with continuous cost):
    Tk 1 
E   C  X  t  , a  t   e dt   E    C  X  t  , a  t   e   t dt 
t

 t  0   k  0 t Tk 
 Tk 1  State change & action taken only at T k
 E   C  X k , ak   e   t dt 
 k  0 t Tk 
  Tk 1 
Mutual independence of (Xk, ak) and
  E C  X k , ak  E  e dt 
 t
event-clocks (Tk, Tk+1)
k 0 t Tk 
 k
1   
  E C  X k , ak     Tk is a Poisson process at rate g
k 0     
    C  X k , ak  
k
 E    
 k  0       


Average cost of a stationary policy p (only with continuous cost):

1 T   N C  X k , ak   1 N 
lim E   C  X  t  , a  t   dt   lim E   E  C  X k , ak  
T   T t  0  N   N k 0   N k 0 

Xiaolan Xie
Cost function convertion
for uniformized Markov chain

Tk is a Poisson process at rate g, i.e. Tk = t1 + … + tk, ti = EXP(g)

 Tk 1     T  k 1   t   k 1   t 

E   e dt   E e k  e dt   E e k  E   e dt 
 t  T
(independence of Tk and  k 1 )
t Tk       
t 0  t 0

E e   Tk   E e  1
   ... k  
 E e  1 ...e   k 
     
k
  E e   i  (independence of  i )
 
i 1
k
k   x k
   
=  e  e  x dx   
i 1      
0
i 1

 k 1   t  1
 
E   e dt   E  1  e   k 1   
 t 0    0 
 1
 
1  e   x  e  x dx  1
 

Xiaolan Xie
Optimality equation: discounted cost case
Equivalent discrete time discounted MDP
• a discrete-time Markov chain with uniform transition rate g
• a discount factor l = g/(g+b)
• a stage cost given by the sum of
─ continuous cost C(s, a)/(b+g),
─ K(s, a) for fixed cost incurred at T0
─ lk(s,a,j)p(j|s,a) for fixed cost incurred at T1
Optimality equation
 C  s, a     
V  s   MIN   K  s, a      p  j s, a   k  s, a, j   V  j   
a A s          jS 

Xiaolan Xie
Optimality equation: average cost case
Equivalent discrete time average-cost MDP
• a discrete-time Markov chain with uniform transition rate g
• a stage cost given by C(s, a)/g whenever a state s is entered
and an action a is chosen.
Optimality equation for average cost per uniformized period:
 C  s, a  
h  s   g  MIN    p  j s, a  h  j  
a As 
  jS 
where
• g = average cost/uniformized period,
• gg =average cost/time unit,
• h(s) = differential cost with respect to reference state s0 and h(s0) = 0

Xiaolan Xie
Optimality equation: average cost case
Multiply both side of the optimality equation by g leads to:

Alternative optimality equation 1:

 
H  s   G  MIN C  s, a    p  j s, a  H  j  
a As  
 jS
where
• G = gg optimal average cost per time unit
• H(s) = modified differential cost with H(s) = g(V(s) – V(s0))

Xiaolan Xie
Optimality equation: average cost case
Alternative optimality equation 2: Hamilton-Jacobi-Bellman equation
 
0  MIN C  s, a   G     j s, a  h  j  
a As  
 jS

where

• h(s) = differential cost with respect to a reference state s0

• m(j|s,a) = transition rate from (s,a) to j with   s ¨s, a      j ¨s, a 

js

i.e. m(j|s,a) = gp(j|s,a) and m(s|s,a) = gp(s|s,a) - g

Xiaolan Xie
Example (continue)
Uniformize the Markov decision process with rate g = p+d

The optimality equation:

 g  s  p d  
  V  s  1  V  s  1  : producing 
   pd pd  
V  s   MIN  
  
g s  p d  
     V  s   V  s  1  : not producing 
  p  d p  d  

Xiaolan Xie
Example (continue)
From the optimality equation:

g  s  p d  p
V  s   V  s  V  s  1    MIN  V  s  1  V  s  , 0
   pd pd  pd

If V(s) is convex, then there exists a K such that :

V(s+1) –V(s) > 0 and the decision is not producing, for all s >= K and
V(s+1) –V(s) <= 0 and the decision is producing, for all s < K
 g  s  p d  
  V  s  1  V  s  1  : producing 
   pd pd  
V  s   MIN  
  
g s  p d  
     V  s   V  s  1  : not producing 
  pd pd  
Xiaolan Xie
Example (continue)
Convexity proved by value iteration
g  s  p 
V n 1  s  
 

 pd
 
MIN V n  s  1 ,V n  s  
d
pd
V n  s  1 

V 0  s  0

Proof by induction.
V0 is convex.
 
MIN V n  s  1 , V n  s  is convex
If Vn is convex with minimum
at s = K, then Vn+1 is convex.

s
K-1 K

Xiaolan Xie
Example (continue)
Convexity proved by value iteration
•Assume Vn is convex with minimum at s = K.
•Vn+1 is convex ifU n  s   MIN  V n  s  1 ,V n  s   is
i.e. DU(s)  D U(s+1) where DU(s) = U(s+1) – U(s)
•True for s +1 < K-1 and s > K-1 by induction.
•Proof established if
DU(K-2)  D U(K-1)  DVn(K-1)  0

MIN V n  s  1 , V n  s  
DU(K-1)  D U(K)  0  D Vn(K)
V(s+1) V(s)

s
K-1 K

Xiaolan Xie
Condition for optimality of
monotone policy
(first order properties)

Xiaolan Xie
Monotone policy
Monotone policy
p(s) nondecreasing or nonincreasing in s

Question: When there exists an optimal monotone policy?

Answers: monotonicity (addressed here) and

convexity (addressed in the previous example)

Only finite-horizon case is considered but can be extented to

discounted and average cost.

Xiaolan Xie
Submodularity and Supermodularity
A function g(x, y) is said supermodular if for x+ ≥ x- and y+ ≥ y-,
g  x , y    g  x  , y    g  x  , y    g  x  , y  
g  x , y   g  x , y  
It is said submodular if
g  x , y    g  x  , y    g  x  , y    g  x  , y  

g  x , y   g  x , y  

Supermodularity  Increasing difference, i.e.

g  x , y    g  x , y    g  x , y    g  x , y  
g  x , y    g  x , y    g  x , y    g  x , y  

Submodularity  Decreasing difference

Xiaolan Xie
Submodularity and Supermodularity
Supermodular functions:
g  x, y    h  x, y  , h is submodular
g  x, y   h  x   e  y 
g  x, y   h  x  y  , h is convex increasing and x, y  R
g  x, y   xy, x, y  R

Property 1: If g(x,y) is supermodular (submodular), then

f(x) = min or max selection of the set argmaxy g(x,y) of maximizers

is monotone nondecreasing (nonincreasing) in x.

Xiaolan Xie
Dynamic Programming Operator
• DP operator T

 
Vt 1  s   T Vt  s    max r  s, a    p  j s , a  Vt  j  
a As  
 jS

VN  s   0

equivalently

 
T Vt  s    max r  s, a   E Vt s next s , a 
a As   

Xiaolan Xie
DP Operator: monotonicity preservation

Property 2 T[Vt(s)] is nondecreasing (nonincreasing) if

1. r(s,a) is nondecreasing (nonincreasing) in s for all a
2. (snext|s,a) is nondecreasing in s for all a, k

Criterion for (2):  p  j s, a  is nondecreasing in s, k , a

jk

Xiaolan Xie
DP Operator: control monotonicity
Theorem 1. Action  t  s  is nondecreasing (nonincreasing) in s if
1. r  s, a  is nondecreasing in s for all a

 
2. s next s, a is nondecreasing in s for all a, k
3. r  s, a  is supermodular (submodular)

 
4. E u s next s, a  is supermodular (submodular) for all nondecreasing u
 

 
Proof. By (4) + Property 2: E Vt s next s, a  supermodular
 
 
 (2) : r  s, a   E Vt s next s, a  supermodular
 
+Property 1: control monotonicity.

Criterion for (4):  p  j s, a  is supermodular (submodular) k

j k Xiaolan Xie
DP Operator: control monotonicity

Theorem 2.
The optimal action  t  s  is nondecreasing (nonincreasing) in s if
1. r  s, a  is nonincreasing in s for all a

 
2. s next s, a is nondecreasing in s for all a, k
3. r  s, a  is supermodular (submodular)

 
4. E u s next s, a  is supermodular (submodular) for all nonincreasing u
 

Xiaolan Xie
Batch delivery model
• Customer demand Dt for a product arrives over time.
• State set S = {0, 1, …}: quantity of pending demand
• Action set A = {0=no delivery, 1=deliver all pending demand}
• Cost C(s,a) = hs(1-a) + aK
where
Submodularity
h = unit holding cost,
a(s) nondecreasing
K= fixed delivery cost
• Transition snext = s(1-a) + D where P(D=i) = pi, i=0, 1, …
GOAL: minimize the total cost

Xiaolan Xie
Batch delivery model

1. C  s, a  =hs  1  a  +aK nondecreasing in s

 
2. s next s, a =s  1  a   D nondecreasing in s
3. C  s, a  submodular

    
C s  ,1  C s  , 0  K  hs   K  hs   C s  , 0  C s  ,1  
 
4. U  s, a   E u s next s, a  submodular for all nondecreasing u
 
    
U s  ,1  U s  , 0  E u  D    E u s   D 
  
    
 E u  D    E u s   D   U s  , 0  U s  ,1
  
Min submodular  Max supermodular

Xiaolan Xie
A machine replacement model
• Machine deteriorates by a random number I of states per period
• State set S = {0, 1, …} from best to worse condition
• Action set A = {1=replace,0=not replace}
• Reward r(s,a) = R – h(s(1-a)) – aK
Supermodularity
R = fixed income per period,
a(s) nondecreasing
h(s) = nondecreasing operation cost
K= replacement cost
• Transition snext = s(1-a) + I where P(I=i) = pi, i=0, 1, …
GOAL: maximize the total reward

Xiaolan Xie
A machine replacement model
1. r  s, a   R - h  s  1  a    aK nonincreasing in s

 
2. s next s, a  s  1  a   I nondecreasing in s
3. r  s, a  supermodular

   
r s  ,1  r s  , 0  2 R  K  h s   
     
 2 R  K  h s   r s  , 0  r s  ,1

4. U  s, a   E u  s s, a   supermodular for all nonincreasing u

next
 
U  s ,1  U  s , 0   E u  I    E u  s  I  
  
 
 E u  I    E u  s  I    U  s , 0   U  s ,1
  
 

Xiaolan Xie
A general framework for
value function property analysis

Based on G. Koole, “Structural results for the control of queueing systems

using event-based dynamic programming,” Queueing Systems 30:323-339,
1998

Xiaolan Xie
Introduction: event operators

Definition.

For any function f : N 0   0,1, 2,...  R,

TAC f  x   min  f  x  1 , c  f  x  

TD f  x   f   x  1 

Tcosts f  x   C  x   f  x 
Tunif  f1 , f 2   x   pf1  x    1  p  f 2  x 

Xiaolan Xie
Introduction: a single server queue

• exponential server
• Poisson arrivals of which the admission can be controlled
• l: arrival rate
• m: service rate , l+m= 1
• c: unit rejection cost
• C(x): holding cost of x customers

Standard DP
Vn 1  x   C  x    min  Vn  x  1 , c  Vn  x    Vn   x  1 

event-based DP
Vn 1  Tcosts  Tunif  TACVn , TDVn  

Xiaolan Xie
Introduction: discrete-time queue

1: customer arrival rate, i.e. one customer per period

p: geometric service rate
x: queue length before admission decision and service completion
Standard DP
 pVn  x    1  p  Vn  x  1 , 
 
Vn 1  x   C  x   min 
 
 
c  pVn  x  1   1  p  Vn  x  

Event-based DP

 
Vn 1  x   Tcosts TAC Tunif  Vn , TD  Vn      x
Xiaolan Xie
One-dimension models : operators

x   x1 , x2 ,... : state
xi : integer such that xi  bi
bi  

Xiaolan Xie
One-dimension models: operators

Tcosts f  x   C  x    f  x  ,   0 or Tcosts  C , f 
direct cost & discounting
Tunif  f1 ,..., f l   x    f j  x
l
p
j 1 j

convex combination for uniformization

Xiaolan Xie
One-dimension models: operators
TA i  f  x   f  x  ei 
arrival at queue i
 B  xi  bi  f  x  ei    xi  bi  f  x  , if xi  bi  B
TFS  i  f  x   
 Bf  x  , otherwise
finite source arrival
TD  i  f  x   f  max  x  ei , bi  
departure from queue i
 xi  bi  f  x  ei    S  xi  bi  f  x  , if xi  bi  S
TMD i  f  x   
 Sf  x  ei  , otherwise
departure from a S-server queue

Xiaolan Xie
One-dimension models : operators
TAC  i  f  x   min   c   f  x  ei    1    f  x  
(generalized) admission control at queue i
 
 min  c   f  x  ei    1    f  x  , if xi  bi
TCD  i  f  x   
c0  f  x  , otherwise
(generalized) controlled departures.
,    0,1 with 0  , M : action set for admission and departure

 1 2 
Examples of action set:  0,1 ,  0,1 , 0, , ,...,1
 m m 

Bang-bang control if c  c   0 or linear

Xiaolan Xie
One-dimension models : properties

Inc  i  : f  x   f  x  ei 
Conv  i  : 2 f  x  ei   f  x   f  x  2ei 

Xiaolan Xie
One-dimension models : property propagation

Notation:

T : X1 ,..., X k  X j

denotes

if f1 ,..., f l all have properties X 1 ,..., X k ,

then T  f1 ,..., fl  has property X j

Xiaolan Xie
One-dimension models : property propagation
Lemma 1.
Tcosts : Inc  1  Inc  1 ; Conv  1  Conv  1
Tunif : Inc  1  Inc  1 ; Conv  1  Conv  1
TA 1 : Inc  1  Inc  1 ; Conv  1  Conv  1
TFS  1 : Inc  1  Inc  1 ; Conv  1  Conv  1
TAC  1 : Inc  1  Inc  1 ; Conv  1  Conv  1
TD  1 : Inc  1  Inc  1 ; Inc  1 , Conv  1  Conv  1 ;
Conv  1  Conv  1 if b1  
TMD 1 : Inc  1  Inc  1 ; Inc  1 , Conv  1  Conv  1 ;
Conv  1  Conv  1 if b1  
TCD 1 : Inc  1  Inc  1 if c0  min M c ; Conv  1  Conv  1

Xiaolan Xie
One-dimension models : property propagation
Proof of Lemma 1.
Tcosts and Tunif : results follow directly as increasingness and convexity are closed under convex combinations.
TA(1) : results follow directly, by replacing x by x + e1 in the inequalities.
TFS(1) : certain terms cancel out.
TD(1) : Increasingness follows as for TA(1), except if x1 = b1. In this case TD(1)f(x) = TD(1)f(x + e1). Also for the
convexity the only non-trivial case is x1 = b1. This reduces to f(x)  f(x + e1).
TMD(1) : Roughly the same arguments are used.
Rewrite TAC  1  x   min c  F  x,   with F  x,    E  f  x  I    , I   1 U   e1 , U  UNIF  0,1
TAC(1) : 

2TAC  1  x  e1   c  F  x  e1 , 0   c  F  x  e1 , 2  with 0    x  , 2    x  2e1 

0 2

  
F  x  e1 , 0   F  x  e1 , 2   E  f x  e1  I 0  f x  e1  I 2  .
  
Assume 0  2 and proof for the case 0  2 is similar.
       
U  0 : f x  e1  I 0  f x  e1  I 2  2 f  x  2e1   f x  I 0  f x  2e1  I 2

U   : f  x  e  I   f  x  e  I   2 f  x  e   f  x  I   f  x  2e  I 
2 1 0 1 2 1 0 1 2

  U   : f  x  e  I   f  x  e  I   f  x  e   f  x  2e   f  x  I   f  x  2e  I 
0 2 1 0 1 2 1 1 0 1 2

where the last inequality is from f  x1  3e1   f  x1  2e1   f  x1  e1   f  x1  .

As a result,F  x  e1 , 0   F  x  e1 , 2   F  x, 0   F  x  2e1 , 2 
Leading to 2TAC  1  x  e1   TAC  1  x   TAC  1  x  2e1 

TCD(1) : similar proof as for TAC(1).

Xiaolan Xie
One-dimension models : property propagation
Environment component
Tenv i   f1 ,..., fl   x    yIN   xi , y   j 1 q j  xi , y  f j  x *
l
0

where x * is equal to x with the i-th component replaced by y

Lemma 2.
Tenv 0  : Inc  1  Inc  1 ; Conv  i   Conv  i 

Xiaolan Xie
One-dimension models : property propagation
Theorem 3.
For an event-based DP value function Vn ,
(i) Vn  F  Inc  1 , Conv  1  if constructed with operators in
Tenv  0  , Tcost , Tunif , TA(1) , TFS  1 , TAC  1 , TD (1) , TMD (1) , TCD (1) with c0  min  c 

and C  F  Inc  1 , Conv  1  ,V0  F  Inc  1 , Conv  1 

(ii) Vn  F  Conv  1  if constructed with operators in
Tenv  0  , Tcost , Tunif , TA(1) , TFS  1 , TAC  1 , TCD (1) ,
TD (1) with b1 =-, TMD (1) with b1 =-
and C  F  Conv  1  ,V0  F  Conv  1 

where F  P   set of functions having all properties in P

Xiaolan Xie
a single server queue
• l: arrival rate is l, m: service rate , l+m= 1
• c: unit rejection cost
• C(x): holding cost of x customers

Vn 1  x   C  x    min  Vn  x  1 , c  Vn  x    Vn   x  1 

 C  x   Vn  x    min  Vn  x  1  Vn  x  , c  Vn   x  1 

Vn 1  Tcosts  Tunif  TACVn , TDVn  
Vn 1 is increasing convex if C  x  and V0 are.
Threshold control: accept if Vn  x  1  Vn  x   c

Xiaolan Xie
discrete-time queue

1: customer arrival rate, p: geometric service rate

x: queue length before admission decision and service completion

 pVn  x    1  p  Vn  x  1 , 
 
Vn 1  x   C  x   min 
 
 
c  pVn  x  1   1  p  Vn  x  

 
Vn 1  x   Tcosts TAC Tunif  Vn , TD V      x
n

Vn 1 is increasing convex if C  x  and V0 are.

Threshold control: accept if
Tunif  Vn , TD  Vn    x  1  Tunif  Vn , TD  Vn    x   c

Xiaolan Xie
Production-inventory system

g  s  p 
V n 1
 s 
 
 
 pd
 
MIN V n  s  1 , V n  s  
d
pd
V n  s  1 

g  s

 
 p
 
 pd
n
 
V  s   MIN V  s  1  V  s  , 0 
n n d
pd
n
 
V  s  1 


 
V n 1  s   Tcosts Tunif TACV n , TDV n 
V n 1 is convex if g  s  and V 0 are (Theorem 3 with b1 =-).
Threshold control: produce if V n  s  1  V n  s   0

Xiaolan Xie
Multi-machine production-inventory with preemption

Same as previous example but with m identical machines & preemption

g  s  ap  m  a p n d 
V  s 
n 1
  min  V  s  1 
n
V  s  V  s  1 
n
   
a   0,1,..., m  mp  d mp  d mp  d 
g  s  

 
   min
 a 0,1,..., m mp  d

ap

V n  s  1  V n  s  
mp
mp  d
V n  s 
d
mp  d
V n  s  1 


 
V n 1  s   TcostsTAC Tunif TAV n ,V n , TDV n 
V n 1 is convex if g  s  and V 0 are.
Threshold control:
all m-machine produce if V n  s  1  V n  s   0
all stop otherwise.

Xiaolan Xie
Examples of Tenv(i)
Tenv (0) : generalizes Tunif


Vn 1  Tcosts  Tunif

T

T
 1
V
 env (0) AC  1 n
,..., T
 l
V
AC  1 n 
, T V 
D n

 j
TAC  1 : operators of l different arrival streams

  xi , y  in Tenv (0) = normalized transition rate to state y of a Markov process

(Markov Arrival Process = MAP)
q j  proba of an arrival in class j

   1  l
Vn 1  Tcosts Tenv (0) TAC  1 Vn ,..., TAC  1 Vn , TDVn 
environmental-dependent departure

Control policy keeps its structure but depends on the environment

Xiaolan Xie
Examples
Admission control of GI/M/1 queue


Vn 1  Tcosts C , TAC  1 Tunif Vn , TD 1 Vn , TD2 1 Vn ,...
  
TDj 1 : j-fold convolution of TD  1 ,
p j : probability of j departures during an interarrival time

F  Inc  1 , Conv  1  Batch arrival and individual admission:

convex combination of convolutions of TAC  1

Finite buffer approximation


Add in the stage cost C  x    x  B  K where K is a big number

F  Conv  1  C  x   1 x  0 r  cx where x  holding cost, r = reward

Xiaolan Xie
Two-dimension models : operators
   
TACF  I  f  x   min  c   f  x   ei    1    f  x  
  iI  
admission control of a fork to all queues in I
TR  I  f  x   min iI f  x  ei  :
routing to one of the queues in I
TMS  I  f  x   min iI f  max  x  ei , bi   :
a movable server for queues in I
 min

TCJ  i , j  f  x   
   
 c   f x  ei  e j   1    f  x  , if xi  bi

c0  f  x  , otherwise
controlled jockeying, i  j

Xiaolan Xie
Two-dimension models : properties
supermodularity Super  i, j  :

  
f  x  ei   f x  e j  f  x   f x  ei  e j 
submodularity Sub  i, j  :

  
f  x   f x  ei  e j  f  x  ei   f x  e j 
superconvexity SuperC  i, j  :

   
f  x  ei   f x  ei  e j  f x  e j  f  x  2ei 

f  x  e j   f  x  ei  e j   f  x  ei   f  x  2e j 
subconvexity SubC  i, j  :

  
f  x  ei   f x  ei  e j  f  x   f x  e j  2ei 
f  x  e j   f  x  ei  e j   f  x   f  x  ei  2e j 

Xiaolan Xie
Two-dimension models : properties
Conv Super SuperC
R R

+ LL
L R R L + L L

R L R R
R LL R R L

Inc Sub SubC

+ R
R L L R + L L

L R L R
L R L R

Super(i,j) + SuperC(i,j)  Conv(i)+Conv(j)

Super(i,j) + Conv(i) + Conv(j)  SubC(i,j)
Sub(i,j) + SubC(i,j)  Conv(i)+Conv(j)
Sub(i,j) + Conv(i) + Conv(j)  SuperC(i,j) Xiaolan Xie
2-dimension models : property propagation
Lemma 4.
Tcosts , Tenv i  , i  j , k , TA i  :
Inc  j   Inc  j  ;
Super  j , k   Super  j , k  ; Sub  j , k   Sub  j , k 
SuperC  j , k   SuperC  j , k  ; SubC  j , k   SubC  j , k 
TFS  i  :
Inc  j   Inc  j  ;
Super  j , k   Super  j , k  ; Sub  j , k   Sub  j , k 

Xiaolan Xie
2-dimension models : property propagation

Lemma 4.
TD i  :
Inc  j   Inc  j  ;
Super  j , k   Super  j , k  ; Sub  j , k   Sub  j , k 
Inc  j  , Inc  k  , Super  j , k  , SuperC  j , k   SuperC  j , k   i  j , k 
Inc  j  , Inc  k  , Sub  j , k  , SubC  j , k   SubC  j , k   i  j , k 
SuperC  j , k   SuperC  j , k  , SubC  j , k   SubC  j , k  , if bi  
TMD j  :
Inc  j   Inc  j  ;
Super  j , k   Super  j , k  ; Sub  j , k   Sub  j , k 
SuperC  j , k   SuperC  j , k  , SubC  j , k   SubC  j , k  , if bi  

Xiaolan Xie
2-dimension models : property propagation
Lemma 4.
TAC  i  : Inc  j   Inc  j  ;
TCD i  : Inc  j   Inc  j   i  j  ;
Inc  i   Inc  i  if c0  min M c
TAC  i  , TCD i  :
Super  j , k   Super  j , k   i  j , k  ;
Sub  j , k   Sub  j , k   i  j , k 
Super  j , k  , SuperC  j , k   SuperC  j , k   i  j , k  ;
Sub  j , k  , SubC  j , k   SubC  j , k   i  j , k 

Xiaolan Xie
2-dimension models : property propagation
Lemma 4.
TACF  I  : Inc  j   Inc  j  ;

Sub  j , k  , SubC  j , k   Sub  j , k   I   j , k   ;

Sub  j , k  , SubC  j , k   SubC  j , k   I   j , k   ;
TR  I  : Inc  j   Inc  j  ;

Super  j , k  , SuperC  j , k   Super  j , k   I   j , k   ;

Super  j , k  , SuperC  j , k   SuperC  j , k   I   j , k   ;
TCJ  i , j  : Inc  k   Inc  k   k  i  ; Inc  i   Inc  i  if bi  ;
Inc  i  , Inc  j   Inc  i  if bi  , c0  min M c
Super  i, j  , SuperC  i, j   Super  i , j  ;
SuperC  i, j   SuperC  i, j  ;

Xiaolan Xie
2-dimension models : property propagation
Lemma 4.
TMS  I  : Inc  j   Inc  j  ;

Inc  j  , Inc  k  , Super  j , k  , SuperC  j , k   Super  j , k   I   j , k   ;

Inc  j  , Inc  k  , Super  j , k  , SuperC  j , k   SuperC  j , k   I   j , k   ;
Super  j , k  , SuperC  j , k   Super  j , k   I   j , k   if bi  ;
Super  j , k  , SuperC  j , k   SuperC  j , k   I   j , k   if bi  ;

Xiaolan Xie
2-dimension models : property propagation

Theorem 5.
For an event-based DP value function Vn ,
Vn  F  Inc  1 , Inc  2  , Super  1, 2  , SuperC  1, 2  
if Vn constructed with operators in
Tenv 0  , Tcost , Tunif , TA(1) , TA(2) , TAC  1 , TAC  2  , TD (1) , TD (2) , TR (1,2) , TMS (1,2) ,
TCD (1) ,TCD (2) with c0  min M c ,
TCJ (1,2) , TCJ (1,2) with c0  min M c

and C , V0  F  Inc  1 , Inc  2  , Super  1, 2  , SuperC  1, 2  

Xiaolan Xie
2-dimension models : property propagation
Theorem 5'.
For an event-based DP value function Vn ,
Vn  F  Super  1, 2  , SuperC  1, 2  
if Vn constructed with operators in
Tenv 0  , Tcost , Tunif , TA(1) , TA(2) , TAC  1 , TAC  2  , TR (1,2) ,
TD (1) with b1  , TD (2) with b2  , TMS (1,2) with b1  b2  ,
TCD (1) ,TCD (2) with c0  min M c ,
TCJ (1,2) , TCJ (1,2) with c0  min M c

and C , V0  F  Super  1, 2  , SuperC  1, 2  

Xiaolan Xie
2-dimension models
Control structure under Super(1, 2) + SuperC(1, 2)

Super(1, 2) + SuperC(1, 2)  Conv(1) + Conv(2)

reject
Conv(1) TAC(1) : threshold admission in x1
Conv(2) TAC(2) : threshold admission in x2
Super(1, 2) TAC(1) : of threshold form in x2 admission
Super(1, 2) TAC(2) : of threshold form in x1

SuperC(1, 2) For TAC(1) : rejection in x+e2  rejection in x+e1

SuperC(1, 2) For TAC(2): rejection in x+e1  rejection in x + e2.

TAC(1) & TAC(2) : decreasing switching curve below which customers are admitted.
TCD(1) and TCD(2) can be seen as dual to TAC(1) and TAC(2), with corresponding
results.

Xiaolan Xie
2-dimension models
Control structure under
Super(1, 2) + SuperC(1, 2)
queue1
SuperC(1, 2):
TR: an increasing switching curve above
(below) which customers are assigned to
queue 1 (2) queue2
SuperC(1, 2):
TCJ(1,2): the optimal control is increasing in
x1 and decreasing in x2, i.e. an increasing
switching curve, below which jockeying
occurs.

TCJ(2,1): an increasing switching curve, above

which jockeying occurs.

Xiaolan Xie
2-dimension models : property propagation

Theorem 6.
For an event-based DP value function Vn ,
Vn  F  Super  1, 2  
if Vn constructed with operators in
Tenv 0  , Tcost , Tunif , TA(1) , TA(2) , TAC  1 , TAC  2 ,
TFS  1 , TFS  2  , TMD 1 , TMD 2  , TD (1) , TD (2) , TCD (1) ,TCD (2) ,

and C , V0  F  Super  1, 2  

Xiaolan Xie
2-dimension models
Control structure under Super(1, 2)

Admission control for class 1 is decreassing in class 2 and vice vers.

Xiaolan Xie
2-dimension models : property propagation

Theorem 7.
For an event-based DP value function Vn ,
Vn  F  Inc  1 , Inc  2  , Sub  1, 2  , SubC  1, 2  
if Vn constructed with operators in
Tenv 0  , Tcost , Tunif , TA(1) , TA(2) , TACF , TD (1) , TD (2) , TCD (1) ,TCD (2)

and C , V0  F  Inc  1 , Inc  2  , Sub  1, 2  , SubC  1, 2  

Xiaolan Xie
2-dimension models
Control structure under Sub(1, 2) + SubC(1, 2)
Sub(1, 2) + SubC(1, 2)  Conv(1) + Conv(2)

Conv(i) threshold admission rule for TAC(i) in xi

Sub(1, 2) TAC(1) is of threshold form in x2
Sub(1, 2) TAC(2) is of threshold form in x1

SubC(1, 2)
TAC(1) (TAC(2)) : increasing switching curve above (below) which customers are
admitted.

Also the effects of TCD(i) amount to balancing in some sense the two queues.

The two queues “attract” each other.

TACF(1,2) has a decreasing switching curve below which customers are admitted

Xiaolan Xie
2-dimension models
Control structure under Sub(1, 2) + SubC(1, 2)

TAC(1) TAC(2)

Admission No
queue1
Admission
No
queue2

Xiaolan Xie
Examples: a queue served by two servers
• A common queue served by two servers (1= fast, 2=slow)
• Poisson arrivals to the queue
• Exponential servers but with different mean service times
• Goal: minimizes the mean sojourn time

Vn 1  Tcosts C  C , Tunif TA(1)TCJ  1,2  Vn , TD (1)TCJ  1,2  Vn , TD (2)TCJ  1,2Vn 
  

c  0 in TCJ  1,2 , C  x1  x2 , C   x2  1 K , K = big number

C  C  F  Inc  1 , Inc  2  , Super  1, 2  , SuperC  1, 2  

Send a customer to server 2 is suboptimal if x2  0

Theshold policy to send a customer to server 2 if x2  0

Xiaolan Xie
Examples: a queue served by two servers

TCJ(1,2)

To slow

Xiaolan Xie
Examples: production line with Poisson demand

• M1 feeds buffers 1, M2 transfers to buffer 2

• Poisson demand filled from queue 2
• Production rate control of both machines
1
Vn 1  x   C  x   min  0,1  Vn  x  e1    1    Vn  x   

2 d
min  0,1  Vn  x  e1  e2    1    Vn  x    Vn  x  e2 
 
   
C  h1  x1   K   x1   h2  x2   b2   x2 


Vn 1  Tcosts C , Tunif TAC (1)Vn , TCJ  1,2  Vn , TD 2  Vn 
  
C  F  Super  1, 2  , SuperC  1, 2  

Xiaolan Xie
Examples: tandem queues with Poisson demand

M1 produce M2 produce
x1

Xiaolan Xie
Examples: admission control of tandem queues
• Two tandem queues: queue 1 feeds queue 2
• Convex holding cost hi(xi)
• Service rate control of both queues
• Admission control of arrival to queue 1


Vn 1  Tcosts C , Tunif TAC (1)Vn , TCJ  1,2  Vn , TCD  2  Vn 
  
C  F  Inc  1 , Inc  2  , Super  1, 2  , SuperC  1, 2  

Xiaolan Xie
Examples: cyclic tandem queues
• Two cyclic queues: queue 1 feed queue 2, vice versa
• Convex holding cost hi(xi)
• Service rate control of both queues


Vn 1  Tcosts C , Tunif TA(1)Vn , TA(2)Vn , TCJ  1,2  Vn , TCJ  2,1 Vn 
  
C  F  Inc  1 , Inc  2  , Super  1, 2  , SuperC  1, 2  

Xiaolan Xie
Multi-machine production-inventory with non preemption

x1  buffer level
x2  queue length of machine 2


V n 1  s   TcostsTunif TAC  1 TAC  2  V n , TeD  2  TAC  2  V n , TD  1 TAC  2  V n 
p  2,1  1 in TeD 2  ,
  TAC  2  V n : SuperC  1, 2 
C  h1  x1   b1   x1 

 f  x  e2  e1   K  x2  1 , if x2  1
TeD  2  f  x    AC(1)
 f  x   Kx2 , if x2  0
TeD  2  : Super  1, 2   Super  1, 2  ; SuperC  1, 2   SuperC  1, 2 

V n : SuperC  1, 2 
Vn  F  Super  1, 2  , SuperC  1, 2  
0  x2  1
AC(2)

Xiaolan Xie
Examples: stochastic knapsack
• Packing a knapsack of integer volume B with objects
from 2 different classes to maximize profit
• Poisson arrivals


Vn 1  Tcosts C , Tunif TAC (1)Vn , TAC (2)Vn  

C   b1 x1  b2 x2  B  K
C  F  Super  1, 2  

Admission control in class 1 decreases in x2 and vice versa

Xiaolan Xie
Examples

C  max  x1 , x2 

C   x1  B  K
C  C  F  Inc  1 , Inc  2  , Sub  1, 2  , SubC  1, 2  

Xiaolan Xie

Practical List of Computer Graphics
No ratings yet
Practical List of Computer Graphics
26 pages
Constraint Interview Ques Solutions
No ratings yet
Constraint Interview Ques Solutions
5 pages
Ci - Adaline & Madaline Network
No ratings yet
Ci - Adaline & Madaline Network
35 pages
Ec 1252 Signals and System
100% (2)
Ec 1252 Signals and System
21 pages
Dynamic Programming - General Method
No ratings yet
Dynamic Programming - General Method
13 pages
Big Coding Y11
No ratings yet
Big Coding Y11
250 pages
AE4-Activity 3A Maximization of Profit - Simplex Method (Answer)
No ratings yet
AE4-Activity 3A Maximization of Profit - Simplex Method (Answer)
3 pages
24MTCS003HY KMP Algorithm Presentation
No ratings yet
24MTCS003HY KMP Algorithm Presentation
12 pages
Sample Exam 1 EE 210
No ratings yet
Sample Exam 1 EE 210
6 pages
BOS 16 Week Teaching Plan Programming For AI (1) 1
No ratings yet
BOS 16 Week Teaching Plan Programming For AI (1) 1
3 pages
Samarth Adatia MLSP Exp2
No ratings yet
Samarth Adatia MLSP Exp2
14 pages
Wipro Engineering Online Test Curriculum: Every Eligible Candidate Must Go Through Below Online Assessment (110 Minutes)
No ratings yet
Wipro Engineering Online Test Curriculum: Every Eligible Candidate Must Go Through Below Online Assessment (110 Minutes)
2 pages
Na Practicals
No ratings yet
Na Practicals
63 pages
16 SparkAlgorithms
No ratings yet
16 SparkAlgorithms
64 pages
Standard Greedy Algorithms
No ratings yet
Standard Greedy Algorithms
5 pages
Method of Quadratic Sequence
No ratings yet
Method of Quadratic Sequence
4 pages
Module 2 Significant Figure and Scientific Notation GEN
No ratings yet
Module 2 Significant Figure and Scientific Notation GEN
5 pages
Dependency Parsing 2: CMSC 723 / LING 723 / INST 725
No ratings yet
Dependency Parsing 2: CMSC 723 / LING 723 / INST 725
52 pages
TI4101 Perancangan Tata Letak Pabrik: Basic Algorithms For The Layout Problem
No ratings yet
TI4101 Perancangan Tata Letak Pabrik: Basic Algorithms For The Layout Problem
52 pages
Optical Flow: Rizwan Manzoor
No ratings yet
Optical Flow: Rizwan Manzoor
31 pages
v25 Algebra 1 5.03.odt
No ratings yet
v25 Algebra 1 5.03.odt
5 pages
Assignment 2 Chbyshev Filter (21-MS-TE-05)
No ratings yet
Assignment 2 Chbyshev Filter (21-MS-TE-05)
6 pages
Adaptive Vector Median Filtering
No ratings yet
Adaptive Vector Median Filtering
11 pages
Geofence Boundary Violation Detection in 3D Using Triangle Weight Characterization With Adjacency
No ratings yet
Geofence Boundary Violation Detection in 3D Using Triangle Weight Characterization With Adjacency
12 pages
A Subspace Approach To Blind Space-Time Signal Processing For Wireless Communication Systems
No ratings yet
A Subspace Approach To Blind Space-Time Signal Processing For Wireless Communication Systems
18 pages
Ge 203 Signal and Transform
No ratings yet
Ge 203 Signal and Transform
3 pages
Stockwell Transform
No ratings yet
Stockwell Transform
5 pages
DSP 1st Exam Solutions
No ratings yet
DSP 1st Exam Solutions
5 pages
Practice Problem
No ratings yet
Practice Problem
12 pages
Homework 03
No ratings yet
Homework 03
3 pages
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
From Everand
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
Mark Manson
4/5 (6458)
Principles: Life and Work
From Everand
Principles: Life and Work
Ray Dalio
4/5 (648)
Never Split the Difference: Negotiating As If Your Life Depended On It
From Everand
Never Split the Difference: Negotiating As If Your Life Depended On It
Chris Voss
4.5/5 (1005)
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
From Everand
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Brené Brown
4/5 (1175)
The Glass Castle: A Memoir
From Everand
The Glass Castle: A Memoir
Jeannette Walls
4.5/5 (1856)
Grit: The Power of Passion and Perseverance
From Everand
Grit: The Power of Passion and Perseverance
Angela Duckworth
4/5 (650)
The Little Book of Hygge: Danish Secrets to Happy Living
From Everand
The Little Book of Hygge: Danish Secrets to Happy Living
Meik Wiking
3.5/5 (464)
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
From Everand
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
Ashlee Vance
4.5/5 (582)
A Man Called Ove: A Novel
From Everand
A Man Called Ove: A Novel
Fredrik Backman
4.5/5 (5181)
The Emperor of All Maladies: A Biography of Cancer
From Everand
The Emperor of All Maladies: A Biography of Cancer
Siddhartha Mukherjee
4.5/5 (298)
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
From Everand
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
Viet Thanh Nguyen
4.5/5 (141)
The Perks of Being a Wallflower
From Everand
The Perks of Being a Wallflower
Stephen Chbosky
4.5/5 (4103)
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
From Everand
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
Ben Horowitz
4.5/5 (361)
Yes Please
From Everand
Yes Please
Amy Poehler
4/5 (2016)
Steve Jobs
From Everand
Steve Jobs
Walter Isaacson
4.5/5 (1139)
Angela's Ashes: A Memoir
From Everand
Angela's Ashes: A Memoir
Frank McCourt
4.5/5 (943)
The Woman in Cabin 10
From Everand
The Woman in Cabin 10
Ruth Ware
3.5/5 (2814)
Shoe Dog: A Memoir by the Creator of Nike
From Everand
Shoe Dog: A Memoir by the Creator of Nike
Phil Knight
4.5/5 (629)
The Outsider: A Novel
From Everand
The Outsider: A Novel
Stephen King
4/5 (2886)
Rise of ISIS: A Threat We Can't Ignore
From Everand
Rise of ISIS: A Threat We Can't Ignore
Jay Sekulow
3.5/5 (144)
The Light Between Oceans: A Novel
From Everand
The Light Between Oceans: A Novel
M.L. Stedman
4.5/5 (815)
Sing, Unburied, Sing: A Novel
From Everand
Sing, Unburied, Sing: A Novel
Jesmyn Ward
4/5 (1267)
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
From Everand
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
Margot Lee Shetterly
4/5 (1022)
The Constant Gardener: A Novel
From Everand
The Constant Gardener: A Novel
John le Carré
4/5 (278)
Team of Rivals: The Political Genius of Abraham Lincoln
From Everand
Team of Rivals: The Political Genius of Abraham Lincoln
Doris Kearns Goodwin
4.5/5 (244)
Her Body and Other Parties: Stories
From Everand
Her Body and Other Parties: Stories
Carmen Maria Machado
4/5 (903)
Fear: Trump in the White House
From Everand
Fear: Trump in the White House
Bob Woodward
3.5/5 (836)
The Art of Racing in the Rain: A Novel
From Everand
The Art of Racing in the Rain: A Novel
Garth Stein
4/5 (4372)
Bad Feminist: Essays
From Everand
Bad Feminist: Essays
Roxane Gay
4/5 (1090)
The Unwinding: An Inner History of the New America
From Everand
The Unwinding: An Inner History of the New America
George Packer
4/5 (45)
The World Is Flat 3.0: A Brief History of the Twenty-first Century
From Everand
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Thomas L. Friedman
3.5/5 (2289)
Little Women
From Everand
Little Women
Louisa May Alcott
4.5/5 (2369)
The Yellow House: A Memoir (2019 National Book Award Winner)
From Everand
The Yellow House: A Memoir (2019 National Book Award Winner)
Sarah M. Broom
4/5 (100)
Wolf Hall: A Novel
From Everand
Wolf Hall: A Novel
Hilary Mantel
4/5 (4135)
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
From Everand
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Dave Eggers
3.5/5 (233)
John Adams
From Everand
John Adams
David McCullough
4.5/5 (2546)
A Tree Grows in Brooklyn
From Everand
A Tree Grows in Brooklyn
Betty Smith
4.5/5 (2033)
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
From Everand
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
Gilbert King
4.5/5 (280)
Brooklyn: A Novel
From Everand
Brooklyn: A Novel
Colm Toibin
3.5/5 (2141)
On Fire: The (Burning) Case for a Green New Deal
From Everand
On Fire: The (Burning) Case for a Green New Deal
Naomi Klein
4/5 (78)
Manhattan Beach: A Novel
From Everand
Manhattan Beach: A Novel
Jennifer Egan
3.5/5 (919)

UE2MDP2019

Uploaded by

UE2MDP2019

Uploaded by

Plan

 Introduction to Markov decision processes

 Markov decision processes formulation

 Discounted markov decision processes

 Average cost markov decision processes

 Continuous-time Markov decision processes

 Stochastic dynamic programming

 Stochastic dynamic programming

We limit ourselves to discrete optimization

Difficulties in using dynamic programming :

State t State t+1

present decision epoch next decision epoch

State t State t+1

present decision epoch next decision epoch

State t State t+1

present decision epoch next decision epoch

State t State t+1

present decision epoch next decision epoch

State t State t+1

present decision epoch next decision epoch

Single machine scheduling (Knapsac)

Control map or control policy

Problem: determining the purchasing quantity at the beginning

• Storage capacity 5 (in 000), Initial stock = 0

2007 Xiaolan Xie

Si  starting time of job i

State t State t+1

stage cost cost

present decision epoch next decision epoch

present decision epoch next decision epoch

State t State t+1

present decision epoch next decision epoch

xt : stock available at the beginning of period t

Order quantity ut at each period is determined dynamically with

Examples of inventory control policies:

1. Order a constant quantity ut = E[wt]

2007 Xiaolan Xie

find a sequence of functions mt, t = 0, ..., N-1, mapping state

so as to minimize the total expected cost.

The sequence p = {m0, ..., mN-1} is called a policy.

2007 Xiaolan Xie

2007 Xiaolan Xie

depending on the control policy.

2007 Xiaolan Xie

Finite state space st  St

Finite control space ut  Ct

Control policy p = {m0, ..., mN-1} with ut = mt(xt)

State-transition probability: pij(u)

2007 Xiaolan Xie

where P is the set of all admissible policies.

J*(x) : optimal cost function or optimal value function.

Furthermore, if u*t = m*t(xt) minimizes the right side of Eq (B)

• Excess demand is lost, i.e. xt+1 = max{0, xt + ut – wt}

• The inventory capacity is 2, i.e. xt + ut  2

• The inventory holding/shortage cost is : (xt + ut – wt)2

• Unit ordering cost is a, i.e. gt(xt,ut,wt) = aut + (xt + ut – wt)2.

• N = 3 and the terminal cost, gN(XN) = 0

Myopic policy: (s=0, u=1), (s=1, u=0), (s=2, u=0)

Stage cost 0.25u+(s+u–

Stage cost 0.25u+(s+u–

Stage 1 Stage 2 Stage 3

0 3.055 (2) 2.055 (2) 1.05 (1)

1 2.805 (1) 1.805 (1) 0.8 (0)

2 2.555 (0) 1.555 (0) 0.6 (0)

Stock Period-1 Period-2 Period-3

Period-1 Period-2 Period-3 Period-4

(make, p) (make, p) (make, p)

p p p average cost =br/(1-r)

• Hedging point policy with

The set T of decisions epochs can be either a discrete set or a

The set T can be finite (finite horizon problem) or infinite

S : the set of all possible system states.

A = sSAs: the set of all possible actions.

S and As can be:

If the cost depends on the state at next decision epoch, then

Ct(s, a) =  jS Ct(s, a, j) pt(j|s, a).

Furthermore, if ut = mt(xt) minimizes the right side of Eq (B)