Stochastic Optimization: Anton J. Kleywegt and Alexander Shapiro
Stochastic Optimization: Anton J. Kleywegt and Alexander Shapiro
= D.
However, if D is not known at the time the decision should be made, then the problem becomes more
dicult. There are several approaches to decision making in the usual case where the demand is not known.
Sometimes a manager may want to hedge against the worst possible outcome. Suppose the manager thinks
that the demand D will turn out to be some number in the interval [a, b] IR
+
, with a < b, i.e., the lower
and upper bounds for the demand are known to the manager. In that case, in order to hedge against the
worst possible scenario, the manager chooses the value of x that gives the best prot under the worst possible
outcome. For any decision x, the worst possible outcome is given by
g
1
(x) min
D[a,b]
G(x, D) = G(x, a) =
_
(r s)a (c s)x if a x
(r c)x if a > x
Because the manager wants to hedge against the worst possible outcome, the manager chooses the value of
x that gives the best prot under the worst possible outcome, that is, the manager chooses the value of x
that maximizes g
1
(x), which is x
1
= a. Clearly, in many cases this will be an overly conservative decision.
Sometimes a manager may want to make the decision that under the worst possible outcome will still
appear as good as possible compared with what would have been the best decision with hindsight, that is
after the outcome becomes known. For any outcome of the demand D, let
g
(D) max
xIR+
G(x, D) = (r c)D
denote the optimal prot with hindsight, also called the optimal value with perfect information. The optimal
decision with perfect information, x
(D) G(x, D) =
_
(c s)(x D) if D x
(r c)(D x) if D > x
is often called the absolute regret. The manager may want to choose the value of x that minimizes the
absolute regret under the worst possible outcome. For any decision x, the worst possible outcome is given
by
g
2
(x) max
D[a,b]
A(x, D) = max (c s)(x a), (r c)(b x)
= maxA(x, a), A(x, b) =
_
(r c)(b x) if x
(cs)
(rs)
a +
(rc)
(rs)
b
(c s)(x a) if x >
(cs)
(rs)
a +
(rc)
(rs)
b
Because the manager wants to choose the value of x that minimizes the absolute regret under the worst
possible outcome, the manager chooses the value of x that minimizes g
2
(x), which is x
2
= [(c s)a + (r
c)b]/(r s). Note that x
2
is a convex combination of a and b, and thus a < x
2
< b. The larger the salvage
loss per unit c s, the closer x
2
is to a, and the larger the prot per unit r c, the closer x
2
is to b. That
seems to be a more reasonable decision than x
1
= a, but it will be shown that in many cases one can easily
obtain a better solution than x
2
.
A similar approach is to choose the value of x that minimizes the relative regret R(x, D) under the worst
possible outcome, where
R(x, D)
g
(D) G(x, D)
g
(D)
=
_
(cs)(xD)
(rc)D
=
(cs)
(rc)
_
x
D
1
_
if D x
(rc)(Dx)
(rc)D
= 1
x
D
if D > x
2
For any decision x, the worst possible outcome is given by
g
3
(x) max
D[a,b]
R(x, D) = max
_
(c s)
(r c)
_
x
a
1
_
, 1
x
b
_
= maxR(x, a), R(x, b) =
_
_
_
1
x
b
if x
ab
(rc)
(rs)
a+
(cs)
(rs)
b
(cs)
(rc)
_
x
a
1
_
if x >
ab
(rc)
(rs)
a+
(cs)
(rs)
b
The manager then chooses the value of x that minimizes g
3
(x), which is x
3
= ab/[(rc)a+(cs)b]/(rs).
Note that [(r c)a + (c s)b]/(r s) in the denominator of the expression for x
3
is a convex combination
of a and b, and thus a < x
3
< b. Similar to x
2
, the larger the salvage loss per unit c s, the closer x
3
is to
a, and the larger the prot per unit r c, the closer x
3
is to b.
A related approach is to choose the value of x that maximizes the competitive ratio (x, D) under the
worst possible outcome, where
(x, D)
G(x, D)
g
(D)
Because (x, D) = 1 R(x, D), maximizing the competitive ratio (x, D) is equivalent to minimizing the
relative regret R(x, D), so that this approach leads to the same solution x
3
as the previous approach.
It was assumed in all the variants of the worst-case approach discussed above that no a priori information
about the demand D was available to the manager except the lower and upper bounds for the demand. In
some situations this may be a reasonable assumption and the worst-case approach could make sense if the
range of the demand is known and is not too large. However, in many applications the range of the
unknown quantities is not known with useful precision, and other information, such as information about
the probability distributions or sample data of the unknown quantities, may be available.
Another approach to decision making under uncertainty, dierent from the worst-case approaches de-
scribed above, is the stochastic optimization approach, which is the approach that most of this article is
focused on. Suppose that the demand D can be viewed as a random variable with a known, or at least
well estimated, probability distribution. The corresponding cumulative distribution function (cdf) F can be
estimated from historical data or by using a priori information available to the manager. Then one can try
to optimize the objective function on average, i.e. to maximize the expected prot
g(x) IE[G(x, D)] (2.2)
=
_
x
0
[rw +s(x w)] dF(w) +
_
x
rxdF(w) cx
This optimization problem is easy to solve. For any D 0, the function G(x, D) is concave in x. Therefore,
the expected value function g is also concave. First, suppose the demand D has a probability density function
(pdf). Then
g
(x) = 0, that is at
x
, where x
satises
F(x
) =
r c
r s
Because s < c < r, it follows that 0 < (r c)/(r s) < 1, so that a value of x
) =
(r c)/(r s) can always be found. If the demand D does not have a pdf, a similar result still holds. In
general
x
= F
1
_
r c
r s
_
3
where
F
1
(p) minx : F(x) p
Another point worth mentioning is that by solving (2.2), the manager tries to optimize the prot on
average. However, the realized prot G(x
), depending on the particular realization of the demand D. This may happen if G(x
, D), considered as
a random variable, has a large variability which could be measured by its variance Var[G(x
, D)]. Therefore,
if the manager wants to hedge against such variability he may consider the following optimization problem
max
x0
g
and the mean x, can be very dierent. Also, it is well known that the quantiles
are much more stable to variations of the cdf F than the corresponding mean value. It typically happens
that an optimal solution x
G(x, )
Minimizing the Absolute Regret The chosen decision x
2
is obtained by solving the following optimiza-
tion problem.
inf
xX
sup
() G(x, )
Minimizing the Relative Regret The chosen decision x
3
is obtained by solving the following optimiza-
tion problem.
inf
xX
sup
() G(x, )
g
()
assuming g
G(x, )
g
()
2.2.2 The Stochastic Optimization Approach
The chosen decision x
() sup
xX
G(x, )
Thus the expected value with perfect information is given by IE[g
) sup
xX
IE[G(x, )]
Note that
g(x
) sup
xX
IE[G(x, )] IE
_
sup
xX
G(x, )
_
IE[g
()]
The dierence, IE[g
()] g(x
) = IE[A(x
) sup
xX
IE[G(x, )] IE[G( x, )] g( x)
The dierence, g(x
j=i
_
k
j
_
F(w)
j
[1 F(w)]
kj
Further, if W has a probability density function (pdf) f, then it follows that the pdf f
i:k
of W
i:k
is given by
f
i:k
(w) = i
_
k
i
_
f(w)F(w)
i1
[1 F(w)]
ki
Use of the empirical distribution, and its robustness, are illustrated in an example in Section 2.5.
If there is reason to believe that the random variable W follows a particular type of probability distribu-
tion, for example a normal distribution, with one or more unknown parameters, for example the mean and
variance
2
of the normal distribution, then standard statistical techniques, such as maximum likelihood
estimation, can be used to estimate the unknown parameters of the distribution from data. Also, in such
a situation, a Bayesian approach can be used to estimate the unknown parameters of the distribution from
data, to optimize an objective function that is related to that of the stochastic optimization problem. More
details can be found in Berger (1985).
2.5 Example (continued)
In this section the use of the empirical distribution, and its robustness, are illustrated with the newsvendor
example of Section 2.1.
Suppose the manager does not know the probability distribution of the demand, but a data set D
1
, D
2
, . . . , D
k
of k independent and identically distributed observations of the demand D is available. As before, let
D
1:k
, D
2:k
, . . . , D
k:k
denote the order statistics of the k observations of D. Using the empirical distribution
F
k
, the resulting decision rule is simple. If (r c)/(r s) ((i 1)/k, i/k] for some i 1, 2, . . . , k, then
x =
F
1
k
_
r c
r s
_
= D
i:k
6
That is, the chosen order quantity x is the ith smallest observation D
i:k
of the demand.
To illustrate the robustness of the solution x obtained with the empirical distribution, suppose that,
unknown to the manager, the demand D is exponentially distributed with rate , that is, the mean demand
is IE[D] = 1/. The objective function is given by
g(x) IE[G(x, D)] =
r s
_
1 e
x
_
(c s)x
The pdf f
i:k
of D
i:k
is given by
f
i:k
(w) = i
_
k
i
_
e
(ki+1)w
_
1 e
w
_
i1
= i
_
k
i
_
i1
j=0
_
i 1
j
_
(1)
j
e
(ki+j+1)w
The expected objective value of the chosen order quantity x = D
i:k
is given by (assuming that D
1
, D
2
, . . . , D
k
and D are i.i.d. exp())
IE[G(D
i:k
, D)] = IE
_
r s
_
1 e
D
i:k
_
(c s)D
i:k
_
=
_
0
_
r s
_
1 e
w
_
(c s)w
_
i
_
k
i
_
i1
j=0
_
i 1
j
_
(1)
j
e
(ki+j+1)w
dw
=
_
_
(r s)
i
j=0
_
i
j
_
(1)
j
1
k i +j + 1
(c s)
i1
j=0
_
i 1
j
_
(1)
j
1
(k i +j + 1)
2
_
_
i
_
k
i
_
IE[D]
Next we compare the objective values of several solutions, including the optimal value with perfect
information, IE[G(D
i:k
, D)], g(x
), and g( x). Recall that the optimal value with perfect information is given
by
g
(D) max
xIR+
G(x, D) = (r c)D
Thus the expected value with perfect information is given by
IE[g
(D)] = (r c)IE[D]
Also, the optimal solution x
= F
1
_
r c
r s
_
= ln
_
c s
r s
_
IE[D]
and the optimal objective function value is given by
g(x
) =
_
(r c) + (c s) ln
_
c s
r s
__
IE[D]
Thus the value of perfect information is
IE[g
(D)] g(x
) = (c s) ln
_
c s
r s
_
IE[D] =
c s
r s
ln
_
c s
r s
_
(r s)IE[D]
7
It is easy to obtain bounds on the value of perfect information. Consider the function h(y) y ln(y) for
y > 0. Then h
= 0) do not
depend on information about the demand. Also, the value of perfect information attains a maximum of
(r s)IE[D]/e when (c s)/(r s) = 1/e, i.e., when the ratio of prot per unit to the salvage loss per unit
(r c)/(c s) = e 1.
The optimal solution x of the deterministic optimization problem (2.6) is x = IE[D]. The expected value
of this solution is given by
g( x) IE[G( x, D)] =
_
(r c)
r s
e
_
IE[D]
Hence the value of the stochastic solution is given by
g(x
) IE[G( x, D)] =
_
c s
r s
ln
_
c s
r s
_
+
1
e
_
(r s)IE[D]
It follows from the properties of h(y) y ln(y) that the value of the stochastic solution attains a minimum
of zero when the value of perfect information attains a maximum, i.e., when (c s)/(r s) = 1/e. Also, the
value of the stochastic solution attains a maximum of (r s)IE[D]/e when the value of perfect information
attains a minimum, i.e., when c = s and when c = r, that is, when using the expected demand IE[D] gives
the poorest results.
Next we evaluate the optimality gaps of several solutions. Let (r c)/(r s) ((i 1)/k, i/k] for
some i 1, 2, . . . , k. Then the optimality gap of the solution based on the empirical distribution is given
by
e
k
()
g(x
) IE[G(D
i:k
, D)]
(r s)IE[D]
= + (1 ) ln(1 )
_
_
i
j=0
_
i
j
_
(1)
j
1
k i +j + 1
(1 )
i1
j=0
_
i 1
j
_
(1)
j
1
(k i +j + 1)
2
_
_
i
_
k
i
_
Note that the division by IE[D] can be interpreted as rescaling the product units so that IE[D] = 1, and the
division by r s can be interpreted as rescaling the money units so that r s = 1. The optimality gap of
the optimal solution x of the deterministic optimization problem is given by
d
()
g(x
) g( x)
(r s)IE[D]
= (1 ) ln(1 ) +
1
e
To evaluate the worst-case solutions x
1
, x
2
, and x
3
, suppose that the interval [a, b] is taken as [0, IE[D]] for
some > 0. Then x
1
= a = 0, and thus g(x
1
) = 0, and the optimality gap of the worst-case solution x
1
is
given by
1
()
g(x
) g(x
1
)
(r s)IE[D]
= + (1 ) ln(1 )
8
Also, x
2
= [(c s)a + (r c)b]/(r s) = (r c)IE[D]/(r s) = IE[D], and thus
g(x
2
) =
__
1 e
_
(1 )
(r s)IE[D]
and the optimality gap of the absolute regret solution x
2
is given by
2
()
g(x
) g(x
2
)
(r s)IE[D]
= + (1 ) ln(1 )
__
1 e
_
(1 )
Also, x
3
= ab/[(r c)a +(c s)b]/(r s) = 0, and thus g(x
3
) = 0, and the optimality gap of the relative
regret solution x
3
is
3
() =
1
().
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
O
p
t
i
m
a
l
i
t
y
G
a
p
Profit Ratio (r-c)/(r-s)
Empirical k = 2
Empirical k = 4
Empirical k = 8 Empirical k = 16
Deterministic
Figure 1: Optimality gaps
e
k
() of the empirical approach for k = 2, 4, 8, 16, as well as the optimality gap
d
() of the deterministic approach, as a function of (r c)/(r s).
Figure 1 shows the optimality gaps
e
k
() for k = 2, 4, 8, 16, as well as the optimality gap
d
() as
a function of . It can be seen that the empirical solutions x tend to be more robust, in terms of the
optimality gap, than the expected value solution x, even if the empirical distribution is based on a very
small sample size k. Only in the region where 1 1/e, i.e., where the value of the stochastic solution is
small, does x give a good solution. It should also be kept in mind that the solution x is based on knowledge
of the expected demand, whereas the empirical solutions do not require such knowledge, but the empirical
solutions in turn require a data set of demand observations. Figure 2 shows the optimality gaps
1
(),
2
(),
and
3
() for = 1, 2, 3, 4, 5, as a function of . Solutions x
1
and x
3
do not appear to be very robust.
Also, only when is chosen to be close to 2 does the absolute regret solution x
2
appear to have robustness
that compares well with the robustness of the empirical solutions. The performance of the absolute regret
solution x
2
appears to be quite sensitive to the choice of . Furthermore, a decision maker is not likely to
know what is the best choice of .
9
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
O
p
t
i
m
a
l
i
t
y
G
a
p
Profit Ratio (r-c)/(r-s)
Worst-Case Absolute Regret b = 1
Absolute Regret b = 2
Absolute Regret b = 3
Absolute Regret b = 4
Absolute Regret b = 5
Relative Regret
Figure 2: Optimality gaps
1
(),
2
(), and
3
() of the worst-case approaches, for = 1, 2, 3, 4, 5, as a
function of (r c)/(r s).
3 Stochastic Programming
The discussion of the above example motivates us to introduce the following model optimization problem,
referred to as a stochastic programming problem,
inf
xX
g(x) IE[G(x, )] (3.1)
(We consider a minimization rather than a maximization problem for the sake of notational convenience.)
Here A IR
n
is a set of permissible values of the vector x of decision variables, and is referred to as the
feasible set of problem (3.1). Often A is dened by a (nite) number of smooth (or even linear) constraints.
In some other situations the set A is nite. In that case problem (3.1) is called a discrete stochastic opti-
mization problem (this should not be confused with the case of discrete probability distributions). Variable
represents random (or stochastic) aspects of the problem. Often can be modeled as a nite dimensional
random vector, or in more involved cases as a random process. In the abstract framework we can view as
an element of the probability space (, T, P) with the known probability measure (distribution) P.
It is also possible to consider the following extensions of the basic problem (3.1).
One may need to optimize a function of the expected value function g(x). This happened, for example,
in problem (2.4), where the manager wanted to optimize a linear combination of the expected value
and the variance of the prot. Although important from a modeling point of view, such an extension
usually does not introduce additional technical diculties into the problem.
The feasible set can also be dened by constraints given in a form of expected value functions. For
example, suppose that we want to optimize an objective function subject to the constraint that the
event h(x, W) 0, where W is a random vector with a known probability distribution and h(, ) is a
given function, should happen with a probability not bigger than a given number p (0, 1). Probability
10
of this event can be represented as the expected value IE[(x, W)], where
(x, w)
_
1 if h(x, w) 0
0 if h(x, w) < 0
Therefore, this constraint can be written in the form IE[(x, W)] p. Problems with such probabilistic
constraints are called chance constrained problems. Note that even if the function h(, ) is continuous,
the corresponding indicator function (, ) is discontinuous unless it is identically equal to zero or one.
Because of that, it may be technically dicult to handle such a problem.
In some cases the involved probability distribution P
[G(x, )] =
_
G(x, ) dP
() (3.2)
By using a transformation it is sometimes possible to represent the above function g() as the expected
value of a function, depending on x and , with respect to a probability distribution that is independent
of . We shall discuss such likelihood ratio transformations in Section 3.4.
The above formulation of stochastic programs is somewhat too general and abstract. In order to proceed
with a useful analysis we need to identify particular classes of such problems that on one hand are interesting
from the point of view of applications and on the other hand are computationally tractable. In the following
sections we introduce several classes of such problems and discuss various techniques for their solution.
3.1 Stochastic Programming with Recourse
Consider again problem (2.2) of the newsvendor example. We may view that problem as a two-stage problem.
At the rst stage a decision should be made about the quantity x to order. At this stage the demand D is
not known. At the second stage a realization of the demand D becomes known and, given the rst stage
decision x, the manager makes a decision about the quantities y and z to sell at prices r and s, respectively.
Clearly the manager would like to choose y and z in such a way as to maximize the prot. It is possible to
formulate the second stage problem as the simple linear program
max
y,z
ry +sz subject to y D, y +z x, y 0, z 0 (3.3)
The optimal solution of the above problem (3.3) is y
= minx, D, z
= y
().
The next question is how one can solve the above two-stage problem numerically. Suppose that the
random data have a discrete distribution with a nite number K of possible realizations
k
= (q
k
, h
k
, T
k
),
k = 1, ..., K, (sometimes called scenarios), with the corresponding probabilities p
k
. In that case IE[Q(x, )] =
K
k=1
p
k
Q(x,
k
), where
Q(x,
k
) = min
_
q
T
k
y
k
: T
k
x +Wy
k
= h
k
, y
k
0
_
Therefore, the above two-stage problem can be formulated as one large linear program:
min c
T
x +
K
k=1
p
k
q
T
k
y
k
s.t. Ax = b
T
k
x +Wy
k
= h
k
x 0, y
k
0, k = 1, . . . , K
(3.6)
The linear program (3.6) has a certain block structure that makes it amenable to various decomposition
methods. One such decomposition method is the popular L-shaped method developed by Van Slyke and
Wets (1969). We refer the interested reader to the recent books by Kall and Wallace (1994) and Birge and
Louveaux (1997) for a thorough discussion of stochastic programming with recourse.
The above numerical approach works reasonably well if the number K of scenarios is not too large.
Suppose, however, that the random vector has m independently distributed components each having just 3
possible realizations. Then the total number of dierent scenarios is K = 3
m
. That is, the number of scenarios
grows exponentially fast in the number m of random variables. In that case, even for a moderate number
of random variables, say m = 100, the number of scenarios becomes so large that even modern computers
cannot cope with the required calculations. It seems that the only way to deal with such exponential growth
of the number of scenarios is to use sampling. Such approaches are discussed in Section 3.2.
It may also happen that some of the decision variables at the rst or second stage are integers, such
as binary variables representing yes or no decisions. Such integer (or discrete) stochastic programs
are especially dicult to solve, and only very moderate progress has been reported so far. A discussion of
two-stage stochastic integer programs with recourse can be found in Birge and Louveaux (1997). A branch
and bound approach for solving stochastic discrete optimization problems was suggested by Norkin, Pug
and Ruszczy nski (1998). Schultz, Stougie and Van der Vlerk (1998) suggested an algebraic approach for
solving stochastic programs with integer recourse by using a framework of Gr obner basis reductions. For a
recent survey of mainly theoretical results on stochastic integer programming see Klein Haneveld and Van
der Vlerk (1999).
Conceptually the idea of two-stage programming with recourse can be readily extended to multistage
programming with recourse. Such an approach tries to model the situation where decisions are made pe-
riodically (in stages) based on currently known realizations of some of the random variables. An H-stage
stochastic linear program with xed recourse can be written in the form
min c
1
x
1
+IE
_
min c
2
()x
2
() + +IE[min c
H
()x
H
()]
_
s.t. W
1
x
1
= h
1
T
1
()x
1
+W
2
x
2
() = h
2
()
T
H1
()x
H1
() +W
H
x
H
() = h
H
()
x
1
0, x
2
() 0, . . . , x
H
() 0
(3.7)
The decision variables x
2
(), . . . , x
H
() are allowed to depend on the random data . However, the decision
x
t
() at time t can only depend on the part of the random data that is known at time t (these restrictions
12
are often called nonanticipativity constraints). The expectations are taken with respect to the distribution
of the random variables whose realizations are not yet known.
Again, if the distribution of the random data is discrete with a nite number of possible realizations,
then problem (3.7) can be written as one large linear program. However, it is clear that even for a small
number of stages and a moderate number of random variables the total number of possible scenarios will
be astronomical. Therefore, a current approach to such problems is to generate a reasonable number of
scenarios and to solve the corresponding (deterministic) linear program, hoping to catch at least the avor
of the stochastic aspect of the problem. The argument is that the solution obtained in this way is more
robust than the solution obtained by replacing the random variables with their means.
Often the same practical problem can be modeled in dierent ways. For instance, one can model a
problem as a two-stage stochastic program with recourse, putting all random variables whose realizations
are not yet known at the second stage of the problem. Then as realizations of some of the random variables
become known, the solutions are periodically updated in a two-stage rolling horizon fashion, every time
by solving an updated two-stage problem. Such an approach is dierent from a multistage program with
recourse, where every time a decision is to be made, the modeler tries to take into account that decisions
will be made at several stages in the future.
3.2 Sampling Methods
In this section we discuss a dierent approach that uses Monte Carlo sampling techniques to solve stochastic
optimization problems.
Example 3.1 Consider a stochastic process I
t
, t = 1, 2, . . . , governed by the recursive equation
I
t
= [I
t1
+x
t
D
t
]
+
(3.8)
with initial value I
0
. Here D
t
are random variables and x
t
represent decision variables. (Note that [a]
+
t
min[I
t1
+x
t
, D
t
] h
t
I
t
=
T
t=1
t
x
t
+
T1
t=1
(
t+1
t
h
t
)I
t
+
1
I
0
(
T
+h
T
)I
T
(3.9)
Here x = (x
1
, . . . , x
T
) is a vector of decision variables, W = (D
1
, . . . , D
T
) is a random vector of the demands
at periods t = 1, . . . , T, and
t
and h
t
are nonnegative parameters representing the marginal prot and the
holding cost, respectively, of the product at period t.
If the initial value I
0
is suciently large, then with probability close to one, variables I
1
, . . . , I
T
stay above
zero. In that case I
1
, . . . , I
T
become linear functions of the random data vector W, and hence components
of the random vector W can be replaced by their means. However, in many practical situations the process
I
t
hits zero with high probability over the considered horizon T. In such cases the corresponding expected
value function g(x) IE[G(x, W)] cannot be written in a closed form. One can use a Monte Carlo simulation
procedure to evaluate g(x). Note that for any given realization of D
t
, the corresponding values of I
t
, and
hence the value of G(x, W), can be easily calculated using the iterative formula (3.8).
That is, let W
i
= (D
i
1
, . . . , D
i
T
), i = 1, . . . , N, be a random (or pseudorandom) sample of N independent
realizations of the random vector W generated by computer, i.e., there are N generated realizations of the
demand process D
t
, t = 1, 2, . . . , T, over the horizon T. Then for any given x the corresponding expected
13
value g(x) can be approximated (estimated) by the sample average
g
N
(x)
1
N
N
i=1
G(x, W
i
) (3.10)
We have that IE [ g
N
(x)] = g(x), and by the Law of Large Numbers, that g
N
(x) converges to g(x) with
probability one (w.p.1) as N . That is, g
N
(x) is an unbiased and consistent estimator of g(x).
Any reasonably ecient method for optimizing the expected value function g(x), say by using its sample
average approximations, is based on estimation of its rst (and maybe second) order derivatives. This has
an independent interest and is called sensitivity or perturbation analysis. We will discuss that in Section 3.3.
Recall that g(x) (g(x)/x
1
, . . . , g(x)/x
T
) is called the gradient vector of g() at x.
It is possible to consider a stationary distribution of the process I
t
(if it exists), and to optimize the
expected value of an objective function with respect to the stationary distribution. Typically, such a station-
ary distribution cannot be written in a closed form, and is dicult to compute accurately. This introduces
additional technical diculties into the problem. Also, in some situations the probability distribution of the
random variables D
t
is given in a parametric form whose parameters are decision variables. We will discuss
dealing with such cases later.
3.3 Perturbation Analysis
Consider the expected value function g(x) IE[G(x, )]. An important question is under which conditions
the rst order derivatives of g(x) can be taken inside the expected value, that is, under which conditions the
equation
g(x) IE[G(x, )] = IE[
x
G(x, )] (3.11)
is correct. One reason why this question is important is the following. Let
1
, . . . ,
N
denote a random
sample of N independent realizations of the random variable with common probability distribution P, and
let
g
N
(x)
1
N
N
i=1
G(x,
i
) (3.12)
be the corresponding sample average function. If the interchangeability equation (3.11) holds, then
IE [ g
N
(x)] =
1
N
N
i=1
IE
_
x
G(x,
i
)
=
1
N
N
i=1
IE[G(x,
i
)] = g(x) (3.13)
and hence g
N
(x) is an unbiased and consistent estimator of g(x).
Let us observe that in both examples 2.1 and 3.1 the function G(, ) is piecewise linear for any realization
of , and hence is not everywhere dierentiable. The same holds for the optimal value function Q(, ) of
the second stage problem (3.5). If the distribution of the corresponding random variables is discrete, then
the resulting expected value function is also piecewise linear, and hence is not everywhere dierentiable.
On the other hand expectation with respect to a continuous distribution typically smoothes the corre-
sponding function and in such cases equation (3.11) often is applicable. It is possible to show that if the
following two conditions hold at a point x, then g() is dierentiable at x and equation (3.11) holds:
(i) The function G(, ) is dierentiable at x w.p.1.
(ii) There exists a positive valued random variable K() such that IE[K()] is nite and the inequality
[G(x
1
, ) G(x
2
, )[ K()|x
1
x
2
| (3.14)
holds w.p.1 for all x
1
, x
2
in a neighborhood of x.
14
If the function G(, ) is not dierentiable at x w.p.1 (i.e., for P-almost every ), then the right
hand side of equation (3.11) does not make sense. Therefore, clearly the above condition (i) is necessary
for (3.11) to hold. Note that condition (i) requires G(, ) to be dierentiable w.p.1 at the given (xed) point
x and does not require dierentiability of G(, ) everywhere. The second condition (ii) requires G(, ) to
be continuous (in fact Lipschitz continuous) w.p.1 in a neighborhood of x.
Consider, for instance, function G(x, D) of example 2.1 dened in (2.1). For any given D the function
G(, D) is piecewise linear and dierentiable at every point x except at x = D. If the cdf F() of D
is continuous at x, then the probability of the event D = x is zero, and hence the interchangeability
equation (3.11) holds. Then G(x, D)/x is equal to s c if x > D, and is equal to r c if x < D.
Therefore, if F() is continuous at x, then G(, D) is dierentiable at x and
g
= 0, = 1, ..., T, occur with probability zero. Then, for almost every W, the gradient of
I
s
with respect to the components of vector x
t
can be written as follows
xt
I
s
=
_
1 if t s <
i
and t ,=
i1
0 otherwise
(3.15)
Thus, by using equations (3.9) and (3.15), one can calculate the gradient of the sample average function
g
N
() of example 3.1, and hence one can consistently estimate the gradient of the expected value function
g().
Consider the process I
t
dened by the recursive equation (3.8) again. Suppose now that variables x
t
do
not depend on t, and let x denote their common value. Suppose further that D
t
, t = 1, ..., are independently
and identically distributed with mean > 0. Then for x < the process I
t
is stable and has a stationary
(steady state) distribution. Let g(x) be the steady state mean (the expected value with respect to the
stationary distribution) of the process I
t
= I
t
(x). By the theory of regenerative processes it follows that for
every x (0, ) and any realization (called sample path) of the process D
t
, t = 1, ..., the long run average
g
T
(x)
T
t=1
I
t
(x)/T converges w.p.1 to g(x) as T . It is possible to show that g
T
(x) also converges
w.p.1 to g(x) as T . That is, by dierentiating the long run average of a sample path of the process
I
t
we obtain a consistent estimate of the corresponding derivative of the steady state mean g(x). Note that
I
t
(x) = t
i1
for
i1
t <
i
, and hence the derivative of the long run average of a sample path of the
process I
t
can be easily calculated.
The idea of dierentiation of a sample path of a process in order to estimate the corresponding derivative
of the steady state mean function by a single simulation run is at the heart of the so-called innitesimal
perturbation analysis. We refer the interested reader to Glasserman (1991) and Ho and Cao (1991) for a
thorough discussion of that topic.
3.4 Likelihood Ratio Method
The Monte Carlo sampling approach to derivative estimation introduced in Section 3.3 does not work if the
function G(, ) is discontinuous or if the corresponding probability distribution also depends on decision
variables. In this section we discuss an alternative approach to derivative estimation known as the likelihood
ratio (or score function) method.
15
Suppose that the expected value function is given in the form g() IE
[G(W)] =
_
G(w)f(, w) dw =
_
G(w)
f(, w)
(w)
(w) dw
and hence
g() = IE
[G(Z)
L(, z) =
f(, z)/f(
0
, z), and hence
L(
0
, z) =
ln[f(
0
, z)]. The function
i=1
G(Z
i
)L(, Z
i
) (3.18)
g
N
()
1
N
N
i=1
G(Z
i
)
L(, Z
i
) (3.19)
This can be readily extended to situations where function G(x, W) also depends on decision variables.
Typically, the density functions used in applications depend on the decision variables in a smooth and
even analytic way. Therefore, usually there is no problem in taking derivatives inside the expected value in
the right hand side of (3.16). When applicable, the likelihood ratio method often also allows estimation of
second and higher order derivatives. However, note that the likelihood ratio method is notoriously unstable
and a bad choice of the pdf may result in huge variances of the corresponding estimators. This should
not be surprising since the likelihood ratio function may involve divisions by very small numbers, which of
course is a very unstable procedure. We refer to Glynn (1990) and Rubinstein and Shapiro (1993) for a
further discussion of the likelihood ratio method.
As an example consider the optimal value function of the second stage problem (3.5). Suppose that only
the right hand side vector h = h() of the second stage problem is random. Then Q(x, h) = G(h Tx),
where G() min
_
q
T
y : Wy = , y 0
_
. Suppose that the random vector h has a pdf f(). By using the
transformation z = h Tx we obtain
IE
f
[Q(x, h)] =
_
G( Tx)f() d =
_
G(z)f(z +Tx) dz = IE
(x
)) (3.21)
where
of g(x) over A.
Typically, in order to guarantee this convergence the following two conditions are imposed on the step sizes:
(i)
=1
= , and (ii)
=1
g(x
) < g(x
).
Kushner and Clark (1978) and Benveniste, Metivier and Priouret (1990) contain expositions of the
theory of stochastic approximation. Applications of the stochastic approximation method, combined with
the innitesimal perturbation analysis technique for gradient estimation, to the optimization of the steady
state means of single server queues were studied by Chong and Ramadge (1992) and LEcuyer and Glynn
(1994).
An attractive feature of the stochastic approximation method is its simplicity and ease of implementation
in those cases in which the projection
X
() can be easily computed. However, it also has severe shortcomings.
The crucial question in implementations is the choice of the step sizes
g
N
(x
) (3.23)
where the step size
arg min
g
N
(x
g
N
(x
[s, a, t] IP[S(t + 1) = s
[H(t), S(t) = s, A(t) = a]. For problems with uncountable state spaces, the
transition probabilities are denoted by p[B[s, a, t] IP[S(t + 1) B[H(t), S(t) = s, A(t) = a], where B is
a subset of states. Another representation is by a transition function f, such that given H(t), S(t) = s,
and A(t) = a, the state at time t + 1 is S(t + 1) = f(s, a, t, ), where is a random variable with a known
probability distribution. The two representations are equivalent, and in this article we use mostly transition
probabilities. When the transition probabilities do not depend on the time t beside depending on the state
s and decision a at time t, they are denoted by p[s
[s, a].
In the revenue management example, suppose the demand has probability mass function p(r, d) IP[D =
d [ price = r] with d ZZ
+
. Also suppose that a quantity q that is ordered at time t is received before time
t + 1, and that unsatised demand is backordered. Then o = ZZ, and the transition probabilities are as
follows.
p[s
) if s
s +q
0 if s
> s +q
If a quantity q that is ordered at time t is received after the demand at time t, and unsatised demand is
lost, then o = ZZ
+
, and the transition probabilities are as follows.
p[s
) if q < s
s +q
d=s
p(r, d) if s
= q
0 if s
< q or s
> s +q
4.2.5 Rewards and Costs
Dynamic decision problems often have as objective to maximize the sum of the rewards obtained in each
time period, or equivalently, to minimize the sum of the costs incurred in each time period. Other types
of objectives sometimes encountered are to maximize or minimize the product of a sequence of numbers
resulting from a sequence of decisions, or to maximize or minimize the maximum or minimum of a sequence
of resulting numbers.
In this article we focus mainly on the objective of maximizing the expected sum of the rewards obtained
in each time period. At any point in time t, the state s at time t, the decision a /(s, t) at time t, and the
time t, should be sucient to determine the expected reward r(s, a, t) at time t. (Again, the denition of the
state should be chosen so that this holds for the decision problem under consideration.) When the rewards
do not depend on the time t beside depending on the state s and decision a at time t, they are denoted by
r(s, a).
21
Note that, even if in the application the reward r(s, a, t, s
at time
t + 1, in addition to the state s and decision a at time t, and the time t, the expected reward at time t can
still be found as a function of only s, a, and t, because
r(s, a, t) = IE [ r(s, a, t, s
)] =
_
s
S
r(s, a, t, s
)p[s
[s, a, t] if o discrete
_
S
r(s, a, t, s
)p[ds
[s, a, t] if o uncountable
In the revenue management example, suppose unsatised demand is backordered, and that an inventory
cost/shortage penalty of h(s) is incurred when the inventory level is s at the beginning of the time period.
Then r(s, (q, r
), s
) = r
(s +q s
) h(s) with s
s +q. Thus
r(s, (q, r
)) =
d=0
p(r
, d)r
d h(s)
If unsatised demand is lost, then r(s, (q, r
), s
) = r
(s +q s
) h(s) with q s
s +q. Thus
r(s, (q, r
)) =
s1
d=0
p(r
, d)r
d +
d=s
p(r
, d)r
s h(s)
In nite horizon problems, there may be a salvage value v(s) if the process terminates in state s at the end
of the time horizon T. Such a feature can be incorporated in the previous notation, by letting /(s, T) = 0,
and r(s, 0, T) = v(s) for all s o.
Often the rewards are discounted with a discount factor [0, 1], so that the discounted expected value
of the reward at time t is
t
r(s, a, t). Such a feature can again be incorporated in the previous notation, by
letting r(s, a, t) =
t
r(s, a, t) for all s, a, and t, where r denotes the undiscounted reward function. When
the undiscounted reward does not depend on time, it is convenient to explicitly denote the discounted reward
by
t
r(s, a).
4.2.6 Policies
A policy, sometimes called a strategy, prescribes the way a decision is to be made at each point in time,
given the information available to the decision maker at the point in time. Therefore, a policy is a solution
for a dynamic program.
There are dierent classes of policies of interest, depending on which of the available information the
decisions are based on. A policy can base decisions on all the information in the history of the process
up to the time the decision is to be made. Such policies are called history dependent policies. Given the
memoryless nature of the transition probabilities, as well as the fact that the sets of feasible decisions and
the expected rewards depend on the history of the process only through the current state, it seems intuitive
that it should be sucient to consider policies that base decisions only on the current state and time, and
not on any additional information in the history of the process. Such policies are called memoryless, or
Markovian, policies. If the transition probabilities, sets of feasible decisions, and rewards do not depend
on the current time, then it also seems intuitive that it should be sucient to consider policies that base
decisions only on the current state, and not on any additional information in the history of the process or
on the current time. (However, this intuition may be wrong, as shown by counterexamples in Section 4.2.7).
Under such policies decisions are made in the same way each time the process is in the same state. Such
policies are called stationary policies.
The decision maker may also choose to use some irrelevant information to make a decision. For example,
the decision maker may roll a die, or draw a card from a deck of cards, and then base the decision on the
outcome of the die roll or the drawn card. In other words, the decision maker may randomly select a decision.
Policies that allow such randomized decisions are called randomized policies, and policies that do not allow
randomized decisions are called nonrandomized or deterministic policies.
Combining the above types of information that policies can base decisions on, the following types of
policies are obtained: the class
HR
of history dependent randomized policies, the class
HD
of history
22
dependent deterministic policies, the class
MR
of memoryless randomized policies, the class
MD
of mem-
oryless deterministic policies, the class
SR
of stationary randomized policies, and the class
SD
of sta-
tionary deterministic policies. The classes of policies are related as follows:
SD
MD
HD
HR
,
SD
MD
MR
HR
,
SD
SR
MR
HR
.
For the revenue management problem, an example of a stationary deterministic policy is to order quantity
q = s
2
s if the inventory level s < s
1
, for chosen constants s
1
s
2
, and to set the price at level r = r(s)
for a chosen function r(s) of the current state s. An example of a stationary randomized policy is to set the
price at level r = r
1
(s) with probability p
1
(s) and at level r = r
2
(s) with probability 1 p
1
(s) for chosen
functions r
1
(s), r
2
(s), and p
1
(s) of the current state s. An example of a memoryless deterministic policy
is to order quantity q = s
2
(t) s if the inventory level s < s
1
(t), for chosen functions s
1
(t) s
2
(t) of the
current time t, and to set the price at level r = r(s, t) for a chosen function r(s, t) of the current state s and
time t.
Policies are functions, dened as follows. Let H(t) (S(0), A(0), S(1), A(1), . . . , S(t)) denote the set
of all histories up to time t, and let H
t=0
H(t) denote the set of all histories. Let /
sS
t=0
/(s, t)
denote the set of all feasible decisions. Let T(s, t) denote the set of probability distributions on /(s, t)
(satisfying regularity conditions), and let T
sS
t=0
T(s, t) denote the set of all such probability
distributions. Then
HR
is the set of functions : H T, such that for any t, and any history H(t),
(H(t)) T(S(t), t) (again regularity conditions may be required).
HD
is the set of functions : H /,
such that for any t, and any history H(t), (H(t)) /(S(t), t).
MR
is the set of functions : oZZ
+
T,
such that for any state s o, and any time t ZZ
+
, (s, t) T(s, t).
MD
is the set of functions
: o ZZ
+
/, such that for any state s o, and any time t ZZ
+
, (s, t) /(s, t).
SR
is the set of
functions : o T, such that for any state s o, (s) T(s).
SD
is the set of functions : o /,
such that for any state s o, (s) /(s).
4.2.7 Examples
In this section a number of examples are presented that illustrate why it is sometimes desirable to consider
more general classes of policies, such as memoryless and/or randomized policies instead of stationary deter-
ministic policies, even if the sets of feasible solutions, transition probabilities, and rewards are stationary.
The examples may also be found in Ross (1970), Ross (1983), Puterman (1994), and Sennott (1999).
The examples are for dynamic programs with stationary input data and objective to minimize the long-
run average cost per unit time, limsup
T
IE
_
T1
t=0
r(S(t), A(t)) [ S(0)
_
/T. For any policy , let
V
(s) limsup
T
1
T
IE
_
T1
t=0
r(S(t), A(t))
S(0) = s
_
denote the long-run average cost per unit time under policy if the process starts in state s, where IE
[]
denotes the expected value if policy is followed.
A policy
is called optimal if V
(s) = inf
HR V
, 2, 2
, 3, 3
) = a, for
each i 1, 2, 3, . . . . Transitions are deterministic; from a state i we either go up to state i + 1 if we make
decision a, or we go across to state i
, we remain in state i
. That
is, the transition function is f(i, a) = i + 1, f(i, b) = i
, and f(i
, a) = i
a cost of 1/i is incurred. That is, the costs are r(i, a) = r(i, b) = 1, and r(i
, a) = 1/i.
Suppose the process starts in state 1. The idea is simple: we would like to go to a high state i, before
moving over to state i
. However, a policy that chooses decision a for each state i, has long-run average
cost per unit time of V
(1) = 1, which is as bad as can be. The only other possibility is that there exists a
state j such that policy chooses decision b with positive probability p
j
when the process reaches state j.
In that case V
(1) p
j
/j > 0. Thus V
, with V
(1).
Example 4.3 The second example shows that it is not always the case that for any policy , there exists
a stationary deterministic policy
is 1, 1, 2, 2, 2, 3, 3, 3, 3, . . . .
The sequence of decisions under policy
is 1, 1, 1/2, 1/2, 1, 1/3, 1/3, 1/3, 1, . . . . Note that the total cost incurred while the process is in state i is 2
for each i, so that the total cost incurred from the start of the process until the process leaves state i is 2i.
The total time until the process leaves state i is 2 + 3 + + i + 1 = i(i + 3)/2. Thus the average cost per
unit time if the process is currently in state i +1 is less than 2(i +1)/(i(i +3)/2), which becomes arbitrarily
small as i becomes large. Thus V
, 2, 2
, 3, 3
[0, a] =
3
2
_
1
4
_
i
p[0[i, a] = p[i + 1[i, a] =
1
2
p[0[i, b] = 1 p[i
[i, b] =
_
1
2
_
i
p[0[i
, a] = 1 p[i
[i
, a] =
_
1
2
_
i
Again, the idea is simple: we would like to visit state 0 as infrequently as possible. Thus we would like
the process to move to a high state i or i
i0
denote the mean time for the process to move from state i to state 0 under stationary policy .
Thus V
(0) = 1/M
00
.
First consider the stationary deterministic policy
j
that chooses decision a for states i = 1, 2, . . . , j 1,
and chooses decision b for states i = j, j + 1, . . . . Then for i = 1, 2, . . . , j 1, M
j
i0
= 2 + 2
i
(1/2)
ji1
.
From this it follows that M
j
00
< 5 and thus V
j
(0) > 1/5 for all j.
Next consider any stationary randomized policy , and let (i, a) denote the probability that decision a is
made in state i. Then, given that the process is in state i, the probability is (i, a)(i+1, a) (j1, a)(j, b)
that the process under policy behaves the same until state 0 is reached as under policy
j
. Thus
M
i0
=
j=i
_
(j, b)
j1
k=i
(k, a)
_
M
j
i0
+ 2
k=i
(k, a)
<
_
2 + 2
i
_
_
_
j=i
(j, b)
j1
k=i
(k, a) + 2
k=i
(k, a)
_
_
= 2 + 2
i
From this it follows that M
00
< 5 and thus V
, 2, 2
, 3, 3
, a) = 0. The transition
probabilities are as follows.
p[0[0, a] = 1
p[i + 1[i, a] = 1
p[i
[i, b] = 1 p[0[i, b] = p
i
p[(i 1)
[i
, a] = 1 for all i 2
p[1[1
, a] = 1
The values p
i
can be chosen to satisfy
p
i
< 1 for all i
i=1
p
i
=
3
4
Suppose the process starts in state 1. Again, the idea is simple: we would like to go down the chain
i
, (i 1)
, . . . , 1
(1) = 2, which is as bad as can be. The only other possibility for a stationary deterministic
policy is to choose decision b for the rst time in state j. In that case, each time state j is visited, there
is a positive probability 1 p
j
> 0 of making a transition to state 0. It follows that the mean time until a
transition to state 0 is made is less than 2j/(1 p
j
) < , and the long-run average cost per unit time is
V
(1) = 2. Thus V
i=1
p
i
= 3/4 the process never makes a transition to
state 0, and the long-run average cost per unit time is 1. Otherwise, with probability 1
i=1
p
i
= 1/4, the
process makes a transition to state 0, and the long-run average cost per unit time is 2. Hence, the expected
long-run average cost per unit time is V
(1) = 3/41+1/42 = 5/4. Thus, there is no -optimal stationary
deterministic policy for (0, 3/4). In fact, by considering memoryless deterministic policies
k
that on
their jth visit to state 1, choose decision a, j + k times and then choose decision b, one obtains policies
with expected long-run average cost per unit time V
k
(1) arbitrarily close to 1 for suciently large values
of k. It is clear that V
t=0
r(S(t), A(t), t)
_
(4.1)
where T < is the known nite horizon length, and decisions A(t), t = 0, 1, . . . , T, have to be feasible and
may depend only on the information available to the decision maker at each time t, that is the history H(t)
of the process up to time t, and possibly some randomization. For the presentation we assume that o is
countable and r is bounded. Similar results hold in more general cases, subject to regularity conditions.
4.3.1 Optimality Results
For any policy
HR
, and any history h(t) H(t), let
U
(h(t)) IE
_
T
=t
r(S(), A(), )
H(t) = h(t)
_
(4.2)
denote the expected value under policy from time t onwards, given the history h(t) of the process up to
time t; U
is called the value function under policy . The optimal value function U
is given by
U
(h(t)) sup
HR
U
(h(t)) (4.3)
It follows from r being bounded that U
and U
HR
is called optimal if
U
(h(t)) = U
HR
is called
-optimal if U
(h(t)) + > U
(h(t)) = IE
[r(s, (h(t)), t) +U
can be computed inductively; this is called the nite horizon policy evaluation algorithm.
This result is also used to establish the result that U
(h(t)) = sup
aA(s,t)
_
r(s, a, t) +IE [U
(h(t)) should depend on h(t) = (h(t 1), a(t 1), s) only through the state s at time t and the time
26
t, and that it should be sucient to consider memoryless policies. To establish these results, inductively
dene the memoryless function V
.
V
(s, T + 1) 0
V
(s, t) sup
aA(s,t)
_
r(s, a, t) +IE [V
(h(t)) =
V
(s, t).
For any memoryless policy
MR
, inductively dene the function
V
(s, T + 1) 0
V
(s, t) IE
(h(t)) = V
(s, t)]
= sup
aA(s,t)
_
r(s, a, t) +IE [V
with
(s, t) = a
(h(t)) = V
(s, t) = V
(s, t) = U
(s, t), then there also does not exist an optimal history dependent
randomized policy. In such a case it still holds that for any > 0, there exists an -optimal memoryless
deterministic policy
(s, t)] +
T + 1
> sup
aA(s,t)
_
r(s, a, t) +IE [V
MD
is then obtained using (4.8), or an -optimal policy
MD
is obtained using (4.9).
Finite Horizon Backward Induction Algorithm
0. Set V
(s, t) = sup
aA(s,t)
r(s, a, t) +IE [V
(s, t)] +
T + 1
> sup
aA(s,t)
_
r(s, a, t) +IE [V
(t), where
and
(s, t) is increasing in s and decreasing in t. The following threshold policy, with reward threshold function
x
(q, s, t) = V
(s, t + 1) V
(Q, S(t), t), and reject the request otherwise. If each request is
for the same amount of resource (say 1 unit of resource), and the salvage reward is concave in the remaining
amount of resource, then the optimal value function V
(1, s, t) = V
(s, t + 1) V
t=0
t
r(S(t), A(t))
_
(4.11)
where (0, 1) is a known discount factor. Again, decisions A(t), t = 0, 1, . . . , have to be feasible and may
depend only on the information available to the decision maker at each time t, that is the history H(t) of
the process up to time t, and possibly some randomization.
4.5.1 Optimality Results
The establishment of optimality results for innite horizon discounted dynamic programs is quite similar
to that for nite horizon dynamic programs. An important dierence though is that backward induction
cannot be used in the innite horizon case.
We again start by dening the value function U
for a policy
HR
,
U
(h(t)) IE
=t
t
r(S(), A())
H(t) = h(t)
_
(4.12)
The optimal value function U
and U
HR
is called optimal if U
(h(t)) = U
HR
is called -optimal if U
(h(t)) + > U
(h(t)) for
all h(t) H(t) and all t 0, 1, . . . .
The value function U
satises an inductive equation similar to (4.4) for the nite horizon case, for any
HR
and any history h(t) = (h(t 1), a(t 1), s).
U
(h(t)) = IE
[r(s, (h(t))) +U
(h(t)) should depend on h(t) = (h(t 1), a(t 1), s) only through the most recent state s, and that
it should be sucient to consider stationary policies. However, it is convenient to show, as an intermediate
step, that it is sucient to consider memoryless policies. For any
HR
and any history h(t), dene the
memoryless randomized policy
MR
as follows.
(s, t +)(A) IP
(h(t + )) for any history h(t + ) that starts with h(t). It follows that U
(h(t))
sup
HR U
(h(t)) = sup
MR U
(h(t)) depends
on h(t) only through the most recent state s and the time t. Instead of exploring this result in more detail as
for the nite horizon case, we use another property of memoryless randomized policies. Using the stationary
properties of the problem parameters, it follows that, for any memoryless randomized policy and any time
t, behaves in the same way from time t onwards as another memoryless randomized policy behaves from
time 0 onwards, where is obtained from by shifting backwards through t, as follows. Dene the shift
function :
MR
MR
by ()(s, t) (s, t + 1) for all s o and all t 0, 1, . . . . That is, policy
()
MR
makes the same decisions at time t as policy
MR
makes at time t + 1. Also, inductively
dene the convolution
t+1
() (
t
()). Thus the shifted policy described above is given by =
t
().
Also note that for a stationary policy , () = .
Now it is useful to focus on the value function V
for a policy
HR
from time 0 onwards,
V
(s) IE
t=0
t
r(S(t), A(t))
S(0) = s
_
(4.14)
That is, V
(s) = U
(h(0)), where h(0) = (s). Then it follows that, for any memoryless randomized policy
and any history h(t) = (h(t 1), a(t 1), s),
U
(h(t)) = V
t
()
(s) (4.15)
Thus we obtain the further simplication that U
(h(t)) sup
HR U
(h(t)) = sup
MR U
(h(t)) =
sup
MR V
by
V
(s) sup
MR
V
(s) (4.16)
Thus, for any history h(t) = (h(t 1), a(t 1), s),
U
(h(t)) = V
(s) (4.17)
and hence U
(s) = U
(h(0)) = IE
_
r(s, (s, 0)) +V
()
(S(1)) [ S(0) = s
_
(4.18)
As a special case, for a stationary policy ,
V
(s) = IE
[r(s, (s)) +V
to satisfy
the following optimality equation.
V
(s) = sup
aA(s)
_
r(s, a) +IE [V
: 1 1
by
L
(V )(s) sup
aA(s)
_
r(s, a) +IE [V (S(1)) [ S(0) = s, A(0) = a]
_
31
Let | |
sup
sS
[V (s)[. For any
V
1
, V
2
1, L
satises |L
(V
1
) L
(V
2
)|
|V
1
V
2
|
is a contraction
mapping on 1, and it follows from the Banach xed point theorem that L
1,
that is, there exists a unique function v
1 that satises V = L
is equal to V
: 1 1 by
L
(V )(s) IE
that L
is
the xed point of L
.
Consider any V 1 such that V L
,
and thus V sup
MR V
such that V V
+ V
+ , and thus V V
.
Combining these results, it follows that for any V 1 such that V = L
(V ), it holds that V = V
, and thus
v
= V
, that is, V
(s)) +IE [V
(s)]
= sup
aA(s)
_
r(s, a) +IE [V
be given by
(s) = a
(V
) =
L
(V
) = V
, that is, V
is a xed point of L
, and thus V
= V
(h(t)) = V
(s) = V
(s) = U
(s)) +IE [V
(s)] + (1 )
> sup
aA(s)
_
r(s, a) +IE [V
. An optimal policy
SD
is then obtained using (4.21), or an -optimal policy
SD
is obtained using (4.22).
Unlike the nite horizon case, V
as
i .
Approximating functions provide good policies, as shown by the following result. Suppose V
is approx-
imated by
V such that
_
_
_V
V
_
_
_
S
p[s
[s, (s)]
V (s
) + sup
aA(s)
_
r(s, a) +
S
p[s
[s, a]
V (s
)
_
for all s o, that is, decision (s) is within of the optimal decision using approximating function
V on
the right hand side of the optimality equation (4.20). Then
V
(s) V
(s)
2 +
1
(4.23)
for all s o, that is, policy has value function within (2 +)/(1 ) of the optimal value function.
32
Value Iteration One algorithm based on a sequence of approximating functions V
i
is called value iteration,
or successive approximation. The iterates V
i
of value iteration correspond to the value function V
(s, T+1i)
of the nite horizon dynamic program with the same problem parameters. Specically, starting with initial
approximation V
0
(s) = 0 = V
(s, T + 1 i) of the corresponding nite horizon dynamic program, that is, the value
function for time T + 1 i that is obtained after i steps of the backward induction algorithm.
Value Iteration Algorithm
0. Choose initial approximation V
0
1 and stopping tolerance . Set i 0.
1. For each s o, compute
V
i+1
(s) = sup
aA(s)
_
r(s, a) +
S
p[s
[s, a]V
i
(s
)
_
(4.24)
2. If |V
i+1
V
i
|
S
p[s
[s, a]V
i+1
(s
)
_
if the maximum on the right hand side is attained. Otherwise, for any chosen > 0, choose a decision
(s)
such that
r(s,
(s)) +
S
p[s
[s,
(s)]V
i+1
(s
) + (1 )
> sup
aA(s)
_
r(s, a) +
S
p[s
[s, a]V
i+1
(s
)
_
It can be shown, using the contraction property of L
, that V
i
V
i
|V
0
V
. That implies that the convergence rate is faster if the discount factor
is smaller.
When the value iteration algorithm stops, the nal approximation V
i+1
satises |V
i+1
V
< /2.
Furthermore, the chosen policy
is an ( + )-optimal
policy.
There are several versions of the value iteration algorithm. One example is Gauss-Seidel value iteration,
which uses the most up-to-date approximation V
i+1
(s
S
p[s
[s,
i
(s)]V
i
(s
) (4.25)
for each s o.
2. For each s o, choose a decision
i+1
(s) arg max
aA(s)
_
r(s, a) +
S
p[s
[s, a]V
i
(s
)
_
33
if the maximum on the right hand side is attained. Otherwise, for any chosen > 0, choose a decision
i+1
(s) such that
r(s,
i+1
(s)) +
S
p[s
[s,
i+1
(s)]V
i
(s
) + (1 )
> sup
aA(s)
_
r(s, a) +
S
p[s
[s, a]V
i
(s
)
_
3. If
i+1
=
i
or i > 0 and |V
i
V
i1
|
< (1)/2. If
i+1
attains
the maximum on the right hand side, then |V
i+1
V
i+1
chooses a decision within (1) of the maximum on the right hand side, then |V
i+1
V
< +
and the chosen policy
i+1
is ( +)-optimal.
Modied Policy Iteration It was mentioned that one of the drawbacks of the policy iteration algorithm
is the computational eort required to solve (4.25) for V
i
. An iterative algorithm called the Gauss-Seidel
method can be used to solve (4.25). For any stationary policy , L
S
p[s
[s, (s)]V
j
(s
)
for each s o, converges to V
i
(s) arg max
aA(s)
_
r(s, a) +
S
p[s
[s, a]V
i,0
(s
)
_
if the maximum on the right hand side is attained. Otherwise, for any chosen > 0, choose a decision
i
(s)
34
such that
r(s,
i
(s)) +
S
p[s
[s,
i
(s)]V
i,0
(s
) + (1 )
> sup
aA(s)
_
r(s, a) +
S
p[s
[s, a]V
i,0
(s
)
_
2. For j = 1, 2, . . . , N
i
, compute
V
i,j
(s) = r(s,
i
(s)) +
S
p[s
[s,
i
(s)]V
i,j1
(s
)
for each s o.
3. If |V
i,1
V
i,0
|
<
i
for some chosen sequence (typically decreasing)
i
. The idea is to choose N
i
to obtain the best trade-o
between the computational requirements of step 1, in which an optimization problem is solved to obtain a
new policy, and that of step 2, in which a more accurate approximation of the value function of the current
policy is computed. If the optimization problem in step 1 requires a lot of computational eort, then it is
better to obtain more accurate approximations of the value functions between successive executions of step
1, that is, it is better to choose N
i
larger, and vice versa. Also, if the policy does not change much from one
major iteration to the next, that is, if policies
i1
and
i
are very similar, then it is also better to obtain a
more accurate approximation of the value function V
i
by choosing N
i
larger. It is typical that the policies
do not change much later in the algorithm, and hence it is typical to choose N
i
to be increasing in i.
When the modied policy iteration algorithm stops, the approximation V
i,1
satises |V
i,1
V
< /2.
If the chosen policy
i
attains the maximum on the right hand side, then
i
is -optimal. If
i
chooses a
decision within (1 ) of the maximum on the right hand side, then
i
is ( +)-optimal. Furthermore, if
the initial approximation V
1,0
satises L
(V
1,0
) V
1,0
, such as if V
1,0
= V
0
for some initial policy
0
, and
if each
i
attains the maximum on the right hand side, then the sequence of policies
i
are monotonically
improving, and V
i,j1
V
i,j
for each i and j, from which it also follows that V
i1,0
V
i,0
for each i.
4.6 Approximation Methods
For many interesting applications the state space o is too big for any of the algorithms discussed so far to
be used. This is usually due to the curse of dimensionalitythe phenomenon that the number of states
grows exponentially in the number of dimensions of the state space. When the state space is too large, not
only is the computational eort required by these algorithms excessive, but storing the value function and
policy values for each state is impossible with current technology.
Recall that solving a dynamic program usually involves using (4.6) in the nite horizon case or (4.20) in
the innite horizon case to compute the optimal value function V
. To accomplish
this, the following major computational tasks are performed.
1. Estimation of the optimal value function V