0% found this document useful (0 votes)
91 views43 pages

Stochastic Optimization: Anton J. Kleywegt and Alexander Shapiro

The document discusses stochastic optimization and decision making under uncertainty. It provides an example of the newsvendor problem where a manager must decide how much of a seasonal product to order before demand is known. The manager's profit depends on the order quantity and realized demand. The document outlines different approaches to handling optimization under uncertainty including worst-case, stochastic optimization, and deterministic optimization. It also discusses concepts important for stochastic programming and dynamic programming under uncertainty like states, decisions, transition probabilities, and policies.

Uploaded by

Aloke Chatterjee
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
91 views43 pages

Stochastic Optimization: Anton J. Kleywegt and Alexander Shapiro

The document discusses stochastic optimization and decision making under uncertainty. It provides an example of the newsvendor problem where a manager must decide how much of a seasonal product to order before demand is known. The manager's profit depends on the order quantity and realized demand. The document outlines different approaches to handling optimization under uncertainty including worst-case, stochastic optimization, and deterministic optimization. It also discusses concepts important for stochastic programming and dynamic programming under uncertainty like states, decisions, transition probabilities, and policies.

Uploaded by

Aloke Chatterjee
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

Stochastic Optimization

Anton J. Kleywegt and Alexander Shapiro


August 24, 2003
School of Industrial and Systems Engineering,
Georgia Institute of Technology,
Atlanta, Georgia 30332-0205, USA
Contents
1 Introduction 1
2 Optimization under Uncertainty 1
2.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2.2 Summary of Approaches for Decision Making Under Uncertainty . . . . . . . . . . . . . . . . 4
2.2.1 Worst-case Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.2 The Stochastic Optimization Approach . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.3 The Deterministic Optimization Approach . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Evaluation Criteria for the Stochastic Optimization Approach . . . . . . . . . . . . . . . . . . 5
2.4 Estimation of Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.5 Example (continued) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3 Stochastic Programming 10
3.1 Stochastic Programming with Recourse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Sampling Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3 Perturbation Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.4 Likelihood Ratio Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.5 Simulation Based Optimization Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4 Dynamic Programming 18
4.1 Revenue Management Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2 Basic Concepts in Dynamic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2.1 Decision Times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2.2 States . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.2.3 Decisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.2.4 Transition Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.2.5 Rewards and Costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.2.6 Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.2.7 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.3 Finite Horizon Dynamic Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.3.1 Optimality Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.3.2 Structural Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.4 Innite Horizon Dynamic Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.5 Innite Horizon Discounted Dynamic Programs . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.5.1 Optimality Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.5.2 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.6 Approximation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
i
1 Introduction
Decision makers often have to make decisions in the presence of uncertainty. Decision problems are often
formulated as optimization problems, and thus in many situations decision makers wish to solve optimization
problems that depend on parameters which are unknown. Typically it is quite dicult to formulate and solve
such problems, both conceptually and numerically. The diculty already starts at the conceptual stage of
modeling. Often there are a variety of ways in which the uncertainty can be formalized. In the formulation of
optimization problems, one usually attempts to nd a good trade-o between the realism of the optimization
model, which aects the usefulness and quality of the obtained decisions, and the tractability of the problem,
so that it can be solved analytically or numerically. As a result of these considerations there are a large
number of dierent approaches for formulating and solving optimization problems under uncertainty. It is
impossible to give a complete survey of all such methods in one article. Therefore this article aims only to
give a avor of prominent approaches to optimization under uncertainty.
2 Optimization under Uncertainty
To describe some issues involved in optimization under uncertainty, we start with a static optimization
problem. Suppose we want to maximize an objective function G(x, ), where x denotes the decision to be
made, A denotes the set of all feasible decisions, denotes an outcome that is unknown at the time the
decision has to be made, and denotes the set of all possible outcomes.
There are several approaches for dealing with optimization under uncertainty. In Section 2.1, some of
these approaches are illustrated in the context of an example.
2.1 Example
Example 2.1 (Newsvendor Problem) Many companies sell seasonal products, such as fashion articles,
airline seats, Christmas decorations, magazines and newspapers. These products are characterized by a
relatively short selling season, after which the value of the products decrease substantially. Often, a decision
has to be made how much of such a product to manufacture or purchase before the selling season starts.
Once the selling season has started, there is not enough time remaining in the season to change this decision
and implement the change, so that at this stage the quantity of the product is given. During the season the
decision maker may be able to make other types of decisions to pursue desirable results, such as to change
the price of the product as the season progresses and sales of the product takes place. Such behavior is
familiar in many industries. Another characteristic of such a situation is that the decisions have to be made
before the eventual outcomes become known to the decision maker. For example, the decision maker has
to decide how much of the product to manufacture or purchase before the demand for the product becomes
known. Thus decisions have to be made without knowing which outcome will take place.
Suppose that a manager has to decide how much of a seasonal product to order. Thus the decision
variable x IR
+
is the order quantity. The cost of the product to the company is c per unit of the product.
During the selling season the product can be sold at a price (revenue) of r per unit of the product. After
the selling season any remaining product can be disposed of at a salvage value of s per unit of the product,
where typically s < r. The demand D for the product is unknown at the time the order decision x has to
be made. If the demand D turns out to be greater than the order quantity x, then the whole quantity x of
the product is sold during the season, and no product remains at the end of the season, so that the total
revenue turns out to be rx. If the demand D turns out to be less than the order quantity x, then quantity
D of the product is sold during the season, and the remaining amount of product at the end of the season
is x D, so that the total revenue turns out to be rD +s(x D). Thus the prot is given by
G(x, D) =
_
rD +s(x D) cx if D x
rx cx if D > x
(2.1)
The manager would like to choose x to maximize the prot G(x, D), but the dilemma is that D is unknown
at the time the decision should be made. This problem is often called the newsvendor problem.
1
Note that if r c, then the company can make no prot from buying and selling the product, so that
the optimal order quantity is x

= 0, irrespective of what the demand D turns out to be. Also, if s c,


then any unsold product at the end of the season can be disposed of at a value at least equal to the cost of
the product, so that it is optimal to order as much as possible, irrespective of what the demand D turns out
to be. For the remainder of this example, we assume that s < c < r.
Under this assumption, for any given D 0, the function G(, D) is a piecewise linear function with
positive slope r c for x < D and negative slope s c for x > D. Therefore, if the demand D is known at
the time the order decision has to be made, then the best decision is to choose order quantity x

= D.
However, if D is not known at the time the decision should be made, then the problem becomes more
dicult. There are several approaches to decision making in the usual case where the demand is not known.
Sometimes a manager may want to hedge against the worst possible outcome. Suppose the manager thinks
that the demand D will turn out to be some number in the interval [a, b] IR
+
, with a < b, i.e., the lower
and upper bounds for the demand are known to the manager. In that case, in order to hedge against the
worst possible scenario, the manager chooses the value of x that gives the best prot under the worst possible
outcome. For any decision x, the worst possible outcome is given by
g
1
(x) min
D[a,b]
G(x, D) = G(x, a) =
_
(r s)a (c s)x if a x
(r c)x if a > x
Because the manager wants to hedge against the worst possible outcome, the manager chooses the value of
x that gives the best prot under the worst possible outcome, that is, the manager chooses the value of x
that maximizes g
1
(x), which is x
1
= a. Clearly, in many cases this will be an overly conservative decision.
Sometimes a manager may want to make the decision that under the worst possible outcome will still
appear as good as possible compared with what would have been the best decision with hindsight, that is
after the outcome becomes known. For any outcome of the demand D, let
g

(D) max
xIR+
G(x, D) = (r c)D
denote the optimal prot with hindsight, also called the optimal value with perfect information. The optimal
decision with perfect information, x

= D, is sometimes called the wait-and-see solution. Suppose the


manager chose to order quantity x, so that the actual prot turned out to be G(x, D). The amount of prot
that the company missed out on because of a suboptimal decision is given by g

(D)G(x, D). This quantity,


A(x, D) g

(D) G(x, D) =
_
(c s)(x D) if D x
(r c)(D x) if D > x
is often called the absolute regret. The manager may want to choose the value of x that minimizes the
absolute regret under the worst possible outcome. For any decision x, the worst possible outcome is given
by
g
2
(x) max
D[a,b]
A(x, D) = max (c s)(x a), (r c)(b x)
= maxA(x, a), A(x, b) =
_
(r c)(b x) if x
(cs)
(rs)
a +
(rc)
(rs)
b
(c s)(x a) if x >
(cs)
(rs)
a +
(rc)
(rs)
b
Because the manager wants to choose the value of x that minimizes the absolute regret under the worst
possible outcome, the manager chooses the value of x that minimizes g
2
(x), which is x
2
= [(c s)a + (r
c)b]/(r s). Note that x
2
is a convex combination of a and b, and thus a < x
2
< b. The larger the salvage
loss per unit c s, the closer x
2
is to a, and the larger the prot per unit r c, the closer x
2
is to b. That
seems to be a more reasonable decision than x
1
= a, but it will be shown that in many cases one can easily
obtain a better solution than x
2
.
A similar approach is to choose the value of x that minimizes the relative regret R(x, D) under the worst
possible outcome, where
R(x, D)
g

(D) G(x, D)
g

(D)
=
_
(cs)(xD)
(rc)D
=
(cs)
(rc)
_
x
D
1
_
if D x
(rc)(Dx)
(rc)D
= 1
x
D
if D > x
2
For any decision x, the worst possible outcome is given by
g
3
(x) max
D[a,b]
R(x, D) = max
_
(c s)
(r c)
_
x
a
1
_
, 1
x
b
_
= maxR(x, a), R(x, b) =
_
_
_
1
x
b
if x
ab
(rc)
(rs)
a+
(cs)
(rs)
b
(cs)
(rc)
_
x
a
1
_
if x >
ab
(rc)
(rs)
a+
(cs)
(rs)
b
The manager then chooses the value of x that minimizes g
3
(x), which is x
3
= ab/[(rc)a+(cs)b]/(rs).
Note that [(r c)a + (c s)b]/(r s) in the denominator of the expression for x
3
is a convex combination
of a and b, and thus a < x
3
< b. Similar to x
2
, the larger the salvage loss per unit c s, the closer x
3
is to
a, and the larger the prot per unit r c, the closer x
3
is to b.
A related approach is to choose the value of x that maximizes the competitive ratio (x, D) under the
worst possible outcome, where
(x, D)
G(x, D)
g

(D)
Because (x, D) = 1 R(x, D), maximizing the competitive ratio (x, D) is equivalent to minimizing the
relative regret R(x, D), so that this approach leads to the same solution x
3
as the previous approach.
It was assumed in all the variants of the worst-case approach discussed above that no a priori information
about the demand D was available to the manager except the lower and upper bounds for the demand. In
some situations this may be a reasonable assumption and the worst-case approach could make sense if the
range of the demand is known and is not too large. However, in many applications the range of the
unknown quantities is not known with useful precision, and other information, such as information about
the probability distributions or sample data of the unknown quantities, may be available.
Another approach to decision making under uncertainty, dierent from the worst-case approaches de-
scribed above, is the stochastic optimization approach, which is the approach that most of this article is
focused on. Suppose that the demand D can be viewed as a random variable with a known, or at least
well estimated, probability distribution. The corresponding cumulative distribution function (cdf) F can be
estimated from historical data or by using a priori information available to the manager. Then one can try
to optimize the objective function on average, i.e. to maximize the expected prot
g(x) IE[G(x, D)] (2.2)
=
_
x
0
[rw +s(x w)] dF(w) +
_

x
rxdF(w) cx
This optimization problem is easy to solve. For any D 0, the function G(x, D) is concave in x. Therefore,
the expected value function g is also concave. First, suppose the demand D has a probability density function
(pdf). Then
g

(x) = sF(x) +r(1 F(x)) c (2.3)


Recalling that g is concave, it follows that the expected prot g(x) is maximized where g

(x) = 0, that is at
x

, where x

satises
F(x

) =
r c
r s
Because s < c < r, it follows that 0 < (r c)/(r s) < 1, so that a value of x

that satises F(x

) =
(r c)/(r s) can always be found. If the demand D does not have a pdf, a similar result still holds. In
general
x

= F
1
_
r c
r s
_
3
where
F
1
(p) minx : F(x) p
Another point worth mentioning is that by solving (2.2), the manager tries to optimize the prot on
average. However, the realized prot G(x

, D) could be very dierent from the corresponding expected value


g(x

), depending on the particular realization of the demand D. This may happen if G(x

, D), considered as
a random variable, has a large variability which could be measured by its variance Var[G(x

, D)]. Therefore,
if the manager wants to hedge against such variability he may consider the following optimization problem
max
x0
g

(x) IE[G(x, D)] Var[G(x, D)] (2.4)


The coecient 0 represents the weight given to the conservative part of the decision. If is large,
then the above optimization problem tries to nd a solution with minimal prot variance, while if = 0,
then problem (2.4) coincides with problem (2.2). Note that since the variance Var[G(x, D)] IE[(G(x, D)
IE[G(x, D)])
2
] is itself an expected value, from a mathematical point of view problem (2.4) is similar to the
expected value problem (2.2). Thus, the problem of optimizing the expected value of an objective function
G(x, D) is very generalit could include the means, variances, quantiles, and almost any other aspects of
random variables of interest.
The following deterministic optimization approach is often used for decision making under uncertainty.
The random variable D is replaced by its mean = IE[D], and then the following deterministic optimization
problem is solved.
max
xIR+
G(x, )
A resulting optimal solution x is sometimes called an expected value solution. Of course, this approach
requires that the mean of the random variable D be known to the decision maker. In the present example,
the optimal solution of this deterministic optimization problem is x = . Note that the two solutions, the
(r c)/(r s)-quantile x

and the mean x, can be very dierent. Also, it is well known that the quantiles
are much more stable to variations of the cdf F than the corresponding mean value. It typically happens
that an optimal solution x

of the stochastic optimization problem is more robust with respect to variations


of the probability distributions than an optimal solution x of the corresponding deterministic optimization
problem. Also note that, for any x, G(x, D) is concave in D. As a result it follows from Jensens inequality
that G(x, IE[D]) IE[G(x, D)], and thus the objective function of the deterministic optimization problem is
biased upward relative to the objective function of the stochastic optimization problem, and the optimal value
of the deterministic optimization problem is biased upward relative to the optimal value of the stochastic
optimization problem, because max
xIR+
G(x, IE[D]) max
xIR+
IE[G(x, D)].
One can also try to solve the optimization problem
max
xIR+
G(x, D)
for particular realizations of D, and then take the expected value of the obtained solutions as the nal
solution. In the present example, for any realization D, the optimal solution of this problem is x = D, and
hence the expected value of these solutions, and nal solution, is = x. Note that in many optimization
problems it may not make sense to take the expected value of the obtained solutions. This is usually the
case in optimization problems with discrete solutions, for example, when a solution is a path in a network,
there does not seem to be a useful way to take the average of several dierent paths.
2.2 Summary of Approaches for Decision Making Under Uncertainty
In this section we summarize several approaches often used for decision making under uncertainty, as intro-
duced in the example of Section 2.1.
4
2.2.1 Worst-case Approaches
Hedging Against the Worst-case Outcome The chosen decision x
1
is obtained by solving the following
optimization problem.
sup
xX
inf

G(x, )
Minimizing the Absolute Regret The chosen decision x
2
is obtained by solving the following optimiza-
tion problem.
inf
xX
sup

() G(x, )
Minimizing the Relative Regret The chosen decision x
3
is obtained by solving the following optimiza-
tion problem.
inf
xX
sup

() G(x, )
g

()
assuming g

() > 0 for all . An equivalent approach is to choose the solution x


3
that maximizes the
competitive ratio, as given by the following optimization problem.
sup
xX
inf

G(x, )
g

()
2.2.2 The Stochastic Optimization Approach
The chosen decision x

is obtained by solving the following optimization problem.


sup
xX
g(x) IE[G(x, )] (2.5)
2.2.3 The Deterministic Optimization Approach
The chosen decision x is obtained by solving the following optimization problem.
sup
xX
G(x, IE[]) (2.6)
2.3 Evaluation Criteria for the Stochastic Optimization Approach
Next we introduce some criteria that are useful for evaluating the stochastic optimization approach to decision
making under uncertainty. The optimal value with perfect information is given by
g

() sup
xX
G(x, )
Thus the expected value with perfect information is given by IE[g

()]. Also, the expected value of an


optimal solution x

of the stochastic optimization problem 2.5 is given by


g(x

) sup
xX
IE[G(x, )]
Note that
g(x

) sup
xX
IE[G(x, )] IE
_
sup
xX
G(x, )
_
IE[g

()]
The dierence, IE[g

()] g(x

) = IE[A(x

, )], is often called the value of perfect information.


5
It is also interesting to compare g(x

) with the value obtained from the deterministic optimization


problem 2.6. The expected value of an optimal solution x of the deterministic optimization problem is given
by g( x) IE[G( x, )]. Note that
g(x

) sup
xX
IE[G(x, )] IE[G( x, )] g( x)
The dierence, g(x

) g( x), is sometimes called the value of the stochastic solution.


2.4 Estimation of Probability Distributions
The stochastic optimization approach usually involves the assumption that the probability distribution of
the unknown outcome is known. However, in practice, the probability distribution is usually not known. One
way to deal with this situation is to estimate a distribution from data, assuming that the data is relevant
for the decision problem, and then to use the estimated distribution in the stochastic optimization problem.
There are several approaches to estimate probability distributions from data.
A simple and versatile estimate of a probability distribution is the empirical distribution. Suppose we
want to estimate the cumulative distribution function (cdf) F of a random variable W, and we have a data
set W
1
, W
2
, . . . , W
k
of k observations of W. Let N(w) denote the number of observations that have value
less than or equal to w. Then the empirical cumulative distribution function is given by

F
k
(w) N(w)/k.
Let W
1:k
, W
2:k
, . . . , W
k:k
denote the order statistics of the k observations of W, that is, W
1:k
is the smallest
among W
1
, W
2
, . . . , W
k
; W
2:k
is the second smallest among W
1
, W
2
, . . . , W
k
; . . . ; W
k:k
is the largest among
W
1
, W
2
, . . . , W
k
. Then, for any i 1, 2, . . . , k, and any p ((i1)/k, i/k],

F
1
k
(p) = W
i:k
. Also, assuming
that W
1
, W
2
, . . . , W
k
are independent and identically distributed with cdf F, it follows that the cdf F
i:k
of
W
i:k
is given by
F
i:k
(w) =
k

j=i
_
k
j
_
F(w)
j
[1 F(w)]
kj
Further, if W has a probability density function (pdf) f, then it follows that the pdf f
i:k
of W
i:k
is given by
f
i:k
(w) = i
_
k
i
_
f(w)F(w)
i1
[1 F(w)]
ki
Use of the empirical distribution, and its robustness, are illustrated in an example in Section 2.5.
If there is reason to believe that the random variable W follows a particular type of probability distribu-
tion, for example a normal distribution, with one or more unknown parameters, for example the mean and
variance
2
of the normal distribution, then standard statistical techniques, such as maximum likelihood
estimation, can be used to estimate the unknown parameters of the distribution from data. Also, in such
a situation, a Bayesian approach can be used to estimate the unknown parameters of the distribution from
data, to optimize an objective function that is related to that of the stochastic optimization problem. More
details can be found in Berger (1985).
2.5 Example (continued)
In this section the use of the empirical distribution, and its robustness, are illustrated with the newsvendor
example of Section 2.1.
Suppose the manager does not know the probability distribution of the demand, but a data set D
1
, D
2
, . . . , D
k
of k independent and identically distributed observations of the demand D is available. As before, let
D
1:k
, D
2:k
, . . . , D
k:k
denote the order statistics of the k observations of D. Using the empirical distribution

F
k
, the resulting decision rule is simple. If (r c)/(r s) ((i 1)/k, i/k] for some i 1, 2, . . . , k, then
x =

F
1
k
_
r c
r s
_
= D
i:k
6
That is, the chosen order quantity x is the ith smallest observation D
i:k
of the demand.
To illustrate the robustness of the solution x obtained with the empirical distribution, suppose that,
unknown to the manager, the demand D is exponentially distributed with rate , that is, the mean demand
is IE[D] = 1/. The objective function is given by
g(x) IE[G(x, D)] =
r s

_
1 e
x
_
(c s)x
The pdf f
i:k
of D
i:k
is given by
f
i:k
(w) = i
_
k
i
_
e
(ki+1)w
_
1 e
w
_
i1
= i
_
k
i
_

i1

j=0
_
i 1
j
_
(1)
j
e
(ki+j+1)w
The expected objective value of the chosen order quantity x = D
i:k
is given by (assuming that D
1
, D
2
, . . . , D
k
and D are i.i.d. exp())
IE[G(D
i:k
, D)] = IE
_
r s

_
1 e
D
i:k
_
(c s)D
i:k
_
=
_

0
_
r s

_
1 e
w
_
(c s)w
_
i
_
k
i
_

i1

j=0
_
i 1
j
_
(1)
j
e
(ki+j+1)w
dw
=
_
_
(r s)
i

j=0
_
i
j
_
(1)
j
1
k i +j + 1
(c s)
i1

j=0
_
i 1
j
_
(1)
j
1
(k i +j + 1)
2
_
_
i
_
k
i
_
IE[D]
Next we compare the objective values of several solutions, including the optimal value with perfect
information, IE[G(D
i:k
, D)], g(x

), and g( x). Recall that the optimal value with perfect information is given
by
g

(D) max
xIR+
G(x, D) = (r c)D
Thus the expected value with perfect information is given by
IE[g

(D)] = (r c)IE[D]
Also, the optimal solution x

of the stochastic optimization problem is given by


x

= F
1
_
r c
r s
_
= ln
_
c s
r s
_
IE[D]
and the optimal objective function value is given by
g(x

) =
_
(r c) + (c s) ln
_
c s
r s
__
IE[D]
Thus the value of perfect information is
IE[g

(D)] g(x

) = (c s) ln
_
c s
r s
_
IE[D] =
c s
r s
ln
_
c s
r s
_
(r s)IE[D]
7
It is easy to obtain bounds on the value of perfect information. Consider the function h(y) y ln(y) for
y > 0. Then h

(y) = ln(y) + 1 and h

(y) = 1/y > 0, because y > 0. Thus h is convex on (0, ), and


h(y) attains a minimum of 1/e when y = 1/e. Also, lim
y0
h(y) = 0, and h(1) = 0. Hence the value of
perfect information attains a minimum of zero when c = s and when c = r. This makes sense from previous
results, since the optimal decisions when c s (x

as large as possible) or when c r (x

= 0) do not
depend on information about the demand. Also, the value of perfect information attains a maximum of
(r s)IE[D]/e when (c s)/(r s) = 1/e, i.e., when the ratio of prot per unit to the salvage loss per unit
(r c)/(c s) = e 1.
The optimal solution x of the deterministic optimization problem (2.6) is x = IE[D]. The expected value
of this solution is given by
g( x) IE[G( x, D)] =
_
(r c)
r s
e
_
IE[D]
Hence the value of the stochastic solution is given by
g(x

) IE[G( x, D)] =
_
c s
r s
ln
_
c s
r s
_
+
1
e
_
(r s)IE[D]
It follows from the properties of h(y) y ln(y) that the value of the stochastic solution attains a minimum
of zero when the value of perfect information attains a maximum, i.e., when (c s)/(r s) = 1/e. Also, the
value of the stochastic solution attains a maximum of (r s)IE[D]/e when the value of perfect information
attains a minimum, i.e., when c = s and when c = r, that is, when using the expected demand IE[D] gives
the poorest results.
Next we evaluate the optimality gaps of several solutions. Let (r c)/(r s) ((i 1)/k, i/k] for
some i 1, 2, . . . , k. Then the optimality gap of the solution based on the empirical distribution is given
by

e
k
()
g(x

) IE[G(D
i:k
, D)]
(r s)IE[D]
= + (1 ) ln(1 )
_
_
i

j=0
_
i
j
_
(1)
j
1
k i +j + 1
(1 )
i1

j=0
_
i 1
j
_
(1)
j
1
(k i +j + 1)
2
_
_
i
_
k
i
_
Note that the division by IE[D] can be interpreted as rescaling the product units so that IE[D] = 1, and the
division by r s can be interpreted as rescaling the money units so that r s = 1. The optimality gap of
the optimal solution x of the deterministic optimization problem is given by

d
()
g(x

) g( x)
(r s)IE[D]
= (1 ) ln(1 ) +
1
e
To evaluate the worst-case solutions x
1
, x
2
, and x
3
, suppose that the interval [a, b] is taken as [0, IE[D]] for
some > 0. Then x
1
= a = 0, and thus g(x
1
) = 0, and the optimality gap of the worst-case solution x
1
is
given by

1
()
g(x

) g(x
1
)
(r s)IE[D]
= + (1 ) ln(1 )
8
Also, x
2
= [(c s)a + (r c)b]/(r s) = (r c)IE[D]/(r s) = IE[D], and thus
g(x
2
) =
__
1 e

_
(1 )

(r s)IE[D]
and the optimality gap of the absolute regret solution x
2
is given by

2
()
g(x

) g(x
2
)
(r s)IE[D]
= + (1 ) ln(1 )
__
1 e

_
(1 )

Also, x
3
= ab/[(r c)a +(c s)b]/(r s) = 0, and thus g(x
3
) = 0, and the optimality gap of the relative
regret solution x
3
is
3
() =
1
().
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
O
p
t
i
m
a
l
i
t
y

G
a
p
Profit Ratio (r-c)/(r-s)
Empirical k = 2
Empirical k = 4
Empirical k = 8 Empirical k = 16
Deterministic
Figure 1: Optimality gaps
e
k
() of the empirical approach for k = 2, 4, 8, 16, as well as the optimality gap

d
() of the deterministic approach, as a function of (r c)/(r s).
Figure 1 shows the optimality gaps
e
k
() for k = 2, 4, 8, 16, as well as the optimality gap
d
() as
a function of . It can be seen that the empirical solutions x tend to be more robust, in terms of the
optimality gap, than the expected value solution x, even if the empirical distribution is based on a very
small sample size k. Only in the region where 1 1/e, i.e., where the value of the stochastic solution is
small, does x give a good solution. It should also be kept in mind that the solution x is based on knowledge
of the expected demand, whereas the empirical solutions do not require such knowledge, but the empirical
solutions in turn require a data set of demand observations. Figure 2 shows the optimality gaps
1
(),
2
(),
and
3
() for = 1, 2, 3, 4, 5, as a function of . Solutions x
1
and x
3
do not appear to be very robust.
Also, only when is chosen to be close to 2 does the absolute regret solution x
2
appear to have robustness
that compares well with the robustness of the empirical solutions. The performance of the absolute regret
solution x
2
appears to be quite sensitive to the choice of . Furthermore, a decision maker is not likely to
know what is the best choice of .
9
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
O
p
t
i
m
a
l
i
t
y

G
a
p
Profit Ratio (r-c)/(r-s)
Worst-Case Absolute Regret b = 1
Absolute Regret b = 2
Absolute Regret b = 3
Absolute Regret b = 4
Absolute Regret b = 5
Relative Regret
Figure 2: Optimality gaps
1
(),
2
(), and
3
() of the worst-case approaches, for = 1, 2, 3, 4, 5, as a
function of (r c)/(r s).
3 Stochastic Programming
The discussion of the above example motivates us to introduce the following model optimization problem,
referred to as a stochastic programming problem,
inf
xX
g(x) IE[G(x, )] (3.1)
(We consider a minimization rather than a maximization problem for the sake of notational convenience.)
Here A IR
n
is a set of permissible values of the vector x of decision variables, and is referred to as the
feasible set of problem (3.1). Often A is dened by a (nite) number of smooth (or even linear) constraints.
In some other situations the set A is nite. In that case problem (3.1) is called a discrete stochastic opti-
mization problem (this should not be confused with the case of discrete probability distributions). Variable
represents random (or stochastic) aspects of the problem. Often can be modeled as a nite dimensional
random vector, or in more involved cases as a random process. In the abstract framework we can view as
an element of the probability space (, T, P) with the known probability measure (distribution) P.
It is also possible to consider the following extensions of the basic problem (3.1).
One may need to optimize a function of the expected value function g(x). This happened, for example,
in problem (2.4), where the manager wanted to optimize a linear combination of the expected value
and the variance of the prot. Although important from a modeling point of view, such an extension
usually does not introduce additional technical diculties into the problem.
The feasible set can also be dened by constraints given in a form of expected value functions. For
example, suppose that we want to optimize an objective function subject to the constraint that the
event h(x, W) 0, where W is a random vector with a known probability distribution and h(, ) is a
given function, should happen with a probability not bigger than a given number p (0, 1). Probability
10
of this event can be represented as the expected value IE[(x, W)], where
(x, w)
_
1 if h(x, w) 0
0 if h(x, w) < 0
Therefore, this constraint can be written in the form IE[(x, W)] p. Problems with such probabilistic
constraints are called chance constrained problems. Note that even if the function h(, ) is continuous,
the corresponding indicator function (, ) is discontinuous unless it is identically equal to zero or one.
Because of that, it may be technically dicult to handle such a problem.
In some cases the involved probability distribution P

depends on parameter vector , whose compo-


nents also represent decision variables. That is, the expected value objective function is given in the
form
g(x, ) IE

[G(x, )] =
_

G(x, ) dP

() (3.2)
By using a transformation it is sometimes possible to represent the above function g() as the expected
value of a function, depending on x and , with respect to a probability distribution that is independent
of . We shall discuss such likelihood ratio transformations in Section 3.4.
The above formulation of stochastic programs is somewhat too general and abstract. In order to proceed
with a useful analysis we need to identify particular classes of such problems that on one hand are interesting
from the point of view of applications and on the other hand are computationally tractable. In the following
sections we introduce several classes of such problems and discuss various techniques for their solution.
3.1 Stochastic Programming with Recourse
Consider again problem (2.2) of the newsvendor example. We may view that problem as a two-stage problem.
At the rst stage a decision should be made about the quantity x to order. At this stage the demand D is
not known. At the second stage a realization of the demand D becomes known and, given the rst stage
decision x, the manager makes a decision about the quantities y and z to sell at prices r and s, respectively.
Clearly the manager would like to choose y and z in such a way as to maximize the prot. It is possible to
formulate the second stage problem as the simple linear program
max
y,z
ry +sz subject to y D, y +z x, y 0, z 0 (3.3)
The optimal solution of the above problem (3.3) is y

= minx, D, z

= maxx D, 0, and its optimal


value is the prot G(x, D) dened in (2.1). Now at the rst stage, before a realization of the demand
D becomes known, the manager chooses a value for the rst stage decision variable x by maximizing the
expected value of the second stage optimal prot G(x, D).
This is the basic idea of a two-stage stochastic program with recourse. At the rst stage, before a
realization of the random variables becomes known, one chooses the rst stage decision variables x to
optimize the expected value g(x) IE[G(x, )] of an objective function G(x, ) that depends on the optimal
second stage objective function Q(x, ()).
A two-stage stochastic linear program with xed recourse is a two-stage stochastic program with the form
min
x
c
T
x +IE[Q(x, )]
s.t. Ax = b, x 0
(3.4)
where Q(x, ) is the optimal value of the second stage problem
min
y
q()
T
y
s.t. T()x +Wy = h(), y 0
(3.5)
11
The second stage problem depends on the data () (q(), h(), T()), elements of which can be random,
while the matrix W is assumed to be known beforehand. The matrices T() and W are called the technology
and recourse matrices, respectively. The expectation IE[Q(x, )] is taken with respect to the random vector
= (), whose probability distribution is assumed to be known. The above formulation originated in the
works of Dantzig (1955) and Beale (1955).
Note that the optimal solution y

= y

() of the second stage problem (3.5) depends on the random


data = (), and therefore is random. One can write Q(x, ()) = q()
T
y

().
The next question is how one can solve the above two-stage problem numerically. Suppose that the
random data have a discrete distribution with a nite number K of possible realizations
k
= (q
k
, h
k
, T
k
),
k = 1, ..., K, (sometimes called scenarios), with the corresponding probabilities p
k
. In that case IE[Q(x, )] =

K
k=1
p
k
Q(x,
k
), where
Q(x,
k
) = min
_
q
T
k
y
k
: T
k
x +Wy
k
= h
k
, y
k
0
_
Therefore, the above two-stage problem can be formulated as one large linear program:
min c
T
x +

K
k=1
p
k
q
T
k
y
k
s.t. Ax = b
T
k
x +Wy
k
= h
k
x 0, y
k
0, k = 1, . . . , K
(3.6)
The linear program (3.6) has a certain block structure that makes it amenable to various decomposition
methods. One such decomposition method is the popular L-shaped method developed by Van Slyke and
Wets (1969). We refer the interested reader to the recent books by Kall and Wallace (1994) and Birge and
Louveaux (1997) for a thorough discussion of stochastic programming with recourse.
The above numerical approach works reasonably well if the number K of scenarios is not too large.
Suppose, however, that the random vector has m independently distributed components each having just 3
possible realizations. Then the total number of dierent scenarios is K = 3
m
. That is, the number of scenarios
grows exponentially fast in the number m of random variables. In that case, even for a moderate number
of random variables, say m = 100, the number of scenarios becomes so large that even modern computers
cannot cope with the required calculations. It seems that the only way to deal with such exponential growth
of the number of scenarios is to use sampling. Such approaches are discussed in Section 3.2.
It may also happen that some of the decision variables at the rst or second stage are integers, such
as binary variables representing yes or no decisions. Such integer (or discrete) stochastic programs
are especially dicult to solve, and only very moderate progress has been reported so far. A discussion of
two-stage stochastic integer programs with recourse can be found in Birge and Louveaux (1997). A branch
and bound approach for solving stochastic discrete optimization problems was suggested by Norkin, Pug
and Ruszczy nski (1998). Schultz, Stougie and Van der Vlerk (1998) suggested an algebraic approach for
solving stochastic programs with integer recourse by using a framework of Gr obner basis reductions. For a
recent survey of mainly theoretical results on stochastic integer programming see Klein Haneveld and Van
der Vlerk (1999).
Conceptually the idea of two-stage programming with recourse can be readily extended to multistage
programming with recourse. Such an approach tries to model the situation where decisions are made pe-
riodically (in stages) based on currently known realizations of some of the random variables. An H-stage
stochastic linear program with xed recourse can be written in the form
min c
1
x
1
+IE
_
min c
2
()x
2
() + +IE[min c
H
()x
H
()]
_
s.t. W
1
x
1
= h
1
T
1
()x
1
+W
2
x
2
() = h
2
()

T
H1
()x
H1
() +W
H
x
H
() = h
H
()
x
1
0, x
2
() 0, . . . , x
H
() 0
(3.7)
The decision variables x
2
(), . . . , x
H
() are allowed to depend on the random data . However, the decision
x
t
() at time t can only depend on the part of the random data that is known at time t (these restrictions
12
are often called nonanticipativity constraints). The expectations are taken with respect to the distribution
of the random variables whose realizations are not yet known.
Again, if the distribution of the random data is discrete with a nite number of possible realizations,
then problem (3.7) can be written as one large linear program. However, it is clear that even for a small
number of stages and a moderate number of random variables the total number of possible scenarios will
be astronomical. Therefore, a current approach to such problems is to generate a reasonable number of
scenarios and to solve the corresponding (deterministic) linear program, hoping to catch at least the avor
of the stochastic aspect of the problem. The argument is that the solution obtained in this way is more
robust than the solution obtained by replacing the random variables with their means.
Often the same practical problem can be modeled in dierent ways. For instance, one can model a
problem as a two-stage stochastic program with recourse, putting all random variables whose realizations
are not yet known at the second stage of the problem. Then as realizations of some of the random variables
become known, the solutions are periodically updated in a two-stage rolling horizon fashion, every time
by solving an updated two-stage problem. Such an approach is dierent from a multistage program with
recourse, where every time a decision is to be made, the modeler tries to take into account that decisions
will be made at several stages in the future.
3.2 Sampling Methods
In this section we discuss a dierent approach that uses Monte Carlo sampling techniques to solve stochastic
optimization problems.
Example 3.1 Consider a stochastic process I
t
, t = 1, 2, . . . , governed by the recursive equation
I
t
= [I
t1
+x
t
D
t
]
+
(3.8)
with initial value I
0
. Here D
t
are random variables and x
t
represent decision variables. (Note that [a]
+

maxa, 0.) The above process I


t
can describe the waiting time of the tth customer in a G/G/1 queue, where
D
t
is the interarrival time between the (t 1)th and tth customers and x
t
is the service time of (t 1)th
customer. Alternatively, I
t
may represent the inventory of a certain product at time t, with D
t
and x
t
representing the demand and production (or ordering) quantities, respectively, of the product at time t.
Suppose that the process is considered over a nite horizon with time periods t = 1, . . . , T. Our goal is to
minimize (or maximize) the expected value of an objective function involving I
1
, . . . , I
T
. For instance, one
may be interested in maximizing the expected value of a prot given by (Albritton, Shapiro and Spearman
1999)
G(x, W)

T
t=1

t
min[I
t1
+x
t
, D
t
] h
t
I
t

=

T
t=1

t
x
t
+

T1
t=1
(
t+1

t
h
t
)I
t
+
1
I
0
(
T
+h
T
)I
T
(3.9)
Here x = (x
1
, . . . , x
T
) is a vector of decision variables, W = (D
1
, . . . , D
T
) is a random vector of the demands
at periods t = 1, . . . , T, and
t
and h
t
are nonnegative parameters representing the marginal prot and the
holding cost, respectively, of the product at period t.
If the initial value I
0
is suciently large, then with probability close to one, variables I
1
, . . . , I
T
stay above
zero. In that case I
1
, . . . , I
T
become linear functions of the random data vector W, and hence components
of the random vector W can be replaced by their means. However, in many practical situations the process
I
t
hits zero with high probability over the considered horizon T. In such cases the corresponding expected
value function g(x) IE[G(x, W)] cannot be written in a closed form. One can use a Monte Carlo simulation
procedure to evaluate g(x). Note that for any given realization of D
t
, the corresponding values of I
t
, and
hence the value of G(x, W), can be easily calculated using the iterative formula (3.8).
That is, let W
i
= (D
i
1
, . . . , D
i
T
), i = 1, . . . , N, be a random (or pseudorandom) sample of N independent
realizations of the random vector W generated by computer, i.e., there are N generated realizations of the
demand process D
t
, t = 1, 2, . . . , T, over the horizon T. Then for any given x the corresponding expected
13
value g(x) can be approximated (estimated) by the sample average
g
N
(x)
1
N
N

i=1
G(x, W
i
) (3.10)
We have that IE [ g
N
(x)] = g(x), and by the Law of Large Numbers, that g
N
(x) converges to g(x) with
probability one (w.p.1) as N . That is, g
N
(x) is an unbiased and consistent estimator of g(x).
Any reasonably ecient method for optimizing the expected value function g(x), say by using its sample
average approximations, is based on estimation of its rst (and maybe second) order derivatives. This has
an independent interest and is called sensitivity or perturbation analysis. We will discuss that in Section 3.3.
Recall that g(x) (g(x)/x
1
, . . . , g(x)/x
T
) is called the gradient vector of g() at x.
It is possible to consider a stationary distribution of the process I
t
(if it exists), and to optimize the
expected value of an objective function with respect to the stationary distribution. Typically, such a station-
ary distribution cannot be written in a closed form, and is dicult to compute accurately. This introduces
additional technical diculties into the problem. Also, in some situations the probability distribution of the
random variables D
t
is given in a parametric form whose parameters are decision variables. We will discuss
dealing with such cases later.
3.3 Perturbation Analysis
Consider the expected value function g(x) IE[G(x, )]. An important question is under which conditions
the rst order derivatives of g(x) can be taken inside the expected value, that is, under which conditions the
equation
g(x) IE[G(x, )] = IE[
x
G(x, )] (3.11)
is correct. One reason why this question is important is the following. Let
1
, . . . ,
N
denote a random
sample of N independent realizations of the random variable with common probability distribution P, and
let
g
N
(x)
1
N
N

i=1
G(x,
i
) (3.12)
be the corresponding sample average function. If the interchangeability equation (3.11) holds, then
IE [ g
N
(x)] =
1
N
N

i=1
IE
_

x
G(x,
i
)

=
1
N
N

i=1
IE[G(x,
i
)] = g(x) (3.13)
and hence g
N
(x) is an unbiased and consistent estimator of g(x).
Let us observe that in both examples 2.1 and 3.1 the function G(, ) is piecewise linear for any realization
of , and hence is not everywhere dierentiable. The same holds for the optimal value function Q(, ) of
the second stage problem (3.5). If the distribution of the corresponding random variables is discrete, then
the resulting expected value function is also piecewise linear, and hence is not everywhere dierentiable.
On the other hand expectation with respect to a continuous distribution typically smoothes the corre-
sponding function and in such cases equation (3.11) often is applicable. It is possible to show that if the
following two conditions hold at a point x, then g() is dierentiable at x and equation (3.11) holds:
(i) The function G(, ) is dierentiable at x w.p.1.
(ii) There exists a positive valued random variable K() such that IE[K()] is nite and the inequality
[G(x
1
, ) G(x
2
, )[ K()|x
1
x
2
| (3.14)
holds w.p.1 for all x
1
, x
2
in a neighborhood of x.
14
If the function G(, ) is not dierentiable at x w.p.1 (i.e., for P-almost every ), then the right
hand side of equation (3.11) does not make sense. Therefore, clearly the above condition (i) is necessary
for (3.11) to hold. Note that condition (i) requires G(, ) to be dierentiable w.p.1 at the given (xed) point
x and does not require dierentiability of G(, ) everywhere. The second condition (ii) requires G(, ) to
be continuous (in fact Lipschitz continuous) w.p.1 in a neighborhood of x.
Consider, for instance, function G(x, D) of example 2.1 dened in (2.1). For any given D the function
G(, D) is piecewise linear and dierentiable at every point x except at x = D. If the cdf F() of D
is continuous at x, then the probability of the event D = x is zero, and hence the interchangeability
equation (3.11) holds. Then G(x, D)/x is equal to s c if x > D, and is equal to r c if x < D.
Therefore, if F() is continuous at x, then G(, D) is dierentiable at x and
g

(x) = (s c)IP(D < x) + (r c)IP(D > x)


which gives the same equation as (2.3). Note that the function G(, D)/x is discontinuous at x = D.
Therefore, the second order derivative of IE[G(, D)] cannot be taken inside the expected value. Indeed, the
second order derivative of G(, D) is zero whenever it exists. Such behavior is typical in many interesting
applications.
Let us calculate the derivatives of the process I
t
, dened by the recursive equation (3.8), for a particular
realization of the random variables D
t
. Let
1
denote the rst time that the process I
t
hits zero, i.e.,
1
1
is the rst time I
11
+x
1
D
1
becomes less than or equal to zero, and hence I
1
= 0. Let
2
>
1
be the
second time that I
t
hits zero, etc. Note that if I
1+1
= 0, then
2
=
1
+ 1, etc. Let 1
1
< <
n
T
be the sequence of hitting times. (In queueing terminology,
i
represents the starting time of a new busy
cycle of the corresponding queue.) For a given time t 1, ..., T, let
i1
t <
i
. Suppose that the events
I
1
+ x

= 0, = 1, ..., T, occur with probability zero. Then, for almost every W, the gradient of
I
s
with respect to the components of vector x
t
can be written as follows

xt
I
s
=
_
1 if t s <
i
and t ,=
i1
0 otherwise
(3.15)
Thus, by using equations (3.9) and (3.15), one can calculate the gradient of the sample average function
g
N
() of example 3.1, and hence one can consistently estimate the gradient of the expected value function
g().
Consider the process I
t
dened by the recursive equation (3.8) again. Suppose now that variables x
t
do
not depend on t, and let x denote their common value. Suppose further that D
t
, t = 1, ..., are independently
and identically distributed with mean > 0. Then for x < the process I
t
is stable and has a stationary
(steady state) distribution. Let g(x) be the steady state mean (the expected value with respect to the
stationary distribution) of the process I
t
= I
t
(x). By the theory of regenerative processes it follows that for
every x (0, ) and any realization (called sample path) of the process D
t
, t = 1, ..., the long run average
g
T
(x)

T
t=1
I
t
(x)/T converges w.p.1 to g(x) as T . It is possible to show that g
T
(x) also converges
w.p.1 to g(x) as T . That is, by dierentiating the long run average of a sample path of the process
I
t
we obtain a consistent estimate of the corresponding derivative of the steady state mean g(x). Note that
I
t
(x) = t
i1
for
i1
t <
i
, and hence the derivative of the long run average of a sample path of the
process I
t
can be easily calculated.
The idea of dierentiation of a sample path of a process in order to estimate the corresponding derivative
of the steady state mean function by a single simulation run is at the heart of the so-called innitesimal
perturbation analysis. We refer the interested reader to Glasserman (1991) and Ho and Cao (1991) for a
thorough discussion of that topic.
3.4 Likelihood Ratio Method
The Monte Carlo sampling approach to derivative estimation introduced in Section 3.3 does not work if the
function G(, ) is discontinuous or if the corresponding probability distribution also depends on decision
variables. In this section we discuss an alternative approach to derivative estimation known as the likelihood
ratio (or score function) method.
15
Suppose that the expected value function is given in the form g() IE

[G(W)], where W is a random


vector whose distribution depends on the parameter vector . Suppose further that the distribution of W
has a probability density function (pdf) f(, w). Then for a chosen pdf (w) we can write
IE

[G(W)] =
_
G(w)f(, w) dw =
_
G(w)
f(, w)
(w)
(w) dw
and hence
g() = IE

[G(Z)L(, Z)] (3.16)


where L(, z) f(, z)/(z) is the so-called likelihood ratio function, Z () and IE

[] means that the


expectation is taken with respect to the pdf . We assume in the denition of the likelihood ratio function
that 0/0 = 0 and that the pdf is such that if (w) is zero for some w, then f(, w) is also zero, i.e., we do
not divide a positive number by zero.
The expected value in the right hand side of (3.16) is taken with respect to the distribution which
does not depend on the vector . Therefore, under appropriate conditions ensuring interchangeability of the
dierentiation and integration operators, we can write
g() = IE

[G(Z)

L(, Z)] (3.17)


In particular, if for a given
0
we choose () f(
0
, ), then

L(, z) =

f(, z)/f(
0
, z), and hence

L(
0
, z) =

ln[f(
0
, z)]. The function

ln[f(, z)] is called the score function, which motivates the


name of this technique.
Now by generating a random sample Z
1
, . . . , Z
N
from the pdf (), one can estimate g() and g() by
the respective sample averages
g
N
()
1
N
N

i=1
G(Z
i
)L(, Z
i
) (3.18)
g
N
()
1
N
N

i=1
G(Z
i
)

L(, Z
i
) (3.19)
This can be readily extended to situations where function G(x, W) also depends on decision variables.
Typically, the density functions used in applications depend on the decision variables in a smooth and
even analytic way. Therefore, usually there is no problem in taking derivatives inside the expected value in
the right hand side of (3.16). When applicable, the likelihood ratio method often also allows estimation of
second and higher order derivatives. However, note that the likelihood ratio method is notoriously unstable
and a bad choice of the pdf may result in huge variances of the corresponding estimators. This should
not be surprising since the likelihood ratio function may involve divisions by very small numbers, which of
course is a very unstable procedure. We refer to Glynn (1990) and Rubinstein and Shapiro (1993) for a
further discussion of the likelihood ratio method.
As an example consider the optimal value function of the second stage problem (3.5). Suppose that only
the right hand side vector h = h() of the second stage problem is random. Then Q(x, h) = G(h Tx),
where G() min
_
q
T
y : Wy = , y 0
_
. Suppose that the random vector h has a pdf f(). By using the
transformation z = h Tx we obtain
IE
f
[Q(x, h)] =
_
G( Tx)f() d =
_
G(z)f(z +Tx) dz = IE

[G(Z)L(x, Z)] (3.20)


Here is a chosen pdf, Z is a random vector having pdf , and L(x, z) f(z+Tx)/(z) is the corresponding
likelihood ratio function. It can be shown by duality arguments of linear programming that G() is a piecewise
linear convex function. Therefore,
x
Q(x, h) is piecewise constant and discontinuous, and hence second order
derivatives of IE
f
[Q(x, h)] cannot be taken inside the expected value. On the other hand, the likelihood ratio
function is as smooth as the pdf f(). Therefore, if f() is twice dierentiable, then the second order
derivatives can be taken inside the expected value in the right hand side of (3.20), and consequently the
second order derivatives of IE
f
[Q(x, h)] can be consistently estimated by a sample average.
16
3.5 Simulation Based Optimization Methods
There are basically two approaches to the numerical solution of stochastic optimization problems by using
Monte Carlo sampling techniques. One approach is known as the stochastic approximation method and
originated in Robbins and Monro (1951). The other method was discovered and rediscovered by dierent
researchers and is known under various names.
Suppose that the feasible set A is convex and that at any point x A an estimate (x) of the gradient
g(x) can be computed, say by a Monte Carlo simulation method. The stochastic approximation method
generates the iterates by the recursive equation
x
+1
=
X
(x

(x

)) (3.21)
where

> 0 are chosen step sizes and


X
denotes the projection onto the set A, i.e.,
X
(x) is the
point in A closest to x. Under certain regularity conditions the iterates x

converge to a locally optimal


solution of the corresponding stochastic optimization problem, i.e., to a local minimizer x

of g(x) over A.
Typically, in order to guarantee this convergence the following two conditions are imposed on the step sizes:
(i)

=1

= , and (ii)

=1

< . For example, one can take

c/ for some c > 0.


If the exact value

g(x

) of the gradient is known, then

gives the direction of steepest descent


at the point x

. This guarantees that if

,= 0, then moving along the direction

the value of the objective


function decreases, i.e., g(x

) < g(x

) for > 0 small enough. The iterative procedure (3.21) tries


to mimic that idea by using the estimates (x

) of the corresponding true gradients. The projection


X
is needed in order to enforce feasibility of the generated iterates. If the problem is unconstrained, i.e., the
feasible set A coincides with the whole space, then this projection is the identity mapping and can be omitted
from (3.21). Note that (x

) does not need to be an accurate estimator of g(x

).
Kushner and Clark (1978) and Benveniste, Metivier and Priouret (1990) contain expositions of the
theory of stochastic approximation. Applications of the stochastic approximation method, combined with
the innitesimal perturbation analysis technique for gradient estimation, to the optimization of the steady
state means of single server queues were studied by Chong and Ramadge (1992) and LEcuyer and Glynn
(1994).
An attractive feature of the stochastic approximation method is its simplicity and ease of implementation
in those cases in which the projection
X
() can be easily computed. However, it also has severe shortcomings.
The crucial question in implementations is the choice of the step sizes

. Small step sizes result in very


slow progress towards the optimum while large step sizes make the iterates zigzag. Also, a few wrong steps
in the beginning of the procedure may require many iterations to correct. For instance, the algorithm is
extremely sensitive to the choice of the constant c in the step size rule

= c/. Therefore, various step


size rules were suggested in which the step sizes are chosen adaptively (see Ruppert (1991) for a discussion
of that topic).
Another drawback of the stochastic approximation method is that it lacks good stopping criteria and
often has diculties with handling even relatively simple linear constraints.
Another simulation based approach to stochastic optimization is based on the following idea. Let g
N
(x)
be the sample average function dened in (3.12), based on a sample of size N. Consider the optimization
problem
min
xX
g
N
(x) (3.22)
We can view the above problem as the sample average approximation of the true (or expected value)
problem (3.1). The function g
N
(x) is random in the sense that it depends on the corresponding sample.
However, note that once the sample is generated, g
N
(x) becomes a deterministic function whose values and
derivatives can be computed for a given value of the argument x. Consequently, problem (3.22) becomes
a deterministic optimization problem and one can solve it with an appropriate deterministic optimization
algorithm.
Let v
N
and x
N
denote the optimal objective value and an optimal solution of the sample average prob-
lem (3.22), respectively. By the Law of Large Numbers we have that g
N
(x) converges to g(x) w.p.1 as
N . It is possible to show that under mild additional conditions, v
N
and x
N
converge w.p.1 to the
17
optimal objective value and an optimal solution of the true problem (3.1), respectively. That is, v
N
and x
N
are consistent estimators of their true counterparts.
This approach to the numerical solution of stochastic optimization problems is a natural outgrowth of
the Monte Carlo method of estimation of the expected value of a random function. The method is known
by various names and it is dicult to point out who was the rst to suggest this approach. In the recent
literature a variant of this method, based on the likelihood ratio estimator g
N
(x), was suggested in Rubinstein
and Shapiro (1990) under the name stochastic counterpart method (also see Rubinstein and Shapiro (1993)
for a thorough discussion of such a likelihood ratio-sample approximation approach). In Robinson (1996)
such an approach is called the sample path method. This idea can also be applied to cases in which the set
A is nite, i.e., to stochastic discrete optimization problems (Kleywegt and Shapiro 1999).
Of course, in a practical implementation of such a method one has to choose a specic algorithm for
solving the sample average approximation problem (3.22). For example, in the unconstrained case one can
use the steepest descent method. That is, iterates are computed by the procedure
x
+1
= x

g
N
(x

) (3.23)
where the step size

is obtained by a line search, e.g.,

arg min

g
N
(x

g
N
(x

)). Note that


this procedure is dierent from the stochastic approximation method (3.21) in two respects. Typically a
reasonably large sample size N is used in this procedure, and, more importantly, the step sizes are calculated
by a line search instead of being dened a priori. In many interesting cases g
N
(x) is a piecewise smooth (and
even piecewise linear) function and the feasible set is dened by linear constraints. In such cases bundle type
optimization algorithms are quite ecient (see Hiriart-Urruty and Lemarechal (1993) for a discussion of the
bundle method).
A well developed statistical inference of the estimators v
N
and x
N
exists (Rubinstein and Shapiro 1993).
That inference aids in the construction of stopping rules, validation analysis and error bounds for obtained
solutions, and, furthermore, suggests variance reduction methods that may substantially enhance the rate
of convergence of the numerical procedure. For a discussion of this topic and an application to two-stage
stochastic programming with recourse we refer to Shapiro and Homem-de-Mello (1998).
If the function g(x) is twice dierentiable, then the above sample path method produces estimators
that converge to an optimal solution of the true problem at the same asymptotic rate as the stochastic
approximation method provided that the stochastic approximation method is applied with the asymptotically
optimal step sizes (Shapiro 1996). On the other hand, if the underlying probability distribution is discrete
and g(x) is piecewise linear and convex, then w.p.1 the sample path method provides an exact optimal
solution of the true problem for N large enough, and moreover the probability of that event approaches one
exponentially fast as N (Shapiro and Homem-de-Mello 1999).
4 Dynamic Programming
Dynamic programming (DP) is an approach for the modeling of dynamic and stochastic decision problems,
the analysis of the structural properties of these problems, as well as for the solution of these problems.
Dynamic programs are also referred to as Markov decision processes (MDP). Slight distinctions can be
made between DP and MDP, such as that in the case of some deterministic problems the term dynamic
programming is used rather than Markov decision processes. The term stochastic optimal control is also
often used for these types of problems. We shall use these terms synonymously.
Dynamic programs and multistage stochastic programs deal with essentially the same types of problems,
namely dynamic and stochastic decision problems. The major distinction between dynamic programming
and stochastic programming is in the structures that are used to formulate the models. For example, in DP,
the so-called state of the process, as well as the value function, that depends on the state, are two structures
that play a central role, but these concepts are usually not used in stochastic programs. Section 4.2 provides
an introduction to concepts that are important in dynamic programming.
Much has been written about dynamic programming. Some books in this area are Bellman (1957),
Bellman (1961), Bellman and Dreyfus (1962), Nemhauser (1966), Hinderer (1970), Bertsekas and Shreve
(1978), Denardo (1982), Ross (1983), Puterman (1994), Bertsekas (1995), and Sennott (1999).
18
The dynamic programming modeling concepts presented in this article are illustrated with an example,
which is both a multiperiod extension of the single period newsvendor example of Sections 2.1 and 2.5, as
well as an example of a dynamic pricing problem. The example is called a revenue management problem,
and is described in Section 4.1.
4.1 Revenue Management Example
Example 4.1 Managers often have to make decisions repeatedly over time regarding how much inventory
to obtain for future sales, as well as how to determine the selling prices. This may involve inventory of
one or more products, and the inventory may be located at one or more locations, such as warehouses
and retail stores. The inventory may be obtained from a production operation that is part of the same
company as the decision maker, and such a production operation may be a manufacturing operation or a
service operation, such as an airline, hotel, or car rental company, or the inventory may be purchased from
independent suppliers. The decision maker may also have the option to move inventory between locations,
such as from warehouses to retail stores. Often the prices of the products can be varied over time to attempt
to nd the most favorable balance between the supply of the products and the dynamically evolving demand
for the products. Such a decision maker can have several objectives, such as to maximize the expected prot
over the long run. The prot involves both revenue, which is aected by the pricing decisions, as well as
cost, which is aected by the inventory replenishment decisions.
In Section 4.2 examples are given of the formulation of such a revenue management problem with a single
product at a single location as a dynamic program.
4.2 Basic Concepts in Dynamic Programming
In this section the basic concepts used in dynamic programming models are introduced.
4.2.1 Decision Times
Decisions can be made at dierent points in time, and a dynamic programming model should distinguish
between the decisions made at dierent points in time. The major reason why it is important to distinguish
between the decisions made at dierent points in time, is that the information available to the decision maker
is dierent at dierent points in timetypically more information is available at later points in time (in fact,
many people hold this to be the denition of time).
A second reason why distinguishing decision points is useful, is that for many types of DP models it
facilitates the computation of solutions. This seems to be the major reason why dynamic programming is
used for deterministic decision problems. In this context, the time parameter in the model does not need to
correspond to the notion of time in the application. The important feature is that a solution is decomposed
into a sequence of distinct decisions. This facilitates computation of the solution if it is easier to compute
the individual decisions and then put them together to form a solution, than it is to compute a solution in
a more direct way.
The following are examples of ways in which the decision points can be determined in a DP model.
Decisions can be made at predetermined discrete points in time. In the revenue management example,
the decision maker may make a decision once per day regarding what prices to set during the day, as
well as how much to order on that day.
Decisions can be made continuously in time. In the revenue management example, the decision maker
may change prices continuously in time (which is likely to require a sophisticated way of communicating
the continuously changing prices).
Decisions can be made at random points in time when specic events take place. In the revenue
management example, the decision maker may decide on prices at the random points in time when
customer requests are received, and may decide whether to order and how much to order at the random
points in time when the inventory changes.
19
A well-formulated DP model species the way in which the decision points in time are determined.
Most of the results presented in this article are for DP models where decisions are made at predetermined
discrete points in time, denoted by t = 0, 1, . . . , T, where T denotes the length of the time horizon. DP
models with innite time horizons are also considered. DP models such as these are often called discrete
time DP models.
4.2.2 States
A fundamental concept in DP is that of a state, denoted by s. The set o of all possible states is called the
state space. The decision problem is often described as a controlled stochastic process that occupies a state
S(t) at each point in time t.
Describing the stochastic process for a given decision problem is an exercise in modeling. The modeler
has to determine an appropriate choice of state description for the problem. The basic idea is that the
state should be a sucient, and ecient, summary of the available information that aect the future of the
stochastic process. For example, for the revenue management problem, choosing the state to be the amount
of the product in inventory may be an appropriate choice. If there is a cost involved in changing the price,
then the previous price should also form part of the state. Also, if competitors prices aect the demand for
the product, then additional information about competitors prices and behavior should be included in the
state.
Several considerations should be taken into account when choosing the state description, some of which
are described in more detail in later sections. A brief overview is as follows. The state should be a sucient
summary of the available information that aect the future of the stochastic process in the following sense.
The state at a point in time should not contain information that is not available to the decision maker at
that time, because the decision is based on the state at that point in time. (There are also problems, called
partially observed Markov decision processes, in which what is also called the state contains information
that is not available to the decision maker. These problems are often handled by converting them to Markov
decision processes with observable states. This topic is discussed in Bertsekas (1995).) The set of feasible
decisions at a point in time should depend only on the state at that point in time, and maybe on the time
itself, and not on any additional information. Also, the costs and transition probabilities at a point in time
should depend only on the state at that point in time, the decision made at that point in time, and maybe
on the time itself, and not on any additional information. Another consideration is that often one would like
to choose the number of states to be as small as possible, since the computational eort of many algorithms
increase with the size of the state space. However, the number of states is not the only factor that aect the
computational eort. Sometimes it may be more ecient to choose a state description that leads to a larger
state space. In this sense the state should be an ecient summary of the available information.
The state space o can be a nite, countably innite, or uncountable set. This article addresses mostly
dynamic programs with nite or countably innite, also called discrete, state spaces o.
4.2.3 Decisions
At each decision point in time, the decision maker has to choose a decision, also called an action or control.
At any point in time t, the state s at time t, and the time t, should be sucient to determine the set /(s, t)
of feasible decisions, that is, no additional information is needed to determine the admissible decisions. (Note
that the denition of the state of the process should be chosen in such a way that this holds for the decision
problem under consideration.) Sometimes the set of feasible decisions depends only on the current state s, in
which case the set of feasible decisions is denoted by /(s). Although most examples have nite sets /(s, t)
or /(s), these sets may also be countably or uncountably innite.
In the revenue management example, the decisions involve how much of the product to order, as well as
how to set the price. Thus decision a = (q, r) denotes that quantity q is ordered, and that the price is set at r.
Suppose the supplier requires that an integer amount between a and b be ordered at a time. Also suppose that
the state s denotes the current inventory, and that the inventory may not exceed capacity Qat any time. Then
the order quantity may be no more than Qs. Also suppose that the price can be set to be any real number
between r
1
and r
2
. Then the set of feasible decisions is /(s) = a, a +1, a +2, . . . , minQs, b [r
1
, r
2
].
20
The decision maker may randomly select a decision. For example, the decision maker may roll a die and
base the decision on the outcome of the die roll. This type of decision is called a randomized decision, as
opposed to a nonrandomized, or deterministic, decision. A randomized decision for state s at time t can be
represented by a probability distribution on /(s, t) or /(s). The decision at time t is denoted by A(t).
4.2.4 Transition Probabilities
The dynamic process changes from state to state over time. The transitions between states may be determin-
istic or random. The presentation here is for a dynamic program with discrete time parameter t = 0, 1, . . . ,
and with random transitions.
The transitions have a memoryless, or Markovian, property, in the following sense. Given the history
H(t) (S(0), A(0), S(1), A(1), . . . , S(t)) of the process up to time t, as well as the decision A(t) /(S(t), t)
at time t, the probability distribution of the state that the process is in at time t + 1 depends only on S(t),
A(t), and t, that is, the additional information in the history H(t) of the process up to time t provides no
additional information for the probability distribution of the state at time t +1. (Note that the denition of
the state of the process should be chosen in such a way that the probability distribution has this memoryless
property.)
Such memoryless random transitions can be represented in several ways. One representation is by tran-
sition probabilities. For problems with discrete state spaces, the transition probabilities are denoted by
p[s

[s, a, t] IP[S(t + 1) = s

[H(t), S(t) = s, A(t) = a]. For problems with uncountable state spaces, the
transition probabilities are denoted by p[B[s, a, t] IP[S(t + 1) B[H(t), S(t) = s, A(t) = a], where B is
a subset of states. Another representation is by a transition function f, such that given H(t), S(t) = s,
and A(t) = a, the state at time t + 1 is S(t + 1) = f(s, a, t, ), where is a random variable with a known
probability distribution. The two representations are equivalent, and in this article we use mostly transition
probabilities. When the transition probabilities do not depend on the time t beside depending on the state
s and decision a at time t, they are denoted by p[s

[s, a].
In the revenue management example, suppose the demand has probability mass function p(r, d) IP[D =
d [ price = r] with d ZZ
+
. Also suppose that a quantity q that is ordered at time t is received before time
t + 1, and that unsatised demand is backordered. Then o = ZZ, and the transition probabilities are as
follows.
p[s

[s, (q, r)] =


_
p(r, s +q s

) if s

s +q
0 if s

> s +q
If a quantity q that is ordered at time t is received after the demand at time t, and unsatised demand is
lost, then o = ZZ
+
, and the transition probabilities are as follows.
p[s

[s, (q, r)] =


_
_
_
p(r, s +q s

) if q < s

s +q

d=s
p(r, d) if s

= q
0 if s

< q or s

> s +q
4.2.5 Rewards and Costs
Dynamic decision problems often have as objective to maximize the sum of the rewards obtained in each
time period, or equivalently, to minimize the sum of the costs incurred in each time period. Other types
of objectives sometimes encountered are to maximize or minimize the product of a sequence of numbers
resulting from a sequence of decisions, or to maximize or minimize the maximum or minimum of a sequence
of resulting numbers.
In this article we focus mainly on the objective of maximizing the expected sum of the rewards obtained
in each time period. At any point in time t, the state s at time t, the decision a /(s, t) at time t, and the
time t, should be sucient to determine the expected reward r(s, a, t) at time t. (Again, the denition of the
state should be chosen so that this holds for the decision problem under consideration.) When the rewards
do not depend on the time t beside depending on the state s and decision a at time t, they are denoted by
r(s, a).
21
Note that, even if in the application the reward r(s, a, t, s

) at time t depends on the state s

at time
t + 1, in addition to the state s and decision a at time t, and the time t, the expected reward at time t can
still be found as a function of only s, a, and t, because
r(s, a, t) = IE [ r(s, a, t, s

)] =
_
s

S
r(s, a, t, s

)p[s

[s, a, t] if o discrete
_
S
r(s, a, t, s

)p[ds

[s, a, t] if o uncountable
In the revenue management example, suppose unsatised demand is backordered, and that an inventory
cost/shortage penalty of h(s) is incurred when the inventory level is s at the beginning of the time period.
Then r(s, (q, r

), s

) = r

(s +q s

) h(s) with s

s +q. Thus
r(s, (q, r

)) =

d=0
p(r

, d)r

d h(s)
If unsatised demand is lost, then r(s, (q, r

), s

) = r

(s +q s

) h(s) with q s

s +q. Thus
r(s, (q, r

)) =
s1

d=0
p(r

, d)r

d +

d=s
p(r

, d)r

s h(s)
In nite horizon problems, there may be a salvage value v(s) if the process terminates in state s at the end
of the time horizon T. Such a feature can be incorporated in the previous notation, by letting /(s, T) = 0,
and r(s, 0, T) = v(s) for all s o.
Often the rewards are discounted with a discount factor [0, 1], so that the discounted expected value
of the reward at time t is
t
r(s, a, t). Such a feature can again be incorporated in the previous notation, by
letting r(s, a, t) =
t
r(s, a, t) for all s, a, and t, where r denotes the undiscounted reward function. When
the undiscounted reward does not depend on time, it is convenient to explicitly denote the discounted reward
by
t
r(s, a).
4.2.6 Policies
A policy, sometimes called a strategy, prescribes the way a decision is to be made at each point in time,
given the information available to the decision maker at the point in time. Therefore, a policy is a solution
for a dynamic program.
There are dierent classes of policies of interest, depending on which of the available information the
decisions are based on. A policy can base decisions on all the information in the history of the process
up to the time the decision is to be made. Such policies are called history dependent policies. Given the
memoryless nature of the transition probabilities, as well as the fact that the sets of feasible decisions and
the expected rewards depend on the history of the process only through the current state, it seems intuitive
that it should be sucient to consider policies that base decisions only on the current state and time, and
not on any additional information in the history of the process. Such policies are called memoryless, or
Markovian, policies. If the transition probabilities, sets of feasible decisions, and rewards do not depend
on the current time, then it also seems intuitive that it should be sucient to consider policies that base
decisions only on the current state, and not on any additional information in the history of the process or
on the current time. (However, this intuition may be wrong, as shown by counterexamples in Section 4.2.7).
Under such policies decisions are made in the same way each time the process is in the same state. Such
policies are called stationary policies.
The decision maker may also choose to use some irrelevant information to make a decision. For example,
the decision maker may roll a die, or draw a card from a deck of cards, and then base the decision on the
outcome of the die roll or the drawn card. In other words, the decision maker may randomly select a decision.
Policies that allow such randomized decisions are called randomized policies, and policies that do not allow
randomized decisions are called nonrandomized or deterministic policies.
Combining the above types of information that policies can base decisions on, the following types of
policies are obtained: the class
HR
of history dependent randomized policies, the class
HD
of history
22
dependent deterministic policies, the class
MR
of memoryless randomized policies, the class
MD
of mem-
oryless deterministic policies, the class
SR
of stationary randomized policies, and the class
SD
of sta-
tionary deterministic policies. The classes of policies are related as follows:
SD

MD

HD

HR
,

SD

MD

MR

HR
,
SD

SR

MR

HR
.
For the revenue management problem, an example of a stationary deterministic policy is to order quantity
q = s
2
s if the inventory level s < s
1
, for chosen constants s
1
s
2
, and to set the price at level r = r(s)
for a chosen function r(s) of the current state s. An example of a stationary randomized policy is to set the
price at level r = r
1
(s) with probability p
1
(s) and at level r = r
2
(s) with probability 1 p
1
(s) for chosen
functions r
1
(s), r
2
(s), and p
1
(s) of the current state s. An example of a memoryless deterministic policy
is to order quantity q = s
2
(t) s if the inventory level s < s
1
(t), for chosen functions s
1
(t) s
2
(t) of the
current time t, and to set the price at level r = r(s, t) for a chosen function r(s, t) of the current state s and
time t.
Policies are functions, dened as follows. Let H(t) (S(0), A(0), S(1), A(1), . . . , S(t)) denote the set
of all histories up to time t, and let H

t=0
H(t) denote the set of all histories. Let /
sS

t=0
/(s, t)
denote the set of all feasible decisions. Let T(s, t) denote the set of probability distributions on /(s, t)
(satisfying regularity conditions), and let T
sS

t=0
T(s, t) denote the set of all such probability
distributions. Then
HR
is the set of functions : H T, such that for any t, and any history H(t),
(H(t)) T(S(t), t) (again regularity conditions may be required).
HD
is the set of functions : H /,
such that for any t, and any history H(t), (H(t)) /(S(t), t).
MR
is the set of functions : oZZ
+
T,
such that for any state s o, and any time t ZZ
+
, (s, t) T(s, t).
MD
is the set of functions
: o ZZ
+
/, such that for any state s o, and any time t ZZ
+
, (s, t) /(s, t).
SR
is the set of
functions : o T, such that for any state s o, (s) T(s).
SD
is the set of functions : o /,
such that for any state s o, (s) /(s).
4.2.7 Examples
In this section a number of examples are presented that illustrate why it is sometimes desirable to consider
more general classes of policies, such as memoryless and/or randomized policies instead of stationary deter-
ministic policies, even if the sets of feasible solutions, transition probabilities, and rewards are stationary.
The examples may also be found in Ross (1970), Ross (1983), Puterman (1994), and Sennott (1999).
The examples are for dynamic programs with stationary input data and objective to minimize the long-
run average cost per unit time, limsup
T
IE
_

T1
t=0
r(S(t), A(t)) [ S(0)
_
/T. For any policy , let
V

(s) limsup
T
1
T
IE

_
T1

t=0
r(S(t), A(t))

S(0) = s
_
denote the long-run average cost per unit time under policy if the process starts in state s, where IE

[]
denotes the expected value if policy is followed.
A policy

is called optimal if V

(s) = inf

HR V

(s) for all states s.


Example 4.2 It is clear that if some feasible sets /(s) are innite, for example /(s) = (0, 1), then an
optimal policy may not exist. The rst example shows that an optimal policy may not exist, even if /(s) is
nite for all states s.
The state space o = 1, 1

, 2, 2

, 3, 3

, . . . . Feasible decision sets are /(i) = a, b, and /(i

) = a, for
each i 1, 2, 3, . . . . Transitions are deterministic; from a state i we either go up to state i + 1 if we make
decision a, or we go across to state i

if we make decision b. Once in a state i

, we remain in state i

. That
is, the transition function is f(i, a) = i + 1, f(i, b) = i

, and f(i

, a) = i

. In a state i a cost of 1 is incurred,


and in a state i

a cost of 1/i is incurred. That is, the costs are r(i, a) = r(i, b) = 1, and r(i

, a) = 1/i.
Suppose the process starts in state 1. The idea is simple: we would like to go to a high state i, before
moving over to state i

. However, a policy that chooses decision a for each state i, has long-run average
cost per unit time of V

(1) = 1, which is as bad as can be. The only other possibility is that there exists a
state j such that policy chooses decision b with positive probability p
j
when the process reaches state j.
In that case V

(1) p
j
/j > 0. Thus V

(1) > 0 for all policies .


23
The stationary deterministic policy
j
that chooses decision a for states i = 1, 2, . . . , j 1, and chooses
decision b for state j, has long-run average cost per unit time of V
j
(1) = 1/j. By choosing j arbitrarily
large, V
j
(1) can be made arbitrarily close to zero, but no policy has long-run average cost per unit time
V

(1) less than or equal to zero. Thus an optimal policy

, with V

(1) = 0, does not exist.


However, for any policy , there exists a stationary deterministic policy
j
such that V
j
(1) < V

(1).
Example 4.3 The second example shows that it is not always the case that for any policy , there exists
a stationary deterministic policy

that is at least as good as .


The state space o = 1, 2, 3, . . . . Feasible decision sets are /(i) = a, b for each i o. Transitions are
deterministic; from a state i we either remain in state i if we make decision a, or we go up to state i + 1 if
we make decision b. That is, the transition function is f(i, a) = i, and f(i, b) = i + 1. When decision a is
made in a state i, a cost of 1/i is incurred, and when decision b is made, a cost of 1 is incurred. That is, the
costs are r(i, a) = 1/i, and r(i, b) = 1.
Suppose the process starts in state 1. Again, the idea is simple: we would like to go to a high state
i, and then make decision a. However, a stationary deterministic policy that chooses decision b for each
state i, has long-run average cost per unit time of V

(1) = 1, which is as bad as can be. The only other


possibility for a stationary deterministic policy is to choose decision a for the rst time in state j. In that
case V

(1) = 1/j > 0. Thus V

(1) > 0 for all stationary deterministic policies .


Consider the memoryless deterministic policy

that chooses decision a the rst i times that the process is


in state i, and then chooses decision b. Thus the sequence of states under policy

is 1, 1, 2, 2, 2, 3, 3, 3, 3, . . . .
The sequence of decisions under policy

is a, b, a, a, b, a, a, a, b, . . . . The sequence of costs under policy

is 1, 1, 1/2, 1/2, 1, 1/3, 1/3, 1/3, 1, . . . . Note that the total cost incurred while the process is in state i is 2
for each i, so that the total cost incurred from the start of the process until the process leaves state i is 2i.
The total time until the process leaves state i is 2 + 3 + + i + 1 = i(i + 3)/2. Thus the average cost per
unit time if the process is currently in state i +1 is less than 2(i +1)/(i(i +3)/2), which becomes arbitrarily
small as i becomes large. Thus V

(1) = 0, and the memoryless deterministic policy

is better than any


stationary deterministic policy.
However, there exists a stationary randomized policy with the same expected long-run average cost
per unit time as policy

. When in state i, chooses decision a with probability i/(i + 1) and decision b


with probability 1/(i +1). The expected amount of time that the process under spends in state i is i +1,
and the expected cost incurred while the process under is in state i is 2. Thus the cost incurred under
is similar to the cost incurred under

, and it can be shown that V



(1) = 0.
Example 4.4 The third example shows that it is not always the case that for any policy , there exists a
stationary randomized policy that is at least as good as .
The state space o = 0, 1, 1

, 2, 2

, 3, 3

, . . . . Feasible decision sets are /(0) = a, /(i) = a, b, and


/(i

) = a for each i 1, 2, 3, . . . . When in state 0, a cost of 1 is incurred, otherwise there is no cost.


That is, the costs are r(0, a) = 1, and r(i, a) = r(i, b) = r(i

, a) = 0. In this example transitions are random,


with transition probabilities
p[i[0, a] = p[i

[0, a] =
3
2
_
1
4
_
i
p[0[i, a] = p[i + 1[i, a] =
1
2
p[0[i, b] = 1 p[i

[i, b] =
_
1
2
_
i
p[0[i

, a] = 1 p[i

[i

, a] =
_
1
2
_
i
Again, the idea is simple: we would like to visit state 0 as infrequently as possible. Thus we would like
the process to move to a high state i or i

, where the probability of making a transition to state 0 can be


made small. However, to move to a high state requires decision a to be made, which involves a high risk of
moving to state 0. The policy that always makes decision a is as bad as possible.
24
Let M

i0
denote the mean time for the process to move from state i to state 0 under stationary policy .
Thus V

(0) = 1/M

00
.
First consider the stationary deterministic policy
j
that chooses decision a for states i = 1, 2, . . . , j 1,
and chooses decision b for states i = j, j + 1, . . . . Then for i = 1, 2, . . . , j 1, M
j
i0
= 2 + 2
i
(1/2)
ji1
.
From this it follows that M
j
00
< 5 and thus V
j
(0) > 1/5 for all j.
Next consider any stationary randomized policy , and let (i, a) denote the probability that decision a is
made in state i. Then, given that the process is in state i, the probability is (i, a)(i+1, a) (j1, a)(j, b)
that the process under policy behaves the same until state 0 is reached as under policy
j
. Thus
M

i0
=

j=i
_
(j, b)
j1

k=i
(k, a)
_
M
j
i0
+ 2

k=i
(k, a)
<
_
2 + 2
i
_
_
_

j=i
(j, b)
j1

k=i
(k, a) + 2

k=i
(k, a)
_
_
= 2 + 2
i
From this it follows that M

00
< 5 and thus V

(0) > 1/5 for any stationary randomized policy .


Consider the memoryless deterministic policy

that uses the decisions of


1
for t = 1, 2, . . . , T
1
,
2
for
t = T
1
+ 1, T
1
+ 2, . . . , T
2
, . . . ,
j
for t = T
j1
+ 1, T
j1
+ 2, . . . , T
j
, . . . . For appropriate choice of T
j
s it
follows that V

(0) = 1/5, and thus the memoryless deterministic policy

is better than any stationary


randomized policy.
Example 4.5 In all the examples presented so far, it is the case that for any policy , and any > 0, there
exists a stationary deterministic policy that has value function V

within of the value function V



. The
fourth example shows that this does not always hold.
The state space o = 0, 1, 1

, 2, 2

, 3, 3

, . . . . Feasible decision sets are /(0) = a, /(i) = a, b, and


/(i

) = a for each i 1, 2, 3, . . . . When in state i 0, 1, 2, . . . , a cost of 2 is incurred, otherwise


there is no cost. That is, the costs are r(0, a) = 2, r(i, a) = r(i, b) = 2, and r(i

, a) = 0. The transition
probabilities are as follows.
p[0[0, a] = 1
p[i + 1[i, a] = 1
p[i

[i, b] = 1 p[0[i, b] = p
i
p[(i 1)

[i

, a] = 1 for all i 2
p[1[1

, a] = 1
The values p
i
can be chosen to satisfy
p
i
< 1 for all i

i=1
p
i
=
3
4
Suppose the process starts in state 1. Again, the idea is simple: we would like to go down the chain
i

, (i 1)

, . . . , 1

as much as possible. To do that, we also need to go up the chain 1, 2, . . . , i, and then go


from state i to state i

by making decision b. When we make decision b in state i, there is a risk 1 p


i
> 0
of making a transition to state 0, which is very bad.
A stationary deterministic policy that chooses decision a for each state i, has long-run average cost per
unit time of V

(1) = 2, which is as bad as can be. The only other possibility for a stationary deterministic
policy is to choose decision b for the rst time in state j. In that case, each time state j is visited, there
is a positive probability 1 p
j
> 0 of making a transition to state 0. It follows that the mean time until a
transition to state 0 is made is less than 2j/(1 p
j
) < , and the long-run average cost per unit time is
V

(1) = 2. Thus V

(1) = 2 for all stationary deterministic policies .


25
Consider the memoryless deterministic policy that on its jth visit to state 1, chooses decision a, j 1
times and then chooses decision b. With probability

i=1
p
i
= 3/4 the process never makes a transition to
state 0, and the long-run average cost per unit time is 1. Otherwise, with probability 1

i=1
p
i
= 1/4, the
process makes a transition to state 0, and the long-run average cost per unit time is 2. Hence, the expected
long-run average cost per unit time is V

(1) = 3/41+1/42 = 5/4. Thus, there is no -optimal stationary
deterministic policy for (0, 3/4). In fact, by considering memoryless deterministic policies
k
that on
their jth visit to state 1, choose decision a, j + k times and then choose decision b, one obtains policies
with expected long-run average cost per unit time V

k
(1) arbitrarily close to 1 for suciently large values
of k. It is clear that V

(1) 1 for all policies , and thus V

(1) = 1, and there is no -optimal stationary


deterministic policy for (0, 1).
4.3 Finite Horizon Dynamic Programs
In this section we investigate dynamic programming models for optimization problems with the form
max
(A(0),A(1),... ,A(T))
IE
_
T

t=0
r(S(t), A(t), t)
_
(4.1)
where T < is the known nite horizon length, and decisions A(t), t = 0, 1, . . . , T, have to be feasible and
may depend only on the information available to the decision maker at each time t, that is the history H(t)
of the process up to time t, and possibly some randomization. For the presentation we assume that o is
countable and r is bounded. Similar results hold in more general cases, subject to regularity conditions.
4.3.1 Optimality Results
For any policy
HR
, and any history h(t) H(t), let
U

(h(t)) IE

_
T

=t
r(S(), A(), )

H(t) = h(t)
_
(4.2)
denote the expected value under policy from time t onwards, given the history h(t) of the process up to
time t; U

is called the value function under policy . The optimal value function U

is given by
U

(h(t)) sup

HR
U

(h(t)) (4.3)
It follows from r being bounded that U

and U

are bounded. A policy


HR
is called optimal if
U

(h(t)) = U

(h(t)) for all h(t) H(t) and all t 0, 1, . . . , T. Also, a policy


HR
is called
-optimal if U

(h(t)) + > U

(h(t)) for all h(t) H(t) and all t 0, 1, . . . , T.


It is easy to see that the value function U

satises the following inductive equation for any


HR
and any history h(t) = (h(t 1), a(t 1), s).
U

(h(t)) = IE

[r(s, (h(t)), t) +U

(H(t + 1)) [ H(t) = h(t)] (4.4)


Using (4.4), U

can be computed inductively; this is called the nite horizon policy evaluation algorithm.
This result is also used to establish the result that U

satisfy the following optimality equation for all histories


h(t) = (h(t 1), a(t 1), s).
U

(h(t)) = sup
aA(s,t)
_
r(s, a, t) +IE [U

(H(t + 1)) [ H(t) = h(t), A(t) = a]


_
(4.5)
From the memoryless properties of the feasible sets, transition probabilities, and rewards, it is intuitive
that U

(h(t)) should depend on h(t) = (h(t 1), a(t 1), s) only through the state s at time t and the time
26
t, and that it should be sucient to consider memoryless policies. To establish these results, inductively
dene the memoryless function V

along the lines of the optimality equation (4.5) for U

.
V

(s, T + 1) 0
V

(s, t) sup
aA(s,t)
_
r(s, a, t) +IE [V

(S(t + 1), t + 1) [ S(t) = s, A(t) = a]


_
t = T, T 1, . . . , 1, 0 (4.6)
Then it is easy to show, again by induction, that for any history h(t) = (h(t 1), a(t 1), s), U

(h(t)) =
V

(s, t).
For any memoryless policy
MR
, inductively dene the function
V

(s, T + 1) 0
V

(s, t) IE

[r(s, (s, t), t) +V

(S(t + 1), t + 1) [ S(t) = s]


t = T, T 1, . . . , 1, 0 (4.7)
Then, for any history h(t) = (h(t 1), a(t 1), s), U

(h(t)) = V

(s, t), that is, V

is the (simpler) value


function of policy
MR
.
In a similar way it can be shown that it is sucient to consider only memoryless deterministic policies,
in the following sense. First suppose that for each s o and each t 0, 1, . . . , T, there exists a decision
a

(s, t) such that


r(s, a

(s, t), t) +IE [V

(S(t + 1), t + 1) [ S(t) = s, A(t) = a

(s, t)]
= sup
aA(s,t)
_
r(s, a, t) +IE [V

(S(t + 1), t + 1) [ S(t) = s, A(t) = a]


_
(4.8)
Then the memoryless deterministic policy

with

(s, t) = a

(s, t) is optimal, that is, for any history


h(t) = (h(t 1), a(t 1), s), U

(h(t)) = V

(s, t) = V

(s, t) = U

(h(t)). If, for some s and t, there does


not exist such an optimal decision a

(s, t), then there also does not exist an optimal history dependent
randomized policy. In such a case it still holds that for any > 0, there exists an -optimal memoryless
deterministic policy

, obtained by choosing decisions

(s, t) such that


r(s,

(s, t), t) +IE [V

(S(t + 1), t + 1) [ S(t) = s, A(t) =

(s, t)] +

T + 1
> sup
aA(s,t)
_
r(s, a, t) +IE [V

(S(t + 1), t + 1) [ S(t) = s, A(t) = a]


_
(4.9)
Solving a nite horizon dynamic program usually involves computing V

with a backward induction


algorithm using (4.6). An optimal policy


MD
is then obtained using (4.8), or an -optimal policy


MD
is obtained using (4.9).
Finite Horizon Backward Induction Algorithm
0. Set V

(s, T + 1) = 0 for all s o.


1. For t = T, . . . , 1, repeat steps 2 and 3.
2. For each s o, compute
V

(s, t) = sup
aA(s,t)
r(s, a, t) +IE [V

(S(t + 1), t + 1) [ S(t) = s, A(t) = a] (4.10)


3. For each s o, choose a decision

(s, t) arg max


aA(s,t)
r(s, a, t) +IE [V

(S(t + 1), t + 1) [ S(t) = s, A(t) = a]


27
if the maximum on the right hand side is attained. Otherwise, for any chosen > 0, choose a decision

(s, t) such that


r(s,

(s, t), t) +IE [V

(S(t + 1), t + 1) [ S(t) = s, A(t) =

(s, t)] +

T + 1
> sup
aA(s,t)
_
r(s, a, t) +IE [V

(S(t + 1), t + 1) [ S(t) = s, A(t) = a]


_
4.3.2 Structural Properties
Dynamic programming is useful not only for the computation of optimal policies and optimal expected values,
but also for determining insightful structural characteristics of optimal policies. In fact, for many interesting
applications the state space is too big to compute optimal policies and optimal expected values exactly, but
dynamic programming can still be used to establish qualitative characteristics of optimal quantities. Some
such structural properties are illustrated with examples.
Example 4.6 The Secretary Problem. Suppose a decision maker has to choose one out of N candidates.
The decision maker observes the candidates one at a time, and after a candidate has been observed, the
decision maker either has to choose that candidate, and the process terminates, or the decision maker has to
reject the candidate and observe the next candidate. Rejected candidates cannot be recalled. The number
N of candidates is known, but the decision maker knows nothing else about the candidates beforehand. The
decision maker can rank any candidates that have been observed. That is, for any two candidates i and
j that have been observed, either i is preferred to j, denoted by j i, or j is preferred to i, denoted by
i j. The preferences are transitive, that is, if i j and j k, then i k. The candidates are observed in
random sequence, that is, the N! permutations of candidates are equally likely. The decision maker wants
to maximize the probability of selecting the best candidate. This problem can be formulated as a dynamic
program. The discrete time parameter corresponds to the number of candidates that have been observed so
far, and the current state is an indicator whether the current candidate is the best candidate observed so
far or not. If the current candidate is selected, then the expected reward is the probability that the current
candidate is the best candidate overall. If the current candidate is rejected, then the current reward is zero,
and the process makes a transition to the next stage. Dynamic programming can be used to show that the
following policy is optimal. Let
(N) max
_
n 1, . . . , N :
1
n
+
1
n + 1
+ +
1
N 1
> 1
_
The optimal policy is then to observe the rst (N) candidates without selecting any candidate, and then
to select the rst candidate thereafter that is preferred to all the previously observed candidates. It can be
shown that (N)/N converges to 1/e quite rapidly. Thus for a reasonably large number N of candidates
(say N > 15), a good policy is to observe the rst N/e candidates without selecting any candidate, and then
to select the rst candidate thereafter that is preferred to all the previous candidates. It is also interesting
that the optimal probability of selecting the best candidate decreases in N, but it never decreases below
1/e 37%, no matter how large the number of candidates.
Example 4.7 Inventory Replenishment. A business purchases and sells a particular product. A decision
maker has to decide regularly, say once every day, how much of the product to buy. The business does not
have to wait to receive the purchased product. Unlike the newsvendor problem, here product that is not
sold on a particular day can be kept in inventory for the future. The business pays a xed cost K plus a
variable cost c per unit of product each time product is purchased. Thus, if a units of product is purchased,
then the purchasing cost is K + ca if a > 0, and it is 0 if a = 0. In addition, if the inventory level at
the beginning of the day is s, and a units of product is purchased, then an inventory cost of h(s + a) is
incurred, where h is a convex function. The demand for the product on dierent days are independent
28
and identically distributed. If the demand D is greater than the available inventory s + a, then the excess
demand is backlogged until additional inventory is obtained, at which time the backlogged demand is lled
immediately. Inventory remaining at the end of the time horizon has no value. The objective is to minimize
the expected total cost over the time horizon. This problem can be formulated as a discrete time dynamic
program. The state S(t) is the inventory at the beginning of day t. The decision A(t) is the quantity
purchased on day t, and the single stage cost r(s, a) = (K +ca)I
{a>0}
+h(s +a). The transitions are given
by S(t + 1) = S(t) + A(t) D(t). Dynamic programming can be used to show that the following policy is
optimal. If the inventory level S(t) <

(t), where

(t) is called the optimal reorder point at time t, then it


is optimal to purchase

(t) S(t) units of product at time t, where

(t) is called the optimal order-up-to


point at time t. If the inventory level S(t)

(t), then it is optimal not to purchase any product. Such


a policy is often called an (s, S)-policy, or a (, )-policy. Similar results hold in the innite horizon case,
except that

and

do not depend on time t anymore.


Example 4.8 Resource Allocation. A decision maker has an amount of resource that can be allocated
over some time horizon. At each discrete point in time, a request for some amount of resource is received. If
the request is for more resource than the decision maker has available, then the request has to be rejected.
Otherwise, the request can be accepted or rejected. A request must be accepted or rejected as a wholethe
decision maker cannot allocate a fraction of the amount of resource requested. Rejected requests cannot be
recalled later. If the request is accepted, the amount of resource available to the decision maker is reduced
by the amount of resource requested, and the decision maker receives an associated reward in return. The
amounts of resource and the rewards of future requests are unknown to the decision maker, but the decision
maker knows the probability distribution of these. At the end of the time horizon, the decision maker
receives a salvage reward for the remaining amount of resource. The objective is to maximize the expected
total reward over the time horizon. Problems of this type are encountered in revenue management and
the selling of assets such as real estate and vehicles. This resource allocation problem can be formulated
as a dynamic program. The state S(t) is the amount of resource available to the decision maker at the
beginning of time period t. The decision A(t) is the rule that will be used for accepting or rejecting requests
during time period t. If a request for amount Q of resource with an associated reward R is accepted
in time period t, then the single stage reward is R and the next state is S(t + 1) = S(t) Q. If the
request is rejected, then the next state is S(t + 1) = S(t). It is easy to see that the optimal value function
V

(s, t) is increasing in s and decreasing in t. The following threshold policy, with reward threshold function
x

(q, s, t) = V

(s, t + 1) V

(s q, t + 1), is optimal. Accept a request for amount Q of resource with an


associated reward R if Q S(t) and R x

(Q, S(t), t), and reject the request otherwise. If each request is
for the same amount of resource (say 1 unit of resource), and the salvage reward is concave in the remaining
amount of resource, then the optimal value function V

(s, t) is concave in s and t, and the optimal reward


threshold x

(1, s, t) = V

(s, t + 1) V

(s 1, t + 1) is decreasing in s and t. These intuitive properties do


not hold in general if the requests are for random amounts of resource.
Structural properties of the optimal value functions and optimal policies of dynamic programs have been
investigated for many dierent applications. Some general structural results are given in Serfozo (1976),
Topkis (1978), and Heyman and Sobel (1984).
4.4 Innite Horizon Dynamic Programs
In this section we present dynamic programming models with an innite time horizon. Although an innite
time horizon is a gment of the imagination, these models often are useful for decision problems with many
decision points. Many innite horizon models also have the desirable feature that there exist stationary
deterministic optimal policies. Thus optimal decisions depend only on the current state of the process,
and not on the sometimes articial notion of time, as in nite horizon problems. This characteristic makes
optimal policies easier to understand, compute, and implement, which is desirable in applications.
We again assume that o is countable and r is bounded. Similar results hold in more general cases, subject
to regularity conditions. We also assume that the sets /(s) of feasible decisions depend only on the states s,
the transition probabilities p[s

[s, a] depend only on the states s, s

, and decisions a, and the rewards r(s, a)


depend only on the states s and decisions a, and not on time, as in the nite horizon case.
29
In this article we focus on dynamic programs with total discounted reward objectives. As illustrated
in the examples of Section 4.2.7, innite horizon dynamic programs with other types of objectives, such
as long-run average reward objectives, may exhibit undesirable behavior. A proper treatment of dynamic
programs with these types of objectives requires more space than we have available here, and therefore we
refer the interested reader to the references. Besides, in most practical applications, rewards and costs in the
near future are valued more than rewards and costs in the more distant future, and hence total discounted
reward objectives are preferred for applications.
4.5 Innite Horizon Discounted Dynamic Programs
In this section we investigate dynamic programming models for optimization problems with the form
max
(A(0),A(1),... )
IE
_

t=0

t
r(S(t), A(t))
_
(4.11)
where (0, 1) is a known discount factor. Again, decisions A(t), t = 0, 1, . . . , have to be feasible and may
depend only on the information available to the decision maker at each time t, that is the history H(t) of
the process up to time t, and possibly some randomization.
4.5.1 Optimality Results
The establishment of optimality results for innite horizon discounted dynamic programs is quite similar
to that for nite horizon dynamic programs. An important dierence though is that backward induction
cannot be used in the innite horizon case.
We again start by dening the value function U

for a policy
HR
,
U

(h(t)) IE

=t

t
r(S(), A())

H(t) = h(t)
_
(4.12)
The optimal value function U

is then dened as in (4.3). It follows from r being bounded and (0, 1)


that U

and U

are bounded. Again, a policy


HR
is called optimal if U

(h(t)) = U

(h(t)) for all


h(t) H(t) and all t 0, 1, . . . , and a policy


HR
is called -optimal if U

(h(t)) + > U

(h(t)) for
all h(t) H(t) and all t 0, 1, . . . .
The value function U

satises an inductive equation similar to (4.4) for the nite horizon case, for any

HR
and any history h(t) = (h(t 1), a(t 1), s).
U

(h(t)) = IE

[r(s, (h(t))) +U

(H(t + 1)) [ H(t) = h(t)] (4.13)


However, unlike the nite horizon case, U

cannot in general be computed inductively using (4.13). We also


do not proceed in the innite horizon case by establishing an optimality equation similar to (4.5). However,
we do proceed by considering an optimality equation similar to (4.6).
From the stationary properties of the feasible sets, transition probabilities, and rewards, it is intuitive
that U

(h(t)) should depend on h(t) = (h(t 1), a(t 1), s) only through the most recent state s, and that
it should be sucient to consider stationary policies. However, it is convenient to show, as an intermediate
step, that it is sucient to consider memoryless policies. For any
HR
and any history h(t), dene the
memoryless randomized policy
MR
as follows.
(s, t +)(A) IP

[A(t +) A[S(t +) = s, H(t) = h(t)]


for any s o, any 0, 1, 2, . . . , and any A /(s). (Recall that (s, t)(A) denotes the probability,
given state s at time t, that a decision in A /(s) is chosen under policy .) Then it is easy to show
that for any s o, any 0, 1, 2, . . . , and any A /(s), IP

[S(t + ) = s, A(t + ) A[H(t) =
h(t)] = IP

[S(t + ) = s, A(t + ) A[H(t) = h(t)]. Thus, for any


HR
and any history h(t),
there exists a memoryless randomized policy that behaves exactly like from time t onwards, and hence
30
U

(h(t + )) = U

(h(t + )) for any history h(t + ) that starts with h(t). It follows that U

(h(t))
sup

HR U

(h(t)) = sup

MR U

(h(t)), that is, it is sucient to consider memoryless randomized policies.


For any memoryless randomized policy and any history h(t) = (h(t 1), a(t 1), s), U

(h(t)) depends
on h(t) only through the most recent state s and the time t. Instead of exploring this result in more detail as
for the nite horizon case, we use another property of memoryless randomized policies. Using the stationary
properties of the problem parameters, it follows that, for any memoryless randomized policy and any time
t, behaves in the same way from time t onwards as another memoryless randomized policy behaves from
time 0 onwards, where is obtained from by shifting backwards through t, as follows. Dene the shift
function :
MR

MR
by ()(s, t) (s, t + 1) for all s o and all t 0, 1, . . . . That is, policy
()
MR
makes the same decisions at time t as policy
MR
makes at time t + 1. Also, inductively
dene the convolution
t+1
() (
t
()). Thus the shifted policy described above is given by =
t
().
Also note that for a stationary policy , () = .
Now it is useful to focus on the value function V

for a policy
HR
from time 0 onwards,
V

(s) IE

t=0

t
r(S(t), A(t))

S(0) = s
_
(4.14)
That is, V

(s) = U

(h(0)), where h(0) = (s). Then it follows that, for any memoryless randomized policy
and any history h(t) = (h(t 1), a(t 1), s),
U

(h(t)) = V

t
()
(s) (4.15)
Thus we obtain the further simplication that U

(h(t)) sup

HR U

(h(t)) = sup

MR U

(h(t)) =
sup

MR V

(s). Dene the optimal value function V

by
V

(s) sup

MR
V

(s) (4.16)
Thus, for any history h(t) = (h(t 1), a(t 1), s),
U

(h(t)) = V

(s) (4.17)
and hence U

(h(t)) depends only on the most recent state s, as expected.


It also follows from (4.13) and (4.15) that for any
MR
,
V

(s) = U

(h(0)) = IE

[r(s, (s, 0)) +U

(H(1)) [ H(0) = h(0) = (s)]


= IE

_
r(s, (s, 0)) +V
()
(S(1)) [ S(0) = s
_
(4.18)
As a special case, for a stationary policy ,
V

(s) = IE

[r(s, (s)) +V

(S(1)) [ S(0) = s] (4.19)


Motivated by the nite horizon optimality equation (4.6), as well as by (4.18), we expect V

to satisfy
the following optimality equation.
V

(s) = sup
aA(s)
_
r(s, a) +IE [V

(S(1)) [ S(0) = s, A(0) = a]


_
(4.20)
Unlike the nite horizon case, we cannot use induction to establish the validity of (4.20). Instead we use the
following approach. Let 1 denote the set of bounded functions V : o IR. Dene the function L

: 1 1
by
L

(V )(s) sup
aA(s)
_
r(s, a) +IE [V (S(1)) [ S(0) = s, A(0) = a]
_
31
Let | |

denote the supremum-norm on 1, that is, for any V 1, |V |

sup
sS
[V (s)[. For any
V
1
, V
2
1, L

satises |L

(V
1
) L

(V
2
)|

|V
1
V
2
|

. Then because [0, 1), L

is a contraction
mapping on 1, and it follows from the Banach xed point theorem that L

has a unique xed point v

1,
that is, there exists a unique function v

1 that satises V = L

(V ). Thus optimality equation (4.20) has


a unique solution v

, and it remains to be shown that v

is equal to V

as dened in (4.16). Similarly, for


any stationary policy , dene the function L

: 1 1 by
L

(V )(s) IE

[r(s, (s)) +V (S(1)) [ S(0) = s]


It follows in the same way as for L

that L

has a unique xed point, and it follows from (4.19) that V

is
the xed point of L

.
Consider any V 1 such that V L

(V ). Then for any


MR
, it follows by induction that V V

,
and thus V sup

MR V

. Similarly, consider any V 1 such that V L

(V ). Then for any


> 0 there exists a stationary deterministic policy

such that V V

+ V

+ , and thus V V

.
Combining these results, it follows that for any V 1 such that V = L

(V ), it holds that V = V

, and thus
v

= V

, that is, V

is the unique xed point of L

, and the validity of the optimality equation (4.20) has


been established.
It can now be shown that it is sucient to consider only stationary deterministic policies, in the following
sense. First suppose that for each s o, there exists a decision a

(s) such that


r(s, a

(s)) +IE [V

(S(1)) [ S(0) = s, A(0) = a

(s)]
= sup
aA(s)
_
r(s, a) +IE [V

(S(1)) [ S(0) = s, A(0) = a]


_
(4.21)
Let the stationary deterministic policy

be given by

(s) = a

(s). Then (4.21) implies that L

(V

) =
L

(V

) = V

, that is, V

is a xed point of L

, and thus V

= V

. That is, for any history h(t) =


(h(t 1), a(t 1), s), U

(h(t)) = V

(s) = V

(s) = U

(h(t)), and thus

is an optimal policy. If, for


some s, there does not exist such an optimal decision a

(s), then there also does not exist an optimal history


dependent randomized policy. In such a case it still holds that for any > 0, there exists an -optimal
stationary deterministic policy

, obtained by choosing decisions

(s) such that


r(s,

(s)) +IE [V

(S(1)) [ S(0) = s, A(0) =

(s)] + (1 )
> sup
aA(s)
_
r(s, a) +IE [V

(S(1)) [ S(0) = s, A(0) = a]


_
(4.22)
4.5.2 Algorithms
Solving an innite horizon discounted dynamic program usually involves computing V

. An optimal policy


SD
is then obtained using (4.21), or an -optimal policy


SD
is obtained using (4.22).
Unlike the nite horizon case, V

is not computed directly using backward induction. An approach that


is often used is to compute a sequence of approximating functions V
i
, i = 0, 1, 2, . . . , such that V
i
V

as
i .
Approximating functions provide good policies, as shown by the following result. Suppose V

is approx-
imated by

V such that
_
_
_V


V
_
_
_

. Consider any policy


SD
such that
r(s, (s)) +

S
p[s

[s, (s)]

V (s

) + sup
aA(s)
_
r(s, a) +

S
p[s

[s, a]

V (s

)
_
for all s o, that is, decision (s) is within of the optimal decision using approximating function

V on
the right hand side of the optimality equation (4.20). Then
V

(s) V

(s)
2 +
1
(4.23)
for all s o, that is, policy has value function within (2 +)/(1 ) of the optimal value function.
32
Value Iteration One algorithm based on a sequence of approximating functions V
i
is called value iteration,
or successive approximation. The iterates V
i
of value iteration correspond to the value function V

(s, T+1i)
of the nite horizon dynamic program with the same problem parameters. Specically, starting with initial
approximation V
0
(s) = 0 = V

(s, T + 1) for all s, the ith approximating function V


i
(s) is the same as
the value function V

(s, T + 1 i) of the corresponding nite horizon dynamic program, that is, the value
function for time T + 1 i that is obtained after i steps of the backward induction algorithm.
Value Iteration Algorithm
0. Choose initial approximation V
0
1 and stopping tolerance . Set i 0.
1. For each s o, compute
V
i+1
(s) = sup
aA(s)
_
r(s, a) +

S
p[s

[s, a]V
i
(s

)
_
(4.24)
2. If |V
i+1
V
i
|

< (1 )/2, then go to step 3. Otherwise, set i i + 1 and go to step 1.


3. For each s o, choose a decision

(s) arg max


aA(s)
_
r(s, a) +

S
p[s

[s, a]V
i+1
(s

)
_
if the maximum on the right hand side is attained. Otherwise, for any chosen > 0, choose a decision

(s)
such that
r(s,

(s)) +

S
p[s

[s,

(s)]V
i+1
(s

) + (1 )
> sup
aA(s)
_
r(s, a) +

S
p[s

[s, a]V
i+1
(s

)
_
It can be shown, using the contraction property of L

, that V
i
V

as i for any initial ap-


proximation V
0
1. Also, the convergence is geometric with rate . Specically, for any V
0
1,
|V
i
V


i
|V
0
V

. That implies that the convergence rate is faster if the discount factor
is smaller.
When the value iteration algorithm stops, the nal approximation V
i+1
satises |V
i+1
V

< /2.
Furthermore, the chosen policy

is an -optimal policy, and the chosen policy

is an ( + )-optimal
policy.
There are several versions of the value iteration algorithm. One example is Gauss-Seidel value iteration,
which uses the most up-to-date approximation V
i+1
(s

) on the right hand side of (4.24) as soon as it becomes


available, instead of using the previous approximation V
i
(s

) as shown in (4.24). Gauss-Seidel value iteration


has the same convergence properties and performance guarantees given above, but in practice it usually
converges faster.
Policy Iteration Policy iteration is an algorithm based on a sequence of policies
i
, and their value
functions V
i
.
Policy Iteration Algorithm
0. Choose initial policy
0

SD
and stopping tolerance . Set i 0.
1. Compute the value function V
i
of policy
i
by solving the system of linear equations
V
i
(s) = r(s,
i
(s)) +

S
p[s

[s,
i
(s)]V
i
(s

) (4.25)
for each s o.
2. For each s o, choose a decision

i+1
(s) arg max
aA(s)
_
r(s, a) +

S
p[s

[s, a]V
i
(s

)
_
33
if the maximum on the right hand side is attained. Otherwise, for any chosen > 0, choose a decision

i+1
(s) such that
r(s,
i+1
(s)) +

S
p[s

[s,
i+1
(s)]V
i
(s

) + (1 )
> sup
aA(s)
_
r(s, a) +

S
p[s

[s, a]V
i
(s

)
_
3. If
i+1
=
i
or i > 0 and |V
i
V
i1
|

< (1 )/2, then stop with chosen policy


i+1
. Otherwise,
set i i + 1 and go to step 1.
It can be shown that policy iteration converges at least as fast as value iteration as i . However,
the amount of work involved in each iteration of the policy iteration algorithm is usually more than the
amount of work involved in each iteration of the value iteration algorithm, because of the computational
eort required to solve (4.25) for V
i
. The total computational eort to satisfy the stopping criterion with
policy iteration is usually more than the total computational eort with value iteration.
A desirable property of the iterates V
i
is that if each
i
attains the maximum on the right hand side,
then they are monotonically improving, that is V
0
V
1
V

. Thus, each iteration produces a


better policy than before. If one starts with a reasonably good heuristic policy
0
, then even if one performs
only one iteration of the policy iteration algorithm, one obtains the benet of an even better policy.
Suppose the policy iteration algorithm stops with
i+1
=
i
. If
i+1
attains the maximum on the right
hand side, then V
i
= V

and the chosen policy


i+1
is optimal. If
i+1
chooses a decision within (1 )
of the maximum on the right hand side, then |V
i
V

< and the chosen policy


i+1
is -optimal.
Otherwise, suppose the policy iteration algorithm stops with |V
i
V
i1
|

< (1)/2. If
i+1
attains
the maximum on the right hand side, then |V
i+1
V

< and the chosen policy


i+1
is -optimal. If

i+1
chooses a decision within (1) of the maximum on the right hand side, then |V
i+1
V

< +
and the chosen policy
i+1
is ( +)-optimal.
Modied Policy Iteration It was mentioned that one of the drawbacks of the policy iteration algorithm
is the computational eort required to solve (4.25) for V
i
. An iterative algorithm called the Gauss-Seidel
method can be used to solve (4.25). For any stationary policy , L

is a contraction mapping. It follows


that for any V
0
1, the sequence of functions V
j
, j = 0, 1, 2, . . . , inductively computed by
V
j+1
(s) = r(s, (s)) +

S
p[s

[s, (s)]V
j
(s

)
for each s o, converges to V

as j . Modied policy iteration uses this Gauss-Seidel method to


compute V

, but it performs only a few iterations to compute an approximation to V

, and then moves on


to the next policy, instead of letting j to compute V

exactly. This compensates for the drawbacks of


policy iteration. Modied policy iteration is usually more ecient than value iteration and policy iteration.
Modied Policy Iteration Algorithm
0. Choose initial approximation V
1,0
1, a method to generate a sequence N
i
, i = 1, 2, . . . , of positive
integers, and stopping tolerance . Set i 1.
1. For each s o, choose a decision

i
(s) arg max
aA(s)
_
r(s, a) +

S
p[s

[s, a]V
i,0
(s

)
_
if the maximum on the right hand side is attained. Otherwise, for any chosen > 0, choose a decision
i
(s)
34
such that
r(s,
i
(s)) +

S
p[s

[s,
i
(s)]V
i,0
(s

) + (1 )
> sup
aA(s)
_
r(s, a) +

S
p[s

[s, a]V
i,0
(s

)
_
2. For j = 1, 2, . . . , N
i
, compute
V
i,j
(s) = r(s,
i
(s)) +

S
p[s

[s,
i
(s)]V
i,j1
(s

)
for each s o.
3. If |V
i,1
V
i,0
|

< (1 )/2, then stop with chosen policy


i
. Otherwise, set V
i+1,0
= V
i,Ni
, set
i i + 1 and go to step 1.
It can be shown that modied policy iteration converges at least as fast as value iteration as i . The
special case of modied policy iteration with N
i
= 1 for all i is the same as value iteration (as long as V
i,1
is
set equal to the maximum of the right hand side). When N
i
for all i, modied policy iteration is the
same as policy iteration.
The sequence N
i
, i = 1, 2, . . . , can be chosen in many ways. Some alternatives are to choose N
i
= N for
some chosen xed N for all i, or to choose N
i
to be the rst minor iteration j such that |V
i,j
V
i,j1
|

<
i
for some chosen sequence (typically decreasing)
i
. The idea is to choose N
i
to obtain the best trade-o
between the computational requirements of step 1, in which an optimization problem is solved to obtain a
new policy, and that of step 2, in which a more accurate approximation of the value function of the current
policy is computed. If the optimization problem in step 1 requires a lot of computational eort, then it is
better to obtain more accurate approximations of the value functions between successive executions of step
1, that is, it is better to choose N
i
larger, and vice versa. Also, if the policy does not change much from one
major iteration to the next, that is, if policies
i1
and
i
are very similar, then it is also better to obtain a
more accurate approximation of the value function V
i
by choosing N
i
larger. It is typical that the policies
do not change much later in the algorithm, and hence it is typical to choose N
i
to be increasing in i.
When the modied policy iteration algorithm stops, the approximation V
i,1
satises |V
i,1
V

< /2.
If the chosen policy
i
attains the maximum on the right hand side, then
i
is -optimal. If
i
chooses a
decision within (1 ) of the maximum on the right hand side, then
i
is ( +)-optimal. Furthermore, if
the initial approximation V
1,0
satises L

(V
1,0
) V
1,0
, such as if V
1,0
= V
0
for some initial policy
0
, and
if each
i
attains the maximum on the right hand side, then the sequence of policies
i
are monotonically
improving, and V
i,j1
V
i,j
for each i and j, from which it also follows that V
i1,0
V
i,0
for each i.
4.6 Approximation Methods
For many interesting applications the state space o is too big for any of the algorithms discussed so far to
be used. This is usually due to the curse of dimensionalitythe phenomenon that the number of states
grows exponentially in the number of dimensions of the state space. When the state space is too large, not
only is the computational eort required by these algorithms excessive, but storing the value function and
policy values for each state is impossible with current technology.
Recall that solving a dynamic program usually involves using (4.6) in the nite horizon case or (4.20) in
the innite horizon case to compute the optimal value function V

, and an optimal policy

. To accomplish
this, the following major computational tasks are performed.
1. Estimation of the optimal value function V

on the right hand side of (4.6) or (4.20).


2. Estimation of the expected value on the right hand side of (4.6) or (4.20). For many applications, this
is a high dimensional integral that requires a lot of computational eort to compute accurately.
35
3. The maximization problem on the right hand side of (4.6) or (4.20) has to be solved to determine
the optimal decision for each state. This maximization problem may be easy or hard, depending on
the application. The rst part of this article discusses several methods for solving such stochastic
optimization problems.
Approximation methods usually involve approaches to perform one or more of these computational tasks
eciently, sometimes by sacricing optimality.
For many applications the state space is uncountable and the transition and cost functions are too
complex for closed form solutions to be obtained. To compute solutions for such problems, the state space
is often discretized. Discretization methods and convergence results are discussed in Wong (1970a), Fox
(1973), Bertsekas (1975), Kushner (1990), Chow and Tsitsiklis (1991), and Kushner and Dupuis (1992).
For many other applications, such as queueing systems, the state space is countably innite. Computing
solutions for such problems usually involves solving smaller dynamic programs with nite state spaces, often
obtained by truncating the state space of the original DP, and then using the solutions of the smaller DPs
to obtain good solutions for the original DP. Such approaches and their convergence are discussed in Fox
(1971), White (1980a), White (1980b), White (1982), Thomas and Stengos (1985), Cavazos-Cadena (1986),
Van Dijk (1991b), Van Dijk (1991a), Sennott (1997a), and Sennott (1997b).
Even if the state space is not innite, the number of states may be very large. A natural approach is to
aggregate states, usually by collecting similar states into subsets, and then to solve a related DP with the
aggregated state space. Aggregation and aggregation/disaggregation methods are discussed in Simon and
Ando (1961), Mendelssohn (1982), Stewart (1983), Chatelin (1984), Schweitzer (1984), Schweitzer, Puterman
and Kindle (1985), Schweitzer (1986), Schweitzer and Kindle (1986), Bean, Birge and Smith (1987), Feinberg
and Chiu (1987), and Bertsekas and Castanon (1989).
Another natural approach for dealing with a large-scale DP is to decompose the DP into smaller related
DPs, which are easier to solve, and then to use the solutions of the smaller DPs to obtain a good solution
for the original DP. Decomposition methods are discussed in Wong (1970b), Collins and Lew (1970), Collins
(1970), Collins and Angel (1971), Courtois (1977), Courtois and Semal (1984), Stewart (1984), and Kleywegt,
Nori and Savelsbergh (1999).
Some general state space reduction methods that include many of the methods mentioned above are
analyzed in Whitt (1978), Whitt (1979b), Whitt (1979a), Hinderer (1976), Hinderer and H ubner (1977),
Hinderer (1978), and Haurie and LEcuyer (1986). Surveys are given in Morin (1978), and Rogers et al.
(1991).
Another natural and quite dierent approach for dealing with DPs with large state spaces, is to approx-
imate the optimal value function V

with an approximating function



V . It was shown in Section 4.5.2 that
good approximations

V to the optimal value function V

lead to good policies . Polynomial approxima-


tions, often using orthogonal polynomials such as Legendre and Chebychev polynomials, have been suggested
by Bellman and Dreyfus (1959), Chang (1966), Bellman, Kalaba and Kotkin (1963), and Schweitzer and Sei-
dman (1985). Approximations using splines have been suggested by Daniel (1976), and approximations using
regression splines by Chen, Ruppert and Shoemaker (1999). Estimation of the parameters of approximating
functions for innite horizon discounted DPs has been studied in Tsitsiklis and Van Roy (1996), Van Roy
and Tsitsiklis (1996), and Bertsekas and Tsitsiklis (1996). Some of this work was motivated by approaches
proposed for reinforcement learning; see Sutton and Barto (1998) for an overview.
References
Albritton, M., Shapiro, A. and Spearman, M. L. 1999. Finite Capacity Production Planning with
Random Demand and Limited Information. preprint.
Beale, E. M. L. 1955. On Minimizing a Convex Function Subject to Linear Inequalities. Journal of the
Royal Statistical Society, Series B, 17, 173184.
Bean, J. C., Birge, J. R. and Smith, R. L. 1987. Aggregation in Dynamic Programming. Operations
Research, 35, 215220.
36
Bellman, R. and Dreyfus, S. 1959. Functional Approximations and Dynamic Programming. Mathemat-
ical Tables and Other Aids to Computation, 13, 247251.
Bellman, R. E. 1957. Dynamic Programming. Princeton University Press, Princeton, NJ.
Bellman, R. E. 1961. Adaptive Control Processes: A Guided Tour. Princeton University Press, Princeton,
NJ.
Bellman, R. E. and Dreyfus, S. 1962. Applied Dynamic Programming. Princeton University Press,
Princeton, NJ.
Bellman, R. E., Kalaba, R. and Kotkin, B. 1963. Polynomial ApproximationA New Computational
Technique in Dynamic Programming: Allocation Processes. Mathematics of Computation, 17, 155161.
Benveniste, A., M etivier, M. and Priouret, P. 1990. Adaptive Algorithms and Stochastic Approxi-
mations. Springer-Verlag, Berlin, Germany.
Berger, J. O. 1985. Statistical Decision Theory and Bayesian Analysis. 2nd edn, Springer-Verlag, New
York, NY.
Bertsekas, D. P. 1975. Convergence of Discretization Procedures in Dynamic Programming. IEEE
Transactions on Automatic Control , AC-20, 415419.
Bertsekas, D. P. 1995. Dynamic Programming and Optimal Control. Athena Scientic, Belmont, MA.
Bertsekas, D. P. and Castanon, D. A. 1989. Adaptive Aggregation Methods for Innite Horizon
Dynamic Programming. IEEE Transactions on Automatic Control , AC-34, 589598.
Bertsekas, D. P. and Shreve, S. E. 1978. Stochastic Optimal Control: The Discrete Time Case.
Academic Press, New York, NY.
Bertsekas, D. P. and Tsitsiklis, J. N. 1996. Neuro-Dynamic Programming. Athena Scientic, New
York, NY.
Birge, J. R. and Louveaux, F. 1997. Introduction to Stochastic Programming. Springer Series in
Operations Research, Springer-Verlag, New York, NY.
Cavazos-Cadena, R. 1986. Finite-State Approximations for Denumerable State Discounted Markov Deci-
sion Processes. Applied Mathematics and Optimization, 14, 126.
Chang, C. S. 1966. Discrete-Sample Curve Fitting Using Chebyshev Polynomials and the Approximate
Determination of Optimal Trajectories via Dynamic Programming. IEEE Transactions on Automatic
Control , AC-11, 116118.
Chatelin, F. 1984. Iterative Aggregation/Disaggregation Methods. In Mathematical Computer Perfor-
mance and Reliability. G. Iazeolla, P. J. Courtois and A. Hordijk (editors). Elsevier Science Publishers
B.V., Amsterdam, Netherlands, chapter 2.1, 199207.
Chen, V. C. P., Ruppert, D. and Shoemaker, C. A. 1999. Applying Experimental Design and Re-
gression Splines to High-Dimensional Continuous-State Stochastic Dynamic Programming. Operations
Research, 47, 3853.
Chong, E. K. P. and Ramadge, P. J. 1992. Convergence of Recursive Optimization Algorithms Using
Innitesimal Perturbation Analysis Estimates. Discrete Event Dynamic Systems: Theory and Applica-
tions, 1, 339372.
Chow, C. S. and Tsitsiklis, J. N. 1991. An Optimal One-Way Multigrid Algorithm for Discrete-Time
Stochastic Control. IEEE Transactions on Automatic Control , AC-36, 898914.
37
Collins, D. C. 1970. Reduction of Dimensionality in Dynamic Programming via the Method of Diagonal
Decomposition. Journal of Mathematical Analysis and Applications, 31, 223234.
Collins, D. C. and Angel, E. S. 1971. The Diagonal Decomposition Technique Applied to the Dynamic
Programming Solution of Elliptic Partial Dierential Equations. Journal of Mathematical Analysis and
Applications, 33, 467481.
Collins, D. C. and Lew, A. 1970. A Dimensional Approximation in Dynamic Programming by Structural
Decomposition. Journal of Mathematical Analysis and Applications, 30, 375384.
Courtois, P. J. 1977. Decomposability: Queueing and Computer System Applications. Academic Press,
New York, NY.
Courtois, P. J. and Semal, P. 1984. Error Bounds for the Analysis by Decomposition of Non-Negative
Matrices. In Mathematical Computer Performance and Reliability. G. Iazeolla, P. J. Courtois and
A. Hordijk (editors). Elsevier Science Publishers B.V., Amsterdam, Netherlands, chapter 2.2, 209224.
Daniel, J. W. 1976. Splines and Eciency in Dynamic Programming. Journal of Mathematical Analysis
and Applications, 54, 402407.
Dantzig, G. B. 1955. Linear Programming under Uncertainty. Management Science, 1, 197206.
Denardo, E. V. 1982. Dynamic Programming Models and Applications. Prentice-Hall, Englewood Clis,
NJ.
Feinberg, B. N. and Chiu, S. S. 1987. A Method to Calculate Steady-State Distributions of Large
Markov Chains by Aggregating States. Operations Research, 35, 282290.
Fox, B. L. 1971. Finite-State Approximations to Denumerable-State Dynamic Programs. Journal of
Mathematical Analysis and Applications, 34, 665670.
Fox, B. L. 1973. Discretizing Dynamic Programs. Journal of Optimization Theory and Applications,
11, 228234.
Glasserman, P. 1991. Gradient Estimation via Perturbation Analysis. Kluwer Academic Publishers,
Norwell, MA.
Glynn, P. W. 1990. Likelihood Ratio Gradient Estimation for Stochastic Systems. Communications of the
ACM, 33, 7584.
Haurie, A. and LEcuyer, P. 1986. Approximation and Bounds in Discrete Event Dynamic Programming.
IEEE Transactions on Automatic Control , AC-31, 227235.
Heyman, D. P. and Sobel, M. J. 1984. Stochastic Models in Operations Research. Vol. II, McGraw-Hill,
New York, NY.
Hinderer, K. 1970. Foundations of Non-stationary Dynamic Programming with Discrete Time Parameter.
Springer-Verlag, Berlin.
Hinderer, K. 1976. Estimates for Finite-Stage Dynamic Programs. Journal of Mathematical Analysis and
Applications, 55, 207238.
Hinderer, K. 1978. On Approximate Solutions of Finite-Stage Dynamic Programs. In Dynamic Program-
mming and its Applications. M. L. Puterman (editor). Academic Press, New York, NY, 289317.
Hinderer, K. and H ubner, G. 1977. On Exact and Approximate Solutions of Unstructured Finite-
Stage Dynamic Programs. In Markov Decision Theory : Proceedings of the Advanced Seminar on
Markov Decision Theory held at Amsterdam, The Netherlands, September 1317, 1976. H. C. Tijms
and J. Wessels (editors). Mathematisch Centrum, Amsterdam, The Netherlands, 5776.
38
Hiriart-Urruty, J. B. and Lemarechal, C. 1993. Convex Analysis and Minimization Algorithms.
Springer-Verlag, Berlin, Germany.
Ho, Y. C. and Cao, X. R. 1991. Perturbation Analysis of Discrete Event Dynamic Systems. Kluwer
Academic Publishers, Norwell, MA.
Kall, P. and Wallace, S. W. 1994. Stochastic Programming. John Wiley & Sons, Chichester, England.
Klein Haneveld, W. K. and Van der Vlerk, M. H. 1999. Stochastic Integer Programming: General
Models and Algorithms. Annals of Operations Research, 85, 3957.
Kleywegt, A. J., Nori, V. S. and Savelsbergh, M. W. P. 1999. The Stochastic Inventory Routing
Problem with Direct Deliveries, Technical Report TLI99-01, The Logistics Institute, School of Industrial
and Systems Engineering, Georgia Institute of Technology, Atlanta, GA 30332-0205.
Kleywegt, A. J. and Shapiro, A. 1999. The Sample Average Approximation Method for Sto-
chastic Discrete Optimization. Preprint, available at: Stochastic Programming E-Print Series,
http://dochost.rz.hu-berlin.de/speps/.
Kushner, H. J. 1990. Numerical Methods for Continuous Control Problems in Continuous Time. SIAM
Journal on Control and Optimization, 28, 9991048.
Kushner, H. J. and Clark, D. S. 1978. Stochastic Approximation Methods for Constrained and Uncon-
strained Systems. Springer-Verlag, Berlin, Germany.
Kushner, H. J. and Dupuis, P. 1992. Numerical Methods for Stochastic Control Problems in Continuous
Time. Springer-Verlag, New York, NY.
LEcuyer, P. and Glynn, P. W. 1994. Stochastic Optimization by Simulation: Convergence Proofs for
the GI/G/1 Queue in Steady-State. Management Science, 11, 15621578.
Mendelssohn, R. 1982. An Iterative Aggregation Procedure for Markov Decision Processes. Operations
Research, 30, 6273.
Morin, T. 1978. Computational Advances in Dynamic Programming. In Dynamic Programmming and its
Applications. M. L. Puterman (editor). Academic Press, New York, NY, 5390.
Nemhauser, G. L. 1966. Introduction to Dynamic Programming. Wiley, New York, NY.
Norkin, V. I., Pflug, G. C. and Ruszczy nski, A. 1998. A Branch and Bound Method for Stochastic
Global Optimization. Mathematical Programming, 83, 425450.
Puterman, M. L. 1994. Markov Decision Processes. John Wiley & Sons, Inc., New York, NY.
Robbins, H. and Monro, S. 1951. On a Stochastic Approximation Method. Annals of Mathematical
Statistics, 22, 400407.
Robinson, S. M. 1996. Analysis of Sample-Path Optimization. Mathematics of Operations Research,
21, 513528.
Rogers, D. F., Plante, R. D., Wong, R. T. and Evans, J. R. 1991. Aggregation and Disaggregation
Techniques and Methodology in Optimization. Operations Research, 39, 553582.
Ross, S. M. 1970. Applied Probability Models with Optimization Applications. Dover, New York, NY.
Ross, S. M. 1983. Introduction to Stochastic Dynamic Programming. Academic Press, New York, NY.
Rubinstein, R. Y. and Shapiro, A. 1990. Optimization of Simulation Models by the Score Function
Method. Mathematics and Computers in Simulation, 32, 373392.
39
Rubinstein, R. Y. and Shapiro, A. 1993. Discrete Event Systems: Sensitivity Analysis and Stochastic
Optimization by the Score Function Method. John Wiley & Sons, Chichester, England.
Ruppert, D. 1991. Stochastic Approximation. In Handbook of Sequential Analysis. B. K. Ghosh and P. K.
Sen (editors). Marcel Dekker, New York, NY, 503529.
Schultz, R., Stougie, L. and Van der Vlerk, M. H. 1998. Solving Stochastic Programs with Integer
Recourse by Enumeration: a Framework Using Gr obner Basis Reductions. Mathematical Programming,
83, 229252.
Schweitzer, P. J. 1984. Aggregation Methods for Large Markov Chains. In Mathematical Computer
Performance and Reliability. G. Iazeolla, P. J. Courtois and A. Hordijk (editors). Elsevier Science
Publishers, Amsterdam, Netherlands, 275286.
Schweitzer, P. J. 1986. An Iterative Aggregation-Disaggregation Algorithm for Solving Undiscounted
Semi-Markovian Reward Processes. Stochastic Models, 2, 141.
Schweitzer, P. J. and Kindle, K. W. 1986. Iterative Aggregation for Solving Undiscounted Semi-
Markovian Reward Processes. Communications in Statistics. Stochastic Models, 2, 141.
Schweitzer, P. J., Puterman, M. L. and Kindle, K. W. 1985. Iterative Aggregation-Disaggregation
Procedures for Discounted Semi-Markov Reward Processes. Operations Research, 33, 589605.
Schweitzer, P. J. and Seidman, A. 1985. Generalized Polynomial Approximations in Markovian Decision
Processes. Journal of Mathematical Analysis and Applications, 110, 568582.
Sennott, L. I. 1997a. The Computation of Average Optimal Policies in Denumerable State Markov Decision
Chains. Advances in Applied Probability, 29, 114137.
Sennott, L. I. 1997b. On Computing Average Cost Optimal Policies with Application to Routing to
Parallel Queues. Zeitschrift f ur Operations Research, 45, 4562.
Sennott, L. I. 1999. Stochastic Dynamic Programming and the Control of Queueing Systems. John Wiley
& Sons, New York, NY.
Serfozo, R. F. 1976. Monotone Optimal Policies for Markov Decision Processes. Mathematical Program-
ming Study, 6, 202215.
Shapiro, A. 1996. Simulation-based Optimization: Convergence Analysis and Statistical Inference. Sto-
chastic Models, 12, 425454.
Shapiro, A. and Homem-de-Mello, T. 1998. A Simulation-Based Approach to Two-Stage Stochastic
Programming with Recourse. Mathematical Programming, 81, 301325.
Shapiro, A. and Homem-de-Mello, T. 1999. On Rate of Convergence of Monte Carlo Approxi-
mations of Stochastic Programs. Preprint, available at: Stochastic Programming E-Print Series,
http://dochost.rz.hu-berlin.de/speps/.
Simon, H. A. and Ando, A. 1961. Aggregation of Variables in Dynamic Systems. Econometrica, 29, 111
138.
Stewart, G. W. 1983. Computable Error Bounds for Aggregated Markov Chains. Journal of the Associ-
ation for Computing Machinery, 30, 271285.
Stewart, G. W. 1984. On the Structure of Nearly Uncoupled Markov Chains. In Mathematical Computer
Performance and Reliability. G. Iazeolla, P. J. Courtois and A. Hordijk (editors). Elsevier Science
Publishers B.V., Amsterdam, Netherlands, chapter 2.7, 287302.
40
Sutton, R. S. and Barto, A. G. 1998. Reinforcement Learning: An Introduction. MIT Press, Cambridge,
MA.
Thomas, L. C. and Stengos, D. 1985. Finite State Approximation Algorithms for Average Cost Denu-
merable State Markov Decision Processes. OR Spektrum, 7, 2737.
Topkis, D. M. 1978. Minimizing a Submodular Function on a Lattice. Operations Research, 26, 305321.
Tsitsiklis, J. N. and Van Roy, B. 1996. Feature-Based Methods for Large-Scale Dynamic Programming.
Machine Learning, 22, 5994.
Van Dijk, N. 1991a. On Truncations and Perturbations of Markov Decision Problems with an Application
to Queueing Network Overow Control. Annals of Operations Research, 29, 515536.
Van Dijk, N. 1991b. Truncation of Markov Chains with Applications to Queueing. Operations Research,
39, 10181026.
Van Roy, B. and Tsitsiklis, J. N. 1996. Stable Linear Approximations to Dynamic Programming
for Stochastic Control Problems with Local Transitions. Advances in Neural Information Processing
Systems 8. MIT Press, Cambridge, MA, 10451051.
Van Slyke, R. and Wets, R. J. B. 1969. L-Shaped Linear Programs with Application to Optimal Control
and Stochastic Programming. SIAM Journal on Applied Mathematics, 17, 638663.
White, D. J. 1980a. Finite-State Approximations for Denumerable-State Innite-Horizon Discounted
Markov Decision Processes: The Method of Successive Approximations. In Recent Developments in
Markov Decision Processes. R. Hartley, L. C. Thomas and D. J. White (editors). Academic Press, New
York, NY, 5772.
White, D. J. 1980b. Finite-State Approximations for Denumerable-State Innite-Horizon Discounted
Markov Decision Processes. Journal of Mathematical Analysis and Applications, 74, 292295.
White, D. J. 1982. Finite-State Approximations for Denumerable-State Innite Horizon Discounted Markov
Decision Processes with Unbounded Rewards. Journal of Mathematical Analysis and Applications,
86, 292306.
Whitt, W. 1978. Approximations of Dynamic Programs, I. Mathematics of Operations Research, 3, 231
243.
Whitt, W. 1979a. A-Priori Bounds for Approximations of Markov Programs. Journal of Mathematical
Analysis and Applications, 71, 297302.
Whitt, W. 1979b. Approximations of Dynamic Programs, II. Mathematics of Operations Research, 4, 179
185.
Wong, P. J. 1970a. An Approach to Reducing the Computing Time for Dynamic Programming. Operations
Research, 18, 181185.
Wong, P. J. 1970b. A New Decomposition Procedure for Dynamic Programming. Operations Research,
18, 119131.
41

You might also like