Lim 05429427
Lim 05429427
n+1
=
n
a
n
D
n
. (1)
Algorithm (1) above moves along the direction of negative gradient so that the objective function is reduced in each step.
In the discrete counterpart of (1) as proposed in Lim and Glynn (2006), D
n
mimics the derivative of the objective function
by computing the difference of the objective values evaluated at the nearest point of
n
to the right and the nearest point of
n
to the left.
When applying a gradientbased algorithm such as (1) in practice, we encounter a problem of choosing the right sequence
of gains a
n
. To ensure the convergence of (1), a
n
needs to satisfy a
n
= and a
2
n
<. A typical choice of a
n
is c/n
613 978-1-4244-5771-7/09/$26.00 2009 IEEE
Lim
for some positive constant c. In most practical situations where the computational budget and hence the number of simulation
runs is predetermined, the choice of the sequence of gains a
n
inuences the empirical performance of the algorithm.
When the gains are too large, the estimates for the optimal solution will bounce around the parameter space whereas when
the gains are too small, the estimates for the optimal solution will seem to get stuck at some point. Furthermore, one often
has no a priori information on the choice of the sequence of gains.
To illustrate the signicance of the choice of a
n
, consider the problem of minimizing a cost function f which depends
on a parameter in f () =
2
for all integers . For the purpose of simplicity, assume that the measurement of the objective
function f () is exact, i.e., the problem is deterministic. We assume that the initial point
1
is 100.10. Since one has no a
priori information on a
n
, one needs to guess the values of a
1
and a
n
. Suppose that one decides to try sequences of gains
a
n
= 5/n, a
n
= 1/n, and a
n
= 1/5n, then the trajectories of [
n
], the nearest integer to
n
, generated from:
n+1
=
n
a
n
( f (
n
|) f (
n
|)), (2)
for each choice of the sequence are as follows:
a
n
= 5/n 100 905 3618 8441 12661 12660
a
n
= 1/n 100 101 1 0 0 0
a
n
= 1/5n 100 60 48 42 38 35
In (2), x| is the smallest integer greater than or equal to x and x| is the largest integer less than or equal to x. When
a
n
= 5/n, the estimated optimal points [
n
] tend to bounce around too much whereas when a
n
= 1/5n, [
n
] tends to converge
slowly.
To suggest the choice of a
n
, we consider a NewtonRaphson type approach. Recall that the NewtonRaphson method
to nd the optimal point of f is based on the following recursion:
n+1
=
n
f
/
(
n
)
f
//
(
n
)
, (3)
where f
/
and f
//
are the rst and second derivatives of f , respectively. Considering that stochastic approximation (1) is
asymptotically optimal when a
n
= 1/( f
//
(
)n) where
n+1
=
n
c
n
D
n
H
n
, (4)
where D
n
is an estimate of f
/
() and H
n
is an estimate of f
//
(
).
The algorithm (4) resembles the NewtonRaphson method and makes use of both the gradient and the Hessian information
of the objective function. It suggests a multiple of the reciprocal of the Hessian as the sequence of gains. In addition to
fast convergence in the rst few iterations, the proposed algorithm enjoys a nice asymptotic property. It converges to a local
optimizer with probability one at rate 1/n where n is the number of iterations. The convergence rate 1/n is induced from
the fact that the objective function is dened on a discrete set and hence E(D
n
/H
n
) in (4) stays at a constant value when
n
gets close to
. Therefore, the convergence rate of (4) differs from the conventional convergence rates that are
obtained in continuous cases.
This paper is organized as follows. In Section 2, we present the proposed algorithm and our main results. In Section 3,
we provide proofs for the main results. In section 4, we illustrate the empirical behavior of the proposed algorithm.
2 PROBLEM FORMULATION AND MAIN RESULTS
Consider the following problem of optimization
min
Z
f () = E[X()], (5)
614
Lim
where Z is the set of integers. Suppose that f () cannot be evaluated exactly, it must be estimated through simulation. Our
goal is to generate a sequence of random variables
1
,
2
, . . . that converges to the optimal solution
of f .
The proposed algorithm proceeds as follows. Choose
1
randomly from Z. Given
1
, . . . ,
n
, one generates X
+
n
, X
n
, X
/
n
,
X
//
n
, and X
///
n
where
X
+
n
= f (
n
|) +
+
n
X
n
= f (
n
|) +
n
X
/
n
= f ([
n
] 1) +
/
n
X
//
n
= f ([
n
]) +
//
n
X
///
n
= f ([
n
] +1) +
///
n
and (
+
n
,
n
,
/
n
,
//
n
,
///
n
; n = 1, 2, . . .) are independent and identically distributed random variables with mean zero. Let
F
1
F
2
. . . be an increasing sequence of elds such that
+
n
,
n
,
/
n
,
//
n
, and
///
n
are F
n
measurable and independent
of F
n1
for all n 2. Then
n+1
is computed from the recursion
n+1
=
n
c
n
X
+
n
X
n
H
n
, (6)
where H
n
is a truncated average of (X
/
i
2X
//
i
+X
///
i
; i = 1, . . . , n). That is,
H
n
=
_
_
_
a, if G
n
< a
G
n
, otherwise
b, if G
n
> b,
where
G
n
=
1
n
n
i=1
(X
/
i
2X
//
i
+X
///
i
).
At iteration n, the estimate of the optimal solution is the nearest integer to
n
.
Below is the detailed description of the proposed algorithm.
Algorithm
1. Set n = 1 and choose
n
randomly from Z.
2. Set
n+1
=
n
c
n
X
+
n
X
n
H
n
.
3. Set n = n+1 and go to Step 2.
The following assumptions will be needed:
A1. f has only one local minimum at
+1) 2f (
) + f (
1) < b <.
A3. max
_
Var(
+
1
), Var(
1
), Var(
/
1
), Var(
//
1
), Var(
///
1
)
_
2
<.
A4. There exists a constant > 0 such that [ f ( +1) f ()[ for all Z.
A5. There exists a constant C such that
[ f ( +1) f ()[ C(1+[
[),
for all Z.
615
Lim
A6. c is large enough so that c > ( f (
+1) 2 f (
) + f (
1)), where
= inf
R
[ f (|) f (|)[
[
[
> 0
A7. There exists K < such that
max([
+
1
[, [
1
[, [
/
1
[, [
//
1
[, [
///
1
[) < K
and
[
1
[ K
with probability one.
We now state our main results as follows:
Theorem 1 Under A1A5,
n
[ = O
p
_
1
n
_
as n .
3 PROOFS
Proof of Theorem 1 For simplicity of writing, we assume
= 0. Let A
n
= (X
+
n
X
n
)/H
n
. First, we prove
inf
[
n
[1/
(
n
)EA
n
b
1
> 0 for every > 0 (7)
E(A
2
n
[F
n
) C
1
(1+(
n
)
2
) for some constant C
1
. (8)
For any
n
with [
n
[ 1/,
(
n
)E(A
n
[F
n
)
=
n
E
_
X
+
n
X
n
H
n
_
=
n
( f (
n
|) f (
n
|))E [1/A
n
] .
Note that A1 implies f (|) f (|) 0 if > 0 and f (|) f (|) 0 if 0. So, ( f (|) f (|)) =
[[[ ( f (|) f (|))[ for any R. So,
n
( f (
n
|) f (
n
|))E [1/H
n
] = [
n
[[ f (
n
|) f (
n
|)[E [1/H
n
]
b
1
by A4.
Hence (7) is proven. Note that
E(A
2
n
[F
n
)
= E
_
(X
+
n
X
n
)
2
H
2
n
F
n
_
_
( f (
n
|) f (
n
|))
2
+2
2
_
E
_
1
H
2
n
F
n
_
b
2
_
( f (
n
|) f (
n
|))
2
+2
2
_
by A2
b
2
_
C
2
(1+
n
|)
2
+2
2
_
by A5
616
Lim
b
2
_
C
2
(2+[
n
[)
2
+2
2
_
C
1
(1+[
n
[
2
),
for some constant C
1
. Hence, (8) is proven.
To prove
n
0, note
E(
2
n+1
[F
n
)
= E
_
_
c
n
A
n
_
2
[F
n
_
=
2
n
+
c
2
n
2
E(A
2
n
[F
n
)
2c
n
n
E(A
n
[F
n
)
2
n
+
c
2
n
2
C
1
_
1+
2
n
_
2c
n
n
E(A
n
[F
n
) by (8)
_
1+
c
2
C
1
n
2
_
2
n
+
c
2
C
1
n
2
2c
n
n
E(H(
n
)[F
n
).
By the theorem for almost supermartingales in Robbins and Siegmund (1971),
2
n
as
n and
n=1
1
n
n
E(A
n
[F
n
) < (9)
with probability one. We now need to show
() ,= 0. Then there exist > 0 and N such that for all n > N,
n
() 1/. Since
n
E(A
n
[F
n
) b
1
> 0 for
all
n
() 1/ from (7), we have
n=1
1
n
n
E(A
n
[F
n
) =,
which contradicts (9). Hence
[) <
since for any constant C,
P(n[
n
[ >C)
E (n[
n
[)
C
.
For simplicity of writing, assume
= 0. First assume
1
=
(10)
and
E(Y
n+1
Y
n
[F
n
)
//
on the event Y
n
<
(11)
for some constants
/
,
//
> 0. From
n+1
=
n
c
n
X
+
n
X
n
H
n
,
617
Lim
we get
(n+1)
n+1
= (n+1)
n
c
_
1+
1
n
_
X
+
n
X
n
H
n
,
or equivalently,
Y
n+1
=Y
n
+
n
c
_
1+
1
n
_
X
+
n
X
n
H
n
.
When Y
n
>
,
E(Y
n+1
Y
n
[F
n
)
=
n
c
_
1+
1
n
_
E
_
X
+
n
X
n
H
n
_
=
n
c
_
1+
1
n
_
( f (
n
|) f (
n
|))E
_
1
A
n
_
.
Under A1A4,
n
n
c
_
1+
1
n
_
( f (
n
|) f (
n
|))E
_
1
H
n
_
n
c
_
1+
1
n
_
( f (1) 2f (0) + f (1))
1
)( f (
n
|) f (
n
|))
n
| c
_
1+
1
n
_
( f (1) 2 f (0) + f (1))
1
)( f (
n
|) f (
n
|)) because
n
> 0
n
| c( f (1) 2f (0) + f (1))
1
)( f (
n
|) f (
n
|))
n
| c(( f (1) 2 f (0) + f (1))
1
)
n
| by A6
=
n
|
_
1c( f (1) 2f (0) + f (1))
1
+c
_
/
n
| for some
/
> 0 by (12)
/
because
n
| 1 for all
n
> 0.
Hence (10) holds. (11) follows in a similar way.
Let
i
= E((i +1)
i+1
i
i
[F
i
) for i = 1, 2, . . .. Note that A7 implies
[
i
[ b(K, N), (13)
for all i = 1, . . . , N with probability one where b(K, N) is a constant depending on K and N because
[
i
[ = [E((i +1)
i+1
i
i
[F
i
)[
= E([(i +1)
i+1
i
i
[[F
i
)
618
Lim
= E
_
i
c
_
1+
1
i
_
X
+
i
X
i
H
i
F
i
_
[
i
[ +c
_
1+
1
i
_
E
_
[X
+
i
X
i
[
[H
i
[
F
i
_
[
i
[ +c
_
1+
1
i
_
a
1
( f (
i
|) f (
i
|) +2K)
[
i
[ +c
_
1+
1
i
_
a
1
(C(1+
i
|) +2K)
C
2
(1+[
i
[)
for some constant C
2
and
[
i
[ =
i1
c
i 1
X
+
i1
X
i1
H
i1
[
i1
[ +
C
3
i 1
(1+[
i1
[)
_
1+
C
3
i 1
_
[
i1
[ +
C
3
i 1
. . .
_
1+
C
3
i 1
__
1+
C
3
i 2
_
. . .
_
1+
C
3
1
_
[
1
[ +
i1
k=1
k1
j=1
_
1+
C
3
i j
_
C
3
i k
,
for come constant C
3
.
We will compute an upper bound for E[Y
n
[ that does not depend on n. Let U = maxk n : Y
k
0 denote the last
time up to n such that Y
k
k=1,...
takes a nonpositive value.
For t > N b(K, N), we have
P(Y
n
t)
=
n1
k=1
P(Y
n
t,U = k)
n1
k=1
P(Y
n
Y
k
t,Y
k
0,Y
i
> 0 for k < i < n)
N1
k=1
P(Y
n
Y
n1
n1
+. . . +Y
k+1
Y
k
k
t +(nN) b(K, N)(Nk))
+
n1
k=N
P(Y
n
Y
n1
n1
+. . . +Y
k+1
Y
k
k
t +(nk))
N1
k=1
E [Y
n
Y
n1
n1
+. . . +Y
k+1
Y
k
k
[
p
[t +(nN) b(K, N)(Nk)[
p
+
n1
k=N
E [Y
n
Y
n1
n1
+. . . +Y
k+1
Y
k
k
[
p
[t +(nk)[
p
for any p > 6.
Note that (Y
n+1
Y
n
n
: n = 1, . . .) is a sequence of a martingale difference with sup
n
[Y
n+1
Y
n
n
[
p
< because
E[Y
n+1
Y
n
n
[
p
= E
c
_
1+
1
n
_
X
+
n
X
n
H
n
c
_
1+
1
n
_
E
_
X
+
n
X
n
H
n
_
p
619
Lim
= c
p
_
1+
1
n
_
p
E
X
+
n
X
n
H
n
E
_
X
+
n
X
n
H
n
_
p
c
p
2
p
E
X
+
n
X
n
( f (
n
|) f (
n
|))E (1/H
n
)H
n
H
n
p
c
p
2
p
b
p
EC
4
(1+[
n
[)
p
c
p
2
p
b
p
(C
4
+E[
n
[
p
)
for some constant C
4
and
E[
n+1
[
p
= E
c
n
X
+
n
X
n
H
n
p
E[
n
[
p
+
c
p
n
p
b
p
C
5
(1+E[
n
[
p
)
=
_
1+
C
5
n
p
_
E[
n
[
p
+
C
5
n
p
for some constants C
5
and C
6
. Hence, by Lemma 1 of Venter (1966), E[
n
[
p
is bounded.
Note
n1
k=N
E [Y
n
Y
n1
n1
+. . . +Y
k+1
Y
k
k
[
p
[t +(nk)[
p
n1
k=N
C
p
(nk)
p/2
(t +(nk))
p
k=1
C
p
k
p/2
(t +k)
p
and that
N1
k=1
E [Y
n
Y
n1
n1
+. . . +Y
k+1
Y
k
k
[
p
[t +(nN) b(K, N)N[
p
N1
k=1
C
p
(nk)
p/2
[t +(nN) b(K, N)N[
p
for some constant C
p
by Lemma 2.1 of Li (2003). So,
E max(Y
n
, 0) =
_
0
P(Y
n
>t)dt
= N b(K, N) +
_
Nb(K,N)
k=1
C
p
k
p/2
(t +k)
p
+
_
Nb(K,N)
N1
k=1
C
p
(nk)
p/2
[t +(nN) b(K, N)N[
p
= N b(K, N) +
k=1
_
b(K,N)
C
p
k
p/2
(t +k)
p
+
N1
k=1
_
Nb(K,N)
C
p
(nk)
p/2
[t +(nN) b(K, N)N[
p
= N b(K, N) +
k=1
1
p1
C
p
k
p/2
[N b(K, N) +k[
p1
+
N
k=1
1
p1
C
p
(nk)
p/2
[N b(K, N) +(nN) b(K, N)N[
p
K
/
for some constant K
/
< which does not depend on n. E max(Y
n
, 0) < follows in a similar way. Hence the desired
result is proven. For the general case, let Y
n
= n(
n
) (
1
) (
1
))
+
< and
620
Lim
E(n(
n
) (
1
))
<. So
E[n(
n
)[ E[n(
n
)
1
[ +E[
1
[ +
<
by A7. Hence, the desired result is proven. 2
4 A NUMERICAL EXAMPLE
Consider the singleperiod newsvendor problem where units are ordered and stacked at the beginning of the period. The
goal is to nd an ordering level that minimizes the cost function f () = E[c +hmax(0, D) +pmax(0, D)], where
c is the unit cost for producing each unit, h is the holding cost per unit remaining at the end of the period, p is the shortage
cost per unit of unsatised demand, and the expectation is taken with respect to the random demand D. The optimal solution
for this problem is given by
= F
1
_
pc
p+h
_
,
where F is the cumulative distribution function of D.
We compare the proposed algorithm:
n+1
=
n
1
n
X
+
n
X
n
A
n
(14)
to the following algorithm:
n+1
=
n
1
n
(X
+
n
X
n
), (15)
which was proposed in Lim and Glynn (2006) and which does not make use of the Hessian information.
Table 1 compares the performance of (14) and (15) with c = 3, h = 5, and p = 9. Demand follows a Poisson distribution
with parameter 100, resulting in
= 98. In (14) and (15), 1/n is adjusted according to the total number of the iterations
as follows:
Total number of iterations
10 n
so that a
n
1/1, 1/2, 1/3, . . . , 1/10 and a
n
does not get too small at the end of iterations.
At each
n
, X
+
n
is the average of 500 replications of c
n
| +h(
n
| D)
+
+ p(D
n
|)
+
. X
n
, X
/
n
, X
//
n
, and X
///
n
are
computed in a similar way. Table 1 shows the sample mean and the sample standard deviation of
n
based on 200 independent
replications with
1
= 5.231. n
1
and n
2
are the total numbers of iterations for (14) and (15), respectively. The ratio of n
1
to
n
2
is set to be 2 to 5 reecting the fact that we need to generate X() at 5 and 2 different values of at each iteration of
(14) and (15), respectively. Hence the same amount of computational budget is allocated to (14) and (15).
Table 1: Performance of Algorithms (14) and (15)
n
1
= 15, n
2
= 6 n
1
= 25, n
2
= 10 n
1
= 35, n
2
= 14
= 98 [
n
[ Variance MSE [
n
[ Variance MSE [
n
[ Variance MSE
Algorithm (14) 51.14 1003.00 3618.29 24.00 967.21 1543.21 18.14 907.21 1236.27
Algorithm (15) 69.31 73.49 4877.32 50.94 106.92 2701.80 33.69 153.76 1288.78
Table 1 shows that the proposed algorithm, Algorithm (14), approaches the optimal solution faster than Algorithm (15),
but shows more variability. However, the overall efciency summarized by the mean square error indicates that the proposed
algorithm outperforms Algorithm (15) since the variance increase is more than offset by the bias reduction.
621
Lim
ACKNOWLEDGMENTS
The author would like to thank all anonymous referees for their valuable comments and suggestions.
REFERENCES
Andradottir, S. 1995. A method for discrete stochastic optimization. Management Science 41:19461961.
Chung, K. L. 1954. On a stochastic approximation method. Annals of Mathematical Statistics 25:463483.
Dup ac, V., and U. Herkenrath. 1982. Stochastic approximation on a discrete set and the multiarmed bandit problem.
Communications in StatisticsSequential Analysis 1:125.
Gelfand, S., and S. Mitter. 1989. Simulated annealing with noisy or imprecise energy measurements. Journal of Optimization
Theory and Applications 2:4962.
Hong, L. J., and B. L. Nelson. 2006. Discrete optimization via simulation using compass. Operations Research 54:115129.
Kleywegt, A., A. Shapiro, and T. Homem-de-Mello. 2001. The sample average approximation for stochastics discrete
optimization. SIAM Journal on Optimization 12:479502.
Li, Y. 2003. A martingale inequality and large deviations. Statistics and Probability Letters 62:317321.
Lim, E., and P. W. Glynn. 2006. Discrete optimization via simulation in the presence of regularity. INFORMS National
Meeting, Pittsburgh.
Lin, X., and L. H. Lee. 2006. A new approach to discrete stochastic optimization problems. European Journal of Operational
Research 172:761782.
Robbins, H., and S. Monro. 1951. A stochastic approximation method. Annals of Mathematical Statistics 22:400407.
Robbins, H., and D. Siegmund. 1971. A convergence theorem for nonnegative almost supermartingales and some applications.
In Optimizing Motheds in Statistics, 233257. New York: Academic Press.
Shi, L., and S. Olafsson. 2000. Nested partitions method for stochastic optimization. Methodology and Computing in Applied
Probability 2:271291.
Venter, J. H. 1966. On Dvoretzky stochastic approximation theorems. Annals of Mathematical Statistics 37:15341544.
AUTHOR BIOGRAPHY
EUNJI LIM is an Assistant Professor in the Industrial Engineering Department at the University of Miami. She received
her Ph.D. in Management Science and Engineering from Stanford University. Her research interest includes stochastic
optimization, statistical inference under shape restrictions, and simulation.
622