Gradient Extrapolated Stochastic Kriging
Gradient Extrapolated Stochastic Kriging
We introduce an approach for enhancing stochastic kriging in the setting where additional direct gradient
information is available (e.g., provided by techniques such as perturbation analysis or the likelihood ratio
method). The new approach, called gradient extrapolated stochastic kriging (GESK), incorporates direct
gradient estimates by extrapolating additional responses. For two simplified settings, we show that GESK
reduces mean squared error (MSE) compared to stochastic kriging under certain conditions on step sizes.
Since extrapolation step sizes are crucial to the performance of the GESK model, we propose two different
approaches to determine the step sizes: maximizing penalized likelihood and minimizing integrated mean
squared error. Numerical experiments are conducted to illustrate the performance of the GESK model and
to compare it with alternative approaches.
Categories and Subject Descriptors: I.6.1 [Computing Methodologies]: Simulation and Modeling
General Terms: Algorithms, Experimentation, Theory
Additional Key Words and Phrases: Stochastic kriging, simulation, stochastic gradients, response surface
ACM Reference Format:
Huashuai Qu and Michael C. Fu. 2014. Gradient extrapolated stochastic kriging. ACM Trans. Model. Com-
put. Simul. 24, 4, Article 23 (November 2014), 25 pages.
DOI: http://dx.doi.org/10.1145/2658995
23
1. INTRODUCTION
Simulation models are commonly used to provide analysis and prediction of the behav-
ior of complex stochastic systems. When the ability to collect substantial data is limited
due to high cost, models of the simulation model, called metamodels (also known as
surrogate models), are fitted by building a mathematical relationship between input
and output. Metamodels can be used to provide approximations of the underlying re-
sponse surfaces. However, constructing an accurate metamodel requires careful choice
of the modeling approach and selection of design points.
There has been a long history of research focusing on metamodels; see Barton and
Meckesheimer [2006] and Barton [2009] for an overview. Different types of metamodels
have been proposed, starting from classic linear regression models. Due to the lack of
flexibility in linear metamodels, nonlinear metamodels have been suggested to pro-
vide better global approximation and capture complicated trends in response surfaces.
This work is supported in part by the National Science Foundation (NSF) under Grants CMMI-0856256,
EECS-0901543, by the Air Force Office of Scientific Research (AFOSR) under Grant FA9550-10-10340, and
by the National Natural Science Foundation of China under Grant 71071040.
Some preliminary results of this article were previously published in Qu and Fu [2012].
Authors’ addresses: H. Qu, Department of Mathematics, University of Maryland, College Park, MD 20742;
email: [email protected]; M. C. Fu, The Robert H. Smith School of Business and Institute for Systems
Research, University of Maryland, College Park, MD 20742; email: [email protected].
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted
without fee provided that copies are not made or distributed for profit or commercial advantage and that
copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for
components of this work owned by others than ACM must be honored. Abstracting with credit is permitted.
To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this
work in other works requires prior specific permission and/or a fee. Permissions may be requested from
Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212)
869-0481, or [email protected].
c 2014 ACM 1049-3301/2014/11-ART23 $15.00
DOI: http://dx.doi.org/10.1145/2658995
ACM Transactions on Modeling and Computer Simulation, Vol. 24, No. 4, Article 23, Publication date: November 2014.
23:2 H. Qu and M. C. Fu
One such nonlinear approach is kriging, which has been studied extensively in the
deterministic simulation community (see, e.g., Santner et al. [2003] and Kleijnen and
van Beers [2005]). Stochastic kriging was introduced by Ankenman et al. [2010] as
an extension of kriging in the stochastic simulation setting, the setting we consider
in this article. Stochastic kriging provides flexible metamodels of simulation output
performance measurements while taking simulation noise into consideration. In the
stochastic simulation setting, direct derivative information may be available, that is,
the simulation output may include not only the performance measurements, but also
estimates of the gradients of performance measurement with respect to the param-
eters. Techniques for estimating gradients, including perturbation analysis (PA) and
the likelihood ratio/score function (LR/SF) method, are discussed in Glasserman [1991],
Rubinstein and Shapiro [1993], and Fu [2008]; see also references therein.
The availability of additional gradient information suggests the potential for improv-
ing the quality of metamodels. Combining gradient information has been investigated
for building metamodels under deterministic computer simulation settings; see Liu
[2003] and Santner et al. [2003] for approaches to approximate response surfaces with
artificial neutral networks and kriging. In stochastic simulation settings, researchers
have also made attempts to incorporate gradient estimates into metamodeling ap-
proaches. Ho et al. [1992] proposed a gradient surface method (GSM) that uses the gra-
dient estimates only to iteratively fit lower-order polynomial models. Fu and Qu [2014]
investigated the direct gradient augmented regression (DiGAR) approach, which is a
modification of the standard linear regression model to incorporate gradient estimates.
Chen et al. [2013] introduced stochastic kriging with gradient estimators (SKG) to ex-
ploit gradient estimates in stochastic kriging, showing that the new approach provides
better prediction with smaller mean squared error (MSE). This approach is similar to
cokriging proposed in deterministic simulations [Alonso and Chung 2002] and requires
differentiability of the correlation functions because derivatives of random processes
or random fields are used to model gradient estimates.
In this article, we take a different approach to incorporate gradient estimates into
stochastic kriging and investigate the potential improvements. A new approach called
gradient extrapolated stochastic kriging (GESK) is proposed, which extrapolates addi-
tional responses in the neighborhood of each design point using the original responses
and gradient estimates. These additional responses, which might be biased, lead to
better predictions than stochastic kriging if step sizes for extrapolations are chosen
carefully. The main idea is to further explore the response surface with simulation re-
sponses and gradient estimates so that a metamodel with better overall accuracy can
be constructed. This suggests that GESK models are superior when there is a limited
number of design points or a response surface with multiple extreme values.
To investigate the performance of GESK, we analyze the possible reduction in MSE of
the GESK model over the standard stochastic kriging model, under two simplified and
tractable settings. Conditions that guarantee reduction in MSE are provided, as well.
We also conduct numerical experiments to illustrate the effectiveness of the GESK
model. Numerical results show comparable performance for GESK and SKG, while
both approaches consistently outperform stochastic kriging. However, the number of
problems where SKG outperforms GESK is comparable to the number of problems
where GESK outperforms SKG. Because it is difficult to establish in general what
problem characteristics determine whether GESK outperforms SKG, practitioners who
want to get the most value out of these methods should try out both and see which one
works best in their problem setting.
An important part of implementing the GESK model is the choice of step size. Large
step sizes usually lead to large approximation errors and deteriorate prediction accu-
racy; small step sizes gain little information from extrapolations and might lead to
ACM Transactions on Modeling and Computer Simulation, Vol. 24, No. 4, Article 23, Publication date: November 2014.
Gradient Extrapolated Stochastic Kriging 23:3
—We investigate the idea of incorporating gradient estimates into stochastic kriging
by extrapolating additional responses using the original responses and gradient
estimates. This approach is not restricted to stochastic kriging but can be applied to
other metamodeling approaches as well.
—We analyze the proposed GESK model theoretically under simplified settings and
show that it provides predictions with smaller MSE than stochastic kriging. We also
conduct numerical experiments and illustrate the performance of the GESK model.
—We present two different strategies, namely PMLE and IMSE, to determine extrap-
olation step sizes used in GESK. Effectiveness of these two strategies are compared
using numerical examples.
The remainder of the article is organized as follows. In Section 2, we review the previous
stochastic kriging models and introduce the GESK approach. In Section 3, we provide
a theoretical analysis of the MSE of GESK using two simplified and tractable problems
and analyze the effects of step sizes on the MSE. Section 4 proposes two strategies to
determine step sizes and discusses choices of gradient estimators in GESK. Numerical
experiments are conducted in Section 5. Section 6 concludes and provides topics for
future research.
2. MODELS
In this section, we review stochastic kriging introduced in Ankenman et al. [2010] and
stochastic kriging with gradient estimators (SKG) introduced in Chen et al. [2013] and
then present the GESK approach.
ACM Transactions on Modeling and Computer Simulation, Vol. 24, No. 4, Article 23, Publication date: November 2014.
23:4 H. Qu and M. C. Fu
Y j (xi ) is modeled the same as in stochastic kriging and each gradient estimator G rj (xi ),
r = 1, . . . , d, is modeled as
∂f(xi ) ∂M(xi )
G rj (xi ) = β+ + δrj (xi ). (6)
∂ xir ∂ xir
ACM Transactions on Modeling and Computer Simulation, Vol. 24, No. 4, Article 23, Publication date: November 2014.
Gradient Extrapolated Stochastic Kriging 23:5
1 1
ni ni
Ḡ r (xi ) = G rj (xi ), δ̄r (xi ) = δrj (xi ).
ni ni
j=1 j=1
The SKG framework models the averaged simulation responses and gradient estimates
as follows:
Ȳ(xi ) = f(xi ) β + M(xi ) + ¯ (xi ),
∂f(xi ) ∂M(xi )
Ḡ (xi ) =
r
β+ + δ̄r (xi ).
∂ xir ∂ xir
To satisfy the conditions required for Equation (6) to hold, a common choice for the
correlation function is the Gaussian correlation function. Let M+ be the variance-
covariance matrix including spatial covariances induced by M, spatial covariances
induced by derivatives of M, and those between M and its partial derivatives. Let
M+ (x0 , ·) be the vector analogous to M (x0 , ·) in stochastic kriging. We assume replica-
tions across design points are independent. In addition, simulation noise j and δ j are
assumed to be independent from M. The covariance matrix + induced by simulation
noise can be estimated by the sample covariances in practice.
Let Ȳ+ be the vector containing sample averages of response estimates and gradient
estimates at all design points:
ACM Transactions on Modeling and Computer Simulation, Vol. 24, No. 4, Article 23, Publication date: November 2014.
23:6 H. Qu and M. C. Fu
estimates at xi by
1
ni
Ḡ(xi ) = G j (xi ).
ni
j=1
Notice that the response estimate Y j (xi ) and the gradient estimate G j (xi ) within the
same replication j are generally correlated.
To incorporate gradient estimates into stochastic kriging, we extrapolate in the neigh-
borhood of the original design points xi , i = 1, 2, . . . , k. Specifically, linear extrapolation
is used to obtain additional responses as follows:
where xi = (xi1 , xi2 , . . . , xid) , and the step size xi needs to be small relative to
the spacing of xi . For simplicity, we assume that only one additional point is added in
the neighborhood of xi and that the same step size is used for all design points (i.e.,
xi = x, i = 1, 2, . . . , k). Extensions include using more sophisticated extrapolation
techniques and extrapolating multiple additional responses in the neighborhood of xi .
Let Ȳ(xi+ ) be the sample average of these extrapolated response outputs, which is
defined similarly as Ȳ(xi ) in Equation (2). For ease of notation, let Ȳi = Ȳ(xi ) and
Ȳi+ = Ȳ(xi+ ). Let Ȳ + be the 2k × 1 vector containing both the original responses and the
additional responses:
Ȳ + = Ȳ1 , Ȳ2 , . . . , Ȳk, Ȳ1+ , Ȳ2+ , . . . , Ȳk+ .
Similarly, x+ is defined as
x+ = (x1 , x2 , . . . , xk, x+ + +
1 , x2 , . . . , xk ) .
The sample average of the additional responses Ȳi+ are modeled similarly to the original
responses Ȳi , that is,
Chen et al. [2012] find that using CRN in stochastic kriging generally inflates the
MSE. Assuming independence across replications and independence between M and the
simulation noise is inherited from the stochastic kriging literature. The last assump-
tion says that the original simulation response is correlated with its corresponding
extrapolated response, but not other extrapolated responses.
ACM Transactions on Modeling and Computer Simulation, Vol. 24, No. 4, Article 23, Publication date: November 2014.
Gradient Extrapolated Stochastic Kriging 23:7
Let +
M be the 2k × 2k variance-covariance matrix implied by the extrinsic spatial
correlation model with 2k design points, including extrapolated design points:
⎛ ⎞
Cov[M(x1 ), M(x1 )] · · · Cov[M(x1 ), M(xk)] Cov[M(x1 ), M(x+ +
1 )] · · · Cov[M(x1 ), M(xk )]
⎜ .. .. .. .. .. .. ⎟
⎜ . . . . . . ⎟
⎜ ⎟
⎜ Cov[M(xk), M(x1 )] · · · Cov[M(xk), M(xk)] Cov[M(xk), M(x+ )] · · · Cov[M(x ), M(x+ ⎟
⎜ 1 k k )] ⎟
+ = ⎜ ⎟,
M ⎜ ⎟
⎜ Cov[M(x+
1 ), M(x1 )] · · · Cov[M(x+ +
1 ), M(xk)] Cov[M(x1 ), M(x1 )]
+
··· + + ⎟
Cov[M(x1 ), M(xk )] ⎟
⎜
⎜ .. .. .. .. .. .. ⎟
⎝ . . . . . . ⎠
Cov[M(x+
k ), M(x1 )] · · · Cov[M(x+
k ), M(x k )] Cov[M(x + +
k ), M(x1 )] ··· Cov[M(x+
k ), M(x +
k )]
where each entry in +M can be computed by Equation (5) with a given correlation function RM
and spatial variance τ 2 .
Let ¯ + ∈ R2k be the augmented vector of mean simulation noise:
+
¯ + = (x ¯ k), ¯ x+
¯ 1 ), . . . , (x 1 , . . . ¯ xk .
which represents spatial covariances between x0 and design points, including those
extrapolated design points. The augmented design matrix F+ can be expressed as
+
F+ = f(x1 ), . . . , f(xk ), f x+
1 , . . . , f xk .
When β , +M and + are known, the MSE-optimal predictor from the GESK model and
its corresponding MSE can be constructed by substituting Ȳ + , F+ , +M (x0 , ·), +M and + for
Ȳ , F, M (x0 , ·), M and in Equations (3) and (4), respectively.
In practice, β, + +
M and are unknown and need to be estimated. The aug-
+
mented matrix M is characterized by the spatial variance τ 2 and correlation function
with parameters θ . We assume that the simulation noise vectors +j = ( j (x1 ), . . . ,
j (xk), j (x+ +
1 ), . . . j (xk )) are multivariate normally distributed with mean zero and
covariance matrix + +
. Given the assumption, we first estimate . Our approach to
+ +
estimate Var[¯ (xi )], Var[¯ (xi )] and Cov[¯ (xi ), (x ¯ i )], i = 1, 2, . . . , k will be described in
ACM Transactions on Modeling and Computer Simulation, Vol. 24, No. 4, Article 23, Publication date: November 2014.
23:8 H. Qu and M. C. Fu
Estimation for Var[¯ (xi+ )] can be done in a similar fashion by replacing Y j (xi ) by Y j (xi+ ).
The covariance Cov[¯(xi ), (x¯ i+ )] is estimated by the sample covariance as
⎡ ⎤
+ 1 ⎣ 1 ni
¯ (xi ), ¯ x
Cov = Y j (xi ) − Ȳ(xi ) Y j xi+ − Ȳ xi+ ⎦ .
i
ni ni − 1
j=1
+
for +
This provides us an estimate . Combining this with normality assumptions,
we can estimate the set of parameters (β, τ 2 , θ ) together using maximum likelihood
estimators (MLEs) as described in Ankenman et al. [2010].
3. ANALYSIS OF THE GESK MODEL AND CHOICES OF STEP SIZES
Key to implementing the GESK model is the choice of step sizes for the extrapolated
points, which depends on analyzing the potential improvements in performance from
the GESK model, as well as the approximation errors introduced by extrapolation. A
good GESK model should take this bias-variance type tradeoff into consideration. We
consider two tractable models: a two-point problem and a k-point problem with known
model parameters. Under these two settings, we analyze the potential improvement
in MSE by the GESK model over the stochastic kriging model and provide conditions
under which such improvement can be guaranteed.
3.1. A Two-Point Problem with Single Extrapolated Point
Consider a one-dimensional problem (d = 1) of two design points x1 and x2 with num-
bers of replications n1 and n2 , respectively. Without loss of generality, let x1 < x2 and the
prediction point be x0 ∈ [x1 , x2 ]. The simulation outputs include responses {Y j (xi )}nj=1 i
n1
for i = 1, 2 at both design points and gradient estimators {G j (x1 )} j=1 at x1 only. A
constant trend is used to represent the overall surface mean (i.e., f(xi ) β = β0 ). All
parameters (β0 , τ 2 , θ ) are assumed to be known.
Let the spatial variance τ 2 > 0 and ril be the correlation between M(xi ) and M(xl ),
i, l = 0, 1, . . . , k. The correlation ril can be calculated from the correlation function
RM (xi , xl ; θ ), but no specific correlation function is specified in this discussion. Let the
variance of the simulation noise at xi from replication j be Var[ j (xi )] = σi2 .
Let Ȳ = (Ȳ1 , Ȳ2 ) be the vector containing the sample means of responses at x1 and
x2 . The stochastic kriging predictor at x0 is given as
σ22 σ12
(r1 (τ 2 + n2
) − r2 τ 2r12 )(Ȳ1 − β0 ) + (r2 (τ 2 + n1
) − r1 τ 2r12 )(Ȳ2 − β0 )
Ŷ(x0 ) = β0 + τ 2
, (8)
σ12 σ22
(τ 2 + n1
)(τ 2 + n2
) − τ 4r12
2
With a prespecified step size x, a new design point x1+ = x1 + x in the interval
[x1 , x2 ] is added and GESK extrapolates its response as Y j (x1+ ) = Y j (x1 ) + xG j (x1 ).
ACM Transactions on Modeling and Computer Simulation, Vol. 24, No. 4, Article 23, Publication date: November 2014.
Gradient Extrapolated Stochastic Kriging 23:9
where is the 2 × 2 covariance matrix of the vector (Ȳ1 , Ȳ2 ) , b is a 2 × 1 vector and c =
τ 2 +σ1+
2
/n1 . The vector containing covariances between M(x0 ) and (M(x1 ), M(x2 ), M(x1+ ))
is
r01
+ M (x0 , ·)
M (x0 , ·) = τ 2
r02 = .
τ 2r01+
r01+
The new predictor at x0 from the GESK model is
1
Y+ (x0 ) =
Y(x0 ) + b −1 (Ȳ − β0 12 ) − Ȳ1+ − β0 M (x0 , ·) −1 b − τ 2r01+ , (10)
v
where
Y(x0 ) is defined in Equation (8) and v = c − b −1 b.
The following theorem provides an expression for MSE( Y+ (x0 )) and conditions under
which the GESK predictor in Equation (10) has smaller MSE than that in Equation (8).
THEOREM 3.1. The MSE of the predictor in Equation (10) can be expressed as
+ ζ12 1
MSE(Y (x0 )) = MSE(Y(x0 )) + − [ M (x0 , ·) −1 b − τ 2r01+ ]2 , (11)
v2 v
ACM Transactions on Modeling and Computer Simulation, Vol. 24, No. 4, Article 23, Publication date: November 2014.
23:10 H. Qu and M. C. Fu
Given a prediction point x0 , let + M (x0 , ·) be a 2k × 1 vector that consists of the spatial
covariances between x0 and all design points,
+
+ +
M (x0 , ·) = M (x0 , x1 ), . . . , M (x0 , xk), M x0 , x1 , . . . , M x0 , xk
= M (x0 , ·) M+ (x0 , ·) ,
where both M+ (x0 , ·) and M+ (x0 , ·) are k × 1 vectors.
As in the analysis of the two-point problem, an important issue to address is the
approximation error introduced by extrapolation. Let the noise terms (xi+ ) at xi+ follow
normal distributions with means ζi = ζ (xi ), which implies that the additional response
outputs Y j (xi+ ) are biased if ζi = 0. We will analyze the effects of incorporating them in
the following. Let the vector ζ ∈ R2k be
ζ = (0, 0, . . . , 0, ζ1 , ζ2 , . . . , ζk) = 0k ζ k ,
which represents the expectation of the 2k × 1 noise vector ¯ + .
ACM Transactions on Modeling and Computer Simulation, Vol. 24, No. 4, Article 23, Publication date: November 2014.
Gradient Extrapolated Stochastic Kriging 23:11
Let
Y+ (x0 ) be the GESK predictor at x0 . The MSE of the GESK predictor for this
k-point problem is
+ −1 +
MSE( Y+ (x0 )) = + +
M (x0 , x0 ) − M (x0 , ·)
M + + M (x0 , ·)
+ −1 2
+ M (x0 , ·) + M +
+
ζ (13)
+ + −1 2
= MSE(
Y2k(x0 )) + M (x0 , ·) M + +
ζ .
The first-term MSE( Y2k(x0 )) is the MSE of prediction that one would obtain if unbiased
responses are collected at 2k design points (i.e., running simulations at xi+ to collect
response estimates rather than extrapolating additional response estimates). The sec-
ond term is the inflation of MSE caused by approximation errors ζ in the additional
extrapolated responses.
Let Y(x0 ) be the stochastic kriging predictor with k design points. Our interest is to
compare the MSE of the GESK predictor Y+ (x0 ) with that of
Y(x0 ). To achieve this, we
begin by looking into the MSE of Y2k(x0 ).
Using the Woodbury matrix identity and block inverse formula in linear algebra, the
MSE of Y2k(x0 ) can be expressed as
MSE(
Y2k(x0 )) = + + + −1 +
M (x0 , x0 ) − M (x0 , ·) ( ) M (x0 , ·)
(14)
= MSE Y(x0 )
which has to be positive for any v ∈ Rk. Therefore, the matrix C − B −1 B is positive
definite and its inverse V = (C − B −1 B)−1 is also positive definite.
Because the matrix V is positive definite, it follows immediately that
MSE Y2k(x0 ) = MSE Y(x0 ) − mωintercal Vmω ≤ MSE Y(x0 ) ,
where equality only holds if and only if ω = 0. Thus, not surprisingly, the MSE is
reduced if the k additional response outputs are unbiased.
ACM Transactions on Modeling and Computer Simulation, Vol. 24, No. 4, Article 23, Publication date: November 2014.
23:12 H. Qu and M. C. Fu
Next, we investigate the effect of the extrapolated bias on the overall MSE. Combin-
ing Equations (13) and (14) gives
+ −1 2
MSE(Y+ (x0 )) = MSE(
Y2k(x0 )) + +M (x0 , ·)
M + +
ζ
+ + −1 2
= MSE(Y(x0 )) − ω Vω + M (x0 , ·) M + +
ζ
2
= MSE(
Y(x0 )) − ω Vω + M+ (x0 , ·)V − M (x0 , ·) −1 BV ζ k
= MSE(
Y(x0 )) − ω Vω + (ω Vζ k)2 (15)
= MSE(
Y(x0 )) + ω Vζ kω Vζ k − ω Vω
= MSE(
Y(x0 )) + ω Vζ kζ k Vω − ω Vω
= MSE(
Y(x0 )) + ω Vζ kζ k V − V ω.
ACM Transactions on Modeling and Computer Simulation, Vol. 24, No. 4, Article 23, Publication date: November 2014.
Gradient Extrapolated Stochastic Kriging 23:13
The step size x determines the MSE of the GESK predictor through several factors:
the biases ζi in the extrapolated responses; the correlation ρi between the simulation
noise of original responses and extrapolated responses; and the variances σi2+ of the sim-
ulation noise in the extrapolated responses. Because linear extrapolation is employed
in the GESK model, the bias ζi in the extrapolated response Ȳi+ is bounded by K||x||2
for some K > 0. The correlation ρi depends on both the step size and the covariance
between the simulation noise of the responses and those of the gradient estimators. A
larger step size or smaller covariance leads to smaller correlation factor ρi , whereas
σi2+ changes as the step size changes but also depends on the sign of the correlation ρi .
Effects of these factors will be discussed in detail in the following using the two-point
problem and k-point problem as in Sections 3.1 and 3.2.
3.3.1. The Two-Point Problem. Theorem 3.1 provided the change in MSE at a prediction
point x0 :
ζ12 1
MSE = − [ M (x0 , ·) −1 b − τ 2r01+ ]2 .
v2 v
We summarize our findings in this setting as follows:
(1) The bias ζ1 must satisfy ζ12 < v as shown in Theorem 3.1 to guarantee reduction
in MSE. Because ζ1 is proportional to (x)2 , intuitively the step size should be
relatively small.
(2) The greater the correlation ρ1 , the greater the reduction in MSE. The parameter ρ1
also depends on the correlation between j (x1 ) and δ j (x1 ) (i.e., the simulation noise
of output responses and gradient estimators). The parameter ρ1 increases as the
correlation between j (x1 ) and δ j (x1 ) increases.
(3) The parameter σ12+ represents the noise in an extrapolated response Y j (x1+ ). The
reduction in MSE is greater if σ12+ is smaller.
All conditions seem to prefer using smaller step sizes. However, other difficulties
arise if the step sizes are too small: first, because the quantity v becomes smaller as
x becomes smaller, the condition ζ12 < v may not hold; second, as x approaches zero,
the correlation ρ1 approaches 1, which may make the matrix + ill conditioned, which
leads to numerical issues.
3.3.2. The k-Point Problem. In addition to the assumptions in Section 3.2, we also as-
sume that the k design points are widely spread such that the spatial correlation
between design points is approximately 0. A similar assumption is used in Chen et al.
[2013] to isolate the impact of incorporating gradient estimators from the spatial co-
variances. This implies that the matrix in Equation (12) is a diagonal matrix. As
the step size x is usually small, we assume the same property holds for B and C in
Equation (12) also. The change in MSE of the GESK predictor is
MSE = ω Vζ kζ k V − V ω,
where ω = B −1 M (x0 , ·) − M+ (x0 , ·) and V = (C − B −1 B)−1 . The effects of x, ρi
and σi2+ are summarized as follows:
(1) Theorem 3.3 suggests that the quantity ζ k ζ k needs to be small enough to guarantee
that GESK can reduce MSE. This condition requires the step size x j in each
dimension to be small.
(2) Regarding the correlation ρi , the preferred sign of the correlation actually depends
on the location of the prediction point x0 . A condition between x and xi − x0
ACM Transactions on Modeling and Computer Simulation, Vol. 24, No. 4, Article 23, Publication date: November 2014.
23:14 H. Qu and M. C. Fu
determines the preferred sign of ρi . A specific analytical form for the condition
depends on the type of correlation function; larger |ρi | is better in each favorable
case.
(3) If the correlation ρi ≈ 0, smaller σi2+ is preferable, since it suggests that there
is less noise in the extrapolated responses. The same conclusion holds when the
correlation ρi is positive. However, if the correlation ρi < 0, smaller σi2+ is not
necessarily better, as there exists an optimal σi2+ that reduces MSE the most.
Analyzing the effects of step size on MSE in a general setting is more difficult, especially
in multidimensional problems. For example, the step size used in a multidimensional
problem may be different along different directions. Choosing good step sizes is crucial
for building the GESK models. In the next section, we propose two different approaches
for determining the optimal step size.
4. IMPLEMENTATIONS OF GESK
In this section, we focus on two important questions in the implementation of the GESK
model: choosing step sizes and choosing gradient estimators. We provide two different
techniques for determining steps sizes and discuss their pros and cons. We also make
recommendations between infinitesimal perturbation analysis (IPA) and the likelihood
ratio/score function (LR/SF) method for gradient estimation.
ACM Transactions on Modeling and Computer Simulation, Vol. 24, No. 4, Article 23, Publication date: November 2014.
Gradient Extrapolated Stochastic Kriging 23:15
ACM Transactions on Modeling and Computer Simulation, Vol. 24, No. 4, Article 23, Publication date: November 2014.
23:16 H. Qu and M. C. Fu
ACM Transactions on Modeling and Computer Simulation, Vol. 24, No. 4, Article 23, Publication date: November 2014.
Gradient Extrapolated Stochastic Kriging 23:17
smaller variances than LR/SF estimators; and (iii) IPA gradient estimators have better
performance when applied in stochastic kriging with gradient estimators (SKG).
Discussions in Section 3.3 suggest that the GESK model prefers gradient estima-
tors that are highly correlated with response estimates and have smaller variances.
Therefore, under most settings, IPA gradient estimators are preferred. IPA gradient
estimators are employed in the M/M/1 example conducted in Section 5.
5. NUMERICAL EXAMPLES
In this section, several numerical experiments are conducted to illustrate the pro-
posed GESK model. Our goal in this section is three-fold: to demonstrate the effects
of different step sizes on the performance of the GESK model, to empirically compare
the effectiveness of the PMLE and IMSE approaches in determining step sizes, and
to examine the performance of the GESK model in different settings and compare it
with stochastic kriging [Ankenman et al. 2010] and stochastic kriging with gradient
estimators (SKG) [Chen et al. 2013]. Implementation of SKG and GESK are built upon
software for stochastic kriging downloaded from http://www.stochastickriging.net.
Across all experiments, we assume little information is known about the response
surface and choose constant trends for all models (i.e., f(x) β = β0 ). A Gaussian cor-
relation function RM (x, x ) = exp{−θ x − x 2 } is used for all the experiments, since
it satisfies the conditions required by SKG. For the J-fold CV implemented in this
section, we choose J = min(k, 10), where k is the number of design points.
We implemented both PMLE and IMSE approaches discussed in Section 4.1 to de-
termine step sizes, together with the CV method to choose regularization parameters.
The corresponding GESK models are named GESK-PMLE and GESK-IMSE. The measure
of performance we chose is the empirical IMSE (EIMSE), as used in van Beers and
Kleijnen [2008] and other kriging literature:
1
N
EIMSE = (Y(xi ) − Y(xi ))2 , (20)
N
i=1
ACM Transactions on Modeling and Computer Simulation, Vol. 24, No. 4, Article 23, Publication date: November 2014.
23:18 H. Qu and M. C. Fu
Table I. Averaged EIMSE from 100 Macroreplications for GESK Models under Six Designs (# of Design Points,
# of Reps) to Predict Expected Waiting Time in the M/M/1 Queue Example
Three fixed step sizes with those determined by PMLE and IMSE are compared. Standard errors are shown in
parenthesis.
Design GESK-1 GESK-2 GESK-3 GESK-PMLE GESK-IMSE
(6, 50) 0.085 (0.0036) 0.167 (0.0035) 0.618 (0.0149) 0.042 (0.0020) 0.027 (0.0017)
(6, 200) 0.094 (0.0023) 0.181 (0.0019) 0.731 (0.0064) 0.034 (0.0011) 0.024 (0.0010)
(6, 1000) 0.092 (0.0011) 0.180 (0.0010) 0.747 (0.0037) 0.038 (0.0006) 0.021 (0.0004)
(8, 200) 0.006 (0.0004) 0.016 (0.0006) 0.194 (0.0023) 0.005 (0.0003) 0.002 (0.0002)
(10, 200) 0.006 (0.0005) 0.007 (0.0011) 0.017 (0.0012) 0.007 (0.0007) 0.004 (0.0003)
(20, 200) 0.002 (0.0008) 0.006 (0.0003) 0.048 (0.0020) 0.003 (0.0003) 0.001 (0.0001)
Fig. 1. Boxplots of EIMSE from 100 macroreplications for the GESK models under six designs (# of design
points, # of reps) to predict expected steady-state waiting time in the M/M/1 queue example.
—Predetermined step sizes versus optimal step sizes. Performances of the two
optimal step sizes are better than those of predetermined step sizes, especially when
the number of design points is small. This is expected, as the choice of step sizes
should adapt to the experiment design and simulation output.
—PMLE versus IMSE. The performance of IMSE is better than that of PMLE under
most experiment designs, in terms of having smaller averaged EIMSE, smaller vari-
ances of EIMSE and smaller number of outliers. Figure 2 shows boxplots for step
sizes determined by PMLE and IMSE under all six designs.
ACM Transactions on Modeling and Computer Simulation, Vol. 24, No. 4, Article 23, Publication date: November 2014.
Gradient Extrapolated Stochastic Kriging 23:19
Fig. 2. Boxplots for step sizes determined by PMLE and IMSE based on 100 macroreplications under six
designs (# of design points, # of reps) to predict the expected steady-state waiting time in the M/M/1 queue
example.
Table II. Averaged EIMSE from 100 Macroreplications for SK, SKG, and GESK with Six Different Designs on
Estimating the Expected Steady-State Waiting Time in an M/M/1 Queue Problem
The design (6, 50) means six design points with 50 replications at each design point. Standard errors are shown
in parentheses.
Design SK SKG GESK-PMLE GESK-IMSE
(6, 50) 0.313 (0.0134) 0.031 (0.0036) 0.042 (0.0020) 0.027 (0.0017)
(6, 200) 0.324 (0.0062) 0.016 (0.0007) 0.034 (0.0011) 0.024 (0.0010)
(6, 1000) 0.328 (0.0027) 0.016 (0.0003) 0.038 (0.0006) 0.021 (0.0004)
(8, 200) 0.054 (0.0019) 0.002 (0.0003) 0.005 (0.0003) 0.002 (0.0002)
(10, 200) 0.009 (0.0004) 0.004 (0.0014) 0.007 (0.0007) 0.004 (0.0003)
(20, 200) 0.004 (0.0002) 0.004 (0.0004) 0.003 (0.0003) 0.001 (0.0001)
—Effect of number of design points. When the number of design points is small, for
example k = 6, improvements in EIMSE are more significant. However, when there
are already enough design points, improvements are hardly noticeable. In addition,
for both PMLE and IMSE, the relative step size (ratio to the size of the subinterval)
generally increases as the number of design points increases.
—Effect of number of replications. As the number of replications increases, the
variances of EIMSE become smaller as shown in Table I and Figure 1. However,
changes in the averaged IMSE are not significant. Variances of the chosen step sizes
seem to decrease as well, as shown in Figure 2.
ACM Transactions on Modeling and Computer Simulation, Vol. 24, No. 4, Article 23, Publication date: November 2014.
23:20 H. Qu and M. C. Fu
Fig. 3. Boxplots of EIMSE from 100 macroreplications for SK, SKG, and GESK with six different designs
on estimating the expected steady-state waiting time in an M/M/1 queue problem, corresponding to results
in Table II.
Table III. Averaged EIMSE from 100 Macroreplications for SK, SKG, and GESK with Five Different Designs on
y(x) = exp(−1.4x) cos(7πx/2) +
Standard errors are shown in parentheses.
Design SK SKG GESK-PMLE GESK-IMSE
(6, 50) 39.616 (0.0374) 2.044 (0.0106) 1.909 (0.0181) 1.828 (0.0111)
(6, 200) 39.586 (0.0192) 2.023 (0.0049) 1.830 (0.0091) 1.757 (0.0050)
(6, 1000) 39.581 (0.0084) 7.652 (0.8823) 1.829 (0.0033) 1.758 (0.0023)
(8, 200) 2.793 (0.0039) 0.069 (0.0009) 0.063 (0.0026) 0.068 (0.0025)
(10, 200) 0.949 (0.0026) 1.243 (0.4204) 0.178 (0.0008) 0.012 (0.0007)
(20, 200) 0.008 (0.0002) 0.001 (0.0001) 0.046 (0.0004) 0.004 (0.0004)
in Section 5.1, so numbers for the two GESK models in Table III and corresponding
boxplots in Figure 4 are the same as those in Section 5.1.
Our findings are summarized as follows:
—SK versus SKG versus GESK. Not surprisingly, SKG and GESK perform better
than SK, as incorporating gradient estimators provides more information about the
ACM Transactions on Modeling and Computer Simulation, Vol. 24, No. 4, Article 23, Publication date: November 2014.
Gradient Extrapolated Stochastic Kriging 23:21
Fig. 4. Boxplots of EIMSE from 100 macroreplications for SK, SKG, and GESK with five different designs
on y(x) = exp(−1.4x) cos(7π x/2) + , corresponding to results in Table III.
response surface. SKG performs better than GESK on most of the designs, especially
as compared with GESK-PMLE. Only under the design (20, 200) do both GESK-
PMLE and GESK-IMSE outperform SKG.
—Number of design points. Incorporating gradient estimators improves perfor-
mance considerably when the design points are sparse. For example, both SKG and
GESK have more significant improvement over stochastic kriging when k = 6. As
the number of design points increases, performance of most models improves.
—Number of replications. As the number of replications increases with a fixed
number of design points, the variance of EIMSE decreases for all three methods, as
shown in Figure 3(a). However, the averaged EIMSE does not improve significantly.
5.2.2. A Stylized Example with Added Noise. We consider a one-dimensional example from
Santner et al. [2003], where the true response surface is Y(x) = exp(−1.4x) cos(7π x/2)
with x ∈ [−2, 0]. The presence of multiple local extreme values on the response surface
makes building a good metamodel difficult. The simulation response output at x from
replication j is Y j (x) = exp(−1.4x) cos(7π x/2) + j (x), with j (x) ∼ N (0, 1). Direct
gradient estimates are assumed of the form G j (x) = Y (x)+δ j (x) as the gradient estimate
at x from simulation replication j, with δ j (x) ∼ N (0, 25). We let δ j (x) have a larger
ACM Transactions on Modeling and Computer Simulation, Vol. 24, No. 4, Article 23, Publication date: November 2014.
23:22 H. Qu and M. C. Fu
Fig. 5. Boxplots for step sizes determined by PMLE and IMSE based on 100 macroreplication in the stylized
example with added noise.
variance in order to empirically investigate the performances of SKG and GESK when
gradient estimates are noisier.
We ran the experiments for 100 macroreplication. Within each macroreplication, we
chose N = 1,000 to estimate the EIMSE in Equation (20). Six different experiment
designs, (6, 50), (6, 200), (6, 1000), (8, 200), (10, 200) and (20, 200) were adopted, with
results shown in Table III and Figure 4. Notice that ln(EIMSE) values are shown in
Figure 4, as EIMSE results from the three models differ substantially.
—SK versus SKG versus GESK. As shown in Figure 4, both the SKG and the GESK
models are better than SK when there is a limited number of design points. The
GESK models perform better than SKG when k = 6. The explanation is that the
response surface has several fluctuations and extrapolation allows GESK models to
explore and approximate the response surface better than the others. SKG performs
better when there are enough design points, for example, k = 20. SKG experiences
numerical issues under designs (6,1000) and (10,200).
—Number of replications. When the number of replications increases, the variance
of EIMSE decreases in general, as shown in Table III and Figure 4(a), except for
SKG in design (6, 1000). However, the averaged EIMSE does not change much as the
number of replications increases, similar to the M/M/1 queue example.
—Number of design points. We fixed the number of replications at 200 and increases
the number of design points up to 20. Boxplots are shown in Figure 4(b). EIMSE
results for all models improve as the number of design points increases, with the
exception of SKG and GESK-PMLE with design (10, 200).
—Step sizes. Step sizes determined by the PMLE and IMSE approaches are shown in
Figure 5. The plots suggest relationships between experiment designs and step sizes:
(i) relative step size (ratio to the size of the subinterval) increases generally when
the number of design points increases, and (ii) the variability of step sizes decreases
as the number of replications increases.
—Remark. Performance of SKG with design (10, 200) shown in Table III and
Figure 4(b) does not seem to match each other. The reason is that several outliers
outside of the range shown in Figure 4(b) are omitted.
These two numerical experiments show comparable performance for GESK and SKG.
However, it is hard to establish in general what problem characteristics determine
whether GESK outperforms SKG. For example, GESK performs better when there are
ACM Transactions on Modeling and Computer Simulation, Vol. 24, No. 4, Article 23, Publication date: November 2014.
Gradient Extrapolated Stochastic Kriging 23:23
Fig. 6. Boxplots of EIMSE from 100 macroreplications for SK and GESK with two different Latin-hypercube
designs on a four-dimensional function with added noise.
20 design points in Table II, whereas in Table III GESK performs better when there
are 6 design points.
—SKG and both GESK models perform better than SK. As the number of design points
increases, the performances of all models improve. Under design (20, 500), GESK-
IMSE seems to be the best; under design (40, 500), SKG is preferred because of its
low average and low variance in EIMSE.
—Between the two GESK models, PMLE scales better than IMSE for high-dimensional
problems. The IMSE approach requires multidimensional integrations to determine
step sizes, which is expensive and depends on the accuracy of integration approxi-
mation in high-dimensional problems.
—Step sizes determined by PMLE are generally much larger than those determined by
IMSE. Along each dimension, step sizes determined by PMLE and IMSE have similar
behavior. If the step size chosen by PMLE is relatively smaller on one dimension,
so is the step size chosen by IMSE. Step sizes chosen for a dimension with higher
gradient values are not necessarily smaller than others.
ACM Transactions on Modeling and Computer Simulation, Vol. 24, No. 4, Article 23, Publication date: November 2014.
23:24 H. Qu and M. C. Fu
REFERENCES
J. J. Alonso and H. S. Chung. 2002. Using gradients to construct cokriging approximation models for high-
dimensional design optimization problems. In 40th AIAA Aerospace Sciences Meeting and Exhibit, AIAA.
2002–0317.
B. E. Ankenman, B. L. Nelson, and J. Staum. 2010. Stochastic kriging for simulation metamodeling. Opera-
tions research 58, 2 (March 2010), 371–382.
R. R. Barton. 2009. Simulation optimization using metamodels. In Proceedings of the 2009 Winter Simulation
Conference, M. D. Rossetti, R. R. Hill, B. Johansson, A. Dunkin, and R. G. Ingalls (Eds.). Institute of
Electrical and Electronics Engineers, Inc, Piscataway, New Jersey, 230–238.
R. R. Barton and M. Meckesheimer. 2006. Metamodel-based simulation optimization. In Handbooks in
Operations Research and Management Science: Simulation, S. G. Henderson and B. L. Nelson (Eds.).
Elsevier, 535–574.
X. Chen, B. E. Ankenman, and B. L. Nelson. 2012. The effects of common random numbers on stochastic
kriging metamodels. ACM Transactions on Modeling and Computer Simulation 22, 2 (March 2012),
Article 7, 20 pages.
X. Chen, B. E. Ankenman, and B. L. Nelson. 2013. Enhancing stochastic kriging metamodels with gradient
estimators. Operations Research 61, 2 (2013), 512–528.
T. Chu, J. Zhu, and H. Wang. 2011. Penalized maximum likelihood estimation and variable selection in
geostatistics. Annals of Statistics 39, 5 (2011), 2607–2625.
ACM Transactions on Modeling and Computer Simulation, Vol. 24, No. 4, Article 23, Publication date: November 2014.
Gradient Extrapolated Stochastic Kriging 23:25
M. C. Fu. 2008. What you should know about simulation and derivatives. Naval Research Logistics 55, 8
(2008), 723–736.
M. C. Fu and H. Qu. 2014. Regression models augumented with direct stochastic gradient estimators.
INFORMS Journal on Computing 26, 3, 484–499.
P. Glasserman. 1991. Gradient Estimation via Perturbation Analysis. Kluwer Academic, Boston, MA.
Y. C. Ho, L. Shi, L. Dai, and W. Gong. 1992. Optimizing discrete event dynamic systems via the gradient
surface method. Discrete Event Dynamic Systems: Theory and Applications 2 (Jan. 1992), 99–120.
J. P. C. Kleijnen and W. C. M. van Beers. 2005. Robustness of kriging when interpolating in random simulation
with heterogeneous variances: Some experiments. European Journal of Operational Research 165, 3
(2005), 826–834.
P. L’Ecuyer. 1990. A unified view of the IPA, SF, and LR gradient estimation techniques. Management Science
36 (1990), 1364–1383.
R. Li and A. Sudjianto. 2005. Analysis of computer experiments using penalized likelihood in Gaussian
kriging models. Technometrics 47, 2 (May 2005), 111–121.
W. Liu. 2003. Development of Gradient-enhanced Kriging Approximations for Multidisciplinary Design Op-
timization. Ph.D. Dissertation. University of Notre Dame.
H. Qu and M. C. Fu. 2012. On direct gradient enhanced simulation metamodels. In Proceedings of the 2012
Winter Simulation Conference. Institute of Electrical and Electronics Engineers, Article 43, 12 pages.
R. Y. Rubinstein and A. Shapiro. 1993. Discrete Event Systems: Sensitivity Analysis and Stochastic Opti-
mization by the Score Function Method. John Wiley & Sons.
T. J. Santner, B. Williams, and W. Notz. 2003. The Design and Analysis of Computer Experiments. Springer-
Verlag.
J. Staum. 2009. Better simulation metamodeling: The why, what, and how of stochastic kriging. In Proceed-
ings of the 2009 Winter Simulation Conference, M. D. Rossetti, R. R. Hill, B. Johansson, A. Dunkin, and
R. G. Ingalls (Eds.). Institute of Electrical and Electronics Engineers, Piscataway, NJ, 119–133.
W. C. M. van Beers and J. P. C. Kleijnen. 2008. Customized sequential designs for random simulation
experiments: Kriging metamodeling and bootstrapping. European Journal of Operational Research 186,
3 (2008), 1099–1113.
W. Xie, B. L. Nelson, and J. Staum. 2010. The influence of correlation functions on stochastic kriging meta-
models. In Proceedings of the 2010 Winter Simulation Conference. Institute of Electrical and Electronics
Engineers, Piscataway, NJ, 1067–1078.
F. Zhang and Q. Zhang. 2006. Eigenvalue inequalities for matrix product. IEEE Transactions on Automatic
Control 51, 9 (Sept. 2006), 1506–1509.
ACM Transactions on Modeling and Computer Simulation, Vol. 24, No. 4, Article 23, Publication date: November 2014.