0% found this document useful (0 votes)
73 views25 pages

Gradient Extrapolated Stochastic Kriging

for reference

Uploaded by

Dominic
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
73 views25 pages

Gradient Extrapolated Stochastic Kriging

for reference

Uploaded by

Dominic
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Gradient Extrapolated Stochastic Kriging

HUASHUAI QU and MICHAEL C. FU, University of Maryland, College Park

We introduce an approach for enhancing stochastic kriging in the setting where additional direct gradient
information is available (e.g., provided by techniques such as perturbation analysis or the likelihood ratio
method). The new approach, called gradient extrapolated stochastic kriging (GESK), incorporates direct
gradient estimates by extrapolating additional responses. For two simplified settings, we show that GESK
reduces mean squared error (MSE) compared to stochastic kriging under certain conditions on step sizes.
Since extrapolation step sizes are crucial to the performance of the GESK model, we propose two different
approaches to determine the step sizes: maximizing penalized likelihood and minimizing integrated mean
squared error. Numerical experiments are conducted to illustrate the performance of the GESK model and
to compare it with alternative approaches.
Categories and Subject Descriptors: I.6.1 [Computing Methodologies]: Simulation and Modeling
General Terms: Algorithms, Experimentation, Theory
Additional Key Words and Phrases: Stochastic kriging, simulation, stochastic gradients, response surface
ACM Reference Format:
Huashuai Qu and Michael C. Fu. 2014. Gradient extrapolated stochastic kriging. ACM Trans. Model. Com-
put. Simul. 24, 4, Article 23 (November 2014), 25 pages.
DOI: http://dx.doi.org/10.1145/2658995

23
1. INTRODUCTION
Simulation models are commonly used to provide analysis and prediction of the behav-
ior of complex stochastic systems. When the ability to collect substantial data is limited
due to high cost, models of the simulation model, called metamodels (also known as
surrogate models), are fitted by building a mathematical relationship between input
and output. Metamodels can be used to provide approximations of the underlying re-
sponse surfaces. However, constructing an accurate metamodel requires careful choice
of the modeling approach and selection of design points.
There has been a long history of research focusing on metamodels; see Barton and
Meckesheimer [2006] and Barton [2009] for an overview. Different types of metamodels
have been proposed, starting from classic linear regression models. Due to the lack of
flexibility in linear metamodels, nonlinear metamodels have been suggested to pro-
vide better global approximation and capture complicated trends in response surfaces.

This work is supported in part by the National Science Foundation (NSF) under Grants CMMI-0856256,
EECS-0901543, by the Air Force Office of Scientific Research (AFOSR) under Grant FA9550-10-10340, and
by the National Natural Science Foundation of China under Grant 71071040.
Some preliminary results of this article were previously published in Qu and Fu [2012].
Authors’ addresses: H. Qu, Department of Mathematics, University of Maryland, College Park, MD 20742;
email: [email protected]; M. C. Fu, The Robert H. Smith School of Business and Institute for Systems
Research, University of Maryland, College Park, MD 20742; email: [email protected].
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted
without fee provided that copies are not made or distributed for profit or commercial advantage and that
copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for
components of this work owned by others than ACM must be honored. Abstracting with credit is permitted.
To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this
work in other works requires prior specific permission and/or a fee. Permissions may be requested from
Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212)
869-0481, or [email protected].
c 2014 ACM 1049-3301/2014/11-ART23 $15.00
DOI: http://dx.doi.org/10.1145/2658995

ACM Transactions on Modeling and Computer Simulation, Vol. 24, No. 4, Article 23, Publication date: November 2014.
23:2 H. Qu and M. C. Fu

One such nonlinear approach is kriging, which has been studied extensively in the
deterministic simulation community (see, e.g., Santner et al. [2003] and Kleijnen and
van Beers [2005]). Stochastic kriging was introduced by Ankenman et al. [2010] as
an extension of kriging in the stochastic simulation setting, the setting we consider
in this article. Stochastic kriging provides flexible metamodels of simulation output
performance measurements while taking simulation noise into consideration. In the
stochastic simulation setting, direct derivative information may be available, that is,
the simulation output may include not only the performance measurements, but also
estimates of the gradients of performance measurement with respect to the param-
eters. Techniques for estimating gradients, including perturbation analysis (PA) and
the likelihood ratio/score function (LR/SF) method, are discussed in Glasserman [1991],
Rubinstein and Shapiro [1993], and Fu [2008]; see also references therein.
The availability of additional gradient information suggests the potential for improv-
ing the quality of metamodels. Combining gradient information has been investigated
for building metamodels under deterministic computer simulation settings; see Liu
[2003] and Santner et al. [2003] for approaches to approximate response surfaces with
artificial neutral networks and kriging. In stochastic simulation settings, researchers
have also made attempts to incorporate gradient estimates into metamodeling ap-
proaches. Ho et al. [1992] proposed a gradient surface method (GSM) that uses the gra-
dient estimates only to iteratively fit lower-order polynomial models. Fu and Qu [2014]
investigated the direct gradient augmented regression (DiGAR) approach, which is a
modification of the standard linear regression model to incorporate gradient estimates.
Chen et al. [2013] introduced stochastic kriging with gradient estimators (SKG) to ex-
ploit gradient estimates in stochastic kriging, showing that the new approach provides
better prediction with smaller mean squared error (MSE). This approach is similar to
cokriging proposed in deterministic simulations [Alonso and Chung 2002] and requires
differentiability of the correlation functions because derivatives of random processes
or random fields are used to model gradient estimates.
In this article, we take a different approach to incorporate gradient estimates into
stochastic kriging and investigate the potential improvements. A new approach called
gradient extrapolated stochastic kriging (GESK) is proposed, which extrapolates addi-
tional responses in the neighborhood of each design point using the original responses
and gradient estimates. These additional responses, which might be biased, lead to
better predictions than stochastic kriging if step sizes for extrapolations are chosen
carefully. The main idea is to further explore the response surface with simulation re-
sponses and gradient estimates so that a metamodel with better overall accuracy can
be constructed. This suggests that GESK models are superior when there is a limited
number of design points or a response surface with multiple extreme values.
To investigate the performance of GESK, we analyze the possible reduction in MSE of
the GESK model over the standard stochastic kriging model, under two simplified and
tractable settings. Conditions that guarantee reduction in MSE are provided, as well.
We also conduct numerical experiments to illustrate the effectiveness of the GESK
model. Numerical results show comparable performance for GESK and SKG, while
both approaches consistently outperform stochastic kriging. However, the number of
problems where SKG outperforms GESK is comparable to the number of problems
where GESK outperforms SKG. Because it is difficult to establish in general what
problem characteristics determine whether GESK outperforms SKG, practitioners who
want to get the most value out of these methods should try out both and see which one
works best in their problem setting.
An important part of implementing the GESK model is the choice of step size. Large
step sizes usually lead to large approximation errors and deteriorate prediction accu-
racy; small step sizes gain little information from extrapolations and might lead to

ACM Transactions on Modeling and Computer Simulation, Vol. 24, No. 4, Article 23, Publication date: November 2014.
Gradient Extrapolated Stochastic Kriging 23:3

numerical stability issues. We formalize two different strategies, penalized maximum


likelihood estimation (PMLE) and minimizing integrated mean squared error (IMSE),
to determine optimal step sizes. A cross-validation method is presented to determine
the regularization parameters required by each of the PMLE and IMSE approaches. We
discuss pros and cons for each approach and compare them empirically with numerical
examples.
This article makes the following contributions:

—We investigate the idea of incorporating gradient estimates into stochastic kriging
by extrapolating additional responses using the original responses and gradient
estimates. This approach is not restricted to stochastic kriging but can be applied to
other metamodeling approaches as well.
—We analyze the proposed GESK model theoretically under simplified settings and
show that it provides predictions with smaller MSE than stochastic kriging. We also
conduct numerical experiments and illustrate the performance of the GESK model.
—We present two different strategies, namely PMLE and IMSE, to determine extrap-
olation step sizes used in GESK. Effectiveness of these two strategies are compared
using numerical examples.

The remainder of the article is organized as follows. In Section 2, we review the previous
stochastic kriging models and introduce the GESK approach. In Section 3, we provide
a theoretical analysis of the MSE of GESK using two simplified and tractable problems
and analyze the effects of step sizes on the MSE. Section 4 proposes two strategies to
determine step sizes and discusses choices of gradient estimators in GESK. Numerical
experiments are conducted in Section 5. Section 6 concludes and provides topics for
future research.

2. MODELS
In this section, we review stochastic kriging introduced in Ankenman et al. [2010] and
stochastic kriging with gradient estimators (SKG) introduced in Chen et al. [2013] and
then present the GESK approach.

2.1. Stochastic Kriging


Stochastic kriging was introduced in Ankenman et al. [2010], focusing on modeling
unknown response surfaces in stochastic simulation settings. Given an experiment
design {(xi , ni )}, i = 1, 2, . . . , k, ni simulation replications are run at each design point xi .
Let Y j (xi ) be the simulation output from replication j at design point xi , j = 1, . . . , ni ,
and xi = (xi1 , xi2 , . . . , xid) ∈ Rd. Stochastic kriging models the output as

Y j (xi ) = f(xi ) β + M(xi ) +  j (xi ), (1)

where f(xi ) ∈ R p is a vector with known functions of xi , β ∈ R p is a vector with unknown


parameters to be estimated. Components in f(xi ) can be viewed as basis functions and
a polynomial basis is usually adopted in the literature. The term f(xi ) β represents the
trend of the overall response surface. It is assumed that M is a realization of a zero-
mean stationary random process (or random field) of the second order. This assumption
is inherited from the deterministic kriging literature, where the stochastic nature of M
is imposed on the problem so that statistical inference can be applied. For this reason,
M is sometimes referred to as extrinsic uncertainty. This is contrasted with the term
 j (xi ), which is the simulation noise for replication j taken at xi . The uncertainty in
 j (xi ) comes from the nature of stochastic simulation, and it is sometimes referred to
as intrinsic uncertainty.

ACM Transactions on Modeling and Computer Simulation, Vol. 24, No. 4, Article 23, Publication date: November 2014.
23:4 H. Qu and M. C. Fu

Given the simulation responses {Y j (xi )}nj=1


i
, i = 1, 2, . . . , k, the sample mean of re-
sponse output and simulation noise values at xi are denoted by
1  1 
ni ni
Ȳ(xi ) = Y j (xi ), ¯ (xi ) =  j (xi ). (2)
ni ni
j=1 j=1

The averaged responses Ȳ(xi ) at xi is modeled as


Ȳ(xi ) = f(xi ) β + M(xi ) + ¯ (xi ).
Suppose that we would like to predict the response Y(x0 ) at any point x0 . Let
Ȳ = (Ȳ(x1 ), Ȳ(x2 ), . . . , Ȳ(xk)) . Let  M be the k × k covariance matrix implied by
the random field M and   be the k × k covariance matrix implied by the simula-
tion noise across all design points {x1 , x2 , . . . , xk}. Let  M (x0 , ·) be the k × 1 vector
(Cov(M(x0 ), M(x1 )), . . . , Cov(M(x0 ), M(xk))) , which represents spatial covariances be-
tween a prediction point x0 and all design points. Also, define the k × p design matrix
F = (f(x1 ), f(x2 ), . . . , f(xk)) . Suppose that  M ,   and β are known. Then, the MSE-
optimal predictor at x0 is of the form

Y(x0 ) = f(x0 ) β +  M (x0 , ·) [ M +   ]−1 (Ȳ − Fβ), (3)
with corresponding MSE
MSE(
Y(x0 )) =  M (x0 , x0 ) −  M (x0 , ·) [ M +   ]−1  M (x0 , ·), (4)
where  M (x0 , x0 ) is the spatial variance of the random field at x0 . To build a stochastic
kriging metamodel in practice requires imposing some structure on the spatial covari-
ance matrix  M (·, ·). It is usually assumed that the spatial covariance between M(xi )
and M(x j ) is
 M (xi , x j ) = Cov[M(xi ), M(x j )] = τ 2 RM (xi , x j ; θ), (5)
where τ is the spatial variance of the random field and RM is a correlation func-
2

tion with parameter θ . The assumption that M is second-order stationary allows us


to write RM (xi , x j ; θ ) = RM (|xi − x j |; θ ) (i.e., the correlation depends only on the dis-
tance between xi and x j ). Common candidates for the correlation function include the
triangular correlation function, the Gaussian correlation function and the Matérn cor-
relation function, and so on. See Xie et al. [2010] for a detailed discussion on effects of
using different correlation functions in stochastic kriging.
2.2. Stochastic Kriging With Gradient Estimators
We review the framework of stochastic kriging with gradient estimators (SKG)
introduced by Chen et al. [2013]. SKG builds stochastic kriging models for gradi-
ent estimators upon the stochastic kriging model for simulation responses. These
two types of models are estimated together and applied to approximate response
surfaces.
Suppose that we observe not only the simulation responses Y j (xi ) but also unbiased
gradient estimates G j (xi ) ∈ Rd for the jth simulation replication at design point xi .
Given an experimental design {(xi , ni )}i=1 k
, let the gradient estimate from replication j
at design point xi be G j (xi ) = (G j (xi ), . . . , G dj(xi )) . In the SKG framework, each response
1

Y j (xi ) is modeled the same as in stochastic kriging and each gradient estimator G rj (xi ),
r = 1, . . . , d, is modeled as
 
∂f(xi )  ∂M(xi )
G rj (xi ) = β+ + δrj (xi ). (6)
∂ xir ∂ xir

ACM Transactions on Modeling and Computer Simulation, Vol. 24, No. 4, Article 23, Publication date: November 2014.
Gradient Extrapolated Stochastic Kriging 23:5

This is valid under the following conditions:


—The function f(xi ) is differentiable with respect to xi .
—The second-order mixed derivative of the correlation function RM in Equation (5)
exists and is continuous.
Let Ḡ r (xi ) and δ̄r (xi ), r = 1, . . . , d, be the sample average of the gradient estimates
and simulation noise, respectively, associated with xi :

1  1 
ni ni
Ḡ r (xi ) = G rj (xi ), δ̄r (xi ) = δrj (xi ).
ni ni
j=1 j=1

The SKG framework models the averaged simulation responses and gradient estimates
as follows:
Ȳ(xi ) = f(xi ) β + M(xi ) + ¯ (xi ),
 
∂f(xi )  ∂M(xi )
Ḡ (xi ) =
r
β+ + δ̄r (xi ).
∂ xir ∂ xir
To satisfy the conditions required for Equation (6) to hold, a common choice for the
correlation function is the Gaussian correlation function. Let  M+ be the variance-
covariance matrix including spatial covariances induced by M, spatial covariances
induced by derivatives of M, and those between M and its partial derivatives. Let
 M+ (x0 , ·) be the vector analogous to  M (x0 , ·) in stochastic kriging. We assume replica-
tions across design points are independent. In addition, simulation noise  j and δ j are
assumed to be independent from M. The covariance matrix  + induced by simulation
noise can be estimated by the sample covariances in practice.
Let Ȳ+ be the vector containing sample averages of response estimates and gradient
estimates at all design points:

Ȳ+ = (Ȳ(x1 ), . . . , Ȳ(xk), Ḡ 1 (x1 ), . . . , Ḡ 1 (xk), . . . , Ḡ d(x1 ), . . . , Ḡ d(xk)) .


The design matrix F in Section 2.1 now becomes F+ , which can be written as
        
∂f(x1 ) ∂f(xk) ∂f(x1 ) ∂f(xk) 
F+ = f(x1 ), . . . , f(xk), ,..., ,..., ,..., .
∂ x11 ∂ xk1 ∂ x1k ∂ xkk
When β is known, the SKG predictor and the corresponding MSE can be ob-
tained by substituting Ȳ+ , F+ ,  M+ ,  M+ (x0 , ·) and  + for Ȳ, F,  M ,  M (x0 , ·) and   in
Equations (3) and (4), respectively. For some simplified settings, Chen et al. [2013]
shows that SKG can reduce MSE by incorporating gradient estimates. Numerical re-
sults also demonstrate the advantage of SKG in improving prediction performance over
stochastic kriging.

2.3. Gradient Extrapolated Stochastic Kriging


We propose a different approach for incorporating the gradient estimates called
gradient extrapolated stochastic kriging(GESK). Again, let G j (xi ) ∈ Rd be the gradient
estimator at xi from replication j. Instead of modeling gradient estimates G j (xi ) as
partial derivatives of the response surface, the gradient estimates are simply viewed
as noisy measurements of the true gradient G(xi ) ∈ Rd, that is, G j (xi ) = G(xi ) + δ j (xi ),
where {δ j (xi )}nj=1
i
represent the zero-mean independent identically distributed noise
across different replications at the design point xi . Denote the sample mean of gradient

ACM Transactions on Modeling and Computer Simulation, Vol. 24, No. 4, Article 23, Publication date: November 2014.
23:6 H. Qu and M. C. Fu

estimates at xi by
1 
ni
Ḡ(xi ) = G j (xi ).
ni
j=1

Notice that the response estimate Y j (xi ) and the gradient estimate G j (xi ) within the
same replication j are generally correlated.
To incorporate gradient estimates into stochastic kriging, we extrapolate in the neigh-
borhood of the original design points xi , i = 1, 2, . . . , k. Specifically, linear extrapolation
is used to obtain additional responses as follows:

xi+ = xi + xi , Y j (xi+ ) = Y j (xi ) + G j (xi ) xi , (7)

where xi = (xi1 , xi2 , . . . , xid) , and the step size xi needs to be small relative to
the spacing of xi . For simplicity, we assume that only one additional point is added in
the neighborhood of xi and that the same step size is used for all design points (i.e.,
xi = x, i = 1, 2, . . . , k). Extensions include using more sophisticated extrapolation
techniques and extrapolating multiple additional responses in the neighborhood of xi .
Let Ȳ(xi+ ) be the sample average of these extrapolated response outputs, which is
defined similarly as Ȳ(xi ) in Equation (2). For ease of notation, let Ȳi = Ȳ(xi ) and
Ȳi+ = Ȳ(xi+ ). Let Ȳ + be the 2k × 1 vector containing both the original responses and the
additional responses:
 
Ȳ + = Ȳ1 , Ȳ2 , . . . , Ȳk, Ȳ1+ , Ȳ2+ , . . . , Ȳk+ .

Similarly, x+ is defined as

x+ = (x1 , x2 , . . . , xk, x+ + + 
1 , x2 , . . . , xk ) .

The sample average of the additional responses Ȳi+ are modeled similarly to the original
responses Ȳi , that is,

Ȳi+ = Ȳ(xi+ ) = f(xi+ ) β + M(xi+ ) + ¯ (xi+ ).

It is worth mentioning that this approach of incorporating gradient information is not


restricted to stochastic kriging, but it is a general approach that can be applied to other
metamodel approaches. The following assumptions are made:
ASSUMPTION 2.1.
(1) Simulations across design points are conducted independently, that is, the use of
common random numbers (CRN) is not considered.
(2) For any design point xi , the noise terms  j (xi ) are independent across replications.
(3) The random field M is independent of all noise terms  j (xi ) and  j (xi+ ), for each
design point xi and replication j.
(4) The simulation noise terms ¯ (xl ) are independent of ¯ (x+h ) for h = l.

Chen et al. [2012] find that using CRN in stochastic kriging generally inflates the
MSE. Assuming independence across replications and independence between M and the
simulation noise is inherited from the stochastic kriging literature. The last assump-
tion says that the original simulation response is correlated with its corresponding
extrapolated response, but not other extrapolated responses.

ACM Transactions on Modeling and Computer Simulation, Vol. 24, No. 4, Article 23, Publication date: November 2014.
Gradient Extrapolated Stochastic Kriging 23:7

Let  +
M be the 2k × 2k variance-covariance matrix implied by the extrinsic spatial
correlation model with 2k design points, including extrapolated design points:
⎛ ⎞
Cov[M(x1 ), M(x1 )] · · · Cov[M(x1 ), M(xk)] Cov[M(x1 ), M(x+ +
1 )] · · · Cov[M(x1 ), M(xk )]
⎜ .. .. .. .. .. .. ⎟
⎜ . . . . . . ⎟
⎜ ⎟
⎜ Cov[M(xk), M(x1 )] · · · Cov[M(xk), M(xk)] Cov[M(xk), M(x+ )] · · · Cov[M(x ), M(x+ ⎟
⎜ 1 k k )] ⎟
+ = ⎜ ⎟,
M ⎜ ⎟
⎜ Cov[M(x+
1 ), M(x1 )] · · · Cov[M(x+ +
1 ), M(xk)] Cov[M(x1 ), M(x1 )]
+
··· + + ⎟
Cov[M(x1 ), M(xk )] ⎟

⎜ .. .. .. .. .. .. ⎟
⎝ . . . . . . ⎠
Cov[M(x+
k ), M(x1 )] · · · Cov[M(x+
k ), M(x k )] Cov[M(x + +
k ), M(x1 )] ··· Cov[M(x+
k ), M(x +
k )]

where each entry in  +M can be computed by Equation (5) with a given correlation function RM
and spatial variance τ 2 .
Let ¯ + ∈ R2k be the augmented vector of mean simulation noise:
    + 
¯ + = (x ¯ k), ¯ x+
¯ 1 ), . . . , (x 1 , . . . ¯ xk .

Under Assumption 2.1, let  +  be the 2k × 2k variance-covariance matrix induced by


¯ + , which can be expressed as
⎛ ⎞
Var[¯ (x1 )] 0 ... 0 ¯ +
Cov[¯ (x1 ), (x1 )] 0 ... 0
⎜ ⎟


.. ⎟

⎜ 0 Var[¯ (x2 )] ... 0 0 . ... 0 ⎟
⎜ .. .. .. .. .. .. ⎟


.. .. ⎟

⎜ . . . . . . . . ⎟
⎜ + ⎟
⎜ 0 0 ... Var[¯ (xk )] 0 0 ... Cov[¯(xk ), ¯ (xk )] ⎟
⎜ ⎟
+ ⎜ ⎟.
 =⎜ ⎟
⎜ ⎟
⎜ Cov[¯(x+
1 ), ¯ (x1 )] 0 ... 0 Var[¯ (x+
1 )] 0 ... 0 ⎟
⎜ ⎟
⎜ . ⎟
⎜ .. ⎟
⎜ 0 ... 0 0 Var[¯ (x+
1 )] . . . 0 ⎟
⎜ ⎟
⎜ . .. .. .. .. .. .. .. ⎟
⎜ .. ⎟
⎝ . . . . . . . ⎠
0 0 ... Cov[¯ (x+
k ), (x
¯ k )] 0 0 ... +
Var[¯ (xk )]

Let x0 be a prediction point and  +M (x0 , ·) be a 2k × 1 vector


   + 
+
M (x0 , ·) = Cov[M(x0 ), M(x1 )], . . . , Cov M(x0 ), M xk ,

which represents spatial covariances between x0 and design points, including those
extrapolated design points. The augmented design matrix F+ can be expressed as
    + 
F+ = f(x1 ), . . . , f(xk ), f x+
1 , . . . , f xk .

When β ,  +M and  + are known, the MSE-optimal predictor from the GESK model and
its corresponding MSE can be constructed by substituting Ȳ + , F+ ,  +M (x0 , ·),  +M and  + for
Ȳ , F,  M (x0 , ·),  M and   in Equations (3) and (4), respectively.
In practice, β,  + +
M and   are unknown and need to be estimated. The aug-
+
mented matrix  M is characterized by the spatial variance τ 2 and correlation function
with parameters θ . We assume that the simulation noise vectors  +j = ( j (x1 ), . . . ,
 j (xk),  j (x+ + 
1 ), . . .  j (xk )) are multivariate normally distributed with mean zero and
covariance matrix  + +
 . Given the assumption, we first estimate   . Our approach to
+ +
estimate Var[¯ (xi )], Var[¯ (xi )] and Cov[¯ (xi ), (x ¯ i )], i = 1, 2, . . . , k will be described in

ACM Transactions on Modeling and Computer Simulation, Vol. 24, No. 4, Article 23, Publication date: November 2014.
23:8 H. Qu and M. C. Fu

the following. To estimate Var[¯ (xi )], we use


⎡ ⎤
1 1 ni
  (xi )] =
Var[¯ ⎣ (Y j (xi ) − Ȳ(xi ))2 ⎦ .
ni ni − 1
j=1

Estimation for Var[¯ (xi+ )] can be done in a similar fashion by replacing Y j (xi ) by Y j (xi+ ).
The covariance Cov[¯(xi ), (x¯ i+ )] is estimated by the sample covariance as
⎡ ⎤
  +  1 ⎣ 1 ni
      
 ¯ (xi ), ¯ x
Cov = Y j (xi ) − Ȳ(xi ) Y j xi+ − Ȳ xi+ ⎦ .
i
ni ni − 1
j=1

+
 for  +
This provides us an estimate   . Combining this with normality assumptions,
we can estimate the set of parameters (β, τ 2 , θ ) together using maximum likelihood
estimators (MLEs) as described in Ankenman et al. [2010].
3. ANALYSIS OF THE GESK MODEL AND CHOICES OF STEP SIZES
Key to implementing the GESK model is the choice of step sizes for the extrapolated
points, which depends on analyzing the potential improvements in performance from
the GESK model, as well as the approximation errors introduced by extrapolation. A
good GESK model should take this bias-variance type tradeoff into consideration. We
consider two tractable models: a two-point problem and a k-point problem with known
model parameters. Under these two settings, we analyze the potential improvement
in MSE by the GESK model over the stochastic kriging model and provide conditions
under which such improvement can be guaranteed.
3.1. A Two-Point Problem with Single Extrapolated Point
Consider a one-dimensional problem (d = 1) of two design points x1 and x2 with num-
bers of replications n1 and n2 , respectively. Without loss of generality, let x1 < x2 and the
prediction point be x0 ∈ [x1 , x2 ]. The simulation outputs include responses {Y j (xi )}nj=1 i

n1
for i = 1, 2 at both design points and gradient estimators {G j (x1 )} j=1 at x1 only. A
constant trend is used to represent the overall surface mean (i.e., f(xi ) β = β0 ). All
parameters (β0 , τ 2 , θ ) are assumed to be known.
Let the spatial variance τ 2 > 0 and ril be the correlation between M(xi ) and M(xl ),
i, l = 0, 1, . . . , k. The correlation ril can be calculated from the correlation function
RM (xi , xl ; θ ), but no specific correlation function is specified in this discussion. Let the
variance of the simulation noise at xi from replication j be Var[ j (xi )] = σi2 .
Let Ȳ = (Ȳ1 , Ȳ2 ) be the vector containing the sample means of responses at x1 and
x2 . The stochastic kriging predictor at x0 is given as
σ22 σ12
(r1 (τ 2 + n2
) − r2 τ 2r12 )(Ȳ1 − β0 ) + (r2 (τ 2 + n1
) − r1 τ 2r12 )(Ȳ2 − β0 )
Ŷ(x0 ) = β0 + τ 2
, (8)
σ12 σ22
(τ 2 + n1
)(τ 2 + n2
) − τ 4r12
2

with corresponding MSE


⎛ ⎞
2 2
r01 σ2 2 2
r02 σ1
(r 2 + r02
2
)τ 2 + + − 2r01r02r12 τ 2
MSE(Ŷ(x0 )) = τ 2 ⎝1 − τ 2 01 ⎠.
n2 n1
(9)
σ1
2
2 + σ2 ) − τ 4 r 2
2
(τ 2 + n1
)(τ n2 12

With a prespecified step size x, a new design point x1+ = x1 + x in the interval
[x1 , x2 ] is added and GESK extrapolates its response as Y j (x1+ ) = Y j (x1 ) + xG j (x1 ).

ACM Transactions on Modeling and Computer Simulation, Vol. 24, No. 4, Article 23, Publication date: November 2014.
Gradient Extrapolated Stochastic Kriging 23:9

This additional response output is modeled as Y j (x1+ ) = β0 + M(x1+ ) +  j (x1+ ). To address


approximation error introduced by extrapolation, we assume that  j (x1+ ) is normally
distributed with mean ζ1 = ζ (x1 ) and variance σ12+ ; thus, the extrapolated responses
Y j (x1+ ) at x1+ are biased unless ζi = 0.
Let Ȳ1+ be the sample mean of responses at x1+ and the vector Ȳ + = (Ȳ1 , Ȳ2 , Ȳ1+ ) .
Let ρ1 be the correlation between ¯ (x1 ) and ¯ (x1+ ), and ri1+ be the correlation between
M(xi ) and M(x1+ ) for i = 0, 1, 2. The variance-covariance matrix  + =  + +
M +   takes
the form ⎛ ⎞
σ12 σ1 σ1+
  0 ρ1  
1 r12 r11+ ⎜ n1 n1 ⎟
+ ⎜ σ22 ⎟  b
 =τ 2
r12 1 r21+ + ⎝ 0 n2
0 ⎠ = b c ,
r11+ r21+ 1 σ1 σ1+ σ1+
2
ρ1 n1 0 n1

where  is the 2 × 2 covariance matrix of the vector (Ȳ1 , Ȳ2 ) , b is a 2 × 1 vector and c =
τ 2 +σ1+
2
/n1 . The vector containing covariances between M(x0 ) and (M(x1 ), M(x2 ), M(x1+ ))
is
   
r01
+  M (x0 , ·)
 M (x0 , ·) = τ 2
r02 = .
τ 2r01+
r01+
The new predictor at x0 from the GESK model is
 1   
Y+ (x0 ) = 
Y(x0 ) + b  −1 (Ȳ − β0 12 ) − Ȳ1+ − β0  M (x0 , ·)  −1 b − τ 2r01+ , (10)
v
where 
Y(x0 ) is defined in Equation (8) and v = c − b  −1 b.
The following theorem provides an expression for MSE( Y+ (x0 )) and conditions under
which the GESK predictor in Equation (10) has smaller MSE than that in Equation (8).

THEOREM 3.1. The MSE of the predictor in Equation (10) can be expressed as
 
+  ζ12 1
MSE(Y (x0 )) = MSE(Y(x0 )) + − [ M (x0 , ·)  −1 b − τ 2r01+ ]2 , (11)
v2 v

and the GESK predictor has a smaller MSE if ζ12 < v.


PROOF. The MSE of GESK predictor MSE(Ŷ+ (x0 )) follows from straightforward cal-
culation. Both  + and  are variance-covariance matrices, and
det( + ) = det ( ) det(c − b  −1 b) = v det(),
and it follows that v > 0 because both det( + ) and det() are positive. Because v > 0,
the condition ζ12 < v is well defined. Under this condition, the GESK predictor has a
smaller MSE than the stochastic kriging predictor.

3.2. A k -Point Problem


In this section, we consider a tractable problem with k design points, where xi ∈ Rd,
under the following assumptions:
(1) Along with the response outputs Y j (xi ), gradient estimators G j (xi ) are also collected
from simulations at design points {xi }i=1
k
.
(2) Only one additional response is extrapolated in the neighborhood of each design
point.
(3) The trend f(xi ) β = β0 and all parameters (β0 , τ 2 , θ ) are known.

ACM Transactions on Modeling and Computer Simulation, Vol. 24, No. 4, Article 23, Publication date: November 2014.
23:10 H. Qu and M. C. Fu

Within the region of interest, an additional response Y + j (xi ) is extrapolated using


each pair of observations (Y j (xi ), G j (xi )). All extrapolated design points should be in
the interior of the design region; therefore, extrapolations from design points at the
boundary should be done with care. Let Ȳ + be the 2k × 1 vector that consists of sample
means of all response outputs
Ȳ + = (Ȳ1 , Ȳ2 , . . . , Ȳk, Ȳ1+ , Ȳ2+ , . . . , Ȳk+ ) ,
where Ȳi = Ȳ(xi ) and Ȳi+ = Ȳ(xi+ ). The sample mean of original responses at xi are
modeled as in Section 2.3. The sample mean of extrapolated responses are modeled
similarly, that is, Ȳ(xi+ ) = β0 + M(xi+ ) + ¯ (xi+ ).
Let ρi denote the correlation between ¯ (xi ) and ¯ (xi+ ). The spatial correlations be-
tween original design points and extrapolated design points are denoted as ril =
Corr[M(xi ), M(xl )], ril+ = Corr[M(xi ), M(xl+ )], and ri+ l+ = Corr[M(xi+ ), M(xl+ )] for i, l =
1, 2, . . . , k. The 2k × 2k variance-covariance matrix  + =  + +
M +   can be expressed in
a block form
 
+  B
 = , (12)
B C
where
⎛ ⎞
τ 2 + σ12 /n1 r12 ··· r1k
⎜ r12 τ 2 + σ22 /n2 · · · r2k ⎟
⎜ ⎟
 =⎜ .. .. .. . ⎟,
⎝ . . . .. ⎠
r1k r2k · · · τ 2 + σk2 /nk
⎛ 2 σ σ ⎞
τ r11+ + ρ1 1n11+ τ 2r12+ ··· τ 2r1k+
σ2 σ2+
⎜ τ r12+
2
τ r22+ + ρ2 n2 · · ·
2
τ 2r2k+ ⎟
⎜ ⎟
B =⎜ .. .. .. .. ⎟,
⎝ . . . . ⎠
σ σ
τ 2r1k+ τ 2r2k+ · · · τ 2rkk+ + ρk knkk+
⎛ 2 ⎞
τ + σ12+ /n1 τ 2r1+ 2+ ··· τ 2r1+ k+
⎜ τ 2r1+ 2+ τ 2 + σ22+ /n2 · · · τ 2r2+ k+ ⎟
⎜ ⎟
C =⎜ .. .. .. .. ⎟.
⎝ . . . . ⎠
τ r1+ k+
2
τ r2+ k+
2
· · · τ + σk+ /nk
2 2

Given a prediction point x0 , let  + M (x0 , ·) be a 2k × 1 vector that consists of the spatial
covariances between x0 and all design points,
    
+ 
+ +
M (x0 , ·) =  M (x0 , x1 ), . . . ,  M (x0 , xk),  M x0 , x1 , . . . ,  M x0 , xk
   
=  M (x0 , ·) M+ (x0 , ·) ,
where both  M+ (x0 , ·) and  M+ (x0 , ·) are k × 1 vectors.
As in the analysis of the two-point problem, an important issue to address is the
approximation error introduced by extrapolation. Let the noise terms (xi+ ) at xi+ follow
normal distributions with means ζi = ζ (xi ), which implies that the additional response
outputs Y j (xi+ ) are biased if ζi = 0. We will analyze the effects of incorporating them in
the following. Let the vector ζ ∈ R2k be
   
ζ = (0, 0, . . . , 0, ζ1 , ζ2 , . . . , ζk) = 0k ζ k ,
which represents the expectation of the 2k × 1 noise vector ¯ + .

ACM Transactions on Modeling and Computer Simulation, Vol. 24, No. 4, Article 23, Publication date: November 2014.
Gradient Extrapolated Stochastic Kriging 23:11

Let 
Y+ (x0 ) be the GESK predictor at x0 . The MSE of the GESK predictor for this
k-point problem is
 + −1 +
MSE( Y+ (x0 )) =  + +
M (x0 , x0 ) −  M (x0 , ·)

M + +  M (x0 , ·)
 +  −1 2
+  M (x0 , ·)  + M + 
+
ζ (13)
 +  + −1 2
= MSE( 
Y2k(x0 )) +  M (x0 , ·)  M +   +
ζ .

The first-term MSE( Y2k(x0 )) is the MSE of prediction that one would obtain if unbiased
responses are collected at 2k design points (i.e., running simulations at xi+ to collect
response estimates rather than extrapolating additional response estimates). The sec-
ond term is the inflation of MSE caused by approximation errors ζ in the additional
extrapolated responses.
Let Y(x0 ) be the stochastic kriging predictor with k design points. Our interest is to
compare the MSE of the GESK predictor  Y+ (x0 ) with that of 
Y(x0 ). To achieve this, we

begin by looking into the MSE of Y2k(x0 ).
Using the Woodbury matrix identity and block inverse formula in linear algebra, the
MSE of Y2k(x0 ) can be expressed as
MSE(
Y2k(x0 )) =  + +  + −1 +
M (x0 , x0 ) −  M (x0 , ·) ( )  M (x0 , ·)
  (14)
= MSE  Y(x0 )

where ω = B  −1  M (x0 , ·) −  M+ (x0 , ·) and V = (C − B  −1 B)−1 .


LEMMA 3.2. The matrix V = (C − B  −1 B)−1 is positive definite.
PROOF. Consider the 2k × 2k covariance matrix  + ,
 
 B
+ =  .
B C

First, it is easy to see that  + is positive definite, so


  
 B u
[u v] > 0,
B C v

for any k × 1 vector u, v ∈ Rk. This leads to


u u + 2v B u + v Cv > 0.
For a fixed vector v, consider f (u) = u u + 2v B u + v Cv as a function of u. The
first-order condition shows that the minimum of f (u) is
min f (u) = v (C − B  −1 B)v,
u

which has to be positive for any v ∈ Rk. Therefore, the matrix C − B  −1 B is positive
definite and its inverse V = (C − B  −1 B)−1 is also positive definite.
Because the matrix V is positive definite, it follows immediately that
     
MSE  Y2k(x0 ) = MSE  Y(x0 ) − mωintercal Vmω ≤ MSE  Y(x0 ) ,
where equality only holds if and only if ω = 0. Thus, not surprisingly, the MSE is
reduced if the k additional response outputs are unbiased.

ACM Transactions on Modeling and Computer Simulation, Vol. 24, No. 4, Article 23, Publication date: November 2014.
23:12 H. Qu and M. C. Fu

Next, we investigate the effect of the extrapolated bias on the overall MSE. Combin-
ing Equations (13) and (14) gives
  + −1 2
MSE(Y+ (x0 )) = MSE(
Y2k(x0 )) +  +M (x0 , ·)

M + +
 ζ
 +  + −1 2

= MSE(Y(x0 )) − ω Vω +  M (x0 , ·)  M +  +
 
ζ

  2
= MSE(
 
Y(x0 )) − ω Vω +  M+ (x0 , ·)V −  M (x0 , ·) −1 BV ζ k
= MSE(
Y(x0 )) − ω Vω + (ω Vζ k)2 (15)
= MSE(
Y(x0 )) + ω Vζ kω Vζ k − ω Vω
= MSE(

Y(x0 )) + ω Vζ kζ k Vω − ω Vω
 
= MSE(

Y(x0 )) + ω Vζ kζ k V − V ω.

The next theorem provides a sufficient condition under which MSE(


Y+ (x0 )) is smaller
than MSE(Y(x0 )).
THEOREM 3.3. Let λi (A) denote the ith largest eigenvalue of matrix A, i = 1, 2, . . . , k.

The symmetric matrix W = Vζ kζ k V − V is negative definite if
 λk(V)
ζkζk ≤ . (16)
[λ1 (V)]2
PROOF. Using Weyl’s inequality in matrix theory and Corollary 11 in Zhang and
Zhang [2006], the largest eigenvalue λ1 (W) of W satisfies
  
λ1 (W) = λ1 Vζ kζ k V − V
  
≤ λ1 Vζ kζ k V + λ1 (−V)
  
= λ1 Vζ kζ k V − λk(V)
 
≤ [λ1 (V)]2 λ1 ζ kζ k − λk(V).
 
The k × k matrix ζ kζ k is known as a dyad, which has one positive eigenvalue ζ k ζ k and
k − 1 zero eigenvalues provided that ζ k = 0. It follows that the largest eigenvalue of
  
ζ kζ k is λ1 (ζ kζ k ) = ζ k ζ k. Applying Condition (16), we have λ1 (W) < 0, that is, the largest
eigenvalue of W is negative, so all eigenvalues of W are negative and, therefore, the
matrix W is negative definite.
When the matrix W is negative definite, the quantity ω Wω is always negative unless
ω = 0, so the GESK model reduces MSE for the k-point problem if Condition (16) holds.
Remark 3.4. Conditioin (16) is well defined, as the matrix V is shown to be positive
definite in Lemma 3.2. Theorem 3.3 shows that when the biases are relatively small,
the reduction in MSE from including the additional extrapolated points still exceeds
the inflation in MSE introduced from the bias of the extrapolated points.

3.3. Effects of Step Size on MSE


In this section, we analyze the effects of step sizes on MSE following the discussions
of the two-point problem and the k-point problem in Sections 3.1 and 3.2, respectively.
Understanding the effects of step sizes will provide insight for determining step sizes,
which will be discussed later in detail in Section 4. In the following discussion, we
continue to assume that the same step size is used for extrapolation at each design
point, and only one additional response is extrapolated in the neighborhood of each
original design point.

ACM Transactions on Modeling and Computer Simulation, Vol. 24, No. 4, Article 23, Publication date: November 2014.
Gradient Extrapolated Stochastic Kriging 23:13

The step size x determines the MSE of the GESK predictor through several factors:
the biases ζi in the extrapolated responses; the correlation ρi between the simulation
noise of original responses and extrapolated responses; and the variances σi2+ of the sim-
ulation noise in the extrapolated responses. Because linear extrapolation is employed
in the GESK model, the bias ζi in the extrapolated response Ȳi+ is bounded by K||x||2
for some K > 0. The correlation ρi depends on both the step size and the covariance
between the simulation noise of the responses and those of the gradient estimators. A
larger step size or smaller covariance leads to smaller correlation factor ρi , whereas
σi2+ changes as the step size changes but also depends on the sign of the correlation ρi .
Effects of these factors will be discussed in detail in the following using the two-point
problem and k-point problem as in Sections 3.1 and 3.2.
3.3.1. The Two-Point Problem. Theorem 3.1 provided the change in MSE at a prediction
point x0 :
 
ζ12 1
MSE = − [ M (x0 , ·)  −1 b − τ 2r01+ ]2 .
v2 v
We summarize our findings in this setting as follows:
(1) The bias ζ1 must satisfy ζ12 < v as shown in Theorem 3.1 to guarantee reduction
in MSE. Because ζ1 is proportional to (x)2 , intuitively the step size should be
relatively small.
(2) The greater the correlation ρ1 , the greater the reduction in MSE. The parameter ρ1
also depends on the correlation between  j (x1 ) and δ j (x1 ) (i.e., the simulation noise
of output responses and gradient estimators). The parameter ρ1 increases as the
correlation between  j (x1 ) and δ j (x1 ) increases.
(3) The parameter σ12+ represents the noise in an extrapolated response Y j (x1+ ). The
reduction in MSE is greater if σ12+ is smaller.
All conditions seem to prefer using smaller step sizes. However, other difficulties
arise if the step sizes are too small: first, because the quantity v becomes smaller as
x becomes smaller, the condition ζ12 < v may not hold; second, as x approaches zero,
the correlation ρ1 approaches 1, which may make the matrix  + ill conditioned, which
leads to numerical issues.
3.3.2. The k-Point Problem. In addition to the assumptions in Section 3.2, we also as-
sume that the k design points are widely spread such that the spatial correlation
between design points is approximately 0. A similar assumption is used in Chen et al.
[2013] to isolate the impact of incorporating gradient estimators from the spatial co-
variances. This implies that the matrix  in Equation (12) is a diagonal matrix. As
the step size x is usually small, we assume the same property holds for B and C in
Equation (12) also. The change in MSE of the GESK predictor is
  
MSE = ω Vζ kζ k V − V ω,
where ω = B  −1  M (x0 , ·) −  M+ (x0 , ·) and V = (C − B  −1 B)−1 . The effects of x, ρi
and σi2+ are summarized as follows:

(1) Theorem 3.3 suggests that the quantity ζ k ζ k needs to be small enough to guarantee
that GESK can reduce MSE. This condition requires the step size x j in each
dimension to be small.
(2) Regarding the correlation ρi , the preferred sign of the correlation actually depends
on the location of the prediction point x0 . A condition between x and xi − x0

ACM Transactions on Modeling and Computer Simulation, Vol. 24, No. 4, Article 23, Publication date: November 2014.
23:14 H. Qu and M. C. Fu

determines the preferred sign of ρi . A specific analytical form for the condition
depends on the type of correlation function; larger |ρi | is better in each favorable
case.
(3) If the correlation ρi ≈ 0, smaller σi2+ is preferable, since it suggests that there
is less noise in the extrapolated responses. The same conclusion holds when the
correlation ρi is positive. However, if the correlation ρi < 0, smaller σi2+ is not
necessarily better, as there exists an optimal σi2+ that reduces MSE the most.

Analyzing the effects of step size on MSE in a general setting is more difficult, especially
in multidimensional problems. For example, the step size used in a multidimensional
problem may be different along different directions. Choosing good step sizes is crucial
for building the GESK models. In the next section, we propose two different approaches
for determining the optimal step size.

4. IMPLEMENTATIONS OF GESK
In this section, we focus on two important questions in the implementation of the GESK
model: choosing step sizes and choosing gradient estimators. We provide two different
techniques for determining steps sizes and discuss their pros and cons. We also make
recommendations between infinitesimal perturbation analysis (IPA) and the likelihood
ratio/score function (LR/SF) method for gradient estimation.

4.1. Choosing Step Size


As discussed in the previous section, central to building a GESK model is the determi-
nation of appropriate step sizes. A good choice of step size is crucial to the performance
of the GESK model. Different step sizes, even with the same data set, may lead to
dramatic performance differences. The linear extrapolation used in GESK is only ap-
propriate in a small neighborhood of the design points, so the step size cannot be too
large. If the step size is too small, the additional points obtained from linear extrapo-
lations provide little information and may also lead to numerical instability.
Two natural choices for determining step sizes are maximum likelihood estimation
(MLE) and minimizing integrated mean squared error (IMSE). However, the standard
MLE approach is not suitable for determining step sizes in this context, because it
leads to step sizes as small as possible, which results in numerical stability issues
when building the GESK model. One unique characteristic of the GESK model is
that biases are introduced during extrapolations. Although the biases are unknown,
they should be taken into consideration during parameter estimations. To accomplish
this, penalty terms are introduced in MLE and IMSE. We formalize two approaches
for determining step sizes: penalized maximum likelihood estimation and minimizing
integrated mean squared error.
4.1.1. Penalized Maximum Likelihood Estimation. One natural choice for determining the
step size x is to treat it as a new parameter in addition to the other parameters
(β, τ 2 , θ ). Under Assumption 1 in Ankenman et al. [2010], we can write down the like-
lihood function. However, as mentioned earlier, naive MLE is not suitable for choosing
step sizes in this case. Assuming the correlation function in Equation (5) is used, we
propose a penalized maximum likelihood method where the penalized log likelihood
function takes the following general form:
1  
Q(β, τ 2 , θ , x) = − ln[(2π )k] − ln  +
M + 
+
2
1 +  
+ −1
− (Ȳ − F+ β)  +M +  (Ȳ + − F+ β) − pλ (x),
2

ACM Transactions on Modeling and Computer Simulation, Vol. 24, No. 4, Article 23, Publication date: November 2014.
Gradient Extrapolated Stochastic Kriging 23:15

where pλ (·) is a given nonnegative penalty function with a regularization parameter


λ. Common choices of penalty functions include L1 penalty, L2 penalty, and smoothly
clipped absolute deviation (SCAD).
In this article, the proposed penalty function is
pλ (x) = λ||x||−2 ,
d
where x = (x1 , x2 , . . . , xd) and ||x||−2 := i=1 (xi )−2 . Therefore the proposed
penalized log likelihood function is
1  + 
Q(β, τ 2 , θ , x) = − ln[(2π )k] − ln  M +  + 

2
1  
+ −1
− (Ȳ + − F+ β)  +
M +  (Ȳ + − F+ β) − λ||x||−2 . (17)
2
Penalized maximum likelihood estimation (PMLE) has been used for variable selec-
tion [Chu et al. 2011] and overcome flat likelihood function issues [Li and Sudjianto
2005] for kriging. One key difference is that previous PMLE approaches try to obtain
better estimates for (β, τ 2 , θ ), whereas we propose to use PMLE for choosing step sizes.
4.1.2. Minimizing Integrated MSE. Another view is to connect the problem of finding step
sizes with design of experiments (DOE). Choosing step sizes is similar to adding new
design points in DOE. In deterministic and stochastic kriging literature, many criteria
have been proposed to find the “best” experiment design, most of which are based
on MSE. Using integrated mean squared error (IMSE) as the objective function, the
problem can be formulated as

Minimize IMSE = MSE+ ( Y(x0 ; x))dx0 , (18)
x
x0 ∈

where  is the region of interest.


Lower IMSE suggests smaller deviation associated with the approximation over the
region of interest. In practice, a penalty term involving step sizes is added to MSE+ ,
 + −1 +
MSE+ ( Y(x0 ; x)) =  + +
M (x0 , x0 ) −  M (x0 , ·)

M + +
  M (x0 , ·) + λ||x||2 , (19)
where the Euclidean norm of x is used. Adding a penalty term that is proportional to
||x||2 in MSE+ follows the discussion in Section 3.3.
The PMLE approach estimates all the parameters (β, τ 2 , θ ) with x simultaneously.
However, the IMSE approach requires τ 2 and θ to be known in advance. In practice, a
two-stage strategy is proposed to address this issue:
(1) In Stage 1, use the original dataset {xi , Ȳ(xi )}i=1
k
to obtain MLEs for (β̂, τ̂ 2 , θ̂ ).
 0 ; x)) in Equation (18) with the estimated (β̂, τ̂ 2 , θ̂ ) and a
(2) Calculate MSE+ (Y(x
predetermined penalty constant λ.
(3) In Stage 2, minimize the IMSE in Equation (19) to find the optimal step size.
In our implementation, we use the optimization routine fmincon in Matlab.
4.1.3. Choosing Regularization Parameter. The selection of the regularization parameter
λ in both approaches remains to be addressed. We propose to use cross-validation
(CV), which is widely used in statistics and the machine learning community, to choose
the regularization parameters. CV allows us to assess the performance with different
regularization parameters without running additional simulations. When a J-fold CV
is applied for a given regularization parameter λ, a corresponding score CV(λ) can

ACM Transactions on Modeling and Computer Simulation, Vol. 24, No. 4, Article 23, Publication date: November 2014.
23:16 H. Qu and M. C. Fu

be calculated. Given the design points {xi }i=1 k


with the averaged simulation output
{Ȳ(xi ), Ḡ(xi )}i=1 , the CV score is calculated as follows:
k

(1) Split the dataset D = {xi , Ȳ(xi ), Ḡ(xi )}i=1


k
into J subsets D1 , D2 , . . . , D J . All the
design points located on the boundary are handled separately so that extrapolated
design points are still in the space of interest.
(2) For j = 1, 2, . . . , J, choose all the design points in D j as prediction points and build
a GESK model using D\D j . Predict the response  Y(x) for every x ∈ D j .
(3) Compute the CV score for a given parameter λ as a sum of squared errors between
the prediction  Y(xi ) and the averaged output Ȳ(xi ) on D j ,

J 
CV(λ) = (Ȳi − 
Y(xi ))2 .
j=1 (xi ,Ȳi )∈D j

To start the CV, we need to choose a set of regularization parameters  =


{λ1 , λ2 , . . . , λ L}. We compute the CV score for each λl ∈  and choose the best regu-
larization parameter λ∗ as
λ∗ = arg min CV(λl ).
λl ∈

If the candidate set of parameters  contains a sufficiently large number of points


and computational time is not an issue, CV will choose the best regularization pa-
rameter from the list of λ values in . In practice, one way is to choose the candidate
parameters linearly on a logarithmic scale, for example, 10−1 , 100 , . . . , 103 . In the nu-
merical examples in Section 5, grid search is adopted to find λ∗ . If the candidate set
 is large and computing CV score for each λ ∈  is not permissible, random search
method can be applied.
The performance of the two proposed approaches, PMLE and IMSE, will be investi-
gated via numerical examples in Section 5 using different test problems. We summarize
their main features and differences as follows:
—Both methods need a regularization parameter to be calibrated, as unknown biases
are taken into consideration in both methods. We propose using CV methods to
determine regularization parameters.
—PMLE uses a penalty function to compensate for the small step size in the stan-
dard MLE approach. The PMLE approach takes biases into consideration but does
not guarantee good performance in MSE or IMSE. However, in high-dimensional
problems, maximizing a penalized likelihood is usually computationally faster than
integrating MSE.
—IMSE minimizes the IMSE over the design region, but in high-dimensional problems,
numerical integrations become computationally expensive, requiring Monte Carlo
methods with long computation time.
4.2. Choosing Gradient Estimators
In this article, we only consider direct gradient estimators. Specifically, we focus on the
infinitesimal perturbation analysis (IPA) and likelihood ratio/score function (LR/SF)
methods. Under mild conditions, both techniques are able to provide unbiased gradient
estimators, but which technique is preferable in building a GESK model, provided that
both methods are applicable?
Observations made by L’Ecuyer [1990] and Chen et al. [2013] suggest the following:
(i) at a given point, correlations between the responses and the corresponding gradient
estimates are higher when IPA is applied; (ii) IPA gradient estimators usually have

ACM Transactions on Modeling and Computer Simulation, Vol. 24, No. 4, Article 23, Publication date: November 2014.
Gradient Extrapolated Stochastic Kriging 23:17

smaller variances than LR/SF estimators; and (iii) IPA gradient estimators have better
performance when applied in stochastic kriging with gradient estimators (SKG).
Discussions in Section 3.3 suggest that the GESK model prefers gradient estima-
tors that are highly correlated with response estimates and have smaller variances.
Therefore, under most settings, IPA gradient estimators are preferred. IPA gradient
estimators are employed in the M/M/1 example conducted in Section 5.

5. NUMERICAL EXAMPLES
In this section, several numerical experiments are conducted to illustrate the pro-
posed GESK model. Our goal in this section is three-fold: to demonstrate the effects
of different step sizes on the performance of the GESK model, to empirically compare
the effectiveness of the PMLE and IMSE approaches in determining step sizes, and
to examine the performance of the GESK model in different settings and compare it
with stochastic kriging [Ankenman et al. 2010] and stochastic kriging with gradient
estimators (SKG) [Chen et al. 2013]. Implementation of SKG and GESK are built upon
software for stochastic kriging downloaded from http://www.stochastickriging.net.
Across all experiments, we assume little information is known about the response
surface and choose constant trends for all models (i.e., f(x) β = β0 ). A Gaussian cor-
relation function RM (x, x ) = exp{−θ x − x 2 } is used for all the experiments, since
it satisfies the conditions required by SKG. For the J-fold CV implemented in this
section, we choose J = min(k, 10), where k is the number of design points.
We implemented both PMLE and IMSE approaches discussed in Section 4.1 to de-
termine step sizes, together with the CV method to choose regularization parameters.
The corresponding GESK models are named GESK-PMLE and GESK-IMSE. The measure
of performance we chose is the empirical IMSE (EIMSE), as used in van Beers and
Kleijnen [2008] and other kriging literature:

1 
N
EIMSE = (Y(xi ) − Y(xi ))2 , (20)
N
i=1

where N is the number of predictions, 


Y(xi ) is the predicted response at xi and Y(xi ) is
the true value at xi .

5.1. Experiment on Step Sizes in GESK


We investigate the effects of using different step sizes in the GESK models using an
M/M/1 queue example [Staum 2009]. The M/M/1 queue has arrival rate 1 and service
rate x ∈ [1.1, 2]. We are interested in the steady-state expected waiting time y(x), which
has an analytical solution y(x) = 1/(x(x − 1)). In our simulation, each sample path was
initialized in steady state and simulated for 5,000 customers. The outputs collected
were the average waiting time and its derivative with respect to the service rate x.
Six different experiment designs, (6, 50), (6, 200), (6, 1000), (8, 200), (10, 200),
(20, 200), were used in the experiment, where the two elements in each pair represent
the number of design points and the number of replications at each design point,
respectively.
With equally spaced design points, three predetermined step sizes were chosen for
each design, which correspond to 1/10, 1/5 and 1/2 of the length of the subinterval.
GESK models built with these step sizes are labelled as GESK-1, GESK-2 and GESK-3, re-
spectively. We ran the experiments for 100 macroreplications. Within each macrorepli-
cation, we chose N = 1,000 to estimate the EIMSE in Equation (20). Table I shows the
sample mean and standard errors of EIMSE, and Figure 1 contains boxplots for the
EIMSE.

ACM Transactions on Modeling and Computer Simulation, Vol. 24, No. 4, Article 23, Publication date: November 2014.
23:18 H. Qu and M. C. Fu

Table I. Averaged EIMSE from 100 Macroreplications for GESK Models under Six Designs (# of Design Points,
# of Reps) to Predict Expected Waiting Time in the M/M/1 Queue Example
Three fixed step sizes with those determined by PMLE and IMSE are compared. Standard errors are shown in
parenthesis.
Design GESK-1 GESK-2 GESK-3 GESK-PMLE GESK-IMSE
(6, 50) 0.085 (0.0036) 0.167 (0.0035) 0.618 (0.0149) 0.042 (0.0020) 0.027 (0.0017)
(6, 200) 0.094 (0.0023) 0.181 (0.0019) 0.731 (0.0064) 0.034 (0.0011) 0.024 (0.0010)
(6, 1000) 0.092 (0.0011) 0.180 (0.0010) 0.747 (0.0037) 0.038 (0.0006) 0.021 (0.0004)
(8, 200) 0.006 (0.0004) 0.016 (0.0006) 0.194 (0.0023) 0.005 (0.0003) 0.002 (0.0002)
(10, 200) 0.006 (0.0005) 0.007 (0.0011) 0.017 (0.0012) 0.007 (0.0007) 0.004 (0.0003)
(20, 200) 0.002 (0.0008) 0.006 (0.0003) 0.048 (0.0020) 0.003 (0.0003) 0.001 (0.0001)

Fig. 1. Boxplots of EIMSE from 100 macroreplications for the GESK models under six designs (# of design
points, # of reps) to predict expected steady-state waiting time in the M/M/1 queue example.

Our findings are summarized as follows:

—Predetermined step sizes versus optimal step sizes. Performances of the two
optimal step sizes are better than those of predetermined step sizes, especially when
the number of design points is small. This is expected, as the choice of step sizes
should adapt to the experiment design and simulation output.
—PMLE versus IMSE. The performance of IMSE is better than that of PMLE under
most experiment designs, in terms of having smaller averaged EIMSE, smaller vari-
ances of EIMSE and smaller number of outliers. Figure 2 shows boxplots for step
sizes determined by PMLE and IMSE under all six designs.

ACM Transactions on Modeling and Computer Simulation, Vol. 24, No. 4, Article 23, Publication date: November 2014.
Gradient Extrapolated Stochastic Kriging 23:19

Fig. 2. Boxplots for step sizes determined by PMLE and IMSE based on 100 macroreplications under six
designs (# of design points, # of reps) to predict the expected steady-state waiting time in the M/M/1 queue
example.

Table II. Averaged EIMSE from 100 Macroreplications for SK, SKG, and GESK with Six Different Designs on
Estimating the Expected Steady-State Waiting Time in an M/M/1 Queue Problem
The design (6, 50) means six design points with 50 replications at each design point. Standard errors are shown
in parentheses.
Design SK SKG GESK-PMLE GESK-IMSE
(6, 50) 0.313 (0.0134) 0.031 (0.0036) 0.042 (0.0020) 0.027 (0.0017)
(6, 200) 0.324 (0.0062) 0.016 (0.0007) 0.034 (0.0011) 0.024 (0.0010)
(6, 1000) 0.328 (0.0027) 0.016 (0.0003) 0.038 (0.0006) 0.021 (0.0004)
(8, 200) 0.054 (0.0019) 0.002 (0.0003) 0.005 (0.0003) 0.002 (0.0002)
(10, 200) 0.009 (0.0004) 0.004 (0.0014) 0.007 (0.0007) 0.004 (0.0003)
(20, 200) 0.004 (0.0002) 0.004 (0.0004) 0.003 (0.0003) 0.001 (0.0001)

—Effect of number of design points. When the number of design points is small, for
example k = 6, improvements in EIMSE are more significant. However, when there
are already enough design points, improvements are hardly noticeable. In addition,
for both PMLE and IMSE, the relative step size (ratio to the size of the subinterval)
generally increases as the number of design points increases.
—Effect of number of replications. As the number of replications increases, the
variances of EIMSE become smaller as shown in Table I and Figure 1. However,
changes in the averaged IMSE are not significant. Variances of the chosen step sizes
seem to decrease as well, as shown in Figure 2.

5.2. Comparisons among SK, SKG, and GESK


In this section, we will compare the performances of three different metamodels:
stochastic kriging, stochastic kriging with gradient estimators (SKG) and gradient
extrapolated stochastic kriging (GESK) in three different experiments. The theoretical
analysis of GESK in Section 3 assumes that all parameters are known, whereas the
comparison here is empirical where all parameters must be estimated.
5.2.1. A Stochastic Simulation Example. We used the same M/M/1 queue example as
in Section 5.1. Six different experiment designs were adopted as well. We ran the
experiments for 100 macroreplications. Within each macroreplication, we chose
N = 1, 000 to estimate the EIMSE in Equation (20). Results are shown in Table II and
Figure 3. These 100 macroreplications used the same random numbers as those used

ACM Transactions on Modeling and Computer Simulation, Vol. 24, No. 4, Article 23, Publication date: November 2014.
23:20 H. Qu and M. C. Fu

Fig. 3. Boxplots of EIMSE from 100 macroreplications for SK, SKG, and GESK with six different designs
on estimating the expected steady-state waiting time in an M/M/1 queue problem, corresponding to results
in Table II.

Table III. Averaged EIMSE from 100 Macroreplications for SK, SKG, and GESK with Five Different Designs on
y(x) = exp(−1.4x) cos(7πx/2) + 
Standard errors are shown in parentheses.
Design SK SKG GESK-PMLE GESK-IMSE
(6, 50) 39.616 (0.0374) 2.044 (0.0106) 1.909 (0.0181) 1.828 (0.0111)
(6, 200) 39.586 (0.0192) 2.023 (0.0049) 1.830 (0.0091) 1.757 (0.0050)
(6, 1000) 39.581 (0.0084) 7.652 (0.8823) 1.829 (0.0033) 1.758 (0.0023)
(8, 200) 2.793 (0.0039) 0.069 (0.0009) 0.063 (0.0026) 0.068 (0.0025)
(10, 200) 0.949 (0.0026) 1.243 (0.4204) 0.178 (0.0008) 0.012 (0.0007)
(20, 200) 0.008 (0.0002) 0.001 (0.0001) 0.046 (0.0004) 0.004 (0.0004)

in Section 5.1, so numbers for the two GESK models in Table III and corresponding
boxplots in Figure 4 are the same as those in Section 5.1.
Our findings are summarized as follows:
—SK versus SKG versus GESK. Not surprisingly, SKG and GESK perform better
than SK, as incorporating gradient estimators provides more information about the

ACM Transactions on Modeling and Computer Simulation, Vol. 24, No. 4, Article 23, Publication date: November 2014.
Gradient Extrapolated Stochastic Kriging 23:21

Fig. 4. Boxplots of EIMSE from 100 macroreplications for SK, SKG, and GESK with five different designs
on y(x) = exp(−1.4x) cos(7π x/2) + , corresponding to results in Table III.

response surface. SKG performs better than GESK on most of the designs, especially
as compared with GESK-PMLE. Only under the design (20, 200) do both GESK-
PMLE and GESK-IMSE outperform SKG.
—Number of design points. Incorporating gradient estimators improves perfor-
mance considerably when the design points are sparse. For example, both SKG and
GESK have more significant improvement over stochastic kriging when k = 6. As
the number of design points increases, performance of most models improves.
—Number of replications. As the number of replications increases with a fixed
number of design points, the variance of EIMSE decreases for all three methods, as
shown in Figure 3(a). However, the averaged EIMSE does not improve significantly.
5.2.2. A Stylized Example with Added Noise. We consider a one-dimensional example from
Santner et al. [2003], where the true response surface is Y(x) = exp(−1.4x) cos(7π x/2)
with x ∈ [−2, 0]. The presence of multiple local extreme values on the response surface
makes building a good metamodel difficult. The simulation response output at x from
replication j is Y j (x) = exp(−1.4x) cos(7π x/2) +  j (x), with  j (x) ∼ N (0, 1). Direct
gradient estimates are assumed of the form G j (x) = Y (x)+δ j (x) as the gradient estimate
at x from simulation replication j, with δ j (x) ∼ N (0, 25). We let δ j (x) have a larger

ACM Transactions on Modeling and Computer Simulation, Vol. 24, No. 4, Article 23, Publication date: November 2014.
23:22 H. Qu and M. C. Fu

Fig. 5. Boxplots for step sizes determined by PMLE and IMSE based on 100 macroreplication in the stylized
example with added noise.

variance in order to empirically investigate the performances of SKG and GESK when
gradient estimates are noisier.
We ran the experiments for 100 macroreplication. Within each macroreplication, we
chose N = 1,000 to estimate the EIMSE in Equation (20). Six different experiment
designs, (6, 50), (6, 200), (6, 1000), (8, 200), (10, 200) and (20, 200) were adopted, with
results shown in Table III and Figure 4. Notice that ln(EIMSE) values are shown in
Figure 4, as EIMSE results from the three models differ substantially.
—SK versus SKG versus GESK. As shown in Figure 4, both the SKG and the GESK
models are better than SK when there is a limited number of design points. The
GESK models perform better than SKG when k = 6. The explanation is that the
response surface has several fluctuations and extrapolation allows GESK models to
explore and approximate the response surface better than the others. SKG performs
better when there are enough design points, for example, k = 20. SKG experiences
numerical issues under designs (6,1000) and (10,200).
—Number of replications. When the number of replications increases, the variance
of EIMSE decreases in general, as shown in Table III and Figure 4(a), except for
SKG in design (6, 1000). However, the averaged EIMSE does not change much as the
number of replications increases, similar to the M/M/1 queue example.
—Number of design points. We fixed the number of replications at 200 and increases
the number of design points up to 20. Boxplots are shown in Figure 4(b). EIMSE
results for all models improve as the number of design points increases, with the
exception of SKG and GESK-PMLE with design (10, 200).
—Step sizes. Step sizes determined by the PMLE and IMSE approaches are shown in
Figure 5. The plots suggest relationships between experiment designs and step sizes:
(i) relative step size (ratio to the size of the subinterval) increases generally when
the number of design points increases, and (ii) the variability of step sizes decreases
as the number of replications increases.
—Remark. Performance of SKG with design (10, 200) shown in Table III and
Figure 4(b) does not seem to match each other. The reason is that several outliers
outside of the range shown in Figure 4(b) are omitted.
These two numerical experiments show comparable performance for GESK and SKG.
However, it is hard to establish in general what problem characteristics determine
whether GESK outperforms SKG. For example, GESK performs better when there are

ACM Transactions on Modeling and Computer Simulation, Vol. 24, No. 4, Article 23, Publication date: November 2014.
Gradient Extrapolated Stochastic Kriging 23:23

Fig. 6. Boxplots of EIMSE from 100 macroreplications for SK and GESK with two different Latin-hypercube
designs on a four-dimensional function with added noise.

20 design points in Table II, whereas in Table III GESK performs better when there
are 6 design points.

5.2.3. A Multidimensional Example. Finally, we consider a stylized multidimensional


example to test the performance of GESK models, especially when gradient es-
timates have much larger variances. We consider the sphere function defined by
2 4
Y(x) = i=1 xi2 + i=3 10xi2 . We chose the experimental design space as [−1, 1]4 . The
simulated response at x = (x1 , x2 , x3 , x4 ) from replication j is Y j (x) = Y(x) +  j (x) with
the noise  j (x) ∼ N (0, 1). The gradient estimate with respect to xr at x from replication
j is given by G rj (x) = ∂Y(x)
∂ xr
+ δrj (x) with δrj (x) ∼ N (0, 25). The added noise terms are
mutually independent.
We chose two different experiment designs, (20, 500) and (40, 500), which correspond
to 20-point and 40-point Latin-hypercube designs with 500 independent replications at
each design point, respectively. We collected simulation responses Y j (x) and gradient
estimates G rj (x) for r = 1, 2, 3, 4, j = 1, 2, . . . , 500 to build metamodels.
We ran the experiments for 100 macroreplications. Within each replication, we chose
N = 1,000 to estimate the EIMSE in Equation (20). Figure 6 contains boxplots for the
EIMSE calculated from the 100 macroreplications. Our findings are summarized as
follows:

—SKG and both GESK models perform better than SK. As the number of design points
increases, the performances of all models improve. Under design (20, 500), GESK-
IMSE seems to be the best; under design (40, 500), SKG is preferred because of its
low average and low variance in EIMSE.
—Between the two GESK models, PMLE scales better than IMSE for high-dimensional
problems. The IMSE approach requires multidimensional integrations to determine
step sizes, which is expensive and depends on the accuracy of integration approxi-
mation in high-dimensional problems.
—Step sizes determined by PMLE are generally much larger than those determined by
IMSE. Along each dimension, step sizes determined by PMLE and IMSE have similar
behavior. If the step size chosen by PMLE is relatively smaller on one dimension,
so is the step size chosen by IMSE. Step sizes chosen for a dimension with higher
gradient values are not necessarily smaller than others.

ACM Transactions on Modeling and Computer Simulation, Vol. 24, No. 4, Article 23, Publication date: November 2014.
23:24 H. Qu and M. C. Fu

6. CONCLUSIONS AND FUTURE RESEARCH


In this article, we investigated gradient extrapolated stochastic kriging (GESK), which
exploits the availability of direct gradient estimates in stochastic simulation settings.
The performance of GESK was analyzed theoretically and numerically, with a focus
on analyzing the approximation errors introduced by extrapolation. Because step sizes
are crucial to GESK models, two methods for determining step sizes were proposed and
tested in numerical examples, which indicated substantial gains in performance over
SK in all of the experiments. Between the proposed PMLE and IMSE approaches for
determining step sizes, IMSE demonstrated better performance in numerical experi-
ments, but it becomes computationally expensive for high-dimensional problems. The
numerical experiments showed comparable performance for GESK and SKG, except
when the number of design points is very small, where GESK shows some advantage.
From our analysis and experiments, we offer the following overall conclusions:
—GESK can be especially effective when the number of design points is relatively small
(e.g., in the setting where simulation is expensive).
—GESK offers additional flexibility in choosing the correlation function; in particular,
differentiability is not a constraint.
—For high-dimensional problems, GESK using the PMLE is recommended because its
computation does not increase with dimension, whereas the computational burden
increases exponentially for IMSE minimization and at least quadratically for SKG.
Our work points to several other directions for future research. The first direction is to
focus on the extrapolation strategy in GESK. For this article, we use linear extrapola-
tion with the same step size and assume that only one additional point is extrapolated
from each design point. More sophisticated techniques could use the local response sur-
face information and adaptively determine the extrapolation strategy. This is especially
important in higher-dimensional problems with multiple extreme values. Moreover, ex-
tending GESK to allow common random numbers (CRN) is a worthy avenue for future
research, as CRN could be beneficial for some metamodels.
Future comparison of the SKG and GESK models is warranted. In particular, it would
be valuable to be able to characterize when one model is likely to be more effective. A
theoretical analysis of various properties comparing the two models can lead to useful
guidelines for practitioners.

REFERENCES
J. J. Alonso and H. S. Chung. 2002. Using gradients to construct cokriging approximation models for high-
dimensional design optimization problems. In 40th AIAA Aerospace Sciences Meeting and Exhibit, AIAA.
2002–0317.
B. E. Ankenman, B. L. Nelson, and J. Staum. 2010. Stochastic kriging for simulation metamodeling. Opera-
tions research 58, 2 (March 2010), 371–382.
R. R. Barton. 2009. Simulation optimization using metamodels. In Proceedings of the 2009 Winter Simulation
Conference, M. D. Rossetti, R. R. Hill, B. Johansson, A. Dunkin, and R. G. Ingalls (Eds.). Institute of
Electrical and Electronics Engineers, Inc, Piscataway, New Jersey, 230–238.
R. R. Barton and M. Meckesheimer. 2006. Metamodel-based simulation optimization. In Handbooks in
Operations Research and Management Science: Simulation, S. G. Henderson and B. L. Nelson (Eds.).
Elsevier, 535–574.
X. Chen, B. E. Ankenman, and B. L. Nelson. 2012. The effects of common random numbers on stochastic
kriging metamodels. ACM Transactions on Modeling and Computer Simulation 22, 2 (March 2012),
Article 7, 20 pages.
X. Chen, B. E. Ankenman, and B. L. Nelson. 2013. Enhancing stochastic kriging metamodels with gradient
estimators. Operations Research 61, 2 (2013), 512–528.
T. Chu, J. Zhu, and H. Wang. 2011. Penalized maximum likelihood estimation and variable selection in
geostatistics. Annals of Statistics 39, 5 (2011), 2607–2625.

ACM Transactions on Modeling and Computer Simulation, Vol. 24, No. 4, Article 23, Publication date: November 2014.
Gradient Extrapolated Stochastic Kriging 23:25

M. C. Fu. 2008. What you should know about simulation and derivatives. Naval Research Logistics 55, 8
(2008), 723–736.
M. C. Fu and H. Qu. 2014. Regression models augumented with direct stochastic gradient estimators.
INFORMS Journal on Computing 26, 3, 484–499.
P. Glasserman. 1991. Gradient Estimation via Perturbation Analysis. Kluwer Academic, Boston, MA.
Y. C. Ho, L. Shi, L. Dai, and W. Gong. 1992. Optimizing discrete event dynamic systems via the gradient
surface method. Discrete Event Dynamic Systems: Theory and Applications 2 (Jan. 1992), 99–120.
J. P. C. Kleijnen and W. C. M. van Beers. 2005. Robustness of kriging when interpolating in random simulation
with heterogeneous variances: Some experiments. European Journal of Operational Research 165, 3
(2005), 826–834.
P. L’Ecuyer. 1990. A unified view of the IPA, SF, and LR gradient estimation techniques. Management Science
36 (1990), 1364–1383.
R. Li and A. Sudjianto. 2005. Analysis of computer experiments using penalized likelihood in Gaussian
kriging models. Technometrics 47, 2 (May 2005), 111–121.
W. Liu. 2003. Development of Gradient-enhanced Kriging Approximations for Multidisciplinary Design Op-
timization. Ph.D. Dissertation. University of Notre Dame.
H. Qu and M. C. Fu. 2012. On direct gradient enhanced simulation metamodels. In Proceedings of the 2012
Winter Simulation Conference. Institute of Electrical and Electronics Engineers, Article 43, 12 pages.
R. Y. Rubinstein and A. Shapiro. 1993. Discrete Event Systems: Sensitivity Analysis and Stochastic Opti-
mization by the Score Function Method. John Wiley & Sons.
T. J. Santner, B. Williams, and W. Notz. 2003. The Design and Analysis of Computer Experiments. Springer-
Verlag.
J. Staum. 2009. Better simulation metamodeling: The why, what, and how of stochastic kriging. In Proceed-
ings of the 2009 Winter Simulation Conference, M. D. Rossetti, R. R. Hill, B. Johansson, A. Dunkin, and
R. G. Ingalls (Eds.). Institute of Electrical and Electronics Engineers, Piscataway, NJ, 119–133.
W. C. M. van Beers and J. P. C. Kleijnen. 2008. Customized sequential designs for random simulation
experiments: Kriging metamodeling and bootstrapping. European Journal of Operational Research 186,
3 (2008), 1099–1113.
W. Xie, B. L. Nelson, and J. Staum. 2010. The influence of correlation functions on stochastic kriging meta-
models. In Proceedings of the 2010 Winter Simulation Conference. Institute of Electrical and Electronics
Engineers, Piscataway, NJ, 1067–1078.
F. Zhang and Q. Zhang. 2006. Eigenvalue inequalities for matrix product. IEEE Transactions on Automatic
Control 51, 9 (Sept. 2006), 1506–1509.

Received February 2013; revised July 2014; accepted July 2014

ACM Transactions on Modeling and Computer Simulation, Vol. 24, No. 4, Article 23, Publication date: November 2014.

You might also like