Computer Experiments
Computer Experiments
13 N
© 1996 Elsevier Science B.V. All rights reserved.
Computer Experiments
J. R. K o e h l e r a n d A. B. O w e n
1. Introduction
Some important quantities describing a computer model are the number of inputs
p, the number of outputs q and the speed with which f can be computed. These vary
261
262 J. R. Koehler and A. B. Owen
There are many different but related goals that arise in computer experiments. The
problem described in the previous section is that of finding a good value for X accord-
ing to some criterion on Y. Here are some other goals in computer experimentation:
finding a simple approximation f that is accurate enough over a region A of X values,
estimating the size of the error f ( X o ) - f ( X o ) for some X0 c A, estimating fA f dX,
sensitivity analysis of Y with respect to changes in X, finding which X J are most
important for each response y k , finding which competing goals for Y conflict the
most, visualizing the function f and uncovering bugs in the implementation of f.
Computer experiments 263
2.1. Optimization
Many engineering design problems take the form of optimizing y1 over allowable
values of X. The problem may be to find the fastest chip, or the least expensive soda
can. There is often, perhaps usually, some additional constraint on another response y2.
The chip should be stable enough, and the can should be able to withstand a specified
internal pressure.
Standard optimization methods, such as quasi-Newton or conjugate gradients (see
for example Gill et al., 1981) can be unsatisfactory for computer experiments. These
methods usually require first and possibly second derivatives of f , and these may be
difficult to obtain or expensive to run. The standard methods also depend strongly on
having good starting values. Computer experimentation as described below is useful
in the early stages of optimization where one is searching for a suitable starting value.
It is also useful when searching for several widely separated regions of the predictor
space that might all have good Y values. Given a good starting value, the standard
methods will be superior if one needs to locate the optimum precisely.
2.2. Visualization
As Diaconis (1988) points out, being able to compute a function f at any given value
X does not necessarily imply that one "understands" the function. One might not
know whether the function is continuous or bounded or unimodal, where its optimum
is or whether it has asymptotes.
Computer experimentation can serve as a primitive way to visualize functions. One
evaluates f at a well chosen set of points x l , . . . , xn obtaining responses Y l , . . . , Yn.
Then data visualization methods may be applied to the p + q dimensional points
(xi,yi), i = 1 , . . . ,n. Plotting the responses versus the input variables (there are
pq such plots) identifies strong dependencies, and plotting residuals from a fit can
show weaker dependencies. Selecting the points with desirable values of Y and then
producing histograms and plots of the corresponding X values can be used to identify
the most promising subregion of X values. Sharifzadeh et al. (1989) took this approach
to find that increasing a certain implant dose helped to make two different threshold
voltages near their common targets and nearly equal (as they should have been).
Similar exploration can identify which input combinations are likely to crash the
simulator.
Roosen (1995) has used computer experiment designs for the purpose of visualizing
functions fit to data.
2.3. Approximation
The original program f may be exceedingly expensive to evaluate. It may however be
possible to approximate f by some very simple function ], the approximation holding
adequately in a region of interest, though not necessarily over the whole domain of f .
If the function f is fast to evaluate, as for instance a polynomial, neural network or
a MARS model (see Friedman, 1991), then it may be feasible to make millions of ]
264 J. R. Koehler and A. B. Owen
evaluations. This makes possible brute force approximations for the other problems.
For example, optimization could be approached by finding the best value of f(x) over
a million random runs x.
Approximation by computer experiments involves choosing where to gather
(xi, f(xi)) pairs, how to construct an approximation based on them and how to assess
the accuracy of this approximation.
2.4. Integration
Suppose that X* is the target value of the input vector, but in the system being
modeled the actual value of X will be random with a distribution d F that hopefully
is concentrated near X*. Then one is naturally interested in f f(X) dF, the average
value of Y over this distribution. Similarly the variance of Y and the probability
that Y exceeds some threshold can be expressed in terms of integrals. This sort of
calculation is of interest to researchers studying nuclear safety. McKay (1995) surveys
this literature.
Integration and optimization goals can appear together in the same problem. In
robust design problems (Phadke, 1988), one might seek the value X0 that minimizes
the variance of Y as X varies randomly in a neighborhood of X0.
There are two main statistical approaches to computer experiments, one based on
Bayesian statistics and a frequentist one based on sampling techniques. It seems to
be essential to introduce randomness in one or other of these ways, especially for the
problem of gauging how much an estimate ](Xo) might differ from the true value
f(Xo).
In the Bayesian framework, surveyed below in Sections 4 and 5, f is a realization
of a random process. One sets a prior distribution on the space of all functions from
[0, 1]p to R q. Given the values Yi = f(xi), i = 1 , . . . , n , one forms a posterior
distribution on f or at least on certain aspects of it such as f(xo). This approach is
extremely elegant. The prior distribution is usually taken to be Gaussian so that any
finite list of function values has a multivariate normal distribution. Then the posterior
distribution, given observed function values is also multivariate normal. The posterior
mean interpolates the observed values and the posterior variance may be used to
give 95% posterior probability intervals. The method extends naturally to incorporate
measurement and prediction of derivatives, partial derivatives and definite integrals
of f .
The Bayesian framework is well developed as evidenced by all the work cited
below in Sections 4 and 5. But, as is common with Bayesian methods there may
be difficulty in finding an appropriate prior distribution. The simulator output might
not have as many derivatives as the underlying physical reality, and assuming too
much smoothness for the function can lead to Gibbs-effect overshoots. A numerical
difficulty also arises: the Bayesian approach requires solving n linear equations in
Computer experiments 265
n unknowns when there are n data points. The effort involved grows as n 3 while
the effort in computing f(X1),..., f(Xn) grows proportionally to n. Inevitably this
limits the size of problems that can be addressed. For example, suppose that one
spends an hour computing f ( x l ) , . . . , f(xn) and then one minute solving the linear
equations. If one then finds it necessary to run 24 times as many function evaluations,
the time to compute the f(xi) grows from an hour to a day, while the time to solve
the linear equations grows from one minute to over nine and a half days.
These difficulties with the Bayesian approach motivate a search for an alternative.
The frequentist approach, surveyed in Sections 6 and 7, introduces randomness by tak-
ing function values x l , . . . , xn that are partially determined by pseudo-random number
generators. Then this randomness in the xi is propagated through to randomness in
f(xo). This approach allows one to consider f to be deterministic, and in particular
to avoid having to specify a distribution for f . The material given there expands on
a proposal of Owen (1992a). There is still much more to be done.
k
= Zjhj( ) + (2)
j=l
where the hj's are known fixed functions, the /3j's are unknown coefficients to be
estimated and Z(x) is a stationary Gaussian random function with E[Z(x)] = 0 and
covariance
For any point x c S, the simulator output Y ( x ) at that point has the Gaussian
distribution with mean ~ / 3 j hj (x) and variance ~r2. The linear component models the
drift in the response, while the systematic lack-of-fit (or bias) is modeled by the second
component. The smoothness and other properties of Z(.) are controlled by R(.).
Let design D -- {xi, i = 1 , . . . , n} C S yield responses y)) = { y ( x l ) , . . . , y(xn)}
and consider a linear predictor
~(xo)--;(:co)y.
of an unobserved point x0. The Kriging approach of Matheron (1963) treats if(x0) as
a random variable by substituting liD for YD where
Y~ = ( Y ( x l ) , . . . , Y(xn)).
The best linear unbiased predictor (BLUP) finds the A(x0) that minimizes
E[AtYD] = E[Y(xo)].
where
~ = [HtV-1H]-I H t V - 1 y D
is the generalized least squares estimate of/3. The mean square error of Y(x0) is
The first component of equation (4) is the generalized least squares prediction
at point :Co given the design covariance matrix VD, while the second component
Computerexperiments 267
"....
q
/..."
/.-
/
o
c/
t i i i i
Fig. 1. A predictionexamplewith n = 3.
"pulls" the generalized least squares response surface through the observed data points.
The elasticity of the response surface "pull" is solely determined by the correlation
function R(.). The predictions at the design points are exactly the corresponding
observations, and the mean square error equals zero. As a prediction point x0 moves
away from all of the design points, the second component of equation (4) goes to
zero, yielding the generalized least squares prediction, while the mean square error at
that point goes to 0-2 q- h'(xo) [H'VDIH]-1 h(xo). In fact, these results are true in
the wide sense if the Gaussian assumption is removed.
As an example, consider an experiment where n -~ 3, p = 1, a 2 = .05, R(d) =
e x p ( - 2 0 d 2) and D = {.3, .5, .8}. The response of the unknown function at the
design is y~ = (.7, .3, .5). The dashed line of Figure 1 is the generalized least
squares prediction surface for h(.) = 1 where fl = .524. The effect of the second
component of equation (4) is to pull the dashed line through the observed design
points as shown by the solid line. The shape of the surface or the amount of elasticity
!
of the "pull" is determined by the vector v=V o--1 as a function of x and therefore
is completely determined by R(.). The dotted lines are +2ffMSE[Y(x)] or 95%
pointwise confidence envelopes around the prediction surface. The interpretation of
these pointwise confidence envelopes is that for any point x0, if the unknown function
is truly generated by a random function with constant mean and correlation function
R(d) = exp(-20d2), then approximately 95% of the sample paths that go through
the observed design points would be between these dotted lines at x0. The predictions
and confidence intervals can be very different for different a 2 and R(-). The effect of
different correlation functions is discussed in Section 4.3. Clearly, the true function is
not "generated" stochastically. The above model is used for prediction and to quantify
the uncertainty of the prediction. This naturally leads to a Bayesian interpretation of
this methodology.
268 J. R. Koehler and A. B. Owen
and variance
of the posterior distribution at each input point are then used as the predictor and
measure of error, respectively, at that point. In general, the Kriging and Bayesian
approaches will lead to different estimators. However, if the prior distribution of Z(.)
is Gaussian and if the prior distribution of the/3j's is diffuse, then the two approaches
yield identical estimators.
As an example, consider the case where the prior distribution of the vector of/3's
is
/3 ~ Nk(b, T2Z)
and the prior distribution of Z(.) is a stationary Gaussian distribution with expected
value zero and covariance function given by equation (3). After the simulator function
has been evaluated at the experimental design, the posterior distribution of/3 is
/31Y~ ~ Nk(a,~)
where
~ = ~ [HtVD1yD + r-2Z-'b]
and
= [H,V~IH + ~-2~-1]-,
and the posterior distribution of Y(xo) is
where
/
Cmo = h' - V~oVD1H.'"-
Hence the posterior distribution is still Gaussian but it is no longer stationary. Now if
-r2 ~ oe then
A
/3~/3,
-+ [H'VD 1H] -1
~- ~ ' v-'
-- Vxo D Vxo + h t [H'V~'H]-' h
-- 2h ! [ H t V ~ IH] -I H t V ~ I v x o
+ ~o -IH [H'Vp H] -~ H ' V p ~o
: ~ - [ - h' l i t ' v ; ~]-' h
J- 2h ! [HtVD 1H ] -1 gtVDlVxo]
: a a - (h'(xo),V'o) ( )-'( )
0 H 'D
HD VD
h(xo)
Vxo
which is the same variance as the BLUP in the Kriging approach. Therefore, if Z(.)
has a Gaussian prior distribution and if the/3's have a diffuse prior, the Bayesian and
the Kriging approaches yield identical estimators.
Currin et al. (1991) provide a more in depth discussion of the Bayesian approach
for the model with a fixed mean (h - 1). O'Hagan (1989) discusses Bayes Linear
Estimators (BLE) and their connection to equations (2) and (4). The Bayesian ap-
proach, which uses random functions as a method of quantifying the uncertainty of
the unknown simulator function Y(.), is more subjective than the Kriging or frequen-
tist approach. While both approaches require prior knowledge or an objective method
of estimating the covariance function, the Bayesian approach additionally requires
knowledge of parameters of the prior distribution of/3 (b and Z). For this reason,
the Kriging results and Bayesian approach with diffuse prior distributions and the
Gaussian assumption are widely used in computer experiments.
270 J. R. Koehler and A. B. Owen
>.
d
.' ".. ,:.,, ./,' '..
..: ",,.,..," ",.. ,. ,.
o ............. •- ,,. ........ ..." ,.. ......
¢5
0.0 0.2 0.4 0.6 0.8 1.0
X
(b) O= 100
R ( x l , z2) = R ( x l - ~2)
Computer experiments 271
so that the process Z(.) is stationary. Some types of nonstationary behavior in the mean
function of Y(.) can be modeled by the linear term in equation (2). A further restriction
makes the correlation function depend only on the magnitude of the distance.
j=l
is often used for mathematical convenience. That is, R(.) is a product of univariate
correlation functions and, hence, only univariate correlation functions are of inter-
est. The product correlation function has been used for prediction in spatial settings
(Ylvisaher, 1975; Curin et al., 1991; Sacks et al., 1989a, b; Welch et al., 1990, 1992).
Several choices for the factors in the product correlation function are outlined below.
>" "7
>.-
. . . . . . . . . . . . . . . . . . . . . . . . 7
m
?
0.0 0.2 0.4 0.6 0.8 "1.0
X
(b) (p,~ = (.45,.2.0)
to
>-
to
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
0.0 0.2 0.4 0.6 0.8 1.0
X
(d) (p,~ = (.95,.90)
Fig. 3. Realizations for the cubic correlation function (p, 7) -----(a) (.15, .03), (b) (.45, .20), (c) (.70, .50),
and (d) (.95, .90).
272 J. R. Koehler and A. B. Owen
4.3.2. Cubic
The (univariate) cubic correlation family is parameterized by p E [0, 1] and 7 E [0, 1]
and is given for d E [0, 1] by
R(d) = 1
-
3(1 p) d2 +
(1 - p)(1 -
id13
2+7 2+7
5-)'2 + 87 - 1
P> ,72+47+7
to ensure that the function is positive definite (see Mitchell et al., 1990). Here
p -- corr(Y(0),Y(1)) is the correlation between endpoint observations and 7 =
corr(Y'(0), Y'(1)) is the correlation between endpoints of the derivative process. The
cubic correlation function implies that the derivative process has a linear correlation
process with parameter "7.
A prediction model in one dimension for this family is a cubic spline interpolator.
In two dimensions, when the correlation is a product of univariate cubic correlation
functions the predictions are piece-wise cubic in each variable.
Processes generated with the cubic correlation function are once mean square dif-
ferentiable. Figure 3 shows several realizations of processes with the cubic correlation
function and parameter pairs (.15, .03), (.45, .20), (.70, .50), (.95, .9). Notice that
the realizations are quite smooth and almost linear for parameter pair (.95, .90).
4.3.3. Exponential
The (univariate) exponential correlation family is parameterized by 0 E (0, o~) and is
given by
R ( d ) = exp(-01dl)
for d C [0, 1]. Processes with the exponential correlation function are Ornstein-
Uhlenbeck processes (Parzen, 1962). The exponential correlation function is not mean
square differentiable.
Figure 4 presents several realizations of one dimensional processes with the expo-
nential correlation function and 0 = 0.5, 2.0, 5.0, 20. Figure 4(a) is for 0 = 0.5
and these realizations have very small global trends but much local variation. Figure
4(d) is for 0 = 20, and is very jumpy. Mitchell et al. (1990) also found necessary and
sufficient conditions on the correlation function so that the derivative process has an
exponential correlation function. These are called smoothed exponential correlation
functions.
4.3.4. Gaussian
Sacks et al. (1989b) generalized the exponential correlation function by using
R ( d ) = exp(-Oidl q)
Computer experiments 273
"7
0.0 0.2 0.4 0.6 0.8 1.0
X
(a) e = 0.5
t~
>-
"7
h " , ~ J~
• . .~q~ /, ,
Fig. 4. Realizations for the exponential correlation function with 0 = (a) 0.5, (b) 2.0, (c) 5.0, and (d) 20.0.
where 0 < q ~< 2 and 0 E (0, c~). Taking q = 1 recovers the exponential correlation
function. As q increases, this correlation function produces smoother realizations.
However, as long as q < 2, these processes are not mean square differentiable.
The Gaussian correlation function is the case q = 2 and the associated processes are
infinitely mean square differentiable. In the Bayesian interpretation, this correlation
function puts all of the prior mass on analytic functions (Currin et al., 1991). This
correlation function is appropriate when the simulator output is known to be ana-
lytic. Figure 5 displays several realizations for various 0 for the Gaussian correlation
function. These realizations are very smooth, even when 0 = 50.
4.3.5. Mat&n
All of the univariate correlation functions described above are either zero, once or
infinitely many times mean square differentiable. Stein (1989) recommends a more
flexible family of correlation function (Matrrn, 1947; Yaglom, 1987). The Matrrn
correlation function is parameterized by 0 E (0, c~) and v E (--1, c~) and is given by
>- I
it ............
0.0
- -
0.2 0.4
• .
~ ~:_:_
0,6
..
-~-___.~_
......
0.8
.
..--:- . ~ : . : - -. : _ _
1.0
X
(a) e = o . s
!I-- ....... - - -
:"-':. . . . . .•. : " .... :-7 :: . • .................
0.0 0.2 0.4 0.6 0.8 1.0
X
(b) e = 2
~ j
Fig. 5. Realizations for the Gaussian correlation function with 0 = (a) 0.5, (b) 2.0, (c) 10.0, and (d) 50.0.
where Kv(.) is a modified Bessel function of order v. The associated process will be
m times differentiable if and only if v > m. Hence, the amount of differentiability
can be controlled by v while 0 controls the range of the correlations. This correlation
family is more flexible than the other correlation families described above due to the
control of the differentiability of the predictive surface.
Figure 6 displays several realizations of processes with the Mat6rn correlation func-
tion with v = 2.5 and various values of 0. For small values of 0, the realizations are
very smooth and flat while the realizations are erratic fo r large values of 0.
4.3.6. Summary
The correlation functions described above have been applied in computer experiments.
Software for predicting with them is described in Koehler (1990). The cubic corre-
lation function yields predictions that are cubic splines. The exponential predictions
are non-differentiable while the Gaussian predictions are infinitely differentiable. The
Mat6rn correlation function is the most flexible since the degree of differentiability
and the smoothness of the predictions can be controlled. In general, enough prior
information to fix the parameters of a particular correlation family and ~r2 will not be
available. A pure Bayesian approach would place a prior distribution on the parame-
t e r s o f a family and use the posterior-distribution of the parameter in the estimation
Computer experiments 275
> ................
-;-
Fig. 6. Realizations for the Mat6rn correlation function with v = 2.5 and 0 = (a) 2.0, (b) 4.0, (c) 10.0, and
(d) 25.0.
process. Alternatively, an empirical Bayes approach which uses the data to estimate
the parameters of a correlation family and ~r2 is often used. The maximum likelihood
estimation procedure will be presented and discussed in the next section.
The previous subsections of this section presented the Kriging model, and families of
correlation functions. The families of correlations are all parameterized by one or two
parameters which control the range of correlation and the smoothness of the corre-
sponding processes. This model assumes that ~rz, the family and parameters of R(.)
are known. In general, these values are not completely known a priori. The appro-
priate correlation family might be known from the simulator's designers experience
regarding the smoothness of the function. Also, ranges for ~r2 and the parameters
of R(-) might be known if a similar computer experiment has been performed. A
pure Bayesian approach is to quantify this knowledge into a prior distribution on ~r2
and R(.). How to distribute a non-informative prior across the different correlation
276 J. R. Koehler and A. B. Owen
families and within each family is unclear. Furthermore, the calculation of the posterior
distribution is generally intractable.
An alternative and more objective method of estimating these parameters is an
empirical Bayes approach which finds the parameters which are most consistent with
the observed data. This section presents the maximum likelihood method for estimating
fl, ~r2 and the parameters of a fixed correlation family when the underlying distribution
of Z(.) is Gaussian. The best parameter set from each correlation family can be
evaluated to find the overall "best" a 2 and R(.).
Consider the case where the distribution of Z(.) is Gaussian. Then the distribution
for the response at the n design points Yo is multinormal and the likelihood is given
by
Hence
which when set to zero yields the maximum likelihood estimate of/3 that is the same
as the generalized least squares estimate,
Similarly,
which when set to zero yields the maximum likelihood estimate of 0.2
ffml ~ n
(7)
Computerexperiments 277
Therefore, if RD is known, the maximum likelihood estimates of/3 and ~r2 are easily
calculated. However, if R(-) is parameterized by 0 -- (01,..., 0s),
does not generally yield an analytic solution for 0 when set to zero for i = 1 , . . . , s.
(Commonly s = p or 2p, but this need not be assumed.)
An alternative method to estimate/9 is to use a nonlinear optimization routine using
equation (5) as the function to be optimized. For a given value of 0, estimates of/3
and cr2 are calculated using equations (6) and (7), respectively. Next, equation (8) is
used in calculating the partial derivatives of the objective function. See Mardia and
Marshall (1984) for an overview of the maximum likelihood procedure.
~f
f l'//.,,,
- -2,/. = .......
>- .... J
jJ
tN.
0
/
Q I,.
(5 ,
5-
',.% t-
t%l
O
o,
A A A
Fig. 7. (a) An example of a response (Y) and three predictors (Ya, Y3, ,Y6). (b) An example of a derivative
(Y') and three predictors (Y3t, Y3~,,Y6~).
and
For more general p and for higher derivatives, following Morris et al. (1993) let
p
where a = ~ j = l aj and tj is the jth component of t. Then E[Y( a' ..... ap)] = 0 and
p
CoY [y(al .....ap)(tl),y(bl .....bp)(t2)] ~---(--1)a(T2H RSaJ-{"bJ)(t2j- tlj)
j=l
for R(d) = rIj=l
P Rj(dj).
Furthermore, for directional derivatives, let Y~(t) be the directional derivative of
Y(t) in the direction u = ( " 1 , . . . , "p)', ~P=I u2 = 1,
L~Y(t) u
Y'(t) = ~ j = (vY(t),.).
j=l
= ~2(R(e),-) (9)
and
where
0ZR(d)
(/~(d))~, = Od~Od,
where uit is the direction of the lth directional derivative at xi. Also let
,* = (,,,,...,#,0,0,... ,0)'
with n Us and m n 0s and let V* be the combined covariance matrix for the design
responses and derivatives with the entries as prescribed above (equations (9), (10),
and (11)). Then
t* .--1
2(~0) = ~ + v~0v (v~ - ~*)
and
9"(x0) = v~0,.v
'* *-' ( ~ , - ~*)
is possible (see Ritter et al., 1993) to approximate Z(x) with an L z error that decays
as O(n-(r+~)/P). This error is a root mean square average over randomly generated
functions Z.
When the covariance has a tensor product form, like those considered here, one can
do even better. Ritter et al. (1995) show that the error rate for approximation in this case
is n-r-1/Z(logn) (p-l)(r+l) for products of covariances satisfying Sacks-Ylvisaker
conditions of order r / > 0. When Z is a p dimensional Wiener sheet process, for which
r = 0, the result is n-1/Z(logn) (p-l) which was first established by Wozniakowski
(1991).
In the general case, the rate for integration is n -I/2 times the rate for approxi-
mation. A theorem of Wasilkowski (1994) shows that a rate n -d for approximation
can usually be turned into a r a t e n - d - l ~ 2 for integration by the simple device of
fitting an approximation with n/2 function evaluations, integrating the approxima-
tion, and then adjusting the result by the average approximation error on n/2 more
Monte Carlo function evaluations. For tensor product kernels the rate for integration is
n - r - 1(log n) (P- 1)/2 (see Paskov, 1993), which has a more favorable power of log n
than would arise via Wasilkowski's theorem.
The fact that much better rates are possible under tensor product models than for
general covariances suggests that the tensor product assumption may be a very strong
one. The tensor product assumption is at least strong enough that under it, there is no
average case curse of dimensionality for approximation.
5. Bayesian designs
(D to (o to
o o
d d • d d
0.0 0.6 0.0 0.6 0.0 0.6 0.0 0.6
Xl Xl Xl X1
n=l n=2 n=3 n=4
(o ~1 to ¢,o
d d d
0.0 0.6 0.0 0.6 0.0 0.6 0.0 0.6
Xl XI Xl X1
,,1=5 n=6 n=7 n=8
o o
d d d d
0.0 0.6 0.0 0.6 0.0 0.6 0.0 0.6
Xl Xl Xl X1
n=9 n=lO n=11 n=12
o
o- - - d d c5
0.0 0.6 0.0 0.6 0.0 0.6 0.0 0.6
Xl Xl Xl Xl
n=13 n=14 n=15 n=16
Fig. 8(a). Maximumentropy designs for p = 2, n = 1-16, and the Gaussian correlationfunctionwith
0 = (0.5, 0.5).
the r~th design point is determined after the first n - 1 points have been evaluated, will
not be presented due to their tendencies to replicate (Sacks et al., 1989b). However,
sequential block strategies could be used where the above designs could be used as
starting blocks. Depending upon the ultimate goal of the computer experiment, the
first design block might be utilized to refine the design and reduce the design space.
Lindley (1956) introduced a measure, based upon Shannon's entropy (Shannon, 1948),
of the amount of information provided by an experiment. This Bayesian measure
uses the expected reduction in entropy as a design criterion. This criterion has been
used in Box and Hill (1967) and Borth (1975) for model discrimination. Shewry and
Wynn (1987) showed that, if the design space is discrete (i.e., a lattice in [0, 1Iv),
then minimizing the expected posterior entropy is equivalent to maximizing the prior
entropy.
Computerexperiments 283
o
. . . . d . . . . . . d
0.0 0.6 0.0 0,6 0.0 0.6 0.0 0.6
Xl Xl Xl Xl
n=l n=2 n=3 n=4
o o
d d N -
0.0 0.6 0.0 0.6 0.0 0.6 0.0 0.6
Xl Xl X1 Xl
n-=5 n=6 n=7 n=8
o
d o . . . . . . . d . . . . . . . .
0.0 0.6 0.0 0.6 0.0 0.6 0.0 0.6
Xl X1 X1 X1
n=9 n=lO n=11 n=12
o
o - o . . . . d . . . . . .
0.0 0.6 0.0 0.6 0.0 0.6 0.0 0.6
Xl Xl Xl Xl
n=t3 n=14 n=15 n=16
Fig. 8(b). Maximum entropy designs for p = 2, n = 1-16, and the Gaussian correlation function with
0 = (2, 2).
Ey [-lnP(YD~)] = m~nEv [ - l n P ( Y D ) ]
In the Gaussian case, this is equivalent to finding a design that maximizes the
determinant of the variance of YD. In the Gaussian prior case, where/3 --~ Nk (b, r 2 S ) ,
the determinant of the unconditioned covariance matrix is
=( -T2ZH t
' °)
T2SH , I
284 J. R. Koehler and A. B. Owen
o o
d • d . . . . d
0.0 0.6 0.0 O.fi 0.0 0.6 0.0 0.6
X1 X1 X1 X1
n=l n=2 n=3 n=4
o o o o
d d d d
0.0 0.6 0.0 0.6 0.0 0.6 0.0 0.6
X1 X1 X1 X1
n=5 n=6 n=7 n=8
o o o o
d . . . . . . . d . . . . . . . d . . . . . . . . d . . . . . . .
0.0 0.6 0.0 0.6 0.0 0.6 0.0 0.6
X1 Xl Xl X1
n=9 n=lO n=11 n=12
o o o o
d d . . . . . ~ d
0.0 0.6 0.0 0.6 0.0 0.6 0.0 0.6
Xl Xl Xl X1
n=13 n=14 n=15 n=16
Fig. 8(c). Maximum entropy designs for p : 2, n : 1-16, and the Gaussian correlation function with
0 = (10, 10).
VD H
-T2ZH ' I
I 0 H
I -~-21JH' I
_vD
-
H
0 T2ZH'V~IH+I
1
= lVol l~2rH'Vp H + 1I
: IVDIIH'V~ 1H + T- z S - ' l l ~ z~[.
Since ~_2S is fixed, the maximum entropy criterion is equivalent to finding the design
D E that maximizes
If the prior distribution is diffuse, T2 ~ ee, the maximum entropy criterion is equiv-
alent to
where
J(D) is dependent on R(.) through Y(x). For any design, J(D) can be expressed as
J ( D ) = a 2 - trace
{I
0 H'
H lid
"h(x)h'(x) h(x)v~
v~h'(x) v~v~
1}dx
286 J. R. Koehler and A. B. Owen
and, as pointed out by Sacks et al. (1989a), if the elements of h(x) and V= are products
of functions of a single input variable, the multidimensional integral simplifies to prod-
ucts of one-dimensional integrals. As in the entropy design criterion, the minimization
of J ( D ) is a optimization in n x p dimensions and is also dependent on R(.).
Sacks and Schiller (1988) describe the use of a simulated annealing method for con-
structing IMSE designs for bounded and discrete design spaces. Sacks et al. (1989b)
use a quasi-Newton optimizer on a Cray X-MP48. They found that optimizing a
n = 16, p = 6 design with 01 . . . . . 06 = 2 took 11 minutes. The PACE program
(Koehler, 1990) uses the optimization program NPSOL (Gill et al., 1986) to solve the
IMSE optimization for a continuous design space. For n = 16, p = 6, this optimiza-
tion requires 13 minutes on a DEC3100, a much less powerful machine than the Cray.
Generally, these algorithms can find only local minima and therefore many random
:o
d d d
0 0
,,¢
(5 d (:5
o o o
6 d d
0.0 0.4 O.B 0.0 0.4 0.8 0.0 0.4 0.8
Xl X1 Xl
n=l n=2 n=3
c0 co 0 0
0 0 0 0
(:5 (5
0 0
(:5 d d
0 0
0 0 0 0
o o 0
(5 (5
0.0 0.4 O.B 0.0 0.4 0.8 0.0 0.4 0.8
Xl Xl Xl
n=4 n=5 n=6
0
0 0 0 0 0 0
co co
0 0 d (5 0
0 0 0
(5 (5 0 0
d
0 0 0
0 0 0 0
o o 0 0
,5 c;
0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 O.B
Xl Xl Xl
n=7 n=B n=9
Fig. 9(a). Minimum integrated mean square error designs for 19 = 2, n = 1-9, and the Gaussian correlation
function with 0 = (.5, .5).
Computer experiments 287
o o o o
d . . . . . . d . . . . d c5
0.0 0.6 0.0 0.6 0.0 0.6 0.0 0.6
X1 Xl X1 Xl
n=l n=2 n=3 n=4
to ~o (o ~o
0
o o 0 o o
d (5 . . . . 6 c5
0.0 0.6 0.0 0.6 0.0 0.6 0.0 0.6
X1 Xl Xl Xl
n=5 n=6 n=7 n=8
d d . . . . . . d c5
0.0 0.6 0.0 0.6 0.0 0.6 0.0 0.6
X1 X1 Xl Xl
n=9 n=lO n=11 n=12
~ ~
t°°°°
0o ~ ~
o o
t0
O0
0
0 0 o o
6 . . . . d
0.0 0.6 0.0 0.6 0.0 0.6 0.0 0.6
Xl Xl Xl Xl
n=13 n=14 n=15 n=16
Fig. 9(b). Minimum integrated mean square error designs f o r p = 2, n = 1-16, and the Ganssian correlation
function with O = (2, 2).
288 J. R. Koehler and A. B. Owen
o o o o
d . . . . . d ~ d
0.0 0.6 0.0 0.6 0.0 0.6 0.0 0.6
Xl Xl Xl X1
n=l n=2 n=3 n=4
o o o o
d 6 ..... c5 d
0.0 0.6 0.0 0.6 0.0 0.6 0.0 0.6
Xl X1 Xl X1
n=5 n=6 n=7 n=8
o o o o
d d d 6
0.0 0.6 0.0 0.6 0.0 0.6 0.0 0.6
X1 Xl Xl Xl
n=9 n=lO n=11 n=12
~o
o
o: l °o
~o
o 1o
oooo°° 0
[ o
~o ~
o
0 00 O ~
o o o o
c~ • • • 6 6 d
0.0 0.6 0.0 0,6 0.0 0.6 0.0 0.6
Xl Xl Xl Xl
n=13 n=14 n=15 n=16
Fig. 9(c). Minimum integrated mean square errordesigns for p = 2, n = 1-16, and the Gaussian correlation
function with 0 = (10, 10).
lower dimension marginals of the input space. Better projection properties are needed
when the true function is only dependent on a subset of the input variables.
d(Xl,Xz)=d(xz, xl),
d(xl,x2) ~ O,
d ( z l , x 2 ) = O ~ x l =x2,
d(Xl,X2) < d(xl,x3) + d(x3,x2).
Computer experiments 289
¢q. 0
0
t'Xl
0 0
× ~. 0
0
o. 0 0
0 . . . . . .
o
0.0 0.4 0.8
Xl
(a) Minimax
o
o
0
Fig. 10. (a) Minimax and (b) Maximin designs for n = 6 and p = 2 with Euclidean distance.
where
Minimax distance designs ensure that all points in [0, 1]p are not too far from a
design point. Let d(., .) be Euclidean distance and consider placing a p-dimensional
sphere with radius r around each design point. The idea of a minimax design is to
place the n points so that the design space is covered by the spheres with minimal r.
As an illustration, consider the owner of a petroleum corporation who wants to open
some franchise gas stations. The gas company would like to locate the stations in the
most convenient sites for the customers. A minimax strategy of placing gas stations
would ensure that no customer is too far from one of the company's stations.
Figure 10(a) shows a minimax design for p -- 2 and n = 6 with d(., .) being
Euclidean distance. The maximum distance to a design point is .318. For small n,
minimax designs will generally lie in the interior of the design space.
Again, let d(., .) be Euclidean distance. Maximin designs pack the n design points,
with their associated spheres, into the design space, S, with maximum radius. Parts
of the sphere may be out of S but the design points must be in S. Analogous to the
minimax illustration above is the position of the owners the gas station franchises.
They wish to minimize the competition from each other by locating the stations as
far apart as possible. A maximin strategy for placing the franchises would ensure that
no two stations are too close to each other.
Figure 10(b) shows a maximin design for p = 2, n = 6 and d(., .) Euclidean
distance. For small n, maximin designs will generally lie on the exterior of S and fill
in the interior as n becomes large.
So if one can integrate over the domain of X then one can fit regression approximations
there.
The quality of the approximation may be assessed globally by the integrated mean
squared error
f (Y - Z(X)fl) 2 dF.
For simplicity we take the distribution F to be uniform on [0, 1]v. Also for simplicity
the integration schemes to be considered usually estimate f g(X)dF by
n
1
n
i:1
for well chosen points x l , . . . , xn. Then ilLS may be estimated by linear regression
fi = n z(xi)'z(xo _1
n Z(xi)tf(xi),
i=l i=1
or when the integrals of squares and cross products of Z's are known by
:
(/ z(X)'Z(X)dF
)-',--
n i:1
Z(xO'f(.O. (13)
n
1
(14)
i=1
and one can avoid the cost of matrix inversion. The computation required by equation
(14) grows proportionally to nr not n 3, where r = r(n) is the number of regression
variables in Z. If r = O(n) then the computations grow as n 2. Then, in the example
from Section 3, an hour of function evaluation followed by a minute of algebra would
scale into a day of function evaluation followed by 9.6 hours of algebra, instead of
the 9.6 days that an n 3 algorithm would require. If the Z(xi) exhibit some sparsity
then it may be possible to reduce the algebra to order n or order n log n.
Thus the idea of turning the function into data and making exploratory plots can
be extended to turning the function into data and applying regression techniques. The
theoretically simplest technique is to take Xi iid U[0, 1]v. Then (Xi, Yi) are iid pairs
292 J. R. Koehler and A. B. Owen
with the complication that Y has zero variance given X. The variance matrix of ~ is
then
(/)-1 (/)-1
1 Z'ZdF Var(Z(X)'Y(X)) ZtZdF (15)
n
1 - ,
n-r-1 (Z(xi)Y(xi)-t3) (Z(xi)Y(xi)-fl)
i=1
when the row vector Z comprises an intercept and r additional regression coefficients.
This approach to computer experimentation should improve if more accurate inte-
gration techniques are substituted for the iid sampling. Owen (1992a) investigates the
case of Latin hypercube sampling for which a central limit theorem also holds.
Clearly more work is needed to make this method practical. For instance a scheme
for deciding how many predictors should be in Z, or otherwise for regularizing/3 is
required.
The frequentist approach proposed in the previous section requires a set of points
x l , . . . , xn that are good for numerical integration and also allow one to estimate the
sampling variance of the corresponding integrals. These two goals are somewhat at
odds. Using an iid sample makes variance estimation easier while more complicated
schemes described below improve accuracy but make variance estimation harder.
The more basic goal of getting points x~ into "interesting corners" of the input
space, so that important features are likely to be found is usually well served by point
sets that are good for numerical integration.
We assume that the region of interest is the unit cube [0, 1]p, and that the integrals
of interest are with respect to the uniform distribution over this cube. Other regions of
interest can usually be reduced to the unit cube and other distributions can be changed
to the uniform by a change of variable that can be subsumed into f .
Throughout this section we consider an example with p = 5, and plot the design
points xi.
Computer experiments 293
1.0
0.8
0.6 -
0.4 -
0.2 -
• • • • •
0.0
J I i r I I
0.0 0.2 0.4 0.6 0.8 1.0
X1
7.1. Grids
Since varying one coordinate at a time can cause one to miss important aspects of f ,
it is natural to consider instead sampling f on a regular grid. One chooses k different
values for each of X I through X p and then runs all k p combinations. This works
well for small values of p, perhaps 2 or 3, but for larger p it becomes completely
impractical because the number of runs required grows explosively.
Figure 11 shows a projection of 55 = 625 points from a uniform grid in [0, 1]5 onto
two of the input variables. Notice that with 625 runs, only 25 distinct values appear
~ n the plane, each representing 25 input settings in the other three variables. Only 5
distinct values appear for each of input variable taken singly. In situations where one
of the responses y k depends very strongly on only one or two of the inputs X j the
grid design leads to much wasteful duplication.
The grid design does not lend itself to variance estimation since averages over
the grid are not random. The accuracy of a grid based integral is typically that of a
univariate integral based on k = n 1/p evaluations. (See Davis and Rabinowitz, 1984.)
For large p this is a severe disadvantage.
294 J. R. Koehler and A. B. Owen
1.0
0.8
0.6
0.4
0.2
0.0 i..!...................................................................................................
I I I I I I
0.0 0.2 0.4 0.6 0.8 1.0
Xl
j {hi(i-i)+0.5}_
X i -~ ?~
where {z} is z modulo 1, that is, z minus the greatest integer less than or equal to z
and hj are integers with hi = 1. The points vi with v~ = ihj/n for integer i form
a lattice in R p. The points xi are versions of these lattice points confined to the unit
cube, and the term "good" refers to a careful choice of n and hj usually based on
number theory.
Figure 12 shows the Fibonacci lattice for p = 2 and n = 34. For more details
see Sloan and Joe (1994). Here hi = 1 and h2 = 21. The Fibonacci lattice is only
available in 2 dimensions. Appendix A of Fang and Wang (1994) lists several other
choices for good lattice points, but the smallest value of n there for p = 5 is 1069.
Computer experiments 295
Hickernell (1996) discusses greedy algorithms for finding good lattice points with
smaller n.
The recent text (Sloan and Joe, 1994) discusses lattice rules for integration, which
generalize the method of good lattice points. Cranley and Patterson (1976) consider
randomly perturbing the good lattice points by adding, modulo 1, a random vector
uniform over [0, 1]p to all the xi. Taking r such random offsets for each of the n data
points gives nr observations with r - 1 degrees of freedom for estimating variance.
Lattice integration rules can be extraordinarily accurate on smooth periodic inte-
grands and thus an approach to computer experiments based on Cranley and Patter-
son's method might be expected to work well when both f ( x ) and Z ( x ) are smooth
and periodic. Bates et al. (1996) have explored the use of lattice rules as designs for
computer experiments.
X~ - ~rJ(i) - Uj (17)
n
where the 7rJ are independent uniform random permutations of the integers 1 through
n, and the U~ are independent U[0, 1] random variables independent of the 7rj.
Latin hypercube sampling was introduced by McKay et al. (1979) in what is widely
considered to be the first paper on computer experiments. The sample points are
stratified on each of p input axes. A common variant of Latin hypercube sampling
has centered points
= J(i) - 0 . 5 (18)
n
Point sets of this type were studied by Patterson (1954) who called them lattice
samples.
Figure 13 shows a projection of 25 points from a (centered) Latin hypercube sample
over 5 variables onto two of the coordinate axes. Each input variable gets explored
in each of 25 equally spaced bins.
The stratification in Latin hypercube sampling usually reduces the variance of es-
timated integrals. Stein (1987) finds an expression for the variance of a sample mean
under Latin hypercube sampling. Assuming that f f ( X ) 2 d F < co write
p
f(x) = , + j(xJ) + e ( x ) (19)
j=l
296 J. R. Koehler and A. B. Owen
i i i i i i i i i i i i i i ~ i i ~ i i i ~ i i l l
~ . o . . . . . !~..?.~!-..t.....i~.~..~..~..i.....!....~i.....i..~.!.~.i~.~..i..~..i~...!~..?~.~..i~.i~.~.!~..~.?.i~.~.....~....~!~....~ .....
.... ,.....,.....,.....,.....,....,.....,.....,....,.....,.....,.....,....,.....,.....,....,.....,.....,....,....,.....,.....,....,.....,.....,....,.
.... !i---!v! ......... V T i ! T I T V V V ! - ! i i V V V I - i - ~ ~
! ! i ! i i ! i ! ! i ! ! ! ! ! ! i ! ! i ! ! ! O !
i i i i !e! i ! i i i i i i i i i i i i i ! i i i !
.....! i " V ! - i V'T'! I-'V!" ; - : i " i - V ? - i T~ I!Fi?IT~ .....
.... i~.?....~..`..~...?....!~.~.i....~?.....~...~.!~.~..i..~..~...~....i....~..~.4.~..!.....i~..@...~.~"i..~.~.i..~..!.~..~?.....?
i ! i ! i i O ! ! i i l ! i ~ i ! ~ ! ! i i ! : i : :
! ! ! ! i i ! i ! i i ! i ! ! i ! ~ [ ! ! O i ! i ! i
....i T V T TVVIqi~ T r V T K i IKTI[VIV; 2
....~ ~ - V T !i TT ! Ti-~ i-V-~ii-T-TTT-i V!T~~
0,6 .... !~....+..~.~.~..~.....~.~..~`.~.-~..~..~...~*..~.~.....~.~...~....~.....~...~.~.".~....`~.~...~..~.*..-.~....:.....~...~.~..~....~....~
! ! i i ! ! i o ! ! i l i ! i ~ i i i ! i i i ! ! ~ .....
....!'"'?'"'!'""!'""?'"'?'"+ "'"~""?'"'!'""!'""?'"'~'-"'['""~'"'!"'"'!'""!""'~"-?'"'!'"'?'"'!'""!'""~'"'?
i i i i i i i O i i i i i i ~ i i ~ i i i ~ i i l l
cM .... ~;~.~.:...-.~.....~.~..;~...:`....~.~.~..~`*...~....4.....~..~...~:~...~....~.~.~.....~.-.*.~..~...~.:~....~.""....~.~..~...*
i ! i ! i i l l l i l l l i i i i i i i i i O i l l ....
X i !Oii i i ! i ! i i i i i i i i i i ! ! i i i i i
! ! ! i i i ! ! ! ! i ! ! i ! i ! i i i ! i l l ! O i
~.4-...+...+....i....+....i....i~....i.....i...~.i...~.i.....i.....i....i.....i.....i....i.....i.....i....i....~.....i....~....i.....i".~.~....~.
: : : : : : : : : : : : : : : : : : : : : : : : : : :
....[ V V T VVT!! V!!-Vi VVT io~ ~ i ! ~ ~ ~ ~
i ! i ioli ! i i i i ! i i ! i ! i i i i i ! i i !
....I K V T - V ! VVTVTriVT V i! K! VTKVV"! ~
....i i V T T V V T T V i ' f V T T V T i i TVT VVT K7
0.2 ~ -?'"'?"
: : : ~'""i
: : : - :?'"!--
: : ~-
: -'~'"
: : :~ : ~'i'i'"
: : : ?'-'~"'"!
: : : : --~: ""~"-
: : -~: : "~'"'y ~ --"~'"-! -- ~- ""!"-" ~---i
i ! i ! i i i ! ! i ! i i i ! ~ i i ! i o ! ! i ! i !
...., , , . , . , . , . . , , . , . , , . , . , , . , , , , , , , , , , , ,
....i - - T - - i - T T i T - i V ! T ! ! ~ T !VTTTVVVi VF!
i i ! ! ! ! i ! i ! i ! i l i ~ O i ! ! ! ! i ! [ ! i
oo 2!i-i ??i!ii?iiiiiiilH?iHiii?i-i
I I I I I I
0.0 0.2 0.4 0.6 0.8 1.0
X1
Fig. 13.25 points of a Latin hypercube sample. The range of each input variable may be partitioned into
25 bins of equal width, drawn here with horizontal and vertical dotted lines, and each such bin contains
one of the points.
Var(l£ ) 1 Ie(X)2dF+o
V a r ( - f ( x~ ni ) l
~/~ i=l
) (/ = _1
n
e(X) 2dF + ~
j=l
o~j(XJ)2 d F ) . (21)
By balancing the univariate margins, Latin hypercube sampling has removed the main
effects of the function f from the error variance.
Computer e3qoeriments 297
Owen (1992a) proves a central limit theorem for Latin hypercube sampling of
bounded functions and Loh (1993) proves a central limit theorem under weaker con-
ditions. For variance estimation in Latin hypercube sampling see (Stein, 1987; Owen,
1992a).
X [ = 7cj(A~) + U~
b (22)
and
X ] = ~'J(AJ) + 0.5
b (23)
just as Latin hypercube sampling has two versions. Indeed Latin hypercube sampling
corresponds to strength t = 1, with A = 1. Here the 7rj are independent uniform
298 J. R. KoehIer and A. B. Owen
1.0
0.8
Q
0.6
|
X
0.4
0.2
0.0
X1
Fig. 14. 25 points of a randomly centered randomized orthogonal array. For whichever two (of five) variables
that are plotted, there is one point in each reference square.
1.0
i iitil iii+iii.ltll ii
.... r-. --~--+--+--,----+.-i---+--+.---,--.+------+-.--~
. . . . . . . . . . . . t.... i'""i"'-!'"
.... iiil
~" ...............
..... ~ " ' ~ : ' " " ~ ....................................... i T o T i ......... i T T ! ........... F T T - [
0.8
..... :....,~.....~,....~ .......... ~.....~..,.~.....~ ................................ . . ,
....." ' T ' ! ' i .... ! ' " - ~ " T ' ! .......... ! ' ! ' " " ! " " ~
0.6
..... : ' " ' : ' " " : ' " " ~ .......... W ' " ~ ' " ' T ' " ' ! ...... . • ' ," .......... : ' " ' : ' " " : ' " " ~
C'kl i i i i ! i ! i : ! : :, i i ! i i i i i
x .....:~
.....
.. . . .". ~-. . .-. ..~~. , . .". : - r. . . . ... . ... ..... . . . . ... ~!. .-. . .;~..-. . ,-. : .r. . .~. ~. . .. . . . .. . . . :r... -. . . -. : , -. . . -+ . ,~. . ~ - ~
.......... i] ..........................
i
. . . . . . . .
i i i •
~....+...,:.....~ iilZii?51iiSi
.....
:: :: i :: i :: i i i :: i i | i i i i i :. :.o!
0.4
~ ~ ~ ~ ~ ~ i ~ .... ..o.~
-, •~ •i .~
..... i~ ! " ' +! ~ ' " ~i " ! +! - ~ " +! + ~ 'i " ~ " '!~ ' + ' " ~ ii1! ! i f'-i-'".:'"i .......... F'"TT-"i
......................
.................................
i i i l -
.... T I T ~
1"'2 W'" "'!'"'!"'"'i"'"'1""'?""!"'"'['""~
......... i i i i
..........
........... t i i i
....
i!iiiiii
i~ii
iiiiiiil
',Ùi~
I
0.2
.....
~ i
:.....~.....~.....:
~ i ..........
:: i
~...._~.....:
:: i_~
.....
i :; i i i i i ! i :: :: i
. . . .
..... i . . . . . . .
........................... " , , , ,
i i i i
.......... I--T-Y1
.....iiiiiiiiiiiiiiiiiilLiiiiiiiiiiiilL~iiiiilL.iiiiiiiilSii'.iiiiii..! ~...............................
,~r~ ............................
i...~_~...~
i ! i ! i ! ! i o! i i i
0.0
i i :, i i ': i ~ i
X1
Fig. 15. 25 points of an orthogonal array based Latin hypercube sample. For whichever two (of five)
variables that are plotted, there is one point in each reference square bounded by solid lines. Each variable
is sampled once within each of 25 horizontal or vertical bins.
Tang (1993) introduced orthogonal array based Latin hypercube samples. The points
of these designs are Latin hypercube samples X~, such that [bX]J is an orthogonal
array. Here b is an integer and [zJ is the smallest integer less than or equal to z. Tang
(1993) shows that for a strength 2 array the main effects and two variable interactions
do not contribute to the integration variance.
Figure 15 shows a projection of 25 points from an orthogonal array based Latin
hypercube sample over 5 variables onto two of the coordinate points. Each variable
individually gets explored in each of 25 equal bins and each pair of variables gets
explored in each of 25 squares.
! ! ! i i i ! i i i ~ i : : : : ! ! i i
1.0
i i i i i ! i l . . . . i i i ~ : : : :
0.8 : : : : : : : : : : : : : : : : 1 : : :
.:.....~.,...~....,~ ....
: : : : J i l l : : : : i ! i l i i l l
0.6 : : : : : : : : : : : : : : : : . . . .
0.4 ! i ! i i i l l . . . .
..... i, !....-i--~ : : : : ..... !..i....i~,i ..... ~ ! i i !. . . !. i ~
..... !.....i.....i...,.i...i . . . . : : : : : : : : : : : :
: : : : : : : : : : : : . . . . . . . .
..... :i"--:-'"!
: : :
r . :. : . : . . . . .
: . . . . . .
"'i":i-
. .
o"i
. .
......
. .
i -i---i'" ~i
..... :i- :
~.--.~-~
: :
.......... !.+-~--.~.--
: : : :
~i,-.i---i-.--i
: : : :
..... i: :
i: :
i i . . . .
: : : : . . . . : : : : : 1 : : : : : :
0.0 ~ : : : : . . . . . . . . . . . .
X1
E= ~-j, bkj
j=l
1.0
! ~ ; !o :: e i i i i i ; . . . .
:: :: i ' .oL...i....L...L... o:: :: i i . . . . . ; ~ !
rii"! ........... ~ ; ! ~ ..... : : : ......... ~.----~...i---- .......... ----÷---+.--÷ .....
. . . . . . . . i.--e--...::--~ ..... !o ~ ~ ~ ~ ~ ~ ~ ' ' ~ !
.... :.-'--~-----r'----~. . . . : . .~ . .: . .: ~ ~. ' ". ! ' ." ?
. .........
= '.' "." *.' ~. ..................... i 0 " ! .....
-~-i-----i-...-i----.i ......... i----!,--~-i .......... ~--..i..-..i-..i... ,---i.....i--.-.~.-i
. . . . .....
0.8 . . . . ': :~ :~ ": i i i i ~... ~ !-=
.............................
i io:: i ~ i ! i i ........i i i i l " i " ~ - ÷ ~ =: : :................................
~ ~ ~
. . . .
..... !.....~.....i.....~i.~.....:.....:....~.....i ..... ----.i----i-oi----.i . . . . . . . . . . . . . !.-~---.-i--.-e ....
.....i.. ~. ..-..! -. : = ..........! ~ ~. ~. i . .........
. . i-i-..i----i"
. . , • ~
--ii-~- i............................
i i i i
0.4
O! i :: ~ : : ' ' " " ' " '
.....i.--!.---i.--i
.... .........!--.i--~.--i.........i..i....i=..!:~
i o=: i........ i-=i--'-i
.... ..... --~-!-~-!ii!" ....
......~,---i-i--"i--i~
... ~ '~ ~• : i ~ ~ ! i .... i.L..L..LJ .-r --i -i-----'~.....
..... 7i,T"; ...... !--i---.-i-----i.--, ~. ~ ..... i..--i--.--i-----i ....
..... :.....L...2.....L .......... :.....:....:.....: ..... : : : " . . . . ~ ~ ~ ~•
• ~ ~ ~ i i O i ......i-oi---i---.~ ....... e i - ~ - ! .......... ~ - ~ ~ - ....
0.2
..... ~.....~.....~.....~....~ ....i...i..~.i.....i ...................... ; .......... i ....i.....i.....i..... i ................................
. . . . ......
..... ~ ' " ~ " i O ~ - ' " ~ " ! ' " " ' ! - ' i . . . ~!"'-i"'"~
. .... ] -i-i i i . ....... ., . ' ' ....
..... I T O " , .......... : : : : i !o i i ~I-7~ .......... r " - - ! - - ! ....
" 'i i i i .... i . . . . . .~. . . . .i,. . . . . . . . . . . . . . . . . . . . . . .
I 'ii°i 'i ........ ; '° ~; ....
0.0
X1
Fig. 17, The 125 points of a scrambled (0, 3, 5)-net in base 5. For whichever two (of five) variables that
are plotted, the result is a 5 by 5 grid of 5 point Latin hypercube samples. Each variable is sampled once
within each of 125 equal bins. Each triple of variables can be partitioned into 125 congruent cubes, each
of which has one point.
1.0
0.8
0.6
t'M
X
0.4
0.2
0.0
X1
Fig. 18. The 625 points of a scrambled (0, 4, 5)-net in base 5. For whichever two (of five) variables that
are plotted, the square can be divided into 625 squares of side 1/25 or into 625 rectangles of side 1/5 by
1/125 or into 625 rectangles of side 1/125 by 1/5 and each such rectangle has one of the points. Each
variable is sampled once within each of 625 equal bins. Each triple of variables can be partitioned into 625
hyperrectangles in three different ways and each such hyperrectangle has one of the points. Each quadruple
of variables can be partitioned into 625 congruent hypercubes of side 1/5, each of which has one point.
For t >~ 0, an infinite sequence (Xi)i~>I of points from [0, 1) s is a (t, s)-sequence
in base b if for all k >~ 0 and m >~ t the finite sequence ~ ~(k+l)b'~ is a (t, ra, s)-net
( X iJi=kb'-+l
in base b.
The advantage of a (t, s)-sequence is that if one finds that the first br~ points are not
sufficient for an integration problem, one can find another b~ points that also form a
(t, ra, s)-net and tend to fill in places not occupied by the first set. If one continues
to the point of having b such (t, m, s)-nets, then the complete set of points comprises
a (t, m + 1, s)-net.
The theory of (t, re, s)-nets and (t, s)-sequences is given in Niederreiter (1992). A
famous result of the theory is that integration over a (t, ra, s)-net can attain an accuracy
of order O(log(n)s-l/n) while restricting to (t, s)-sequences raises this slightly to
O ( l o g ( n ) S / n ) . These results require that the integrand be of bounded variation in the
sense of Hardy and Krause. For large s, it takes unrealistically large n for these rates
Computer experiments 303
to be clearly better than rz-1/2 but in examples they seem to outperform simple Monte
Carlo.
The construction of (t, m, s)-nets and (t, s)-sequences is also described in Nieder-
reiter (1992). Here we remark that for prime numbers s a construction by Faure (1982)
gives (0, s)-nets in base s and Niederreiter extended the method to prime powers s.
(See Niederreiter, 1992.) Thus one can choose b to be the smallest prime power greater
than or equal to s and use the first s variables of the corresponding (0, b)-sequence
in base b.
Owen (1995) describes a scheme to randomize (t, m, s)-nets and (t, s)-sequences.
The points are written in a base b expansion and certain random permutations are
applied to the coefficients in the expansion. The result is to make each permuted Xi
uniformly distributed over [0, 1) s while preserving the (t, m, s)-net or (t, s)-sequence
structure of the ensemble of X~. Thus the sample estimate n -I ~i~=1f(X~) is unbi-
ased for f f(X) d F and the variance of it may be estimated by replication. On some
test integrands in (Owen, 1995) the randomized nets outperformed their unrandom-
ized counterparts. It appears that the unscrambled nets have considerable structure,
stemming from the algebra underlying them, and that this structure is a liability in
integration.
Figure 16 shows the 25 points of a scrambled (0, 2, 5)-net in base 5 projected onto
two of the five input coordinates. These points are the initial 25 points of a (0, 5)-
sequence in base 5. This design has the equidistribution properties of an orthogonal
array based Latin hypercube sample. Moreover every consecutive 25 points in the
sequence X25a+l, X25a+z, • • •, Xzs(~+l) has these equidistribution properties. The first
125 points, shown in Figure 17 have still more equidistribution properties: any triple
of the input variables can be split into 125 subcubes each with one of the Xi, in any
pair of variables the points appear as a 5 by 5 grid of 5 point Latin hypercube samples
and each individual input variable can be split into 125 cells each having one point.
The first 625 points, are shown in Figure 18.
Owen (1996a) finds a variance formula for means over randomized (t, m, s)-nets
and (t, s)-sequences. The formula involves a wavelet-like anova combining nested
terms on each coordinate, all crossed against each other. It turns out that for any
square integrable integrand, the resulting variance is o(n -1) and it therefore beats any
of the usual variance reduction techniques, which typically only reduce the asymptotic
coefficient of n -1.
For smooth integrands with s = 1, the variance is in fact O(n -3) and in the general
case Owen (1996b) shows that the variance is O(rz-3(logn)S-1).
8. Selected applications
One of the largest fields using and developing deterministic simulators is in the de-
signing and manufacturing of VLSI circuits. Alvarez et al. (1988) describe the use of
SUPREM-III (Ho et al., 1984) and SEDAN-II (Yu et al., 1982) in designing BIMOS
devices for manufacturability. Aoki et al. (1987), use CADDETH a two dimensional
device simulator, for optimizing devices and for accurate prediction of device sensitiv-
ities. Sharifzadeh et al. (1989) use SUPREME-III and PISCES-II (Pinto et al., 1984)
304 J. R. Koehler and A. B. Owen
References
Alvarez, A. R., B. L. Abdi, D. L. Young, H. D. Weed, J. Teplik and E. Herald (1988). Application of
statistical design and response surface methods to computer-aided VLSI device design. IEEE Trans.
Comput. Aided Design 7(2), 271-288.
Aoki, Y., H. Masuda, S. Shimada and S. Sato (1987). A new design-centering methodology for VLSI
device development. IEEE Trans. Comput. Aided Design 6(3), 452-461.
Bartell, S. M., R. H. Gardner, R. V. O'Neill and J. M. Giddings (1983). Error analysis of predicted fate of
anthracene in a simulated pond. Environ. Toxicol. Chem. 2, 19-28.
Bartell, S. M., J. P. Landrum, J. P. Giesy and G. J. Leversee (1981). Simulated transport of polycyclic
aromatic hydrocarbons in artificial streams. In: W. J. Mitch, R. W. Bosserman and J. M. Klopatek, eds.,
Energy and Ecological Modelling. Elsevier, New York, 133-143.
Bates, R. A., R. J. Buck, E. Riccomagno and H. P. Wynn (1996). Experimental design and observation for
large systems (with discussion). J. Roy. Statist. Soc. Sen. B 58(1), 77-94.
Borth, D. M. (1975). A total entropy criterion for the dual problem of model discrimination and parameter
estimation. J. Roy. Statist. Soc. Ser. B 37, 77-87.
Box, G. E. P. and N. R. Draper (1959). A basis for the selection of a response surface design. J. Amer.
Statist. Assoc. 54, 622-654.
Box, G. E. P. and N. R. Draper (1963). The choice of a second order rotatable design. Biometrika 50,
335-352.
Box, G. E. P. and W. J. Hill (1967). Discrimination among mechanistic models. Technometrics 9, 57-70.
Church, A., T. Mitchell and D. Fleming (1988). Computer experiments to optimize a compression mold
filling process. Talk given at the Workshop on Design for Computer Experiments in Oak Ridge, TN,
November.
Cranley, R. and T. N. L. Patterson (1976). Randomization of number theoretic methods for multiple
integration. SlAM J. Numer. Anal 23, 904-914.
Cressie, N. A. C. (1986). Kriging nonstationary data. J. Amen. Statist. Assoc. 81, 625-634.
Cressie, N. A. C. (1993). Statistics .for Spatial Data (Revised edition). Wiley, New York.
Currin, C., M. Mitchell, M. Morris and D. Ylvisaker (1991). Bayesian prediction of deterministic functions,
with applications to the design and analysis of computer experiments. J. Amen. Statist. Assoc. 86, 953-963.
Dandekar, R. (1993). Performance improvement of restricted pairing algorithm for Latin hypercube sam-
pling Draft Report, Energy Information Administration, U.S.D.O.E.
Davis, P. J. and P. Rabinowitz (1984). Methods of Numerical Integration, 2nd. edn. Academic Press, San
Diego.
Diaconis, P. (1988). Bayesian numerical analysis In: S. S. Gupta and J. O. Berger, eds., Statistical Decision
Theory and Related Topics IV, Vol. 1. Springer, New York, 163-176.
Efron, B. and C. Stein (1981). The jackknife estimate of variance. Ann. Statist. 9, 586-596.
Fang, K. T. and Y. Wang (1994). Number-theoretic Methods in Statistics. Chapman and Hall, London.
Faure, H. (1982). Discrrpances des suites associres ~ un syst~me de numrration (en dimension s). Acta
Arithmetica 41, 337-351.
Friedman, J. H. (1991). Multivariate adaptive regression splines (with Discussion). Ann. Statist. 19, 1-67.
Gill, P. E., W. Murray, M. A. Saunders and M. H. Wright (1986). User's guide for npsol (version 4.0):
A Fortran package for nonlinear programming. SOL 86-2, Stanford Optimization Laboratory, Dept. of
Operations Research, Stanford University, California, 94305, January.
Gill, P. E., W. Murray and M. H. Wright (1981). Practical Optimization. Academic Press, London.
Gordon, W. J. (1971). Blending function methods of bivariate and multivariate interpolation and approxi-
mation. SlAM J. Numer. Anal. 8, 158-177.
Gu, C. and G. Wahba (1993). Smoothing spline ANOVA with component-wise Bayesian "confidence
intervals". J. Comp. Graph. Statist. 2, 97-117.
Hickernell, E J. (1996). Quadrature error bounds with applications to lattice rules. SIAM J. Numer. Anal.
33 (in press).
Ho, S. P., S. E. Hansen and P. M. Fahey (1984). Suprem III - a program for integrated circuit process
modeling and simulation. TR-SEL84 1, Stanford Electronics Laboratories.
306 J. R. Koehler and A. B. Owen
lman, R. L. and W. J. Conover (1982). A distributon-free approach to inducing rank correlation among
input variables. Comm. Statist. Bll(3), 311-334.
Johnson, M. E., L. M. Moore and D. Ylvisaker (1990). Minimax and maximin distance designs. J. Statist.
Plann. Inference 26, 131-148.
Joumel, A. G. and C. J. Huijbregts (1978). Mining Geostatistics. Academic Press, London.
Koehler, J. R. (1990). Design and estimation issues in computer experiments. Dissertation, Dept. of
Statistics, Stanford University.
Lindley, D. V. (1956). On a measure of the information provided by an experiment. Ann. Math. Statist.
27, 986-1005.
Loh, W.-L. (1993). On Latin hypercube sampling. Tech. Report No. 93-52, Dept. of Statistics, Purdue
University.
Loh, W.-L. (1994). A combinatorial central limit theorem for randomized orthogonal array sampling designs.
Tech. Report No. 94-4, Dept. of Statistics, Purdue University.
Mardia, K. V. and R. J. Marshall (1984). Maximum likelihood estimation of models for residual covariance
in spatial regression. Biometrika 71(1), 135-146.
Matrrn, B. (1947). Method of estimating the accuracy of line and sample plot surveys. Medd. Skogsforskn
Inst. 36(1).
Matheron, G. (1963). Principles of geostatistics. Econom. Geol. 58, 1246--1266.
McKay, M. (1995). Evaluating prediction uncertainty. Report NUREG/CR-6311, Los Alamos National
Laboratory.
McKay, M., R. Beckman and W. Conover (1979). A comparison of three methods for selecting values of
input variables in the analysis of output from a computer code. Technometrics 21(2), 239-245.
Miller, D. and M. Frenklach (1983). Sensitivity analysis and parameter estimation in dynamic modeling of
chemical kinetics. Internat. J. Chem. Kinetics 15, 677-696.
Mitchell, T. J. (1974). An algorithm for the construction of 'D-optimal' experimental designs. Technometrics
16, 203-210.
Mitchell, T., M. Morris and D. Ylvisaker (1990). Existence of smoothed stationary processes on an interval.
Stochastic Process. Appl. 35, 109-119.
Mitchell, T., M. Morris and D. Ylvisaker (1995). Two-level fractional factorials and Bayesian prediction.
Statist. Sinica 5, 559-573.
Mitchell, T. J. and D. S. Scott (1987). A computer program for the design of group testing experiments.
Comm. Statist. Theory Methods 16, 2943-2955.
Morris, M. D. and T. J. Mitchell (1995). Exploratory designs for computational experiments. J. Statist.
Plann. Inference 43, 381-402.
Morris, M. D., T. J. Mitchell and D. Ylvisaker (1993). Bayesian design and analysis of computer experi-
ments: Use of derivative in surface prediction. Technometrics 35(3), 243-255.
Nassif, S. R., A. J. Strojwas and S. W. Director (1984). FABRICS II: A statistically based IC fabrication
process simulator. IEEE Trans. Comput. Aided Design 3, 40-46.
Niederreiter, H. (1992). Random Number Generation and Quasi-Monte Carlo Methods. SIAM, Philadelphia,
PA.
O'Hagan, A. (1989). Comment: Design and analysis of computer experiments. Statist. Sci. 4(4), 430-432.
Owen, A. B. (1992a). A central limit theorem for Latin hypercube sampling. J. Roy. Statist. Soc. Ser. B
54, 541-551.
Owen, A. B. (1992b). Orthogonal arrays for computer experiments, integration and visualization. Statist.
Sinica 2, 439-452.
Owen, A. B. (1994a). Lattice sampling revisited: Monte Carlo variance of means over randomized orthog-
onal arrays. Ann. Statist. 22, 930-945.
Owen, A. B. (1994b). Controlling correlations in latin hypercube samples. J. Amer. Statist. Assoc. 89,
1517-1522.
Owen, A. B. (1995). Randomly permuted (t, m, s)-nets and (t, s)-sequences. In: H. Niederreiter and
E J.-S. Shiue, eds., Monte Carlo and Quasi-Monte Carlo Methods in Scientific Computing. Springer,
New York, 299-317.
Owen, .a/. B. (1996a). Monte Carlo variance of scrambled net quadrature. SIAM J. Numer. AnaL, to appear.
Computer experiments 307
Owen, A. B. (1996b). Scrambled net variance for integrals of smooth functions. Tech. Report Number
493, Department of Statistics, Stanford University.
Paskov, S. H. (1993). Average case complexity of multivariate integration for smooth functions. J. Com-
plexity 9, 291-312.
Park, J.-S. (1994) Optimal Latin-hypercube designs for computer experiments. J. Statist. Plann. Inference
39, 95-111.
Parzen, A. B. (1962). Stochastic Processes. Holden-Day, San Francisco, CA.
Patterson, H. D. (1954). The errors of lattice sampling. J. Roy. Statist. Soc. Ser. B 16, 140-149.
Phadke, M. (1988). Quality Engineering Using Robust Design. Prentice-Hall, Englewood Cliffs, NJ.
Pinto, M. R., C. S. Rafferty and R. W. Dutton (1984). PISCES-II-posson and continuity equation solver.
DAGG-29-83-k 0125, Stanford Electron. Lab.
Ripley, B. (1981). Spatial Statistics. Wiley, New York.
Ritter, K. (1995). Average case analysis of numerical problems. Dissertation, University of Edangen.
Ritter, K., G. Wasilkowski and H. Wozniakowski (1993). On multivariate integration for stochastic pro-
cesses. In: H. Brass and G. Hammerlin, eds., Numerical Integration, Birkhauser, Basel, 331-347.
Ritter, K., G. Wasilkowski and H. Wozniakowski (1995). Multivariate integration and approximation for
random fields satisfying Sacks-Ylvisaker conditions. Ann. AppL Prob. 5, 518-540.
Roosen, C. B. (1995). Visualization and exploration of high-dimensional functions using the functional
ANOVA decomposition. Dissertation, Dept. of Statistics, Stanford University.
Sacks, J. and S. Schiller (1988). Spatial designs. In: S. S. Gupta and J. O. Berger, eds., Statistical Decision
Theory and Related Topics IV, Vol. 2. Springer, New York, 385-399.
Sacks, J., S. B. Schiller and W. J. Welch (1989). Designs for computer experiments. Technometrics 31(1),
41-47.
Sacks, J., W. J. Welch, T. J. Mitchell and H. P. Wynn (1989). Design and analysis of computer experiments.
Statist. Sci. 4(4), 409-423.
Shannon, C. E. (1948). A mathematical theory of communication. Bell Syst. Tech. J. 27, 379-423, 623-656.
Sharifzadeh, S., J. R. Koehler, A. B. Owen and J. D. Shott (1989). Using simulators to model transmitted
variability in IC manufacturing. IEEE Trans. Semicond. Manufact. 2(3), 82-93.
Shewry, M. C. and H. P. Wynn (1987). Maximum entropy sampling. J. AppL Statist. 14, 165-170.
Shewry, M. C. and H. P. Wynn (1988). Maximum entropy sampling and simulation codes. In: Proc. 12th
World Congress on Scientific Computation, Vol. 2, IMAC88, 517-519.
Sloan, I. H. and S. Joe (1994). Lattice Methods for Multiple Integration. Oxford Science Publications,
Oxford.
Smolyak, S. A. (1963). Quadrature and interpolation formulas for tensor products of certain classes of
functions. Soviet Math. Dokl. 4, 240-243.
Stein, M. L. (1987). Large sample properties of simulations using Latin hypercube sampling. Technometrics
29(2), 143-151.
Stein, M. L. (1989). Comment: Design and analysis of computer experiments. Statist. Sci. 4(4), 432-433.
Steinberg, D. M. (1985). Model robust response surface designs: Scaling two-level factorials. Biometrika
72, 513-26.
Tang, B. (1992). Latin hypercubes and supersaturated designs. Dissertation, Dept. of Statistics and Actuarial
Science, University of Waterloo.
Tang, B. (1993). Orthogonal array-based Latin hypercubes. J. Amer. Statist. Assoc. 88, 1392-1397.
Wahba, G. (1978). Interpolating surfaces: High order convergence rates and their associated designs,
with applications to X-ray image reconstruction. Tech. report 523, Statistics Depmtment, University of
Wisconsin, Madison.
Wahba, G. (1990). Spline Models for Observational Data. CBMS-NSF Regional Conference Series in
Applied Mathematics, Vol. 59. SIAM, Philadelphia, PA.
Wasilkowski, G. (1993). Integration and approximation of multivariate functions: Average case complexity
with Wiener measure. Bull Amer. Math. Soc. (N. S.) 28, 308-314. Full version J. Approx. Theory 77,
212-227.
Wozniakowski H. (1991). Average case complexity of multivariate integration. Bull. Amer. Math. Soc.
(N. S.) 24, 185-194.
308 J. R. Koehler and A. B. Owen
Welch, W. J. (1983). A mean squared error criterion for the design of experiments. Biometrika 70(1),
201-213.
Welch, W. Yu, T. Kang and J. Sacks (1990). Computer experiments for quality control by parameter design.
J. Quality TechnoL 22, 15-22.
Welch, W. J., J. R. Buck, J. Sacks, H. P. Wynn, T. J. Mitchell and M. D. Morris. Screening, prediction, and
computer experiments. Technometrics 34(1), 15-25.
Yaglom, A. M. (1987). Correlation Theory of Stationary and Related Random Functions, Vol. 1. Springer,
New York.
Ylvisaker, D. (1975). Designs on random fields. In: J. N. Srivastava, ed., A Survey of Statistical Design
and Linear Models. North-Holland, Amsterdam, 593~507.
Young, A. S. (1977). A Bayesian approach to prediction using polynomials. Biometrika 64, 309-317.
Yu, Z., G. G. Y. Chang and R. W. Dutton (1982). Supplementary report on sedan II. TR-G201 12, Stanford
Electronics Laboratories.