A First Course in Experimental Design
A First Course in Experimental Design
A First Course in Experimental Design
Art B. Owen
Stanford University
Autumn 2020
ii
iii
iv
1 Introduction 5
1.1 History of design . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Confounding and related issues . . . . . . . . . . . . . . . . . . . 6
1.3 Neyman-Rubin Causal Model . . . . . . . . . . . . . . . . . . . . 7
1.4 Random assignment and ATE . . . . . . . . . . . . . . . . . . . 8
1.5 Random science tables . . . . . . . . . . . . . . . . . . . . . . . . 10
1.6 External validity . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.7 More about causality . . . . . . . . . . . . . . . . . . . . . . . . 12
2 A/B testing 15
2.1 Why is this hard? . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Selected points from the Hippo book . . . . . . . . . . . . . . . . 17
2.3 Questions raised in class . . . . . . . . . . . . . . . . . . . . . . . 19
2.4 Winner’s curse . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.5 Sequential testing . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.6 Near impossibility of measuring returns . . . . . . . . . . . . . . 23
3 Bandit methods 27
3.1 Exploration and exploitation . . . . . . . . . . . . . . . . . . . . 28
3.2 Regret . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3 Upper confidence limit . . . . . . . . . . . . . . . . . . . . . . . . 30
3.4 Thompson sampling . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.5 Theoretical findings on Thompson sampling . . . . . . . . . . . . 35
3.6 More about bandits . . . . . . . . . . . . . . . . . . . . . . . . . 36
1
2 Contents
5 Analysis of variance 55
5.1 Potatoes and sugar . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.2 One at a time experiments . . . . . . . . . . . . . . . . . . . . . 57
5.3 Interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.4 Multiway ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.5 Replicates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.6 High order ANOVA tables . . . . . . . . . . . . . . . . . . . . . 63
5.7 Distributions of sums of squares . . . . . . . . . . . . . . . . . . 65
5.8 Fixed and random effects . . . . . . . . . . . . . . . . . . . . . . 68
7 Fractional factorials 81
7.1 Half replicates . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
7.2 Catapult example . . . . . . . . . . . . . . . . . . . . . . . . . . 83
7.3 Quarter fractions . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
7.4 Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
7.5 Overwriting notation . . . . . . . . . . . . . . . . . . . . . . . . 89
7.6 Saturated designs . . . . . . . . . . . . . . . . . . . . . . . . . . 89
7.7 Followup fractions . . . . . . . . . . . . . . . . . . . . . . . . . . 90
7.8 More data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 91
16 Wrap-up 175
16.1 What statistics is about . . . . . . . . . . . . . . . . . . . . . . . 175
16.2 Principals from experimental design . . . . . . . . . . . . . . . . 177
Bibliography 181
Introduction
To gain understanding from data, we must contend with noise, bias, correlation
and interaction among other issues. Often there is nothing we can do about
some of those things, because we are just handed data with flaws embedded.
Choosing or designing the data gives us much better possibilities. By care-
fully designing an experiment we can gain information more efficiently, meaning
lower variance for a given expenditure in time or money or subjects. More im-
portantly, experimentation provides the most convincing empirical evidence of
causality. That is, it is not just about more efficient estimation of regression co-
efficients and similar parameters. It is about gaining causal insight. If we think
of efficiency as better handling of noise, we can think of the causal estimation
as better handling of correlations among predictors as well as interactions and
bias.
We all know that “correlation does not imply causation”. Without a causal
understanding, all we can do is predict outcomes, not confidently influence them.
There are settings where prediction alone is very useful. Predicting the path
of a hurricane is enough to help people get out of the way and prepare for the
aftermath. Predicting stock prices is useful for an investor whose decisions are
too small to move the market. However, much greater benefits are available
from causal understanding. For instance, a physician who could only predict
patient outcomes in the absence of treatment but not influence them would not
be as effective as one who can choose a treatment that brings a causal benefit.
In manufacturing, causal understanding is needed to design better products. In
social science, causal understanding is needed to understand policy choices.
Our main tool will be randomizing treatment assignments. Injecting ran-
domness into the independent variables provides the most convincing way to
establish causality, though we will see that it is not perfect.
5
6 1. Introduction
Yi = Wi Yi1 + (1 − Wi )Yi0 , i = 1, . . . , n.
Row i shows data for subject i. There is a column each for treatment and control.
If we knew Y then we would know every subject’s own personal treatment effect
∆i = Yi1 − Yi0 .
There is an important implicit assumption involved in writing the science
table this way. We are assuming that the response for subject i is Yi = Yi (Wi )
and does not depend on Wi0 for any i0 6= i. Imagine instead the opposite where
the value Yi depends on the whole vector W = (W1 , . . . , Wn ) ∈ {0, 1}n . Then
our science table would have n rows and 2n columns, one for each possible W .
We can use the small science table in (1.1) under the Stable Unit Treatment
Value Assumption (SUTVA). Under SUTVA, Yi does not depend on Wi0 for
any i0 6= i. It might not even depend on Wi , but if it depends on W at all it
can only be through Wi . In applications we have to consider whether SUTVA
is realistic. If patients in a clinical trial swap drugs with each other, SUTVA is
violated. If we are experimenting on subjects in a network we might find that
the response for one subject depends on the treatment of their neighbors. That
would violate SUTVA.
and estimate τ by
τ̂ = Ȳ1 − Ȳ0 .
By choosing W randomly we can get E(τ̂ ) = τ where this E(·) refers to ran-
domness in W .
Suppose we have a simple random sample where all nn1 ways of picking
E(Ȳj ) = µj , j = 0, 1
E(τ̂ ) = τ.
as a ratio of two random variables and work out approximate mean and variance
via the delta-method. For the gory details see Rosenman et al. (2018).
Suppose that we want var(Ȳj ) under simple random sampling. It’s kind
of a pain in the neck to work that out using the theory of finite population
sampling (survey sampling) from Cochran (1977) or (Rice, 2007, Chapter 7).
It is even worse if we toss coins because we hit the n1 ∈ {0, n} problem with
probability just enough bigger than zero to be a theoretical nuisance. Both of
those get worse if we want var(Ȳ1 − Ȳ0 ). Let’s just avoid it! We will see later an
argument by Box et al. (1978) that will let us use plain old regression methods
to get inferences. That is much simpler, and there is no reason to pick the
cumbersome way to do things. There are also permutation test methods to get
randomization based confidence intervals by Monte Carlo sampling. Those can
work well and be straightforward to use. There is also a book in the works by
Tirthankar Dasgupta and Donald Rubin on using the actual randomization in
more complcated settings. I’m looking forward to seeing how that goes.
After writing the above, I saw Imbens (2019) describing a conservative esti-
mate of var(τ̂ ) due to Neyman. It is
1 X 1 X
Wi (Yi − Ȳ1 )2 + (1 − Wi )(Yi − Ȳ0 )2 .
n1 (n1 − 1) i n0 (n0 − 1) i
This is just the sum of the two sample variance estimators that we might have
used in regular modeling (like Box et al. advise). Let’s still avoid digging into
why that is conservative.
This is the average ATE over the distribution of random science tables.
We could argue whether τ (Y) or E(τ (Y)) is the more important thing to
estimate but in practice they may well be very close. For instance, if Yij =
iid
µij + εij with noise εij ∼ N (0, σ 2 ) then
n 2σ 2
1X
τ (Y) − E(τ (Y)) = εi1 − εi0 ∼ N 0, .
n i=1 n
In a large experiment the two quantities are close. If the quantities are close then
we may choose to study whichever one gives the most clarity to the analysis.
In a theoretical study we can model reasonable science tables by specifying a
distribution for Y that fits the applied context.
Wi =0 Wi =1 Wi =0 Wi =1
i=1 Y10 Y11 i=n+1 Yn+1,0 Yn+1,1
2
Y20 Y21
n+2
Yn+2,0 Yn+2,1
3 Y30 Y31 n+3 Yn+3,0 Yn+3,1
.. .. .. .. .. ..
Ynow = .
. .
Ylater = .
. . (1.2)
i Yi0 Yi1 i Yi0 Yi1
.. . .. .. .. ..
..
. . . . .
n Yn0 Yn1 n+m Yn+m,0 Ynm ,1
These have ATEs τnow and τlater and our experiment will give us the estimate
τ̂now . If we think of τ̂now as the future ATE we incur an error
The first term can be studied using statistical inference on our data. We shied
away from doing that once we saw the sampling theory issues it raises, but
will look into it later. The second term is about whether the ATE might have
changed. It is about external validity.
External validity is some sort of extrapolation. It can be extremely rea-
sonable and well supported empirically. E.g., gravity works the same here as
elsewhere. It can be unreasonable. E.g., experimental findings on undergradu-
ates who are required to participate for a grade might not generalize to people
of all ages, everywhere.
Findings on present customers might not generalize perfectly to others.
Findings for mice might not generalize well to humans. Findings in the US
might not generalize to the EU. Findings in a clinical trial with given enrol-
ment conditions might not generalize to future patients who are sicker (or are
healthier) than those in the study.
You may have heard the expression ‘your mileage may vary’. This refered
originally to ratings of fuel effiiciency for cars. If you’re considering two cars,
the advertised mileages µ0 and µ1 might not apply to you because you drive dif-
ferently or live near different kinds of roads or differ in some other way from the
test conditions. It commonly holds that the difference µ1 − µ0 might still apply
well to you. The things that make your driving different from test conditions
could affect both cars nearly equally. In that case, the ATE has reasonable ex-
ternal validity. This is a common occurrence and gives reason for more optimism
about external validity.
External validity can be judged but not on the basis of observing a subset
of Ynow . The exact same Y could appear in one problem with external validity
and another without it. External validity can be based on past experiences of
similar quantities generalizing where tested. That is a form of meta-analysis, a
study of studies. External validity can also be based on scientific understanding
that may ultimately have been gleaned by looking at what generalizes and what
does not organized around an underlying theory that has successfully predicted
many generalizations.
A/B testing
15
16 2. A/B testing
the modern problems are only 1/1024 times as hard as the old ones. Yet,
they involve large teams of data scientists and engineers developing specialized
experimental platforms. Why?
Online experimentation is a new use case for randomized trials and it brings
with it new costs, constraints and opportunities. Here are some complicating
factors about online experimentation:
• there can be thousands of experiments per year,
• many going on at the same time,
• there are numerous threats to SUTVA,
• tiny effects can have enormous value,
• effects can drift over time,
• the users may be quite different from the data scientists and engineers,
• there are adversarial elements, e.g., bots and spammers, and
• short term metrics might not lead to long term value.
There are also some key statistical advantages in the online setting. It is
possible to experiment directly on the web page or other product. This is quite
different from pharmaceutical or aerospace industries. People buying tylenol
are not getting a random tylenol-A versus tylenol-B. When your plane pulls
up to the gate it is not randomly 787-A versus 787-B. Any change to a high
impact product with strong safety issues requires careful testing and perhaps
independent certification about the results of those tests.
A second advantage is speed. Agriculture experiments often have to be
designed to produce data in one growing season per year. If the experiment fails,
a whole year is lost. Clinical trials may take years to obtain enough patients.
Software or web pages can be changed much more quickly than those industries’
products can. Online settings involve strong day of week effects (weekend versus
work days) and hour of day effects (work hours and time zones) and with fast
data arrival a clear answer might be available in one week. Or if something has
gone badly wrong an answer could be available much faster.
Now, most experimental changes do almost nothing. This could be because
the underlying system is near some kind of local optimum. If the changes are
doing nearly nothing then it is reasonable to suppose that they don’t interfere
much with each other either. In the language of Chapter 1, the science table
for one experiment does not have to take account of the setting levels for other
concurrent experiments. So in addition to rapid feedback, this setting also
allows great parallelization of experimentation. The industrial settings that Box
studied often have significant interactions among the k variables of interest.
Kohavi et al. (2009, Section 4) describe reasons for using single experiments
instead of varying k factors at a time. Some combinations of factors might be
undesiriable or even impossible to use. Also combining factors can introduce
undesirable couplings between investigations. For instance, it could be necessary
to wait until all k teams are ready to start, thus delaying k − 1 of the teams.
to get the same Wi that they got earlier. This is done by using a deterministic
hash function that turns the user id into a number b ∈ {0, 1, 2, · · · , B − 1} where
B is a number of buckets. For instance B = 1000 is common. You can think
of the hash function as a random number generator and the user id as a seed.
Then we could give W = 0 whenever b < 500 and W = 1 when b > 500. When
the user returns with the same user id they get the same treatment.
We don’t want a user to be in the treatment arm (or control arm) of every
A/B test that they are in. We would want independent draws instead. So we
should pick the bucket b based on hash(userid + experimentid) or similar.
There is an important difference between a person and a user id. A person’s
user id might change if they delete a cookie, or use different hardware (e.g., their
phone and a laptop) for the same service. It is also possible that multiple people
share an account. When that user id returns it might be a family member. The
link between people and user ids may be almost, but not quite, one to one.
They discuss other experimental units that might come up in practice. Per-
haps i denotes a web page that might get changed. Or a browser session. Vaver
and Koehler (2011) make the unit a geographical region. Somebody could in-
crease their advertising in some regions, leaving others at the nominal level (or
decreasing to keep spending constant) and then look for a way to measure total
sales of their product in those regions.
For a cloud based document tool with shared users it makes sense to ex-
periment on a whole cluster of users that work together. It might not even be
possible to have people in both A and B arms of the trial share a document. Also
there is a SUTVA issue if people in the same group have different treatments.
2.2.5 Ramping up
Because most experiments involve changes that are unhelpful or even harmful,
it is not wise to start them out at 50:50 on the experimental units. It is better
to start small, perhaps with only 1% getting the new treatment. You can do
that by allocating buckets 0 through 9 of 0 through 999 to the new treatment.
Also, if something is going badly wrong it can be detected quickly on a small
sample.
to Dunnett’s test. Checking just now, Dunnett’s test is indeed there in Chapter
4.3.5 of Berger et al. (2018).
Q: Do people usually use A/A test to get the significance level, does it differ
a lot in case you have a lot of data. I assume with a huge amount of data the
test statistic might be more or less or t-distributed (generally).
A: They might. I more usually hear about it being used to spot check some-
thing that seems odd. If n is large then the t-test might be ok statistically
but A/A tests can catch other things. For example if you have k highly cor-
related tests then the A/A sampling may capture the way false discoveries are
dependent.
A t-test is unrobust to different variances. You’re ok if you’ve done a nearly
50:50 split. But if you’ve done an imbalanced split then the t-test can be wrong.
If the A/A test is a permutation it will be wrong there too because the usual
t-test is asymptotically equivalent to a permutation test. Ouch. Maybe a clever
bootstrap would do. Or Welch’s t test. Or a permutation strategy based on
Welch’s.
If you have heavy-tailed responses, so heavy that they look like they’ve
come from a setting with infinite variance then the t-test will not work well for
you. However maybe nothing else will be very good in that case. These super
heavy-tailed responses come up often when the response is dollar valued. Think
of online games where a small number of players, sometimes called ‘whales’,
spend completely unreasonable (to us at least) amounts of money. (You can use
medians or take logs but those don’t answer a question about expectations.)
To put A/A testing into an experimental platform you would have to find
a way to let the user specify what data are the right ones to run A/A tests
on for each experiment. Then you have to get that data into the system from
whatever other system it was in. That would be more complicated than just
using χ2 tables or similar.
Q: Is A/A testing a form of bootstrapping?
A: It is a Monte Carlo resampling method very much like bootstrapping. It
might be more accurately described as permutation testing. There’s nothing
to stop somebody doing a bootstrap instead. However the A/A test has very
strong intuitive rationale. It describes a treatment method that we are confident
cannot find any real discovery because we shift treatment labels completely at
random.
One explanation is the winner’s curse. It is well known that the stock or
mutual fund that did best last year is not likely to repeat this year. Also athletes
that had a super year are not necessarily going to dominate the next year.
These can be understood as regression to the mean https://en.wikipedia.
org/wiki/Regression_toward_the_mean.
This section is based on Lee and Shen (2018) who cite some prior work in
the area. Suppose that the B version in experiment j has true effect τj for
J = 1, . . . , J. We get τ̂j ∼ N (τ, σj2 ). The central limit theorem may make the
normal distribution an accurate approximation here. Note that this σj2 is really
what we would normally call σ 2 /n, so it could be very small. Suppose that we
adopt the B version if τ̂j > Z 1−αj σj . For instance Z 1−αj = 1.96 corresponds
to a one sided p-value below 0.025. Let Sj = 1{τ̂j > Z 1−αj σj } be the indicator
of the event that the experiment was accepted. Ignoring P multiplicative effects
J
and just summing, the true gain from accepted trials is j=1 τj Sj while the
PJ
estimated gain is j=1 τ̂j Sj The estimated gain is over-optimistic on average
by
J
! J Z ∞
X X τ̂ − τ τ̂j − τj
E (τ̂j − τj )Sj = ϕ dτ̂j > 0,
j=1 j=1 Z
1−αj
σj σj σj
where ϕ(·) is the N (0, 1) probability density function. We know that the bias is
positive because the unconditional expectation is E(τ̂j − τj ) = 0. Lee and Shen
(2018) have a clever way to show this. If Z 1−αj σj + τj > 0 then the integrand
is everywhere positive. If not, then the left out integrand over −∞ to Z 1−αj σj
is everywhere non-positive so the part left out has to have a negative integral
giving the retained part a positive one.
They go on to plot the bias as a function of τ for different critical p-values.
Smaller p-values bring less bias (also less acceptances). The bias for τj > 0 is
roughly propotional to τj while for τj < 0 the bias is nearly zero. They go on to
estimate the size of the bias from given data and present bootstrap confidence
intervals on it to get a range of sizes.
While we are thinking of regression to the mean, we should note that it is a
possible source of the placebo effect. If you do a randomized trial giving people
either nothing at all, or a pill with no active ingredient (placebo) you might
find that the people getting the placebo do better. That could be established
causally via randomization. One explanation is that they somehow expected
to get better and this made them better or lead them to report being better.
A real pill ought to be better than a placebo so an experiment for it could
test real versus placebo to show that the resulting benefit goes beyond possible
psychological effects.
If you don’t randomize then the placebo effect could be from regression to
the mean. Suppose people’s symptoms naturally fluctuate between better and
worse. If they take treatment when their symptoms are worse than average,
then by regression to the mean, the expected value of symptoms some time
later will be better than when they were treated. In that case, no psychological
Bandit methods
When you say you’re going to do an A/B test somebody usually suggests using
bandit methods instead. And vice versa.
In the bandit framework you try to optimize as you go and ideally spend
next to no time on the suboptimal choice between A and B, or other options.
It can be as good as having only O(log(n)) tries of any sub-optimal treatment
in n trials.
We review some theory of bandit methods. The main point is to learn the
goals and methods and tradeoffs with bandits. We also go more deeply into
Thompson sampling proposed originally in Thompson (1933).
27
28 3. Bandit methods
your understanding. So, get confused and then get out of it. Spotting and
resolving puzzlers is also a way to find research ideas.
Here is a quote from Paul Halmos about reading mathematics:
Don’t just read it; fight it! Ask your own questions, look for
your own examples, discover your own proofs. Is the hypothesis
necessary? Is the converse true? What happens in the classical
special case? What about the degenerate cases? Where does the
proof use the hypothesis?
It is good to poke at statistical ideas in much the same way, with a view to
which problems they suit.
3.2 Regret
These definitions are based on Bubeck and Cesa-Bianchi (2012). Suppose that
at time i = 1, 2, 3, . . . we have arms j = 1, 2, . . . , K to pick from. If at time i
we pick arm j then we would get Yi,j ∼ νj . Notice that the distribution νj here
is assumed to not depend on i. We let µj = E(Yi,j ) be the mean of νj and
µ∗ = max µj ≡ µj∗ .
16j6K
So µ∗ is the optimal expected payout and j∗ is the optimal arm (or one that is
tied for optimal).
If we knew the µj we would choose arm j∗ every time and get expected
payoff nµ∗ in n tries. Instead we randomize our choice of arm, searching for
the optimal one. At time i we choose a random arm Ji ∈ {1, 2, . . . , K} and get
payoff Yi,Ji . Because we choose just one arm, we do not get to see what would
have happened for the other K − 1 arms. That is, we never see Yi,j 0 for any
j 0 6= Ji , so we cannot learn from those values.
There are various ways to quantify how much worse off we are than optimal
play would be. The regret at time n is
n
X n
X
Rn = max Yi,j − Yi,Ji .
j
i=1 i=1
This is how much worse off we are compared to whatever arm would have been
the best one to use continually for the first n tries. Be sure that you understand
why Pr(Rn < 0) > 0 with this defniition. A harsher definition is
n
X n
X
max Yi,j − Yi,Ji .
j
i=1 i=1
This is how much worse off we would be compared to a psychic who knew the
future data. It is not a reasonable comparison so it is not the focus of our study.
The expected regret is
n n
!
X X
E(Rn ) = E max Yi,j − Yi,Ji .
j
i=1 i=1
Each time we move maxj outside of a sum or expectation, things get easier.
What is random in the E(·) of R̄n is the sequence J1 , . . . , Jn of chosen arms.
Other authors call R̄n the expected regret.
Ps Now let ∆j = µ∗ − µj > 0 be the suboptimality of arm j and define Tj (s) =
i=1 1{Ji = j}. This is the number of times that arm j was chosen in the first
s tries. Then
n
X K
X K
X K
X
R̄n = nµ∗ − E(µJi ) = E(Tj (n))µ∗ − E(Tj (n))µj = E(Tj (n))∆j .
i=1 j=1 j=1 j=1
Our pseudo-regret comes from the expected number of each kind of suboptimal
pulls time their suboptimality. To derive this notice that
K
X K
X
n= Tj (n) = E(Tj (n))
j=1 j=1
UCB method
A B C
1.0
Confidence interval for E(Y) ●
●
0.8
0.6
●
0.4
0.2
0.0
Treatment
Figure 3.1: Hypothetical confidence intervals for E(Y | A), E(Y | B) and E(Y |
C).
with B. Then it’s confidence interval would tend to get narrower with further
samples. It’s center could also shift up or down as sampling goes on, tending
towards the true mean which we anticipate to be somewhere inside the current
confidence interval, though that won’t always hold. What could happen is that
the confidence interval converges on a value above the center for A but below
the upper limit for A. Then if A were really as good as its upper limit, we would
never sample it and find out. The same argument holds for sampling from C.
Now suppose that we sample from A. Its confidence interval will narrow and
the center could move up or down. If A is really bad then sampling will move
the mean down and narrow the confidence interval and it will no longer keep
the top upper confidence limit. We would then stop playing it, at least for a
while. If instead, A was really good, we would find that out.
Given that we want to use the upper limit of a confidence interval, what
confidence level should we choose? The definitive sources on that point are
Gittins (1979), Lai and Robbins (1985) and Lai (1987). We can begin with a
finite horizon n, for instance the next n patients or visitors to a web page. At
step i we could use the 100(1 − αi )% upper confidence limit.
It is easy to choose the treatment for the n’th subject. We just take the
arm that we think has the highest mean. There is no n + 1’st subject to benefit
from what we learn from subject n. So we could take αn = 1/2. That would
be the center of our confidence interval (if it is symmetric). If we are picking
a fixed sequence α1 , . . . , αn then it makes sense to have αi increasing towards
0.5 because as time goes on, there is less opportuntity to take advantage of any
learning. The αi should start small, especially if n is large.
A finite horizon might not be realistic. We might not know how many
subjects will be in the study. Another approach is to define the discounted
regret
X∞
(µ∗ − E(µJi ))θi−1 0 < θ < 1.
i=1
This regret is the immediate regret plus θ times a similar future quantity
∞
X
µ∗ − E(µJ1 ) + θ (µ∗ − E(µJi+1 ))θi−1 .
i=1
Yi,J
Each factor µJi i (1 − µJi )1−Yi,Ji is a likelihood contribution for (µ1 , . . . , µK )
based on the conditional distribution of Yi = Yi,Ji given Ji . There’s a brief
discussion about using a conditional likelihood below.
ind
This means that the posterior distribution has µj ∼ Beta(aj + Sj , bj + Fj ).
This expression also shows that aj and bj can be viewed as numbers of prior
pseudo-counts. We are operating as if we had already seen aj successes and bj
failures from arm j before starting.
Figure 3.2 has pseudo-code for running the Thompson sampler for Bernoulli
data and beta priors. In this problem it is easy to pick arm j with probability
equal to the probability that µj is largest. We sample µ1 , . . . , µK one time each
and let J be the index of the biggest one we get.
Thompson sampling is convenient for web applications where we might not
be able to update Sj and Fj as fast as the data come in. Maybe the logs can
only be scanned hourly or daily to get the most recent (Ji , Yi,Ji ) pairs. Then
we just keep sampling with the fixed posterior distribution between updates. If
instead we were using UCB then we might have to sample the arm with the
highest upper confidence limit for a whole day between updates. That could be
very suboptimal if that arm turns out to be a poor one.
Puzzler: the UCB analysis is pretty convincing that we win by betting
on optimism. How does optimism enter the Thompson sampler? We get just
one draw from the posterior for arm j. That draw could be better or worse
than the mean. Just taking the mean would not bake in any optimism and
Initialize:
Sj ← aj , Fj ← bj , j = 1, . . . , K # aj = bj = 1 starts µj ∼ U[0, 1]
Run:
for i > 1
for j = 1, . . . , K
θj ∼ Beta(Sj , Fj ) # make sure min(Sj , Fj ) > 0
J ← arg maxj θj # call it Ji if you plan to save them
SJ ← SJ + Xi,J
FJ ← FJ + 1 − Xi,J
Figure 3.2: Pseudo-code for the Thompson sampler with Bernoulli responses
and beta priors. As written it runs forever.
would fail to explore. We could bake in more optimism by letting each arm
take m > 1 draws and report its best result. I have not seen this proposal
analyzed (though it might well be in the literature somewhere). It would play
more towards optimism but that does not mean it will work better; optimism
was just one factor. Intuitively, taking m > 1 should favor the arms with less
data, other things being equal. Without some theory, we cannot be sure that
m > 1 doesn’t actually slow down exploration. Maybe it would get us stuck
in a bad arm forever (I doubt that on intuitive grounds only). If we wanted,
we could take some high quantile of the beta distributions but deciding what
quantile to use would involve the complexity that we avoided by moving from
UCB to Thompson. For Bernoulli responses with a low success rate, the beta
distributions will initially have a positive skewness. That is a sort of optimism.
Puzzler/rabbit hole: are we leaving out information about µj from the
distribution of Ji ? I think not, because the distribution of Ji is based on the past
Yi which already contribute to the conditional likelihood terms. A bit of web
searching did not turn up the answer. It is clear that if you were given J1 , . . . , Jn
it would be possible at the least to figure out which µj was µ∗ . But that doesn’t
mean they carry extra information. The random variables are observed in this
order:
J1 → Y1 → J2 → Y2 → · · · → Ji → Yi → · · · → Jn → Yn .
Each arrow points to new information about µ. The distribution of J1 does not
depend on µ = (µ1 , . . . , µK ). The likelihood is
Now in our Bernoulli Thompson sampler our algorithm for choosing J3 was
just based on a random number generator that was making our beta random
variables. That convinces me that p(J3 | y2 , J2 , J1 , y1 ; µ) has nothing to do with
µ. So the conditional likelihood is ok. At least for the Bernoulli bandit. Phew!
We can be confident that we are taking the best arm but we cannot get a good
estimate of the amount of improvement. For some purposes we might want to
know how much better the best arm is.
Maybe Wi = (Wi1 , . . . , Wi,10 ) ∈ {0, 1}10 because we have 10 decisions to
make for subject i. We could run a bandit with K = 210 arms but that is
awkward. An alternative is to come up with a model, such as Pr(Y = 1 | W ) =
Φ(W T β) for unknown β. Or maybe a logistic regression would be better. We
can place a prior on β and update it as data come in. Then we need a way
to sample a W with probability proportional to it being the best one. Some
details for this example are given in Scott (2010). These can be hard problems
but the way forward via Thompson sampling appears easier than generalizing
UCB. This setting has an interesting feature. Things we learn from one of the
1024 arms provide information on β and thereby update our prior on some of
the other arms.
For contextual bandits, we have a feature vector Xi that tells us something
about subject i before we pick a treatment. Now our model might be Pr(Y =
1 | X, W ; β) for parameters β. See Agrawal and Goyal (2013) for Thompson
sampling and contextual bandits.
In restless bandits, the distributions νj can be drifting over time. See Whittle
(1988). Clearly we have to explore more often in this case because some other
arm might have suddenly become much more favorable than the one we usually
choose. It also means that the very distant past observations might not be
relevant, and so the upper confidence limits or parameter distributions should
be based on recent data with past data downweighted or omitted.
39
40 4. Paired and blocked data, randomization inference
and now
β̂1 − β1
tobs = p .
s ((X T X)−1 )22
In order to get these pivotal inferences we need to make 4 assumptions:
1) εi are normally distributed,
2) var(εi ) does not depend on Wi ,
3) εi are independent, and
4) there are no missing predictors.
For the last one, we need to know that E(Yj ) is not really β0 + β1 Wi + β2 Ui for
some other variable Ui .
Assumption 1 is hard to believe, but the central limit theorem reduces the
.
damage it causes. Assumption 2 can be serious but doesplittle damage if n1 = n2 .
We p 2 2
can also just avoid pooling the variances and use s1 /n1 + s2 /n2 in place
of s 1/n1 + 1/n2 .
Assumption 3 is critical and violations can be hard to detect. Assumption 4
is even more critical and harder to detect. We almost don’t even notice we are
making an assumption about Ui because Ui is missing from equation (4.1).
Putting A’s first gave the worst bias. The alternating plan improved a lot, but
could have done very badly with some√high frequency bias. The random plan
came out best. The bias will be Op (1/ N ) under randomization, whether the
Ui constitute a trend or an oscillation or something else.
Next, let’s consider what happens if there are correlations in the εi . We will
consider local correlations
0
1, i = i
corr(Yi , Yi0 ) = ρ, |i − i0 | = 1
0, else.
Now
ˆ = 1 T
var(∆) v cov(ε)v
25
where vi = 1 for Wi = 1 and vi = −1 for Wi = 0. Using σ 2 for var(εi ), we get
2 2 σ 2 14ρ,
A’s first
ˆ
var(∆) = σ + × −18ρ, alternate
5 25
2ρ, random.
The data analyst will ordinarily proceed as if ρ = 0 especially in small data sets
where we cannot estimate ρ very well. For the plants ρ could well be positive
or negative making var(∆) ˆ quite different from 2σ 2 /5.
Box et al. (1978) take the view that randomization makes it reasonably safe
to use our usual statistical models. A forthcoming book by Tirthankar Dasgupta
and Donal Rubin will, I expect, advocate for using the actual randomization
that was done to drive the inferences.
A B B A B A B A A B
A B B A A B A B B A
BHH contemplate very big differences between the kids. Suppose that some
are in the chess club while others prefer skateboarding. Figure 4.1 shows an
exaggerated simulated example of how this might come out. The left panel
shows that tread wear varies greatly over the 30 subjects there but just barely
between the treatments. The right panel shows a consistent tendency for tread
B to show more wear than tread A, though with a few exceptions.
The way to handle it is is via a paired t-test. Let Di = Y1i − Y2i for
i = 1, . . . , n (so there are N = 2n measurements). Then do a one-sample t-test
for whether E(D) = ∆ where ∆ is ordinarily 0.
The output from a paired t-test on this data is
t = -2.7569, df = 29, p-value = 0.009989
95 percent confidence interval: -0.59845150 -0.08868513
with of course more digits than we actually want. The difference is barely
significant at the 1% level. An unpaired t-test on this data yields:
t = -0.2766, df = 57.943, p-value = 0.7831
95 percent confidence interval: -2.829992 2.142856
A solid, B open B vs A
●
●
●
●
●
**
● * *
35
35
● ●
● ●
*
●●
●
● * *
● ●
●
●
●
●
● ●
●
*****
30
●
● * *
30
●
Wear
● *
B
●
●
●
●
●
●
●
●
● ****
*** *
● ●
● ●
● ●
●
*
25
25
●●
●
●
●
●
●
● * **
●
* *
●
20
20
●
●
*
0 5 10 15 20 25 30 20 25 30 35
Subject A
Figure 4.1: Hypothetical shoe wear numbers for 30 subject and soles A versus B.
and the difference is not statistically significant, with a much wider confidence
interval.
In this setting the paired analysis is correct or at least less wrong and that
is not because of the smaller p-value. It is because the unpaired analysis ignores
correlations between measurements for the left and right shoe of a given kid.
In class somebody asked what would be missing from the science table for
this example. We get both the A and B numbers. What we don’t get is what
would have happened if a kid who got A B had gotten B A instead. The
science table would have had a row like LA LB RA RB for each kid and
we would only see two of those four numbers. We would never get LA LB
for any of the kids. It is certainly possible that there are trends where left shoes
get a different wear pattern than right shows. Randomization protects against
that possibility.
If we model the (Y1j , Y2j ) pairs as random with a correlation of ρ and equal
variance σ 2 then our model gives
var(Dj ) = var(Y1j − Y2j ) = 2σ 2 (1 − ρ)
and we see that the higher the correlation, the more variance reduction we get.
Experimental design offers possibilities to reduce the variance of your data and
this is perhaps the simplest such example.
The regression model for this paired data is
Yij = µ + bj + ∆Wij + εij
where bj is a common effect from the j’th pair, ∆ is the treatment effect and
Wij ∈ {0, 1} is the treatment variable. This model forces the treatment differ-
ence to be the same in every pair. Then
Dj = Y1j − Y2j = (µ + bj + ∆W1j + ε1j ) − (µ + bj + ∆W2j + ε2j ) = ∆ + ε1j − ε2j .
4.4 Blocking
Pairs are blocks of size 2. We can use blocks of any size k > 2. They are very
suitable when there are k > 2 treatments to compare. Perhaps the oven can
hold k = 3 cakes at a time. Or the car has k = 4 wheels on it at a time.
If we have k = 3 treatments and block size of 3 we can arrange the treatments
as follows:
B A C C B A B A C • • • A B C
| {z } | {z } | {z } | {z }
block 1 block 2 block 3 block B
This is called the one-way ANOVA because it has only one treatment factor.
We will later consider multiple treatment factors. This model is not identified,
because we could replace µ by µ − η and αi by αi + η for any η ∈ R without
changing Yij . One way to handle that problem is to impose the constraint
Pk
i=1 ni αi = 0. Many regression packages would force α1 = 0. This model can
be written
which is known as the cell mean model. We can think of a grid of boxes or
cells µ1 µ2 · · · µk and we want to learn the mean response in each of
them.
The null hypothesis is that the treatments all have the same mean. That
can be written as
H0 : µ1 = µ2 = · · · = µk
or as
H0 : α1 = α2 = · · · = αk = 0.
The ‘big null’ is that L(Yi1 ) = L(Yi2 ) = · · · = L(Yil ) and that is what permuta-
tions test.
We can test H0 by standard regression methods. Under H0 the linear model
is just
We could reject H0 by a likelihood ratio test if the ‘full model’ (4.3) has a
much higher likelihood than the ‘sub model’ (4.4). When the likelihoods in-
volve Gaussian models, log likelihoods become sums of squares and the results
simplify.
Here are the results in the balanced setting where ni = n is the same for all
i = 1, . . . , k. The full model has MLE
n
1X
µ̂i = Ȳi• = Yij
n j=1
Source df SS MS F
Treatments k−1 SSB MSB = SSB/(k − 1) MSB/MSW
Error N −k SSW MSW = SSW/(N − k)
Total N −1 SST
Table 4.1: This is the ANOVA table for a one way analysis of variance.
The total sum of squares is equal to the sum of squares between treatment
groups plus the sum of squares Pk P within treatment groups. This can be seen
n 2
algebraicly by expanding i=1 j=1 (Yij − Ȳi• + Ȳi• − Ȳ•• ) . It is also just
Pythagoras (orthogonality of the space of fits and residuals) from a first course
in regression.
The F -test statistic based on the extra sum of squares principle is
1 1 1
k−1 SSEnull − SSEfull k−1 (SST − SSW) k−1 SSB MSB
F = 1 = 1 = 1 ≡ .
N −k SSEfull N −k SSW N −k SSW
MSW
P
Here, N = i nk = nk is the total sample size. When we divide a sum of squares
by its degrees of freedom the ratio is called a mean square. We should reject
the null hypothesis if MSB is large. The question ‘how large?’ is answered
by requiring it to be a large enough multiple of MSW. We reject H0 if p =
Pr(Fk−1,N −k > F ; H0 ) is small.
These notes assume familiarity with the simple ANOVA tables for regression
and the one way analysis of variance. Table 4.1 contains the ANOVA table for
this design. There are two sources of variation in this data: treatment groups
and error. Because there are k treatments there are k − 1 degrees of freedom.
There are ni P − 1 degrees of freedom for error in each of the k treatment groups
for a total of i (ni − 1) = N − k. There is often another column for the p-value.
The mean square column provides information on statistical significance.
The sum of squares column is about practical significance. For instance R2 =
SSB/SST is the fraction of variation explained by the model terms.
To see why we care about mean squares consider ε ∼ N (0, σ 2 IN ). This is
a vector of noise that can be projected onto a one dimensional space parallel
to (1, 1, . . . , 1) where it affects Ȳ•• = µ̂, a k − 1 dimensional space spanned
by between treatment differences Ȳi• − Ȳ•• where it affects SSB and an N − k
dimensional space of within treatment differences Yij − Ȳi• . If Yij would be just
noise εij then we would have µ̂2 ∼ σ 2 χ2(1) , SSB ∼ χ2(k−1) and SSW ∼ χ2(N −k) ,
all independent. The χ2 mean equals its degrees of freedom and so we normalize
sums of squares into mean squares.
Yij = µ + αi + bj + εij i = 1, . . . , k j = 1, . . . , n.
Note that this model does not include an interaction. The treatment differences
αi − αi0 are the same in every block j. All values in block j are adjusted up
or down by the same constant bj . We denote it by bj instead of βj because
we may not be very interested in block j per se. A block might be a litter of
animals or one specific run through of our laboratory equipment. In a surfing
competition it might be about one wave with three athletes on it. That wave is
never coming back so we are only interested in αi , and maybe how that wave
helps us compare αi for different i, but not bj .
The parameter estimates here are µ̂ = Ȳ•• , α̂i = Ȳi• − Ȳ•• , b̂j = Ȳ•j − Ȳ•• ,
and
ε̂ij = Yij − µ̂ − α̂i − b̂j = Yij − Ȳi• − Ȳ•j + Ȳ•• = (Yij − Ȳi• ) − (Ȳ•j − Ȳ•• ).
We should get used to seeing these alternating sign and difference of differences
patterns.
The ANOVA decomposition is
where
k X
X B
SST = (Yij − Ȳ•• )2 ,
i=1 j=1
k X
X B
SSA = (Ȳi• − Ȳ•• )2 ,
i=1 j=1
k X
X B
SSB = (Ȳ•j − Ȳ•• )2 , and
i=1 j=1
k X
X B
SSE = (Yij − Ȳi• − Ȳ•j + Ȳ•• )2 .
i=1 j=1
The ANOVA table for it is in Table 4.2. You could write SSA as i B(Ȳi• − Ȳ•• )2
P
and that is definitely what you would do in a hand calculation. The way it is
written is more intuitive. All the sums of squares are sums over all data points.
We test for treatment effects via
MSA
p = Pr Fk−1,(k−1)(B−1) > .
MSE
It is sometimes argued that one ought not to test for block effects. I don’t
quite understand that. If it turns out that blocking is not effective, then we
Source df SS MS F
SSA MSA
Treatments k−1 SSA MSA =
k−1 MSE
SSB
Blocks B−1 SSB MSB = (∗)
B−1
SSE
Error (k − 1)(B − 1) SSE MSE =
N −k
Total N −1 SST
could just not do it in the next experiment which might then be simpler to
run and have more degrees of freedom for errror. A test can be based on
MSB/MSE ∼ FB−1,(k−1)(B−1) .
The very old text books going back to 1930s place a lot of emphasis on
getting sufficiently many degrees of freedom for error. That concern is very
relevant when the error degrees of freedom are small, say under 10. The reason
.995 .005
can be seen by looking at quantiles of Fnum,den such as Fnum,den and Fnum,den
when the denominator degrees of freedom den is small. Check out qf in R, or
it’s counterpart in python or matlab. It is not a concern in A/B testing with
thousands or millions of observations.
You could have driver 1 (column) test car 1 one with treatment A. Then driver
2 takes car 2 with B and so on through all 16 cases ending up with driver 4
taking car 4 with treatment C. Now if there are car to car differences they are
balanced out with respect to treatments. Driver to driver differences are also
balanced out. This design only lets one car and one driver be on the track at
once.
The model for this design is
Yijt = µ + ai + bj + τk + εijk .
|{z} |{z} |{z} |{z}
row col trt err
k: 1 2 3 4 5 6 7
#: 1 1 1 4 56 9,408 16,942,080
Table 4.3: This is integer sequence number A000315 in the online encyclopedia
of integer sequences by Neil J. A. Sloane: https://oeis.org/A000315.
It does not allow for any interactions between cars and drivers, cars and batteries
or drivers and batteries. Later when we take a closer study of interactions we
will see that an interaction between cars and drivers could look like an effect of
batteries. If there are no significant interactions like this then a Latin square
can be an extremely efficient way to gather information. Otherwise it is risky.
Sometimes a risky strategy pays off better than a cautious one. Other times
not.
To use a Latin square we start with a basic Latin square, perhaps like this
one
A B C D
B C D A
C D A B
D A B C
and then randomly permute the rows and columns. We might as well also
permute the symbols in it. Even if that is not necessary, it is easy to do, and
simpler to just do it than think about whether you should. The above Latin
square is called a cyclic Latin square because the rows after the first are simply
their predecessor shifted left one space with wraparound.
The number of distinct k × k Latin squares to start with is given in Ta-
ble 4.3. Two Latin squares are distinct if you cannot change one into the other
by permuting the rows and columns and symbols. The number grows quickly
with k. Be sure to permute the Latin square, especially if your starting pattern
is cyclic. The cyclic pattern will be very bad if there is a diagonal trend in the
layout. In many of the original uses the Latin square was made up of k 2 plots
of land for agriculture.
Not only are Latin squares prone to trouble with interactions, they also have
only a few degrees of freedom. With k 2 data points there are k 2 − 1 degrees of
freedom about the mean. We use up k − 1 of them for each of rows, columns
and treatments. That leaves k 2 −1−3(k −1) = (k −1)(k −2) degrees of freedom
for error.
Box et al. (1978, Chapter 8) provide a good description of how to analyze
Latin squares. I changed their car and driver example to have electric cars.
They give ANOVA tables for Latin squares and describe how to replicate them
in order to get more degrees of freedom for error. In a short course like this one,
we will not have time to go into those analyses.
the Latin letters (A, B, C, D) form a Latin square. So do the Greek letters
(α, β, γ, δ). These two Latin squares are mutually orthogonal meaning that
every combination of one Latin letter with one Greek letter appears the same
number of times (actually once). From two mutually orthogonal Latin squares
MOLS we get a Graeco-Latin square like the one shown.
We could use a Graeco-Latin square with treatments A, B, C and D blocked
out against three factors: one for rows, one for columns and one for Greek
letters. We are now in the setting of combinatoric existence and non-existence
results. For instance, no Graeco-Latin square exists for k = 6. Euler thought
there would be none for k = 10 but that was proved wrong in the 1950s.
The operational difficulties of arranging a real-world Graeco-Latin square
experiment are daunting. It is easy to do in software on the computer. You can
even do hyper-Graeco-Latin square experiments with three or more MOLS. For
instance if k is a prime number you can have k − 1 MOLS and then block out
k factors at k levels in addition to a treatment factor at k levels. Or you can
embed k 2 points into [0, 1]k+1 and have every pairwise scatterplot be a k × k
grid. We will see this later for computer experiments and space-filling designs.
Be sure to randomize!
Sometimes the number of levels in a block is less than the number of treat-
ments we have in mind. For instance, consider a club of people are tasting 12
different wines and we don’t want anybody to taste more than 6 of them. Then
we would like to arrange our tastings so that each person tastes 6 wines. Those
people then represent incomplete blocks.
In an ideal world, each pair of wines would be tasted together by the same
number of tasters. That would give us balanced incomplete blocks. This
makes sense because the best comparisons between wines A and B will come
from people who tasted both A and B. That is, from within block comparisons.
There will also be between block comparisons. For instance if many people
found A better than B and many found B better than C that provides evidence
(through a regression model) that A is better than C. But the within block
Table 4.4: Treatment and control sample sizes by age group in a Hodgkin’s
disease investigation.
evidence from having A and C compared by the same people is more informative
if the block effects (people) are large.
In sporting leagues we have k teams and we compare then in games that are
(ordinarily) blocks of size B = 2. A tournament in which each pair of teams
played together the same number of times would be a balanced incomplete block
design.
There are also partially balanced incomplete block designs where the
number of blocks where two treatments are together is either λ or λ + 1. So,
while not equal, they are close to equal.
We will not consider how to analyze incomplete block designs. If you use
one in your project, the other topics from this course will prepare you to read
about them and adopt them.
There are even design strategies where one blocking factor has k levels and
another has fewer than k levels. So the design is incomplete in that second factor.
If you find yourself facing a situation like this, look for Youden squares.
Analysis of variance
This chapter goes deeper into the Analysis of Variance (ANOVA). We consider
multiple factors and we also introduce the notions of fixed and random effects.
Most of these notes assume familiarity with statistics and data analysis in order
to study how to make data more than how to analyze it. This chapter is a bit of
an exception. We will also extend the theory to cover what might be gaps in the
usual way regression courses are taught. ANOVA is a bit more complicated than
just running regressions with the categories coded up as indicator variables, and
so we will need to add some extra theory. Where possible, the additional theory
will be anchored in things that one would remember from an earlier regression
course.
Suppose that we have two categorical variables, say A and B. Each has two
levels. That makes four treatment combinations. We could just study it as one
categorical variable with four levels. However we benefit from working with the
underlying 2 × 2 structure. We have two choices to make (which A) and (which
B) and a third thing about interaction, which we will see includes considering
whether the best A depends on B and vice versa. Note that here A and B
are two different treatment options. In A/B testing A and B are two different
versions of one treatment option. Experimental design is complicated enough
that no single notational convention can carry us through. We have to use local
notation or else we would not be able to read the literature.
55
56 5. Analysis of variance
additivity and even consider something that looks like one term of a singular
value decomposition.
To illustrate how a two factor experiment works, consider the following hy-
pothetical yields for 3 fertilizers and 4 varieties of potatoes:
Yield (kg) V1 V2 V3 V4
F1 109.0 110.9 94.2 125.9
F2 104.9 113.4 110.1 138.0 .
F3 151.8 160.9 111.9 145.0
Based on these values, we can wonder which fertilizer is best, which variety is
best, and the extent to which one decision depends on the other.
Taking the yield data to be a 3 × 4 matrix of Yij values, the overall average
yield is Ȳ•• = 123. If we think fertilizer i raised or lowered the yield it must be
about yields higher or lower than 123. So we can subtract 123 from all the Yij
PJ
and then take (1/J) j=1 (Yij − Ȳ•• ) = Ȳi• − Ȳ•• as the incremental effect of
fertilizer i. We get:
F1 F2 F3
−13.0 −6.4 19.4.
By this measure, fertilizer 1 lowers the yield by 13 while fertilizer 3 raises it by
19.4. The same idea applied to varieties yields Ȳ•j − Ȳ•• :
V1 V2 V3 V4
−1.1 5.4 −17.6 13.3.
Variety 3 underperforms quite a bit while variety 4 comes out best.
We have just computed the grand mean Ȳ•• = 123 and the main effects
for fertilizer and variety. Note that each of the main effects average to zero,
because of the way that they are constructed. We can decompose the table into
grand mean, main effects and a residual term, as follows:
109.0 110.9 94.2 125.9 123 123 123 123
104.9 113.4 110.1 138.0 = 123 123 123 123
151.8 160.9 111.9 145.0 123 123 123 123
−13.0 −13.0 −13.0 −13.0 −1.1 5.4 −17.6 13.3
+ −6.4 −6.4 −6.4 −6.4 + −1.1 5.4 −17.6 13.3
19.4 19.4 19.4 19.4 −1.1 5.4 −17.6 13.3
0.1 −4.5 1.8 2.6
+ −10.6 −8.6 11.1 8.1 .
10.5 13.1 −12.9 −10.7
The last term captures the extent to which the yields are not additive. It
is called the interaction. The grand mean, main effects and interactions we
Depth
want are defined in terms of E(Yij ). The ones we get are noisy versions of
those defined through Yij . Much of ANOVA is about coping with noise that
makes a sample table of Yij differ from a hypothetical population table of E(Yij ).
Perhaps more is about coping with interactions that complicate estimation an
interpretation of data.
Here is another example motivated by agriculture. My notes say that I got
it from Cochran and Cox, which has gone through many editions, but I cannot
now find it in that book. It is about the yield of sugar in 100s of pounds
per acre. There’s an old unit called the ‘hundredweight’ which is about 45.4
kilograms. They considered two treatments. Treatment N involved either no
nitrogen, or application of 300 lbs of nitrogen per acre. Treatment D involved
either ploughing to the usual depth of 7” or going further to 11”. The results,
which might or might not be hypothetical are depicted in Figure 5.1.
When both variables are at their ‘low’ level the yield is 40.9. We see that
going to the high level of N raises the yield by either 7.8 if D is at the high level
(meaning greater depth) or 6.9 if D is at the low level. The overall estimate
of the treatment effect is then their average, roughly 7.4. Similarly, we see two
different effects for D, one at each of the high and low levels of N, that average
to about 2.
There seems to be a positive interaction of about 0.9 meaning that applying
both treatments gives more than we would expect from them individually. This
is a kind of synnergy.
Experiment B Take n/2 observations at each of (0, 0), (0, 1), (1, 0) and (1, 1).
Experiment B costs only 2n. It has N̂ = [(Ȳ10 − Ȳ00 ) + (Ȳ11 − Ȳ01 )]/2 with
1 σ2 σ2 σ2 σ 2 2σ 2
var(N̂ ) = + + + = = var(D̂).
4 n/2 n/2 n/2 n/2 n
The factorial experiment B delivers the same accuracy as the OAAT one at
half the cost. By that measure, it is twice as good. It is actually better than
twice as good. The factorial experiment can be used to investigate whether the
factors interact while OAAT cannot.
Experiment A is not the best OAAT that we could do. It misses an op-
portunity to reuse the data at (0, 0). We could also consider experiment C
below.
Experiment C Take n observations at (0, 0), and n more at (0, 1) and n more
at (1, 0).
Experiment C costs 1.5 times as much as experiment B and has the same
variance. It also makes corr(N̂ , D̂) 6= 0. So, while OAAT via experiment A is
only half as good as a factorial experiment OAAT via experiment C is 2/3 as
good. Since the observation at (0, 0) is used twice, we might want to give it
extra samples. Of course a better idea is not to do OAAT.
OAAT could leave us with the following information
GOOD ?
x
FAIR −→ GOOD
where two changes from our starting point are both beneficial but we don’t
know what happens if we make both changes. We would probably try that. We
might get a result like
GOOD −→ BAD
x x
FAIR −→ GOOD
hopefully by trying it out first before commiting to it. In a case like this we
learn that making both changes is a bad idea, but we would have learned sooner
with a factorial experiment.
BAD ?
x
FAIR −→ BAD
where both changes are adverse. We don’t know what would happen if both
changes were made. Based on the sketch above, many people would not even
try making both changes. It is possible that the underlying truth is like this
BAD −→ BEST EVER
x x
FAIR −→ BAD
where making both changes would be extremely valuable. In a factorial experi-
ment we would learn this, while n OAAT it could very well go undiscovered.
5.3 Interactions
One severe problem with OAAT is that if there are important interactions, then
we don’t learn about them and might be forever stuck in a suboptimal setting.
Interactions cause severe difficulties. We can think of a failure of external va-
lidity as being an interaction between treatment choices (e.g., aspirin vs tylenol)
and another variable describing the past versus the future. Or that second vari-
able could be data in our study versus data we want to generalize to. It is bad
enough that ‘your mileage may vary’ and worse still that mileage differences
may vary. That could mean that the optimal choice changes between the data
we learned from and the setting where we will make future decisions.
Interactions underly lots of accidents and disasters. An accident might only
have happened because of a wet road, inattentive driver, bad street lights and
poor brakes. If all of those things were needed to create the accident then it
is a sort of interaction. It’s good that the accident is not in the grand mean
or main effects, because then we could get more of them, but having it be an
interaction makes it harder to prevent.
Andrew Gelman has written that interactions take 16 times as much data
to estimate as main effects do: https://statmodeling.stat.columbia.edu/
2018/03/15/need-16-times-sample-size-estimate-interaction-estimate-main-effect/
It is informative to see how that 16 arises. In a 2 × 2 experiment the main
effect estimate would have
σ2 σ2 4σ 2
var(Ȳ1• − Ȳ2• ) = + =
n/2 n/2 n
while the interaction would have variance
σ2 16σ 2
var(Ȳ11 − Ȳ01 − Ȳ10 + Ȳ11 ) ×4= .
n/4 n
If an interaction would have the same size as a main effect, then it would be 4
times as hard to estimate, meaning that we would need to raise n by a factor
of 4 to do as well.
Now suppose that the main effect is θmain and the interaction is θinter =
λθmain . We use sample size n for the main effect and get a relative error of
√
|θ | n|θmain |
p main = .
2
4σ /n 2σ
Yijk` = µ + αi + βj + γk + δ`
+ (αβ)ij + (αγ)ik + (αδ)i` + (βγ)jk + (βδ)j` + (γδ)k`
+ (αβγ)ijk + (αβδ)ik` + (αγδ)ik` + (βγδ)jk`
+ εijk` .
To make the model identifiable we make each main effect sum to zero and
each
PJ interaction sum to zero over all values of any index in it. For instance
j=1 (αβγ)ijk = 0 for all i and k.
It is easy to work out what the estimates are when εijk` ∼ N (0, σ 2 ). We get
µ̂ = Ȳ••••
1 XXX
α̂i = (Yijk` − µ̂) = Ȳi••• − Ȳ••••
JKL j
k `
[ = 1
XX
(αβ)ij (Yijk` − µ̂ − α̂i − β̂j )
KL
k `
and the others are similar. In each case we subtract sub-effect estimates and av-
erage over variables not in the interaction of interest. The results are differences
of certain data averages.
For d factors there are 2d − 1 non-empty subsets of them that all explain some
amount of variance P that sums to the total variance among all N P
values. We
ordinarily ignore ijk` µ̂2 which when added to the above gives us ijk` Yijk` 2
.
The reason that µ and µ̂ are of less interest is that the grand mean has no
impact on our choices of i or j or k or `.
Later we will look at the functional ANOVA. This was used by Hoeffding
(1948) to study U -statistics, by Sobol’ (1969) to study numerical integration
and by Efron and Stein (1981) to study the jackknife. We will use a more
abstract notation for it that simplifies some expressions. Suppose that x =
(x1 , x2 , . . . , xd ) where xj are independent random inputs. Now let Y = f (x).
If E(Y 2 ) < ∞ then var(f (x)) exists and there is a way to do an ANOVA on it.
We proceed by analogy. The grand mean is µ = E(f (x)). Then the main effect
for variable j is
We keep subtracting sub-effects and averaging over variables not in the inter-
action of interest. The xj do not have to be categorical random variables like
the levels of the factors we use above. They could be U[0, 1] random variables
or vectors, sound clips, images or genomes. All that matters is that they are
independent and that f (x) is real-valued with finite variance.
We will look at https://statweb.stanford.edu/~owen/mc/A-anova.pdf
later when we get to computer experiments.
5.5 Replicates
Suppose that we were interested in a four-factor interaction. We could take
R > 2 independent measurements at each ABCD combination. Then our model
would be
Yijk`r = µ + αi + βj + γk + δ`
+ (αβ)ij + (αγ)ik + (αδ)i` + (βγ)jk + (βδ)j` + (γδ)k`
(5.1)
+ (αβγ)ijk + (αβδ)ik` + (αγδ)ik` + (βγδ)jk`
+ (αβγδ)ijk` + εijk`r .
where 1 6 r 6 R.
Suppose that we are making soup. We could go to store i, buy vegetables
j, use recipe k, find person ` and have them taste the resulting pot of soup R
times for r = 1, . . . , R. Or, on R separate days we could go to all I stores, buy
all J vegetables at each store, try all K recipes on each set of vegetables from
each store, and ask all L people to try all IJK of those soups once.
These are clearly quite different things. Not all kinds of replicate are equal.
Equation (5.1) is a plausible model for the setting where the tasters taste each
pot of soup R times in a row. It is an entirely unsuitable model for the setting
where the whole experiment is completely redone R times. For that we would
at a minimum want to include a main effect ηr for replicate r. There could even
be a case for making the day of the experiment be its own fifth factor E with R
levels. What is going on here is that the meaning of r = 1 is different in the two
cases. In the first case it is just the first time one person tastes a given soup.
In the second case it is one entire full replication of the experiment.
Just looking at the file of N = IJKLR numbers we might not be able to
tell which way the experiment was done. We will think more about replicates
when they come up in specific examples.
where we suppose that there are known values in all the cells marked X. There
are (I − 1)(J − 1) of them. For any set of choices we make, the table can be
completed by first making row sums zero and them making column sums zero.
Had we omitted one of those (I − 1)(J − 1) values, there would not be a unique
way to complete the table.
Table 5.1 shows a portion of the ANOVA table for a four way factorial
experiment where each cell has R independent values. We see IJKL(R − 1)
degrees of freedom for error. We could get this by subtracting all of the other
degrees of freedom numbers from N − 1 = IJKLR − 1. Or we can view it as
gathering R − 1 degrees of freedom for error from each of I × J × K × L cells.
Source DF
A or α I −1
B or β J −1
.. ..
. .
AB or αβ (I − 1)(J − 1)
.. ..
. .
ABC or αβγ (I − 1)(J − 1)(K − 1)
.. ..
. .
ABCD or αβγδ (I − 1)(J − 1)(K − 1)(L − 1)
Error IJKL(R − 1)
Total IJKLR − 1
Table 5.1: Selected rows of the ANOVA table for a four way table where each
cell has R independent repeats.
It is clear from the above that these full factorial experiments are big and
bulky and hence probably expensive. If I = J = K = L = 11 then we need
N = 114 R = 14,641R observations. They will give us 10 df in each main effect,
100 per two factor interaction, 1000 df for each three factor interaction and
10,000 df in the four factor interaction which could be the least useful of them
all.
Next we will look at what happens when all factors are at 2 levels. We
still need N = 2d data points for d factors or R2d if we have replicated the
experiment.
A common practice is to design an experiment that learns about the main
effects and low order interactions partially or completely ignoring the high order
interactions. To do this seems a bit hypocritical. We earlier argued that OAAT
is bad because it can miss the two factor interactions and now we are getting
ready to possibly ignore some other higher order interactions.
This common practice is a gamble. In statistics, we are usually against
gambling and favor a cautious approach that covers all possibilities. However the
cautious approach is really expensive and could be suboptimal. Experimental
design has room for both bold and cautious choices. With a bold choice we do a
small experiment that could be inexpensive or fast. If it goes well, then we learn
more quickly. If it goes badly, then the experiment might not be informative at
all and we have to do another one.
The bold strategies we will look at are motivated by a principle of factor
sparsity. This holds that the important quantities are mostly lower order and
perhaps even many of the main effects are unimportant. Or at least relatively
unimportant. If there are 210 − 1 effects and interactions they cannot all be
relatively important! There is a related bet on sparsity principle (Friedman
et al., 2001) motivating the use of the lasso. If things are sparse, you win. If
ind
Also, if Qj ∼ χ2(nj ) then
Q1 /n1
F = ∼ Fn1 ,n2
Q2 /n2
and this is the very definition of the F distribution. The F distribution has
numerator and denominator degrees of freedom and we write Fnum,den for the
general case.
Now we consider noncentral distributions. These appear less often in intro-
ductory courses. The main place we need them is in power calculations, such
as choosing a sample size. Choosing a sample size is maybe the simplest design
problem. It does not however come up if we are just looking at pre-existing
data sets.
ind
If Zi ∼ N (µi , 1) then
n
Zi2 ∼ χ0,2
X
(n) (λ)
i=1
Pn
where λ = i=1 µ2i . This is the noncentral χ2 distribution on n degrees of free-
dom with noncentrality parameter λ. There are alternative parameterizations
out there so we always have to check books, articles and software documentation
to see which one was used. Note particularlyPthat the n means µi only affect
the distribution through their sum of squares i µ2i . That extremely convenient
fact depends on the Gaussian distribution chosen for Zi .
If Q1 ∼ χ0,2 2
(n1 ) (λ) and Q2 ∼ χ(n2 ) then
Q1 /n1
F0 = ∼ Fn0 1 ,n2 (λ).
Q2 /n2
This is the noncentral F distribution. Here is how we use it. Our F statistics
will have central numerators under H0 but noncentral ones under HA . They
will usually have central denominators under both because most of our denom-
inators will involve sums of squares of differences of noise. If our noise model
is wrong then the denominator might be noncentral. That would leave us with
a doubly noncentral F distribution. That second noncentrality would make
the denominator bigger and hence the F statistic smaller and that implies lower
statistical power to reject H0 .
Next we look at the distributions of sums of squares for simple one way
layout with Yij = µ + αi + εij for i = 1, . . . , I and j = 1, . . . , n. Recall from
iid
introductory regression that when Xi ∼ N (µ, σ 2 ) then
n
X
(n − 1)s2 ≡ (Xi − X̄)2 ∼ σ 2 χ2(n−1) .
i=1
ind
Also if Qi ∼ χ2(ni ) then
X X
Qi ∼ χ2(N ) N= ni .
i i
Now
n
I X n
I X
X X 2
SSE = (Yij − Ȳi• )2 = (µ + αi + εij ) − (µ + αi + ε̄i• )
i=1 j=1 i=1 j=1
n
I X I
iid
X X
2
= (εij − ε̄i• ) = Qi for Qi ∼ σ 2 χ2(n−1)
i=1 j=1 i=1
2 2
∼ σ χ (I(n − 1)).
Therefore
I I
X ε̄i• 2 n X 2
√ ∼ χ0,2 (λ) for λ = α .
i=1
σ/ n σ 2 i=1 i
Notice that the degrees of freedom drop by one for centered ε̄i• . I have not
found a nice way to get this using just things one might remember from an
introductory regression class plus the noncentral distribution definitions.
Now let’s look at our F statistic, under H0 :
1
MSA I−1 SSA
F = = 1
MSE I(n−1) SSE
1 2 2
I−1 σ χ(I−1)
∼ 1 2 2
I(n−1) σ χ(I(n−1))
1 2
I−1 χ(I−1)
= 1 2
I(n−1) χ(I(n−1))
= FI−1,I(n−1) .
Under HA ,
I
0 n X 2
F ∼ FI−1,I(n−1) (λ) λ= α .
σ 2 i=1 i
Under HA , λ > 0 and the larger λ is the larger E(MSA) is. Thus HA tends to
increase F and so we should reject H0 for unusually large values. We reject H0
at level α when the observed value Fobs satisfies
1−α
Fobs > FI−1,I(n−1)
We set a p-value of
p = Pr(FI−1,I(n−1) > Fobs )
Yij = µ + ai + εij
2 2
where ai ∼ N (0, σA ) and εij ∼ N (0, σE ) are all independent. We might be
2 2 2
interested in learning σA /σE and the usual null hypothesis is H0 : σA = 0.
The ANOVA is exactly the same as before. That is SST = SSB + SSW or
SST = SSA + SSE are the two different notations we have used for the one
factor ANOVA. This identity is just algebraic so it holds for any numbers we
would put into it. Using methods like those in Section 5.7 we find that
2 2 2 2
SSW ∼ σE χI(n−1) and SSB ∼ n(σA + σE /n)χ2(I−1) .
Table 5.2: Expected values of the mean squares in a mixed effects model.
2 2 2
χ2I−1
MSA ∼ σE + nσAB + nJσA
I −1
2
2 2 2
χ J−1
MSB ∼ σE + nσAB + nIσB
J −1
2 2
χ2(I−1)(J−1)
MSAB ∼ σE + nσAB ) , and
(I − 1)(J − 1)
2
χ2IJ(n−1)
MSE ∼ σE .
IJ(n − 1)
If we use FA = MSA/MSE, then we have a problem. It won’t have the F
2 2
distribution if σA = 0 but σAB > 0. What we must do instead is test it via
2
FA = MSA/MSAB. Similarly we test σB = 0 using FB = MSB/MSAB and we
2
test σAB = 0 using FAB = MSAB/MSE.
Now suppose that A is a fixed effect while B is a random effect. We are
in for a bit of a surprise. This is called a mixed effects model. Table 5.2
show the expected values of the mean squares for this case. What we see is that
we should test the fixed effect A via the ratio MSA/MSAB and the random
effect B via the ratio MSB/MSE. If the other effect is fixed use MSE in the
denominator while if the other effect is random use MSAB. It is the ‘other
effect rule’ for external validity.
To understand this result intuitively let’s consider an extreme example. Sup-
pose that the fixed effect A is about 3 headache medicines. We have 12 subjects
in a random effect B. Each subject tests each medicine n times.
Let’s exaggerate and suppose that n = 106 . It naively looks like we have 36
million observations. But we only have 12 people in the data set. If we did let
n → ∞ then we would have a 3×12 table of exact means. Our sample size would
actually be 36. With a small sample n our data have to be less useful than those
36 values would have been. For purposes of learning the effect of 3 medicines
on a population we have something less informative than 36 observations and
nothing like 36,000,000 observations worth of information.
Exercise: work out the limit as n → ∞ of the test for A.
Suppose that we have k factors to explore. If we go with 2 levels each that yields
2k different treatment combinations which is the smallest possible product of k
integers larger than 1. A variable could originally be binary, such as a choice
between two polymers, or the choice between doing or not doing a step in a
process. Or the variable could have more level or even take on a continuum
of levels such as an oven temperature. For such a variable we could select two
levels such as 1020◦ C and 1050◦ C.
When we choose two levels for a continuous variable, some judgment must
be exercised. If the effect of that variable on the response is monotone or even
linear then we will have the greatest statistical power using widely separated
values. The flip side is that we should avoid closely spaced values because then
there will be low power and we might not get good relative accuracy on the
effect. Then if we suspect a variable is not important we might use more widely
separated values just to be able to check on that. If we know a good operating
range then spanning that range makes sense for ‘external validity’ reasons. If
we space things too far apart then our experimental analysis might give that
variable too much importance compared to the others. The possibilities that we
want to study might not be perfectly orthogonal. For instance if the experiment
has an oven at two temperatures for two different time periods, then either the
(LO,LO) combination or the (HI,HI) combination might be unsuitable. In that
case one can use sliding levels such as 30 vs 50 minutes when the temperature
is high and 45 vs 75 minutes when it is low. Wu and Hamada (2011) discuss
these tradeoffs.
If we suspect curvature then two values will not be enough. We will look at
‘response surface’ designs later that let us estimate curved responses.
If we see each treatment combinationn times then we need to gather n2k
71
72 6. Two level factorials
Source df
Source df
TRTs 2k − 1
TRTs 2k − 1
DAYS n−1
ERR 2k (n − 1)
ERR (n − 1)(2k − 1)
TOT n2k − 1
TOT n2k − 1
Table 6.1: Left: Randomized blocks ANOVA table. Right: completely ran-
domized design ANOVA table.
data points.
ABCD Y
− − −− (1)
+ − −− a
− + −− b
+ + −− ab
− − +− c
+ − +− ac
− + +− bc
+ + +− abc
− − −+ d
+ − −+ ad
− + −+ bd
+ + −+ abd
− − ++ cd
+ − ++ acd
− + ++ bcd
+ + ++ abcd
Y A B AB
(1) b − − − + + −
a ab + + − + − +
Table 6.3: The first square diagrams Y for a 22 experiment. The next ones
show which observations are at high and low levels of A, B and AB.
of C. When everything is at the low level we use the symbol ‘(1)’ instead of a
blank or null string. We will be using some multiplicative formulas in which (1)
makes a natural multiplicative identity.
The treatment effect for A is defined to be the average of E(Y ) over 2k−1
observations at the high level of A minus the average of E(Y ) over 2k−1 obser-
vations at the low level of A. It is called αA and its estimate α̂A is the average
of observed Y at the high level of A minus the average of observed Y at the low
level of A. For instance with k = 3 and no replicates
1 1
αA = E(Ya ) + E(Yab ) + E(Yac ) + E(Yabc ) − E(Y(1) ) + E(Yb ) + E(Yc ) + E(Ybc ) and
4 4
1 1
α̂A = Ya + Yab + Yac + Yabc − Y(1) + Yb + Yc + Ybc .
4 4
If there are replicates, we could call the obvservations Ya,j and Yb,j for j =
1, . . . , n et cetera.
This is different from what we had before. With k = 1 we used to have
E(Y1j ) = µ + α1 and E(Y1j ) = µ + α2 with α1 + α2 = 0. With that notation
E(Y2j ) − E(Y1j ) = α2 − α1 = 2α2 . Now we have E(Ya,j ) − E(Y(1),j ) = αA . That
is αa = 2α2 . In a 21 experiment we write
1 1
Ya,j = µ + αa + εa,j and Y(1),j = µ − αa + ε(1),j .
2 2
Because interactions have one degree of freedom, we can give them a sign
too. The AB interaction is the effect of A at the high level of B minus the effect
of A at the low level of B. We can write it as
Just like other main effects and interactions, half the data are at the high level
of ABC and half are at the low level.
Because each effect has one degree of freedom, we can use a t test for it
instead of an F test. The t statistic for H0 : αa = 0 is
α̂
√ a ,
s/ 2k−2 n
where s is the standard error. The degrees of freedom are (n − 1)(2k − 1) if the
experiment was run in n blocks of 2k runs (and the model should then include
a block effect) and the degrees of freedom are n2k − 1 if it is a completely
randomized allocation.
When this factor sparsity is in play, then it would be wasteful to take n > 1
replicate instead of inspecting additional factors. If you replicate and leave out
the most important factor, that seemingly safe choice can be very costly.
It is common to see that the important interactions in a QQ plot involve the
important main effects. We might see α̂A , α̂B , α̂C and α̂BC as the outliers. In
class we saw such a QQ plot in a the fractional factorial class.
Daniels’ article closes with an interesting comment about why we just ana-
lyze one response at a time:
The third set of questions plaguing the conscience of one ap-
plied statistician concerns the repeated use of univariate statistical
methods instead of a single multivariate system. I know of no indus-
trial product that has but one important property. And yet, mainly
because of convenience, partly because of ignorance, and perhaps
partly because of lack of full development of methods applicable to
industrial research problems, I find myself using one-at-a-time meth-
ods on responses, even at the same time that I derogate a one-at-a-
time approach to the factors influencing a response. To what extent
does this simplification invalidate the judgments I make concerning
the effects of all factors on all responses?
Factor sparsity supports a concept called design projection. Suppose that
in a 23 experiment that factor A is completely null: αa = αab = αac = αabc = 0.
Then our experiment is just like a 22 experiment in B and C with two replicates,
depicted as follows:
A B C B C B C
−1 −1 −1 −1 −1 −1 −1
−1 −1 +1
−1 +1
−1 −1
−1 +1 −1
+1 −1
−1 +1
−1 +1 +1
−→
+1 +1
=
−1 +1
.
+1 −1 −1
−1 −1
+1 −1
+1 −1 +1
−1 +1
+1 −1
+1 +1 −1 +1 −1 +1 +1
+1 +1 +1 +1 +1 +1 +1
In the rightmost array the runs are reordered to show the replication. The
same thing happens if either B or C is null. In the multi-response setting that
Daniels’ was concerned with, there might be a different null factor for each of
the responses being measured.
In practice it is unlikely that A is perfectly null but it might be relatively null
with some other |α|’s being orders of magnitude larger than the ones involving A.
It could be tricky to decide whether A is null and would involve the issues raised
in the next section on factorial experiments with no replicates.
y M A B C AB AC BC ABC
(1) + − − − + + + −
a + + − − − − + +
b + − + − − + − +
ab + + + − + − − −
c + − − + + − − +
ac + + − + − + − −
bc + − + + − − + −
abc + + + + + + + +
Y = Xβ + ε ∈ RN
b ab (1) a
c ac bc abc
d ad cd acd
bcd abcd bd abd
Table 6.5: The 8 observations on the left are at the high level of BCD. The 8
observations on the right are at the low level of BCD.
write it as
1
BCD = (a + 1)(b − 1)(c − 1)(d − 1)
2k−1
1
= k−1 (ab + b − 1 − b)(cd − c − d + 1)
2
= ···
1
= k−1 (b + c + d + bcd + ab + ac + ad + abcd)
2
1
− k−1 ((1) + bc + cd + bd + a + abc + acd + abd).
2
The high and low levels of BCD are depicted in Table 6.5. What we see are
that bcd is at the high level. Anything that is missing an odd number of those
letters is at the low level. Anything missing an even number is at the high level
(just like zero which is even). The presence of another factor like A does not
change it.
6.6 Blocking
We might want to (or have to) run our 2k experiment in blocks. One common
choice has blocks of size 2k−1 . So our experiment has 2 blocks of size 2k−1 .
Blocking could be done for greater accuracy (balancing out some source of noise)
or we might be forced into it: the equipment holds exactly 8 pieces, no more.
The best plan for 2 blocks is run one block at the high level of the k-factor
interaction and one at the low level. For k = 3 it would be
a b c abc vs 1 ab ac bc
with run order randomized within the blocks. With this choice the ABC inter-
action is confounded with the blocks. The effects for A, B, C, AB, BC and AC
are orthogonal to blocks (because they’re orthogonal to ABC).
If we only have these 3 factors we might replicate the block structure as
follows:
abc ab abc ab abc ab
a ac a ac a ac
b bc b bc b bc
c 1 c 1 c 1
Source df
Replicates 2
Blocks = ABC 1
Blocks × replicates 2
A, B, C, AB, AC, BC 1 each
Error 12
Total 23
Table 6.6: This is the ANOVA table for a blocked 23 experiment with n = 3
replicates each having 2 blocks.
Fractional factorials
81
82 7. Fractional factorials
Interaction order: 0 1 2 3 4 5 6 7
7 7 7 7 7 7 7 7
Degrees of freedom: 0 1 2 3 4 5 6 7
Degrees of freedom: 1 7 21 35 35 21 7 1
Cost $1k $7k $21k $35k $35k $21k $7k $1k
Table 7.1: For a hypothetical 27 experiment where each run costs $1000, the
table shows how much of the budget goes to interactions of each size.
on the grand mean and main effects along with $59,000 on interactions of order
three and higher.
Let’s do 2k−1 runs, i.e., half of the full 2k experiment. We will get confounding.
To make the confounding minimally damaging, we will confound “done” versus
“not done” with the k-fold interaction.
For k = 3 we could do the 4 runs at the high level of ABC:
I A B AB C AC BC ABC
a + + − − − − + +
b + − + − − + − +
c + − − + + − − +
abc + + + + + + + +
We see right away that the intercept column I is identical to the one for ABC.
We say that the grand mean is aliased with ABC. It is also confounded with
ABC. We then get
E(µ̂) = µ + αABC .
E(α̂A ) = αA + αBC .
I A B AB C AC BC ABC
a + + − − − − + +
b
+ − + − − + − +
c
+ − − + + − − +
abc
+ + + + + + + +
bc
+ − + − + − + −
ac
+ + − − + + − −
ab + + + + − − − −
(1) + − − + − + + −
We could do either the top half, with ABC = I or the bottom half with ABC =
−I. That is ABC = ±I give a 23−1 experiment. If we use the bottom half we
get
E(α̂A ) = αA − αBC .
If any of the k factors in our 2k−1 factorial experiment are “null” then we
get a full 2k factorial experiment in the remaining ones. Practially, null means
“relatively null” in comparison to some larger effects.
Geometrically the sampled points of a 23−1 experiment lie on 4 corners of
the cube {−1, 1}3 (or if you prefer [0, 1]3 ). If we draw in the edges on each face
of the cube, no pair of sampled points share an edge.
A 2k−1 design looks like one block of a blocked 2k where the blocks are
defined by high and low levels of the k-factor interaction. If we have just done
the ABC · · · Z = I block we could analyze it as a 2k−1 design and perhaps
decide that we have learned what we need and don’t have to do the other block.
the main effect and n/2 for the others). For k = 2 we get
and for k = 230 it would only take 30 of these operations to compute 230 effects
of interest (most likely in a computer experiment). In a fractional factorial we
need to account for the aliasing.
Here it is (or at least part of it) for k = 3:
The digression on Yates’ algorithm demystifies the axis label in Figure 7.1.
There we see that main effects for ‘back stop’, ‘fixed arm’, ‘moving arm’ and
‘bucket’ all have positive values clearly separated from the noise level. The
effect for ‘front stop’ appears to be negative but is not clearly separated from
the others. The reference line is based on a least squares fit to the 11 smallest
effects (F and the interactions). Effects are named after their lowest alias.
In this instance there was a clear break between large effects and most small
effects and some reasonable doubt as to whether F belongs with the large or
the small ones. In many examples in the literature some main effects end up
in the bulk of small effects and a handful of two factor interactions show large
effects. Usually one or both of the interacting factors appears also as a large
main effect.
One reason to be interested in statistical insignificance is that when it hap-
pens we clearly do not know the sign of the effect, even if we’re certain that an
exact zero cannot be true. Statistical significance makes it more reasonable that
you can confidently state the sign of the effect. There can however be doubt
about the sign if the confidence interval for the effect has an edge too close to
zero. This could happen in a low power setting. See Owen (2017).
Here is a naive regression analysis of the data.
Coefficients:
Value Std. Error t value
(Intercept) 180.96 7.441 24.318
Front -13.94 7.441 -1.874
Back 31.29 7.441 4.205
Fixed 36.59 7.441 4.918
F = Front T
B = Back
1 X = Fixed X
M = Moving
B
T = Bucket
MT
Normal Quantiles
BX
BM
0 BT
XT
FM
FX
FT
-1 XM
FB
-20 0 20 40 60 80 100
yate(yate(yate(yate(pult3o$Dist))))[2:16]/8
ABCDEF × ABCDE = F.
We certainly don’t want to alias one of our main effects to the grand mean.
A better choice is to set ABCD = I and CDEF = I. We then also get
ABCD × CDEF = ABEF = I. Then we find that
and so we get
The cited books have tables of design choices for fractional factorial ex-
periments. Those tables can run to several pages. Probably nobody actually
memorizes or even uses them all but it is good to have those choices. One of
the key quantities in those designs is the “resolution” that we discuss next.
7.4 Resolution
The resolution of a fractional factorial experiment is the number of letters in
the shortest word in the defining relation. A defining relation is an expression
like ABCD = ±I. In this case the word is ABCD (not I!) and it has length 4.
The quarter fraction example from the previous section had defining relations
ABCD = I, CDEF = I and ABEF = I, so the shortest one is 4. In a 2k−1
fraction with defining relation setting the k-factor interaction equal to ±I the
shortest work length is k.
When there are k factors of two levels each and we have 2k−p runs in a
fractional factorial of resolution R, then the design is labeled 2k−p
R . Resolution
is conventionally given in Roman numerals. The three most important ones are
III, IV and V .
For resolution R = III, no main effects are aliased/confounded with each
other. Some main effects are confounded with two factor interactions.
For resolution R = IV , no main effects are confounded with each other,
no main effects are confounded with any two factor interactions, and some two
factor interactions are confounded with each other.
For esolution R = V , no main effects are confounded with each other, no
main effects are confounded with two factor interactions and no two factor in-
teractions are confounded with each other. However, some main effects are
confounded with four factor interactions, and some two factor interactions con-
founded with three factor interactions.
Table 7.4 has an informal summary of what these resolutions require. Res-
olution V has the least confounding but requires the most expense. Resolution
III has the least expense but could be misleading if we do not have enough
factor sparsity. Resolution IV is a compromise but it requires careful thought.
Some but not all of the two factor interactions may be aliased with others. We
can check the tables or defining relations to see which they are. If we have good
knowledge or guesses ahead of time we can keep the interactions most likely to
be important unconfounded with each other. Similarly, after the experiment
a better understanding of the underlying science would help us guess which
interaction in a confounded pair might actually be most important.
For a specific pattern 2k−p
R there can be multiple designs and they are not
all equally good. For instance with R = IV we would prefer a design with the
smallest number of aliased 2F I’s (two factor interactions). If there’s tie we
would break it in favor of the design with the fewest aliased 3F Is. Carrying on
until the tie is broken we reach a minimum aberration design. An investigator
would of course turn to tables of minimum aberration designs constructed by
researchers who specialize in this.
Table 7.4: Informal synopsis of what we you need for the three most common
resolutions.
For the 2k−1 experiment with the k-fold interaction aliased to ±I we find
that R = k. So k = 3 gives 23−1 4−1 5−1
III , k = 4 gives 2IV and k = 5 gives 2V .
There is a projection property for resolution R. If we select any R −1 factors
then we get all 2R−1 possible combinations the same number of times. That
has to be 2k−p /2R−1 times, that is 2k−p−R+1 times.
I A B AB C AC BC D=ABC
a + + − − − − + +
b
+ − + − − + − +
c
+ − − + + − − +
abc
+ + + + + + + +
bc
+ − + − + − + −
ac
+ + − − + + − −
ab + + + + − − − −
(1) + − − + − + + −
There is a special setting where we can dispense with any concern over
interactions. The classic example is when we are weighing objects. The weight
of a pair of objects is simply the sum of their weights, with interactions being
zero. Experimental designs tuned to a setting where interactions are known to
be impossible are called weighing designs.
A special kind of weighing design is known as Plackett-Burman designs.
They exist when n = 4m for (most) values of m, and so n does not have to be
a power of 2. We will see them later as Hadamard designs.
E(α̂A ) = αA + αBC ± · · · .
In the second part of our experiment, we could flip the signs of the A assignments
but leave everything else unchanged. In that second half we will have
E(α̂A ) = αA − αBC ∓ · · · .
1
α̂A = first expt α̂A + second expt α̂A
2
1
E(α̂A ) = (αA + αBC + · · · ) + (αA − αBC − · · · ) = αA .
2
We get rid of any aliasing for A. Of course if B had looked important after the
first 1/4 we could have flipped B. We would not have to know which one is
important before doing the first 1/4 of the factorial.
A second kind of followup is the foldover. We flip the signs of all the
factors. (Note that we cannot and don’t attempt to flip the intercept.) Foldovers
deconfound an main effect from any 2FI that it was confounded with. To see
this, suppose that A = BC in the first experiment. Flipping signs of everything
makes (−A) = (−B)(−C). That means −A = BC or A = −BC. Then
1h i
E(α̂A ) = (αA + αBC + · · · ) + (αA − αBC + · · · ) = αA + · · · .
2
Y vs Jittered A
●
●
●
15 ●
● ●
● ●
10
Y
●
●
●●
5
●
●
●
Figure 7.2: Exaggerated plot when A is a strong main effect and there is a
second strong main effect.
Y vs Jittered A
●●●
15
●●
● ●●
●
●
●●
10
Y
5
0
●
●
●
Figure 7.3: Exaggerated plot when A is a strong main effect and it has a strong
interaction with one or more other factors.
93
94 8. Analysis of covariance and crossover designs
where
k n
1 XX
x̄•• = xij .
nk i=1 j=1
P
This way µ = E(ȳ•• ). We also still impose i αi = 0.
We can similarly add the (xij − x̄•• )T β term to any of our other models:
blocked designs, Latin squares, factorials and fractional factorials. For instance,
in a randomized block experiment we would fit
Yij = α + µi + εij
Fleiss gives four significant figures for most numerical values in this example.
These data lead to an estimated benefit for treatment 2 of 0.1587. A t-test
(pooling the variance estimates) gives t = 3.56 which is significant enough.
[Exercise: figure out the degrees of freedom and the p-value.]
Prior to the treatment, however, the subjects started off as follows:
Trt n Mean s.dev.
1 74 0.6065 0.2541
2 64 0.5578 0.2293
So group 2 started out with an advantage of 0.0487. That is roughly one third
of the apparent post-treatment gain.
We could reasonably want to get a sharper estimate of the treatment benefit.
In some problems it might be enough to simply be confident about which of two
treatments is best. For other purposes it is important to estimate how much
better that treatment is.
The table of treatment differences xij − Yij is
Trt n Mean s.dev.
1 74 0.0551 0.2192
2 64 0.1651 0.2235
The estimated benefit from treatment 2 is now 0.1100 with t = 2.91 (still sig-
nificant).
Fleiss articulates two goals. One is to properly account for pre-treatment
differences. Another is to reduce the variance of the treatment effect estimate.
Differences do not alway achieve the second goal. To see why, let ρ =
corr(xij , Yij ) and σy2 = var(Yij ) and σx2 = var(xij ). Then var(xij − Yij ) =
σx2 + σy2 − 2ρσx σy . Now var(xij − Yij ) < var(Yij ) if and only if
1 σx
ρ> .
2 σy
If σx ≈ σy , then we need ρ ' 1/2 for differencing to improve accuracy. In the
hypothetical weight loss example it seems quite plausible that ρ would be large
enough to make differences more accurate than using post-treatment weight.
Now let’s look at a regression model
In this version we can think of µ + β(x − x̄) as an estimate of where the subject
would be post-treatment and then the experiment is comparing the average
amount that Y exceeds this baseline, between levels A and B of the treatment.
We don’t know β, but we can estimate it by least squares, and that’s what
ANCOVA does.
For the gingivitis data, the estimated treatment difference from ANCOVA
was 0.1263 and thet statistic was 3.57 (vs 3.56 for the original data and 2.91 for
differences). The standard error of the estimate was 0.0354 vs 0.0446 (for Y )
or 0.0374 (for x − Y ). In this instance the precision was improved by regression
on the prior measurement.
This model is a very bad choice if our goal is to test whether the treatment
has a causal impact on Y . Suppose that the treatment has a causal impact on
Z which then has a causal impact on Y . We might then find that Z explains Y
so well that the treatment variable appears to be insignificant. For instance, the
causal implications in an agriculture setting might be that a pesticide treatment
has a causal impact on the quantity X of insects in an orchard, and that in turn
has a causal impact on the size Y of the apple harvest. Including Z in the
regression could change the magnitude of the treatment differences and lead us
to conclude that the treatment does not causally affect the size of the apple
harvest. This could be exactly wrong if the treatment really does increase the
apple harvest because of its impact on the pests.
With data like this we could do two ANCOVAs. One relates Y to the
treatment and any pre-treatment predictors. Another relates z to the treatment
and any pre-treatment variables. We could then learn whether the treatment
affects z and that might be a plausible mechanism for how it affects Y .
If there is a strong predictive value in xT ij β then we get a more precise
comparison of treatment groups from the ANCOVA than we would from an
ANOVA that ignored xij .
Period 1 Period 2
Subject Group 1 A B
Subject Group 2 B A
period is long enough then we might bet on a cross-over design. Clearly if the
first period treatment can bring a permanent cure (or irreversible damage) then
the second period is affected and a cross-over is problematic. Cross-over designs
are well suited for chronic issues that require continual medication.
Here is a sketch from Brown (1980) of how cross-over data looks.
Group 1 Group 2
Subjects Subjects
Period Trt S11 , · · · , S1n1 Trt S21 , . . . , S2n2
1 A Y111 , · · · , Y1n1 1 B Y211 , · · · , Y2n2 1
2 B Y112 , · · · , Y1n1 2 A Y221 , · · · , Y2n2 2
There are two groups of subjects, two treatments and two periods. For instance,
in period 1, group 1 gets treatment A and yields observations Y111 , · · · , Y1n1 1 .
Here is his regression model after the carry-over term has been removed.
Notice the subscript on the treatment effect τ . The treatment level is u(i, k)
which is A if it is group i = 1 in period k = 1 or group i = 2 in period k = 2.
If i 6= k then the treatment is B.
As in ANCOVA, we will consider some differences. Let Dj be the period 2
value minus the period 1 value for subject j. Then
− π2 + τ1 − τ2
|π1 {z Group i = 1
} | {z }
E(Dj ) = period effect treatment effect
π1 − π2 + τ 2 − τ 1 Group i = 2
| {z } | {z }
period effect treatment effect
and so
E(Dj | i = 1) − E(Dj | i = 2) = 2(τ1 − τ2 ).
That is, the group differences of Dj inform us about treatment difference τ1 −τ2 .
We can use this to test for treatment differences via a t-statistic
(n1 − 1)s2Dgp1 + (n2 − 1)s2Dgp2
r
D̄gp1 − D̄gp2 n1 n2
t= s2D = .
sD n1 + n2 n1 + n2 − 2
101
102 9. Split-plot and nested designs
we want have purposely made them differ and want to study them in their own
right.
When we do computer experiments or Monte Carlo simulations, it often
makes sense to analyze them as designed experiments. If one factor A is set at
the beginning of an hour long computation and then a second factor B can be
set when there are just two minutes left in the computation, then a split-plot
design makes sense. We choose A, compute for 58 minutes, save the internal
state of the computation, and then vary factor B several times.
A split-plot design will ordinarily give us better comparisons for the inner
factor B because it is blocked by the outer factor and the outer factor is not
blocked. This is perhaps not surprising. If it is so much cheaper to vary factor
B then it is expected that we can study it with more precision.
To begin with, we will suppose that both A and B are fixed effects. We
consider random effects and mixed effects later.
Let’s vary factor A at I > 2 levels. We will have n plots at each of those
levels for a total of n × I plots. We depict them as follows
A1 A3 A2
z }| { z }| { z }| {
B2 B3 B1 B1 B3 B2 • • • B1 B2 B3
plot 1 plot 2 plot n × I
Here, factor A varies at the level of whole plots. Each level i appears n times.
Factor B varies at the level of sub-plots: j = 1, . . . , J, with J = 3 in the diagram.
Each level j appears nI times
The n appearances of each level of A could be from n replicates or from a
completely randomized design where I treatments were applied n times each in
random order to n × I plots. When replicates are used it is usual to include an
additive shift for them in the model.
We can compare levels of B using ‘within-plot’ differences such as Yijk −Yij 0 k
for levels j 6= j 0 of factor B. We can compare levels of A using ‘between-plot’
differences such as Ȳi•k − Ȳi0 •k . We expect between-plot differences to be less
informative when the plots vary a lot.
The AB interaction is estimated with ‘within-plot’ differences. More pre-
cisely, it uses between plot differences of within plot differences that are also
within plot differences of between plot differences:
Ȳij • − Ȳi•• − Ȳ•j • + Ȳ••• = Ȳij • − Ȳi•• − Ȳ•j • − Ȳ•••
J
1 X
Ȳij • − Ȳi•• = Ȳij • − Ȳij 0 • within
J 0
j =1
I
1X
Ȳ•j • − Ȳ••• = Ȳij • − Ȳi•• also within.
I i=1
We see from the last line that the interaction is an average of within-plot dif-
ferences and so it gets within-plot accuracy.
We still have to figure out the sums of squares that go in these tables. We
could do a full I × J × n ANOVA with replicates k = 1, . . . , n treated as a third
factor C crossed with factors A and B. Then the subplot error sum of squares
is SSABC + SSBC . The whole plot error sum of squares is SSAC . The replicates
sum of squares in the whole plot analysis is SSC . The sums of squares for A, B
and AB are, unsurprisingly, SSA , SSB and SSAB .
There is a different analysis in Montgomery (1997, Chapter 12-4). In our
notation, his model is
other effects, for better or worse. There would be no way to estimate E(ε2ijk )
separately from SSABC in that model, without another level of replication. We
will stay with the analysis based on the two tables described above. However,
if you are in a setting where you suspect that there could be meaningful inter-
actions between blocks and treatments, then Montgomery’s approach provides
a way forward.
Yandell (1997, Chapter 23) is a whole chapter on split-plot designs mostly
motivated by agriculture.
For each level i, it has J − 1 degrees of freedom and so it has I(J − 1) degrees
of freedom in total.
Recall that the AB interaction has (I − 1)(J − 1) df. This B(A) sum of
squares gets I(J − 1) − (I − 1)(J − 1) = J − 1 more df. They are the df for the
B main effect. The B main effect is meaningless when j = 1 has no persistent
meaning as i varies. As a result we lump the B main effect in with the prior
AB interaction to get SSB(A) = SSB + SSAB .
and so
2
2
σB(A) σ2
E(MSA ) = nJ σA + + = σ 2 + nσB(A)
2 2
+ nJσA .
J nJ
E(MSE ) = σ 2
2
and we see that σA is replaced by a sample variance among the αi . If both A
and B are fixed, then
PI
2 nJ i=1 αi2
E(MSA ) = σ +
I −1
PI PJ 2
i=1 j=1 βj(i)
E(MSB(A) ) = σ 2 + n , and
I(J − 1)
E(MSE ) = σ 2 .
The most straitforward analysis is to model the cluster level data using either
a randomization (permutation) approach or a one way PnANOVA. It seems like a
shame to greatly reduce the sample size from N = i=1 ni individuals to just
n N clusters. It is then tempting to analyze the data on the individual level.
It would however be quite wrong to analyze the individual data as if they were
N independent measurements. There are instead inter-cluster correlations. An
individual level analysis can be quite unreliable if it considers the individuals to
be independent when they are in fact correlated. Suppose for instance that εij
have variance σ 2 , correlations ρ within a cluster and are independent between
clusters. Then
1 X ni
var(Ȳi• ) = var(ε̄i• ) = var εij
ni j=1
1
= 2 ni σ 2 + ni (ni − 1)ρσ 2
ni
σ2
= 1 + (ni − 1)ρ .
ni
in terms of ρ and σ 2 and all the ni . Murray (1998) is an entire book on cluster
randomized trials also known as group randomized trials.
Taguchi methods
109
110 10. Taguchi methods
●
●
● ●
● ● ● ● ●
● ●
● ● ●
●
●
● ●
Figure 10.1: An archery example to illustrate bias and variance. The first
archer is quite good, or at least best. The second has higher bias. The third
has higher variance.
better they are. These then have a target of T = 0 and are scored by kY 2 . For
other quantities, the larger they are, the better, and those are scored via kY −2 .
That is smaller values of 1/Y are better.
Y = f (x1 , . . . , xk , z1 , . . . , zn ) + ε = f (x, z) + ε
and then
n
X ∂ 2 2
var(f (x0 , z)) ≈ f σj .
j=1
∂zj
Since z and hence σj2 is out of our control, our best chance to reduce this
variance is to reduce (∂f /∂zj )2 through our choice of x0 . That is, we want to
reduce the sensitivity of our output to fluctuations in the input.
125
120
Output Voltage
115
110
105
Transistor Gain
Figure 10.2: Hypothetical curve modeled after the account in Phadke (1989)
of power supply design.
This is the sample variance over the outer experiment recorded in decibels.
Recall that 10 log10 (·) converts a quantity to decibels. A logarithmic scale is
convenient because any model we get for E(η) when exponentiated to give a
variance (or inverse of a variance) will never give a negative value. It is also
more plausible that the factors in x might have mutiplicative effects on this
variance than an additive effect.
In its most basic form Taguchi’s method finds settings that maximize the
signal to noise ratio η. It is often an additive model in the components of x. In
the “bigger the better” setting, the analysis is of
1 X n2
1
ηi = −10 log10 .
n2 j=1 yij
1 X n2
ηi = −10 log10 yij .
n2 j=1
The experimental designs are usually orthgonal arrays like the following one
at 3 levels:
0 0 0 0
0 1 1 2
0 2 2 1
1 0 1 1
1 1 2 0
1 2 0 2
2 0 2 2
2 1 0 1
2 2 1 0
This design is called an orthogonal array because in any pair of columns each of
the 9 possible combinations appears the same number of times, i.e., once. We
will see more orthogonal arrays when we look at computer experiments. We
recognize this as a 34−2 design because it handles 4 variables at 3 levels each
using only 9 runs not 81.
It is usual for the outer experiment to also be an orthogonal array. Then
the design is made up of an inner array and an outer array.
There is a good worked example of robust design in Byrne and Taguchi
(1987) (which seems not to be available online). The value Y was the force
needed to pull a nylon tube off of a connector. This pull-off force was studied
with an inner array varying 4 quantities at three levels each (interference, wall
thickness, insertion depth and % adhesive). The outer array was a 23 experiment
in the time, temperature and relative humidity during conditioning. There were
thus 72 runs in all.
10.5 Controversy
There were some quite heated discussions about which aspects of Taguchi’s ro-
bust design methods were new and which were good. The round table discussion
in Nair et al. (1992) includes many points of view and dozens of references.
One issue is whether it is a good idea to use that combination of inner and
outer arrays. There every value of x is paired with every value of z. It might
be less expensive to run a joint factorial experiment on (x, z) ∈ Rk+n instead.
As usual, costs come into it. If the cost is dominated by the number of unique
x runs made, then a split plot structure like Taguchi uses is quite efficient. If
changing z and changing x cost the same then a joint factorial experiment and
the corresponding analysis will be more efficient.
Another issue is whether an analysis of signal to noise ratios is the best
way to solve the problem. In the roundtable discussion James Lucas says “The
designs that Taguchi recommends have the two most important characteristics
of experimental designs: (1) They have factorial structure, and (2) they get
run.” That is, an analysis that is not optimal may be better because more
people can understand it and use it.
Most of these notes are about designing how to experimentally gather data
with the assumption that they can be largely analyzed with methods familiar
from linear regression. Here we look at some ways of analyzing data that are
especially suited to designed experiments.
The course begin by grounding experimentation in causal inference. The
notions of potential outcomes, randomization, the SUTVA assumption and ex-
ternal validity help us think about experimentation. Then A/B testing and
bandit methods bridge us to problems of great current interest in industry.
Then we began with more classical experimental design.
The chapters so far have included a number of categorical quantities. There
are experimental units which may be plots or subjects and there are sub-units.
There are experimental factors. A combination of factors comprises a treatment
which may or may not involve important interactions. Those factors can be fixed
or random, nested or crossed. We also saw blocks and replicates and repeated
measures.
11.1 Contrasts
Beyond just rejecting H0 or not rejecting it, we have an interest in the different
expected values of Y . For a one way fixed effects model with
E(Yij ) = µ + αi ≡ µi
117
118 11. Some data analysis
µ1 − (µ2 + µ3 + µ4 )/3.
PI
These
P are all examples of contrasts. Contrasts take P the form i=1 λi µi
where i λi = 0 and, to remove an uninteresting case, i λ2i > 0. A contrast
PI
also satisfies i=1 λi αi . The reason why we have so much less interest in µ than
αi is that µ does not affect any comparisons of the levels of this factor and so
does not affect many of our choices. Perhaps if µ is bad enough we might not
want to use any of the levels of our factor, but when as usual we have to choose,
µ plays no role in E(Y ).
In the one way layout we can test a contrast with a t test, via
I
√
P
i λi Ȳi•
X
t = pP 2
∼ t(N −k) s= MSE N= ni .
s i λi /n i=1
We saw earlier that the presence of a random effect can complicate the inference
on a fixed effect with which it is crossed. If A is fixed and B is random we can
use
√
P
λi Ȳi••
t = pPi 2 ∼ t((I−1)(J−1)) s = MSAB.
s i λi /(nB)
This formula is for a balanced setting. When MSAB is the appropriate denom-
inator for our F test it provides the appropriate value of s for our t-test. The
degrees of freedom to use are the number underlying the estimate s.
Suppose that we have a factor that represents I equispaced levels of a con-
tinuous variable. For instance 20kg, 40kg, 60kg and 80kg of fertilizer. It is
then interesting to test the extent of a linear trend in the average responses
Y .PCentering these levels produces a contrast λ = (−30, −10, 10, 30). A test
of i λi αi = 0 is equivalent to one with λ = (−3, −1, 1, 3). This is a linear
contrast. When there are an odd number of levels then the central element
in the contrast has λi = 0. A test for curvature can be based on a quadratic
contrast. If the levels are linearly related to i then we can take
1X
λi = (i − ī)2 − (i − ī)2
I i
where ī = (I + 1)/2.
contrasts λ and λ0 are orthogonal if i λi λ0i = 0. Then i λi Ȳi• and
P P
P Two0
i λi Ȳi• are uncorrelated.
is the kurtosis of Y . The kurtosis is zero for Gaussian random variables but not
necessarily for other variables. When Y has ‘heavier tails’ than the Gaussian
distribution has, then κ > 0 and s2 has higher variance than under a Gaussian
assumption (and is not χ2 ). When σ̂12 and σ̂22 are not scaled χ2 random variables
then we cannot expect their ratio σ̂12 /σ̂22 to be approximately F distributed.
The situation is a better for
n
1 X
(Ȳi• − Ȳ•• )2
n − 1 i=1
because the CLT is making each Ȳi• more nearly normally distributed than
individual Yij are.
Yij = µ + ai + εij , i = 1, . . . , I j = 1, . . . , n
2 2
where ai ∼ N (0, σA ) and εij ∼ N (0, σE ) are all independent. We have renamed
2 2 2 2
σ to be σE here. The variances σA and σE are called variance components.
For a thorough treatment of variance components see Searle et al. (1992).
In this simple variance components model we have
2 2 2
E(MSA) = nσA + σE and E(MSE) = σE .
2 2 MSA − MSE
σ̂E = MSE and σ̂A = .
n
2
The estimate of σA is potentially awkward because it can be negative. It is
then common to take
MSA − MSE
2
σ̂A = max ,0 .
n
2 2
This estimate is no longer unbiased. It satisfies E(σ̂A ) > σA because we some-
times increase an unbiased estimate to zero, but never decrease it. If we are
averaging estimates like this over many data sets we might prefer to use any
negative values we get so as not to get a biased average.
We can also look at this setting through the correlation patterns in the data.
If j 6= j 0 then
2
cov(Yij , Yij 0 ) = cov(ai + εij , ai + εij 0 ) = cov(ai , ai ) = σA
and so
2
σA σ2
ρ ≡ corr(Yij , Yij 0 ) = = 2 A 2.
var(Yij ) σA + σE
2
We can interpret σ̂A as an indication that ρ < 0. Negative correlations are
impossible under our random effects model but distributions with those negative
correlations do exist. For instance the correlation matrix for Yij for j = 1, . . . , n
could be
1 ρ ρ ··· ρ
ρ 1 ρ · · · ρ
ρ ρ 1 · · · ρ
∈ Rn×n
.. .. .. . . ..
. . . . .
ρ ρ ρ ··· 1
for −1/(n − 1) 6 ρ 6 1. The lower limit on ρ is there to keep the correlation
matrix positive semi-definite.
In this correlation model
n
!
X
var Yij = nσ 2 + n(n − 1)ρσ 2
j=1
or equivalently
σ2
var(Ȳi• ) = (1 + (n − 1)ρ).
n
Here 1 + (n − 1)ρ) is the design effect we saw in clusterPrandomized trials. What
we see with ρ < 0 is that the variance of Ȳi• or of j Yij is less than what
it would be for independent observations. If there is some mechanism keeping
the total more constant than under independence that could explain negative
correlations. Cox (1958) considers animals that share a pen into which some
constant amount of food is placed. That could introduce negative correlations
in their weights. In a ride hailing setting with a fixed number of passengers we
might see negative correlations among the number of rides per driver. In both
of those settings we could get positive correlations too. The quantity of food or
of passengers could fluctuate up and down generating positive correlations.
If negative correlations are statistically convincing then we can move away
from the ANOVA and model the covariance matrix of the data instead.
and suppose that we want to estimate the grand mean µ. Two natural estimates
are
I ni I
1 XX 1X
µ̂ = Yij and µ̂ = Ȳi• .
N i=1 j=1 I i=1
That is we can average all of the data or average all of the group means. If
2 2
we actually knew σA and σE then we could compute the minimum variance
unbiased linear combination of Ȳi• as
I
X I
.X
µ̂ = Ȳi• /var(Ȳi• ) 1/var(Ȳi• )
i=1 i=1
I
X I
.X
2 2 2 2
= Ȳi• /(σA + σE /ni ) 1/(σA + σE /ni ).
i=1 i=1
2 2
Now if σA σE /ni for all i then averaging the Ȳi• would be nearly optimal.
2 2
If instead, σA σE /ni for all i then averaging all the data would be nearly
optimal. In practice we don’t ordinarily know these variance components but
this analysis would let us make a reasonable choice between the two natural
estimates above given a guess or assumption on the variance components.
For much more about variance components and unbalanced data, see Searle
et al. (1992).
Our best estimate of ai is E(ai | Ȳi• ) (after arguing that observations from i0 6= i
don’t help and that only the sufficient statistic Ȳi• is useful). Using properties
of the bivariate Gaussian distribution
2 2
Under very strong assumptions of normality and knowing µ, σA and σE we
would estimate (predict) ai by
2
σA
2 2 Ȳi• − µ .
σA + σE /ni
2 2
We shrink Ȳi• −µ towards zero, shirking it a lot if σA σE /ni . So we estimate
µ + ai by
2 2
σE /ni σA
2 2 µ+ 2 2 /n Ȳi• .
σA + σE /ni σA + σE i
This is a linear combination of the population mean µ and the average for unit i.
As ni increases we trust Ȳi• more. This estimate is the BLUP, for best linear
unbiased predictor. It minimizes variance among linear combinations of data.
With our simplifying assumptions here, the data is just Ȳi• . The approach
generalizes but becomes complex to depict.
Yij = µ + αi + bj + εij
viewing the block as a random effect, and the observation Yi0 j 0 is missing. We
could replace it with whatever minimizes
XX
(Yij∗ − µ − αi − bj )2
i j
for some new parameters. But we may have made the problem much harder by
introducing high order interactions.
In a setting like this, log(Y ) may have a more nearly additive model than Y
does. If log(Y ) is nearly additive then Y may not be. The expression above has
log(E(Yijk )) additive which is not the same has having E(log(Yijk )) additive.
Conversely, sometimes Y is more nearly additive than log(Y ).
There is a strong simplification from modeling on a nearly additive scale
because interactions bring in so many more parameters. Also many of our
models and methods use aliasing of the interactions and that is less harmful
when they are much smaller. It may then require some after thought to translate
a model for transformed Y to get conclusions for E(Y ). A very difficult situation
arises when Y is measured in dollars and the model works with log(Y )¿
As a second example, suppose that
.
E(Yijk ) = µ + αi + βj + γk ,
and let (
0, |Y − τ | > δ
Ỹ =
1, |Y − τ | 6 δ.
Response surfaces
Very often we want to model E(Y | x) for continuosly distributed x, not just
binary variables as we could handle with 2k factorials. Those can be used to
study continuous variables by choosing two levels for them. However, once we
know which variables are the most important and have perhaps a rough idea
of the range in which to explore them we may then want to map out E(Y | x)
more precisely for the subset of most important variables. We would like to
estimate an arbitrary response surface E(Y | x) in those key variables. The
literature on response surface models is mostly about estimating first order
(linear) and second order (quadratic) polynomial models in x, so most of the
practical methods do not have the full generality that the term ‘response surface’
suggests.
The material for this lecture is based largely on these texts Box and Draper
(1987), Myers et al. (2016) and Wu and Hamada (2011) and these survey articles
Myers et al. (1989) and Khuri and Mukhopadhyay (2010). Additional material
on optimal design comes from Atkinson et al. (2007).
125
126 12. Response surfaces
though some of the cross terms might be subject to aliasing with higher order
interactions, with each other (resolution IV) or with main effects (resolution
III). Equation (12.2) leaves out term for βjj x2j .
If xij takes only two levels, then the most we can do with it is fit a two
parameter model such as a linear one. To fit a third parameter, such as cur-
vature, we need a third level. For that we can take some center points. When
we have been sampling xij ∈ {−1, 1} we might then take some additional runs
with xij = 0.
The simplest strategy is to add one or more center points with xi = 0. Put
in a center point (maybe several). E.g. for p = 2 we could use
x1 x2
−1 −1
−1 1
1 −1
1 1 .
0 0
.. ..
. .
0 0
From the repeated center points, we can estimate σ 2 or at least var(Y | x = 0).
We can also use that data to estimate this model
p
X X p
X
β0 + βj x j + βjk xj xk + γ x2j .
j=1 16j<k6p j=1
Notice that
Pp there is only one coefficient γ for all of the squared terms. This
is γ = j=1 βjj in our usual notation. The reason is that in a design with
two levels plus a center point we have xij = ±xij 0 for all i = 1, . . . , n and all
1 6 j < j 0 6 d. This then implies that x2ij = x2ij 0 and so all of the quadratic
terms are perfectly confounded.
We might run this centerpoint design in a case where we expect little cur-
vature but just want to be able to make a check on it. If there is convincing
evidence that γ differs from zero by an important amount, then we know there
Source df
A 2
AL 1
AQ 1
B 2
BL 1
BQ 1
AB 4
AL × B L 1
AL × B Q 1
AQ × BL 1
AQ × BQ 1
Given data from a 3k factorial experiment we can fit the two level model
k
X k
X X
β0 + βj xij + βjj x2ij + βjj xij xij 0
j=1 j=1 16j<j 0 6k
arrays later. The top of Table 12.2 shows a construction in modular arithmetic.
We will see more of that construction too.
When using an orthogonal array, be sure to randomize the levels. That is,
there are 6 possible ways to map the levels 0, 1 and 2 of the array onto the
experimental levels −1, 0 and 1 and one of those should be chosen at random.
An independent randomization should be made for each column. It would be
a very poor practice to just subtract 1 from each entry in the array. The run
order should also be randomized.
These 3k−p designs can also be run in blocks whose size is a power of 3.
There is an extensive selection of three level designs here: http://neilsloane.
com/oadir/#3_2. For a comprehensive account of orthogonal arrays, see He-
dayat et al. (1999).
for p = 1 + d + d(d − 1)/2 + d = 1 + d(d + 3)/2 and fit by least squares. That is
β̂ = (F T F )−1 F T Y
2k points OAAT ±α × ej
n0 center points
https://www.jstor.org/stable/2983966
https://www.google.com/search?q=central+composite+design
Pn
It is also common to code the pure quadratic features as x2ij −(1/n) i0 =1 x2i0 j
to make them orthogonal to the intercept. This also makes them orthogonal to
the linear terms because, breaking the design into its three parts we find that
X X X
(x2ij − x2j )xij 0 = x2ij xij 0 from xij 0 = 0
i i i
X
= xij 0 + (α ∗ 0) + (−α ∗ 0)
i∈Factorial
= 0.
Exercise: show that x2ij − x2j is orthogonal to xij xij 0 for j 0 6= j and to xij 0 xij 00
when no two of j, j 0 and j 00 are equal.
One very tricky issue is choosing the value of α. Taking α = 1 is convenient
k
because it keeps all factors at three levels. We
√ will have a subset of a 3 2factorial
experiment. Another choice is to take α = k because this keeps kxi k = k on
the star points just like it is for the factorial points. This is called a “spherical
design” because then√ both the star and factorial points are embedded within
a sphere of radius
Pk k.2 When we choose a spherical design we need some zero
points or else j=1 xij = k for all i and we will have a singular matrix F .
Exercise: is this exactly the same problem that we saw with a centerpoint
design and the parameter
√ γ or is it different? √
If we choose α = k then we might find that values xij ∈ ± k are too far
from the region of interest even though they are exactly the same distance from
the center as the factorial points are. The issue stems from factor sparsity. If x1
is √
a very important variable, much more so than the others, then taking xi1 =
± k represents a much more consequential change than just taking everything
in {−1, 1}. Something is too far from the region of interest if the quadratic model
that serves over [−1, 1]k does not extrapolate well there. Perhaps changing a
geometric parameter for a transistor by √ that much turns it into a diode. It is
even possible that operating at xij = ± k is unsafe if xj represents temperature
or pressure. This sort of non-statistical issue can be much more important than
designing to reduce var(β̂) and it requires the input of a domain expert.
One possible way to choose α is to obtain orthogonality, that is a diagonal
matrix F T F ∈ Rp×p . If the pure quadratic terms are centered then it remains
possible that they are not orthogonal to each other. There is one value of α
that makes them orthogonal. After some algebra, this is
QF 1/4
α=
4
for Q = [(F + T )1/2 − F 1/2 ]2 where F is number of factorial observations and
T = 2k + n0 . This then makes corr(β̂jj , β̂j 0 j 0 ) = 0.
A B C
±1 ±1 0
0 ±1 ±1
±1 0 ±1
0 0 0
The first row denotes a 22 factorial in A and B with factor C held at zero.
There follow two similar rows with A and then B held at zero. Finally there is a
row representing n0 runs at x = 0. If, for instance n0 = 3 then this experiment
would have 15 runs. According to Box and Behnken (1960), “The exact number
of center points is not critical”. Geometrically this design has 12 points on the
surface of the unit cube [−1, 1]3 , one at the midpoint of each of the 12 edges
connecting the 8 vertices within six faces. There are also center points.
We can recognize the strategy in this design. There is a balanced incomplete
block structure designating some number r < k of the factors that take values
±1 while the remaining k − r factors are held at zero. Then some number of
center points are added.
Table 12.3 shows another Box-Behnken design, this time for 4 factors. It is
arranged in three blocks each of which has its own center point.
A B C D
±1 ±1 0 0
0 0 ±1 ±1
0 0 0 0
±1 0 0 ±1
0 ±1 ±1 0
0 0 0 0
±1 0 ±1 0
0 ±1 0 ±1
0 0 0 0
Table 12.3: A schematic for a Box-Behnken design in four factors with three
blocks and one center point per block.
Exercise: are the block indicator variables for the Box-Behnken design in 12.3
orthogonal to the regression model? If we change it to n0 = 4 are the orthogo-
nal? Since there are three block variables you can drop the intercept column.
Like 3k−p designs Box-Behnken designs can easily handle categorical vari-
ables at 3 levels. The tabled designs in the literature and textbooks involve only
modest dimensions k. For large k, Box-Behnken designs use many more runs
than parameters. While that may be useful in some settings, in others there is
a premium on minimizing n by taking it just slightly larger than the number of
parameters.
where g(·) > 0 measures interest level. It could be a distribution but does not
have to be. One could also minimize X 0 var(f (x)T β̂)g(x) dx where the set X 0
R
Optimal design can also help us when we have a more complicated constraint
region X to work with than the usual cubes and balls. It is not always possible
to do the global optimization that we would want in optimal design.
Box and Draper (1987) are skeptical about the optimal design approaches
sometimes called “alphabetic optimality”. The work some examples that opti-
mize mean squared error taking account of bias and variance. The bias comes
from the possibility that the model is a polynomial of higher order than the one
fit by the response surface model. They find that the optimal designs for mean
squared error are similar to those we would get optimizing for bias squared and
they can be quite different from what we would find optimizing just for vari-
ance like the alphabetic
Pnoptimality designs do. They ‘match moments’ over the
region, making (1/n) i=1 fi fiT = E(f f T ).
Now we turn briefly to logistic regression. It is not a pre-requisite for this
course but many readers will have encountered it. Logistic regression is for
binary responses Y ∈ {0, 1}. The model relating Y to x ∈ R is
exp(β0 + β1 x)
Pr(Y = 1 | X = x) = ≡ p(x; β)
1 + exp(β0 + β1 x)
and in general it uses Pr(Y = 1 | x) = exp(β T f (x))/(1 + exp(f (x)T β)). The
likelihood function is
n
Y
L(β) = p(xi ; β)Yi (1 − p(xi ; β))1−Yi
i=1
Super-saturated designs
137
138 13. Super-saturated designs
instance, with n = 4,
+ + + +
+ + − −
+ − + −
+ − − +
is a Hadamard matrix. We could use it as a saturated design to fit an intercept
and 3 binary variables. We have seen it before as a 23−1 factorial. There is a
good account of Hadamard matrices in (Hedayat et al., 1999, Chapter 7). It is
definitive apart from a few recent contributions that one can find online either
at Wikipedia or Neil Sloane’s web site.
The Sylvester construction, which actually pre-dates Hadamard’s interest
in these matrices is as follows:
1 1 H1 H1 H2 H2
H1 = 1 , H2 = = , H4 =
1 −1 H1 −H1 H2 −H2
and so on with
H2k H2k
H2k+1 =
H2k −H2k
for k > 1 in general.
Sylvester’s construction is a special case of a Kronecker construction that
works as follows. If A ∈ {−1, 1}n×n and B ∈ {−1, 1}m×m then their Kronecker
product is
A11 B A12 B · · · A1n B
A21 B A22 B · · · A2n B
nm×nm
A⊗B = . .. ∈ {−1, 1}
.. ..
.. . . .
An1 B An2 B · · · Ann B
Every step follows directly from the definition of the Hadamard product, so
readers seeing Kronecker products for the first time should take a moment to
check each step.
It is known that if a matrix Hn exists then n = 1 or 2 or 4m for some integer
m > 1. The Hadamard conjecture is that Hn exists whenever n = 4m for
m > 1. There is no known counter example, but matrices for n ∈ {668, 716, 892}
n # distinct Hn
1,2,4,8,12 1
16 5
20 3
24 60
28 487
32 13,710,027
have not (as of October 2020) been found and there are 10 more missing cases
for n 6 2000. These missing values are not a problem for experimental design.
If we want H668 but have to use H772 instead, it would not be a costly increase
in sample size. The most plausible uses for such large matrices are in software
and four additional function evaluations are unlikely to be problematic.
Two Hadamard matrices are equivalent if one can be turned into the other
by permuting its rows, or by permuting its columns, or by flipping the signs
of an entire row or by flipping the signs of an entire column. The number of
distinct (non-equivalent) Hadamard matrices that exist for some small values of
n are in Table 13.1.
Given a Hadamard matrix we can always find an equivalent one whose first
row and column are all +1. Hadamard matrices are often depicted in this form.
In an experiment we would then use the first column to represent the intercept
and the next n − 1 columns for n − 1 predictor variables. What we get is a
Resolution III design (main effects clear) in n − 1 binary variables.
In addition to the Sylvester construction there is a construction of Paley
(1933) that is worth noting. If n = 4m and s = n − 1 = pr for a prime
number p and exponent r > 1, then Paley’s first construction provides Hn . The
construction is available whenever pr ≡ 3 mod 4. Figure 13.1 shows one of these
matrices for n = 252 and prime number p = 251 ≡ 3 mod 4. Apart from a
border of +1 at the left and top, each row of this matrix is a cyclic shift of the
row above it. That means we can construct the matrix ‘on the fly’ and need not
store it. That feature would be very useful for n > 106 . There is a construction
in (Hedayat et al., 1999, Theorem 7.1) that is quite simple to use for a prime
number p mod 4. For n − 1 = pr it would require finite field arithmetic that
when r = 1 reduces to arithmetic modulo p. Note that the theorem gives a first
row of Hn equal to 1 −1 −1 · · · −1 instead of all 1s. However, we would
randomize all the colunns once constructed. (Be sure to pick one randomization
for each of the n − 1 columns and use it for all n rows.) Paley (1933) has a
second construction, but it does not have quite the same simply implemented
striping pattern.
We can use foldovers of Hadamard matrices. First split the intercept column
Any three columns of this matrix have all eight of {−1, 1}3 m times each.
For a binary xij the most efficient allocation would have half of the observations
at each of the two levels. With random balance they would be somewhat un-
equally split. The second inefficiency in random balance is that
Pthe terms xij 0 βj 0
would raise the variance of Yij in a regression model to σ 2 + j 0 6=j βj20 var(xij 0 ).
On the other hand, in a full regression model the matrix X T X/n is far from
identity and then det((X T X)−1 ) is infinite for p > n no matter what design is
used.
13.4 Quasi-regression
Suppose that we know the distribution of the feature vector f (xi ) ∈ Rp . Then
the regression parameter that minimizes squared error in predicting a finite
variance value Yi with a linear combination of these features is
This β minimizes E((Y −f (x)T β)2 ) whether or not E(Y | x) = f (x)T β. In linear
regression with n > p we estimate β by
1 −1 1
β̂ = F TF F TY (13.2)
n n
where F ∈ Rn×p has i’th row f (xi ) and Y ∈ Rn has i’th component Yi . The
expectations in (13.1) have been replaced by corresponding sample averages
in (13.2) to get β̂.
A very popular choice is to take f (x) to be products of orthogonal polynomi-
als. The resulting expansion approximating E(Y | x) is known as polynomial
chaos.
Now suppose that xi have been sampled at random. When we choose the
sampling distribution we may well know E(f (xi )f (xi )T ) because we also chose
the features. For instance, if the features are all polynomials and xi have a
which can be computed at O(np) cost instead of O(np2 ) that least squares costs.
Quasi-regression can be used when p > n and it avoids the O(p2 ) space required
for linear regression.
When p > n then shrinkage estimators as in Jiang and Owen (2003) are
advised. Ordinarily var(β̃) > var(β̂) when both are possible. This is a coun-
terexample to the usual rule in Monte Carlo sampling where plugging in a known
expectation in place of a sampled quantity ordinarily helps. Blatman and Su-
dret (2010) report that sparse regression methods outperform methods based
on numerically estimating E(f f T ) and E(f Y ).
Booth and Cox (1962) citing Box’s discussion of random balance introduce
some criteria. The consider the matrix X ∈ {−1, 1}n×p with p > n − 1 and
each column of X containing n/2 values for each of ±1. Then they consider
the matrix S = X T X which is proportional to sample correlations among the
predictors. Their criterion is max16j<j 0 6p |Sjj 0 |. They break ties by preferring
designs where a smaller number of column pairs attain the maximum of |Sjj 0 |.
They give some small examples with n and p of a few dozens. The examples
were found by computer search. When compared to random balance the designs
they obtain have much better values of S. They also report the variance of |Sjj 0 |
values.
Georgiou (2014) considers the criterion
1 X
E(S 2 ) ≡ p
2
Sjj 0
2 16j<j 0 6p
and remarks that algorithms that try to optimize it may possibly yield identical
columns. Incidentally, Georgiou (2014) gives this as the first criterion that
Booth and Cox (1962) consider but in preparing these notes I do not find that
in their article.
Georgiou (2014) reports some lower bounds on E(s2 ). The precise bounds
depend on things like the value of n modulo 4. In all cases
n2 (p − n + 1)
E(S 2 ) >
(n − 1)(p − 1)
where his definition of E(S 2 ) includes the intercept column of all ones. If we
normalize each Sjj 0 to Sjj 0 /n then
p−n+1
E((S/n)2 ) > .
(n − 1)(p − 1)
Let’s suppose that n 1 and p − n 1. Then ignoring the ±1s above, we get
makes makes little difference. Exercise: compare the E(S 2 ) criterion that Lin
gets to what he would get choosing runs with −1 in the given column.
Lin (1993) runs forward stepwise regression on data from his design. Phoa
et al. (2009) use L1 regularization based on the Dantzig selector of Candes and
Tao (2007).
Lin (1995) looks for ways to maximize the number p of binary variables in
a model subject to a constraint on maxj6=j 0 |Sjj 0 |. He breaks ties based on the
number of pairs at the same maximum level.
Practical investigations: how would it go to choose 1/4 or 1/8 et cetera of
a Hadamard design? Would Paley or Sylvester constructions be about equally
good or would one be better? This last point requires a way to make a fair
comparison between Hadamard designs with different values of n.
√
the zero based indexing) where ω = exp(2π −1/n). They split out the real
and imaginary parts of Xij doubling the number of columns obtained. (Those
2n real valued vectors in Rn cannot of course be mutually orthogonal.)
Another approach they describepis to take X ∈ Rn×p with IID N (0, 1) or
IID U{−1, 1} entries multiplied by p/n. The amazing thing is that this, after
many years, provides some justification for Satterthwaite’s random balance.
Part of the argument in Krahmer and Ward (2011) is based on the Johnson-
Lindenstrauss lemma (Johnson and Lindenstrauss, 1984). Think of a satu-
rated design as p − 1 column vectors in Rp , orthogonal to each other and to the
vector of ones. Now we project those columns v1 , . . . , vp−1 into a lower dimen-
sional space by multiplying them by a matrix Φ. That is we get ui = Φvi ∈ Rn
for i = 1, . . . , n for a matrix Φ ∈ Rn×p . The Johnson-Lindenstrauss lemma
shows that there is a mapping from Rp to Rn that preserves interpoint dis-
tances to within . When that mapping is the linear one described above this
means that
Computer experiments
149
150 14. Computer experiments
0.758 0.0035
A 0.6 0.006 0.04 100tc −0.3
0.036Sw Wfw q λ (Nx Wdg )0.49
cos2 (Λ) cos(Λ)
+ Sw Wp ,
πj (i − 1) + uij
xij = 1 6 i 6 n, 16j6d
n
where πj are uniform random permutations of 0, 1, . . . , n − 1 and uij ∼ U[0, 1).
All d permutations and nd uniform random variables are mutually independent.
Many computing environments include methods to make a uniform random
permutation. Figure 14.1 shows a small Latin hypercube sample. We see that
the n values of xij for i = 1, . . . , n are equispaced (stratified) and this is true
simultaneously for all j = 1, . . . , d. It balances nd prespecified rectangles in
[0, 1]d using just n points. If any one of those inputs is extremely important, we
have sampled it evenly without even knowing which one it was.
A notable strength of a Latin hypercube sample is that it allows d > n or
even d n. It is also easy to show that each xi ∼ U[0, 1]d . The name of this
1.0
●
●
●
●
●
0.8
●
●
●
●
0.6 ●
●
●
●
●
0.4
●
●
●
●
●
0.2
●
●
●
0.0
design is connected to Latin squares. For Figure 14.1, imagine placing letters
A, B, C, · · · , Y in the 25 × 25 grid with each letter appearing once per row and
once per column. That would be a Latin square. In an LHS we sample within
the cells occupied by just one of those 25 letters, perhaps ‘A’. This design is also
used in computer graphics by Shirley (1991) who gives it the name n rooks
because if the sampled points were rooks on an n × n chessboard, no rook could
take any other.
Figure 14.1 showed points uniformly and randomly distributed inside tiny
squares. If we prefer, we could evaluate f at the centers of those squares by
taking
πj (i − 1) + 1/2
xij = 1 6 i 6 n, 1 6 j 6 d.
n
This centered version of Latin hypercube sampling was described by Patterson
(1954) for an agricultural setting (crediting Yates).
x y
0 0
0 1
.. ..
. .
0 p−1
1 0
.. ..
. .
1 p−1
.. ..
. .
p−1 0
.. ..
. .
p−1 p−1
Table 14.1: Here are the first 2 columns of a Bose OA(p2 , p + 1, p, 2) orthogonal
array. We have labeled them x and y for future use.
Let’s pick any two of these columns indexed by c1 , c2 with c1 < c2 . The
(a1 , a2 ) that we need satisfies
a1 1 c1 x
= mod p
a2 1 c2 y
where (x, y)T ∈ {0, 1, . . . , p − 1}2 indexes the row of Table 14.1 that we want.
We can solve this formally by taking
−1
x 1 c1 a1 1 c2 −c1 a1
= = .
y 1 c2 a2 c2 − c1 −1 1 a2
● ●
● ●
● ● ● ●
● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ●
● ●
● ● ● ● ●
● ● ● ●
0 2 4 6 8 10
0 2 4 6 8 10
● ● ● ● ●●
● ● ● ● ● ● ● ●
● ● ● ● ● ●● ● ●
● ● ● ● ●
● ● ● ● ●● ● ●
● ● ● ● ● ● ● ●
● ● ●
● ● ● ● ● ● ● ●● ● ●
● ● ● ● ● ● ●● ● ●
● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ● ●● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ●
● ●
● ● ● ● ●
x3
x3
● ● ● ● ● ●
● ● ● ● ●●
X2
X2
● ● ●
●
●
●
●
●
●
●
●
●
● 10 ● ● ●
● ●
● ●
●
●
10
●
●
●
●
●
●
●
● 8 ●
●●
● ● ●
● ● 8
●
● ●
●
●
● 6 ● ● ●
● ● ●
6
●
● ●
●
● 4 ●
● ●
●
● 4
● ●
●
2 ●
● ● 2
● 0 ● 0
0 2 4 6 8 10 0 2 4 6 8 10
X1 X1
Figure 14.2: The left panel shows three columns of OA(121, 12, 11, 2) unscram-
bled. The right panel shows a scramble of them. From Owen (2020).
[0, 1]p+1 with just p2 points. For p = 101 we have balanced more than 5 × 107
strata with only about 105 points.
To use an orthogonal array as a space filling design it is important to ran-
domize the levels. In a randomized orthogonal array (Owen, 1992) we begin
with an OA(n, d, b, t) matrix A and take
πj (aij ) + uij πj (aij ) + 1/2
xij = or xij =
b b
for independent uniform random permutations πj of {0, 1, . . . , b − 1} and uij ∼
U[0, 1)d . Random offsets uij produce xi ∼ U[0, 1)d (which is the same distri-
bution as U[0, 1]d ). These points are dependent because by construction they
must avoid each other in any t-dimensional coordinate projection. The centered
versions might be better for plotting contours of f in plane given by xj and
xj 0 for two variables 1 6 j, j 0 6 d. Exercise: is a Latin hypercube sample a
randomized orthogonal array?
It is important to apply the permutations πj . Without permutation, the
points will lie in or near two planes and then not fill the space well. See Fig-
ure 14.2. Figure 14.3 shows several pairwise scatterplots from a randomized
orthogonal array.
Tang (1993) perturbs the points of an orthogonal array slightly, to make
them also a Latin hypercube sample. Compared to a randomized orthogonal
array the result is more uniform univariate marginal distributions.
There are many more orthogonal arrays to chose from. See Hedayat et al.
(1999). As mentioned above, the Bose construction can be generalized to
Randomized Bose OA
X2 vs X1 X3 vs X1 X4 vs X1
● ● ● ● ●
● ● ● ● ● ● ●
● ●
● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ●
● ● ●
● ●● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ●
● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ●
● ●● ● ● ● ●
● ● ● ● ● ●
● ● ●
Xk
●●
Xk
● ● ● ●
● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ●
●
● ● ● ● ●● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ●
● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ●
● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ●● ● ●
● ● ● ● ● ● ●
●
● ● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ●
● ● ● ●
● ● ● ● ●
●● ● ● ●
●
●
● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ●
●● ● ● ● ●
X3 vs X2 X4 vs X2 X4 vs X3
● ● ● ● ● ●
● ● ●
● ●● ● Xj ● ●
● ●
● ●
● ● X● j ●
●
●
●
● ●
X
● j ● ●
● ●
● ●● ● ● ● ● ●
● ● ● ● ●
● ● ● ●
●
● ● ● ● ● ●
● ● ● ● ● ●
● ● ●
●
●
● ● ●
● ● ● ● ● ● ●
●
● ● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ● ● ● ●
●● ● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ● ●●
● ●
●● ● ● ● ● ●
● ●● ● ● ● ● ● ●
● ● ● ● ●
● ●
●
● ● ● ● ● ● ● ●
● ● ● ● ●
Xk
Xk
● ● ● ● ●
● ● ● ●●
● ● ● ● ● ●● ● ● ● ● ● ●
● ● ●
● ● ● ● ●
● ● ● ● ● ●
●● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ●●
● ● ● ● ● ●
●
● ● ● ●
● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ●
● ● ●
● ● ● ● ● ● ●
● ● ●
●
● ● ●● ● ● ● ● ●
● ● ● ●● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ● ●● ●
● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ●
50
40
30
Frequency
20
10
0
Wing weight
Figure 14.4: Histogram of the wing weight function evaluated at 1024 scrambled
Sobol’ points.
that vary by a factor of about 3–fold. Perhaps the input ranges are quite wide
or there are strong effects in the corners. If we were interested in the lightest
wings, then we could select out the points with f (xi ) < 200 and plot their input
values as in Figure 14.5. In many of the scatterplots we see very empty corners
and densely sampled corners. For instance variables 3 and 8 together seem to
have a strong impact on whether the wing weight is below 200. Figures like this
are exploratory in nature; we might see something we did not anticipate, or we
might not see anything that we can interpret.
Figure 14.6 show linear regression output for the wing weight function on
.
the randomized Sobol’ inputs. This function is nearly linear with R2 = 0.9833
(adjusted R2 virtually identical at 0.9831). Because the sampling model is not
linear plus noise the usual interpretation of regression output does not apply.
Notwithstanding that, this function is surprisingly close to linear, even though
the formula did not look linear. By Taylor’s theorem, a smooth and very non-
linear function could look locally linear especially in a small region not centered
at a point where the gradient vanishes. Reading the variables’ described ranges
online does not make them appear to be very local. Also, a small region of
interest in design space would seem like it ought to restrict the wing’s weight
to a much narrower range than 3–fold. It appears that this nonlinear looking
function is actually close to linear over a wide range of inputs. A plain regression
as a quadratic polynomial with cross terms and pure quadratic terms scores an
adjusted R2 of 0.9997. The function is simpler than it looks.
This is by no means what always happens with computer experiments. It
is however also quite common that potentially quite complicated functions that
come in real applications are not maximally complex. There are additional
0.0 0.6
● ● ● ● ●● ● ● ● ●● ● ●● ● ●
● ● ●●
● ● ● ●● ●● ● ● ●● ● ● ●
●●● ● ● ● ●● ●● ● ● ● ● ● ● ● ●●
● ● ●
var 1 ●●
●
● ●
● ●● ●●
●●
●
●●● ●●● ● ●●
●
●●●
●●
●●●
●● ● ●
●●● ● ●
●●● ●●
●
●
●
●
●●●
●
● ● ●● ● ●
● ● ●●
● ●● ●
● ● ●●
●
●
● ●
●
●
●● ●
●● ●● ● ●● ●●
● ●● ● ●
● ● ●
●
● ● ●● ●
● ● ●
● ● ●●
●
●
●● ●● ●
● ●
●
●●
● ●●
●● ● ● ● ● ●
● ● ●●
● ●●● ●
●
●
●●●
●●●●
●
●●
●●
●
●●● ● ●● ●
● ●●
●
●●
●●● ●
●●
●●●
●
●●
● ●● ●
●
●
●
●
●●
●
● ●●
●
●●●
●
● ●●
●●
●● ●
● ● ●●
●
●●
●
● ● ●
●
● ● ● ● ●● ●● ● ●● ●
●●●● ● ●●
●● ● ●
●●● ● ●● ● ● ● ● ● ●
●●●●● ●●
● ●● ● ●
●●● ●● ●●● ●● ● ●
●●●
● ●●● ● ● ● ●
●
●●
●●●● ● ●●
● ● ● ●
● ●● ● ●
●●●
● ●● ●● ●
● ● ● ●● ●●
●●●●●●● ●
● ●
●● ● ●●●
●● ●
●● ●● ● ●
● ●●
●● ● ●● ●●●●● ● ●
● ●
●●
●●● ● ●● ● ● ●● ● ● ●●●
●●●● ● ● ● ●●
● ●● ●●● ●● ●●
●●
●●●●●●●● ●● ●●● ● ● ●● ● ●●● ● ●●● ●● ● ●● ● ●●● ● ● ●● ●● ●
●● ●● ● ●
● ●
● ● ● ● ● ●
● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ●●●● ●●
● ● ●
● ● ● ●● ● ●
● ● ● ●
●● ● ●● ● ● ●
● ● ●●● ● ● ●● ●● ● ●
●●
● ●
●
●● ●
● ●●●●
●●
● ●
● ● ●● ● ●
●● ●
●
●
●
●●
● ● ●●●●
●
●●● ●
●
●●● ●●
● ●
●
●
●
●
● ● ● ●● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ●
●● ● ● ● ●
●●● ● ● ● ● ●
●●
● ●● ● ● ● ●●
● ●● ●
● ●● ●●●
●● ● ● ● ●●●● ●●● ●● ● ● ●● ● ●●●●● ●●● ●●● ●●
●● ● ● ●● ● ●●●● ●●●● ● ● ●●
●●●●● ●●●●
● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ●
●
● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
0.0 0.6
●● ● ● ● ● ● ● ●● ● ●● ●●● ●● ● ● ●● ● ● ● ●
● ●● ● ●● ● ● ● ● ● ● ● ● ●
●●●
●
●● ● ● ● ● ●● ● ●
●●●
●
●●
●● ●
●
●
●●●●●● ●● ●
●
● ●● ● ● ●
●●● ● ●●● ●
● ●● ● ●●● ●●
var 3 ●
● ●●●● ●
●●● ●● ●●
● ●
●●● ● ● ● ● ●
● ●
●● ●
●●●
●● ●●● ●●● ●
●
●●
●
●●●
●
●
●
●● ● ●
●
● ●●● ●●
●
●●
●
●
● ●
● ●● ●
● ●●
● ●● ●
●
●●
●
●
● ●
●● ●
●●
●
●●
●●●●
●
●●
●● ●
●●
●●
●●●● ●
●●●●●
●●●
●
●●
●
●●● ●●● ●●●
●● ●●
● ●●● ●
●●● ●●● ● ●● ●●
● ●●●● ●● ●●●●● ● ●● ●●●● ● ● ● ●●● ● ● ● ●● ● ●● ●●● ● ● ● ●●●●●● ● ● ● ●● ● ● ● ● ● ●●● ●
●●● ●●●●●● ● ●
●
●●● ●
●● ●● ●
● ● ● ●● ● ●●
● ●●●
●
●● ●● ●●●● ●● ●●● ●●●●●●●
●● ● ●●● ● ●● ● ●●●●● ●●
● ●●
● ● ●●●● ●●●● ●● ● ● ●●●●●●
● ●●●●●● ●●
●●●
●● ●
●●● ●● ●● ●●●● ● ●●
●● ● ●● ● ● ●● ● ● ● ●● ●
● ●
● ● ●●● ● ● ●● ●●● ●●● ●● ●●● ● ●●
●
●●
●
●●●●
●
● ● ● ●● ●● ● ●● ●●●●●●●● ●●●
●●
●●● ●● ● ●● ●● ● ●
●● ●●
● ● ●●● ●
●● ● ●●● ●●
●● ●
● ●●●● ●●● ● ●
●●● ●●●●
● ● ● ●●●●
● ● ● ●● ●●
● ●
●
●
● ●●●●●●
● ●●●● ●
●●●●● ● ●
● ●●● ●●● ●
● ●● ● ● ●●
● ●● ●●●● ●●●● ● ●
●● ●●
● ● ● ●● ● ● ● ●
●● ● ●● ● ● ● ● ●● ● ● ● ● ● ●
● ● ● ●● ●● ● ●● ● ●
●●● ●●● ●● ● ● ●● ● ● ● ● ● ●●● ●● ●●● ● ● ● ● ●● ●
●●
● ● ●●
●●
●
●● ●● ●●
●
● ● ● ● ● ● ●●
●
●
●● ●
●
● ●● ●
●●● ●
● ● ●●
●
●
●
●
●●
●
●
●
●
●● ● ●●
●●●●●●●
●●
●●●
●●
●
●
var 4 ●● ●●● ● ● ●
●● ●●
●
●
● ●●
●
●
● ●●●
●
●●
●
●● ● ●
●●
●●
●
● ●● ●●●●●●
●
●
●
●
● ●
●
●●
● ●
● ●●
● ●
●● ●
●
●
●
●
●
●●●
●●
●● ●
● ●●●
● ● ●
●
●
●●
●
●
●●
● ● ●● ●
● ●●
●
● ●
●● ●
● ●●
● ●
●●●●
●
●●
●
● ● ●● ● ● ●
●●
●
● ● ●●
●● ● ●
● ●● ● ●
●● ●
● ● ● ●● ● ● ● ●
●●● ●
●
● ● ● ● ●● ●●●●● ● ●●
●● ● ●● ● ● ●● ● ●●●
●
● ● ●● ●
● ●
●●● ●● ● ● ● ● ● ●● ●
● ●● ●
● ●
● ● ● ●● ● ● ● ●●
● ●
●● ● ● ●● ● ●● ● ● ●
●● ●● ● ●●
●
●
● ● ● ●●●● ● ● ● ● ● ●
● ● ● ● ●● ●
● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●
0.8
●● ● ●● ● ● ●●● ●●●● ● ●● ● ● ● ● ●● ●
●●●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●●●● ●
● ●●● ● ● ●● ●
●● ●●●● ● ●● ●●● ●●● ● ● ● ● ● ●●●● ●
●● ● ●●● ●
●● ● ●● ● ● ●● ● ● ●●
● ● ● ● ● ● ●●●●
● ●● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ●● ●● ● ● ●● ● ● ●● ●●
● ●● ● ● ● ● ●●
● ●●● ● ● ●● ● ● ●
●●●● ● ● ● ●● ●● ● ● ● ● ●
●● ●● ●●● ● ●
● ●●● ● ● ● ●● ●●●
● ●●● ● ● ●
● ●●
●● ●
● ● ● ●● ● ●
● ● ●
●
●●● ● ● ● ●●● ●● ●
●●
● ●
● ●● ●● ●
● ● ●●
●●●●●
●●
●
●
●
● ●
●● ● ●● ● ●●
● ● ●
● ●
● ● ●●
●
● ●
● ●
● ● ●
●
●
●●●● ●
●●
●●●
●●●
●
●
● ●
●●●●●● ●
●●
●
● ● ● ● ●● ●
●
●●
●●
●●● ●●● ● ●
●
● ●●
●
● var 5 ●
●
●●
●●● ● ●
●●
●●●● ● ● ●
●
●
●
●
● ●
●● ●
●● ●●● ●●●
● ● ● ●●
● ● ●●● ●●
●
●●●●
●●
●
● ●● ●
●● ● ● ●
● ●●
●●
●●
● ●
●●
●
●
●●
● ●
●
●● ●●
●●
●●●●
● ●●●●
● ●
●●
●● ● ●
● ● ● ●●● ● ●
●● ●
● ●●
● ● ●●
●●
● ●● ●
●
●
●
●● ● ● ● ●●● ●
● ●● ● ●● ●● ●
● ● ●●● ● ●●● ● ● ● ● ●● ● ● ●
●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ●
0.0
● ● ● ● ● ● ●● ●● ●● ● ● ● ●● ● ● ●●● ●●● ●
● ● ●●●● ●● ● ● ●●
● ● ●● ●●●
● ● ●● ●● ●● ● ● ●●● ●
● ●●●●● ● ●
● ● ●● ●●●
● ● ●● ● ●● ● ● ● ● ● ● ● ●● ● ●● ● ● ●●
●● ● ● ● ● ● ● ●● ● ●● ● ●
● ● ●● ●● ● ●
●●●
● ●●
● ● ●● ● ● ● ●● ● ●● ●
● ●● ●
● ●● ● ●● ●
●● ●
0.8
●●
●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●●
●●
● ● ●● ●● ●● ●● ● ● ●
●●●● ●●● ●●●
● ● ●
●●
● ● ●● ●● ●
●●● ●● ●● ● ● ● ● ● ●●● ●●● ●●●●●● ● ● ●
●●
●
● ● ●● ● ● ●●● ●● ●●
●
● ● ● ●● ● ● ● ● ●
●● ● ● ● ●●● ● ● ● ●● ●● ● ●
●
● ● ●
●●● ● ● ●● ●● ● ●●
●
●● ● ●● ● ● ●
●
●
● ● ● ● ● ● ●●●● ●●
●● ●● ● ● ● ●● ●● ● ● ● ●
●●● ●●
●●
● ●● ●
●
●
● ●●●●● ●
●
●●●
●●
●
●●●
●●
●
●● ● ●●●
● ●● ●●● ● ●●
● ●● ● ●
●●
●
●
●●●
●●●
●● ● ● ●●
●●●●●●
● ● ●
●
● ●●
● ●
●
● ●
● ● ●● ●
●
●●●● ●
●●
●
●
●● ●
●●
●
● ●
● ●●
● ●● ●
●●
● ●●
●
● ●
●● ● ●
● ●●
● ●●
var 6 ●
●
● ●●● ●●● ●
● ●●●●
●●●●
●●●
●● ●
●
●
●●
●
●●
●● ●
● ●
● ● ●
● ●● ●● ●●● ●●
● ●
●
●
●● ● ●●
●●●●
●
● ● ●●
●●
● ●●
●● ● ● ●
●
●
● ● ●●● ● ●
● ●● ● ●
●
●● ● ●
●● ●● ●● ●●
● ● ●
●● ● ●●● ●● ●●● ● ● ● ●● ● ●●●● ●● ● ● ● ●●●●● ● ●
●● ● ● ● ●●● ● ● ● ●●●●●● ●●
●● ●● ●●●● ● ● ● ● ● ●● ●● ●● ●● ● ●● ●
●●
●●● ● ● ●
● ● ● ●
●● ● ●●● ●
● ●●●●
● ● ●
0.0
●● ●● ● ● ● ● ● ● ● ●● ●
●● ● ● ● ● ●● ● ●
● ●● ●● ●
● ● ●● ● ●● ● ●
●
●
● ●● ● ●
●
● ●● ● ● ●● ●●
● ●
● ● ●● ●
●● ● ●●●●● ● ●
● ●
● ●● ●● ● ●
●● ●●●●●
●
●
●●●●●●●●●●● ● ● ● ● ●● ●●●
● ●●● ● ● ●●
● ● ● ● ●● ● ●
● ●●● ● ● ●● ● ● ●● ●●●●●● ● ● ● ●●● ● ● ● ● ● ●
●●●●
●●
●● ●●● ●●●● ●
●● ●●●●● ●●● ●●
0.2 0.8
●●●●● ●●
● ●● ● ● ● ●● ● ●●● ●● ● ●● ●●● ●
●●●● ● ● ●
●●●●● ●●
●●● ● ●●●● ●●
●
● ● ●
●● ●● ● ●● ● ● ● ● ●● ●● ●● ● ● ●
●● ●● ● ● ● ●●● ●
● ●●
●
●●● ● ● ●● ●●● ● ● ●● ●● ●●●●●●●● ● ●● ●● ●
● ●● ●● ●● ● ●●●
●● ● ●●
● ● ● ● ●● ● ● ●●●● ● ● ●● ●● ● ● ●●
● ● ●● ● ● ●● ● ● ● ●● ●● ● ● ●●● ●● ●●● ● ● ● ●●●
● ● ● ●● ●● ● ● ● ● ● ●●
●● ● ● ● ●●● ●●●●● ● ●●●
●● ●●
●● ● ● ●●● ●● ● ● ●● ●● ● ●
●● ●●● ●●● ●●
● ● ● ● ●●●●
● ● ●
● ●● ● ● ● ●●●●●●●●●●● ●
●●● ●●●●● ● ●●●
● ● ● ●
● ● ●
●● ● ●● ● ● ● ●● ● ●●● ● ●● ●●● ●●●
●●● ●●
● ●● ● ● ● ● ●
●●
●●
●
● ● ● ●●
●● ●● ●●
●●●●
●
●●
●
●●
●
●●
●●●
●●
●●
● ●● ●
● ● ●
● ● ● ●
●●
● ● ●● ●
● ●●
●● ●
● ● ●● ● ●
● ● ●● ●
● ● ●
●
●
●
●
●
●
●
●●
●●●
● ●●
●
●
●
●
●
● var 7 ●
●
●
●●
● ●● ●●
● ●
● ● ●
●
● ●
●
●● ●● ●●
● ●
●●● ●
●
●
●
● ● ●● ● ● ●
●●
● ● ● ● ●
●
● ●
● ●●
●
● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ●● ●
0.4
● ● ● ● ● ● ● ●
● ●● ●● ● ● ● ● ● ● ●● ● ● ●●● ●●● ● ● ●
●
●● ● ● ● ● ● ●● ●● ●
●●● ●
● ● ●● ● ● ● ● ●●
● ● ● ●● ● ●● ● ●● ● ● ●● ● ● ● ● ● ●
●
●
●● ●
● ● ● ●●
●
● ●●
● ●●
●●●●
●
● ●● ●
●
● ●
●
●●
●●
●
● ● ●
●
● ● ●● ●
●● ●
●
● ● ●● ●
●●●●
●●
●
●
●●
●●●
●●
●●
● ●
● ●●●
●
● ●●
●
●●●
●
●●
●●
● ● ● ●
●● ●
●●
● ●
●
●
●
●
●
●● ● ● ●
● ●● ●
●● ●● ●●●●●
● ●
●● ● ●
● ●● ● ●
●
●
●●
●●
● ●●
●
● ●●●
●●●●●●
●
●
● ●●
● ● ● ● ● ●
● ● ●●● ● ●
●● ●● ●
●
●
var 8 ●
●●
● ●●
●● ●
●● ● ●●
●● ●
●
●
●
●●●
●
● ●● ● ●● ●
● ●● ● ●●
● ●
● ● ●●●● ● ●
●● ●
●
●●●● ● ● ● ● ●●● ●● ● ●
● ●
● ●
● ●●● ● ●● ● ● ● ● ●●●
● ●
●
●●● ●●● ●● ●●●● ● ●
● ●●●●● ● ● ● ●
● ● ●●●● ● ●● ●
● ● ● ● ●●
●
●●●
● ●
●● ●● ● ●
● ●● ●● ●
●●●
●●● ●
● ● ●
●●●●● ● ●● ● ●
●● ● ●
● ●● ●● ●● ●
●●●●● ●●●● ● ●●● ●●● ● ●●● ● ●●●● ● ● ●●●● ● ● ●● ● ● ●● ● ● ●●●●●● ●● ● ●● ●● ● ●●● ●● ● ●● ● ● ●●●●●●● ●● ●●
●● ● ● ●●● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ●● ● ● ● ●● ● ● ●● ● ● ●●
● ●●●
●
●
● ●
●
● ● ● ● ● ● ● ●●●● ● ●
0.0 0.6
●● ●● ● ●● ● ● ● ●● ●● ● ●● ● ● ● ● ● ●● ● ●●● ● ● ● ● ● ●●● ● ●● ● ●●
● ●
● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ●
● ●
●●
●●
● ●● ●●
●●● ●●●
● ●● ● ● ●
●
●●
●
●
●
● ●● ●● ● ●
● ●●
● ●●
● ●
● ●●●● ●●●
●
●
●
●●
●●
●
●●● ● ●
●●●●●●●
●● ●●
● ●
● ●
●
●● ●
● ●●
● ●
● ● ● ●●●● ●
●● ●● ● ● ●
● ●
●● ● ●●
● ● ●●● ●
● ●●● ● ●● ●●
● ●● ●●● ●
●
●
●
● ● ●●
●● ● ●
● ●● ●● ●●
● ● ●●
●●
● ●
●
● ●● ● ●●
●●
● ● ●
●
●
●●● ●●
●●●●
●
●
●
●●●●●
●
●
●● ● ●●
● ●● ●●
● ●
●●● ●
● var 9 ●
●●●
●
● ●●
● ● ● ●● ●
●●
●●●●
●
● ●● ●
●
●●●● ● ●●● ● ●● ● ●
● ● ●
● ●●● ●●●●●●
● ●● ● ● ●
●●● ●●●● ●●● ●●●●●● ● ●
● ● ●●
● ●
●● ●● ● ● ●●●●● ●● ● ● ●
●● ● ●●
●● ● ●
●●● ● ●●● ●
●●
●● ●●
●
●●
●●● ● ● ●● ●●●● ●● ● ●●● ●●
●
● ●●● ● ● ●●
● ● ● ● ● ●●●● ●● ● ●● ●●●●●● ●●●● ● ● ●
●●● ● ●●
●
● ● ●●●
● ●● ●●●
● ●● ●●●
●●●● ●●● ●
● ●●● ●●●●●● ● ●● ● ● ●
●●●● ●
● ●●●● ●● ● ●●● ●● ● ● ●●
●● ●● ● ●●● ●● ●
● ●●
●●● ●● ●● ●●●
●● ● ● ● ● ● ● ● ● ● ●● ●●● ● ● ● ● ● ● ●
● ● ●●●●●●● ●● ● ● ●●● ● ●
● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●
●●● ●● ●● ● ●● ●● ● ● ● ● ●● ● ● ● ● ●
●●● ● ●
●
●● ●
● ●●
● ● ● ● ● ●●● ●● ● ● ● ●
● ● ●
● ● ●
●●
●●●● ● ●●
● ●
● ● ●
●● ●●● ● ●● ● ● ●
● ●● ● ● ●● ●●● ● ● ● ● ● ●● ●● ●●
●
●● ● ● ● ●
0.0 0.6
●
●●● ●● ●● ●● ●●
●●● ● ●● ●●●
● ●●
● ●
● ●● ● ● ● ● ●
● ●●
●●●
●
●● ●●●
● ●● ● ● ● ● ●● ● ● ●
●
●● ●● ● ●●●● ●● ●● ●●● ● ● ●● ● ●
● ●
●
● ●●● ●●
●●
●●
●● ●
● ●●●●
●●●
●
●●● ● ●
●
● ●●●
● ●
●●
● ●● ●
●● ●●
●●● ●● ● ●●
●
●●●
●
●
●
● ●●
●● ●●
●●●●
●● ●
●
●
●
●
●
●●
●
● ●● ●● ●●
● ●
●●
● ●
●● ● ●●●
●●
● ●●
● ●
● ●
●●●●
● ● ● ● ●●●●
●
●● ●
●
●●●
● ●●● ●
●● ● ● ●
● ● ●
● ●● ● ●●
●
●
●
●● ●●
●● ● ● ● ● ●
●
● ●●
●●
●●●● ●
●●●
●●
●●
●●
● ●● ●
● ●●
● ● ●
●●●● ●●●
●
●●● ● ● ●
●●●● ●
●●
●● ●
●
● ●
●
●●●● ● ●
●
●●
●
● ●
●
●
var 10
●●● ● ● ● ●● ●● ● ●● ●● ● ● ● ● ●●● ●●● ● ●●● ●●●
● ●● ● ● ● ● ●● ● ● ● ●● ● ● ● ●● ●● ●●
●● ●
● ●● ●● ● ● ●● ● ● ● ●
●
●● ● ●
● ● ● ●●● ● ● ●
● ● ●●
● ●● ● ● ● ●
● ●● ● ●●● ●
●● ● ●●●
● ● ●
●● ● ●●● ●● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ●
●●
●● ● ● ● ●● ● ●●●● ●●● ●
●●● ● ● ● ● ●● ● ● ● ●● ●●
● ●● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ●●● ● ● ● ● ● ●● ● ●● ● ●
●
0.0 0.6 0.0 0.6 0.0 0.6 0.2 0.8 0.0 0.6
Figure 14.5: Scatterplot matrix of sample points for which the wing weight was
below 200.
examples in Constantine (2015). See also Caflisch et al. (1997) who find that
a complicated 360-dimensional function arising in financial valuation is almost
perfectly additive.
14.5 Kriging
The usual way to predict f (x0 ) at a point x0 6∈ {x1 , . . . , xn } where we have
evaluated f is based on a method called kriging from geostatistics. Kriging
originated in the work of Krige (1951). Kriging is based on Gaussian process
models of the unknown function f . Stein (2012) gives the theoretical back-
ground. Sacks et al. (1989) applies it to computer experiments drawing on a
body of work developed by J. Sacks and D. Ylvisaker. Much of the description
below is based on Roustant et al. (2012) who present software for kriging.
Figure 14.6: Some regression output for the wing weight function.
1.1
●
1.9
●
1.7
●
2.4
●
2.1 2.3
● ●
then
x1 ⊥⊥ x2 ⇐⇒ Σ12 = 0.
For our present purposes, the most useful property of the multivariate Gaus-
sian distribution is that, if Σ22 invertible, then
L(x1 | x2 ) = N µ1 + Σ12 Σ−1 −1
22 (x2 − µ2 ), Σ11 − Σ12 Σ22 Σ21 .
This condition is not very restrictive. If Σ22 is singular then some component of
x2 is linearly dependent on the others. We could just condition on those others
if they have a nonsingular covariance. More generally, we need only condition
on a subvector of x2 that has an invertible covariance.
with µ and Σ chosen as described below. Then, to predict f (x0 ) we may use
We also get var(f (x0 ) | f (x1 ), · · · , f (xn )) for uncertainty quantification, from
the multivariate Gaussian distribution.
Because x0 could be anywhere in the set X of interest, we need a model
for f (x) at all x ∈ X . For this, we select functions µ(x) = E(f (x)) defined on
X and Σ(x, x0 ) = cov(f (x), f (x0 )) defined on X × X . The covariance function
Σ(·, ·) must satisfy
n
X n X
X n
var ai f (xi ) = ai ai0 Σ(xi , xi0 ) > 0 (14.1)
i=1 i=1 i0 =1
for all a ∈ Rn and all xi ∈ X for all n > 1, or else it yields negative variances
that are invalid. Interestingly, the condition (14.1) is sufficient for us to get a
well defined Gaussian process.
There is an interesting question about in what precise sense is f (·) random?
We simply treat it as if the function f were drawn at random from a set of
possible functions that we could have been studying. The function is usually
chosen to fit a scientific purpose, though perhaps a random function model
describes our state of knowledge about f (·). Perhaps not. However, the kriging
method is widely used because it often gives very accurate emulator functions
f˜(·). They interpolate because for 1 6 i 6 n,
Knowledge about f (·) can be used to guide the choice of µ(·) and Σ(·, ·).
After choosing Σ(·, ·), we may pick a model like
where µ(·) is a known function and Z is a mean zero Gaussian process with
covariance Σ. This method is known as simple kriging. For µ(·), we might
choose an older/cheaper/simpler version of the function f .
A second choice, known as universal kriging takes
X
Y = f (x) = βj fj (x) + Z(x)
j
where once again Z(·) is a mean zero Gaussian process. Here βj are unknown
coefficients, while fj (·) are known predictor functions. They could for instance
be polynomials or sinusoids, or as above, some prior versions of f . Sometimes
βj are given a GaussianPprior, and other times they are treated as unknown
constants. That makes j βj fj a fixed effect term which in combination with
a random effect Z(·) makes this model one of mixed effects. It has greater
complexity than simple kriging. Roustant et al. (2012) describe how to analyze
this situation.
The third model we consider is ordinary kriging. Here
Y = f (x) = µ + Z(x)
R(x, x0 ) ≡ R(x − x0 ).
This model fits with factor sparsity. The value of θj may make some xj very
important and others unimportant. It reduces our problem to finding covari-
ances for the d = 1 case. If all d correlations ρj (·) are valid then so is their
tensor product.
One common choice is the squared exponential covariance
●
0.5
●
●
●
0.0
●
B(t)
●
−0.5
●
−1.0
●
●
Time t
Figure 14.8: A Brownian motion path and some ‘skeletons of it’. From Owen
(2020).
2
ν= 3/2
2
ν= 3/2
1
0
0
−1
−1
−2
−2
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
2
ν= 5/2
2
ν= 5/2
1
1
0
0
−1
−1
−2
−2
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
2
2
ν= 7/2 ν= 7/2
1
1
0
0
−1
−1
−2
−2
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
four of them:
We get a function f˜ that matches all known values and derivatives and integrals
while also being consistently extendable to the unknown values and derivatives
and integrals. It is quite common for automatic differentiation codes to deliver
gradients along with function values. Kriging can make use of known gradients
to better predict function values.
If there is some Monte Carlo sampling embedded in f (x), then we may need
to model f as having an unknown true value somewhat different from the value
we computed. Let’s suppose that Yi = f (xi ) + εi for measurement noise εi that
is independent of the Gaussian process distribution of f . Then
if we assume that εi are IID with mean zero and variance σ02 . Here R(·, ·) may
be any one of our prior correlation functions.
Now suppose that there is sporadic roughness in f (·) such as small step
discontinuities that we described as numerical noise above. We can model that
using what is called a nugget effect. The covariance is
This looks a lot like the way we handled measurement noise above. What
changed is that 1i=i0 has become 1xi =xi0 . These would be different if we had
two observations i 6= i0 with xi = xi0 , that is replicates. A nugget effect is a
kind of “reproducible noise”.
In class we looked at and discussed Figures 1, 2 and 3 from Roustant et al.
(2012). Figure 1 shows f˜(x) and 95% confidence bands. It is simple kriging
with a quadratic curve 11x + 2x2 . The covariance is Matern with ν = 5/2,
θ = 0.4 and σ = 5. Figure 2 shows three different θ. The leftmost panel of
Figure 3 shows a mean reversion issue. As we move x0 away from the region
sampled by x1 , . . . , xn then f˜(x0 ) reverts towards (the esimated value of) µ.
The dissertation Lee (2017) provides an alternative form of kriging that reverts
toward the nearest neighbor.
The interpolating prediction for simple kriging is
f˜(x0 ) = µ(x0 ) + c(x0 )T C −1 (Y − µ)
where
Y = (f (x1 ), . . . , f (xn )T ,
c(x0 ) = (Σ(x1 , x0 ), . . . , Σ(xn , x0 ))T , and
Σ(x1 , x1 ) . . . Σ(x1 , xn )
C= .. .. ..
.
. . .
Σ(xn , x1 ) . . . Σ(xn , xn )
This follows from the multivariate Gaussian model. For Σ(x, x̃) = σ 2 R(x, x̃)
we can replace Σ(·, ·) by R(·, ·).
One difficulty with kriging is that the cost of the linear algebra ordinarily
grows proportionally to n3 . This may be ok if f (·) is so expensive that only
a tiny value of n is possible. Otherwise we might turn to polynomial chaos or
quasi-regression. A second difficulty with kriging is that C is very often nearly
singular. Indeed that is perhaps very good. For instance if f (xn ) is almost
identical to the linear combination of f (x1 ), . . . , f (xn−1 ) that we would have
used to get f˜(xn ) from the first n − 1 points then C is nearly singular.
Ritter (1995) shows that kriging can attain prediction errors f˜(x0 ) − f (x0 )
that are O(n−r−1/2+ ) for n evaluations of a function with r derivatives. Here
> 0 hides powers of log(n). The interpolations in kriging have some connec-
tions to classical numerical analysis methods that may explain why they work
so well (Diaconis, 1988). When the process is Brownian motion f (t) = B(t),
then the predictions are linear splines (in one dimension). For a process that is
Rt
once integrated Brownian motion f (t) = 0 B(x) dx, we find that f˜ is a cubic
spline interpolator.
14.8 Optimization
One of the main uses of kriging is to approximately find the optimum of the
function f (x) on X . That is, we might seek x∗ = arg minx f (x).
Given f (x1 ), . . . , f (xn ), we could estimate x∗ by x̃∗ = arg minx f˜(x) where
˜
f (x) = E(f (x) | f (x1 ), . . . , f (xn ). However, if we are still searching for x∗
then x̃∗ is not necessarily the best choice for xn+1 . Suppose for instance that
a confidence interval yields f (x̃∗ ) = 10 ± 1 while at a different location x0 we
have f (x0 ) = 11 ± 5. Then x0 could well be a better choice for the next function
evaluation.
The DiceOptim package of Roustant et al. (2012) chooses xn+1 to be the
point with the greatest expected improvement as described next. First let
f ∗ ≡ mini f (xi ). Then, if f (x) < f ∗ we improve by f ∗ − f (x). Otherwise, we
improve by 0. The expected improvement at x is
There is a closed form expression for EI in terms of ϕ(·) and Φ(·). We could
then take
xn+1 = arg max EI(x).
x
Like bandit methods, this approach involves some betting on optimism. The
optimization of EI may be difficult because that function could be very mul-
timodal. However, if each evaluation of f (·) takes many hours, then there is
plenty of time available to search for the optimum of EI, since it will ordinarily
be inexpensive to evaluate. Figure 22 of Roustant et al. (2012) illustrates this
process. That paper also describes how to choose multiple candidates for the
improvement of EI in case they can be computed in parallel. Balandat et al.
(2019) use randomized QMC in their search for the best expected improvement.
Figures 8(abc) of Koehler and Owen (1996) show some of these for d = 2.
Another criterion is
Z
min E((fe(x) − f (x))2 ) dx
x1 ,...,xn [0,1]d
to minimize the integrated mean square error (MISE). Figures 9(ab) show some
examples for d = 2. The maximum entropy designs place points on the boundary
of [0, 1]2 while the MISE designs are internal.
Some other design approaches presented in Johnson et al. (1990) are called
minimax and maximin designs. They are related to packing and covering
as described in Conway and Sloane (2013).
If you were placing coffee shops at points x1 , . . . , xn you might want to minimize
the maximum distance that a customer (who might be at x0 ) has to go to get
to one of your shops. We can also think of “covering” the cube [0, 1]d with n
disks of small radius and centers x1 , . . . , xn .
The maximin design chooses x1 , . . . , xn to
In coffee terms, we would not want any two coffee shops to be very close to each
other. We can think of this as successfully “packing” n disks into [0, 1]d without
any of them overlapping.
Johnson et al. (1990) show that minimax designs are G-optimal in the θ → ∞
limit. This is the limit in which values of f (x) and f (x0 ) most quickly become
independent as x and x0 move away from each other.
Park (1994) looks a ways to numerically optimize criteria such as the above
among Latin hypercube sample designs.
Figure 14.10: The left panel shows 1024 random points. The middle panel
shows a Sobol’ sequence. The right panel shows some scrambled Sobol’ points.
We had two guest lectures by people using experimental design and developing
new methods to handle the new problems. We also had a lecture on methods
to mix some randomization in with what would otherwise be an observational
causal inference.
171
172 15. Guest lectures and hybrids of experimental and observational data
different from raising 10% of them by 5 seconds. They compare 50’th and
90’th percentiles within the A and B pupulations. Comparing quantiles is more
complicated than comparing means and bootstrapping is too slow at scale.
Some networks can be chopped up in to pieces that barely overlap at all and
then treatments can be randomized to those pieces. This becomes very difficult
in networks of people where some may have thousands of neighbors.
A second SUTVA violation arises in two-sided markets visualized as bipartite
graphs. Think of links between advertisers and members. An experiment on one
side of the graph can affect participants on the other and indirectly spill over
to the first side. What that means is that, for instance, a difference observed
between members in treatment and control groups might not end up as the real
difference seen when making the change for all members.
priate to the business. In a tie-breaker they could offer it instead to the top 5%
of customers and randomly to half of the next 10% of customers.
The analysis in Owen and Varian (2020) shows that statistical efficiency is
monotonically increasing in the amount of experimentation. Of course there is
a cost issue preventing one from just making all of the awards at random. The
paper analyzes that tradeoff. It also shows that there is no benefit to making
the award probability take values other than 0%, 50% or 100%, perhaps on a
sliding scale.
Wrap-up
The final lecture of the class had several of the students presenting their final
projects. These were about understanding or tuning household tasks like cuisine
or entertaining young children or morning wakeup rituals or hobbies such as
horticulture. It was very nice to see a range of design ideas. From a survey
of the class it seemed that fractional factorials and analysis of covariance ideas
turned out to be most widely used.
Prior to those examples was a short note summarizing the topics of the
course. That was preceded by a brief overview of the statistics problem in
general.
175
176 16. Wrap-up
Model Model
Figure 16.1: The left figure shows how we envision a one and the same statis-
tical model produces both our known data and some unknown data of interest.
In inference we reverse the arrow from the model to the known data. Then the
known data tell us something about the model (with some quantified uncer-
tainty). From that we can derive consequences about the unknown data.
model, deductive inference lets us derive consequences for the unknown data.
Induction leaves us with some uncertainty about the model. When we derive
something about the unknown data we can propagate that uncertainty.
One could argue that it is logically impossible to learn the general case from
specific observations. For a survey of the problem of induction in philosophy,
see Henderson (2018). We do it anyway, accepting that the certainty possible
in mathematics may not be available in other settings.
A famous observation from George Box is that all models are wrong, but
some are useful. Nobody, not even Box, could give us a list of which one is
useful when. As applied statisticians, it is our responsibility to do that in any
case that we handle. There are settings where we believe that we can get a
usable answer from an incorrect answer. Sometimes we know that small errors
in the model will yield only small errors in our inferences. This is a notion of
robustness. In other settings we can get consequences from our model that
can be tested later in new settings. This is a notion of validation, as if we
were doing “guess and check”. Then, even if the model had errors we can get a
measure of how well it performs.
There are approaches to inference that de-emphasize or perhaps remove the
model. We can imagine the path being like the right hand side of Figure 16.2
that avoids the model entirely and then unlocks more computational possibili-
ties. There may well be an implicit model there, such as unknown (x, Y ) pairs
being IID from the same distribution that produced the known ones. IID sam-
pling would provide a justification for using cross-validation or the bootstrap.
For a discussion of the role of models in statistics and whether they are really
necessary, see Breiman (2001).
Inference Algorithmic
Model
The setting we considered in this course relates to the usual inference prob-
lem as shown in Figure 16.3. We were down in the lower left hand corner looking
at how to make the data that would then be fed into the inferential machinery.
Model
Known Unknown
Data Data
Expt
Design
Angrist, J. D. and Pischke, J.-S. (2014). Mastering ‘metrics: The path from
cause to effect. Princeton University Press, Princeton, NJ.
Atkinson, A., Donev, A., and Tobias, R. (2007). Optimum experimental designs,
with SAS. Oxford University Press.
Balandat, M., Karrer, B., Jiang, D. R., Daulton, S., Letham, B., Wilson, A. G.,
and Bakshy, E. (2019). Botorch: Programmable Bayesian optimization in
PyTorch. Technical report, arXiv:1910.06403.
181
182 Bibliography
Box, George, E., Hunter, J. S., and Hunter, W. G. (2005). Statistics for exper-
imenters: design, innovation, and discovery. Wiley, New York.
Box, G. E. and Behnken, D. W. (1960). Some new three level designs for the
study of quantitative variables. Technometrics, 2(4):455–475.
Box, G. E., Hunter, W. H., and Hunter, S. (1978). Statistics for experimenters,
volume 664. John Wiley and sons, New York.
Cochran, W. G. (1977). Sampling techniques. John Wiley & Sons, New York.
Fang, K.-T., Li, R., and Sudjianto, A. (2006). Design and modeling for computer
experiments. CRC press, Boca Raton, FL.
Fleiss, J. L. (1986). Design and analysis of clinical experiments. John Wiley &
Sons, New York.
Friedman, J., Hastie, T., and Tibshirani, R. (2001). The elements of statistical
learning. Springer, New York.
Johari, R., Koomen, P., Pekelis, L., and Walsh, D. (2017). Peeking at A/B
tests: Why it matters, and what to do about it. In Proceedings of the 23rd
ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining, pages 1517–1525.
Johnson, M. E., Moore, L. M., and Ylvisaker, D. (1990). Minimax and maximin
distance designs. Journal of statistical planning and inference, 26(2):131–148.
Kohavi, R., Longbotham, R., Sommerfield, D., and Henne, R. M. (2009). Con-
trolled experiments on the web: survey and practical guide. Data mining and
knowledge discovery, 18(1):140–181.
Kohavi, R., Tang, D., and Xu, Y. (2020). Trustworthy Online Controlled Ex-
periments: A Practical Guide to A/B Testing. Cambridge University Press.
Lee, M. R. and Shen, M. (2018). Winner’s curse: Bias estimation for total effects
of features in online controlled experiments. In Proceedings of the 24th ACM
SIGKDD International Conference on Knowledge Discovery & Data Mining,
pages 491–499.
Mase, M., Owen, A. B., and Seiler, B. (2019). Explaining black box decisions
by shapley cohort refinement. Technical report, arXiv:1911.00467.
Myers, R. H., Khuri, A. I., and Carter, W. H. (1989). Response surface method-
ology: 1966–l988. Technometrics, 31(2):137–157.
Nair, V. N., Abraham, B., MacKay, J., Box, G., Kacker, R. N., Lorenzen,
T. J., Lucas, J. M., Myers, R. H., Vining, G. G., Nelder, J. A., Phadke,
M. S., Sacks, J., Welch, W. J., Shoemaker, A. C., Tsui, K. L., Taguchi, S.,
and Wu, C. F. J. (1992). Taguchi’s parameter design: a panel discussion.
Technometrics, 34(2):127–161.
Owen, A. B. (2017). Confidence intervals with control of the sign error in low
power settings. Technical report, Stanford University.
Pearl, J. and Mackenzie, D. (2018). The book of why: the new science of cause
and effect. Basic Books.
Phoa, F. K., Pan, Y.-H., and Xu, H. (2009). Analysis of supersaturated de-
signs via the Dantzig selector. Journal of Statistical Planning and Inference,
139(7):2362–2372.
Ribeiro, M. T., Singh, S., and Guestrin, C. (2016). Why should I trust you?:
Explaining the predictions of any classifier. In Proceedings of the 22nd ACM
SIGKDD international conference on knowledge discovery and data mining,
pages 1135–1144, New York. ACM.
Rosenman, E., Owen, A. B., Baiocchi, M., and Banack, H. (2018). Propensity
score methods for merging observational and experimental datasets. Technical
report, arXiv:1804.07863.
Sacks, J., Welch, W. J., Mitchell, T. J., and Wynn, H. P. (1989). Design and
analysis of computer experiments. Statistical science, pages 409–423.
Saltelli, A., Ratto, M., Andres, T., Campolongo, F., Cariboni, J., Gatelli,
D., Saisana, M., and Tarantola, S. (2008). Global Sensitivity Analysis. The
Primer. John Wiley & Sons, Ltd, New York.
Santner, T. J., Williams, B. J., and Notz, W. I. (2018). The design and analysis
of computer experiments. Springer, New York, second edition.
Youden, W., Kempthorne, O., Tukey, J. W., Box, G., and Hunter, J. (1959).
Discussion of the papers of Messrs. Satterthwaite and Budne. Technometrics,
1(2):157–184.