Statistical Design and Analysis of Biological Experiments
Statistical Design and Analysis of Biological Experiments
Statistical Design and Analysis of Biological Experiments
Hans-Michael Kaltenbach
Statistical Design
and Analysis
of Biological
Experiments
Statistics for Biology and Health
Series Editors
Mitchell Gail, Division of Cancer Epidemiology and Genetics, National Cancer
Institute, Rockville, MD, USA
Jonathan M. Samet, Department of Epidemiology, School of Public Health, Johns
Hopkins University, Baltimore, MD, USA
Statistics for Biology and Health (SBH) includes monographs and advanced text-
books on statistical topics relating to biostatistics, epidemiology, biology, and
ecology.
Statistical Design
and Analysis of Biological
Experiments
Hans-Michael Kaltenbach
Department of Biosystems Science
and Engineering, ETH Zürich
Basel, Switzerland
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
For Elke
Preface
This book serves as an introduction to the design and analysis of experiments and
uses examples from biology and life sciences. It is aimed at students and researchers
wanting to gain or refresh knowledge of experimental design. Previous exposure to
a standard statistics introduction would be helpful, but the required material is also
reviewed in the first chapters to make the book self-contained. Most calculations are
demonstrated in R, but should be easily transferable to other software.
The main feature of this book is the use of Hasse diagrams to construct and visu-
alize a design. These diagrams clarify and simplify many ideas in experimental design
yet have received little attention for teaching. They allow me to focus on the logical
structure of an experimental design, facilitate comparisons between designs, cleanly
separate treatments from units, and make the randomization more explicit. The visu-
alization by Hasse diagrams encourages trying different modifications of a design,
and an appropriate linear model and its specification in R are then easily derived
from a given diagram. The analysis of variance techniques are carefully developed
from basic principles, and I exploit Hasse diagrams to derive model specifications
for linear mixed models as a modern alternative for more complex designs.
I aimed at keeping the book concise enough to be read cover-to-cover, yet to
include the standard experimental designs found in biological research. I also discuss
fractional factorial and response surface designs that proved invaluable for optimizing
experimental conditions in my own practice. I believe that power analysis is an
important part of experimental design and discuss this topic in detail; this includes
‘portable power’ as a quick-and-dirty tool for approximate sample size calculations
that I have not seen anywhere else. Finally, I strongly emphasize estimation and effect
sizes over testing and p-values and therefore discuss linear contrasts at considerable
length; I also present standardized effect sizes usually not discussed in texts in the
biomedical sciences.
In advancing through the material, I rely on a single main example—drugs and
diets and their effect on mice—and use artificial data for illustrating the analyses.
This approach has the advantage that the scientific questions and the experimental
details can be handled rather easily, which allows me to introduce new designs and
analysis techniques in an already familiar setting. It also emphasizes how the same
treatments and experimental material can be organized and combined in different
vii
viii Preface
ways, which results in different designs with very different properties. Real exper-
iments always have their own idiosyncrasies, of course, and rarely oblige the nice
and clean design found in a textbook. To allow for this fact, I discuss several real-life
examples with all their deviations from the most elegant designs, failed observations,
and alternative interpretations of the results; these mostly originate from consultation
and collaboration with colleagues at my department.
The book is organized in three main parts: Chaps. 1–3 introduce the design of
experiments and the associated vocabulary, and provide a brief but thorough review
of statistical concepts. Readers familiar with the material of a standard introductory
statistics class might read Chap. 1 and then skim through Chaps. 2–3 to absorb the
notation and get acquainted with Hasse diagrams and the introductory examples.
The main designs and analysis techniques are discussed in Chaps. 4–8. This mate-
rial provides the core of an introductory design of experiments class and includes
completely randomized and blocked designs, factorial treatment designs, and split-
unit designs. Chapters 9–10 introduce two more advanced methods and discuss the
main ideas of fractional factorial designs for handling larger experiments and factor
screening, and of response surface methods for optimization.
I am indebted to many people who contributed to this book. Lekshmi Dharmarajan
and Julia Deichmann taught tutorials and often suffered through last-minute changes
while the material and exposition took shape. They also provided valuable comments
on earlier versions of the text. Many students at the Department of Biosystems
Science and Engineering at ETH Zurich provided corrections and helpful feedback.
Christian Lohasz and Andreas Hierlemann were kind enough to allow me to use their
tumor diameter data, and Tania Roberts and Fabian Rudolf did the same for the yeast
medium example. Cristina Loureiro Casalderrey worked on the yeast transformation
example. My wife endured countless hours of me locked away in an office, staring
and mumbling at a computer screen. Jörg Stelling generously granted me time and
support for writing this book. Thank you all!
Finally, https://gitlab.com/csb.ethz/doe-book provides datasets and R-code for
most examples as well as errata.
ix
x Contents
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
Chapter 1
Principles of Experimental Design
1.1 Introduction
The validity of conclusions drawn from a statistical analysis crucially hinges on the
manner in which the data are acquired, and even the most sophisticated analysis
will not rescue a flawed experiment. Planning an experiment and thinking about
the details of data acquisition are so important for a successful analysis that R. A.
Fisher—who single-handedly invented many of the experimental design techniques
we are about to discuss—famously wrote
To call in the statistician after the experiment is done may be no more than asking him to
perform a post-mortem examination: he may be able to say what the experiment died of.
(Fisher 1938)
The (statistical) design of experiments provides the principles and methods for plan-
ning experiments and tailoring the data acquisition to an intended analysis. The
design and analysis of an experiment are best considered as two aspects of the same
enterprise: the goals of the analysis strongly inform an appropriate design, and the
implemented design determines the possible analyses.
The primary aim of designing experiments is to ensure that valid statistical and
scientific conclusions can be drawn that withstand the scrutiny of a determined skep-
tic. A good experimental design also considers that resources are used efficiently
and that estimates are sufficiently precise and hypothesis tests adequately powered.
It protects our conclusions by excluding alternative interpretations or rendering them
implausible. Three main pillars of experimental design are randomization, replica-
tion, and blocking, and we will flesh out their effects on the subsequent analysis as
well as their implementation in an experimental design.
An experimental design is always tailored toward pre-defined (primary) analyses,
and an efficient analysis and unambiguous interpretation of the experimental data is
often straightforward from a good design. This does not prevent us from conduct-
ing additional analyses of interesting observations after the data are acquired, but
these analyses can be subjected to more severe criticisms, and conclusions are more
tentative.
In this chapter, we provide the wider context for using experiments in a larger
research enterprise and informally introduce the main statistical ideas of experimental
design. We use a comparison of two samples as our main example to study how
design choices affect an analysis, but postpone a formal quantitative analysis to the
next chapters.
For illustrating some of the issues arising in the interplay of experimental design and
analysis, we consider a simple example. We are interested in comparing the enzyme
levels measured in processed blood samples from laboratory mice, when the sample
processing is done either with a kit from vendor A or a kit from a competitor B. For
this, we take 20 mice and randomly select 10 of them for sample preparation with kit
A, while the blood samples of the remaining 10 mice are prepared with kit B. The
experiment is illustrated in Fig. 1.1A and the resulting data are given in Table 1.1.
One option for comparing the two kits is to look at the difference in average
enzyme levels, and we find an average level of 10.32 for vendor A and 10.66 for
vendor B. We would like to interpret their difference of -0.34 as the difference due
to the two preparation kits and conclude whether the two kits give equal results or if
A B C
A A B B A A A A B B B B A B A B A B A B
Fig. 1.1 Three designs to determine the difference between two preparation kits A and B based on
four mice. A One sample per mouse. Comparison between averages of samples with the same kit. B
Two samples per mouse treated with the same kit. Comparison between averages of mice with the
same kit requires averaging responses for each mouse first. C Two samples per mouse each treated
with a different kit. Comparison between two samples of each mouse, with differences averaged
Table 1.1 Measured enzyme levels from samples of 20 mice. Samples of 10 mice each were
processed using a kit of vendors A and B, respectively
A 8.96 8.95 11.37 12.63 11.38 8.36 6.87 12.35 10.32 11.99
B 12.68 11.37 12.00 9.81 10.35 11.76 9.01 10.83 8.76 9.99
1.2 A Cautionary Tale 3
measurements based on one kit are systematically different from those based on the
other kit.
Such interpretation, however, is only valid if the two groups of mice and their
measurements are identical in all aspects except the sample preparation kit. If we use
one strain of mice for kit A and another strain for kit B, any difference might also be
attributed to inherent differences between the strains. Similarly, if the measurements
using kit B were conducted much later than those using kit A, any observed difference
might be attributed to changes in, e.g., mice selected, batches of chemicals used,
device calibration, and any number of other influences. None of these competing
explanations for an observed difference can be excluded from the given data alone, but
a good experimental design allows us to render them (almost) arbitrarily implausible.
The second aspect for our analysis is the inherent uncertainty in our calculated
difference: if we repeat the experiment, the observed difference will change each
time, and this will be more pronounced for a smaller number of mice, among others.
If we do not use a sufficient number of mice in our experiment, the uncertainty asso-
ciated with the observed difference might be too large, such that random fluctuations
become a plausible explanation for the observed difference. Systematic differences
between the two kits, of practically relevant magnitude in either direction, might
then be compatible with the data, and we can draw no reliable conclusions from our
experiment.
In each case, the statistical analysis—no matter how clever—was doomed before
the experiment was even started, while simple ideas from statistical design of experi-
ments would have provided correct and robust results with interpretable conclusions.
In our example, we selected the mice, used a single sample per mouse, deliberately
chose the two specific vendors, and had full control over which kit to assign to
which mouse. In other words, the two kits are the treatments and the mice are the
experimental units. We took the measured enzyme level of a single sample from a
mouse as our response, and samples are therefore the response units. The resulting
experiment is comparative because we contrast the enzyme levels between the two
treatment groups.
In this example, we can coalesce experimental and response units, because we
have a single response per mouse and cannot distinguish a sample from a mouse
in the analysis, as illustrated in Fig. 1.1A for four mice. Responses from mice with
the same kit are averaged, and the kit difference is the difference between these two
averages.
By contrast, if we take two samples per mouse and use the same kit for both
samples, then the mice are still the experimental units, but each mouse now groups
the two response units associated with it. Now, responses from the same mouse are
first averaged, and these averages are used to calculate the difference between kits;
even though eight measurements are available, this difference is still based on only
four mice (Fig. 1.1B).
If we take two samples per mouse but apply each kit to one of the two samples,
then the samples are both the experimental and response units, while the mice are
blocks that group the samples. Now, we calculate the difference between kits for
each mouse, and then average these differences (Fig. 1.1C).
If we only use one kit and determine the average enzyme level, then this investi-
gation is still an experiment but is not comparative.
To summarize, the design of an experiment determines the logical structure of the
experiment; it consists of (i) a set of treatments (the two kits), (ii) a specification of
the experimental units (such as animals, cell lines, or samples) (the mice in Fig. 1.1A,
B and the samples in Fig. 1.1C), (iii) a procedure for assigning treatments to units,
and (iv) a specification of the response units and the quantity to be measured as a
response (the samples and associated enzyme levels).
Construct validity concerns the choice of the experimental system for answering our
research question. Is the system even capable of providing a relevant answer to the
question?
Studying the mechanisms of a particular disease, for example, might require
a careful choice of an appropriate animal model that shows a disease phenotype
and is accessible to experimental interventions. If the animal model is a proxy for
drug development for humans, biological mechanisms must be sufficiently similar
between animal and human physiologies.
Another important aspect of the construct is the quantity that we intend to measure
(the measurand) and its relation to the quantity or property we are interested in. For
example, we might measure the concentration of the same chemical compound once
in a blood sample and once in a highly purified sample, and these constitute two
different measurands, whose values might not be comparable. Often, the quantity
of interest (e.g., liver function) is not directly measurable (or even quantifiable) and
we measure a biomarker instead. For example, pre-clinical and clinical investiga-
tions may use concentrations of proteins or counts of specific cell types from blood
samples, such as the CD4+ cell count used as a biomarker for an immune system
function.
The internal validity of an experiment concerns the soundness of the scientific ratio-
nale, statistical properties such as precision of estimates, and the measures taken
against the risk of bias. It refers to the validity of claims within the context of the
experiment. The statistical design of experiments plays a prominent role in ensuring
internal validity, and we briefly discuss the main ideas before providing the technical
details and an application to our example in the subsequent sections.
how is it administered? How do we make sure that the placebo group is comparable
to the drug group in all other aspects? What do we measure and what do we mean by
‘difference’? A shift in average response, a fold-change, change in response before
and after treatment?
The scientific rationale also enters the choice of a potential control group to which
we compare responses. The quote
The deep, fundamental question in statistical analysis is ‘Compared to what?’ (Tufte 1997)
Risk of Bias
Another aspect of internal validity is the precision of estimates and the expected
effect sizes. Is the experimental setup, in principle, able to detect a difference of
relevant magnitude? Experimental design offers several methods for answering this
question based on the expected heterogeneity of samples, the measurement error, and
other sources of variation: power analysis is a technique for determining the number
of samples required to reliably detect a relevant effect size and provide estimates
of sufficient precision. More samples yield more precision and more power, but we
have to be careful that replication is done at the right level: simply measuring a
biological sample multiple times as in Fig. 1.1B yields more measured values, but
is pseudo-replication for analyses. Replication should also ensure that the statistical
uncertainties of estimates can be gauged from the data of the experiment itself,
without additional untestable assumptions. Finally, the technique of blocking, shown
in Fig. 1.1C, can remove a substantial proportion of the variation and thereby increase
power and precision if we find a way to apply it.
1.4 Experiment Validity 7
The external validity of an experiment concerns its replicability and the general-
izability of inferences. An experiment is replicable if its results can be confirmed
by an independent new experiment, preferably by a different lab and researcher.
Experimental conditions in the replicate experiment usually differ from the original
experiment, which provides evidence that the observed effects are robust to such
changes. A much weaker condition on an experiment is reproducibility, the prop-
erty that an independent researcher draws equivalent conclusions based on the data
from this particular experiment, using the same analysis techniques. Reproducibility
requires publishing the raw data, details on the experimental protocol, and a descrip-
tion of the statistical analyses, preferably with accompanying source code. Many
scientific journals subscribe to reporting guidelines to ensure reproducibility, and
these are also helpful for planning an experiment.
An important threat to replicability and generalizability are too tightly controlled
experimental conditions, when inferences only hold for a specific lab under the very
specific conditions of the original experiment. Introducing systematic heterogeneity
and using multi-center studies effectively broadens the experimental conditions and
therefore the inferences for which internal validity is available.
For systematic heterogeneity, experimental conditions are systematically altered
in addition to the treatments, and treatment differences are estimated for each con-
dition. For example, we might split the experimental material into several batches
and use a different day of analysis, sample preparation, batch of buffer, measurement
device, and lab technician for each batch. A more general inference is then possible
if effect size, effect direction, and precision are comparable between the batches,
indicating that the treatment differences are stable over the different conditions.
In multi-center experiments, the same experiment is conducted in several different
labs and the results compared and merged. Multi-center approaches are very com-
mon in clinical trials and often necessary to reach the required number of patient
enrollments.
Generalizability of randomized controlled trials in medicine and animal studies
can suffer from overly restrictive eligibility criteria. In clinical trials, patients are often
included or excluded based on co-medications and co-morbidities, and the resulting
sample of eligible patients might no longer be representative of the patient population.
For example, Travers et al. (2007) used the eligibility criteria of 17 random controlled
trials of asthma treatments and found that out of 749 patients, only a median of 6%
(45 patients) would be eligible for an asthma-related randomized controlled trial.
This puts a question mark on the relevance of the trials’ findings for asthma patients
in general.
8 1 Principles of Experimental Design
If systematic differences other than the treatment exist between our treatment groups,
then the effect of the treatment is confounded with these other differences and our
estimates of treatment effects might be biased.
We remove such unwanted systematic differences from our treatment comparisons
by randomizing the allocation of treatments to experimental units. In a completely
randomized design, each experimental unit has the same chance of being subjected to
any of the treatments, and any differences between the experimental units other than
the treatments are distributed over the treatment groups. Importantly, randomization
is the only method that also protects our experiment against unknown sources of
bias: we do not need to know all or even any of the potential differences and yet their
impact is eliminated from the treatment comparisons by random treatment allocation.
Randomization has two effects: (i) differences unrelated to treatment become
part of the ‘statistical noise’ rendering the treatment groups more similar; and (ii) the
systematic differences are thereby eliminated as sources of bias from the treatment
comparison.
Randomization transforms systematic variation into random variation.
In our example, proper randomization would select 10 out of our 20 mice fully at
random, such that the probability of any one mouse being picked is 1/20. These 10
mice are then assigned to kit A and the remaining mice to kit B. This allocation is
entirely independent of the treatments and any properties of the mice.
To ensure random treatment allocation, some kind of random process needs to be
employed. This can be as simple as shuffling a pack of 10 red and 10 black cards or
using a software-based random number generator. Randomization is slightly more
difficult if the number of experimental units is not known at the start of the experiment,
such as when patients are recruited for an ongoing clinical trial (sometimes called
rolling recruitment), and we want to have a reasonable balance between the treatment
groups at each stage of the trial.
Seemingly random assignments ‘by hand’ are usually no less complicated than
fully random assignments, but are always inferior. If surprising results ensue from the
experiment, such assignments are subject to unanswerable criticism and suspicion
of unwanted bias. Even worse are systematic allocations; they can only remove bias
from known causes, and immediately raise red flags under the slightest scrutiny.
Even with a fully random treatment allocation procedure, we might end up with an
undesirable allocation. For our example, the treatment group of kit A might—just
by chance—contain mice that are all bigger or more active than those in the other
1.5 Reducing the Risk of Bias 9
1.5.2 Blinding
An often overlooked source of bias has been termed the researcher degrees of free-
dom or garden of forking paths in the data analysis. For any set of data, there are
many different options for its analysis: some results might be considered outliers
and be discarded, assumptions are made on error distributions and appropriate test
statistics, and different covariates might be included into a regression model. Often,
multiple hypotheses are investigated and tested, and analyses are done separately
on various (overlapping) subgroups. Hypotheses formed after looking at the data
require additional care in their interpretation; almost never will p-values for these
ad hoc or post hoc hypotheses be statistically justifiable. Many different measured
response variables invite fishing expeditions, where patterns in the data are sought
without an underlying hypothesis. Only reporting those sub-analyses that gave ‘inter-
esting’ findings invariably leads to biased conclusions and is called cherry-picking
or p-hacking (or much less flattering names).
The statistical analysis is always part of a larger scientific argument, and we should
consider the necessary computations in relation to building our scientific argument
about the interpretation of the data. In addition to the statistical calculations, this
interpretation requires substantial subject-matter knowledge and includes (many)
non-statistical arguments. Two quotes highlight that experiment and analysis are a
means to an end and not the end in itself.
There is a boundary in data interpretation beyond which formulas and quantitative deci-
sion procedures do not go, where judgment and style enter. (Abelson 1995)
Analysis Plan
The analysis plan is written before conducting the experiment and details the measur-
ands and estimands, the hypotheses to be tested together with a power and sample size
calculation, a discussion of relevant effect sizes, detection and handling of outliers
and missing data, as well as steps for data normalization such as transformations
and baseline corrections. If a regression model is required, its factors and covari-
ates are outlined. Particularly in biology, handling measurements below the limit of
quantification and saturation effects require careful consideration.
In the context of clinical trials, the problem of estimands has become a recent
focus of attention. An estimand is the target of a statistical estimation procedure, for
example, the true average difference in enzyme levels between the two preparation
kits. A common problem in many studies are post-randomization events that can
change the estimand, even if the estimation procedure remains the same. For example,
if kit B fails to produce usable samples for measurement in five out of 10 cases because
the enzyme level was too low, while kit A could handle these enzyme levels perfectly
fine, then this might severely exaggerate the observed difference between the two
kits. Similar problems arise in drug trials, when some patients stop taking one of the
drugs due to side-effects or other complications.
Registration
Notes
The problem of measurements and measurands is further discussed for statistics in
Hand (1996) and specifically for biological experiments in Coxon et al. (2019). A
general review of methods for handling missing data is Dong and Peng (2013). The
different roles of randomization are emphasized in Cox (2009).
12 1 Principles of Experimental Design
Two well-known reporting guidelines are the ARRIVE guidelines for animal
research (Kilkenny et al. 2010) and the CONSORT guidelines for clinical trials
(Moher et al. 2010). Guidelines describing the minimal information required for
reproducing experimental results have been developed for many types of experi-
mental techniques, including microarrays (MIAME), RNA sequencing (MINSEQE),
metabolomics (MSI), and proteomics (MIAPE) experiments; the FAIRSHARE ini-
tiative provides a more comprehensive collection (Sansone et al. 2019).
The problems of experimental design in animal experiments and particularly trans-
lation research are discussed in Couzin-Frankel (2013). Multi-center studies are now
considered for these investigations, and using a second laboratory already increases
reproducibility substantially (Richter 2017; Richter et al. 2010; Voelkl et al. 2018;
Karp 2018) and allows standardizing the treatment effects (Kafkafi et al. 2017). First
attempts are reported of using designs similar to clinical trials (Llovera and Liesz
2016). Exploratory–confirmatory research and external validity for animal studies
are discussed in Kimmelman et al. (2014) and Pound and Ritskes-Hoitinga (2018).
Further information on pilot studies is found in Moore et al. (2011), Sim (2019),
and Thabane et al. (2010).
The deliberate use of statistical analyses and their interpretation for support-
ing a larger argument was called statistics as principled argument (Abelson 1995).
Employing useless statistical analysis without reference to the actual scientific ques-
tion is surrogate science (Gigerenzer and Marewski 2014), and adaptive thinking is
integral to meaningful statistical analysis (Gigerenzer 2002).
Summary
In an experiment, the investigator has full control over the experimental conditions
applied to the experiment material. The experimental design gives the logical struc-
ture of an experiment: the units describing the organization of the experimental
material, the treatments and their allocation to units, and the response. The statisti-
cal design of experiments includes techniques to ensure the internal validity of an
experiment and methods to make the inference from experimental data efficient.
References
Fisher, R. A. (1938). “Presidential Address to the First Indian Statistical Congress”. In: Sankhya:
The Indian Journal of Statistics 4, pp. 14–17.
Gigerenzer, G. (2002). Adaptive Thinking: Rationality in the Real World. Oxford Univ Press.
Gigerenzer, G. and J. N. Marewski (2014). “Surrogate Science: The Idol of a Universal Method for
Scientific Inference”. In: Journal of Management 41.2, pp. 421–440.
Hand, D. J. (1996). “Statistics and the theory of measurement”. In: Journal of the Royal Statistical
Society A 159.3, pp. 445–492.
Kafkafi, N. et al. (2017). “Addressing reproducibility in single-laboratory phenotyping experi-
ments”. In: Nature Methods 14.5, pp. 462–464.
Karp, N. A. (2018). “Reproducible preclinical research-Is embracing variability the answer?” In:
PLOS Biology 16.3, e2005413.
Kilkenny, C. et al. (2010). “Improving Bioscience Research Reporting: The ARRIVE Guidelines
for Reporting Animal Research”. In: PLOS Biology 8.6, e1000412.
Kimmelman, J., J. S. Mogil, and U. Dirnagl (2014). “Distinguishing between Exploratory and
Confirmatory Preclinical Research Will Improve Translation”. In: PLOS Biology 12.5, e1001863.
Llovera, G. and A. Liesz (2016). “The next step in translational research: lessons learned from the
first preclinical randomized controlled trial”. In: Journal of Neurochemistry 139, pp. 271–279.
Moher D.and Hopewell, S. et al. (2010). “CONSORT 2010 Explanation and Elaboration: updated
guidelines for reporting parallel group randomised trials”. In: BMJ: British Medical Journal 340.
Moore, C. G. et al. (2011). “Recommendations for planning pilot studies in clinical and translational
research.” In: Clinical and Translational Science 4.5, pp. 332–337.
Pound, P. and M. Ritskes-Hoitinga (2018). “Is it possible to overcome issues of external valid-
ity in preclinical animal research? Why most animal models are bound to fail”. In: Journal of
Translational Medicine 16.1, p. 304.
Richter, S. H. (2017). “Systematic heterogenization for better reproducibility in animal experimen-
tation”. In: Lab Animal 46.9, pp. 343–349.
Richter, S. H. et al. (2010). “Systematic variation improves reproducibility of animal experiments”.
In: Nature Methods 7.3, pp. 167–168.
Sansone, S.-A. et al. (2019). “FAIRsharing as a community approach to standards, repositories and
policies”. In: Nature Biotechnology 37.4, pp. 358–367.
Sim, J. (2019). “Should treatment effects be estimated in pilot and feasibility studies?” In: Pilot and
Feasibility Studies 5.107, e1–e7.
Thabane, L. et al. (2010). “A tutorial on pilot studies: the what, why and how”. In: BMC Medical
Research Methodology 10.1, p. 1.
Travers J.and Marsh, S. et al. (2007). “External validity of randomised controlled trials in asthma:
To whom do the results of the trials apply?” In: Thorax 62.3, pp. 219–233.
Tufte, E. (1997). Visual Explanations: Images and Quantities, Evidence and Narrative. 1st. Graphics
Press.
Voelkl, B. et al. (2018). “Reproducibility of preclinical animal research improves with heterogeneity
of study samples”. In: PLOS Biology 16.2, e2003693.
Würbel, H. (2017). “More than 3Rs: The importance of scientific validity for harm-benefit analysis
of animal research”. In: Lab Animal 46.4, pp. 164–166.
Chapter 2
Review of Statistical Concepts
2.1 Introduction
We briefly review some basic concepts in statistical inference for analyzing a given
set of data. The material roughly covers a typical introductory course in statistics:
describing variability, estimating parameters, and testing hypotheses in the context
of normally distributed data. The focus on the normal distribution avoids the need
for more advanced mathematical and computational machinery and allows us to
concentrate on the design rather than complex analysis aspects of an experiment in
later chapters.
2.2 Probability
We formally describe the observation for the ith mouse by the outcome yi of its
associated random variable Yi . The (probability) distribution of a random variable Y
gives the probability of observing an outcome within a given interval. The probability
density function (pdf) f Y (y) and the cumulative distribution function (cdf) FY (y) of
Y both describe its distribution. The area under the pdf between two values a and b
gives the probability that a realization of Y falls into the interval [a, b]:
b
P(Y ∈ [a, b]) = P(a ≤ Y ≤ b) = f Y (y ) dy = FY (b) − FY (a) ,
a
while the cdf gives the probability that Y will take any value lower or equal to y:
y
FY (y) = P(Y ≤ y) = f Y (y )dy .
−∞
It is often reasonable to assume a normal (or Gaussian) distribution for the data.
This distribution has two parameters μ and σ 2 , and its probability density function
is
1 1 y−μ 2
f Y (y; μ, σ ) = √
2
· exp − · ,
2πσ 2 σ
which yields the famous bell-shaped curve symmetric around a peak at μ and with
width determined by σ. In our example, we have normally distributed observations
with μ = 10 and σ 2 = 2, which corresponds to the probability density function shown
in Fig. 2.1.
We write Y ∼ N (μ, σ 2 ) to say that the random variable Y has a normal distri-
bution with parameters μ and σ 2 . The density is symmetric around μ, and thus the
probability that we observe an enzyme level lower than 10 is P(Y ≤ 10) = 0.5 in
our example. The probability that the observed level falls above a = 12.7 is about
∞
1 − FY (a) = a f Y (y )dy = 0.025 or 2.5%. More details on normal distributions
and their properties are given in Sect. 2.2.6.
2.2.2 Quantiles
We are often interested in the α-quantile qα below which a realization falls with
given probability α such that
P(Y ≤ qα ) = α .
2.2 Probability 17
Probability density
0.2
0.1
97.5%−quantile
2.5% probability 2.5%−quantile 2.5% probability
0.0
4 8 12 16
Measured enzyme level
Fig. 2.1 Normal density for N (μ = 10, σ 2 = 2) with 2.5% and 97.5% quantiles (arrows) and areas
corresponding to 2.5% probability under left and right tails (gray shaded areas)
The value qα depends on the distribution of Y , and different distributions have dif-
ferent quantiles for the same α.
For our example, a new observation will be below the quantile q0.025 = 7.23
with probability 2.5% and below q0.975 = 12.77 with probability 97.5%. These two
quantiles are indicated by arrows in Fig. 2.1, and each shaded area corresponds to a
probability of 2.5%.
and is the probability that X ≤ x and Y ≤ y are simultaneously true. We can decom-
pose it into the marginal distribution P(X ≤ x) and the conditional distribution
P(Y ≤ y|X ≤ x). The conditional distribution is read as ‘Y given X ’ and gives the
probability that Y ≤ y if we know that X ≤ x.
In our example, the enzyme levels Yi and Y j of two different mice are independent:
their realizations yi and y j are both from the same distribution, but knowing the
measured level of one mouse will tell us nothing about the level of the other mouse,
hence P(Yi ≤ yi | Y j ≤ y j ) = P(Yi ≤ yi ).
The joint probability of two independent variables is the product of the two individ-
ual probabilities. For example, the probability that both mouse i and mouse j yield
measurements below 10 is P(Yi ≤ 10, Y j ≤ 10) = P(Yi ≤ 10 | Y j ≤ 10) · P(Y j ≤
10) = P(Yi ≤ 10) · P(Y j ≤ 10) = 0.5 · 0.5 = 0.25.
18 2 Review of Statistical Concepts
Table 2.2 Measured enzyme levels of 10 lab mice, each mouse measured twice
1 2 3 4 5 6 7 8 9 10
8.96 8.95 11.37 12.63 11.38 8.36 6.87 12.35 10.32 11.99
8.82 9.13 11.37 12.50 11.75 8.65 7.63 12.72 10.51 11.80
Instead of working with the full distribution of a random variable Y , it is often suf-
ficient to summarize its properties by the expectation and variance, which roughly
speaking give the position around which the density function spreads and the disper-
sion of values around this position, respectively.
The expected value (or expectation, also called mean or average) of a random
variable Y is a measure of its location. It is defined as the weighted average of all
possible realizations y of Y , which we calculate by integrating over the values y and
multiplying each value with the probability density f Y (y) of Y ’s distribution:
+∞
E(Y ) = y · f Y (y) dy .
−∞
The expectation is linear and the following arithmetic rules apply for any two random
variables X and Y and any non-random constant a:
If X and Y are independent, then the expectation of their product is the product of
the expectations:
E(X · Y ) = E(X ) · E(Y ) ,
but note that in general E (X / Y ) = E(X ) / E(Y ) even for independent variables.
The expectation is often denoted by μ.
The variance of a random variable Y , often denoted by σ 2 , is defined as
Var(Y ) = E (Y − E(Y ))2 = E(Y 2 ) − E(Y )2 ,
the expected distance of a value of Y from its expectation, where the distance is
measured as the squared difference. It is a measure of dispersion, describing how
wide values spread around their expected value.
For a non-random constant a and two random variables X and Y , the following
arithmetic rules apply for variances:
For a normally distributed random variable Y ∼ N (μ, σ 2 ), the expectation and vari-
ance completely specify the full distribution, since μ = E(Y ) and σ 2 = Var(Y ). For
our example distribution in Fig. 2.1, the expectation μ = 10 provides the location of
the maximum of the density, and the variance σ 2 = 2 corresponds to the width of the
density curve around this location. The relation between a distribution’s parameters
and its expectation and variance is less direct for many other distributions.
Given the expectation and variance of a random variable Y , we can define a new
random variable Z with expectation zero and variance one by shifting and scaling:
Y − E(Y )
Z= √ has E(Z ) = 0 and Var(Z ) = 1 .
Var(Y )
However, its variance is smaller than the individual variances, and decreases with
the number of random variables:
n
1 1
n
1 σ2
Var(Ȳ ) = Var Yi = 2 Var(Yi ) = 2 · n · σ 2 = .
n i=1 n i=1 n n
For our example, the average of the 10 measurements of enzyme levels has expecta-
tion E(Ȳ ) = E(Yi ) = 10, but the variance Var(Ȳ ) = σ 2 /10 = 0.2 is only one-tenth
of the individual variances Var(Yi ) = σ 2 = 2. The average of 10 measurements there-
fore shows less dispersion around the mean than each individual measurement. Since
the sum of normally distributed random variables is again normally distributed, we
thus know that Ȳ ∼ N (μ, σ 2 /n) with n = 10 for our example.
We often describe the scale of the distribution by the standard deviation:
sd(Y ) = Var(Y ) ,
which is a measure of dispersion in the same unit as the random variable. While the
variance of the sum of independent random variables is the sum of their variances,
the variance does not behave nicely with changes in the measurement scale: if Y is
measured in meters, its variance is given in square-meters. A shift to centimeters then
multiplies all measurements by 100, but the variance by 1002 = 10 000. In contrast,
the standard deviation behaves nicely under changes in scale, but not under addition:
sd(a · Y ) = |a| · sd(Y ) but sd(X ± Y ) = sd(X )2 + sd(Y )2 ,
for independent X and Y . For our example, sd(Yi ) = 1.41 and sd(Ȳ ) = 0.45.
2.2 Probability 21
which measures the dependency between X and Y : the covariance is large and positive
if a large deviation of X from its expectation is associated with a large deviation of
Y in the same direction; it is large and negative if these directions are reverse.
The covariance of a random variable with itself is its variance Cov(X, X ) =
Var(X ), and the covariance is zero if X and Y are independent (the converse is not
true!). The covariance is linear in both arguments, in particular:
and similarly for the second argument. A related measure is the correlation
Cov(X, Y )
Corr(X, Y ) = ,
sd(X ) · sd(Y )
and
which both reduce to the previous formulas if the variables are independent and
Cov(X, Y ) = 0.
In our first example, the measurements of enzyme levels in 10 mice are indepen-
dent. Therefore, Cov(Yi , Yi ) = Var(Yi ) and Cov(Yi , Y j ) = 0 for two different mice
i and j.
In our second example, two samples were measured for each mouse. We can write
the random variable Yi, j for the jth measurement of the ith mouse as the sum of a
random variable Mi capturing the difference of the true enzyme level of mouse i to
the expectation μ = 10, and a random variable Si, j capturing the difference between
the observed enzyme level Yi, j and the true level of mouse i. Then,
Yi, j = μ + Mi + Si, j ,
22 2 Review of Statistical Concepts
where μ = 10, Mi ∼ N (0, σm2 ), and Si, j ∼ N (0, σe2 ). The ith mouse then has average
response μ + Mi , and the first measurement deviates from this by Si,1 . If Mi and Si, j
are all independent, then the overall variance σ 2 is decomposed into the two variance
components: Var(Yi, j ) = σ 2 = σm2 + σe2 = Var(Mi ) + Var(Si, j ).
If we plot the two measurements Yi,1 and Yi,2 of each mouse in a scatterplot as
in Fig. 2.2A, we notice the high correlation: whenever the first measurement is high,
the second measurement is also high. The covariance between these variables is
Cov(Yi,1 , Yi,2 )
= Cov(μ + Mi + Si,1 , μ + Mi + Si,2 )
= Cov(Mi + Si,1 , Mi + Si,2 )
= Cov(Mi , Mi ) + Cov(Mi , Si,2 ) + Cov(Si,1 , Mi ) + Cov(Si,1 , Si,2 )
=Var(Mi ) =0 (independence) =0 (independence) =0 (independence)
= σm2 ,
Cov(Yi,1 , Yi,2 ) σ2
Corr(Yi,1 , Yi,2 ) = = 2 m 2 .
sd(Yi,1 ) · sd(Yi,2 ) σm + σe
With σm2 = 1.9 and σe2 = 0.1, the correlation is extremely high at 0.95. This can also
be seen in Fig. 2.2B, which shows the two measurements for each mouse separately.
Here, the magnitude of the differences between the individual mice (gray points) is
measured by σm2 , while the magnitude of differences between the two measurements
of any mouse (plus and cross) is much smaller and is given by σe2 .
As the third example, we consider the arithmetic mean of independent random
variables and determine the covariance Cov(Yi , Ȳ ). Because Ȳ is computed using
Yi , these two random variables cannot be independent, and we expect a non-zero
covariance. Using the linearity of the covariance, we get
⎛ ⎞
1 n
1
n
1 σ2
Cov(Yi , Ȳ ) = Cov ⎝Yi , Yj⎠ = Cov(Yi , Y j ) = Cov(Yi , Yi ) = ,
n j=1 n j=1 n n
using the fact that Cov(Yi , Y j ) = 0 for i = j. Thus, the covariance decreases with
increasing number of random variables, because the variation of the average Ȳ
depends less and less on each single variable Yi .
2.2 Probability 23
A B
10
9
Level of second sample
12 8
7
Mouse
6
10
5
4
8 3
2
1
8 10 12 8 10 12
Level of first sample Enzyme level
Fig. 2.2 A Scatterplot of the enzyme levels of the first and second samples for each mouse. The
points lie close to a line, indicating a very high correlation between the two samples. B Average
(gray) and first (plus) and second (cross) sample enzyme levels for each mouse. The dashed line is
the average level in the mouse population
We encounter four families of distributions in the next chapters, and briefly gather
some of their properties here for future reference. We assume that all data follow
some normal distribution describing the errors due to sampling and measurement,
for example. Derived from that are (i) the χ2 -distribution related to estimates of
variance, (ii) the F-distribution related to the ratio of two variance estimates, and
(iii) the t-distribution related to estimates of means and differences in means.
Normal Distribution
The normal distribution is well-known for its bell-shaped density, shown in Fig. 2.3A
for three combinations of its two parameters μ and σ 2 . The special case of μ = 0
and σ 2 = 1 is called the standard normal distribution.
The omnipresence of the normal distribution can be partly explained by the central
limit theorem, which states that if we have a sequence of random variables with
identical distribution (for example, describing the outcome of many measurements),
then their average will have a normal distribution, no matter what distribution each
single random variable has. Technically, if Y1 , . . . , Yn are independent and identically
distributed with mean E(Yi ) = μ and variance Var(Yi ) = σ 2 , then the arithmetic
mean Ȳn = (Y1 + · · · + Yn )/n approaches a normal distribution as n increases:
A B 0.25
0.4
0.20
0.3
0.15
0.2
0.10
0.1
0.05
0.0 0.00
−10 −5 0 5 10 0 5 10 15 20
C D
0.4 1.00
0.3 0.75
0.2 0.50
0.1 0.25
0.0 0.00
−4 −2 0 2 4 0 1 2 3 4 5
Fig. 2.3 Probability densities. A Normal distributions: standard normal with mean μ = 0, stan-
dard deviation σ = 1 (solid); shifted: μ = 5, σ = 1 (dotted); scaled: μ = 0, σ = 3 (dashed). B
χ2 -distributions with 3 (solid), 5 (dotted), and 10 (dashed) degrees of freedom. C t-distributions
with 2 (dotted) and 10 (dashed) degrees of freedom and standard normal density (solid). D F-
distributions with (n = 2, m = 10) numerator/denominator degrees of freedom (solid), respectively,
(n = 10, m = 10) (dotted) and (n = 10, m = 100) (dashed)
probability is about 95%; and for three standard deviations, it is larger than 99%. In
particular, the 2.5% and 97.5% quantiles of a standard normal distribution are z 0.025 =
−1.96 and z 0.975 = +1.96 and can be conveniently taken to be ±2 for approximate
calculations.
χ2 -Distribution
This distribution has expectation n and variance 2n, and the degrees of freedom n is
its only parameter. It occurs when estimating the variance of a normally distributed
random variable from a set of measurements, and is used to establish confidence
intervals of variance estimates (Sect. 2.3.4.2). We denote by χ2α, n the α-quantile on
n degrees of freedom.
Its densities for 3, 5, and 10 degrees of freedom are shown in Fig. 2.3B. Note that
the densities are asymmetric, only defined for non-negative values, and the maximum
is at n − 2 and does not coincide with the expected value (for n = 5 the expected
value is 5, while the maximum is at 3).
The sum of two independent χ2 -distributed random variables is again χ2 -
distributed, with degrees of freedom the sum of the two individual degrees of freedom:
also has a χ2 -distribution, where one degree of freedom is ‘lost’ because we can
calculate any single summand from the remaining n − 1.
If the random variables Yi ∼ N (μi , 1) are normally
distributed with unit variance
but individual (potentially non-zero) means μi , then
i Yi2 ∼ χ2 (λ) has a noncentral
χ2 -distribution with noncentrality parameter λ = i μi2 . This distribution plays a
role in sample size determination, for example.
26 2 Review of Statistical Concepts
t-Distribution
X
√ ∼ tn
Y/n
F-Distribution
If X ∼ χ2n and Y ∼ χ2m are two independent χ2 -distributed random variables, then
their ratio, scaled by the degrees of freedom, has an F-distribution with n numer-
ator degrees of freedom and m denominator degrees of freedom, which are its two
parameters:
X/n
∼ Fn,m .
Y/m
a noncentral F-distribution with noncentrality parameter λ; hence χ2n (λ) = Fn,∞ (λ)
and tm2 (η) = F1,m (λ = η 2 ).
2.3 Estimation
yi = μ + ei ,
where the deviations ei ∼ (0, σ 2 ) are distributed around a zero mean. There are two
parts to this model: one part (μ) describes the mean structure of our problem, that
is, the location of the distribution; the other (ei and the associated variance σ 2 ) the
variance structure, that is, the random deviations around the location. The deviations
ei = yi − μ are called residuals and capture the unexplained variation with residual
variance σ 2 . The diagram in Fig. 2.4 gives a visual representation of this situation:
reading from top to bottom, it shows the increasingly finer partition of the data.
On top, the factor M corresponds to the population mean μ and gives the coarsest
summary of the data. We write this factor in bold to indicate that its parameter is a fixed
number. The summary is then refined by the next finer partition, which here already
corresponds to 10 individual mice. To each mouse is associated the difference from
its value (the observations yi ) and the next-coarser partition (the population mean
μ), and the factor (Mouse) corresponds to the 10 residuals ei = μ − yi . In contrast
to the population mean, the residuals are random and will change in a replication of
the experiment. We indicate this fact by writing the factor in italics and parentheses.
The number of parameters associated with each granularity is given as a super-
script (one population mean and 10 residuals); the subscript gives the number of
independent parameters. While there are 10 residuals, there are only nine degrees
(Mouse)10
9
28 2 Review of Statistical Concepts
of freedom for their values, since knowing nine residuals and the population mean
allows calculation of the value for the tenth residual. The degrees of freedom are
easily calculated from the diagram: take the number of parameters (the superscript)
of each factor and subtract the degrees of freedom of every factor above it.
Since (Mouse) subdivides the partition of M into a finer partition of the data, we
say that (Mouse) is nested in M.
Our task is now threefold: (i) provide an estimate of the expected population
enzyme level μ from the given data y1 , . . . , y10 , (ii) provide an estimate of the pop-
ulation variance σ 2 , and (iii) quantify the uncertainty of those estimates.
The estimand is a population parameter θ and an estimator of θ is a function that
takes data y1 , . . . , yn as an input and returns a number that is a ‘good guess’ of the
true value of the parameter. There might be several sensible estimators for a given
parameter, and statistical theory and assumptions on the data usually provide insight
into which estimator is most appropriate.
We denote an estimator of a parameter θ by θ̂ and sometimes use θ̂n to empha-
size its dependence on the sample size. The estimate is the value of θ̂ that results
from a specific set of data. Standard statistical theory provides us with methods for
constructing estimators, such as least squares, which requires few assumptions, and
maximum likelihood, which requires postulating the full distribution of the data but
can then better leverage them. Since the data are random, so is the estimate, and the
estimator is therefore a random variable with an expectation and a variance.
The bias of an estimator is the difference between its expectation and the true value
of the parameter:
bias θ̂ = E θ̂ − θ = E θ̂ − θ .
θ̂ − θ
∼ N (0, 1) for n → ∞
sd(θ̂)
An estimator is consistent if the estimates approach the true value of the parameter
when increasing the sample size indefinitely.
We now consider estimators for the mean and variance, as well as covariance and
correlations as concrete examples. These estimators form the basis for more complex
estimation problems that we encounter later on. We look at properties like bias and
consistency in more detail to illuminate these concepts with simple examples.
The most common estimator for the expectation μ is the arithmetic mean
1
n
μ̂ = ȳ = yi .
n i=1
1
n
σ̇ 2 = (yi − μ)2 ,
n i=1
but we would need μ to calculate it. In order to make this estimator operational, we
plug in the estimator μ̂ instead of μ, which gives the estimator
n
ˆσ̇ 2 = 1 (yi − μ̂)2 .
n i=1
This estimator is biased, as we easily verify with a direct if somewhat lengthy cal-
culation.
30 2 Review of Statistical Concepts
1 1
n n
ˆ2
E(σ̇ ) = E (yi − μ̂)2 =E ((yi − μ) − (μ̂ − μ))2
n i=1 n i=1
1
n
= E((yi − μ)2 ) − 2 · E (yi − μ)(μ̂ − μ) + E((μ̂ − μ)2 )
n i=1
n n
1 2 σ2 1 2 σ2 σ2
= σ − 2Cov(yi , μ̂) + = σ −2 +
n i=1 n n i=1 n n
n−1
= · σ2 < σ2 .
n
Hence, this estimator systematically underestimates the true variance. The bias
decreases with increasing sample size, as (n − 1)/n approaches one for large n.
Moreover, the bias is known and can be explicitly calculated in advance, because
it only depends on the sample size n and not on the data yi . We can therefore remedy
n
this bias simply by multiplying the estimator with n−1 and thus arrive at an unbiased
estimator for the variance
1
n
n ˆ2
σ̂ =
2
σ̇ = (yi − μ̂)2 .
n−1 n − 1 i=1
1
n
Cov(X, Y ) = (xi − μ̂ X ) · (yi − μ̂Y )
n − 1 i=1
For our two-sample example, we find a covariance between first and second sam-
ples of
Cov(Yi,1 , Yi,2 ) = 3.47 and thus a very high correlation of ρ̂ = 0.99.
2.3 Estimation 31
For our example, we find that se(μ̂) = 0.61 and the standard error is only about 6%
of the estimated expectation μ̂ = 10.32, indicating high relative precision.
For normally distributed data, the ratio
n
σ̇ 2 1 yi − μ 2
=
σ2 n i=1 σ
has a χ2 -distribution. To compensate for the larger uncertainty when using an estimate
μ̂ instead of the true value, we need to adjust the degrees of freedom and arrive at
the distribution of the variance estimator σ̂ 2 :
(n − 1)σ̂ 2
∼ χ2n−1 .
σ2
This distribution has variance 2(n − 1), and from it we find the variance of the
estimator to be
(n − 1)σ̂ 2 (n − 1)2 2
2(n − 1) = Var = Var(σ̂ 2 ) ⇐⇒ Var(σ̂ 2 ) = σ4 .
σ 2 σ 4 n−1
√
Hence, se(σ̂ 2 ) = 2/(n − 1)σ 2 , the estimator is consistent, and doubling the preci-
sion again requires about four times as much data. The standard error now depends
on the true parameter value σ 2 , and larger variances are more difficult to estimate
precisely. For our example, we find that the variance estimate σ̂ 2 = 3.77 has an esti-
32 2 Review of Statistical Concepts
Not accurate
The estimators for expectation and variance are examples of point estimators and
provide a single number as the ‘best guess’ of the true parameter from the data.
The standard error quantifies the uncertainty of a point estimate: an estimate of the
average enzyme level based on 100 mice is more precise than an estimate based on
only two mice.
A confidence interval of a parameter θ is another way for quantifying the uncer-
tainty that additionally takes account of the full distribution of the estimator. The
interval contains all values of the parameter that are compatible with the observed
data up to a specified degree. The (1 − α)-confidence interval of an estimator θ̂ is
an interval [a(θ̂), b(θ̂)] such that
P a(θ̂) ≤ θ ≤ b(θ̂) = 1 − α .
We call a(θ̂) and b(θ̂) the lower and upper confidence limit, respectively, and abbre-
viate them as LCL and UCL. The confidence level 1 − α quantifies the degree of
being ‘compatible with the data’. The higher the confidence level (the lower α), the
wider the interval, until a 100%-confidence interval includes all possible values of
the parameter and becomes useless.
2.3 Estimation 33
While not strictly required, we always choose the confidence limits a and b such
that the left and right tails each cover half of the required confidence level. This
provides the shortest possible confidence interval.
The confidence interval equation is a probability statement about a random inter-
val covering θ and not a statement about the probability that θ is contained in a
given interval. For example, it is incorrect to say that, having computed a 95%-
confidence interval of [−2, 2], the true population parameter has a 95% probability
of being larger than −2 and smaller than +2. Such a statement would be nonsensical,
because any given interval either contains the true value (which is a fixed number)
or it does not, and there is no probability attached to this.
One correct interpretation is that the proportion of intervals containing the correct
value θ is (1 − α) under repeated sampling and estimation. This interpretation is
helpful in contexts like quality control, where the ‘same’ experiment is done repeat-
edly and there is thus a direct interest in the proportion of intervals containing the true
value. For most biological experiments, we do not anticipate repeating them over and
over again. Here, an equivalent interpretation is that our specific confidence interval,
computed from the data of our experiment, has a (1 − α) probability to contain the
correct parameter value.
For computing a confidence interval, we need to derive the distribution of the
estimator; for maximum likelihood estimators, this distribution is normal for large
sample sizes, and
θ − θ̂
∼ N (0, 1) ,
se(θ̂)
and therefore
θ − θ̂
z α/2 ≤ ≤ z 1−α/2 with probability 1 − α ,
se(θ̂)
For some standard estimators, the exact distribution might not be normal but is
nevertheless known. If the distribution is unknown or difficult to compute, we can
use computational methods such as bootstrapping to find an approximate confidence
interval.
34 2 Review of Statistical Concepts
μ̂ − μ μ̂ − μ
√ ∼ N (0, 1) and √ ∼ tn−1 .
σ/ n σ̂/ n
The normal distribution is appropriate if the variance is known or if the sample size
n is large.
We then construct a (1 − α)-confidence interval based on the normal distribution:
√ √ √
μ̂ + z α/2 · σ/ n, μ̂ + z 1−α/2 · σ/ n = μ̂ ± z α/2 · σ/ n .
The equality holds because z α/2 = −z 1−α/2 . For small sample sizes n, we need to take
account of the additional uncertainty from replacing σ by σ̂. The exact confidence
interval of μ is then
√
μ̂ ± tα/2, n−1 · se(μ̂) = μ̂ ± tα/2, n−1 · σ̂/ n .
For our data, the 95%-confidence interval based on the normal approximation is
[9.11, 11.52] and the interval based on the t-distribution [8.93, 11.71] is wider. For
larger sample sizes n, the difference between the two intervals quickly becomes
negligible.
To further illustrate confidence intervals, we return to our enzyme level example.
The true N (10, 2)-density of enzyme levels is shown in Fig. 2.6A (solid line) and
the N (10, 2/10)-density of the estimator of the expectation based on 10 samples is
shown as a dotted line. The rows of Fig. 2.6B are the enzyme levels of 10 replicates
of 10 randomly sampled mice as gray points, with the true average μ = 10 shown
A B
1.00 10
9
0.75 8
7
Replicate
Density
6
0.50
5
4
0.25 3
2
0.00 1
7.5 10.0 12.5 15.0 7.5 10.0 12.5 15.0
Enzyme level Enzyme level
Fig. 2.6 A Distribution of enzyme levels in population (solid) and of average (dotted); B Levels
measured for 10 replicates of 10 randomly sampled mice each (gray points) with estimated mean
for each replicate (black points) and their 95%-confidence intervals (black lines) compared to true
mean (dotted line)
2.3 Estimation 35
as a vertical dotted line. The resulting estimates of the expectation and their 95%-
confidence intervals are given as black points and lines, respectively. The 10 estimates
vary much less than the actual measurements and fall both above and below the
true expectation. Since the confidence intervals are based on the random samples,
they have different lower and upper confidence limits and different lengths. For a
confidence limit of 95%, we expect that 1 out of 20 confidence intervals does not
cover the true value on average; an example is the interval of replicate five that covers
only values below the true expectation.
For the variance estimator, recall that (n − 1)σ̂ 2 /σ 2 has a χ2 -distribution. It follows
that
(n − 1)σ̂ 2
χ2α/2, n−1 ≤ ≤ χ21−α/2, n−1 with probability 1-α ,
σ2
(n − 1)σ̂ 2 (n − 1)σ̂ 2
≤ σ 2
≤ .
χ21−α/2, n−1 χ2α/2, n−1
This interval does not simplify to the form θ̂ ± qα · se, because the χ2 -distribution
is not symmetric and χ2α, n = −χ21−α, n .
For our example, we estimate the population variance as σ̂ 2 = 3.77 (for a true
value of σ 2 = 2). With n − 1 = 9 degrees of freedom, the two quantiles for a 95%-
confidence interval are χ20.025, 9 = 2.7 and χ20.975, 9 = 19.02, leading to an interval
of [1.78, 12.56]. The interval covers the true value, but its width indicates that our
variance estimate is quite imprecise. A confidence interval for the standard deviation
σ is calculated by taking square-roots of the lower and upper confidence limits of
the variance. This yields a 95%-confidence interval of [1.34, 3.54] for our estimate
σ̂ = 1.94 (true value of σ = 1.41).
Table 2.3 Measured enzyme levels for 20 mice, measured by kits from two vendors A and B
A 8.96 8.95 11.37 12.63 11.38 8.36 6.87 12.35 10.32 11.99
B 12.68 11.37 12.00 9.81 10.35 11.76 9.01 10.83 8.76 9.99
(Mouse)20
18
based on kit A compared to kit B or if measurements from one kit are more dispersed
than from the other.
We denote by μ A , μ B the expectations and by σ 2A , σ 2B the variances of the enzyme
levels yi,A and yi,B measured by kit A and B, respectively. Our interest focuses on two
effect sizes: we measure the systematic difference between the kits by the difference
Δ = μ A − μ B of their expected values, and we measure the difference in variances
by their proportion = σ 2A /σ 2B . Since we already have 10 measurements with our
standard kit A, our proposed experiment is to select another 10 mice and measure
their levels using kit B. This results in the data shown in Table 2.3, whose first row
is identical to our previous data.
Our goal is to establish an estimate of the difference in means and the proportion
of variances, and to calculate confidence intervals for these estimates.
The logical structure of the experiment is shown in the diagram in Fig. 2.7.
The data from this experiment are described by the grand mean μ = (μ A + μ B )/2,
given by the factor M; this is the ‘best guess’ for the value of any datum yi j if nothing
else is known. If we know which vendor was assigned to the datum, a better ‘guess’
is the corresponding mean μ A or μ B . The factor Vendor is associated with the two
differences μ A − μ and μ B − μ. Since we can calculate μ A , say, from μ and μ B , the
degrees of freedom for this factor are one. Finally, the next finer partition of the data
is into the individual observations; their 2 · 10 residuals ei j = yi j − μi are associated
with the factor (Mouse), and only 18 of the residuals are independent given the two
group means. In this diagram, (Mouse) is nested in Vendor and Vendor is nested in
M since each factor further subdivides the partition of the factor above. This implies
that (Mouse) is also nested in M.
The diagram corresponds to the model
yi j = μ + δi + ei j
2.3 Estimation 37
for the data, where δi = μi − μ are the deviations from grand mean to group mean
and μ A = μ + δ A and μ B = μ + δ B . The three parameters μ, δ A , and δ B are unknown
but fixed quantities, while the ei j are random and interest focuses on their variance
σ2 .
(n−1)σ̂ 2A
σ 2A
/(n − 1) σ̂ 2A /σ 2A ˆ
= = ∼ Fn−1,m−1
(m−1)σ̂ 2B
/(m − 1) σ̂ B /σ B
2 2
σ 2B
For our example, we find the two individual estimates σ̂ 2A = 3.77 and σ̂ 2B = 1.69 and
an estimated proportion ˆ = 2.23. This yields a 95%-confidence interval for ˆ of
[0.55, 8.98], which contains the ratio of one (meaning equal variances) and values
below and above one. Hence, given the data, we have no evidence that the two true
variances are substantially different, even though their estimates differ by a factor of
more than 2.
Δ̂ = μ
A − μ B = μ̂ A − μ̂ B ,
the difference between the estimates of the respective expected enzyme levels.
For the example, we estimate the two average enzyme levels as μ̂ A = 10.32 and
μ̂ B = 10.66 and the difference as Δ̂ = −0.34. The estimated difference is not exactly
zero, but this might be explainable by measurement error or the natural variation of
enzyme levels between mice.
We take account of the uncertainty by calculating the standard error and a 95%-
confidence interval for Δ̂. The lower and upper confidence limits then provide infor-
mation about a potential systematic difference: if the upper limit is below zero, then
38 2 Review of Statistical Concepts
only negative differences are compatible with the data, and we can conclude that
measurements of kit A are systematically lower than measurements of kit B. Con-
versely, a lower limit above zero indicates kit A yielding systematically higher values
than kit B. If zero is contained in the confidence interval, we cannot determine the
direction of a potential difference, and it is also plausible that no difference exists.
We already established that the data provides no evidence against the assumption
that the two variances are equal. We therefore assume a common variance σ 2 =
σ 2A = σ 2B , which we estimate by the pooled variance estimate
σ̂ 2A + σ̂ 2B
σ̂ 2 = .
2
For our data, σ̂ 2 = 2.73 and the estimated standard deviation is σ̂ = 1.65.
Compared to this standard deviation, the estimated difference is small (about
21%). Whether such a small difference is meaningful in practice depends on the
subject matter. A helpful dictum is
A difference which makes no difference is no difference at all. (attributed to William James)
To determine the confidence limits, we first need the standard error of the difference
estimate. The two estimates μ̂ A and μ̂ B are based on independently selected mice,
and are therefore independent. The simple application of the rules for variances then
yields
σ2
Var(Δ̂) = Var(μ̂ A − μ̂ B ) = Var(μ̂ A ) + Var(μ̂ B ) = 2 · .
n
The standard error of the difference estimator is therefore
σ2
se(Δ̂) = 2 · ,
n
Table 2.4 Enzyme levels for vendors A and B based on two samples from each of 10 mice, one
sample per vendor
Mouse 1 2 3 4 5 6 7 8 9 10
A 9.14 9.47 11.14 12.45 10.88 8.49 7.62 13.05 9.67 11.63
B 9.19 9.70 11.12 12.62 11.50 8.99 7.54 13.38 10.94 12.28
M11 μ
(Mouse)10
9 Vendor21 2
mi (σm ) δj
(Sample)20
9
eij (σe2 )
Fig. 2.8 Paired design for estimating average enzyme level difference: 10 mice, each vendor
assigned to one of two samples per mouse
We find a very different result when we address the same question using the design
illustrated in Fig. 1.1C, where we randomly select 10 mice, draw two samples from
each, and randomly assign each kit to one sample. The resulting data are given in
Table 2.4.
The Hasse diagram for this experiment is shown in Fig. 2.8A. The two factors
(Mouse) and Vendor are now crossed and written next to each other. Each sample
corresponds to a combination of one mouse and one vendor and is nested in both.
Since the mice are randomly selected, their average responses are also random.
The two observations from mouse i are
yi A = μ + δ A + m i + ei A and yi B = μ + δ B + m i + ei B ,
where Var(m i ) = σm2 and Var(ei j ) = σe2 . The parameters δ A = −δ B are the treatment
effects. The expected response with kit A is then μ + δ A , and the variance of any
observation is Var(yi j ) = σm2 + σe2 . This model corresponds directly to the diagram,
as
1 1 1
n n n
Δ̂ = (yi,A − yi,B ) = yi,A − yi,B = μ̂ A − μ̂ B .
n i=1 n i=1 n i=1
This is the same estimator as for two independent samples and yields a similar
estimated difference of Δ̂ = −0.37 for our example.
The variance of this estimator is very different, however, due to the correlation
between each pair of samples,
σm2 + σe2 σ2 σ2
Var(Δ̂ ) = Var(μ̂ A ) + Var(μ̂ B ) − 2 · Cov(μ̂ A , μ̂ B ) = 2 −2 m =2 e .
n n n
In other words, contrasting the two kits within each mouse and then averaging these
differences over the 10 mice eliminate the between-mouse variation σm2 from the
treatment comparison.
For the data in Table 2.4, we find the two means μ̂ A = 10.35 and μ̂ B = 10.73
together with the variances σ̂ 2A = 3.1 and σ̂ 2B = 3.38, and a non-zero covariance of
Cov = 3.16. The standard error of the difference estimate is then se(Δ̂ ) = 0.13,
much lower than the previous standard error se(Δ̂) = 0.74.
The (1 − α)-confidence interval for Δ is
The difference Δ is an example of a raw effect size and has the same units as the orig-
inal measurements. Subject-matter knowledge often provides information about the
relevance of specific effect sizes and might tell us, for example, whether a difference
in enzyme levels of more than 0.5 is biologically relevant.
Sometimes the raw effect size is difficult to interpret. This is a particular problem
with some current measurement techniques in biology, which give measurements in
arbitrary units (a.u.), making it difficult to directly compare results from two exper-
iments. In this case, a unitless standardized effect size might be more appropriate.
A popular choice is Cohen’s d, which compares the difference with the standard
deviation in the population:
2.3 Estimation 41
μ A − μB μ̂ A − μ̂ B
d= estimated by d̂ = .
σ σ̂
It is a unitless effect size that measures the difference as a multiple of the standard
deviation. If |d| = 1, then the two means are one standard deviation apart. In the
original literature, Cohen suggests that |d| < 0.2 should be considered a small effect,
0.2 < |d| < 0.5 a medium-sized effect, and 0.5 < |d| < 0.8 a large effect (Cohen
1988), but such definitive categorization should not be taken too literally.
For our example, we calculate d̂ = −0.21, a difference of 21% of a standard
deviation, indicating a small-to-medium effect size. The exact confidence interval
for d̂ is based on a noncentral t-distribution and cannot be given in closed form
(cf. Sect. 2.5). For large enough sample size, we can use a normal confidence interval
d̂ ± z α/2 · se(d̂) based on an approximation of the standard error (Hedges and Olkin
1985):
nA + nB d̂ 2
se(d̂) = + ,
nA · nB 2 · (n A + n B )
where n A and n B are the respective sample sizes for the two groups. For our example,
this yields an approximate 95%-confidence interval of [−1.08, 0.67].
The underlying question in our kit vendor example is whether the statement μ A = μ B
about the underlying parameters is true or not. We argued that the data would not
support this statement if the confidence interval of Δ lies completely above or below
zero, excluding Δ = 0 as a plausible difference. Significance testing is an equivalent
way of using the observations for evaluating the evidence in favor or against a specific
null hypothesis, such as
H0 : μ A = μ B or equivalently H0 : Δ = 0 .
The null hypothesis is a statement about the true value of one or more parameters.
or falsified. If, on the other hand, prediction and observation are in agreement, then
the experiment fails to reject the theory. However, this does not mean that the theory
is proven or verified since the agreement might be due to chance, the experiment
not specific enough, or the data too noisy to provide sufficient evidence against the
theory:
Absence of evidence is not evidence of absence.
It is instructive to explicitly write down this logic more formally. The correctness of
the conjecture C implies the correctness of the prediction P:
C true =⇒ P true .
and conclude that C cannot be true. We say P is necessary for C. There is thus
an asymmetry in the relation of ‘P is true’ and ‘P is false’ toward the correctness
of C, and we can falsify a theory (at least in principle), but never fully verify it.
The philosopher Karl Popper argued that falsification is a cornerstone of science,
and falsifiability (in principle) of conjectures separates a scientific theory from a
non-scientific one (Popper 1959).
Statistical testing of hypotheses follows a similar logic, but adds a probabilistic
argument to quantify the (dis-)agreement between hypothesis and observation. The
data may provide evidence to reject the null hypothesis, but can never provide evi-
dence for accepting the null hypothesis. We therefore formulate a null hypothesis
for the ‘undesired’ outcome such that if we don’t reject it, nothing is gained and we
don’t have any clue from the data how to proceed in our investigation. If the data
provides evidence to reject the hypothesis, however, we can reasonably exclude it as
a possible explanation of the observed data.
To appraise the evidence that our data provides against the null hypothesis, we
need to take the random variation in the data into account. Instead of a yes/no answer,
we can then only argue that “if the hypothesis is true, then it is (un-)likely that data
like ours are observed.” This is precisely the argument from the hypothesis to the
observable consequences, but with a probabilistic twist.
2.4 Testing Hypotheses 43
For our example, we know that if the true difference Δ is zero, then the estimated
difference Δ̂ divided by its standard error has a t-distribution. This motivates the
well-known t-statistic
Δ̂ Δ̂
T = = ∼ t2n−2 . (2.1)
se(Δ̂) 2 · σ̂ 2 /n
Thus, our conjecture is “the null hypothesis is true and Δ = 0” from which we derive
the prediction “the observed test statistic T based on the two sample averages follows
a t-distribution with 2n − 2 degrees of freedom”.
For our data, we compute a t-statistic of t = −0.46, based on the two means 10.32
for vendor A, 10.66 for vendor B, their difference Δ̂ = −0.34, and the resulting
standard error of 0.74 of the difference on 18 degrees of freedom. The estimated
difference in means is expressed by t as about 46% of a standard error, and the sign
of t indicates that measurements with kit A might be lower than those with kit B.
The test statistic remains the same for the case of paired data as in Fig. 1.1C, with
standard error calculated as in Sect. 2.3.5.3; this is known as a paired t-test.
p-values
If we assume that the hypothesis H0 is true, we can compute the probability that our
test statistic T exceeds any given value in either direction using its known distribution.
Calculating this probability for the observed value t provides us with a quantitative
measure of the evidence that the data provide against the hypothesis. This probability
is called the p-value
p = P(|T | ≥ |t| | H0 true) .
Because our test statistic is random (it is a function of the random samples), the
p-value is also a random variable. If the null hypothesis is true, then p has a uniform
distribution between 0 and 1.
For our example, we compute a p-value of 0.65, and we expect a t-statistic that
deviates from zero by 0.46 or more in either direction in 65 out of 100 cases whenever
the null hypothesis is true. We conclude that based on the variation in the data and
the sample size, observing a difference of at least this magnitude is very likely and
there is no evidence against the null hypothesis.
A small p-value is considered indicative of H0 being false, since it is unlikely
that the observed (or larger) value of the test statistic would occur if the hypothesis
were true. This leads to a dichotomy of explanations for small p-values: either the
44 2 Review of Statistical Concepts
data led to the large value just by chance, or the null hypothesis is wrong and the
test statistic does not follow the predicted distribution. On the other hand, a large
p-value might occur either because H0 is true or because H0 is false but our data did
not yield sufficient information to detect the true difference in means.
We may still decide to ignore a large p-value and nevertheless move ahead and
assume that H0 is wrong; but the argument for doing so cannot rest on the single
experimental result and its statistical analysis, but must include external arguments
such as subject-matter considerations (e.g., about plausible effects) or outcomes of
related experiments.
If possible, we should always combine an argument based on a p-value with the
observed effect size: a small p-value (indicating that the effect found can be distin-
guished from noise) is only meaningful if the observed effect size has a practically
relevant magnitude. Even a large effect size might yield a large p-value if the sample
size is small and the variation in the data is large. In our case, the estimated difference
of Δ̂ = −0.34 is small compared to the standard error se(Δ̂) = 0.74 and cannot be
distinguished from random fluctuation. That our test does not provide evidence for
a difference, however, does not mean it does not exist.
Conversely, a low p-value means that we were able to distinguish the observed
difference from a random fluctuation, but the detected difference might have a tiny
and irrelevant effect size if the random variation is low or the sample size large.
However, this provides only one (crucial) piece of the argument why we suggest a
particular interpretation and rule out others. It does not excuse us from proposing a
reasonable interpretation of the whole investigation and other experimental data.
Statistical Significance
Error Probabilities
Any statistical test has four possible outcomes under the significant/non-significant
dichotomy, which we summarize in Table 2.5. A significant result leading to rejection
of H0 is called a positive. If the null hypothesis is indeed false, then it is a true positive,
but if the null hypothesis is indeed true, it was incorrectly rejected: a false positive
or type-I error. The significance level α is the probability of a false positive and by
choosing a specific value, we can control the type-I error of our test. The specificity of
the test is the probability 1 − α that we correctly do not reject a true null hypothesis.
Conversely, a non-significant result is called a negative. Not rejecting a true null
hypothesis is a true negative, and incorrectly not rejecting a false null hypothesis is
a false negative or a type-II error. The probability of a false negative test outcome
is denoted by β. The power or sensitivity of the test is the probability 1 − β that we
correctly reject the null hypothesis. The larger the true effect size, the more power a
test has to detect it. Power analysis allows us to determine the test parameters (most
importantly, the sample size) to provide sufficient power for detecting an effect size
deemed scientifically relevant.
Everything else being equal, the two error probabilities α and β are adversaries:
by lowering the significance level α, larger effects are required to reject the null
hypothesis, which simultaneously increases the probability β of a false negative.
This has an important consequence: if our experiment has low power (it is under-
powered), then only large effect sizes yield statistical significance. A significant
p-value and a large effect size then tempt us to conclude that we reliably detected a
substantial difference, even though we must expect such a result with probability α
in the case that no difference exists.
Under the repeated application of a test with new data, α and β are the expected
frequencies—the false positive rate, respectively, the false negative rate—of the two
types of error. This interpretation is helpful in scenarios such as quality control.
On the other hand, we usually conduct only a limited number of repeated tests in
scientific experimentation. Here, we can interpret α and β as quantifying the severity
of a statistical test—its hypothetical capability to detect the desired difference in a
planned experiment. This is helpful for planning experiments to ensure that the
data generated will likely provide sufficient evidence against an incorrect scientific
hypothesis.
The corresponding set of values is called the rejection region R and we reject H0
whenever the observed difference Δ̂ falls inside R. In our example, R consists of
the two sets (−∞, −1.55) and (1.55, +∞). Since Δ̂ = −0.34 does not fall into this
region, we do not reject the null hypothesis.
There is a one-to-one correspondence between the rejection region R for a sig-
nificance level α, and the (1 − α)-confidence interval of the estimated difference.
The estimated effect Δ̂ lies inside the rejection region if and only if the value 0 is
outside its (1 − α)-confidence interval. This equivalence is illustrated in Fig. 2.9. In
Fig. 2.9 Equivalence of testing and estimation. A Estimate inside rejection region and null value
outside the confidence interval. B Estimate outside rejection region, confidence interval contains
null value. Top: null value (dark gray dots) and rejection region (dark gray shades); bottom: estimate
Δ̂ (light gray dots) and confidence interval (light gray shades)
2.4 Testing Hypotheses 47
hypothesis testing, we put ourselves on the value under H0 (e.g., Δ = 0) and check
if the estimated value Δ̂ is so far away that it falls inside the rejection region. In
estimation, we put ourselves at the estimated value Δ̂ and check if the data provide
no objection against the hypothesized value under H0 , which then falls inside the
confidence interval. Our previous argument that the 95%-confidence interval of Δ̂
does contain zero is therefore equivalent to our current result that H0 : Δ = 0 cannot
be rejected at the 5% significance level.
We briefly present four more significance test statistics for testing expectations and
variances. Their distributions under the null hypothesis are all based on normally
distributed data.
For large sample sizes n, the t-distribution quickly approaches a standard normal
distribution. If 2n − 2 is ‘large enough’ (typically, n > 20 suffices), we can then
compare the test statistic
Δ̂
T =
se(Δ̂)
with the quantiles z α of the standard normal distribution instead of the exact t-
quantiles, and reject H0 if T > z 1−α/2 or T < z α/2 . This test is sometimes called a
z-test. There is usually no reason to forego the exactness of the t-test, however.
For our example, we reject the null hypothesis at a 5% level if the test statistic T
is below the approximate z 0.025 = −1.96, respectively, and above z 0.975 = 1.96, as
compared to the exact thresholds t0.025,18 = −2.1 and t0.975,18 = 2.1.
χ2 -Test of Variance
σ̂ 2
T =d·
σ02
variance is smaller than σ02 such that we establish an upper bound for the variance.
We reject this hypothesis if T > χ21−α,d (note that we use α here and not α/2), but the
true variance can be arbitrarily smaller than σ02 if we do not reject the null hypothesis.
For our example data for vendor A, we have n = 10 normally distributed samples
and the variance estimate is based on n − 1 = 9 degrees of freedom. We reject the
hypothesis H0 : σ 2A = 3 at a 5% level if T = 9σ̂ 2A /3 < 2.7 or T > 19.02. We calculate
t = 11.31 and find a p-value of 0.75.
H0 : σ12 = · · · = σk2 .
compares the variance of the average residuals between groups to the variance of
the residuals within the groups, and has an F-distribution with k − 1 numerator and
N − k denominator degrees of freedom if H0 is true.
Notes
Most material in this chapter is covered by any introductory book on statistics. Two
historical examples are Fisher’s Statistical Methods for Research Workers (Fisher
1925) and Snedecor and Cochran’s Statistical Methods (Snedecor and Cochran
1989), and a very readable account of the historical development of statistics is
Salsburg (2002). A recent introductory text is Wolfe and Schneider (2017), and a
very broad treatment of many methods is Wasserman (2014). Other texts emphasize
using R (Dalgaard 2008; Field et al. 2012; Shahbaba 2012). We only covered a statis-
tical approach often called frequentist; an alternative approach is Bayesian statistics,
covered at length in Gelman (2013).
There has been a re-emphasis on estimates and their uncertainties rather than
testing in recent years; a standard account of this approach is Altman et al. (2000b).
In particular, there is a strong incentive to abandon the significant/non-significant
dichotomy in hypothesis testing in favor of p-values, estimates, and confidence inter-
vals; in 2019, the American Statistical Association devoted a special issue entitled
Statistical Inference in the 21st Century: A World Beyond p < 0.05 to this topic
(Wasserstein et al. 2019). The relation of sample size, confidence intervals, and
power is discussed in Greenland et al. (2016). An insightful discussion of statistical
significance is Cox (2020). An interesting perspective on statistical inference based
on the notion of severity is given in Mayo (2018).
Standardized effect sizes are popular in fields like psychology and social sciences,
but have also been advocated for biological research (Nakagawa and Cuthill 2007).
The bootstrap method was proposed in Efron (1979) and DiCiccio and Efron (1996)
is a more recent review of confidence interval estimation.
The diagrams in Figs. 2.4, 2.7, and 2.8 are examples of Hasse diagrams. Their
use to describe statistical models and analyses began with Tjur (1984), and variants
have been proposed regularly for visualizing experimental designs and automating
their analysis (Taylor and Hilton 1981; Brien 1983; Bergerud 1996; Darius et al.
1998; Vilizzi 2005; Großmann 2014; Bate and Chatfield 2016, b). A review of recent
developments is Bailey (2020). Their use for planning and discussing experimental
designs in non-technical terms is emphasized by Lohr (1995), but only two previous
textbooks make use of them (Oehlert 2000; Bailey 2008).
Exact confidence intervals for Cohen’s d
We can calculate an exact confidence interval for an estimated standardized effect
size d̂ using
√ the noncentral t-distribution with estimated noncentrality parameter
η̂ = d̂ · n/2. We calculate a (1 − α)-confidence interval for η̂ and then transform
50 2 Review of Statistical Concepts
it into a corresponding
√ interval for d̂ by multiplying the lower and upper confidence
limits by 2/n. To find the confidence interval for η̂, we need to find the noncentrality
parameters ηlcl and ηucl such that
Using R
Basic R covers most frequently used distributions; for a distribution X, the functions
dX() and pX() provide the density and distribution function, qX() calculates
quantiles, and rX() provides a random number generator. For example, the χ2 -
distribution has functions pchisq(), dchisq(), qchisq(), and rchisq().
Noncentral distributions are accessed using the ncp= argument in the correspond-
ing functions, e.g., qt(p=0.05, df=6, ncp=1) gives the 5% quantile of a
noncentral t-distribution with η = 1 and six degrees of freedom.
Data is most conveniently brought into a data-frame, which is a rectangular table
whose columns can be accessed by name. Our vendor example uses a data-frame
with columns y for the enzyme level (a number per row), and vendor to encode the
vendor (with entry either A or B). The tidyverse encompasses a number of packages
to comfortably work with data-frames, for example, dplyr and tidyr. A good
introduction to this framework is Wickham and Grolemund (2016a), which is also
freely available online.
Standard estimators are mean() for the expectation, and var() and sd()
for variance and standard deviation. Covariances and correlations are calculated by
cov() and cor(). Confidence intervals are not readily available and are computed
manually; for means and mean differences, we can use a linear model of the form y~1,
2.5 Notes and Summary 51
respectively, y~d (d encoding the two groups) in lm(), and apply confint() to
the resulting model.
The t-test is computed using t.test(); the option paired=TRUE calculates
the paired t-test, and var.equal=TRUE enforces that variances in both samples
are considered equal. The χ2 -test and the F-test are available as chisq.test(),
respectively, var.test(), and Levene’s test as leveneTest() from package
car.
The effectsize package provides cohens_d() for calculating Cohen’s d
with an exact confidence interval.
Summary
Probability distributions describe random outcomes, such as measurements with
noise; they are characterized by the cumulative distribution and density functions.
An α-quantile of a distribution is the value such that a proportion of α of possible
realizations will fall below the quantile; this plays a prominent role in constructing
confidence intervals and rejection regions.
An estimator takes a random sample and calculates the best guess for the desired
parameter. It has a mean and standard deviation (called the standard error) and preci-
sion and accuracy describe important properties. The precision of an estimate is given
by the standard error, and is used to construct the confidence interval that contains
all values for the parameter that are compatible with the data.
Statements about parameters can be made in the form of statistical hypotheses,
which are then tested against available data. The hypothesis is rejected (or falsified)
if the deviation between the expected distribution and the observed outcome of a test
statistic is larger than a given threshold, defined by the significance level. The lowest
significance level not resulting in rejection is the p-value, the probability to see the
observed deviation or larger if the hypothesis is in fact true.
For scientific inference, hypothesis tests are usually of secondary interest, and
estimates of relevant parameters and effect sizes should be given instead, together
with their confidence intervals. The hypothesis testing framework is useful, however,
in the planning stage of an experiment to determine sample sizes, among others.
References
Altman, D. G. et al. (2000). Statistics with Confidence. 2nd. John Wiley & Sons, Inc.
Bailey, R. A. (2008). Design of Comparative Experiments. Cambridge University Press.
Bailey, R. A. (2020). “Hasse diagrams as a visual aid for linear models and analysis of variance”.
In: Communications in Statistics - Theory and Methods, pp. 1–34.
Bate, S. T. and M. J. Chatfield (2016a). “Identifying the Structure of the Experimental Design”. In:
Journal of Quality Technology 48.4, pp. 343–364.
Bate, S. T. and M. J. Chatfield (2016b). “Using the Structure of the Experimental Design and
the Randomization to Construct a Mixed Model”. In: Journal of Quality Technology 48.4, pp.
365–387.
Bergerud, W. A. (1996). “Displaying factor relationships in experiments”. In: The American Statis-
tician 50.3, pp. 228–233.
52 2 Review of Statistical Concepts
Brien, C. J. (1983). “Analysis of Variance Tables Based on Experimental Structure”. In: Biometrics
39.1, pp. 53–59.
Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences. 2nd. Lawrence Erlbaum
Associates, Hillsdale.
Cox, D. R. (2020). “Statistical significance”. In: Annual Review of Statistics and Its Application 7,
pp. 1.1–1.10.
Dalgaard, P. (2008). Introductory Statistics with R. Statistics and Computing. Springer New York.
Darius, P. L., W. J. Coucke, and K. M. Portier (1998). “A Visual Environment for Designing
Experiments”. In: Compstat, pp. 257–262.
DiCiccio, T. J. and B. Efron (1996). “Bootstrap confidence intervals”. In: Statistical Science 11.3,
pp. 189–228.
Efron, B. (1979). “Bootstrap Methods: another look at the jackknife”. In: Annals of Statistics 7.11,
pp. 1–26.
Field, A., J. Miles, and Z. Field (2012). Discovering Statistics Using R. SAGE Publications Ltd.
Fisher, R. A. (1925). Statistical Methods for Research Workers. 1st. Oliver & Boyd, Edinburgh.
Fisher, R. A. (1971). The Design of Experiments. 8th. Hafner Publishing Company, New York.
Gelman, A. et al. (2013). Bayesian Data Analysis. 3rd. Taylor & Francis.
Greenland, S. et al. (2016). “Statistical tests, P values, confidence intervals, and power: a guide to
misinterpretations”. In: European Journal of Epidemiology 31.4, pp. 337–350.
Großmann, H. (2014). “Automating the analysis of variance of orthogonal designs”. In: Computa-
tional Statistics and Data Analysis 70, pp. 1–18.
Hedges, L. and I. Olkin (1985). Statistical methods for meta-analysis. Academic Press.
Lohr, S. L. (1995). “Hasse diagrams in statistical consulting and teaching”. In: The American
Statistician 49.4, pp. 376–381.
Mayo, D. G. (2018). Statistical Inference as Severe Testing. Cambridge University Press.
Nakagawa, S. and I. C. Cuthill (2007). “Effect size, confidence interval and statistical significance:
a practical guide for biologists.” In: Biological Reviews of the Cambridge Philosophical Society
82.4, pp. 591–605.
Oehlert, G. W. (2000). A First Course in Design and Analysis of Experiments. W. H. Freeman.
Popper, K. R. (1959). The Logic of Scientific Discovery. Routledge.
Salsburg, D. (2002). The Lady Tasting Tea. Holt Paperbacks.
Shahbaba, B. (2012). Biostatistics with R. Springer New York.
Snedecor, G. W. and W. G. Cochran (1989). Statistical Methods. 8th. Iowa State University Press.
Taylor, W. H. and H. G. Hilton (1981). “A Structure Diagram Symbolization for Analysis of Vari-
ance”. In: The American Statistician 35.2, pp. 85–93.
Tjur, T. (1984). “Analysis of variance models in orthogonal designs”. In: International Statistical
Review 52.1, pp. 33–65.
Vilizzi, L. (2005). “The linear model diagram: A graphical method for the display of factor rela-
tionships in experimental design”. In: Ecological Modelling 184.2-4, pp. 263–275.
Wasserman, L. (2004). All of Statistics. Springer Texts in Statistics. Springer New York.
Wasserstein, R. L., A. L. Schirm, and N. A. Lazar (2019). “Moving to a World Beyond “p < 0.05””.
In: The American Statistician 73.sup1, pp. 1–19.
Wickham, H. and G. Grolemund (2016). R for Data Science. O’Reilly.
Wolfe, D. A. and G. Schneider (2017). Intuitive Introductory Statistics. Springer.
Chapter 3
Planning for Precision and Power
3.1 Introduction
Without comment, we always used a balanced allocation, where the same number of
experimental units is allocated to each treatment group. This choice seems intuitively
sensible, and we quickly confirm that it indeed yields the highest precision and power
in our example. We will later see that unbalanced allocation not only decreases
precision, but might prevent the estimation of relevant treatment effects altogether
in more complex designs.
We denote by n A and n B the number of mice allocated to kits A and B, respectively.
The standard error of our estimate is then
1 1
se(μ̂ A − μ̂ B ) = + ·σ ,
nA nB
n A
where we estimate the two expectations by μ̂ A = i=1 yi,A /n A and correspondingly
for μ̂ B .
For fixed total sample size n A + n B , this standard error is minimal for a balanced
allocation with treatment groups of equal size n A = n B , provided the variance σ 2 is
identical in both treatment groups. The more unbalanced the allocation is, the larger
the standard error will become.
To illustrate, we consider two experimental designs with a total sample size of
n A + n B = 20: first, we assign a single mouse to vendor B (n B = 1), and the remain-
ing mice to vendor A (n A = 19). Then,
1 1
se19,1 = + · σ = 1.026 σ
19 1
and the standard error is even higher than the dispersion in the population! However,
if we assign the mice equally (n A = n B = 10), we get a substantially lower standard
error of
1 1
se10,10 = + · σ = 0.45 σ .
10 10
allows us to directly compare the precision of the two allocation strategies. It is the
increase in sample size needed for the first experiment to match the precision of the
second. Here, the same unbalanced allocation would require about five times more
mice to match the precision of the balanced design. This would mean using at least
5 mice for vendor B and 95 mice for vendor A (100 mice in total). Dividing the
experimental material inaptly results in a substantial loss of precision, which is very
costly to make up for.
If the two treatment groups have very different standard deviations σ A and σ B ,
then the standard error is
σ 2A σ2
se(μ̂ A − μ̂ B ) = + B .
nA nB
We can increase precision and power by reducing the standard deviation σ of our
response values. This option is very attractive, since reducing σ to one-half will also
cut the standard error to one-half and increase the value of the t-statistic by two,
without altering the necessary sample size.
Recall that the standard deviation describes how dispersed the measured enzyme
levels of individual mice are around the population mean in each treatment group.
This dispersion contains the biological variation σm from mouse to mouse, and the
variation σe due to within-mouse variability and measurement error, such that
σm2 + σe2
σ =
2
σm2 + σe2 and se(Δ̂) = 2· .
n
3.3.1 Sub-sampling
We can employ the same strategy if the measurement error is large, and we decrease
its influence on the standard error by taking r measurements of each of m samples.
This strategy is called sub-sampling and is only successful in increasing precision
and power substantially if σe is not small compared to the between-mouse variation
σm , since the contribution of σm on the standard error only depends on the number n
of mice, and not on the number m of samples per mouse. In biological experiments,
the biological (mouse-to-mouse) variation is typically much larger than the technical
(sample-to-sample) variation and sub-sampling is of very limited use. Indeed, a
very common mistake is to ignore the difference between technical and biological
replicates and treat all measurements as biological replicates. This flaw is known as
pseudo-replication and leads to overestimating the precision of an estimate and thus
to much shorter, incorrect, confidence intervals, and to overestimating the power of
a test, with too low p-values and high probability of false positives (Hurlbert 1984,
2009).
56 3 Planning for Precision and Power
For our examples, the between-mouse variance is σm2 = 1.9 and much larger than
the within-mouse variance σe2 = 0.1. For n = 10 mice per treatment group and m = 1
samples per mouse, the standard error is 0.89. Increasing the number of samples to
m = 2 reduces this error to 0.88 and further increasing to an unrealistic m = 100
only reduces the error down to 0.87. In contrast, using 11 instead of 10 mice per
treatment reduces the standard error already to 0.85.
3.3.3 Blocking
group. This effectively removes the variation between blocks from the treatment
comparison, as we saw in Sect. 2.3.5.3.
If we consider our paired-vendor example again, each observation has variance
σm2 + σe2 , and the two treatment group mean estimates μ̂ A and μ̂ B both have vari-
ance (σm2 + σe2 )/n. However, the estimate Δ̂ = μ̂ A − μ̂ B of their difference only
has variance 2 · σe2 /n, and the between-mouse variance σm2 is completely eliminated
from this estimate. This is because each observation from the same mouse is equally
affected by any systematic deviation that exists between the specific mouse and the
overall average, and this deviation therefore cancels if we look at differences between
observations from the same mouse.
For σm2 = 1.9 and σe2 = 0.1, for example, blocking by mouse reduces the expected
standard error from 0.89 in the original two-vendor experiment to 0.2 in the paired-
vendor experiment. Simultaneously, the experiment size is reduced from 20 to 10
mice, while the number of observations remains the same. Importantly, the samples—
and not the mice—are the experimental units in this experiment, since we randomly
assign kits to samples and not to mice. In other words, we still have 20 experimental
units, the same as in the original experiment.
The relative efficiency between unblocked experiment and blocked experiment is
RE = 20, indicating that blocking allows a massive reduction in sample size while
keeping the same precision and power.
As expected, the t-test equally profits from the reduced standard error. The t-
statistic is now t = −2.9 leading to a p-value of p = 0.018 and thus a significant
result at the 5% significance level. This compares to the previous t-value of t = −0.46
for the unblocked design with a p-value of 0.65.
“How many samples do I need?” is arguably among the first questions a researcher
asks when thinking about an experimental design. Sample size determination is a
crucial component of experimental design in order to ensure that estimates are suf-
ficiently precise to be of practical value and that hypothesis tests are adequately
powered to be able to detect a relevant effect size. Sample size determination cru-
cially depends on deciding on a minimal effect size. While precision and power can
always be increased indefinitely by increasing the sample size, limits on resources—
time, money, and available experimental material—pose practical limits. There is
also a diminishing return, as doubling precision requires quadrupling the sample
size.
58 3 Planning for Precision and Power
To provide a concrete example, let us consider our comparison of the two preparation
kits again and assume that the volume of blood required is prohibitive for more than
one sample per mouse. In the two-vendor experiment based on 20 mice, we found
that our estimate Δ̂ was too imprecise to determine with any confidence which—if
any—of the two kits yields lower responses than the other.
To determine a sufficient sample size, we need to decide which minimal effect
size is relevant for us, a question answerable only with experience and subject-matter
knowledge. For the sake of the example, let us say that a difference of δ0 = ±0.5 or
larger would mean that we stick with one vendor, but a smaller difference is not of
practical relevance for us. The task is therefore to determine the number n of mice
per treatment group, such that the confidence interval of Δ has width no more than
one, i.e., that
We note that the t-quantiles and the standard error both depend on n, which prevents
us from solving this inequality directly. For a precise calculation, we can start at
some not too large n, calculate the width of the confidence interval, increase n if the
width is too large, and repeat until the desired precision is achieved.
If we have a reason to believe that n will not be very small, then we can reduce
the problem to
√ √
UCL − LCL = (z 1−α/2 − z α/2 ) 2σ/ n ≤ 2|δ0 | =⇒ n ≥ 2 · z 1−α/2
2
σ 2 /δ02 ,
if we exploit the fact that the t-quantile tα,n is approximately equal to the standard
normal quantile z α , which does not depend on the sample size.
For a 95%-confidence interval, we have z 0.975 = +1.96, which we can approxi-
mate as z 0.975 ≈ 2 without introducing any meaningful error. This leads to the simple
formula
n ≥ 8σ 2 /δ02 .
In order to actually calculate the sample size with this formula, we need to know
the standard deviation of the enzyme levels or an estimate σ̂. Such an estimate might
be available from previous experiments on the same problem. If not, we have to
conduct a separate (usually small) experiment using a single treatment group for
getting such an estimate. In our case, we already have an estimate σ̂ = 1.65 from our
previous experiment, from which we find that a sample size of n = 84 mice per kit
is required to reach our desired precision (the approximation z 0.975 = 2 yields n =
87). This is a substantial increase in experimental material needed. We will have to
3.4 Sample Size and Precision 59
decide if an experiment of this size is feasible for us, but a smaller experiment will
likely waste time and resources without providing a practically relevant answer.
It is often useful to turn the question around: given we can afford a certain maximal
number of mice for our experiment, what precision can we expect? If this precision
turns out to be insufficient for our needs, we might as well call the experiment off or
start considering alternatives.
For example, let us assume that we have 40 mice at our disposal for the experiment.
From our previous discussion, we know that the variances of measurements using
kits A and B can be assumed equal, so a balanced assignment of n = 20 mice per
vendor is optimal. The expected width of a 95%-confidence interval is
√ √
UCL − LCL = (z 0.975 − z 0.025 ) · 2σ/ n = 1.24 · σ .
Using our previous estimate of σ̂ = 1.65, we find the expected width of the 95%-
confidence interval of 2.05 compared to√2.9 for the previous experiment with 10
mice per vendor, a decrease in length by 2 = 1.41 due to doubling the sample size.
This is not even close to our desired length of one, and we should consider if this
experiment is worth doing, since it uses resources without a reasonable chance of
providing a precise-enough estimate.
A more common approach for determining the required sample size is via a hypoth-
esis testing framework which allows us to also consider acceptable false positive
and false negative probabilities for our experiment. For any hypothesis test, we can
calculate each of the following five parameters from the other four:
• The power 1 − β: this probability is higher the more stringent our α is set (for a
fixed difference) and larger effects will lead to fewer false negatives for the same
α level. In practice, the desired power is often about 80–90%; higher power might
require prohibitively large sample sizes.
We start developing the main ideas for determining a required sample size in a
simplified scenario, where we know the variance exactly. Then, the standard error
of Δ̂ is also known exactly, and the test statistic Δ̂/se(Δ̂) has a standard normal
distribution under the null hypothesis H0 : Δ = 0. The same calculations can also
be used with a variance estimate, provided the sample size is not too small and the
t-distribution of the test statistic is well approximated by the normal distribution.
In the following, we assume that we decided on the false positive probability α,
the power 1 − β, and the minimal effect size Δ = δ0 . If the true difference is smaller
than δ0 , we might still detect it, but detection becomes less and less likely the smaller
the difference gets. If the difference is greater, our chance of detection increases.
If H0 : Δ = 0 is true, then Δ̂ ∼ N (0, 2σ 2 /n) has a normal distribution√with√mean
zero and variance 2σ 2 /n. We reject the null hypothesis if Δ̂ ≤ z α/2 · 2σ/ n or
√ √
Δ̂ ≥ z 1−α/2 · 2σ/ n. These two critical values are shown as dashed vertical lines
in Fig. 3.1 (top) for sample sizes n = 10 (left) and n = 90 (right). As expected, the
critical values move closer to zero with increasing sample size.
If H0 is not true and Δ = δ0 instead, then Δ̂ ∼ N (δ0 , 2σ 2 /n) has a normal dis-
tribution with mean δ0 and variance 2σ 2 /n. This distribution is shown in the bottom
row of Fig. 3.1 for the two sample sizes and a true difference of δ0 = 1; it also gets
narrower with increasing sample size n.
A false negative happens if H0 is not true, so Δ = δ0 , yet the estimator Δ̂ falls
outside the rejection region. The probability of this event is
√
2·σ
P |Δ̂| ≤ z 1−α/2 · √ ; Δ = δ0 = β ,
n
Our goal is to find n such that the probability of a false negative stays below a
prescribed value β.
We can see this in Fig. 3.1: for a given α = 5%, the dashed lines denote the
rejection region, and the black shaded area corresponds to a probability of 5%. If
Δ = δ0 , we get the distributions in the bottom row, where all values inside the dashed
lines are false negatives, and the probability β corresponds to the gray shaded area.
3.5 Sample Size and Power 61
H0 true
1.0
0.5
Density
0.0
1.5
H0 false
1.0
0.5
0.0
−2 0 2 −2 0 2
Difference in means
Fig. 3.1 Distributions of difference in means if the null hypothesis is true and the difference between
means is zero (top) and when the alternative hypothesis is true and the difference is one (bottom)
for 10 (left) and 90 (right) samples. The dashed lines are the critical values for the test statistic.
Shaded black region: false positives (α). Shaded gray region: false negatives (β)
where we used the fact that z β = −z 1−β . The first formula uses the minimal raw
effect size δ0 and requires knowledge of the residual variance, whereas the second
formula is based on the minimal standardized effect size d0 = δ0 /σ, which measures
the difference between the means as a multiple of the standard deviation.
In our example, a hypothesis test with significance level α = 5% and a variance
of σ 2 = 2 has power 11, 35, and 100% to detect a true difference of δ0 = 1 based
on n = 2, n = 10, and n = 100 mice per treatment group, respectively. We require
at least 31 mice per vendor to achieve a power of 1 − β = 80%.
The same ideas apply to calculating the minimal effect size that is detectable with
a given significance level and power for any fixed sample size. For our example, we
might only have 20 mice per vendor at our disposal. For our variance of σ 2 = 2, a
significance level of α = 5% and a power of 1 − β = 80%, we find that for n = 20,
the achievable minimal effect size is δ0 = 1.25.
A small minimal standardized effect size of d0 = δ0 /σ = 0.2 requires at least n =
392 mice per vendor for α = 5% and 1 − β = 80%. This number decreases to n =
63 and n = 25 for a medium effect d0 = 0.5, respectively, a larger effect d0 = 0.8.
62 3 Planning for Precision and Power
16 16
n≈ = 2 , (3.3)
(δ0 /σ)2 d0
based on the observation that the numerator in Eq. (3.2) is then roughly 16 for a
significance level α = 5% and a reasonable power of 1 − β = 80%.
Such approximate formulas were termed portable power (Wheeler 1974) and
enable quick back-of-napkin calculation during a discussion, for example.
We can translate the sample size formula to a relative effect based on the coefficient
of variation CV = σ/μ (Belle and Martin 1993):
16 · (CV)2
n≈ .
ln(μ A /μ B )2
This requires taking logarithms and is not quite so portable. A convenient further
shortcut exists for a variation of 35%, typical for biological systems (Belle 2008),
noting that the numerator then simplifies to 16 · (0.35)2 ≈ 2.
For example, a difference in enzyme level of at least 20% of vendor A compared
to vendor B and a variability for both vendors of about 30% means that
μA σA σB
= 0.8 and = = 0.3 .
μB μA μB
16 · (0.3)2
n≈ ≈ 29 .
ln(0.8)2
For a higher variability of 35%, the sample size increases, and our shortcut yields
n ≈ 2/ ln(μ A /μ B )2 ≈ 40.
In practice, the variance σ 2 is usually not known and the test statistic T uses an
estimate σ̂ 2 instead. If H0 is true and Δ = 0, then T has a t-distribution with 2n − 2
degrees of freedom, and its quantiles depend on the sample size. If H0 is false and
the true difference is Δ = δ0 , then the test statistic has a noncentral t-distribution
with noncentrality parameter
3.5 Sample Size and Power 63
0.4
0.3
df = 5
0.2
0.1 Noncentrality
parameter
Density
0.0 0
0.4
5
0.3 10
df = 20
0.2
0.1
0.0
−5 0 5 10 15 20
Value of t−statistics
Fig. 3.2 t-distribution for 5 (top) and 20 (bottom) degrees of freedom and three different noncen-
trality parameters (linetype)
δ0 δ0 δ0
η= = √ = n/2 · = n/2 · d0 .
se(Δ̂) 2 · σ/ n σ
For illustration, Fig. 3.2 shows the density of the t-distribution for different number of
samples and different values of the noncentrality parameter; note that the noncentral t-
distribution is not symmetric, and tα,n (η) = −t1−α,n (η). The noncentrality parameter
can be written as η 2 = 2 · n · (d02 /4), the product of the experiment size 2n, and the
(squared) effect size.
An increase in sample size has two effects on the distribution of the test statistic
T : (i) it moves the critical values inwards, although this effect is only√pronounced for
small sample sizes; (ii) it increases the noncentrality parameter with n and thereby
shifts the distribution away from zero. This is in contrast to our earlier discussion
using a normal distribution, where an increase in sample size results in a decrease in
the variance.
In other words, increasing the sample size slightly alters the shape of the dis-
tribution of our test statistic, but more importantly moves it away from the central
t-distribution under the null hypothesis. The overlap between the two distributions
then decreases and the same observed difference between the two treatment means
is more easily distinguished from a zero difference.
As an example, assume we have a reason to believe that the two kits indeed give
consistently different readouts. For a significance level of α = 5% and n = 10, we
calculate the power that we successfully detect a true difference of |Δ| = δ0 = 2,
of δ0 = 1, and of δ0 = 0.5. Under the null hypothesis, the test statistics T has a
(central) t-distribution with 2n − 2 degrees of freedom, and we reject the hypothesis
64 3 Planning for Precision and Power
0.3
H0 true
0.2
0.1
Density
0.0
0.4
0.3
H0 false
0.2
0.1
0.0
−4 0 4 −4 0 4
Value of t−statistic
Fig. 3.3 Distributions of t-statistic if null hypothesis is true and true difference is zero (top) and
when alternative hypothesis is true and true difference is δ0 = 2 (bottom) for 2 (left) and 10 (right)
samples. The dashed lines are the critical values for the test statistic. Shaded black region: false
positives (α). Shaded gray region: false negatives β
if |T | > t1−α/2, 2n−2 (Fig. 3.3 (top)). If however, Δ = δ0 is true, then the distribution
of T changes to a noncentral t-distribution with 2n − 2 degrees of freedom and
noncentrality parameter η = δ0 /se(Δ̂) (Fig. 3.3 (bottom)). The power 1 − β is the
probability that this T falls into the rejection region and either stays above t1−α/2, 2n−2
or below tα/2, 2n−2 .
We compute the upper t-quantile for n = 10 as t0.975,18 = 2.1. If the true difference
is δ0 = 2, then the probability to stay above this value (and correctly reject H0 ) is
high with a power of 73%. This is because the standard error is 0.74 and thus the
precision of the estimate Δ̂ is large compared to the difference we attempt to detect.
Decreasing this difference while keeping the significance level and the sample size
fixed decreases the power to 25% for δ0 = 1 and further to 10% for δ0 = 0.5. In
other words, we can expect to detect a true difference of δ0 = 0.5 in only 10% of
experiments with 10 samples per vendor and it is questionable if such an experiment
is worth implementing.
It is not possible to find a closed formula for the sample size calculation, because
the central and noncentral t-quantiles depend on n, while the noncentrality param-
eter depends on n and additionally alters the shape of the noncentral t-distribution.
R’s built-in function power.t.test() uses an iterative approach and yields a
(deceptively precise!) sample size of n = 43.8549046 per vendor to detect a dif-
ference δ0 = 1 with 80% power at a 5% significance level, based on our previous
estimate σ̂ 2 = 2.73 of the variance. We provide an iterative algorithm for illustration
in Sect. 3.6.
3.5 Sample Size and Power 65
Note that we replaced the unknown true variance with an estimate σ̂ 2 for this calcu-
lation, and the accuracy of the resulting sample size hinges partly on the assumption
that the variance estimate is reasonably close to the true variance.
When confronted with an undesired non-significant test outcome from their exper-
iment, researchers sometimes calculate the observed power or retrospective power
based on the effect size and residual variance estimated from the data. It is then
argued that large observed power provides evidence in favor of the null hypothe-
66 3 Planning for Precision and Power
sis. Appealing as this idea might seem, it is fundamentally flawed and based on the
improper use of the concept of power.
To see this, let us imagine that two t-tests are performed, with resulting test
statistics t1 and t2 , and associated non-significant p-values p1 > α and p2 > α.
Assume that t1 > t2 and the first hypothesis test indicates a larger deviation from the
null hypothesis. The first p-value is then smaller than the second, which we interpret
as stronger—yet not significant—evidence against the null hypothesis in the first
test. If we now calculate the observed power at t1 , respectively t2 , for our desired
significance level α, we find that this power is larger for the first experiment since
t1 is further away from the zero value. The proposed argument then claims that this
larger observed power provides more evidence in favor of the null hypothesis. This
directly contradicts our previous interpretation of the two p-values.
The fallacy arises because p-value and observed power are both based on the
same (random) values of the test statistic and residual variance estimated from the
specific data of each experiment. This always results in higher observed power for
lower p-values and leads to the apparent contradiction in the example. Indeed, it can
be shown that the observed power is in direct correspondence to the p-value and
therefore provides no additional information.
Similar problems result if the power is based on the observed residual variance, but
calculated at a specific effect size deemed scientifically relevant. Because it is again
based on the specific outcome of the experiment, this power cannot be interpreted as
the power to detect an effect of the specific size and provides no evidence in favor
or against the null hypothesis.
We have already seen a much better—and logically correct—way of interpreting a
(non-significant) test result by estimating the difference and calculating its confidence
interval. In contrast to observed power, p-value and confidence interval provide
different pieces of information and there is no direct correspondence between. If the
interval is wide and contains the value zero, as in our unpaired-vendor example, we
conclude that the data provide little evidence for or against the null hypothesis. If the
interval is short, as in our paired-vendor example, we conclude that plausible values
are restricted to a narrow range. If this range includes zero, we have evidence that
the true value is unlikely to be far off zero.
An equivalent argument can be made using two one-sided significance tests, where
the null hypotheses are H0 : Δ < −δ0 and H0 : Δ > +δ0 . Note that these reverse
the burden of proof and a rejection of both hypotheses means that the true difference
is likely in the interval (−δ0 , +δ0 ). This is known as an equivalence test, where the
aim is to show that two treatments are equal (rather than different). These tests play
a prominent role in toxicity or environmental studies, where we try to demonstrate
that responses from a treatment group exposed to a potential hazard are no different
than ±δ0 compared to a non-exposed group.
Calculating power prospectively in the planning of an experiment ensures that tests
are adequately powered and estimates are sufficiently precise. Power analysis has no
role retrospectively in the analysis of data from an experiment. Here, estimates of
effect sizes and their confidence intervals are the appropriate quantities, augmented
by p-values if necessary.
3.6 Notes and Summary 67
Notes
A general discussion of power and sample size is given in Cohen (1992, 1988) and
Krzywinski and Altman (2013), and a gentle practical introduction is Lenth (2001).
Sample sizes for confidence intervals are addressed in Goodman and Berlin (1994),
Maxwell et al. (2008), and Rothman and Greenland (2018); the equivalence to power
calculation based on testing is discussed in Altman et al. (2000). The fallacies of
‘observed power’ are elucidated in Hoenig and Heisey (2001) and equivalence testing
in Schuirmann (1987). The free software G∗Power is an alternative to R for sample
size determination (Faul 2007).
Power analysis code
To further illustrate our power analysis, we implement the corresponding calculations
in an R function. The following code calculates the power of a t-test given the minimal
difference delta, the significance level alpha, the sample size n, and the standard
deviation s.
The function first calculates the degrees of freedom for the given sample size.
Then, the lower and upper critical values for the t-statistic under the null hypothesis
are computed. Next, the noncentrality parameter η = ncp is used to determine the
probability of correctly rejecting H0 if indeed |Δ| = δ0 ; this is precisely the power.
We start from a reasonably low n, calculate the power, and increase the sample
size until the desired power is reached.
# delta0: true difference
# alpha: significance level (false positive rate)
# n: sample size
# s: standard deviation
# return: power
getPowerT = function(delta0, alpha, n, s) {
df = 2*n-2 # degrees of freedom
q.H0.low = qt(p=alpha/2, df=df) # low rejection quantile
q.H0.high = qt(p=1-alpha/2, df=df) # high rejection quantile
ncp = abs(delta0) / (sqrt(2)*s/sqrt(n)) # noncentrality
# prob. to reject low or high values if H0 false
p.low = pt(q=q.H0.low, df=df, ncp=ncp)
p.high = 1 - pt(q=q.H0.high, df=df, ncp=ncp)
return( p.low + p.high )
}
Using R
Base-R provides the power.t.test() function for power calculations based on
the t-distribution. It takes four of the five parameters and calculates the fifth.
Summary
Determining the required sample size of an experiment—at least approximately—is
part of the experimental design. We can use the hypothesis testing framework to
determine sample size based on the two error probabilities, a measure of variation,
68 3 Planning for Precision and Power
and the required minimal effect size. The resulting sample size should then be used
to determine if estimates have sufficient expected precision. We can also determine
the minimal effect size detectable with desired power and sample size, or the power
achieved from a given sample size for a minimal effect size, all of which we can use
to decide if an experiment is worth doing. Precision and power can also be increased
without increasing the sample size, by balanced allocation, narrowing experimental
conditions, or blocking. Power analysis is often based on noncentral distributions,
whose noncentrality parameters are the product of experiment size and effect size;
portable power formulas use approximations of various quantities to allow back-of-
envelope power analysis.
References
Altman, D. G. et al. (2000). Statistics with Confidence. 2nd. John Wiley & Sons, Inc.
Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences. 2nd. Lawrence Erlbaum
Associates, Hillsdale.
Cohen, J. (1992). “A Power Primer”. In: Psychological Bulletin 112.1, pp. 155–159.
Faul, F. et al. (2007). “G∗Power3: a flexible statistical power analysis program for the social,
behavioral, and biomedical sciences”. In: Behavior Research Methods 39.2, pp. 175–191.
Goodman, S. N. and J. A. Berlin (1994). “The Use of Predicted Confidence Intervals When Planning
Experiments and the Misuse of Power When Interpreting Results”. In: Annals of Internal Medicine
121, pp. 200–206.
Hoenig, J. M. and D. M. Heisey (2001). “The abuse of power: The pervasive fallacy of power
calculations for data analysis”. In: American Statistician 55.1, pp. 19–24.
Hurlbert, S. H. (1984). “Pseudoreplication and the Design of Ecological Field Experiments”. In:
Ecological Monographs 54.2, pp. 187–211.
Hurlbert, S. H. (2009). “The ancient black art and transdisciplinary extent of pseudoreplication”.
In: Journal of Comparative Psychology 123.4, pp. 434–443.
Krzywinski, M. and N. Altman (2013). “Points of significance: Power and sample size”. In: Nature
Methods 10, pp. 1139–1140.
Lenth, R. V. (2001). “Some Practical Guidelines for Effective Sample Size Determination”. In: The
American Statistician 55.3, pp. 187–193.
Maxwell, S. E., K. Kelley, and J. R. Rausch (2008). “Sample Size Planning for Statistical Power
and Accuracy in Parameter Estimation”. In: Annual Review of Psychology 59.1, pp. 537–563.
Rothman, K. J. and S. Greenland (2018). “‘Planning Study Size Based on Precision Rather Than
Power”. In: Epidemiology 29.5, pp. 599–603.
Schuirmann, D. J. (1987). “A comparison of the two one-sided tests procedure and the power
approach for assessing the equivalence of average bioavailability”. In: Pharmacometrics 15.6,
pp. 657–680.
van Belle, G. (2008). Statistical Rules of Thumb. 2nd. John Wiley & Sons, Inc.
van Belle, G. and D. C. Martin (1993). “Sample size as a function of coefficient of variation and
ratio”. In: American Statistician 47.3, pp. 165–167.
Wheeler, R. E. (1974). “Portable Power”. In: Technometrics 16.2, pp. 193–201.
Chapter 4
Comparing More Than Two Groups:
One-Way ANOVA
4.1 Introduction
We extend our discussion from experiments with two treatment groups to experiments
with k treatment groups, assuming completely random treatment allocation. In this
chapter, we develop the analysis of variance framework to address the omnibus null
hypothesis that all group means are equal and there is no difference between the
treatments. The main idea is to partition the overall variation in the data into one part
attributable to differences between the treatment group means, and a residual part.
We can then test the equality of group means by comparing variances between group
means and within each group using an F-test.
We also look at corresponding effect size measures to quantify the overall dif-
ference between treatment group means. The associated power analysis uses the
noncentral F-distribution, where the noncentrality parameter is the product of exper-
iment size and effect size. Several simplifications allow us to derive a portable power
formula for quickly approximating the required sample size.
The analysis of variance is intimately linked to a linear model, and we formally
introduce Hasse diagrams to describe the logic of the experiment and to derive the
corresponding linear model and its specification in statistical software.
We consider investigating four drugs for their properties to alter the metabolism in
mice, and we take the level of a liver enzyme as a biomarker to indicate this alteration,
where higher levels are considered ‘better’. Metabolization and elimination of the
drugs might be affected by the fatty acid metabolism, but for the moment we control
this aspect by feeding all mice with the same low-fat diet and return to the diet effect
in Chap. 6.
Table 4.1 Measured enzyme levels for four drugs assigned to eight mice each
D1 13.94 16.04 12.70 13.98 14.31 14.71 16.36 12.58
D2 13.56 10.88 14.75 13.05 11.53 13.22 11.52 13.99
D3 10.57 8.40 7.64 8.97 9.38 9.13 9.81 10.02
D4 10.78 10.06 9.29 9.36 9.04 9.41 6.86 9.89
A B
16
Enzyme level
16 14
12
10
14
8
Enzyme level
D1 D2 D3 D4
12
10 C
16
Enzyme level
14
12
8
10
8
D1 D2 D3 D4 D1 D2 D3 D4
Drug
Fig. 4.1 Observed enzyme levels in response to four drug treatments with eight mice per treatment
group. A Individual observations by treatment group. B Grand mean (horizontal line) and group
means (diamonds) used in estimating the between-groups variance. C Individual responses (gray
points) are compared to group means (black diamonds) for estimating the within-group variance
The data in Table 4.1 and Fig. 4.1A show the observed enzyme levels for N = n ·
k = 32 mice, with n = 8 mice randomly assigned to one of the k = 4 drugs D1, D2,
D3, and D4. We denote the four average treatment group responses as μ1 , . . . , μ4 ;
we are interested in testing the omnibus hypothesis H0 : μ1 = μ2 = μ3 = μ4 that the
group averages are identical and the four drugs therefore all have the same effect on
the enzyme levels.
Other interesting questions regard the estimation and testing of specific treatment
group comparisons, which we postpone to Chap. 5.
k
where μ̂ = i=1 μ̂i /k is an estimate of the grand mean μ. Since Var(μ̂i ) = σe2 /n,
this provides us with an estimator
k
σ̃e2 μ̂i ) = n ·
= n · Var( (μ̂i − μ̂)2 /(k − 1)
i=1
for the variance σe2 that only considers the dispersion of group means around the
grand mean and is independent of the dispersion of individual observations around
their group mean.
On the other hand, our previous estimator pooled over groups is
⎛ ⎞
n n
⎜ j=1 (y1 j − μ̂1 ) 2
j=1 (yk j − μ̂k ) 2⎟ k n
(yi j − μ̂i )2
σ̂e2 = ⎜
⎝ +··· + ⎟ /k =
⎠
−1
n
−1
n
i=1 j=1
N −k
variance group 1 variance group k
and also estimates the variance σe2 (Fig. 4.1C). It only considers the dispersion of
observations around their group means and is independent of the μi being equal. For
example, we could add a fixed number to all measurements in one group and this
would affect σ̃e2 but not σ̂e2 .
The two estimators have expectations
1
k
E(σ̃e2 ) = σe2 + n · (μi − μ)2 and E(σ̂e2 ) = σe2 ,
k − 1 i=1
Q
72 4 Comparing More Than Two Groups: One-Way ANOVA
respectively. Thus, while σ̂e2 is always an unbiased estimator of the residual vari-
ance, the estimator σ̃e2 has bias Q—the between-groups variance—but is unbiased
precisely if H0 is true and Q = 0. If H0 is true, then the ratio
σ̃e2
F= ∼ Fk−1,N −k
σ̂e2
Our derivation of the omnibus F-test used the decomposition of the data into a
between-groups and a within-groups component. We can exploit this decomposition
further in the (one-way) analysis of variance (ANOVA) by directly partitioning the
overall variation in the data via sums of squares and their associated degrees of
freedom. In the words of its originator:
The analysis of variance is not a mathematical theorem, but rather a convenient method
of arranging the arithmetic. (Fisher 1934)
4.3 One-Way Analysis of Variance 73
The arithmetic advantages of the analysis of variance are no longer relevant today, but
the decomposition of the data into various parts for explaining the observed variation
remains an easily interpretable summary of the experimental results.
To stress that ANOVA decomposes the variation in the data, we first write each datum
yi j as a sum of three components: the grand mean, deviation of group mean to grand
mean, and deviation of datum to group mean:
Sums of Squares
We quantify the overall variation in the observations by the total sum of squares, the
summed squared distances of each datum yi j to the estimated grand mean ȳ·· .
Following the partition of each datum, the total sum of squares is also partitioned
into two parts: (i) the treatment (or between-groups) sum of squares which measures
the variation between group means and captures the variation explained by the sys-
tematic differences between the treatments, and (ii) the residual (or within-groups)
sum of squares which measures the variation of responses within each group and
thus captures the unexplained random variation:
k
n
k
k
n
SStot = (yi j − ȳ·· )2 = n · ( ȳi· − ȳ·· )2 + (yi j − ȳi· )2 .
i=1 j=1 i=1 i=1 j=1
SStrt SSres
The intermediate term 2 i j (yi j − ȳi· )( ȳi· − ȳ·· ) = 0 vanishes because SStrt is
based on group means and grand mean, while SSres is independently based on obser-
vations and group means; the two are orthogonal.
For our example, we find a total sum of squares of SStot = 197.26, a treatment sum
of squares SStrt = 155.89, and a residual sum of squares SSres = 41.37; as expected,
the latter two add precisely to SStot . Thus, most of the observed variation in the data
is due to systematic differences between the treatment groups.
74 4 Comparing More Than Two Groups: One-Way ANOVA
Degrees of Freedom
Associated with each sum of squares term are its degrees of freedom, the number of
independent components used to calculate it.
The total degrees of freedom for SStot are dftot = N − 1, because we have N
response values, and need to compute a single value ȳ·· to find the sum of squares.
The treatment degrees of freedom are dftrt = k − 1, because there are k treatment
means, estimated by ȳi· , but the calculation of the sum of squares requires the overall
average ȳ·· .
Finally, there are N residuals, but we used up 1 degree of freedom for the overall
average, and k − 1 for the group averages, leaving us with dfres = N − k degrees of
freedom.
The degrees of freedom then decompose as
This decomposition tells us how much of the data we ‘use up’ for calculating each
sum of squares component.
Mean Squares
Dividing a sum of squares by its degrees of freedom gives the corresponding mean
squares, which are exactly our two variance estimates. The treatment mean squares
are given by
SStrt SStrt
MStrt = = = σ̃e2
dftrt k−1
and are our first variance estimate based on group means and grand mean, while the
residual mean squares
SSres SSres
MSres = = = σ̂e2
dfres N −k
are our second independent estimator for the within-group variance. We find MSres =
41.37/28 = 1.48 and MStrt = 155.89/3 = 51.96 for our example.
In contrast to the sum of squares, the mean squares do not decompose by factor
and MStot = SStot /(N − 1) = 6.36 = MStrt + MSres = 53.44.
Omnibus F-Test
and we reject H0 if the observed F-statistic exceeds the (1 − α)-quantile F1−α,dftrt ,dfres .
Based on the sum of squares and degrees of freedom decompositions, we again
find the observed test statistic of F = 51.96/1.48 = 35.17 on dftrt = 3 and dfres = 28
degrees of freedom, corresponding to a p-value of p = 1.24 × 10−9 .
ANOVA Table
The results of an ANOVA are usually presented in an ANOVA table such as Table 4.2
for our example.
An ANOVA table has one row for each source of variation, and the first column
gives the name of each source. The remaining columns give (i) the degrees of freedom
available for calculating the sum of squares (indicating how much of the data is ‘used’
for this source of variation), (ii) the sum of squares to quantify the variation attributed
to the source, (iii) the resulting mean squares used for testing, (iv) the observed value
of the F-statistic for the omnibus null hypothesis, and (v) the corresponding p-value.
The raw difference or the standardized difference d are both easily interpretable
effect size measures for the case of k = 2 treatment groups that we can use in con-
junction with the t-test. We now introduce three effect size measures for the case of
k > 2 treatment groups for use in conjunction with an omnibus F-test.
A simple effect size measure is the variation explained, which is the proportion
of the factor’s sum of squares of the total sum of squares:
SStrt SStrt
ηtrt
2
= = .
SStot SStrt + SSres
find ηtrt
2
= 155.89/197.26 = 0.79, confirming that the differences between drugs are
responsible for 79% of the variation in the data.
The raw effect size measures the average deviation between group means and the
grand mean:
1 1 2
k k
b =
2
(μi − μ) =
2
α .
k i=1 k i=1 i
1 2
k
f 2 = b2 /σe2 = α
kσe2 i=1 i
and measures the average deviation between group means and grand mean in units
of residual variance. It specializes to f = d/2 for k = 2 groups.
Extending from his classification of effect sizes d, Cohen proposed that values
of f > 0.1, f > 0.25, and f > 0.4 may be interpreted as small, medium, and large
effects (Cohen 1992). An unbiased estimate of f 2 is
1 2
k
ˆ SStrt k−1 N
f =
2
= ·F= · α̂ ,
SSres N −k N − k k · σ̂e2 i=1 i
where F is the observed value of the F-statistic, yielding fˆ2 = 3.77 for our example;
the factor N /(N − k) removes the bias.
The two effect sizes f 2 and ηtrt
2
are translated into each other via
ηtrt
2
f2
f2 = and ηtrt
2
= . (4.1)
1 − ηtrt
2 1+ f2
Much like the omnibus F-test, all three effect sizes quantify any pattern of group
mean differences, and do not distinguish if each group deviates slightly from the
grand mean, or if one group deviates by a substantial amount while the remaining
do not.
We are often interested in determining the necessary sample size such that the
omnibus F-test reaches the desired power for a given significance level. Just as
before, we need four out of the following five quantities for a power analysis: the
two error probabilities α and β, the residual variance σe2 , the sample size per group
n (we only consider balanced designs), and a measure of the minimal relevant effect
size (raw or standardized).
4.4 Power Analysis and Sample Size for Omnibus F-test 77
Recall that under the null hypothesis of equal treatment group means, the deviations
αi = μi − μ are all zero, and F = MStrt /MSres follows an F-distribution with k − 1
numerator and N − k denominator degrees of freedom.
If the treatment effects are not zero, then the test statistic follows a noncentral
F-distribution Fk−1,N −k (λ) with noncentrality parameter
n 2
k
n · k · b2
λ=n·k· f = 2
= 2· α .
σe2 σe i=1 i
The noncentrality parameter is thus the product of the overall sample size N = n · k
and the standardized effect size f 2 (which can be translated from and to ηtrt 2
via
Eq. (4.1)). For k = 2, this reduces to the previous case since f = d /4 and tn (η) =
2 2 2
F1,n (λ = η 2 ).
The idea behind the power analysis is the same as for the t-test. If the omnibus
hypothesis H0 is true, then the F-statistic follows a central F-distribution with k − 1
and N − k degrees of freedom, shown in Fig. 4.2 (top) for two sample sizes n = 2
(left) and n = 10 (right). The hypothesis is rejected at the significance level α = 5%
(black shaded area) whenever the observed F-statistic is larger than the 95% quantile
F1−α, k−1, N −k (dashed line).
If H0 is false, then the F-statistic follows a noncentral F-distribution. Two cor-
responding examples are shown in Fig. 4.2 (bottom) for an effect size f 2 = 0.34
corresponding, for example, to a difference of δ = 2 between the first and the second
0.6
H0 true
0.4
0.2
Density
0.0
0.8
0.6
H0 false
0.4
0.2
0.0
0 5 10 15 0 5 10 15
Value of F−statistic
Fig. 4.2 Distribution of F-statistic if H0 is true (top) with false positives (black shaded area),
respectively, if H0 is false and the first two groups differ by a value of two (bottom) with false neg-
atives (gray shaded area). The dashed line indicates the 95% quantile of the central F-distribution.
Left: sample size n = 2; right: sample size n = 10
78 4 Comparing More Than Two Groups: One-Way ANOVA
treatment group, with no difference between the remaining two treatment groups.
We observe that this distribution shifts to higher values with increasing sample
size, since its noncentrality parameter λ = n · k · f 2 increases with n. For n = 2,
we have λ = 2 · 4 · f 2 = 2.72 in our example, while for n = 10, we already have
λ = 10 · 4 · f 2 = 13.6. In each case, the probability β of falsely not rejecting H0 (a
false positive) is the gray shaded area under the density up to the rejection quantile
F1−α, k−1, N −k of the central F-distribution. For n = 2, the corresponding power is
then 1 − β = 13% which increases to 1 − β = 85% for n = 10.
The number of treatments k is usually predetermined, and we then exploit the relation
between noncentrality parameter, effect size, and sample size for the power analysis.
A frequent challenge in practice concerns defining a reasonable minimal effect size
f 02 or b02 that we want to reliably detect. Using a minimal raw effect size also requires
an estimate of the residual variance σe2 from previous data or a dedicated preliminary
experiment.
A simple method to provide a minimal effect size uses the fact that f 2 ≥ d 2 /2k for
the standardized effect size d between any pair of group means. The standardized dif-
ference d0 = δ0 /σ = (μmax − μmin )/σ between the largest and smallest group means
therefore provides a conservative minimal effect size f 02 = d02 /2k (Kastenbaum et al.
1970).
We can improve on the inequality for specific cases, and Cohen proposed three
patterns with minimal, medium, and maximal variability of treatment group dif-
ferences αi , and provided their relation to the minimal standardized difference d0
(Cohen 1988, p. 276ff).
• If only two groups show a deviation from the common mean, we have αmax =
+δ0 /2 and αmin = −δ0 /2 for these two groups, respectively, while αi = 0 for the
k − 2 remaining groups. Then, f 02 = d02 /2k and our conservative effect size is in
fact exact.
• If the group means μi are equally spaced with distances δ0 /(k − 1), then the
omnibus effect size is f 02 = d02 /4 · (k + 1)/3(k − 1). For k = 4 and d = 3, an
example is μ1 = 1, μ2 = 2, μ3 = 3, and μ4 = 4.
• If half of the groups is at one extreme αmax = +δ0 /2 while the other half is at the
other extreme αmin = −δ0 /2, then f 02 = d02 /4 if k is even and f 02 = d02 /4 · (1 −
1/k 2 ) if k is odd. Again for k = 4 and d = 3, an example is α1 = α3 = −1.5,
α2 = α4 = +1.5. For μ = 10 and σ 2 = 1, this corresponds to μ1 = μ3 = 8.5 and
μ2 = μ4 = 11.5.
4.4 Power Analysis and Sample Size for Omnibus F-test 79
A simple power analysis function for R is given in Sect. 4.7 for illustration, while
the more flexible built-in procedure power.anova.test() directly provides the
necessary calculations. This procedure accepts the number of groups k (groups=),
the per-group sample size n (n=), the residual variance σe2 (within.var=), the
power 1 − β (power=), the significance level α (sig.level=), and a modified
version of the raw effect size ν 2 = i αi2 /(k − 1) (between.var=) as its argu-
ments. Given any four of these parameters, it will calculate the remaining one.
We look at a range of examples to illustrate the use of power analysis in R. We
assume that our previous analysis was completed and we intend to explore new
experiments of the same type. This allows us to use the variance estimate σ̂e2 as our
assumed within-group variance for the power analyses, where we round this estimate
to σ̂e2 = 1.5 for the following calculations. In all examples, we set our false positive
probability to the customary α = 5%.
From a minimal raw effect size b02 , we find the corresponding between.var argu-
ment as
k
ν02 = · b2 .
k−1 0
k
ν02 = · σ2 · f 2 ,
k−1 e 0
but this formula defeats the purpose of standardized effect size, since it explicitly
requires the residual variance. The solution to this problem comes from noticing
that since we measure effects in units of standard deviation, we can set σe2 = 1 in
80 4 Comparing More Than Two Groups: One-Way ANOVA
this formula and use it in conjunction with within.var=1 to achieve the desired
result.
The standardized effect sizes of f 02 = 0.01 (small), f 02 = 0.06 (medium), and
f 0 = 0.16 (large) then translate to ν02 = 0.01, ν02 = 0.08, and ν02 = 0.21, respec-
2
tively.
For a sample size of 10 mice per drug, these minimal effect sizes correspond to
a power of 7%, 21%, and 50%, respectively, and to achieve a desired power of 80%
would require 274, 45, and 18 mice per drug, respectively.
For finding the required sample size based on a minimal group difference, we plan
a follow-up experiment to detect a difference of at least δ0 = 1 between the average
enzyme levels of D1 and D2. With our assumed variance of σe2 = 1.5, this corre-
sponds to a standardized effect size of d0 = δ0 /σ = 0.82.
We use Cohen’s first pattern and set α1 = δ0 /2, α2 = −δ0 /2, α3 = α4 = 0, which
yields a standardized effect size of f 2 = 0.083, respectively, η 2 = 0.077. Using the
formulas above, this corresponds to a between.var parameter of ν02 = 0.17.
The required sample size for a power of 80% is 34 mice per group.
k η0,drug
2
ν02 = · ,
k − 1 1 − η0,drug
2
We can also find the minimal effect size detectable for a given sample size and power.
For example, we might plan a similar experiment with the same four drugs, but we
have only maximally 20 mice per drug available. Plugging everything in, we find a
4.4 Power Analysis and Sample Size for Omnibus F-test 81
The power calculations for the omnibus F-test rely on the same assumptions as
the test itself, and require identical residual variances for each treatment group and
normally distributed residuals. Results are only marginally affected by moderately
different group variances or moderate deviations from normality, but the seemingly
precise six decimals of the built-in power.anova.test() should not be taken too
literally. More severe errors result if observations are not independent, for example,
if correlations arise by measuring the same unit multiple times.
Even if all assumptions are matched perfectly, the calculated sample size is still
based on an educated guess or a previous estimate of the residual variance. We should
therefore make an allowance for a margin of error in our calculations.
We can again use a conservative approach and base our power calculations on
the upper confidence limit rather than the variance estimate itself. For our example,
the residual variance is σ̂e2 ≈ 1.5 with a 95%-confidence interval of [0.94, 2.74]. For
k = 4 groups, a desired power of 80% and a significance level of 5%, detecting a
raw effect of b02 = 0.12 ( f 02 = 0.08), requires 34 mice per group. Taking the upper
confidence limit, the required sample size increases by a factor of roughly UCL/σ̂e2 ≈
2.74/1.5 = 1.8 to 61 mice per group.
A less-conservative approach increases the ‘exact’ sample size by 20–30%; for
our example, this yields a sample size of 40 for a 20% margin, and of 44 for a 30%
margin, compared to the original exact sample size of 34.
The portable power procedure exploits the fact that for the common significance
level α = 5% and a commonly desired power of 1 − β = 80%, the noncentrality
parameter λ changes comparatively little, allowing us to use a crude approximation
for our calculations (Wheeler 1974). Such a procedure is very helpful in the early
stages of planning an experiment, when all that is needed are reasonably accurate
approximations for sample sizes to gauge the practical implications of an experiment
design.
The portable power procedure uses the quantity
φ2 = λ/k = n · f 2 .
The analysis of variance has an intimate connection with classical linear regression
and both methods are based on describing the observed data by a linear mathemati-
cal model (cf. Sect. 4.7). The analysis of more complex designs becomes relatively
straightforward when this connection is exploited, and most statistical software will
internally run a linear regression procedure for computing an ANOVA. While this
relieves the practitioner from much tedious algebra, it still means that the appropriate
linear model for an experimental design has to be correctly specified for the software.
The specification has two parts: first, the experimental design has to be translated
into the linear model, such that the statistical inferences fully capture the logical
structure of our experiment. And second, the linear model has to be translated into a
model specification in the software. We can solve both problems by Hasse diagrams
that visualize the logical structure of an experiment, and from which both a linear
model formula and a symbolic representation of the model can be derived with relative
ease. We already saw some simple examples of these diagrams in Figs. 2.4, 2.7 and
2.8. We now work out the connection between design, diagram, and model more
systematically. Some of the following discussions might seem overly complicated
4.5 Hasse Diagrams and Linear Model Specification 83
for the relatively simple designs discussed so far, but are necessary for more complex
designs in the following chapters.
The treatment structure of an experiment describes the treatment factors and their
relationships. In our drug example, the experiment has a single treatment factor Drug
with four levels D1, D2, D3, and D4. Other designs use several treatment factors,
and each applied treatment is then a combination of one level from each treatment
factor.
The unit (or design) structure describes the unit factors and their relationships.
A unit factor logically organizes the experimental material, and our experiment has
a single unit factor (Mouse) with 32 levels, each level corresponding to one mouse.
Unit factors are of several basic types: the smallest subdivision of the experimental
material to which levels of a treatment factor are randomly assigned is called the
experimental unit of this treatment factor; it provides the residual variance for testing
this treatment factor.
Groups of units are specified by a grouping factor, also known as a blocking factor;
these are often non-specific and of no direct inferential interest, but are used to remove
variation from comparisons or take account of units in the same group being more
similar than units in different groups. A blocking factor can also be intrinsic and
describe a non-randomizable property of another unit factor; a common example is
the sex of an animal, which we cannot deliberately choose (so it is not a treatment),
but which we need to keep track of in our inferences.
Treatment factors are often fixed factors with predetermined fixed levels, while
unit factors are often random factors whose levels are a random sample from a
population; in a replication of the experiment, the fixed factor levels would remain the
same (we use the same four drugs again), while the random factor levels change (we
do not use the same mice again). We denote treatment factors by an informative name
written in bold and unit factors in italics; we denote random factors by parentheses:
the treatment factor Drug is fixed for our experiment, while the unit factor (Mouse)
is random.
The observations are recorded on the response unit factor, and we mainly consider
experiments with a simple response structure where a single value is observed on
one unit factor in the design, which we denote by underlining.
The treatment and unit structures are created by nesting and crossing of factors. A
factor A is crossed with another factor B if each level of A occurs together with each
84 4 Comparing More Than Two Groups: One-Way ANOVA
Fig. 4.3 Examples of data layouts. A Factors Vendor and Mouse are crossed. B Factor Mouse is
nested in factor Drug
level of B and vice versa. This implicitly defines a third interaction factor denoted by
A:B, whose levels are the possible combinations of levels of A with levels of B. In our
paired design (Fig. 2.8), the treatment factor Vendor is crossed with (Mouse), since
each kit (that is, each level of Vendor) is assigned to each mouse. We omitted the
interaction factor, since it coincides with (Sample) in this case. The data layout for
two crossed factors is shown in Fig. 4.3A; the cross-tabulation is completely filled.
A factor B is nested in another factor A if each level of B occurs together with
one and only one level of A. For our current example, the factor (Mouse) is nested in
Drug, since we have one or more mice per drug, but each mouse is associated with
exactly one drug; Fig. 4.3B illustrates the nesting for two mice per drug.
When designing an experiment, the treatment structure is determined by the pur-
pose of the experiment: what experimental factors to consider and how to combine
factor levels to treatments. The unit structure is then used to accommodate the treat-
ment structure and to maximize precision and power for the intended comparisons.
In other words:
The treatment structure is the driver in planning experiments, the design structure is the
vehicle. (van Belle 2008, p180)
Finally, the experiment structure combines the treatment and unit structures and is
constructed by making the randomization of each treatment on its experimental unit
explicit. It provides the logical arrangement of units and treatments.
M M M11
(Mouse)32
28
Fig. 4.4 Hasse diagrams for a completely randomized design for determining the effect of four
different drugs using 8 mice per drug and a single response measurement per mouse
the experiment structure diagram by considering the randomization. The Hasse dia-
gram visualizes the nesting/crossing relations between the factors. Each factor is
represented by a node, shown as the factor name. If factor B is nested in factor A,
we write B below A and connect the two nodes with an edge. The diagram is thus
‘read’ from top to bottom. If A and B are crossed, we write them next to each other
and connect each to the next factor that it is nested in. We then create a new factor
denoted by A : B, whose levels are the combinations of levels of A with levels of B,
and draw one edge from A and one edge from B to this factor. Each diagram has a
top node called M or M, which represents the grand mean, and all other factors are
nested in this top node.
The Hasse diagrams for our drug example are shown in Fig. 4.4. The treatment
structure contains the single treatment factor Drug, nested in the obligatory top node
M (Fig. 4.4A). Similarly, the unit structure contains only the factor (Mouse) nested
in the obligatory top node M (Fig. 4.4B).
We construct the experiment structure diagram as follows: first, we merge the two
top nodes M and M of the treatment and unit structure diagram, respectively, into a
single node M. We then draw an edge from each treatment factor to its experimental
unit factor. If necessary, we clean up the resulting diagram by removing unnecessary
‘shortcut’ edges: whenever there is a path A–B–C, we remove the edge A–C if it
exists since its nesting relation is already implied by the path.
In our example, we merge the two nodes M and M into a single node M. Both Drug
and (Mouse) are now nested under the same top node. Since Drug is randomized on
(Mouse), we write (Mouse) below Drug and connect the two nodes with an edge. This
makes the edge from M to (Mouse) redundant and we remove it from the diagram
(Fig. 4.4C).
We complete the diagram by adding the number of levels for each factor as a
superscript to its node, and by adding the degrees of freedom for this factor as a
subscript. The degrees of freedom for a factor A are calculated as the number of
levels minus the degrees of freedom of each factor that A is nested in. The number
of levels and the degrees of freedom of the top node M are both one.
The superscripts are the number of factor levels: 1 for M, 4 for Drug, and 32
for (Mouse). The degrees of freedom for Drug are therefore 3 = 4 − 1, the number
86 4 Comparing More Than Two Groups: One-Way ANOVA
of levels minus the degrees of freedom of M. The degrees of freedom for (Mouse)
are 28, which we calculate by subtracting from its number of levels (32) the three
degrees of freedom of Drug and the single degree of freedom for M.
We can derive the omnibus F-test directly from the experiment diagram in Fig. 4.4C:
the omnibus null hypothesis claims equality of the group means for the treatment
factor Drug. This factor has four such means (given by the superscript), and the
F-statistic has three numerator degrees of freedom (given by the subscript). For
any experimental design, we find the corresponding experimental unit factor that
provides the within-group variance σe2 in the diagram by starting from the factor
tested and moving downwards along edges until we find the first random factor.
In our example, this trivially leads to identifying (Mouse) as the relevant factor,
providing N − k = 28 degrees of freedom (the subscript) for the F-denominator.
The test thus compares MSdrug to MSmouse .
M M M11
(Sample) (Mouse)32
28
(Sample )128
96
Fig. 4.5 Completely randomized design for determining the effect of four different drugs using 8
mice per drug and four samples measured per mouse
4.5 Hasse Diagrams and Linear Model Specification 87
provides the observations. Since each sample belongs to one mouse, and each mouse
has several samples, the factor (Sample) is nested in (Mouse). The observations are
then partitioned first into 32 groups—one per mouse—and further into 128—one per
sample per mouse. For the experiment structure, we randomize Drug on (Mouse),
and arrive at the diagram in Fig. 4.5C.
The F-test for the drug effect again uses the mean squares for Drug on 3 degrees
of freedom. Using our rule, we find that (Mouse)—and not (Sample)—is the experi-
mental unit factor that provides the estimate of the variance for the F-denominator on
28 degrees of freedom. As far as this test is concerned, the 128 samples are technical
replicates or pseudo-replicates. They do not reflect the biological variation against
which we need to test the differences in enzyme levels for the four drugs, since drugs
are randomized on mice and not on samples.
For a completely randomized design with k treatment groups, we can write each
datum yi j explicitly as the corresponding treatment group mean and a random devi-
ation from this mean:
yi j = μi + ei j = μ + αi + ei j . (4.2)
The first model is called a cell means model, while the second, equivalent, model is
a parametric model. If the treatments had no effect, then all αi = μi − μ are zero
and the data are fully described by the grand mean μ and the residuals ei j . Thus, the
parameters αi measure the systematic difference of each treatment from the grand
mean and are independent of the experimental units.
It is crucial for an analysis that the linear model fully reflects the structure of
the experiment. The Hasse diagrams allow us to derive an appropriate model for
any experimental design with comparative ease. For our example, the diagram in
Fig. 4.4C has three factors: M, Drug, and (Mouse), and these are reflected in the
three sets of parameters μ, αi , and ei j . Note that there are four parameters αi to
produce the four group means, but given three and the grand mean μ, the fourth
parameter can be calculated; thus, there are four parameters αi , but only three can
be independently estimated given μ, as reflected by the three degrees of freedom for
Drug. Further, the ei j are 32 random variables, and this is reflected in the fact that
(Mouse) is a random factor. Given estimates for μ and αi , the ei j in each of the four
groups must sum to zero and only 28 values are independent.
For the sub-sampling example in Fig. 4.5, the linear model is
yi jk = μ + αi + m i j + ei jk ,
88 4 Comparing More Than Two Groups: One-Way ANOVA
The aov() function provides all the necessary functionality for calculating complex
ANOVAs and for estimating the model parameters of the corresponding linear mod-
els. It requires two arguments: data= indicates a data-frame with one column for
each variable in the model, and aov() uses the values in these columns as the input
data. The model is specified with the formula= argument using a formula. This for-
mula describes the factors in the model and their crossing and nesting relationships,
and can be derived directly from the experiment diagram.
For our first example, the data is stored in a data-frame called drugs which con-
sists of 32 rows and three columns: y contains the observed enzyme level, drug the
drug (D1 to D4), and mouse the number of the corresponding mouse (1 . . . 32). Our
data are then analyzed using the command aov(formula=y~1+drug+Error
(mouse), data=drugs).
The corresponding formula has three parts: on the left-hand side, the name of
the column containing the observed response values (y), followed by a tilde. Then,
a part describing the fixed factors, which we can usually derive from the treatment
structure diagram: here, it contains the special symbol 1 for the grand mean μ and the
term drug encoding the four parameters αi . Finally, the Error() part describes the
random factors and is usually equivalent to the unit structure of the experiment. Here,
it contains only mouse. An R formula can often be further simplified; in particular,
aov() will always assume a grand mean 1, unless it is explicitly removed from the
model, and will always assume that each row is one observation relating to the lowest
random factor in the diagram. Both parts can be skipped from the formula and are
implicitly added; our formula is thus equivalent to y~drug. We can read a formula
as an instruction: explain the observations yi j in column y by the factors on the right:
a grand mean 1/μ, the refinements drug/αi that give the group means when added
to the grand mean, and the residuals mouse/ei j that cover each difference between
group mean and actual observation.
The function aov() returns a complex data structure containing the fitted model.
Using summary(), we produce a human-readable ANOVA table:
m = aov(y~drug, data=drugs)
summary(m)
4.5 Hasse Diagrams and Linear Model Specification 89
It corresponds to our manually computed Table 4.2. Each factor in the diagram
produces one row in the ANOVA table, with the exception of the trivial factor M.
Moreover, the degrees of freedom correspond between each diagram factor and
its table row. This provides a quick and easy check if the model formula correctly
describes the experiment structure. While not strictly necessary here, this is an impor-
tant and useful feature of Hasse diagrams for more complex experimental designs.
From the Hasse diagrams, we construct the formula for the ANOVA as follows:
• There is one variable for each factor in the diagram;
• terms are added using +;
• R adds the factor M implicitly; we can make it explicit by adding 1;
• if factors A and B are crossed, we write A*B or equivalently A + B + A:B;
• if factor B is nested in A, we write A/B or equivalently A + A:B;
• the formula has two parts: one for the random factors inside the Error() term,
and one for the fixed factors outside the Error() term;
• in most cases, the unit structure describes the random factors and we can use its
diagram to derive the Error() formula;
• likewise, the treatment structure usually describes the fixed factors, and we can
use its diagram to derive the remaining formula.
In our sub-sampling example (Fig. 4.5), the unit structure contains (Sample) nested
in (Mouse), and we describe this nesting by the formula mouse/sample. The for-
mula for the model is thus y~1+drug+Error(mouse/sample), respectively,
y~drug+Error(mouse).
The aov() function does not directly provide estimates of the group means,
and an elegant way of estimating them in R is the emmeans() function from the
emmeans package.1 It calculates the expected marginal means (sometimes con-
fusingly called least squares means) as defined in (Searle and Milliken 1980) for
any combination of experiment factors using the model found by aov(). In our
case, we can request the estimated group averages μ̂i for each level of drug, which
emmeans() calculates from the model as μ̂i = μ̂ + α̂i :
em = emmeans(m, ~drug)
This yields the estimates, their standard errors, degrees of freedom, and 95%-
confidence intervals in Table 4.3.
Table 4.3 Estimated cell means, standard errors, and 95%-confidence intervals
Mean se df LCL UCL
D1 14.33 0.43 28 13.45 15.21
D2 12.81 0.43 28 11.93 13.69
D3 9.24 0.43 28 8.36 10.12
D4 9.34 0.43 28 8.46 10.22
k
ni
k
k
ni
SStot = (yi j − ȳi + ȳi − ȳ)2 = n i · ( ȳi − ȳ)2 + (yi j − ȳi )2 .
i=1 j=1 i=1 i=1 j=1
SStrt SSres
4.6 Unbalanced Data 91
Clearly, if no responses are observed for a treatment group, its average response
cannot be estimated, and no inference is possible about this group or its relation to
other groups.
The ratio of treatment to residual mean squares then again forms the F-statistic
SStrt /(k − 1)
F=
SSres /(N − k)
which has an F-distribution with k − 1 and N − k degrees of freedom under the null
hypothesis H0 : μ1 = · · · = μk that all treatment group means are equal.
With unequal numbers n i of observations per cell, we now have two reasonable
estimators for the grand mean μ: one estimator is the weighted mean ȳ·· , which is
the direct equivalent to our previous estimator:
1
k ni
y·· n 1 · ȳ1· + · · · n k · ȳk· n1 nk
ȳ·· = = yi j = = · ȳ1· + · · · + · ȳk· .
N N i=1 j=1 n1 + · · · + nk N N
This estimator weighs each estimated treatment group mean by the number of avail-
able response values and hence its value depends on the number of observations per
group via the weights n i /N . Its variance is
σ2
Var( ȳ·· ) = .
N
The weighted mean is often an undesirable estimator, because larger groups then
contribute more to the estimation of the grand mean. In contrast, the unweighted
mean ⎛ ⎞
1 ⎝ 1
k ni
ȳ1· + · · · + ȳk·
ỹ·· = yi j ⎠ =
k i=1 n i j=1 k
first calculates the average of each treatment group based on the available obser-
vations, and then takes the mean of these group averages as the grand mean. This
is precisely the estimated marginal mean, an estimator for the population marginal
mean μ = (μ1 + · · · + μk )/k.
In the direct extension to our discussion in Sect. 3.2, its variance is
σ2 1 1
Var( ỹ·· ) = · + ··· + ,
k2 n1 nk
92 4 Comparing More Than Two Groups: One-Way ANOVA
Notes
Standard analysis of variance requires normally distributed, independent data in
each treatment group and homogeneous group variances. Moderate deviations from
normality and moderately unequal variances have little impact on the F-statistics,
but non-independence can have devastating effects (Cochran 1947; Lindman 1992).
The method by Kruskal and Wallis provides an alternative to the one-way ANOVA
based on ranks of observations rather than an assumption of normally distributed
data (Kruskal and Wallis 1952).
Effect sizes for ANOVA are comparatively common in psychology and related
fields (Cohen 1988; Lakens 2013), but have also been advertised for biological sci-
ences, where they are much less used (Nakagawa and Cuthill 2007). Like any esti-
mate, standardized effect sizes should also be reported with a confidence interval, but
standard software rarely provides out-of-the-box calculations (Kelley 2007); these
confidence intervals are calculated based on noncentral distributions, similar to our
calculations in Sect. 2.5 for Cohen’s d (Venables 1975; Cumming and Finch 2001).
The use of its upper confidence limit rather than the variance’s point estimate for
power analysis is evaluated in Kieser and Wassmer (1996). A pilot study can be con-
ducted to estimate the variance and enable informed power calculations for the main
experiment; a group size of n = 12 is recommended for clinical pilot trials (Julious
4.7 Notes and Summary 93
2005). The use of portable power analysis makes use of φ2 for which tables were
originally given in Tang (1938).
A Simple Function for Calculating Sample Size
The following code illustrates the power calculations from Sect. 4.4. It accepts a
variety of effect size measures (b2 , f 2 , η 2 , or λ), either the numerator degrees of
freedom df1= or the number of treatment groups k=, the denominator degrees of
freedom df2= or the number of samples n=, and the significance level alpha=
and calculates the power of the omnibus F-test. This code is meant to be more
instructional than practical, but its more versatile arguments will also allow us to
perform power analysis for linear contrasts, which is not possible with the built-in
power.anova.test(). For a sample size calculation, we would start with a
small n, use this function to find the power, and then increment n until the desired
power is exceeded.
# Provide EITHER b2 (with s) OR f2 OR lambda OR eta2
# Provide EITHER k OR df1,df2
getPowerF = function(n, k=NULL, df1=NULL, df2=NULL,
b2=NULL, s=NULL,
f2=NULL, lambda=NULL, eta2=NULL,
alpha) {
# If necessary, calculate degrees of freedom
if(is.null(df1)) df1 = k - 1
if(is.null(df2)) df2 = n*k - k
# Map effect size to ncp lambda=f2*n*k if necessary
if(!is.null(b2)) lambda = b2 * n * (df1+1) / s
if(!is.null(f2)) lambda = f2 * n * (df1+1)
if(!is.null(eta2)) lambda = (eta2 / (1-eta2)) * n * (df1+1)
# Critical quantile for rejection under H0
q = qf(p=1-alpha, df1=df1, df2=df2)
# Power
pwr = 1 - pf(q=q, ncp=lambda, df1=df1, df2=df2)
return(pwr)
}
yi j = μ + αi + ei j = μ + α1 · x1 j + · · · + αk · xk j + ei j ,
such as group averages, which are independent of the coding for the parameters used
in the model estimation procedure.
The treatment coding uses α̂r = 0 for a specific reference group r; μ̂ then estimates
the group mean of the reference group r , and α̂i estimates the difference between
group i and the reference group r . The vector of regressors for the response yi j is
(x1 j , . . . , xk j ) = (0, . . . , 0,
1 , 0, . . . , 0) for i = r
i
(x1 j , . . . , xk j ) = (0, . . . , 0,
0 , 0, . . . , 0) for i = r .
i
In R, the treatment groups are usually encoded as a factor variable, and aov()
will use its alphabetically first level as the reference group by default. We can set
the reference level manually using relevel() in base-R or the more convenient
fct_relevel() from the forcats package. We can use the dummy.coef()
function to extract all parameter estimates from the fitted model, and
summary.lm() to see the estimated parameters together with their standard errors
and t-tests. The treatment coding can also be set manually using the contrasts=
argument of aov() together with contr.treatment(k), where k is the number
of treatment levels.
For our example, aov() uses the treatment level D1 as the reference level. This
gives the estimates (Intercept) = μ̂ = 14.33 for the average enzyme level in
the reference group, which serves as the grand mean, D1 = α̂1 = 0 as expected,
and the three differences from D2, D3, and D4 to D1 as D2 = α̂2 = −1.51, D3 =
α̂3 = −5.09, and D4 = α̂4 = −4.99, respectively. We find the group mean for D3,
for example, as μ̂3 = μ̂ + α̂3 = 14.33 + (−5.09) = 9.24.
k
The sum coding uses the constraint i=1 α̂i = 0. Then, μ̂ is the estimated grand
mean, and α̂i is the estimated deviation of the ith group mean from this grand mean.
The vector of regressors for the response yi j is
(x1 j , . . . , xk j ) = (0, . . . , 0,
1 , 0, . . . , 0) for i = 1, . . . , k − 1
i
(x1 j , . . . , xk j ) = (−1, · · · , −1, 0) for i = k .
The aov() function uses the argument contrasts= to specify the desired
coding for each factor, and contrasts=list(drug=contr.sum(4)) pro-
vides the sum encoding for our example. The resulting parameter estimates are
(Intercept) = μ̂ = 11.43 for the average enzyme level over all groups, and
D1 = α̂1 = 2.9, D2 = α̂2 = 1.38, D3 = α̂3 = −2.19, and D4 = α̂4 = −2.09,
respectively, as the specific differences from each estimated group mean to the esti-
mated general mean. As required, the estimated differences add to zero and we find
the group mean for D3 as μ̂3 = μ̂ + α̂3 = 11.43 + (−2.19) = 9.24.
4.7 Notes and Summary 95
Using R
Base-R provides the aov() function for calculating an analysis of variance. The
model is specified using R’s formula framework, which implements a previously pro-
posed symbolic description of models (Wilkinson and Rogers 1973). The
summary() function prints the ANOVA table of a fitted model. Alternatively,
the same model specification (but without Error()) can be used with the linear
regression function lm() in which case summary() provides the (independent)
regression parameter estimates and dummy.coef() gives a list of all parameters.
The functions summary.lm() and summary.aov() provide the respective other
view for these two equivalent models. Further details are provided in Sect. 4.5.3.
ANOVA tables can also be calculated from other types of regression models (we
look at linear mixed models in subsequent chapters) using the function anova()
with the fitted model. This function also allows specifying the approximation method
for calculating degrees of freedom using its ddf= option. Power analysis for a one-
way ANOVA is provided by power.anova.test(). Completely randomized
experiments can be designed using design.crd() from package agricolae,
which also provides randomization.
Exact confidence limits for f 2 and η 2 are translated from those calculated for the
noncentrality parameter λ of the F-distribution. The effectsize package pro-
vides functions cohens_f() and eta_squared() for calculating these effect
sizes and their confidence intervals.
Summary
The analysis of variance decomposes the overall variation in the data into parts
attributable to different sources, such as treatment factors and residuals. The decom-
position relies on a partition of the sum of squares, measuring the variation, and
the degrees of freedom, measuring the amount of data expended on each source.
The ANOVA table provides a summary of this decomposition: sums of squares and
degrees of freedom for each source, and the resulting mean squares. The F-test tests
the null hypothesis that all treatment group means are equal. It uses the ratio of the
treatment means squares and the residual means squares, which provide two inde-
pendent estimates of the residual variance under the null hypothesis. For two groups,
this is equivalent to a t-test with equal group variances.
Hasse diagrams visualize the logical structure of an experiment: we distinguish
the unit structure with the unit factors and their relations from the treatment structure
with the treatment factors and their relations. We combine both into the experiment
structure by linking each treatment factor to the unit factor on which it is randomized;
this is the experimental unit for that treatment factor and provides the residual mean
squares of the F-test. We derive the model specification directly from the Hasse
diagram and can verify the correct specification by comparing the degrees of freedom
from the diagram with those in the resulting ANOVA table.
96 4 Comparing More Than Two Groups: One-Way ANOVA
References
Cochran, W. G. (1947). “Some Consequences When the Assumptions for the Analysis of Variance
are not Satisfied”. In: Biometrics 3.1, pp. 22–38.
Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences. 2nd. Lawrence Erlbaum
Associates, Hillsdale.
Cohen, J. (1992). “A Power Primer”. In: Psychological Bulletin 112.1, pp. 155–159.
Cumming, G. and S. Finch (2001). “A primer on the understanding, use, and calculation of con-
fidence intervals that are based on central and noncentral distributions”. In: Educational and
Psychological Measurement 61.4, pp. 532–574.
Fisher, R. A. (1934). “Discussion to “Statistics in Agricultural Research””. In: Journal of the Royal
Statistical Society 1, pp. 26–61.
Julious, S. A. (2005). “Sample size of 12 per group rule of thumb for a pilot study”. In: Pharma-
ceutical Statistics 4.4, pp. 287–291.
Kastenbaum, M. A., D. G. Hoel, and K. O. Bowman (1970). “Sample size requirements: One-way
Analysis of Variance”. In: Biometrika 57.2, pp. 421–430.
Kelley, K. (2007). “Confidence intervals for standardized effect sizes: theory, application, and
implementation”. In: Journal Of Statistical Software 20.8, e1–e24.
Kenward, M. G. and J. H. Roger (1997). “Small Sample Inference for Fixed Effects from Restricted
Maximum Likelihood”. In: Biometrics 53.3, pp. 983–997.
Kieser, M. and G. Wassmer (1996). “On the use of the upper confidence limit for the variance from
a pilot sample for sample size determination”. In: Biometrical Journal 38, pp. 941–949.
Kruskal, W. H. and W. A. Wallis (1952). “Use of Ranks in One-Criterion Variance Analysis”. In:
Journal of the American Statistical Association 47.260, pp. 583–621.
Lakens, D. (2013). “Calculating and reporting effect sizes to facilitate cumulative science: A prac-
tical primer for t-tests and ANOVAs”. In: Frontiers in Psychology 4, pp. 1–12.
Lenth, R. V. (2016). “Least-Squares Means: The R Package lsmeans”. In: Journal of Statistical
Software 69.1.
Lindman, H. R. (1992). Analysis of Variance in Experimental Design. Springer Berlin/Heidelberg.
Nakagawa, S. and I. C. Cuthill (2007). “Effect size, confidence interval and statistical significance:
a practical guide for biologists.” In: Biological Reviews of the Cambridge Philosophical Society
82.4, pp. 591–605.
Nelder, J. A. (1994). “The statistics of linear models: back to basics”. In: Statistics and Computing
4.4, pp. 221–234.
Nelder, J. A. (1998). “The great mixed-model muddle is alive and flourishing, alas!” In: Food
Quality and Preference 9.3, pp. 157–159.
Satterthwaite, F. E. (1946). “An approximate distribution of estimates of variance components”. In:
Biometrika Bulletin 2.6, pp. 110–114.
Searle, S. R., F. M. Speed, and G. A. Milliken (1980). “Population Marginal Means in the Linear-
Model - An Alternative to Least-Squares Means”. In: American Statistician 34.4, pp. 216–221.
Tang, P. C. (1938). “The power function of the analysis of variance test with tables and illustrations
of their use”. In: Statistical Research Memoirs 2, pp. 126–149.
van Belle, G. (2008). Statistical Rules of Thumb. 2nd. John Wiley & Sons, Inc.
Venables, W. (1975). “Calculation of confidence intervals for noncentrality parameters”. In: Journal
of the Royal Statistical Society B 37.3, pp. 406–412.
Wheeler, R. E. (1974). “Portable Power”. In: Technometrics 16.2, pp. 193–201.
Wilkinson, G. N. and C. E. Rogers (1973). “Symbolic description of factorial models for analysis
of variance”. In: Journal of the Royal Statistical Society C 22.3, pp. 392–399.
Chapter 5
Comparing Treatment Groups with
Linear Contrasts
5.1 Introduction
The omnibus F-test appraises the evidence against the hypothesis of identical group
means, but a rejection of this null hypothesis provides little information about which
groups differ and how. A very general and elegant framework for evaluating treatment
group differences are linear contrasts, which provide a principled way for construct-
ing corresponding t-tests and confidence intervals. In this chapter, we develop this
framework and apply it to our four drugs example; we also consider several more
complex examples to demonstrate its power and versatility.
If a set of contrasts is orthogonal, then we can reconstitute the result of the
F-test using the results from the contrasts, and a significant F-test implies that at
least one contrast is significant. If the F-test is not significant, we might still find
significant contrasts, because the F-test considers all deviations from equal group
means simultaneously, while a contrast looks for a specific set of deviations for which
it provides more power by ignoring other potential deviations.
Considering several contrasts leads to a multiplicity problem that requires adjust-
ing confidence and significance levels. Several multiple comparison procedures allow
us to calculate these adjustments for different sets of contrasts.
For our example, we might be interested in comparing the two drugs D1 and D2,
for example. One way of doing this is by a simple t-test between the corresponding
observations. This yields a t-value of t = 2.22 and a p-value of p = 0.044 for a
difference of μ̂1 − μ̂2 = 1.52 with a 95%-confidence interval [0.05, 2.98]. While
this approach yields a valid estimate and test, it is inefficient because we completely
neglect the information available in the observations of drugs D3 and D4. Specifi-
cally, if we assume that the variances are the same in all treatment groups, we could
use these additional observations to get a better estimate of the residual variance σe2
and increase the degrees of freedom.
We consider three example comparisons using our four drugs. We additionally
assume that D1 and D2 share the same active component and denote these drugs as
‘Class A’, while D3 and D4 share another component (‘Class B’):
D1 versus D2 : μ1 − μ2
D3 versus D4 : μ3 − μ4
μ1 + μ2 μ3 + μ4
Class A versus Class B : − .
2 2
Note that a t-test for the third comparison requires manual calculation of the corre-
sponding estimates and their standard errors first.
Linear contrasts use all data for estimation and ‘automatically’ lead to the correct t-
test and confidence interval calculations. Their estimation is one of the main purposes
for an experiment:
Contrasts of interest justify the design, not the other way around.
Formally, a linear contrast Ψ (w) for a treatment factor with k levels is a linear
combination of the group means using a weight vector w = (w1 , . . . , wk ):
Ψ (w) = w1 · μ1 + · · · + wk · μk ,
where the entries in the weight vector sum to zero, such that w1 + · · · + wk = 0.
We compare the group means of two sets X and Y of treatment factor levels by
selecting the weights wi as follows:
Unbalancedness only affects the precision but not the interpretation of contrast esti-
mates, and we can make our exposition more general by allowing different numbers
of samples per group, denoting by n i the number of samples in group i. From the
properties of the group mean estimates μ̂i , we know that Ψ̂ (w) is an unbiased esti-
mator of the contrast Ψ (w) and has variance
k k
wi2
Var Ψ̂ (w) = Var w1 · μ̂1 + · · · + wk · μ̂k = wi2 · Var(μ̂i ) = σe2 · .
i=1 i=1
ni
k
w 2
s e Ψ̂ (w) = σ̂e · i
.
i=1
n i
Note that the precision of a contrast estimate depends on the sizes for the involved
groups (i.e., those with wi = 0) in an unbalanced design, and standard errors are
higher for contrasts involving groups with low numbers of replicates in this case.
The estimate of a contrast is based on the normally distributed estimates of the
group means. We can use the residual variance estimate from the preceding ANOVA,
and the resulting estimator for any contrast has a t-distribution with all N − k degrees
of freedom.
100 5 Comparing Treatment Groups with Linear Contrasts
Table 5.1 Estimates and 95%-confidence intervals for three example contrasts
Comparison Contrast Estimate LCL UCL
D1-versus-D2 Ψ (w1 ) = μ1 − μ2 1.51 0.27 2.76
D3-versus-D4 Ψ (w2 ) = μ3 − μ4 −0.1 −1.34 1.15
Class A-versus-Class B Ψ (w3 ) = μ1 +μ
2
2
− μ3 +μ4
2 4.28 3.4 5.16
k
w 2
Ψ̂ (w) ± tα/2,N −k · s e Ψ̂ (w) = Ψ̂ (w) ± tα/2,N −k · σ̂e · i
,
i=1
n i
A linear contrast estimate has a t-distribution for normally distributed response val-
ues. This allows us to derive a t-test for testing the null hypothesis
H0 : Ψ (w) = 0
(0) · μ̂1 + (0) · μ̂2 + (+1) · μ̂3 + (−1) · μ̂4 μ̂3 − μ̂4
T = =√ √ .
σ̂e · (0) +(0 )+(+1)
2 2 2 +(−1)2
2σ̂e / 8
8
This is exactly the statistic for a two-sample t-test, but uses a pooled variance estimate
over all treatment groups and N − k = 28 degrees of freedom. For our data, we
calculate t = −0.16 and p = 0.88; the enzyme levels for drugs D3 and D4 cannot be
distinguished. Equivalent calculations for all three example contrasts are summarized
in Table 5.2. Note again that for our first contrast, the t-value is about 10% larger
than with the t-test based on the two groups alone, and the corresponding p-value is
about one-half.
The emmeans package provides for the heavy lifting: we calculate the analysis of
variance using aov(), estimate the group means using emmeans(), and define a
list of contrasts which we estimate using contrast().
We are usually more interested in confidence intervals for contrast estimates than
we are in t-values and test results. Conveniently, the confint() function takes a
contrast() result directly and by default yields 95%-confidence intervals for our
contrasts.
For our three example contrasts, the following code performs all required calcu-
lations in just five commands:
m = aov(y~drug, data=drugs)
em = emmeans(m, ~drug)
ourContrasts = list(
"D1-vs-D2"=c(1,-1,0,0),
"D3-vs-D4"=c(0,0,1,-1),
"Class A-vs-Class B"=c(1/2,1/2,-1/2,-1/2)
)
estimatedContrasts = contrast(em, method=ourContrasts)
ci = confint(estimatedContrasts)
There is no limit on the number of contrasts that we might estimate for any set of
data. On the other hand, contrasts are linear combinations of the k group means
and we also need to estimate the grand mean.1 That means that we can find exactly
k − 1 contrasts that exhaust the information available in the group means and any
additional contrast can be calculated from results of these k − 1 contrasts. This idea
is made formal by saying that two contrasts Ψ (w) and Ψ (v) are orthogonal if
k
wi · vi
=0.
i=1
ni
This
requirement reduces to the more interpretable ‘usual’ orthogonality condition
i wi · vi = 0 for a balanced design.
Our three example contrasts are all pair-wise orthogonal; for example, we have
i w1,i · w2,i = (+1 · 0) + (−1 · 0) + (0 · +1) + (0 · −1) = 0 for the first and sec-
1 We can do this using a ‘contrast’ w = (1/k, 1/k, . . . , 1/k), even though its weights do not sum
to zero.
5.2 Linear Contrasts 103
ond contrast. With k = 4, only three contrasts can be mutually orthogonal and our
three contrasts thus fully exhaust the available information.
If two contrasts Ψ (w) and Ψ (v) are orthogonal, then the associated null hypothe-
ses
H0 : Ψ (w) = 0 and H0 : Ψ (v) = 0
are logically independent in the sense that we can learn nothing about one being true
or false by knowing the other being true or false.2
A set of k − 1 orthogonal contrasts decomposes the treatment sum of squares into
k − 1 contrast sums of squares. The sum of squares of a contrast Ψ (w) is
2
k
i=1 wi μ̂i
SSw = k ,
i=1 wi2 /n i
and each contrast has one degree of freedom: dfw = 1. We can use the F-statistic
SSw /1 SSw
F= =
σ̂e
2 MSres
2 A small subtlety arises from estimating the contrasts: since all t-tests are based on the same estimate
of the residual variance, the tests are still statistically dependent. The effect is usually so small that
we ignore this subtlety in practice.
104 5 Comparing Treatment Groups with Linear Contrasts
15
Enzyme level
14
13
12
C0 C1 C2 C3 C4
Concentration
The order of levels of Drug is completely arbitrary: we could just as well put the
drugs of Class B as the first two levels and those of Class A as levels three and
four. For some treatment factors, levels are ordered and level i is ‘smaller’ than
level i + 1 in some well-defined sense; such factors are called ordinal. For example,
our treatment might consist of our current drug D1 administered at equally spaced
concentrations C0 < C1 < C2 < C3 < C4 . Data for 8 mice per concentration are
shown in Fig. 5.1 and indicate that the drug’s effect is negligible for the first two or
three concentrations, then increases substantially and seems to decrease again for
higher concentrations.
We can analyze such data using ordinary contrasts, but with ordered treatment factor
levels, it makes sense to look for trends. We do this using orthogonal polynomials,
which are linear, quadratic, cubic, and quartic polynomials that decompose a potential
trend into different components. Orthogonal polynomials are formulated as a special
set of orthogonal contrasts. We use the emmeans() and contrast() combination
again, and ask for a poly contrast to generate the appropriate set of orthogonal
contrasts:
The contrast weight vectors for our example are shown in Table 5.3.
Each polynomial contrast measures the similarity of the shape of the data to the
pattern described by the weight vector. The linear polynomial measures an upward or
downward trend, while the quadratic polynomial measures curvature in the response
5.2 Linear Contrasts 105
to the concentrations, such that the trend is not simply a proportional increase or
decrease in enzyme level. Cubic and quartic polynomials measure more complex
curvature, but become harder to interpret directly.
For our data, we get the result in Table 5.4. We find a highly significant positive
linear trend, which means that on average, the enzyme level increases with increasing
concentration. The negligible quadratic together with significant cubic and quartic
trend components means that there is curvature in the data, but it is changing with
the concentration. This reflects the fact that the data show a large increase for the
fourth concentration, which then levels off or decreases at the fifth concentration,
leading to a sigmoidal pattern.
Time Trends
3 It
is plausible that measurements on the same mouse are more similar between timepoints close
together than between timepoints further apart, a fact that ANOVA cannot properly capture.
106 5 Comparing Treatment Groups with Linear Contrasts
Another example of contrasts useful for ordered factors are the orthogonal Helmert
contrasts. They compare the second level to the first, the third level to the average of
the first and second level, the fourth level to the average of the preceding three, and
so forth.
Helmert contrasts can be used for finding a minimal effective dose in a dose-
response study. Since doses are typically in increasing order, we first test the second-
lowest against the lowest dose. If the corresponding average responses cannot be
distinguished, we assume that no effect of relevant size is present (provided the
experiment is not underpowered). We then pool the data of these two doses and use
their common average to compare against the third-smallest dose, thereby increasing
the precision compared to contrasting each level only to its preceding level.
Helmert contrasts are not directly available in emmeans, but the package manual
tells us how to define them ourselves:
helmert.emmc = function(ordered.levels, ...) {
# Use built-in R contrast to find contrast matrix
contrast.matrix = as.data.frame(contr.helmert(ordered.levels))
# Provide useful name for each contrast
names(contrast.matrix) = paste(ordered.levels[-1],"vs lower")
# Provide name of contrast set
attr(contrast.matrix, "desc") = "Helmert contrasts"
return(contrast.matrix)
}
In our example data in Fig. 5.1, the concentration C3 shows a clear effect. The situation
is much less clear-cut for lower concentrations C0 , . . . , C2 ; there might be a hint of
linear increase, but this might be due to random fluctuations. Using the Helmert
contrasts
yields the contrasts in Table 5.5 and the results in Table 5.6. Concentrations C0 , C1 , C2
show no discernible differences, while enzyme levels increase significantly for con-
centration C3 , indicating that the minimal effective dose is between concentrations
C2 and C3 .
5.2 Linear Contrasts 107
Table 5.6 Estimates for Helmert contrasts indicate minimum effective dose of at most C3
Contrast Estimate se df t value P(>|t|)
C1 versus lower 0.11 0.31 35 0.35 7.28e-01
C2 versus lower 0.38 0.54 35 0.71 4.84e-01
C3 versus lower 5.47 0.77 35 7.11 2.77e-08
C4 versus lower 3.80 0.99 35 3.83 5.10e-04
Similar to our previous discussions for simple differences and omnibus F-tests, we
might sometimes profit from a standardized effect size measure for a linear contrast
Ψ (w), which provides the size of the contrast in units of standard deviation.
A first idea is to directly generalize Cohen’s d as Ψ (w)/σe and measure the
contrast estimate in units of the standard deviation. A problem with this approach
is that the measure still depends on the weight vector: if w = (w1 , . . . , wk ) is the
weight vector of a contrast, then we can define an equivalent contrast using, for exam-
ple, w = 2 · w = (2 · w1 , . . . , 2 · wk ). Then, Ψ (w ) = 2 · Ψ (w), and the above stan-
dardized measure also scales accordingly. In addition, we would like our standard-
ized effect measure to equal Cohen’s d if the contrast describes a simple difference
between two groups.
Both problems are resolved by Abelson’s standardized effect size measure dw
(Abelson and Prentice 1997):
2 Ψ (w) 2 Ψ̂ (w)
dw = k · estimated by d̂w = k · .
i=1 wi
2 σe i=1 wi
2 σ̂e
For our scaled contrast w , we find dw = dw and the standardized effect sizes for
w and w coincide. For a simple difference μi − μ j , we have wi = +1 and w j = −1,
k
all other weights being zero. Thus i=1 wi2 = 2 and dw is reduced to Cohen’s d.
108 5 Comparing Treatment Groups with Linear Contrasts
The power calculations for linear contrasts can be done based on the contrast estimate
and its standard error, which requires calculating the power from the noncentral t-
distribution. We follow the equivalent approach based on the contrast’s sum of squares
and the residual variance, which requires calculating the power from the noncentral
F-distribution.
Exact Method
Note that the last term has exactly the same λ = n · k · f 2 form that we encountered
previously, since a contrast uses k = 2 (sets of) groups and we know that f 2 = d 2 /4
for direct group comparisons.
From the noncentrality parameter, we can calculate the power for testing the null
hypothesis H0 : Ψ (w) = 0 for any given significance level α, residual variance σe2 ,
sample size n, and assumed real value of the contrast Ψ0 , respectively, the assumed
standardized effect size dw,0 . We calculate the power using our getPowerF()
function and increase n until we reach the desired power.
For our first example contrast μ1 − μ2 , we find i wi2 = 2. For a minimal dif-
ference of Ψ0 = 2 and using a residual variance estimate σ̂e2 = 1.5, we calculate a
noncentrality parameter of λ = 1.33 n. The numerator degree of freedom is df1=1
and the denominator degrees of freedom are df2=n*4-4. For a significance level
of α = 5% and a desired power of 1 − β = 80%, we find a required sample size of
n = 7. We arrive at a very conservative estimate by replacing the residual variance by
the upper confidence limit UCL = 2.74 of its estimate. This increases the required
sample size to n = 12. The sample size increases substantially to n = 95 if we want
to detect a much smaller contrast value of Ψ0 = 0.5.
Similarly, our third example contrast (μ1 + μ2 )/2 − (μ3 + μ4 )/2 has i wi2 = 1.
A minimal value of Ψ0 = 2 can be detected with 80% power at a 5% significance
level for a sample size of n = 4 per group, with an exact power of 85% (for n = 3
the power is 70%). Even though the desired minimal value is identical to the first
example contrast, we need less samples since we are comparing the average of two
groups to the average of two other groups, making the estimate of the difference
more precise.
The overall experiment size is then the maximum sample size required for any
contrast of interest.
5.2 Linear Contrasts 109
Portable Power
For making the power analysis for linear contrasts portable, we apply the same ideas
as for the omnibus F-test. The numerator degrees of freedom for a contrast F-test
is dfnum = 1, and we find a sample size formula of
k
2 · φ2 · σe2 · i=1 wi2 φ2 φ2
n= = = .
Ψ02 (dw /2)2 f w2
For a simple difference contrast, i wi2 = 2, we have Ψ0 = δ0 . With the approxima-
tion φ2 ≈ 4 for a power of 80% at a 5% significance level and k = 2 factor levels,
we derive our old formula again: n = 16/(Ψ0 /σe )2 .
For our first example contrast with Ψ0 = 2, we find an approximate sample size
of n ≈ 6 based on φ2 = 3, in reasonable agreement with the exact sample size of
n = 7. If we instead ask for a minimal difference of Ψ0 = 0.5, this number increases
to n ≈ 98 mice per drug treatment group (exact: n = 95).
The sample size is lower for a contrast between two averages of group means.
For our third example contrast with weights w = (1/2, 1/2, −1/2, −1/2), we find
n = 8/(Ψ0 /σe )2 . With Ψ0 = 2, this gives an approximate sample size of n ≈ 5 (exact:
4).
5.3.1 Introduction
Since there is always one pair with greatest difference, it is incorrect to use a standard
t-test for this contrast. Rather, the resulting p-value needs to be adjusted for the fact
that we cherry-picked our contrast, making it larger on average than the contrast of
any randomly chosen pair. Properly adjusted tests for post-hoc contrasts thus have
lower power than those of pre-defined planned contrasts.
4 If200 hypotheses seem excessive, consider a simple microarray experiment: here, the difference
in expression level is simultaneously tested for thousands of genes.
5.3 Multiple Comparisons and Post-Hoc Analyses 111
These methods apply generally for multiple hypotheses. Here, we focus on testing
q contrasts Ψ (wl ) with q null hypotheses
H0,l : Ψ (wl ) = 0,
where wl = (w1l , . . . , wkl ) is the weight vector describing the lth contrast.
The Bonferroni and Holm corrections are popular and simple methods for controlling
the family-wise error probability. Both work for arbitrary sets of planned contrasts,
but are conservative and lead to low significance levels for the individual tests, often
much lower than necessary.
The simple Bonferroni method is a single-step procedure to control the family-
wise error probability by adjusting the individual significance level from α to α/q. It
does not consider the observed data and rejects the null hypothesis H0 : Ψ (wl ) = 0
if the contrast exceeds the critical value based on the adjusted t-quantile:
k
w 2
Ψ̂ (wl ) > t1−α/2q,N −k · σ̂e · il
.
i=1
n i
It is easily applied to existing test results: just multiply the original p-values by the
number of tests q and declare a test significant if this adjusted p-value stays below
the original significance level α.
For our three example contrasts, we previously found unadjusted p-values of 0.019
for the first, 0.875 for the second, and 10−10 for the third contrast. The Bonferroni
adjustment consists of multiplying each by q = 3, resulting in adjusted p-values of
0.057 for the first, 2.626 for the second (which we cap at 1.0), and 3 × 10−10 for
the third contrast, moving the first contrast from significant to not significant at the
α = 5% level. The resulting contrast estimates and t-test are shown in Table 5.7.
The Bonferroni–Holm method is based on the same assumptions, but uses a multi-
step procedure to find an optimal significance level based on the observed data. This
increases its power compared to the simple procedure. Let us call the unadjusted p-
values of the q hypotheses p1 , . . . , pq . The method first sorts the observed p-value
Table 5.7 Estimated contrasts and hypothesis tests adjusted using Bonferroni correction
Contrast Estimate se df t value P(>|t|)
D1-versus-D2 1.51 0.61 28 2.49 5.66e-02
D3-versus-D4 −0.10 0.61 28 −0.16 1.00e+00
Class A-versus-Class B 4.28 0.43 28 9.96 3.13e-10
112 5 Comparing Treatment Groups with Linear Contrasts
such that p(1) < p(2) < · · · < p(q) and p(i) is the ith smallest observed p-value. It
then compares p(1) to α/q, p(2) to α/(q − 1), p(3) to α/(q − 2) and so on until a
p-value exceeds its corresponding threshold. This yields the smallest index j such
that α
p( j) > ,
q +1− j
We gain more power if the set of contrasts has more structure, and Tukey’s method is
designed for the common case of all pair-wise differences. It considers the distribution
of the studentized range, the difference between the maximal and minimal group
means by calculating honest significant differences (HSD) (Tukey 1949a). It requires
a balanced design and rejects H0,l : Ψ (wl ) = 0 if
k
1 w2 σ̂e
Ψ̂ (wl ) > q1−α,k−1,N · σ̂e · il
that is |μ̂i − μ̂ j | > q1−α,k−1,N · √ ,
2 i=1 n n
where qα,k−1,N is the α-quantile of the studentized range based on k groups and
N = n · k samples.
The result is shown in Table 5.8 for our example. Since the difference between
the two drug classes is very large, all but the two comparisons within each class
yield highly significant estimates, but neither difference of drugs in the same class
is significant after Tukey’s adjustment.
Another very common type of biological experiment uses a control group and infer-
ence focuses on comparing each treatment with this control group, leading to k − 1
contrasts. These contrasts are not orthogonal, and the required adjustment is provided
by Dunnett’s method (Dunnett 1955). It rejects the null hypothesis H0,i : μi − μ1 = 0
that treatment group i shows no difference to the control group 1 if
1 1
|μ̂i − μ̂1 | > d1−α,k−1,N −k · σ̂e · + ,
ni n1
where d1−α,k−1,N −k is the quantile of the appropriate distribution for this test.
5.3 Multiple Comparisons and Post-Hoc Analyses 113
Table 5.8 Estimated contrasts of all pair-wise comparisons adjusted by Tukey’s method (top) and
versus the reference adjusted using Dunnett’s method (bottom). Note that contrast D1 − D2, for
example, yields identical estimates but different p-values
Contrast Estimate se df t value P(>|t|)
Pairwise-Tukey
D1 – D2 1.51 0.61 28 2.49 8.30e-02
D1 – D3 5.09 0.61 28 8.37 2.44e-08
D1 – D4 4.99 0.61 28 8.21 3.59e-08
D2 – D3 3.57 0.61 28 5.88 1.45e-05
D2 – D4 3.48 0.61 28 5.72 2.22e-05
D3 – D4 −0.10 0.61 28 −0.16 9.99e-01
Reference-Dunnett
D2 – D1 −1.51 0.61 28 −2.49 5.05e-02
D3 – D1 −5.09 0.61 28 −8.37 1.24e-08
D4 – D1 −4.99 0.61 28 −8.21 1.82e-08
For our example, let us assume that drug D1 is the best current treatment option,
and we are interested in comparing the alternatives D2 to D4 to this reference. The
required contrasts are the differences from each drug to the reference D1, resulting
in Table 5.8.
Unsurprisingly, enzyme levels for drug D2 are barely distinguishable from those
of the reference drug D1, and D3 and D4 show very different responses than the
reference.
The method by Scheffé is suitable for testing any group of contrasts, even if they were
suggested by the data (Scheffé 1959); in contrast, most other methods are restricted to
pre-defined contrasts. Naturally, this freedom of cherry-picking comes at a cost: the
Scheffé method is extremely conservative (so effects have to be huge to be deemed
significant), and is therefore only used if no other method is applicable.
The Scheffé method rejects the null hypothesis H0,l if
k
w 2
Ψ̂ (wl ) > (k − 1) · Fα,k−1,N −k · σ̂e · il
.
i=1
n i
This is very similar to the Bonferroni correction, except that the number of contrasts
q is irrelevant, and the quantile is a scaled quantile of an F-distribution rather than
a quantile from a t-distribution.
114 5 Comparing Treatment Groups with Linear Contrasts
Table 5.9 Estimated contrasts of our three example contrasts, assuming they were suggested by
the data; adjusted using Scheffe’s method
Contrast Estimate se df t value P(>|t|)
D1-versus-D2 1.51 0.61 28 2.49 1.27e-01
D3-versus-D4 −0.10 0.61 28 −0.16 9.99e-01
Class A-versus-Class B 4.28 0.43 28 9.96 2.40e-09
For illustration, imagine that our three example contrasts were not planned before
the experiment, but rather suggested by the data after the experiment was completed.
The adjusted results are shown in Table 5.9.
We notice that the p-values are much more conservative than with any other
method, which reflects the added uncertainty due to the post-hoc nature of the con-
trasts.
5.3.6 Remarks
We further illustrate the one-way ANOVA and use of linear contrasts using a real-life
example (Lohasz et al. 2020).5 The two anticancer drugs cyclophosphamide (CP) and
ifosfamide (IFF) become active in the human body only after metabolization in the
liver by the enzyme CYP3A4, among others. The function of this enzyme is inhibited
by the drug ritanovir (RTV), which more strongly affects metabolization of IFF than
CP. The experimental setup consisted of 18 independent channels distributed over
several microfluidic devices; each channel contained a co-culture of multi-cellular
liver spheroids for metabolization and tumor spheroids for measuring drug action.
The experiment used the diameter of the tumor (in µm) after 12 days as the
response variable. There are six treatment groups: a control condition without drugs,
a second condition with RTV alone, and the four conditions CP-only, IFF-only, and
the combined CP:RTV and IFF:RTV. The resulting data are shown in Fig. 5.2A for
each channel.
A preliminary analysis revealed that device-to-device variation and variation from
channel to channel were negligible compared to the within-channel variance, and
these two factors were consequently ignored in the analysis. Thus, data are pooled
over the channels for each treatment group, and the experiment is analyzed as a
(slightly unbalanced) one-way ANOVA. We discuss an alternative two-way ANOVA
in Sect. 6.3.8.
Inhomogeneous Variances
The omnibus F-test and linear contrast analysis require equal within-group variances
between treatment groups. This hypothesis was tested using the Levene test and
a p-value below 0.5% indicated that variances might differ substantially. If true,
this would complicate the analysis. Looking at the raw data in Fig. 5.2A, however,
reveals a potential error in channel 16, which was labeled as IFF:RTV, but shows
tumor diameters in excellent agreement with the neighboring CP:RTV treatment.
Including channel 16 in the IFF:RTV group then inflates the variance estimate. It
was therefore decided to remove channel 16 from further analysis, the hypothesis
of equal variances is then no longer rejected ( p > 0.9), and visual inspection of the
data confirms that dispersions are very similar between channels and groups.
5 Theauthors of this study kindly granted permission to use their data. Purely for illustration, we
provide some alternative analyses to those in the publication.
116 5 Comparing Treatment Groups with Linear Contrasts
A
Control CP IFF RTV CP:RTV IFF:RTV
600
Diameter [µm]
500
400
300
200
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
Channel
B C
RTV − Control IFF:RTV − RTV
IFF:RTV − IFF
IFF:RTV − Control
CP:RTV − CP
IFF − Control
(IFF:RTV−IFF) − (CP:RTV−CP)
CP:RTV − Control
(IFF:RTV − RTV) − (IFF − Control)
CP − Control (CP:RTV − RTV) − (CP − Control)
Fig. 5.2 A Observed diameters by channel. Point shape indicates treatment. Channel 16 appears
to be mislabelled. B Estimated treatment-versus-control contrasts and Dunnett-adjusted 95%-
confidence intervals based on data excluding channel 16, significant contrasts are indicated as
triangles. C As (B) for specific (unadjusted) contrasts of interest
Analysis of Variance
As is expected from looking at the data in Fig. 5.2A, the one-way analysis of variance
of tumor diameter versus treatment results in a highly significant treatment effect.
Linear Contrasts
Since the F-test does not elucidate which groups differ and by how much, we proceed
with a more targeted analysis using linear contrasts to estimate and test meaningful
and interpretable comparisons. With a first set of standard contrasts, we compare
5.4 A Real-Life Example—Drug Metabolization 117
each treatment group to the control condition. The resulting contrast estimates are
shown in Fig. 5.2B together with their Dunnett-corrected 95%-confidence intervals.
Somewhat surprisingly, the RTV-only condition shows tumor diameters signif-
icantly larger than those under the control condition, indicating that RTV alone
influences the tumor growth. Both conditions involving CP show reduced tumor
diameters, indicating that CP inhibits tumor growth, as does IFF alone. Lastly, RTV
seems to substantially decrease the efficacy of IFF, leading again to tumor diameters
larger than under the control condition, but (at least visually) comparable to the RTV
condition.
The large and significant difference between control and RTV-only poses a prob-
lem for the interpretation: we are interested in comparing CP:RTV against CP-only
and similarly for IFF. But CP:RTV could be a combined effect of tumor diameter
reduction by CP (compared to control) and increase by RTV (compared to control).
We have two options for defining a meaningful contrast: (i) estimate the difference
in tumor diameter between CP:RTV and CP. This is a comparison between the com-
bined and single drug actions. Or (ii) estimate the difference between the change in
tumor diameter from CP to control and the change from CP:RTV to RTV (rather
than to control). This is a comparison between the baseline tumor diameters under
control and RTV to those under addition of CP and is the net-effect of CP (provided
that RTV increases tumor diameters equally with and without CP).
The two sets of comparisons lead to different contrasts, but both are meaningful for
these data. The authors of the study decided to go for the first type of comparison and
compared tumor diameters for each drug with and without inhibitor. The two contrasts
are IFF:RTV − IFF and CP:RTV − CP, shown in rows 2 and 3 of Fig. 5.2C. Both
contrasts show a large and significant increase in tumor diameter in the presence of
the inhibitor RTV, where the larger loss of efficacy for IFF yields a more pronounced
difference.
For a complete interpretation, we are also interested in comparing the two dif-
ferences between the CP and the IFF conditions: is the reduction in tumor diameter
under CP smaller or larger than under IFF? This question addresses a difference of
differences, a very common type of comparison in biology, when different condi-
tions are contrasted and a ‘baseline’ or control is available for each condition. We
express this question as (IFF:RTV − IFF) − (CP:RTV − CP), and we can sort the
terms to derive the contrast form (IFF:RTV + CP) − (CP:RTV + IFF). Thus, we
use weights of +1 for the treatment groups IFF:RTV and CP, and weights of −1 for
the groups CP:RTV and IFF to define the contrast. Note that this differs from our
previous example contrast comparing drug classes, where we compared averages of
several groups by using weights ±1/2. The estimated contrast and confidence inter-
val is shown in the fourth row in Fig. 5.2C: the two increases in tumor diameter under
co-administered RTV are significantly different, with about 150 µm more under IFF
than under CP.
For comparison, the remaining two rows in Fig. 5.2C show the tumor diameter
increase for each drug with and without inhibitor, where the no-inhibitor condition is
compared to the control condition, but the inhibitor condition is compared to the RTV-
only condition. Now, IFF shows less increase in tumor diameter than in the previous
118 5 Comparing Treatment Groups with Linear Contrasts
comparison, but the result is still large and significant. In contrast, we do not find a
difference between the CP and CP:RTV conditions indicating that loss of efficacy for
CP is only marginal under RTV. This is because the previously observed difference
can be explained by the difference in ‘baseline’ between control and RTV-only. The
contrasts are constructed as before: for CP, the comparison is (CP:RTV − RTV) −
(CP − Control) which is equivalent to (CP:RTV + Control) − (RTV + CP).
Conclusion
Notes
Linear contrasts for finding minimum effective doses are discussed in Ruberg (1989),
and general design and analysis of dose finding studies in Ruberg (1995a, b).
If and when multiple comparison procedures are required is sometimes a matter
of debate. Some people argue that such adjustments are rarely called for in designed
experiments (Finney 1988; O’Brien 1983). Further discussion of this topic is pro-
vided in Cox (1965), O’Brien (1983), Curran-Everett (2000), and Noble (2009).
An authoritative treatment of the issue is Tukey (1991). A book-length treatment is
Rupert (2012).
In addition to the weak and strong control family-wise error (Proschan and Brittain
2020), we can also control for other types of error (Lawrence 2019), most prominently
the false discovery rate FDR (Benjamini and Hochberg 1995). A relatively recent
graphical method allows adapting the Bonferroni–Holm procedure to a specific set
of hypotheses and can yield less conservative adjustments (Bretz 2009).
5.5 Notes and Summary 119
Using R
Estimation of contrasts in R is discussed in Sect. 5.2.4. A very convenient option for
applying multiple comparisons procedures is to use the emmeans package and fol-
low the same strategy as before: estimate the model parameters using aov() and esti-
mate the group means using emmeans(). We can then use the contrast() func-
tion with an adjust= argument to choose a multiple correction procedure to adjust
p-values and confidence intervals of contrasts. This function also has several fre-
quently used sets of contrasts built-in, such as method="pairwise" for generat-
ing all pair-wise contrasts or method="trt.vs.ctrl1" and
method="trt.vs.ctrlk" for generating contrasts comparing all treatments
to the first, respectively, last, level of the treatment factor. For estimated marginal
means em, and either the corresponding built-in contrasts or our manually defined
set of contrasts ourContrasts, we access the five procedures as
contrast(em, method = ourContrasts, adjust = "bonferroni")
contrast(em, method = ourContrasts, adjust = "holm")
contrast(em, method = "pairwise", adjust = "tukey")
contrast(em, method = "trt.vs.ctrl1", adjust = "dunnett")
contrast(em, method = ourContrasts, adjust = "scheffe")
By default, these functions provide the contrast estimates and associated t-tests. We
can use the results of contrast() as an input to confint() to get contrast
estimates and their adjusted confidence intervals instead.
The package Superpower provides functionality to perform power analysis of
contrasts in conjunction with emmeans.
Summary
Linear contrasts are a principled way for defining comparisons between two sets of
group means and constructing the corresponding estimators, their confidence inter-
vals, and t-statistics. While an ANOVA omnibus F-test looks for any pattern of
deviation between group means, linear contrasts use specific comparisons and are
more powerful in detecting the specified deviations. Without much exaggeration,
linear contrasts are the main reason for conducting comparative experiments and
their definition is an important part of an experimental design.
With more than one hypothesis tested, multiple comparison procedures are often
required to adjust for the inflation in false positives. General purpose procedures are
easy to use, but sets of contrasts often have more structure that can be exploited to
gain more power.
Power analysis for contrasts poses no new problems, but the adjustments by MCPs
can only be considered for single-step procedures, because multi-step procedures
depend on the observed p-values which are of course unknown at the time of planning
the experiment.
120 5 Comparing Treatment Groups with Linear Contrasts
References
Abelson, R. P. and D. A. Prentice (1997). “Contrast tests of interaction hypothesis”. In: Psychological
Methods 2.4, pp. 315–328.
Benjamini, Y. and Y. Hochberg (1995). “Controlling the False Discovery Rate: A Practical and
Powerful Approach to Multiple Testing”. In: Journal of the Royal Statistical Society. Series B
(Methodological) 57.1, pp. 289–300.
Bretz, F. et al. (2009). “A graphical approach to sequentially rejective multiple test procedures”. In:
Statistics in Medicine 28.4, pp. 586–604.
Cox, D. R. (1965). “A remark on multiple comparison methods”. In: Technometrics 7.2, pp. 223–
224.
Curran-Everett, D. (2000). “Multiple comparisons: philosophies and illustrations”. In: American
Journal of Physiology-Regulatory, Integrative and Comparative Physiology 279, R1–R8.
Dunnett, C. W. (1955). “A multiple comparison procedure for comparing several treatments with a
control”. In: Journal of the American Statistical Association 50.272, pp. 1096–1121.
Finney, D. J. (1988). “Was this in your statistics textbook? III. Design and analysis”. In: Experimental
Agriculture 24, pp. 421–432.
Lawrence, J. (2019). “Familywise and per-family error rates of multiple comparison procedures”.
In: Statistics in Medicine 38.19, pp. 1–13.
Lohasz, C. et al. (2020). “Predicting Metabolism-Related Drug-Drug Interactions Using a Micro-
physiological Multitissue System”. In: Advanced Biosystems 4.11, pp. 2000079.
Noble, W. S. (2009). “How does multiple testing correction work?” In: Nature Biotechnology 27.12,
pp. 1135–1137.
O’Brien, P. C. (1983). “The appropriateness of analysis of variance and multiple-comparison pro-
cedures”. In: Biometrics 39.3, pp. 787–788.
Proschan, M. A. and E. H. Brittain (2020). “A primer on strong vs weak control of familywise error
rate”. In: Statistics in Medicine 39.9, pp. 1407–1413.
Ruberg, S. J. (1989). “Contrasts for identifying the minimum effective dose”. In: Journal of the
American Statistical Association 84.407, pp. 816–822.
Ruberg, S. J. (1995a). “Dose response studies I. Some design considerations”. In: Journal of Bio-
pharmaceutical Statistics 5.1, pp. 1–14.
Ruberg, S. J. (1995b). “Dose response studies II. Analysis and interpretation”. In: Journal of Bio-
pharmaceutical Statistics 5.1, pp. 15–42.
Rupert Jr, G. (2012). Simultaneous statistical inference. Springer Science & Business Media.
Scheffé, H. (1959). The Analysis of Variance. John Wiley & Sons, Inc.
Tukey, J. W. (1949a). “Comparing Individual Means in the Analysis of Variance”. In: Biometrics
5.2, pp. 99–114.
Tukey, J. W. (1991). “The philosophy of multiple comparisons”. In: Statistical Science 6, pp. 100–
116.
Chapter 6
Multiple Treatment Factors: Factorial
Designs
6.1 Introduction
The treatment design in our drug example contains a single treatment factor, and
one of four drugs is administered to each mouse. Factorial treatment designs use
several treatment factors, and a treatment applied to an experimental unit is then a
combination of one level from each factor.
While we analyzed our tumor diameter example as a one-way analysis of vari-
ance, we can alternatively interpret the experiment as a factorial design with two
treatment factors: Drug with three levels, ‘none’, ‘CP’, and ‘IFF’, and Inhibitor
with two levels, ‘none’ and ‘RTV’. Each of the six previous treatment levels is then
a combination of one drug and the absence/presence of RTV.
Factorial designs can be analyzed using a multi-way analysis of variance—a
relatively straightforward extension from Chap. 4—and linear contrasts. A new phe-
nomenon in these designs is the potential presence of interactions between two (or
more) treatment factors. The difference between the levels of the first treatment factor
then depends on the level of the second treatment factor, and greater care is required
in the interpretation.
6.2 Experiment
In Chap. 4, we suspected that the four drugs might show different effects under
different diets but ignored this aspect by using the same low-fat diet for all treatment
groups. We now consider the diet again and expand our investigation by introducing
a second diet with high fat content. We previously found that drugs D1 and D2 from
class A are superior to the two drugs from class B, and therefore concentrate on
class A exclusively. In order to establish and quantify the effect of these two drugs
compared to ‘no effect’, we introduce a control group using a placebo treatment with
no active substance to provide these comparisons within our experiment.
Placebo D1 D2
low fat
high fat
Fig. 6.1 Experiment with two crossed treatment factors with four mice in each treatment combi-
nation
Table 6.1 Enzyme levels for combinations of three drugs and two diets, four mice per combination
The proposed experiment is as follows: two drugs, D1 and D2, and one placebo
treatment are given to 8 mice each, and treatment allocation is completely at random.
For each drug, 4 mice are randomly selected and receive a low-fat diet, while the other
4 mice receive a high-fat diet. The treatment design then consists of two treatment
factors, Drug and Diet, with three and two levels, respectively. Since each drug is
combined with each diet, the two treatment factors are crossed, and the design is
balanced because each drug–diet combination is allocated to four mice (Fig. 6.1).
The 24 measured enzyme levels for the six treatment combinations are shown in
Table 6.1 and Fig. 6.2A. A useful alternative visualization is an interaction plot as
shown in Fig. 6.2B, where we show the levels of Diet on the horizontal axis, the
enzyme levels on the vertical axis, and the levels of Drug are represented by point
shapes. We also show the average response in each of the six treatment groups and
connect those of the same drug by a line, linetype indicating the drug.
Even a cursory glance reveals that both drugs have an effect on the enzyme level,
resulting in higher observed enzyme levels compared to the placebo treatment on
a low-fat diet. Two drug treatments—D1 and placebo—appear to have increased
enzyme levels under a high-fat diet, while levels are lower under a high-fat diet for
D2. This is an indication that the effect of the diet is not the same for the three drug
treatments and there is thus an interaction between drug and diet.
6.2 Experiment 123
A B Placebo D1 D2
15.0 15.0
Enzyme level
12.5 12.5
10.0 10.0
Fig. 6.2 A Enzyme levels for combinations of three drugs and two diets, four mice each in a
completely randomized design with two-way crossed treatment structure. Drugs are highlighted
as point shapes. B Same data as interaction plot, with shape indicating drug treatment and lines
connecting the average response values for each drug over diets
Hasse Diagrams
The treatment structure of this experiment is shown in Fig. 6.3A and is a factorial
treatment design with two crossed treatment factors Drug and Diet, which we write
next to each other in the treatment structure diagram. Each treatment combination is
a level of the new interaction factor Drug:Diet, nested in both treatment factors.
The unit structure only contains the single unit factor (Mouse), as shown in
Fig. 6.3B.
To derive the experiment structure in Fig. 6.3C, we note that each combination of a
drug and diet is randomly assigned to a mouse. This makes (Mouse) the experimental
unit for Drug and Diet and also for their interaction Drug:Diet. We consequently
draw an edge from Drug:Diet to (Mouse); the edges from Drug respectively Diet
to (Mouse) are then ‘shortcuts’ and are omitted from the diagram.
124 6 Multiple Treatment Factors: Factorial Designs
M M M11
(Mouse)24
18
Fig. 6.3 Completely randomized design for determining the effects of two different drugs and
placebo combined with two different diets using four mice per combination of drug and diet, and a
single response measurement per mouse
The analysis of variance framework readily extends to more than one treatment factor.
We already find all the new properties of this extension with only two factors, and
we can therefore focus on this case for the moment.
For our general discussion, we consider two treatment factors A with a levels and
B with b levels; the interaction factor A:B then has a · b levels. We again denote by
n the number of experimental units for each treatment combination in a balanced
design, and by N = a · b · n the total sample size. For our example, a = 3, b = 2,
n = 4, and thus N = 24.
yi jk = μi j + ei jk ,
in which each cell mean μi j is the average response of the experimental units receiving
level i of the first treatment factor and level j of the second treatment factor, and
ei jk ∼ N (0, σe2 ) is the residual of the kth replicate for the treatment combination
(i, j).
Our example leads to 3 × 2 = 6 cell means μ11 , . . . , μ32 ; the cell mean μ11 is
the average response to the placebo treatment under the low-fat diet, and μ32 is the
average response to the D2 drug treatment under the high-fat diet.
6.3 ANOVA for Two Treatment Factors 125
The cell means model does not explicitly reflect the factorial treatment design, and
we could analyze our experiment as a one-way ANOVA with six treatment levels, as
we did in the tumor diameter example in Sect. 5.4.
Parametric Model
makes the factorial treatment design explicit. Each cell mean μi j is decomposed into
four contributions: a grand mean μ, the average deviation of cell mean μi· for level
i of A (averaged over all levels of B), the average deviation of cell mean μ· j of B
(averaged over all levels of A), and an interaction for each level of A:B.
The model components are shown schematically in Fig. 6.4A for our drug–diet
example. Each drug corresponds to a row, and μi· is the average response for drug i,
where the average is taken over both diets in row i. Each diet corresponds to a column,
and μ· j is the average response for column j (over the three drugs). The interaction
is the difference between the additive effect of the row and column differences to the
grand mean μi· − μ·· and μ· j − μ·· , and the actual cell mean’s difference μi j − μ··
from the grand mean:
μi j − μi· − μ· j + μ·· = (μi j − μ·· ) − (μi· − μ·· ) + (μ· j − μ·· ) .
A B
16
μ2 ·
14
α2
12 μ
α3
α1 μ3 ·
10
μ1 ·
Pl. D1 D2
Fig. 6.4 A The 3-by-2 data table with averages per treatment. The main effects are differences
between row- or column means and the grand mean. Interaction effects are differences between cell
means and the two main effects. B Marginal data considered for drug pools data for each drug over
diets. The three averages correspond to the row means and point shapes correspond to diet. Dashed
line: grand mean; solid gray lines: group means; vertical lines: group mean deviations
126 6 Multiple Treatment Factors: Factorial Designs
The analysis of variance with more than one treatment factor is a direct extension of
the one-way ANOVA, and is again based on decomposing the total sum of squares
and degrees of freedom. Each treatment factor can be tested individually using a
corresponding F-test.
Sums of Squares
For balanced designs, we decompose the sum of squares into one part for each factor
in the design. The total sum of squares is
a
b
n
SStot = SSA + SSB + SSA:B + SSres = (yi jk − ȳ··· )2 ,
i=1 j=1 k=1
and measures the overall variation between the observed values and the grand mean.
We now have two treatment sum of squares, one for factor A and one for B; they are
the squared differences of the group mean from the grand mean:
a
b
n
a
b
SSA= ( ȳi·· − ȳ··· )2 = bn ( ȳi·· − ȳ··· )2 and SSB = an ( ȳ· j· − ȳ··· )2 .
i=1 j=1 k=1 i=1 j=1
Each treatment sum of squares is found by pooling the data over all levels of the
other treatment factor. For instance, SSdrug in our example results from pooling data
over the diet treatment levels, corresponding to a one-way ANOVA situation as in
Fig. 6.4B.
6.3 ANOVA for Two Treatment Factors 127
a
b
SSA:B =n ( ȳi j· − ȳi·· − ȳ· j· + ȳ··· )2 .
i=1 j=1
If interactions exist, parts of SStot − SSA − SSB are due to a systematic difference
between the additive prediction for each group mean and the actual mean of that
treatment combination.
The residual sum of squares measures the distances of responses to their corre-
sponding treatment means:
a
b
n
SSres = (yi jk − ȳi j· )2 .
i=1 j=1 k=1
For our example, we find that SStot = 129.44 decomposes into treatments SSdrug =
83.73 and SSdiet = 0.59, interaction SSdrug:diet = 19.78, and residual SSres = 25.34
sums of squares.
Degrees of Freedom
The total degrees of freedom also partition by the corresponding factors, and can
be calculated quickly and easily from the Hasse diagram. For the general two-way
ANOVA, we have
Mean Squares
Mean squares are found by dividing the sum of squares terms by their degrees of
freedom. For example, we can calculate the total mean squares
SStot
MStot = ,
abn − 1
which correspond to the variance in the data ignoring treatment groups. Further,
MSA = SSA /dfA , MSB = SSB /dfB , MSA:B = SSA:B /dfA:B , and MSres = SSres /dfres .
In our example, MSdrug = 41.87, MSdiet = 0.59, MSdrug:diet = 9.89, and MSres = 1.41.
128 6 Multiple Treatment Factors: Factorial Designs
F-tests
We can perform an omnibus F-test for each factor by comparing its mean squares
with the corresponding residual mean squares. We find the correct denominator mean
squares from the experiment structure diagram: it corresponds to the closest random
factor below the tested treatment factor and is MSres from (Mouse) for all three tests
in our example.
The two main effect tests for A and B based on the F-statistics
MSA MSB
F= ∼ Fa−1,ab(n−1) and F = ∼ Fb−1,ab(n−1)
MSres MSres
respectively. These hypotheses state that the corresponding treatment groups, aver-
aged over the levels of the other factor, have equal means, and the average effect of
the corresponding factor is zero. The treatment mean squares have expectations
nb 2 na 2
a b
E(MSA ) = σe2 + αi and E(MSB ) = σe2 + β ,
a − 1 i=1 b − 1 j=1 j
which both provide an independent estimate of the residual variance σe2 if the corre-
sponding null hypothesis is true.
The interaction test for A:B is based on the F-statistic
MSA:B
F= ∼ F(a−1)(b−1),ab(n−1) ,
MSres
n a b
E(MSA:B ) = σe2 + (αβ)i2j .
(a − 1)(b − 1) i=1 j=1
Thus, the corresponding null hypothesis is H0,A:B : (αβ)i j = 0 for all i, j. This is
equivalent to
H0,A:B : μi j = μ + αi + β j for all i, j ,
stating that each treatment group mean has a contribution from both factors, and
these contributions are independent of each other.
6.3 ANOVA for Two Treatment Factors 129
ANOVA Table
Using R, these calculations are done easily using aov(). We find the model specifica-
tion from the Hasse diagrams. All fixed factors are in the treatment structure and pro-
vide the terms 1+drug+diet+drug:diet which we abbreviate as drug*diet.
All random factors are in the unit structure, providing Error(mouse), which we
may drop from the specification since (Mouse) is the lowest random unit factor. The
model y~drug*diet yields the ANOVA table
We find that the sums of squares for the drug effect are large compared to all other
contributors and Drug is highly significant. The Diet main effect does not achieve
significance at any reasonable level, and its sums of squares are tiny. However, we
find a large and significant Drug:Diet interaction and must therefore be very careful
in our interpretation of the results, as we discuss next.
The main effect of a treatment factor is the deviation between group means of its
factor levels, averaged over the levels of all other treatment factors.
In our example, the null hypothesis of the drug main effect is
The first hypothesis asks if there is a difference in the enzyme levels between the
three drugs, ignoring the diet treatment by averaging the observations for low-fat
and high-fat diets for each drug. The second hypothesis asks if there is a difference
between enzyme levels for the two diets, averaged over the three drugs.
From the ANOVA table, we see that the main effect of the drug treatment is
highly significant, indicating that regardless of the diet, the drugs affect enzyme
levels differently. In contrast, the diet main effect is negligibly small with a large
p-value, but the raw data in Fig. 6.2A show that while the diet has no influence on
the drug levels when averaged over all drug treatments, it has visible effects for some
130 6 Multiple Treatment Factors: Factorial Designs
drugs: the effect of D1 seems to be unaffected by the diet, while a low-fat diet yields
higher enzyme levels for D2, but lower levels for placebo.
The interaction effects quantify how different the effect of one factor is for different
levels of another factor. For our example, the large and significant interaction factor
shows that while the diet main effect is negligible, the two diets have a very different
effect depending on the drug treatment. Simply because the diet main effect is not
significant therefore does not mean that the diet effect can be neglected for each drug.
The most important rule when analyzing designs with factorial treatment structures
is therefore
Be careful when interpreting main effects in the presence of large interactions.
Four illustrative examples of interactions are shown in Fig. 6.5 for two factors A and
B with two levels each. In each panel, the two levels of A are shown on the horizontal
axis, and a response on the vertical axis. The two lines connect the same levels of
factor B between levels of A.
In the first case (Fig. 6.5A), the two lines are parallel and the difference in response
between the low to the high levels of B is the same for both levels of A, and vice
versa. The average responses μi j in each of the four treatment groups are then fully
described by an additive model
μi j = μ + αi + β j ,
B high B high
B high
B high B low
Fig. 6.5 Four stylized scenarios with different interactions. A Parallel lines indicate no interaction
and an additive model. B Small interaction and non-zero main effects. Factor B attenuates effect of
factor A. C Strongly pronounced interaction. Factor B on high level completely compensates effect
of factor A. D Complete effect reversal; all information on factor effects is in the interaction
6.3 ANOVA for Two Treatment Factors 131
with difference μ21 − μ11 = α2 − α1 . Under a high-fat diet, the cell means are
(αβ)i j = μi j − (μ + αi + β j )
quantifies exactly how much the observed cell mean μi j differs from the value pre-
dicted by the additive model μ + αi + β j .
For our example, the responses to placebo and drug D1 under a low-fat diet are
then
and thus the difference is μ21 − μ11 = (α2 − α1 ) + ((αβ)21 − (αβ)11 ), while for
the high-fat diet, we find a difference between placebo and D1 of μ22 − μ12 =
(α2 − α1 ) + ((αβ)22 − (αβ)12 ). If at least one of the parameters (αβ)i j is not zero,
then the difference between D1 and placebo depends on the diet. The interaction is
the difference of differences
and is fully described by the four interaction parameters corresponding to the four
means involved.
It is then misleading to speak of a ‘drug effect’, since this effect is diet-dependent.
In our case of a quantitative interaction, we find that a placebo treatment always gives
lower enzyme levels, independent of the diet. The diet only modulates this effect,
and we can achieve a greater increase in enzyme levels for D1 over placebo if we
additionally use a low-fat diet.
132 6 Multiple Treatment Factors: Factorial Designs
The interaction becomes more pronounced the more ‘non-parallel’ the lines
become. The interaction in Fig. 6.5C is still quantitative, and a low level of B always
yields the higher response, but the modulation by A is much stronger than before.
In particular, A has no effect under a high level of B and using either a low or high
level yields the same average response, while under a low level of B, using a high
level for A yields even higher responses.
In our example, there is an additional quantitative interaction between diet and
D1/D2, with D1 consistently giving higher enzyme levels than D2 under both diets.
However, if we would only consider a low-fat diet, then the small difference between
responses to D1 and D2 might make us prefer D2 if it is much cheaper or has a
better safety profile, for example, while D2 is no option under a high-fat diet.
Interactions can also be qualitative, in which case we might see an effect reversal.
Now, one level of B is no longer universally superior under all levels of A, as shown
in Fig. 6.5D for an extreme case. If A is on the low level, then higher responses are
gained if B is on the low level, while for A on its high level, the high level of B gives
higher responses.
We find this situation in our example when comparing the placebo treatment and
D2: under a low-fat diet, D2 shows higher enzyme levels, while enzyme levels are
similar or even lower for D2 than placebo under a high-fat diet. In other words, D1 is
always superior to placebo and we can universally recommend its use, but we should
only use D2 for a low-fat diet since it seems to have little to no (or even negative)
effect under a high-fat diet.
We measure the effect size of each factor using its f 2 value. For our generic two-
factor design with factors A and B with a and b levels, respectively, the effect sizes
are
1 2 1 2 1
a b a b
f A2 = α , f B2 = β , 2
f A:B = (αβ)i2j ,
aσe2 i=1 i bσe2 j=1 j abσe2 i=1 j=1
2
which in our example are f drug = 0.43 (huge), f diet
2
= 0.02 (tiny), and f drug:diet
2
= 0.3
(medium).
Alternatively, we can measure the effect of factor X by its variance explained
ηX2 = SSX /SStot , which in our example yields ηdrug2
= SSdrug /SStot = 65% for the
drug main effect, ηdiet = 0.45% for the diet main effect, and ηdrug:diet
2 2
= 15% for
the interaction. This effect size measure has two caveats for multi-way ANOVAs,
however. First, the total sum of squares contains all those parts of the variation
‘explained’ by the other treatment factors. The magnitude of any ηX2 therefore depends
on the number of other treatment factors, and on the proportion of variation explained
by them. The second caveat is a consequence of this: the explained variance η 2X does
6.3 ANOVA for Two Treatment Factors 133
not have a simple relationship with the effect size f X2 , which compares the variation
due to the factor alone to the residual variance and excludes all effects due to other
treatment factors.
We resolve both issues by the partial-η 2 effect size measure, which is the fraction
of variation explained by each factor over the variation that remains to be explained
after accounting for all other treatment factors:
Note that ηp,2 X = ηX2 for a one-way ANOVA. The partial-η 2 are not a partition of the
variation into single-factor contributions and therefore do not sum to one, but they
have a direct relation to f 2 :
ηp,2 X
f X2 =
1 − ηp,2 X
for any treatment factor X. In our example, we find ηp,2 drug = 77%, ηp,2 diet = 2%, and
ηp,2 drug:diet = 44%.
In the one-way ANOVA, the omnibus F-test was of limited use for addressing our
scientific questions. With more factors in the experiment design, however, a non-
significant F-test and a small effect size would allow us to remove the corresponding
factor from the model. This is known as model reduction, and the reduced model is
often easier to analyze and interpret.
In order to arrive at an interpretable reduced model, we need to take account of
the principle of marginality for model reduction (Nelder 1994):
If we remove a factor from the model, then we also need to remove all interaction factors
involving this factor.
For a two-way ANOVA, the marginality principle implies that only the interaction
factor can be removed in the first step. We then re-estimate the resulting additive
model and might further remove one or both main effects factors if their effects
are small and non-significant. If the interaction factor cannot be removed, then the
two-way ANOVA model cannot be reduced.
Indeed, the interpretation of the model would suffer severely if we removed A but
kept A:B, and this model would then describe B as nested in A, in contrast to the
actual design which has A and B crossed. The resulting model is then of the form
yi jk = μ + αi + (αβ)i j + ei jk
134 6 Multiple Treatment Factors: Factorial Designs
M M
A B
A
A:B
A:B
(E )
(E )
Fig. 6.6 A Experiment structure diagram for a full two-way model with interaction. Factors A
and B are crossed. B Experiment structure resulting from removing one treatment factor but not its
interaction, violating the marginality principle. Factor B is now nested in A contrary to the actual
design
and has three sets of parameters: the grand mean, one deviation for each level of
A, and additionally different and independent deviations for the levels of B within
each level of A. We see this consequence of violating the marginality principle by
comparing the full model diagram in Fig. 6.6A to the diagram of the reduced model
in Fig. 6.6B and recalling that two nested factors correspond to a main effect and an
interaction effect since A/B=A+A:B.
In our example, we find that Diet has a non-significant p-value and small effect
size, which suggests that it does not have an appreciable main effect when averaged
over drugs. However, Diet is involved in the significant interaction factor Drug:Diet
with large effect size, and the marginality principle prohibits removing Diet from the
model. Indeed, we have already seen that the effects of the three drugs are appreciably
modified by the diet, so we cannot ignore the diet in our analysis.
Recall that the estimated marginal means give the average response in a treatment
group calculated from a linear model. To solidify our understanding of estimated
marginal means, we briefly consider their calculation and interpretation for our
example. This also further exemplifies the interpretation of interactions and the con-
sequences of model reduction.
Our example has six possible cell means, one for each combination of a drug
with a diet. We consider five linear models for this factorial design with two crossed
treatment factors: the full model with interaction, the additive model, two models with
a single treatment factor, and a model with no treatment factors. Each of these models
6.3 ANOVA for Two Treatment Factors 135
Table 6.2 Empirical means (Data) and predicted means for each cell for full model with interaction
(Full), additive model without interactions (Additive), one-way model containing only main effect
for drug (Drug only), respectively diet (Diet only), and trivial model with no treatment factors
(Average)
Drug Diet Data Full Additive Drug only Diet only Average
Placebo Low fat 9.18 9.18 10.07 9.92 11.99 11.84
D1 Low fat 14.17 14.17 14.52 14.37 11.99 11.84
D2 Low fat 12.63 12.63 11.38 11.23 11.99 11.84
Placebo High fat 10.65 10.65 9.76 9.92 11.68 11.84
D1 High fat 14.56 14.56 14.21 14.37 11.68 11.84
D2 High fat 9.83 9.83 11.07 11.23 11.68 11.84
gives rise to different estimated marginal means for the six treatment combinations,
and their predicted cell means are shown in Table 6.2.
The Full model is μi j = μ + αi + β j + (αβ)i j and has specification
y~1+drug+diet+drug:diet. It provides six different and independent cell
means, which are identical to the empirical averages (Data) for each cell.
The Additive model is μi j = μ + αi + β j and is specified by y~1+drug+diet.
It does not have an interaction term and the two treatment effects are hence additive,
such that the difference between D1 and placebo, for example, is 10.07 − 14.52 =
−4.45 for a low-fat diet and 9.76 − 14.21 = −4.45 for a high-fat diet. In other words,
differences between drugs are independent of the diet and vice versa.
The two models Drug only and Diet only are μi j = μ + αi (specified as
y~1+drug) and μi j = μ + β j (specified as y~1+diet) and completely ignore
the respective other treatment. The first gives identical estimated marginal means for
all conditions under the same drug and thus predicts three different cell means, while
the second gives identical estimated marginal means for all conditions under the
same diet, predicting two different cell means. Finally, the Average model μi j = μ
is specified as y~1 and describes the data by a single common mean, resulting in
identical predictions for the six cell means.
Hence, the model used for describing the experimental data determines the esti-
mated marginal means, and therefore the contrast estimates. In particular, nonsense
will arise if we use an additive model such as drug+diet for defining the estimated
marginal means, and then try to estimate an interaction contrast (cf. Sect. 6.6.2) using
these means. The contrast is necessarily zero, and no standard error or t-statistic can
be calculated.
In Sect. 5.4, we looked into a real-life example and examined the effect of two
anticancer drugs (IFF and CP) and an inhibitor (RTV) on tumor growth. We analyzed
136 6 Multiple Treatment Factors: Factorial Designs
the data using a one-way ANOVA based on six treatment groups and concentrated
on relevant contrasts. There is, however, more structure in the treatment design,
since we are actually looking at combinations of an anticancer drug and an inhibitor.
To account for this structure, we reformulate the analysis to a more natural two-
way ANOVA with two treatment factors: Drug with the three levels none, CP, and
IFF, and Inhibitor with two levels none and RTV. Each drug is used with and
without an inhibitor, leading to a fully-crossed factorial treatment design with the
two factors and their interaction Drug:Inhibitor. We already used the two conditions
(none,none) and (none,RTV ) as our control conditions in the contrast analysis. The
model specification is Diameter~Drug*Inhibitor and leads to the ANOVA
table
We note that the previous sum of squares for Condition was 815204.90 in our
one-way analysis and now exactly partitions into the contributions of our three treat-
ment factors. The previous effect size of 93% is also partitioned into ηDrug2
= 56%,
ηInhibitor = 27%, and ηDrug:Inhibitor = 9%. In other words, the differences between
2 2
conditions are largely caused by differences between the anticancer drugs, but the
presence of the inhibitor also explains more than one-quarter of the variation. The
interaction is highly significant and cannot be ignored, but its effect is much smaller
compared to the two main effects. All of this is of course in complete agreement with
our previous contrast analysis.
We briefly discuss the general advantages of factorial treatment designs and extend
our discussion to more than two treatment factors.
Slightly less obvious is the fact that factorial designs also require smaller sample sizes
than OVAT designs, even if no interactions are present between the treatment factors.
For example, we can estimate the main effects of two factors A and B with two levels
each using two OVAT designs with 2n experimental units per experiment. Each main
effect is then estimated with 2n − 2 residual degrees of freedom. We can use the same
experimental resources for a 2 × 2-factorial design with 4n experimental units. If the
interaction A:B is large, then only this factorial design will allow correct inferences.
If the interaction is negligible, then we have two estimates of the A main effect: first
for B on the low level by contrasting the n observations for A on a low level with
those n for A on a high level, and second for the n + n observations for B on a high
level. The A main effect estimate is then the average of these two, based on all 4n
observations; this is sometimes called hidden replication. The inference on A is also
more general, since we observe the A effect under two conditions for B. The same
argument holds for estimating the B main effect. Since we need to estimate one grand
mean and one independent group mean each for A and B, the residual degrees of
freedom are 4n − 3.
Removing Interactions
M11 M11
A : Bab
(a−1)(b−1) (E )ab
(a−1)(b−1)
(E )ab
0
Fig. 6.7 A Experiment structure for a × b two-way factorial with single observation per cell; no
residual degrees of freedom left. B Same experiment based on additive model removes interaction
factor and frees degrees of freedom for estimating residual variance
then remove the interaction factor from the full model, which leads to new residuals
(αβ)i j + ei j ≈ ei j and ‘frees up’ (a − 1)(b − 1) degrees of freedom for estimating
their residual variance. This strategy corresponds to merging the A:B interaction
factor with the experimental unit factor (E), resulting in the experimental structure
in Fig. 6.7B.
yi j = μ + αi + β j + τ · αi · β j + ei j ,
an additive model based solely on the main effect parameters, augmented by a special
interaction term that introduces a single additional parameter τ and requires only one
rather than (a − 1)(b − 1) degrees of freedom. If the hypothesis H0 : τ = 0 of no
interaction is true, then the test statistic
SSτ /1
F= ∼ F1,ab−(a+b)
MSres
6.4 More on Factorial Designs 139
s0 = 1.5 · median j |c j | .
The margin of error (ME) (an upper limit of a confidence interval) is then
ME = t0.975,d · PSE ,
and Lenth proposes to use d = m/3 as the degrees of freedom, where m is the
number of effects in the model. This limit is corrected for multiple comparisons by
adjusting the confidence limit from α = 0.975 to γ = (1 + 0.951/m )/2. The resulting
simultaneous margin of error (SME) is then
0.0
−0.3
−0.6
A A:T S S:A S:A:T S:t S:T S:t:AS:t:A:T S:t:T t T t:A t:A:T t:T
Effect
Fig. 6.8 Analysis of active effects in unreplicated 24 -factorial with Lenth’s method
Factors with effects exceeding SME in either direction are considered active, those
between the ME limits are inactive, and those between ME and SME have unclear
status. We therefore choose those factors that exceed SME as our safe choice, and
might include those exceeding ME as well for subsequent experimentation.
In his paper, Lenth discusses a 24 -factorial experiment, where the effect of acid
strength (S), time (t), amount of acid (A), and temperature (T) on the yield of isatin
is studied (Davies 1954). The experiment design and the resulting yield are shown
in Table 6.3.
The results are shown in Fig. 6.8. No factor seems to be active, with temperature,
acid strength, and the interaction of temperature and time coming closest. Note that
the marginality principle requires that if we keep the temperature-by-time interaction,
we must also keep the two main effects temperature and time in the model, regardless
of their size or significance.
6.4 More on Factorial Designs 141
We can construct factorial designs for any number of factors with any number of
levels, but the number of treatment combinations increases exponentially with the
number of factors, and the more levels we consider per factor, the more combinations
we get. For example, introducing the vendor as a third factor with two levels into
our drug–diet example already yields 3 · 2 · 2 = 12 combinations, and introducing a
fourth three-level factor increases this number further to 12 · 3 = 36 combinations.
In addition, both extensions yield higher-order interactions between three and four
factors.
This model has three main effects, three two-way interactions, and one three-way
interaction. A large and significant three-way interaction (αβγ)i jk arises, for exam-
ple, if the drug–diet interaction that we observed for D2 compared to placebo under
both diets only occurs for the kit of vendor A, but not for the kit of vendor B (or with
different magnitude). In other words, the two-way interaction itself depends on the
level of the third factor, and we are now dealing with a difference of a ‘difference of
differences’.
In this particular example, we might consider reducing the model by removing all
interactions involving Vendor, as there seems no plausible reason why the kit used
for preparing the samples should interact with drug or diet. The resulting diagram is
Fig. 6.9B; this model reduction adheres to the marginality principle.
A further example of a three-way analysis of variance inspired by an actual exper-
iment concerns the inflammation reaction in mice and potential drug targets to sup-
press inflammation. It was already known that a particular pathway gets activated to
trigger an inflammation response, and the receptor of that pathway is a known drug
target. In addition, there was preliminary evidence that inflammation sometimes trig-
gers even if the pathway is knocked out. A 2 × 2 × 2 = 23 -factorial experiment was
conducted with three treatment factors: (i) genotype: wildtype mice/knock-out mice
without the known receptor; (ii) drug: placebo treatment/known drug; (iii) waiting
time: 5 h/8 h between inducing inflammation and administering the drug or placebo.
Several mice were used for each of the 8 combinations and their protein levels mea-
sured using mass spectrometry. The main focus of inference is the genotype-by-drug
interaction, which would indicate if an alternate activation exists and how strong it is.
142 6 Multiple Treatment Factors: Factorial Designs
M11 M11
(Mouse)24
17
(Mouse)24
12
Fig. 6.9 Three-way factorial. A Full experiment structure with three two-way and one three-way
interaction. B Reduced experiment structure assuming negligible interactions with vendor
The three-way interaction then provides additional information about the difference
in activation times between the known and a potential unknown pathway. We discuss
this example in more detail in Sect. 9.8.6.
each of the levels. For our example, with a significant and large three-way interaction,
we could stratify by vendor and produce two separate models, one for vendor A and
one for vendor B, studying the effect of drug and diet independently for both kits.
The advantage of this approach is the easier interpretation, but we are also effectively
halving the experiment size, such that information on drug and diet gained from one
vendor does not transfer easily into information for the other vendor.
With more than three factors, even higher-order interactions can be considered.
Statistical folklore and practical experience suggest that interactions often become
less pronounced the higher their order, a heuristic known as effect sparsity. Thus, if
there are no substantial two-way interactions, we do not expect relevant interactions
of order three or higher.
If we have several replicate observations per treatment, then we can estimate the
full model with all interactions and use the marginality principle to reduce the model
by removing higher-order interactions that are small and non-significant. If only one
observation per treatment group is available, a reasonable strategy is to remove the
highest-order interaction term from the model in order to free its degrees of freedom.
These are then used for estimating the residual variance, which in turn allows us
to find confidence intervals and perform hypothesis tests on the remaining factors.
The highest-order interaction typically has many degrees of freedom, and we can
continue with a model reduction based on F-tests with sufficient power.
Two more strategies are commonly used and are very powerful when applica-
ble. When considering many factors and their low-order interactions, we can use
fractional factorial designs to deliberately confound some effects and reduce the
experiment size. We discuss these designs in detail for the case that each factor has
two levels in Chap. 9. The second strategy applies when we are starting out with
our experiments, and are unsure which treatment factors to consider for our main
experiments. We can then conduct preliminary screening experiments to simultane-
ously investigate a large number of potential treatment factors, and identify those that
have sufficient effect on the response. These experiments concentrate on the main
effects only and are based on the assumption that only a small fraction of the factors
considered will be relevant. This allows experimental designs of small size, and we
discuss some options in Sect. 9.7.
Serious problems can arise in the calculation and interpretation of the analysis of
variance for factorial designs with unbalanced data. We briefly review the main
problems and some remedies. The defining feature of unbalanced data is succinctly
stated as
The essence of unbalanced data is that measures of effects of one factor depend upon
other factors in the model. (Littell 2002)
144 6 Multiple Treatment Factors: Factorial Designs
A B C D
low
high
Fig. 6.10 Experiment layout for three drugs under two diets. A Unbalanced design with all cells
filled but some cells having more data than others. B, C Connected some-cells-empty data cause
problems in non-additive models. D Unconnected some-cells-empty data prohibit estimation of any
effects
We first consider the case of all-cells-filled data for a a × b factorial with a rows and
b columns. The observations are yi jk with i = 1 . . . a, j = 1 . . . b, and k = 1 . . . n i j ,
and we can estimate the mean μi j of row i and column j by taking the average of
the corresponding observations yi jk of cell (i, j). The unweighted row means are the
averages of the cell means over all columns:
1
b
μi· = μi j ,
b j=1
and are the population marginal means estimated by the estimated marginal means
1
b
μ̂i· = ỹi·· = ȳi j· .
b j=1
6.5 Unbalanced Data 145
The naive weighted row means, on the other hand, sum over all observations in each
row and divide by the total number of observations in that row; they are
b
ni j
b
ni j
μi·w = μi j estimated as μ̂i·w = ȳi j· .
j=1
n i· j=1
n i·
The unweighted and weighted row means coincide for balanced data with n i j ≡ n,
but they are usually different for unbalanced data.
We are interested in the row main effect and in testing the hypothesis
H0 : μ1· = · · · = μa·
H0w : μw w
1· = · · · = μa· .
This was not a problem before for balanced data and μi· = μi·w , but the tested hypoth-
esis for unbalanced data is unlikely to be of any direct interest. The same problems
arise for column effects.
We consider an unbalanced version of our drug–diet example as an illustration:
Each entry in this table shows the sum yi j· over the responses in the corresponding
cell, the number of observations n i j for that cell in parentheses, and the cell average
ȳi j· = yi j· /n i j . Values in the table margins are the row and column totals and averages
and the overall total and average.
For these data, we find an unweighted mean for the first row (the low-fat diet)
of μ̂1· = (9 + 14 + 13.5)/3 = 12.2, while the corresponding weighted row mean is
μw 1· = (9 + 28 + 27)/(1 + 2 + 2) = 12.8.
The main problem for a traditional analysis of variance is that for unbalanced
data, the sums of squares are not orthogonal and do not decompose uniquely, and
there is some part of the overall variation that can be attributed to either of the two
factors. For example, the model y~diet*drug would decompose the total sum of
squares into
where SSDiet adj. mean is the sum of squares attributed to Diet after the grand mean
has been accounted for, and SSDrug adj. Diet is the remaining sum of squares for Drug,
once the diet has been accounted for. The traditional ANOVA based on the model
y~diet+drug then tests the weighted diet main effect
On the other hand, the model y~drug*diet decomposes the total sum of squares
into
SStot = SSDrug adj. mean + SSDiet adj. Drug + SSDiet:Drug + SSres ,
and produces a larger sum of squares term for Drug and a smaller term for Diet.
In this model, the latter only accounts for the variation after the drug has been
accounted for. Even worse, the same analysis of variance based on the additive
model y~drug+diet tests yet another and even less interesting (or comprehensi-
ble) hypothesis, namely
b
b
H0w : n i j · μi j = n i j · μi·w for all i .
j=1 j=1
For our example, the full model y~drug*diet yields the ANOVA table
Note that it is only the attribution of variation to the two treatment factors that varies
between the two models, while the total sum of squares and the interaction and
residual sums of squares are identical. In particular, the sums of squares always
6.5 Unbalanced Data 147
add to the total sum of squares. This analysis is known as a type-I sum of squares
ANOVA, where results depend on the order of factors in the model specification,
such that effects of factors introduced later in the model are adjusted for effects of
all factors introduced earlier.
In practice, unbalanced data are often analyzed using a type-II sum of squares
ANOVA based on an additive model without interactions. In this analysis, the treat-
ment sum of squares for any factor is calculated after adjusting for all other factors,
and the sum of squares do no longer add to the total sum of squares, as part of the
variation is accounted for in each adjustment and not attributed to the corresponding
factor. The advantage of this analysis is that for the additive model, each F-test cor-
responds to the respective interesting hypothesis based on the unweighted means,
which is independent of the specific number of observations in each cell. In R, we can
use the function Anova() from package car with option type=2 for this analysis:
first, we fit the additive model as m = aov(y~diet+drug, data=data) and
then call Anova(m, type=2) on the fitted model.
For our example, this yields the ANOVA table
The treatment sums of squares in this table are both adjusted for the respective other
factor, and correspond to the second rows in the previous two tables. However, we
know that the interaction is important for the interpretation of the data, and the results
from this additive model are still misleading.
A similar idea is sometimes used for non-additive models with interactions, known
as type-III sum of squares. Here, a factor’s sum of squares term is calculated after
adjusting for all other main effects and all interactions involving the factor. These
can be calculated in R using Anova() with option type=3. Their usefulness is
heavily contested by some statisticians, though, since the hypotheses now relate to
main effects in the presence of relevant interactions, and are therefore difficult to
interpret (some would say useless). This sentiment is expressed in the following
scoffing remark:
The non-problem that Type III sums of squares tries to solve only arises because it is so
simple to do silly things with orthogonal designs. In that case main effect sums of squares
are uniquely defined and order independent. Where there is any failure of orthogonality,
though, it becomes clear that in testing hypotheses, as with everything else in statistics,
it is your responsibility to know clearly what you mean and that the software is faithfully
enacting your intentions. (Venables 2000)
sense of the data. This is most easily done by using estimated marginal means and
contrasts between them.
Standard analyses may or may not apply if some cells are empty, depending on the
particular pattern of empty cells. Several situations of some-cells-empty data are
shown in Fig. 6.10.
In Fig. 6.10B, only cell (1, 3) is missing and both row and column main effects
can be estimated if we assume an additive model. This model allows the prediction
of the average response in the missing cell, since the difference μ22 − μ23 is then
identical to μ12 − μ13 , and we can therefore estimate the missing cell mean by μ13 =
(μ22 − μ23 ) − μ12 ; this is an estimated marginal mean for μ13 . We can also estimate
an interaction involving the first two columns and rows using our previous methods
when ignoring the third level of the column factor.
In Fig. 6.10C, several cells are missing, but the remaining cells are still connected:
by rearranging rows and columns, we can draw a connection from any filled cell to any
other filled cell using only vertical and horizontal lines. These data thus provide the
information needed to estimate an additive model, and we use the resulting model to
estimate the marginal means of the empty cells. In contrast to scenario (B), however,
we cannot find a filled 2 × 2 sub-table and are unable to estimate any interaction.
In Fig. 6.10D, the non-empty cells are not connected, and we cannot predict the
empty cell means based on observed data. Neither main nor interaction effects can
be estimated for the non-connected part of the table.
Each scenario poses severe problems when an additive model is not adequate and
interactions have to be considered. Estimated marginal means can then not be used
to compensate empty cells, and the analysis of models with interactions based on
some-cells-empty data is necessarily restricted to subsets of rows and columns of the
data for which all cells are filled. Similarly, contrast estimates can only be formed if
the contrast does not involve empty cells. Such analyses are therefore highly specific
to the situation at hand and defy simple recipes.
Returning to balanced data, linear contrasts allow a more targeted analysis of treat-
ment effects just like in the one-way ANOVA. As before, we define contrasts on
the treatment group means μi j , and we estimate these using the estimated marginal
means based on an appropriate model.
6.6 Contrast Analysis 149
In our example, the F-test for the interaction factor indicates that the diet affects the
drugs differently, and we might use the following contrasts to compare each drug
between the diets:
(w1 ) = μ11 − μ12 , (w2 ) = μ21 − μ22 , and (w3 ) = μ31 − μ32 ,
where w1 = (1, 0, 0, −1, 0, 0) is the weight vector for comparing the placebo effect
between low fat and high fat, and w2 = (0, 1, 0, 0, −1, 0) and w3 = (0, 0, 1, 0,
0, −1) are the corresponding weight vectors for D1 and D2.
The results in Table 6.4 give quantitative confirmation of our earlier suspicion:
both placebo and D1 show lower responses under the low-fat diet compared to high-
fat diet, but the resulting difference is explainable by sampling variation and cannot
be distinguished from zero with the given data. In contrast, the response for D2 is
substantially and significantly higher under a low-fat diet compared to a high-fat diet.
Confidence intervals are fairly wide due to the small sample size, and the experiment
is likely underpowered.
Each contrast’s standard error is based on 18 degrees of freedom, and the residual
variance estimate is based on all data. The three contrasts are orthogonal, but there
are two more orthogonal contrasts between the six cell means in Table 6.5, which we
consider next.
Similar to our previous considerations for the tumor diameter example in Sect. 5.4,
we have to ask whether main effect contrasts for the two active drugs D1 and D2
provide the correct comparisons. If a substantial difference in enzyme levels exists
between the diets in the placebo group, then we might rather want a contrast that
compares the difference in enzyme levels between diets for D1 (respectively D2) to
the corresponding difference in the placebo group. This contrast is then a difference
of differences and thus an interaction contrast.
Table 6.4 Contrast estimates and 95%-confidence intervals for comparing individual drugs under
low- and high-fat diets
Contrast Estimate se df LCL UCL
Placebo low–Placebo high −1.48 0.84 18 −3.24 0.29
D1 low–D1 high −0.39 0.84 18 −2.15 1.37
D2 low–D2 high 2.80 0.84 18 1.04 4.57
150 6 Multiple Treatment Factors: Factorial Designs
Table 6.5 Estimated cell means for three drugs and two diets
Drug Diet Mean Estimate se df LCL UCL
Placebo Low fat μ11 9.18 0.59 18 7.93 10.42
D1 Low fat μ21 14.17 0.59 18 12.93 15.42
D2 Low fat μ31 12.63 0.59 18 11.38 13.87
Placebo High fat μ12 10.65 0.59 18 9.41 11.90
D1 High fat μ22 14.56 0.59 18 13.31 15.81
D2 High fat μ32 9.83 0.59 18 8.58 11.07
For D1, we calculate the difference in enzyme levels between the two diets under
D1 to the difference under placebo with the contrast
This contrast is equivalent to first ‘adjusting’ each enzyme level under D1 by that
of the placebo control group, and then determining the difference between the two
adjusted values. The corresponding contrast is
The estimates and 95%-confidence intervals are shown in Table 6.6. Not unexpected,
we find that the change in response for D1 is larger than the change in the placebo
group, but not significantly so. We already know that D1 enzyme levels are higher for
both diets from the main effect contrasts, so this new result means that the difference
between placebo and D1 is the same for both diets. In other words, the response to
D1 changes from low fat to high fat by the same amount as does the response to
placebo. The two lines in Fig. 6.2B should then be roughly parallel (not considering
sample variation), as indeed they are.
In contrast, the change for D2 is substantially and significantly larger than that
for the placebo group. This is also in accordance with Fig. 6.2B: while the response
to placebo is increasing slightly from low fat to high fat, the response to D2 does the
opposite and decreases substantially from low fat to high fat. This is reflected by the
significant large positive contrast, which indicates that the change in D2 between
the two diets follows another pattern than the change in placebo, and goes in the
opposite direction (placebo increases while D2 decreases).
These contrasts belong to the Drug:Diet interaction factor, and from the degrees
of freedom in the Hasse diagram we know that two orthogonal contrasts partition its
sum of squares into individual contributions. However, our two interaction contrasts
are not orthogonal and therefore do not partition the interaction sum of squares.
Two alternative interaction contrasts that are orthogonal use weight vectors w7 =
(0, −1, 1, 0, 1, −1) and w8 = (1, −1/2, 1/2, −1, −1/2, +1/2), respectively. The
first contrasts the effects of the two active drugs under the two diets, while the
second compares the placebo treatment to the average of the two active drugs.
The estimates for these two contrasts are shown in Table 6.7. They show that the
difference in enzyme levels between the two diets is much larger for D1 than for D2,
and much lower for placebo compared to the average of the two drugs. How far the
latter contrast is biologically meaningful is of course another question, and our first
set of non-orthogonal contrasts is likely more meaningful and easier to interpret.
Together with the three contrasts (w1 ), (w2 ), and (w3 ) defined in Sect. 6.6.1,
which individually contrast each drug between the two diets, these two orthogonal
interaction contrasts form a set of five orthogonal contrasts that fully exhaust the
information in the data. In principle, the three treatment F-tests of the two main drug
and diet effects and the drug-by-diet interaction can be reconstituted from the F-tests
of the interaction sums of squares.
The power analysis and determination of sample sizes for omnibus F-tests in a multi-
way ANOVA follow the same principles as for a one-way ANOVA and are based
on the noncentral F-distribution. Similarly, the power analysis for linear contrasts
is identical to the one-way ANOVA case. The ideas of portable power can also be
applied, so we give only some additional remarks.
152 6 Multiple Treatment Factors: Factorial Designs
To determine the power of an omnibus F-test for a main effect of A, we need the
significance level α, the effect size (such as f A2 or ηp,2 A ), and the residual variance σe2 .
The sample size n A is now the number of samples per level of A, and is distributed
evenly over the levels of all other factors crossed with A. For our generic two-
way design with factors A and B with a and b levels, respectively, the number of
samples per treatment combination is then n = n A /b. The degrees of freedom for the
denominator are dfres = ab(n − 1) for the full model with interaction, and dfres =
nab − a − b + 1 for the additive model; we directly find them from the experiment
diagram.
For our drug–diet example, we consider n B = 12 mice per diet and thus n =
12/3 = 4 mice per drug–diet combination, a significance level of α = 5% and use
our previous estimate for the residual variance σ̂e2 = 1.5. These parameters provide a
power of 1 − β = 64% for detecting a minimal effect size of d0 = 1 or equivalently
f 02 = d02 /4 = 1/4.
As a second example, we consider a ‘medium’ standardized main drug effect
of f 0 = 0.25 and a desired power of 80% at a 5% significance level. This requires
a sample size of n A = 53 mice per drug level, about n = 27 mice per drug–diet
treatment combination.
The standardized difference between two drug averages is
then d0 = f 02 · 6 = 0.61, corresponding to a raw difference in enzyme levels of
δ0 = d0 · σe = 0.75 for the assumed residual variance.
Extending these ideas to the main effects of higher-order factorial designs is
straightforward and only requires that we correctly calculate the residual degrees of
freedom to take account of the other factors in the design. The resulting sample size
is then again per level of the factor of interest, and has to be divided by the product of
the number of levels of all other factors to get to the sample size per treatment. For
example, with a three-way design with a = b = 4 and c = 5 factor levels, a resulting
sample size of n A = 60 for an A main effect means that we need n = n A /bc = 3
samples per treatment combination.
6.7.2 Interactions
The denominator degrees of freedom for an F-test of an interaction factor are still the
residual degrees of freedom, and we find the correct numerator degrees of freedom
from the Hasse diagram. For our generic two-way example, these are ab(n − 1) and
(a − 1)(b − 1), respectively, and the resulting sample size corresponds to n directly.
For higher-order factorial designs, we again have to divide the resulting sample size
by the product of levels of the factors not involved in the interaction.
eters (αβ)i j directly, (ii) define the expected responses μi j for each combination of
levels of A and B and work out the parameters from there, and (iii) use informa-
tion about the main effects from previous single-factor studies and define a level of
attenuation of these effects due to the second factor.
We look into the third option more closely and denote the two main effect sizes of
2 2
previous single-factor studies by f A, single and f B, single ; note that the main effects will
likely change in a two-way design. The same situation occurs if we have reasonable
guesses for the two isolated main effect sizes. We consider how much attenuation or
increase in the difference in response we expect for A, say, when looking at the low
level of B compared to its high level. The effect size of the interaction is then about
the same size as the single main effects if we expect a complete effect reversal, and we
2
can use f A, single as a reasonable expected effect size for the interaction (Fig. 6.5D).
If we expect full attenuation of the effect of one factor, such as in Fig. 6.5C, then the
interaction effect size is about one-half the main effect size. With lower attenuation,
the interaction effect decreases accordingly. From these considerations, we find a
crude guess for the required sample size for detecting an interaction effect: the total
sample size is roughly the same as for the original main effect for a complete effect
reversal. If we needed n samples for detecting a certain minimal A main effect with
given power, then we need again n samples for each level of A, which means n/b
samples per treatment. If we expect full attenuation, the total required sample size
doubles, and for a 50% attenuation, we are already at four times the sample size that
we would need to detect the single-factor main effect. The arguments become much
more involved for interactions of factors with more than two levels.
We can alternatively forsake a power analysis for the interaction omnibus F-test,
and instead specify (planned) interaction contrasts. The problem of power and sample
size is then shifted to corresponding analyses for contrasts, whose minimal effect
sizes are often easier to define.
6.7.3 Contrasts
The power analysis for contrasts is based on the same methods as for the one-way
design discussed in Sect. 5.2.8. For a two-way design with a and b levels, we can find
ab − 1 orthogonal contrasts, a − 1 for the first main effect, b − 1 for the second, and
(a − 1)(b − 1) to partition the interaction. A sensible strategy is to define the planned
contrasts for the experiment, use power analysis to determine the required sample
size for each contrast, and then use the largest resulting sample size for implementing
the experiment.
If the contrasts are not orthogonal, we should consider adjustments for multiple
testing. Since we do not have the resulting p-values yet, the single-step Bonferroni
correction with significance levels α/q for a set of q contrasts is the only real choice
for a set of general-purpose contrasts.
154 6 Multiple Treatment Factors: Factorial Designs
We note that the noncentrality parameter is again the product of effect size and total
size of the experiment (λ = n · a · b · f 2 for a two-factor ANOVA), and we can use
the portable power formula n = φ2 / f 2 to determine the number of experimental
units per level of the factor considered.
For our drug–diet example with three drugs and two diets, the necessary sample
size for detecting a medium drug main effect of at least f drug = 0.25 with 80%
power and significance level α = 5% is n drug = φ2 /0.252 = 48 mice per drug (using
φ2 = 3). This translates to 24 mice per treatment in good agreement with our previous
exact calculations in Sect. 6.7.1.
A useful shortcut exists for the most frequent type of interaction contrast which
uses two levels each of m treatment factors, with weights wi = ±1 for two chosen
levels of each factor, and zero otherwise. The weight vector of such contrast has 2m
non-zero entries, hence the estimator variance is σe2 /n · 2m . For detecting a minimal
contrast size of 0 or a minimal distance δ0 between any two cell means in the
contrast, the required sample size is
φ2 φ2 1
n= · 2 m−3
respectively n = · .
02 /σe2 δ02 /σe2 2m−3
The sample size n is the number of samples for each combination of the treatment
factors involved in the interaction.
Notes
The term factorial design was introduced by Fisher in the first edition of his book
Design of Experiments (Fisher 1971) in 1935. The advantages of this design are
universally recognized and re-iterated regularly for specific fields such as animal
experiments (Shaw and Festing 2002).
Interpretation of interactions is a specific problem in factorial designs, and depends
on the type of factors involved in an interaction (de Gonzalez and Cox 2007). Two
interesting perspectives are found in Finney (1948) and Abelson and Prentice (1997).
‘Spurious’ interactions can arise in factorial designs with heterogeneous variance
(Snee 1982).
Due to the potentially large number of treatment level combinations, methods for
factorial designs without replication are required. An additive model allows pooling
interactions and residual factors (Xampeny et al. 2018), and several methods for
detecting additivity are compared in Rusch (2009). A general discussion of pooling
in ANOVA is Hines (1996). The short note in Lachenbruch (1988) discusses sample
sizes for testing interactions.
6.8 Notes and Summary 155
References
Abelson, R. P. and D. A. Prentice (1997). “Contrast tests of interaction hypothesis”. In: Psychological
Methods 2.4, pp. 315–328.
Davies, O. L. (1954). The Design and Analysis of Industrial Experiments. Oliver & Boyd, London.
de Gonzalez, A. and D. R. Cox (2007). “Interpretation of interaction: a review”. In: The Annals of
Applied Statistics 1.2, pp. 371–385.
Finney, D. J. (1948). “Main Effects and Interactions”. In: Journal of the American Statistical Asso-
ciation 43.244, pp. 566–571.
Fisher, R. A. (1926). “The Arrangement of Field Experiments”. In: Journal of the Ministry of
Agriculture of Great Britain 33, pp. 503–513.
Fisher, R. A. (1971). The Design of Experiments. 8th. Hafner Publishing Company, New York.
Hector, A., S. von Felten, and B. Schmid (2010). “Analysis of variance with unbalanced data: An
update for ecology & evolution”. In: Journal of Animal Ecology 79.2, pp. 308–316.
Herr, D. G. (1986). “On the History of ANOVA in Unbalanced, Factorial Designs: The First 30
Years”. In: The American Statistician 40.4, pp. 265–270.
Hines, W. G. S. (1996). “Pragmatics of Pooling in ANOVA Tables”. In: The American Statistician
50.2, pp. 127–139.
Lachenbruch, P. A. (1988). “A note on sample size computation for testing interactions”. In: Statistics
in Medicine 7.4, pp. 467–469.
Lenth, R. V. (1989). “Quick and easy analysis of unreplicated factorials”. In: Technometrics 31.4,
pp. 469–473.
Littell, R. C. (2002). “Analysis of unbalanced mixed model data: A case study comparison of
ANOVA versus REML/GLS”. In: Journal of Agricultural, Biological, and Environmental Statis-
tics 7.4, pp. 472–490.
Mee, R. W. and X. Lu (2010). “Don’t use rank sum tests to analyze factorial designs”. In: Quality
Engineering 23.1, pp. 26–29.
Nelder, J. A. (1994). “The statistics of linear models: back to basics”. In: Statistics and Computing
4.4, pp. 221–234.
Rusch, T. et al. (2009). “Tests of additivity in mixed and fixed effect two-way ANOVA models with
single sub-class numbers”. In: Statistical Papers 50.4, pp. 905–916.
Searle, S. R. (1987). Linear Models for Unbalanced Data. John Wiley & Sons, Inc.
Shaw, R. G. and T. Mitchell-Olds (1993). “Anova for unbalanced data: an overview”. In: Ecology
74, pp. 1638–1645.
Shaw, R., M. F. W. Festing, et al. (2002). “Use of factorial designs to optimize animal experiments
and reduce animal use”. In: ILAR Journal 43.4, pp. 223–232.
Snee, R. D. (1982). “Nonadditivity in a two-way classification: Is it interaction or nonhomogeneous
variance?” In: Journal of the American Statistical Association 77.379, pp. 515–519.
Tukey, J. W. (1949b). “One Degree of Freedom for Non-Additivity”. In: Biometrics 5.3, pp. 232–
242.
Urquhart, N. S. and D. L. Weeks (1978). “Linear Models in Messy Data: Some Problems and
Alternatives”. In: Biometrics 34.4, pp. 696–705.
Venables, W. N. (2000). Exegeses on linear models. Tech. rep.
Xampeny, R., P. Grima, and X. Tort-Martorell (2018). “Selecting significant effects in factorial
designs: Lenth’s method versus using negligible interactions”. In: Communications in Statistics
- Simulation and Computation 47.5, pp. 1343–1352.
Yates, F. (1934). “The Analysis of Multiple Classifications with Unequal Numbers in the Different
Classes”. In: Journal of the American Statistical Association 29.185, pp. 51–66.
Chapter 7
Improving Precision and Power: Blocked
Designs
7.1 Introduction
In our discussion of the vendor examples in Sect. 3.3, we observed huge gains in
power and precision when allocating each vendor to one of two samples per mouse,
contrasting the resulting enzyme levels for each pair of samples individually, and
then averaging the resulting differences.
This design is an example of blocking, where we organize the experimental units
into groups (or blocks), such that units within the same group are more similar than
those in different groups. For k treatments, a block size of k experimental units per
group allows us to randomly allocate each treatment to one unit per block and we can
estimate any treatment contrast within each block and then average these estimates
over the blocks. Hence, this strategy removes any differences between blocks from
the contrast estimates and tests, resulting in lower residual variance and increased
precision and power without increases in sample size. As R. A. Fisher noted
Uniformity is only requisite between objects whose response is to be contrasted (that is,
objects treated differently). (Fisher 1971, p. 33)
We continue with our example of how three drug treatments in combination with
two diets affect enzyme levels in mice. To keep things simple, we only consider
the low-fat diet for the moment, so the treatment structure only contains Drug with
three levels. Our aim is to improve the precision of contrast estimates and increase
the power of the omnibus F-test. To this end, we arrange (or block) mice into groups
of three and randomize the drugs separately within each group such that each drug
occurs once per group. Ideally, the variance between animals in the same group
is much smaller than between animals in different groups. A common choice for
blocking mice is by litter, since sibling mice often show more similar responses as
compared to mice from different litters (Perrin 2014). Litter sizes are typically in
the range of 5–7 animals in mice (Watt 1934), which would easily allow us to select
three mice from each litter for our experiment.
Our experiment is illustrated in Fig. 7.1A: we use b = 8 litters of n = 3 mice each,
resulting in an experiment size of N = bn = 24 mice. In contrast to a completely
randomized design, our randomization is restricted since we require that each treat-
ment occurs exactly once per litter, and we randomize drugs independently for each
litter.
The data are shown in Fig. 7.1B, where we connect the three observed enzyme
levels in each block by a line, akin to an interaction plot. The vertical dispersion of
the lines indicates that enzyme levels within each litter are systematically different
from those in other litters. The lines are roughly parallel, which shows that all three
drug treatments are affected equally by these systematic differences, there is no litter-
by-drug interaction, and treatment contrasts are unaffected by systematic differences
between litters.
Hasse Diagram
Deriving the experiment structure diagram in Fig. 7.2C poses no new challenges:
the treatment structure contains the single treatment factor Drug with three levels
7.2 The Randomized Complete Block Design 159
A B
16
14
Enzyme level
12
10
Placebo D1 D2
Fig. 7.1 Comparing enzyme levels for placebo, D1, and D2 under low-fat diet using randomized
complete block design. A Data layout with eight litters of size three, treatments independently
randomized in each block. B Observed enzyme levels for each drug; lines connect responses of
mice from the same litter
M M M11 M11
(Mouse)
(Mouse)24
0
Fig. 7.2 Randomized complete block design for determining effect of two different drugs and
placebo using eight mice per drug and a single measurement per mouse. Each drug occurs once per
block and assignment to mice is randomized independently for each block
while the unit structure contains the blocking factor (Litter), which groups the units
from (Mouse). We consider the specific eight litters in our experiment as a random
sample, which makes both unit factors random. The factors Drug and (Litter) are
crossed and their combinations are levels of the interaction factor (Litter:Drug);
since (Litter) is random, so is the interaction. The randomization allocates levels
of Drug on (Mouse), and the experiment structure is similar to a two-way factorial
design, but (Litter) originates from the unit structure, is random, and is not randomly
allocated to (Mouse), but rather groups levels of (Mouse) by an intrinsic property of
each mouse.
160 7 Improving Precision and Power: Blocked Designs
Each treatment occurs only once per block, and the variations due to interactions
and residuals are completely confounded. This design is called a randomized com-
plete block design (RCBD). If we want to analyze data from an RCBD, we need to
assume that the block-by-treatment interaction is negligible. We can then merge the
interaction and residual factors and use the sum of their variation for estimating the
residual variance (Fig. 7.2D).
An appreciable block-by-treatment interaction means that the differences in
enzyme levels between drugs depend on the litter. Treatment contrasts are then litter-
specific and this systematic heterogeneity of treatment effects complicates the anal-
ysis and precludes a straightforward interpretation of the results. Such interaction is
likely caused by other biologically relevant factors that influence the effect of (some)
treatments but that have not been accounted for in our experiment and analysis.
We cannot test the interaction factor and therefore require a non-statistical argu-
ment to justify ignoring the interaction. Since we have full control over which prop-
erty we use for blocking the experimental units, we can often employ subject-matter
knowledge to exclude interactions between our chosen blocking factor and the treat-
ment factor. In our particular case, for example, it seems unlikely that the litter affects
drugs differently, which justifies treating the litter-by-drug interaction as negligible.
Linear Model
Because the blocking factor levels are random, so are the cell means μi j . Moreover,
there is a single observation per cell and each cell mean is confounded with the
residual for that cell; a two-factor cell means model is therefore not useful for an
RCBD.
The parametric model (including block-by-treatment interaction, Fig. 7.2C) is
yi j = μ + αi + b j + (αb)i j + ei j ,
where μ is the grand mean, αi the deviation of the average response for drug i
compared to the grand mean, and b j is the random effect of block j. We assume
that the residuals and block effects are normally distributed as ei j ∼ N (0, σe2 ) and
b j ∼ N (0, σb2 ), and are all mutually independent. The interaction of drug and block
effects is a random effect with distribution (αb)i j ∼ N (0, σαb
2
).
For an RCBD, the interaction is completely confounded with the residuals; its
parameters (αb)i j cannot be estimated from the experiment without replicating treat-
ments within each block. If interactions are negligible, then (αb)i j = 0 and we arrive
at the additive model (Fig. 7.2D)
yi j = μ + αi + b j + ei j . (7.1)
7.2 The Randomized Complete Block Design 161
with variance Var(μ̂i ) = (σb2 + σe2 )/b. We then base our contrast analysis on these
estimated marginal means, effectively using a cell means model for the treatment
factor after correcting for the block effects.
Analysis of Variance
The total sum of squares for the RCBD with an additive model decomposes into
sums of squares for each factor involved, such that
SStrt /dftrt
F= ,
SSres /dfres
and mean squares are formed by dividing each sum of squares by its corresponding
degrees of freedom.
For k treatment factor levels and b blocks, the test statistic has a Fk−1, (b−1)(k−1) -
distribution under the omnibus null-hypothesis H0 : μ1 = μ2 = · · · = μk that all
treatment group means are equal. Note that without blocking, the denominator sum
of squares would be SSblock + SSres with corresponding loss of power.
We derive the model specification from the Hasse diagrams in Fig. 7.2: since
fixed and random factors are exclusive to the treatment, respectively, unit structure,
the corresponding terms are 1+drug and Error(litter/mouse); we reduce
these terms to drug and Error(litter) and the model is fully specified as
y~drug+Error(litter). The analysis of variance table then contains two error
strata: one for the block effect and one for the within-block residuals.
We randomized the treatment on the mice within litters, and the within-block
variance σe2 therefore provides the mean squares for the F-test denominator; this
agrees with the fact that (Mouse) is the closest random factor below Drug in the
experiment structure diagram (Fig. 7.2D). The between-block error stratum contains
162 7 Improving Precision and Power: Blocked Designs
no further factors or tests, since there are no factors randomized on (Litter) in this
design. The ANOVA table is then
The omnibus F-test for the treatment factor provides clear evidence that the drugs
affect the enzyme levels differently and the differences in average enzyme levels
between drugs are about 85 times larger than the residual variance.
The effect size ηdrug
2
= 63% is large, but measures the variation explained by
the drug effects compared to the overall variation, including the variation removed
by blocking. It is more meaningful to compare the drug effect to the within-block
residuals alone using the partial effect size ηp,2 drug = 92%, since the litter-to-litter
variation is removed by blocking and has no bearing on the precision of the treatment
effect estimates. The large effect sizes confirm that the overwhelming fraction of the
variation of enzyme levels in each litter is due to the drugs acting differently.
In contrast to a two-way ANOVA with factorial treatment structure, we cannot
simplify the analysis to a one-way ANOVA with a single treatment factor with b · k
levels. This is because the blocking factor is random, and the resulting one-way
factor would also be a random factor. The omnibus F-test for this factor is difficult to
interpret, and a contrast analysis would be a futile exercise, since we would compare
randomly sampled factor levels among each other.
In contrast to our previous designs, the analysis of an RCBD requires two variance
components in our model: the residual within-block variance σe2 and the between-
block variance σb2 . The classical analysis of variance handles these by using one error
stratum per variance component, but this makes subsequent analyses more tedious.
For example, estimates of the two variance components are of direct interest, but while
the estimate σ̂e2 = MSres is directly available from the ANOVA result, an estimate
for the between-block variance is
MSblock − MSres
σ̂b2 = ,
n
and requires manual calculation. For our example, we find σ̂e2 = 0.44 and σ̂b2 = 1.62
based on the ANOVA mean squares.
An attractive alternative is the linear mixed model, which explicitly considers the
different random factors for estimating variance components and parameters of the
linear model in Eq. (7.1). Linear mixed models offer a very general and powerful
7.2 The Randomized Complete Block Design 163
extension to linear regression and analysis of variance, but their general theory and
estimation are beyond the scope of this book. For our purposes, we only need a
small fraction of their possibilities and we use the lmer() function from package
lme4 for all our calculations. In specifying a linear mixed model, we use terms of
the form (1|X) to introduce a random offset for each level of the factor X; this
construct replaces the Error()-term from aov(). The fixed effect part of the
model specification remains unaltered. For our example, the model specification is
then y~drug+(1|litter), which asks for a fixed effect (αi ) for each level of
Drug, and allows a random offset (b j ) for each litter.
The Fixed effects section of the result gives the parameter estimates for αi ,
but they are of secondary interest and depend on the coding (cf. Sect. 4.7).
Note that the degrees of freedom for estimating the intercept (which here corre-
sponds to the average enzyme level in the placebo group) is no longer an integer,
because block effects and residuals are used with different weights in its estimation.
The Random effects section of the result provides the variance and standard
deviation for each variance component:
These correspond to the two variance components σ̂b2 = 1.62 (Litter) and σ̂e2 =
0.44 (Residual), in agreement with the ANOVA results.
We calculate our familiar ANOVA table from an estimated linear mixed model m
using anova(m), which yields
Linear mixed models and ‘traditional’ analysis of variance use the same linear model
to analyze data from a given design. Their main difference is the way they handle
models with multiple variance components. Analysis of variance relies on the crude
concept of error strata, which makes direct estimation of variance more cumbersome
and leads to loss in efficiency if information on effects is distributed between error
strata (such as in incomplete block designs, discussed in Sect. 7.3). Linear mixed
models use different techniques for estimation of the model’s parameters that make
use of all available information. Variance estimates are directly available, and linear
mixed models do not suffer from problems with unbalanced group sizes. Colloquially,
we can say that the analysis of variance approach provides a convenient framework
164 7 Improving Precision and Power: Blocked Designs
to phrase design questions, while the linear mixed model provides a more general
tool for the subsequent analysis.
7.2.3 Contrasts
In a blocked design with random block factor, the only useful contrasts are between
levels of (fixed) treatment factors, since a contrast involving levels of the blocking
factor would define a comparison of specific instances of random factor levels. To
estimate the difference between the average enzyme level of the first and second litter,
for example, only has a useful interpretation if these two arbitrary litters are again used
in a replication of the experiment. This is also reflected in the population marginal
means, of which there are only three—μ1 , . . . , μ3 , one per treatment group—in
our example, rather than one per block-treatment combination. We then define and
estimate treatment contrasts exactly as before, and briefly exemplify the procedure
for our experiment with three contrasts: the comparison of D1, respectively, D2 to
placebo, and the comparison of the average of D1 and D2 to placebo. The contrasts
are
μ2 + μ3
(w1 ) = μ2 − μ1 , (w2 ) = μ3 − μ1 , and (w3 ) = − μ1 .
2
The following code defines and estimates the contrasts based on the linear mixed
model:
The contrast estimates for our experiment are shown in Table 7.1 for an analysis based
on aov(), the above analysis using lmer(), and an incorrect analysis of variance
that ignores the blocking factor and treats the data as coming from a completely
randomized design. Ignoring the blocking factor results in a residual variance of
σb2 + σe2 instead of σe2 for the correct analysis. The resulting decrease in precision is
clearly visible.
7.2 The Randomized Complete Block Design 165
Table 7.1 Contrast estimates and confidence intervals based on three linear models
Contrast Estimate se df LCL UCL
ANOVA: aov(y~drug+Error(litter))
D1 – Placebo 4.23 0.33 14 3.51 4.94
D2 – Placebo 2.91 0.33 14 2.19 3.62
(D1+D2)/2 – Placebo 3.57 0.29 14 2.95 4.18
Mixed model: lmer(y~drug+(1|litter))
D1 – Placebo 4.23 0.33 14 3.51 4.94
D2 – Placebo 2.91 0.33 14 2.19 3.62
(D1+D2)/2 – Placebo 3.57 0.29 14 2.95 4.18
INCORRECT: aov(y~drug)
D1 – Placebo 4.23 0.72 21 2.73 5.72
D2 – Placebo 2.91 0.72 21 1.41 4.40
(D1+D2)/2 – Placebo 3.57 0.62 21 2.27 4.86
Once the data are recorded, we are interested in quantifying how ‘good’ the blocking
performed in the experiment. This information would allow us to better predict
the expected residual variance for a power analysis of our next experiment and to
determine if we should continue using the blocking factor.
One way of evaluating the blocking is by treating the blocking factor as fixed, and
specify the ANOVA model as y~drug+litter. The ANOVA table then contains
a row for the blocking factor with an associated F-test (Samuels et al. 1991). This
test only tells us if the between-block variance is significantly different from zero,
but does not quantify the advantage of blocking.
A more meaningful alternative is the calculation of appropriate effect sizes, and
we can determine the percentage of variation removed from the analysis by blocking
using the effect size ηblock
2
, which evaluates to 31% for our example. In addition,
the partial effect ηp, block = 86% shows that the vast majority of non-treatment sum
2
of squares is due to the litter-to-litter variation and blocking by litter was a very
successful strategy.
Alternatively, the intraclass correlation coefficient ICC = σb2 /(σb2 + σe2 ) uses the
proportion of variance and we find an ICC of 79% for our example, confirming that
the blocking works well. The ICC is directly related to the relative efficiency of the
RCBD compared to a non-blocked CRD, since
σb2 + σe2 1
RE(CRD, RCBD) = = , (7.2)
σe2 1 − ICC
166 7 Improving Precision and Power: Blocked Designs
and σb2 + σe2 is the residual variance of a corresponding CRD. From our estimates of
the variance components, we find a relative efficiency of 4.66; to achieve the same
precision as our blocked design with 8 mice per treatment group would require about
37 mice per treatment group for a completely randomized design.
We can alternatively find the relative efficiency as
res, CRD
MS
RE(CRD, RCBD) = ,
MSres, RCBD
where we estimate the residual mean squares of the CRD from the ANOVA table
of the RCBD as a weighted average of the block and the residual sums of squares,
weighted by the corresponding degrees of freedom:
For our example, M Sres, CRD = 1.92, resulting in a relative efficiency estimate of
RE(CRD,RCBD) = 4.34. It differs slightly from our previous estimate because it
uses an approximation based on the ANOVA results rather than estimates of the
variance components based on the linear mixed model.
The main purpose of blocking is to decrease the residual variance for improving
the power of tests and precision of estimates by grouping the experimental units
before the allocation of treatments, such that units within the same block are more
similar than units from different blocks. The blocking factor must therefore describe
a property of experimental units that affects all treatments equally to avoid a block-
by-treatment interaction, and that is independent of the treatment outcome, which
precludes, e.g., grouping ‘responders’ and ‘non-responders’ based on the measured
response.
Some examples of grouping units using intrinsic properties are (i) litters of ani-
mals; (ii) parts of a single animal or plant, such as left/right kidney or leaves; (iii)
initial weight; (iv) age or co-morbidities; (v) biomarker values, such as blood pres-
sure, state of tumor, or vaccination status. Properties non-specific to the experimental
units include (i) batches of chemicals used for the experimental unit; (ii) device used
for measurements; or (iii) date in multi-day experiments. These are often necessary
to account for systematic differences from the logistics of the experiment (such as
batch effects); they also increase the generalizability of inferences due to the broader
experimental conditions.
7.2 The Randomized Complete Block Design 167
M M M
(Mouse)
Fig. 7.3 Creating a randomized complete block design from a completely randomized design. A A
CRD randomizes drugs on mice. B Introducing a blocking factor ‘above’ groups the experimental
units. C Subdividing each experimental units and randomizing the treatments on the lower level
creates a new experimental unit factor ‘below’
As always, practical considerations should be taken into account when deciding upon
blocking and randomization. There is no point in designing an overly complicated
blocked experiment that becomes too difficult to implement correctly. On the other
hand, there is no harm if we use a blocking factor that turns out to have no or minimal
effect on the residual variance.
We can think about creating a blocked design by starting from a completely
randomized design and ‘splitting’ the experimental unit factor into a blocking and
a nested (potentially new) unit factor. Two examples are shown in Fig. 7.3, starting
from the CRD (Fig. 7.3A) randomly allocating drug treatments on mice. In the first
RCBD (Fig. 7.3B), we create a blocking factor ‘above’ the original experimental unit
factor and group mice by their litters. This restricts the randomization of Drug to
mice within litters. In the second RCBD (Fig. 7.3C), we subdivide the experimental
unit into smaller units by taking multiple samples per mouse. This re-purposes the
original experimental unit factor as the blocking factor and introduces a new factor
‘below’, but requires that we now randomize Drug on (Sample) to obtain an RCBD
and not pseudo-replication.
1.00
Design
CRD
0.75 RCBD
0.50 0.125
0.100 0.25
0.075 0.5
0.050
0.25 1
2 3
2
0 20 40 60
Sample size n (=number blocks b)
Fig. 7.4 Power for different sample sizes n for a completely randomized design with residual
variance two and four randomized complete block designs with same overall variance, but four
different within-block residual variances. Inset: same curves for small sample sizes show lower
power of RCBD than CRD for identical residual variance due to lower residual degrees of freedom
in the RCBD
M M M11
(Lab)21 Drug32
Drug (Lab)
(Lab : Drug)62
(Mouse)
(Mouse)24
18
Fig. 7.5 A (generalized) randomized complete block design (GRCBD) with four mice per drug
and laboratory. Completely randomized design for estimating drug effects of three drugs replicated
in two laboratories
yi jk = μ + αi + b j + (αb)i j + ei jk ,
170 7 Improving Precision and Power: Blocked Designs
where αi are the (fixed) treatment effect parameters, b j the random block effect
parameters with b j ∼ N (0, σb2 ), (αb)i j ∼ N (0, σαb 2
) describes the interaction of
treatment i and laboratory j, and ei jk ∼ N (0, σe ) are the residuals. The specifi-
2
We can typically consider the levels of a blocking factor as randomly drawn from a
set of potential levels. The blocking factor is then random, and we are not interested
in contrasts involving its levels, for example, but rather use the blocking factor to
increase precision and power by removing parts of the variation from treatment
contrasts.
A typical example of a non-random classification factor is the sex of an animal.
The Hasse diagrams in Fig. 7.6 show an experiment design to study the effects of
our three drugs on both female and male mice. Each treatment group has eight mice,
with half of them female, the other half male. The experiment design looks similar to
a factorial design of Chap. 6, but the interpretation of its analysis is rather different.
Most importantly, while the factor Sex is fixed with only two possible levels, its levels
are not randomly assigned to mice. This is reflected in the fact that Sex groups mice
by an intrinsic property and hence belongs to the unit structure. In contrast, levels of
Drug are randomly assigned to mice, and Drug therefore belongs to the treatment
structure of the experiment.
In contrast to a randomized complete block design, we cannot increase the number
of blocks to increase the replication. With two levels of Sex fixed, we instead need to
increase the experiment size by using multiple mice of each sex for each drug. Since
Sex and Drug are fixed, so is their interaction, and all fixed factors are tested against
the same error term from (Mouse).
The specification for the analysis of variance is y~drug*sex, and it looks exactly
like the specification of a two-way ANOVA. However, since Sex is a unit factor that
is not randomized, the roles of Sex and Drug are not symmetric and the interpretation
is rather different from our previous two-way ANOVA with treatment factors Drug
and Diet. In the factorial design, the presence of a Drug:Diet interaction means that
7.2 The Randomized Complete Block Design 171
M M M11
Sex 21 Drug32
Drug Sex
Sex : Drug62
(Mouse)
(Mouse)24
18
Fig. 7.6 A blocked design using sex of mouse as a fixed classification factor, with four female and
four male mice in of three each treatment groups
we can modify the effect of a drug by using a different diet. In the fixed block design,
a Sex:Drug interaction means that the effect of drugs is not stable over the sexes and
some drugs affect female and male mice differently. We can use a drug to alter the
enzyme level in a mouse, but we cannot alter the sex of that mouse to modulate the
drug effect.
For constructing the Hasse diagrams in Fig. 7.8, we note that the blocking factor
is crossed with the full treatment structure—both treatment factors and the treat-
ment interaction factor. We assume that our blocking factor does not differentially
172 7 Improving Precision and Power: Blocked Designs
low fat
high fat
Fig. 7.7 Experiment layout for three drugs under two diets, four litter blocks. Two-way crossed
treatment structure within each block
M M M11
Drug Diet
(Litter )
(Litter )43 Drug32 Diet21
Drug : Diet
Drug : Diet62
(Mouse)
(Mouse)24
15
Fig. 7.8 Randomized complete block design for determining effect of two different drugs and
placebo combined with two different diets using four mice per drug and diet, and a single response
measurement per mouse. All treatment combinations occur once per block and are randomized
independently
change the diet or drug effects and consequently ignore all three block-by-treatment
interactions (Litter:Diet), (Litter:Drug), and (Litter:Diet:Drug).
We can easily derive a corresponding linear model from the Hasse diagrams. It is
identical to our ‘usual’ model for an RCBD, except that its mean structure additionally
contains the parameters for the second treatment factor and the interaction:
yi jk = μ + αi + β j + (αβ)i j + bk + ei jk ,
where αi and β j are the main effects of drug and diet, respectively, (αβ)i j the
interaction effects between them, bk ∼ N (0, σb2 ) are the random block effects, and
ei jk ∼ N (0, σe2 ) are the residuals. Again we assume that all random effects are mutu-
ally independent.
The enzyme levels yi jk are shown in Fig. 7.9 as dark grey points in an interaction
plot separately for each block. We can clearly see the block effects, which system-
atically shift enzyme levels for all conditions. After estimating the linear model, the
shift bk can be estimated for each block, and the light grey points show the resulting
‘normalized’ data yi jk − b̂k after accounting for the block effects.
7.2 The Randomized Complete Block Design 173
14
Enzyme level
12
10
low fat high fat low fat high fat low fat high fat low fat high fat
Fig. 7.9 Enzyme levels for placebo (point), drug D1 (triangle), and drug D2 (square). Data are
shown separately for each of four litters (blocks). Lines connect mean values over litters of same
drug under low and high-fat diets. Dark grey points are raw data, light grey points are enzyme levels
adjusted for block effect
ANOVA Table
We analyze the resulting data using either analysis of variance or a linear mixed
model. The two model specifications derive directly from the Hasse diagrams
and consist of two main effects and interaction for the treatment factors and
an error stratum or a random intercept for the litter unit factor. The specifica-
tions are y~drug*diet+Error(litter), respectively, y~drug*diet+(1|
litter).
The resulting ANOVA table based on the linear mixed model is
Blocking by litter reduces the residual variance from 1.19 for the non-blocked design
in Chap. 6 to 0.64, the relative efficiency is RE(CRD,RCBD) = 1.87. Due to the
increase in power, the Diet main effect is now significant.
Contrasts
The definition and analysis of linear contrasts work exactly as for the two-way
ANOVA in Sect. 6.6, and contrasts are defined on the six treatment group means.
For direct comparison with our previous results, we estimate the two interaction
contrasts of Table 6.6 in the blocked design. They compare the difference in enzyme
levels for D1 (resp. D2) under low and high-fat diet to the corresponding difference
174 7 Improving Precision and Power: Blocked Designs
7.3.1 Introduction
The randomized complete block design can be too restrictive if the number of treat-
ment levels is large or the available block sizes small. Fractional factorial designs
offer a solution for factorial treatment structures by confounding some treatment
effects with blocks (Chap. 9). For a single treatment factor, we can use incomplete
block designs (IBD), where we deliberately relax the complete balance of the previous
designs and use only a subset of the treatments in each block.
The most important example is the balanced incomplete block design (BIBD),
where each pair of treatments is allocated to the same number of blocks, and so are
therefore all individual treatments. This specific type of balance ensures that pair-
wise contrasts are all estimated with the same precision, but precision decreases if
more than two treatment groups are compared.
We illustrate this design with our drug example, where three drug treatments are
allocated to 24 mice on a low-fat diet. If the sample preparation is so elaborate that
only two mice can be measured at the same time, then we have to run this experiment
in twelve batches, with each batch containing one of three treatment pairs: (placebo,
D1), (placebo, D2), or (D1, D2). A possible balanced incomplete block design
is shown in Fig. 7.10, where each treatment pair is used in four blocks, and the
treatments are randomized independently to mice within each block. The resulting
data are shown in Table 7.3.
7.3 Incomplete Block Designs 175
Fig. 7.10 Experiment layout for a balanced incomplete block design with three drugs in twelve
batches of size two
Table 7.3 Data for a BIBD with 12 blocks of size 2 and 3 treatment levels
Batch Drug y Batch Drug y Batch Drug y
Block 1 Placebo 8.79 Block 5 D2 14.05 Block 9 D1 13.87
Block 1 D1 12.89 Block 5 Placebo 11.93 Block 9 D2 13.92
Block 2 D2 13.49 Block 6 D1 12.98 Block 10 Placebo 9.03
Block 2 Placebo 10.58 Block 6 D2 11.47 Block 10 D1 13.67
Block 3 D1 12.76 Block 7 Placebo 8.06 Block 11 D2 13.42
Block 3 D2 11.64 Block 7 D1 12.80 Block 11 Placebo 10.62
Block 4 Placebo 10.14 Block 8 D2 11.22 Block 12 D1 15.63
Block 4 D1 14.65 Block 8 Placebo 9.10 Block 12 D2 12.74
The key requirement of a BIBD is that all pairs of treatments occur the same number
of times in a block. This requirement restricts the possible combinations of number
of treatments, block size, and number of blocks. With three pairs of treatments as
in our example, the number of blocks in a BIBD is restricted to be a multiple of
three. The relations between the parameters of a BIBD are known as the two defining
equations
r k = bs and λ · (k − 1) = r · (s − 1) . (7.3)
They relate the number of treatments k, number of blocks b, block size s to the number
of times r that each treatment occurs and the number of times λ that each pair of
treatments occurs in the same block. If these equations hold, then the corresponding
design is a BIBD.
The equations are derived as follows: with b blocks of s units each, the product
bs is the total number of experimental units in the experiment. This number has to
equal r k, since each of the k treatments occurs in r blocks. Moreover, each particular
176 7 Improving Precision and Power: Blocked Designs
treatment occurs λ times with each of the remaining k − 1 treatments. It also occurs
in r blocks, and is combined with s − 1 other treatments in each block.
For our example, we have k = 3 treatment levels and each treatment occurs in
r = 8 blocks, we have b = 12 blocks each of size s = 2, resulting in λ = 4 co-
occurrences of each pair in the same block. This satisfies the defining equations and
our design is a BIBD. In contrast, for only b = 5 blocks, we are unable to satisfy the
defining equations and no corresponding BIBD exists.
In order to generate a balanced incomplete block design, we need to find param-
eters s, r , and λ such that the defining Eq. (7.3) holds. A particularly simple way
is by generating an unreduced or combinatorial design and use as many blocks as
thereare
ways to select s out of k treatment levels. The number of blocks b is then
b = ks = k!/(k − s)!s! and increases rapidly with k.
We can often improve on the unreduced design, and generate a valid BIBD with
substantially fewer blocks. Experimental design books sometimes contain tables of
BIBDs for a different number of treatment and block sizes, see for example Cochran
and Cox (1957). As a further example, we consider constructing a BIBD for k = 6
treatments
in blocks of size s = 3. The unreduced combinatorial design requires
b = 63 = 20 blocks (experiment size N = 60). Not all of these are needed to fulfill
the defining equations, however, and a smaller BIBD exists with only b = 10 blocks
(and N = 30).
7.3.3 Analysis
The balanced incomplete block design is not fully balanced: since only a fraction of
the available treatment levels occurs in each block, some combinations of block and
treatment levels are not observed, resulting in some-cells-empty data. The treatment
and block main effects are no longer independent, but the requirements encoded in the
two defining equations ensure that unbiased estimates exist for all treatment effects,
and that standard ANOVA techniques are available for the analysis. However, the
treatment and block factor sums of squares are not independent, and while their sum
remains the same, their individual values now depend on the order of the two factors
in the model specification. Moreover, part of the information about treatment effects
is captured by the differences in blocks; in particular, there are two omnibus F-tests
for the treatment factor, the first based on the inter-block information and located in
the ANOVA table in the block-factor error stratum, and the second based—as for
an RCBD—on the intra-block information and located in the residual error stratum.
A linear mixed model automatically combines these information, and analysis of
variance and mixed model results no longer concur exactly for a BIBD.
7.3 Incomplete Block Designs 177
Linear Model
yi j = μ + αi + b j + ei j ,
where μ is the grand mean, αi the expected difference between the group mean for
treatment i and the grand mean, b j ∼ N (0, σb2 ) are the random block effects, and
ei j ∼ N (0, σe2 ) are the residuals. All random variables are mutually independent.
Not all block-treatment combinations occur in a BIBD, and some of the yi j there-
fore do not exist. As a consequence, naive parameter estimators are biased, and more
complex estimators are required for the parameters and treatment group means. We
forego a more detailed discussion of these problems and rely on statistical software
like R to provide appropriate estimates for BIBDs.
In an RCBD, we can estimate any treatment contrast and all effects independently
within each block, and then average over blocks. We can use the same intra-block
analysis for a BIBD by estimating contrasts and effects based on those blocks that
contain sufficient information and averaging over these blocks. The resulting esti-
mates are free of block effects.
In addition, the block totals—calculated by adding up all response values in a
block—also contain information about contrasts and effects if the block factor is ran-
dom. This information can be extracted by an inter-block analysis. We again refrain
from discussing the technical details, but provide some intuition for the recovery of
inter-block information.
We consider our example with three drugs in blocks of pairs and are interested
in estimating the average enzyme level under the placebo treatment. Using the intra-
block information, we would adjust each response value by the block effect, and
average over the resulting adjusted placebo responses. For the inter-block analysis,
first note that the true treatment group means are μ + α1 , μ + α2 , and μ + α3 for
placebo, D1, and D2, respectively. The block totals for the three types of block
are then (placebo, D1)= TP,D1 = μ + α1 + μ + α2 , (placebo, D2)= TP,D2 = μ +
α1 + μ + α3 , and (D1, D2)= TD1,D2 = μ + α2 + μ + α3 , respectively, and each
type of block occurs the same number of times.
The treatment group mean for placebo can then be calculated from the block totals
as (TP,D1 + TP,D2 − TD1,D2 )/2, since
178 7 Improving Precision and Power: Blocked Designs
Analysis of Variance
Based on the linear model, the specification for the analysis of variance is
y~drug+Error(block) for a random block factor, the same as for an RCBD.
The resulting ANOVA table has again two error strata, and the block error stratum
contains the inter-block information about the treatment effects. With two sums of
squares, the analysis provides two different omnibus F-tests for the treatment factor.
The first F-test is based on the inter-block information about the treatment, and is
in general (much) less powerful than the second F-test based on the intra-block
information. Both F-tests can be combined to an overall F-test taking account of all
information, but the inter-block F-test is typically ignored in practice and only the
intra-block F-test is used and reported; this approximation leads to a (often small)
loss in power.
The dependence of the block- and treatment factors must be considered for fixed
block effects. The correct model specification contains the blocking factor before
the treatment factor in the formula, and is y~block+drug for our example. This
model adjusts treatments for blocks and the analysis is identical to an intra-block
analysis for random block factors. The model y~drug+block, on the other hand,
yields an entirely different ANOVA table and an incorrect F-test, as we discussed in
Sect. 6.5.
The linear mixed model approach does not require error strata to cope with the two
variance components, and provides a single estimate of the drug effects and treatment
7.3 Incomplete Block Designs 179
Table 7.4 Contrast estimates and 95%-confidence intervals for BIBD based on intra-block esti-
mates from classic analysis of variance (top) and linear mixed model (bottom)
Contrast Estimate se df LCL UCL
ANOVA: aov(y~drug+Error(block))
D1 – Placebo 4.28 0.31 10.00 3.59 4.97
D2 – Placebo 2.70 0.31 10.00 2.01 3.39
(D1+D2)/2 – Placebo 3.49 0.27 10.00 2.90 4.09
Mixed model: lmer(y~drug+(1|block))
D1 – Placebo 4.22 0.31 10.81 3.54 4.90
D2 – Placebo 2.74 0.31 10.81 2.06 3.42
(D1+D2)/2 – Placebo 3.48 0.27 10.81 2.89 4.07
group means based on all available information. The resulting degrees of freedom are
no longer integers and resulting F-tests and p-values can deviate from the classical
analysis of variance. We specify the model as y~drug+(1|block); this results
in the ANOVA table
We note that the sum of squares and the mean square estimates are slightly larger than
for the aov() analysis, because the between-block information is taken into account.
This provides more power and results in a slightly larger value of the F-statistic.
This is an example of a design in which the deliberate violation of complete
balance still allows an analysis of variance, but where a mixed model analysis gives
advantages both in the calculation but also the interpretation of the results. Since
the BIBD is not fully balanced, the linear mixed model ANOVA table gives slightly
different results when we approximate degrees of freedom with the more conservative
Kenward–Roger method rather than the Satterthwaite method reported here.
7.3.4 Contrasts
Contrasts are defined exactly as for our previous designs, but their estimation is based
only on the intra-block information if the estimated marginal means are calculated
from the ANOVA model. Estimates and confidence intervals then differ between
ANOVA and linear mixed model results, and the latter should be preferred. For
our example, we calculate three contrasts comparing each drug, respectively, their
average to placebo. The results are given in Table 7.4, and demonstrate the differences
in degrees of freedom and precision between the two underlying models.
Estimation of all treatment contrasts is unbiased and contrast variances are free
of the block variance in a BIBD. The variance of a contrast estimate is
180 7 Improving Precision and Power: Blocked Designs
s 2 2
k
ˆ
Var((w)) = σ w .
λk e i=1 i
λk k s−1
=r = n∗
s s k−1
as the “effective sample size” in each treatment group for a BIBD. It is smaller than
the actual replication r since some information is lost as we cannot fully eliminate
the block variances from the group mean estimates.
The relative efficiency of a BIBD compared to an RCBD with r blocks is
r ks k−1 n∗
s−1
RE (BIBD, RCBD) = = .
r r
The fewer treatments fit into a single block, the lower the relative efficiency. For
our example, k = 3 and s = 2 result in a relative efficiency of 3/4, and our BIBD
requires about 33% more samples to achieve the same precision as an RCBD.
Table 7.5 Balanced incomplete block design for estimating between-plate reproducibility for 10
plates A–J based on 18 patients samples, each patient sample in five aliquots
Aliquot Patient
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
1 D F E G F J E H A I A B A H H G H I
2 G H F F H D B C F F E D J J I H E B
3 A J I J E I F B C J D F E B A C I E
4 B A C I J B H J H D J I G I G D A A
5 C D G A G E D G B C C G B C D E C F
the same precision for all pair-wise contrasts between plates and has about 89%
efficiency. A possible layout is shown in Table 7.5, where letters correspond to the
10 plates.
We conduct sample size determination and power analysis for a balanced incomplete
block design in much the same fashion as for a randomized complete block design.
The denominator degrees of freedom for the omnibus F-test are dfres = bk − b −
k + 1. The noncentrality parameter is
k s−1
λ=k· r· · f 2 = k · n∗ · f 2 ,
s k−1
which is again of the form “total sample size times effect size”, but uses the effective
sample size per group to adjust for the incompleteness.
The reference design or design for differential precision is a variation of the BIBD that
is useful when a main objective is comparing treatments to a control (or reference)
group. One concrete application is a microarray study, where several conditions are
to be tested, but only two dyes are available for each microarray. If all pair-wise
contrasts between treatment groups are of equal interest, then a BIBD is a reasonable
option, while we might prefer a reference design for comparing conditions against a
common control as in Fig. 7.11A.
Another application of reference designs is the screening of several new treat-
ments against a standard treatment. In this case, selected treatments might be com-
182 7 Improving Precision and Power: Blocked Designs
A B
Reference design
A B A C B C Block 3 Positive D1 D2
Block 1 Block 2 Block 3
Fig. 7.11 A Reference design for three treatments A, B, C and a reference condition R (top) and
BIBD without reference condition (bottom). B Reference design for three treatments and common
positive control as reference condition with block size three
7.4.1 Introduction
The unit structure of (G)RCBD and BIBD experimental designs consists of only two
unit factors: the blocks and the experimental units nested in blocks. We now discuss
unit structures with more than one blocking factor; these can be nested or crossed.
M M M11 M11
(Lab : Drug)62
(Litter ) (Litter )86
(Litter )86
(Mouse)24
14
(Mouse) (Lab : Litter : Drug)24
12
(Mouse)24
0
Fig. 7.12 Randomized complete block design for determining effect of two different drugs and
placebo treatments on enzyme levels in mice, replicated in two laboratories. A Treatment structure.
B Unit structure with three nested factors. C Full experiment structure with block-by-treatment
interactions. D Experiment structure if these interactions are negligible
the omnibus F-test and contrasts for Drug are calculated within each litter and then
averaged over litters within labs.
This type of design can be extended to an arbitrary number of nested blocks and
we might use two labs, two cages per lab, and two litters per cage for our example.
As long as each nested factor is replicated, we are able to estimate corresponding
variance components. If a factor is not replicated (e.g., we use a single litter per lab),
then there are no degrees of freedom for the nested blocking factor, and the effects
of both blocking factors are completely confounded. Effectively, such a design uses
a single blocking factor, where each level is a combination of lab and litter.
We should keep in mind that variance components are harder to estimate than
averages, and estimates based on low replication are imprecise. This is not a problem
if our goal is removal of variation from treatment contrasts, but sufficient replication is
necessary if estimation of variance components is a primary goal of the experiment.
Sample size determination for variance component estimation is covered in more
specialized texts.
Nesting blocking factors essentially results in replication of (parts of) the design. In
contrast, crossing blocking factors allows us to control several sources of variation
simultaneously. The most prominent example is the Latin square design, which
consists of two crossed blocking factors simultaneously crossed with a treatment
factor, such that each treatment level occurs once in each level of the two blocking
factors.
For example, we might be concerned about the effect of litters on our drug compar-
isons, but suspect that the position of the cage in the rack also affects the observations.
7.4 Multiple Blocking Factors 185
M M M11
(Cage) (Litter )
Drug
(Cage)32 (Litter )32 Drug32
(Cage : Litter )
(Mouse)92
(Mouse)
Fig. 7.13 Latin square with cage and litter as random row/column effects and three drug treatment
levels.
The Latin square design removes both between-litter and between-cage variation
from the drug comparisons. For three drugs, this design requires three litters of three
mice each, and three cages. Crossing litters and cages results in one mouse per litter
in each cage. The drugs are then randomized on the intersection of litters and cages
(i.e., on mice) such that each drug occurs once in each cage and once in each litter.
The Hasse diagrams for this design are shown in Fig. 7.13. The crucial observation
is that none of the potential block-by-block or block-by-treatment interactions can be
estimated, since each combination of block levels or block-treatment combinations
occurs only once. In particular, the intersection of a cage and a litter is a single mouse,
and thus (Cage:Litter) is completely confounded with (Mouse) in this example.
An example layout of this design is shown in Fig. 7.14A, where the two blocking
factors are given as rows and columns. Each treatment occurs exactly once per
row and once per column and the Latin square design imposes two simultaneous
constraints on the randomization of drugs on mice.
With all interactions negligible, data from a Latin square design are analyzed by
an additive model
yi jk = μ + αi + c j + rk + ei jk ,
where μ is the grand mean, αi the treatment parameters, and c j (resp. rk ) are the
random column (resp. row) parameters. This model is again in direct correspon-
dence to the experiment diagram and is specified as y ~ drug + Error(cage+
litter), respectively, y ~ drug + (1|cage) + (1|litter).
We can interpret the Latin square design as a blocked RCBD: ignoring the cage
effect, for example, the experiment structure is identical to an RCBD blocked by
litter. Conversely, we have an RCBD blocked on cages when ignoring the litter
effect. The remaining blocking factor blocks the whole RCBD. The requirement
of having each treatment level once per cage and litter implies that the number of
cages must equal the number of litters, and that both must also equal the number
of treatments. These constraints pose some complications for the randomization of
treatment levels to mice, and randomization is usually done by software; many older
186 7 Improving Precision and Power: Blocked Designs
Fig. 7.14 A Latin square using cage and litter and three drug treatments. B Replication keeping
the same cages, and using more litters without (top) and with (bottom) forming Latin squares. C
Full replication with new rows and columns. D Two fully crossed blocks
books on experimental design also contain tables of Latin square designs with few
treatment levels.
A design with a single Latin square often lacks sufficient replication: for k = 2, k = 3,
and k = 4 treatment levels, we only have 0, 2, and 6 residual degrees of freedom,
respectively. We therefore consider several options for generating r replicates of
a Latin square design. In each case, the Hasse diagram allows us to calculate the
resulting error degrees of freedom and to specify an appropriate linear model.
First, we might consider replicating columns while keeping the rows, using r k
column factor levels instead of k. The same logic applies to keeping columns and
replicating rows, of course. The two experiments in Fig. 7.14B illustrate this design
for a two-fold replication of the 3 × 3 Latin square, where we use six litters instead
of three, but keep using the same three cages in both replicates. In the top part of
the panel, we do not impose any new restrictions on the allocation of drugs and
only require that each drug occurs the same number of times in each cage, and
that each drug is used with each litter. In particular, the first three columns do not
7.4 Multiple Blocking Factors 187
A: B: C:
(Rep)21
(Litter )65 (Cage)32 Drug32 (Rep)21 (Cage)32 Drug32
(Cage)64 (Litter )64 Drug32
(Litter )64
(E )18
8 (E )18
8 (E )18
6
D: E: F:
(Cage)65 (Litter )65 Drug32 (Day)21 (Rack )21 Drug32 (Rep)21 (Device)32 (Tech)32 Drug32
(E )36
23
Fig. 7.15 Replication of latin square design. A Latin rectangle with six columns and three rows.
B Two sets of three columns with identical rows. C New rows and columns in each replicate. D Fully
crossed blocks, not a latin square. E Keeping both row and column levels with two independently
randomized replicates. F Independent replication of rows and columns
form a Latin square in themselves. This design is called a Latin rectangle, and its
experiment structure is shown in Fig. 7.15A with model specification y ~ drug
+ Error(cage+litter) or y ~ drug + (1|cage)+(1|litter).
We can also insist that each replicate forms a proper Latin square in itself. That
means we organize the columns in two groups of three as shown in the bottom of
Fig. 7.14B. In the diagram in Fig. 7.15B, this is reflected by a new grouping factor
(Rep) with two levels, in which the column factor (Litter) is nested. The model
is specified as y ~ drug + Error(cage+rep/litter) or y ~ drug +
(1|cage)+(1|rep/litter).
By nesting both blocking factors in the replication, we find yet another useful
design that uses different row and column factor levels in each replicate of the Latin
square as shown in Fig. 7.14C. The corresponding diagram is Fig. 7.15C and yields
the model y ~ drug + Error(rep/(cage+litter)), respectively, y ~
drug + (1|rep) + (1|rep:cage) + (1|rep:litter).
We can also replicate both rows and columns (potentially with different numbers
of replication) without restricting any subset of units to a Latin square. The exper-
iment is shown in Fig. 7.14D and extends the Latin rectangle to rows and columns.
Here, none of the replicates alone forms a Latin square, but all rows and columns
have the same number of units for each treatment level. The diagram is shown in
188 7 Improving Precision and Power: Blocked Designs
Table 7.6 Noncentrality parameter for omnibus F-test and residual degrees of freedom for Latin
square design and three strategies for r -fold replication
Design NCP λ dfres
Single latin square λ = k2 · f 2 (k − 1)(k − 2)
Same rows, same columns λ = r · k2 · f 2 (k − 1)(r (k − 1) − 3)
Same rows, new columns λ = r · k2 · f 2 (k − 1)(r k − 2)
New rows, new columns λ = r · k2 · f 2 (k − 1)(r (k − 1) − 1)
The determination of sample size and power for a crossed-block design requires no
new ideas. The noncentrality parameter λ is again the total sample size times the
effect size f 2 , and we find the correct degrees of freedom for the residuals from the
experiment diagram.
For a single Latin square, the sample size is necessarily k 2 and increases to r · k 2
if we use r replicates. The residual degrees of freedom then depend on the strategy
for replication. The residual degrees of freedom and the noncentrality parameter are
shown in Table 7.6 for a single Latin square and three replication strategies: using
the same levels for row and column factors, using the same levels for either row or
column factor and new levels for the other, and using new levels for both blocking
factors. The numerator degrees of freedom are k − 1 in each case.
7.4 Multiple Blocking Factors 189
Plant
I II III IV V VI VII
low D A G F C B E
Leaf height
mid A C D G B E F
high B E C A F G D
Fig. 7.16 Experiment layout for Youden square. Seven plants are considered (I–VII) with seven
different inoculation treatments (A–G). In each plant, three leaves at different heights (low/mid/high)
are inoculated. Each inoculation occurs once at each height, and the assignment of treatments to
plants forms a balanced incomplete block design
Youden Designs
The Latin square design requires identical number of levels for the row and column
factors. We can use two blocking factors with a balanced incomplete block design
to reduce the required number of levels for one of the two blocking factors. These
designs are called Youden squares and only use a fraction of the treatment levels in
each column (resp. row) and the full set of treatments in each row (resp. column). The
idea was first proposed by Youden for studying inoculation of tobacco plants against
the mosaic virus (Youden 1937), and his experiment layout is shown in Fig. 7.16.
In the experiment, seven plants were considered with seven different inoculation
treatments, applied to individual leaves. It was suspected that the height of a plant’s
leaf influences the effect of the virus, and leaf height was used as a second blocking
factor with three levels (low, middle, high). Crossing the two blocking factors leads to
a 3 × 7 layout. The treatment levels are randomly allocated such that each treatment
occurs once per leaf height (i.e., each inoculation occurs once in each column). The
columns form a balanced incomplete block design of k = 7 treatments in b = 7
blocks of size s = 3, leading to r = 3 occurrences of each treatment, and λ = 1
blocks containing each pair of treatments, in accordance with the defining Eq. (7.3).
Notes
Different ways of replication lead to variants of the GRCBD and are presented in
Addelman (1969) and Gates (1995). Variants of blocked designs with different block
sizes are discussed in Pearce (1964), and the question of treating blocks as random
or fixed in Dixon (2016). Evaluation and testing of blocks is reviewed in Samuels
et al. (1991). Excellent discussions of interactions between different types of factors
(e.g., treatment-treatment or treatment-classification) are given in Cox (1984) and
de Gonzalez and Cox (2007).
190 7 Improving Precision and Power: Blocked Designs
Blocking designs are also important in animal experiments (Festing 2014; Lazic
and Essioux 2013), and replicating pre-clinical experiments in at least two laborato-
ries can greatly increase reproducibility (Karp 2018).
The idea of a Latin square can be extended to more than two blocking factors;
with three factors, such designs are called graeco-latin squares.
Using R
Linear mixed models are covered in R by the lme4 package (Bates 2015) and its
lmer() function, which we use exclusively in this book. This package does not pro-
vide p-values, which we can augment by additionally loading the lmerTest pack-
age (Kuznetsova et al. 2017). The linear mixed models are specified similarly to linear
models, and random effects are introduced using the (X|G) construct that produces
random effects for a factor X grouped by G. We only consider X=1, which produces
random intercepts (or offsets) for each level of the factor G. Note that while crossed
random factors R and C can be specified in aov() using Error(R+C), the equiv-
alent notation (1|R+C) is not allowed in lmer(), and we use (1|R)+(1|C)
instead. A comprehensive textbook on linear mixed models in R is Galecki and
Burzykowski (2013); it captures many more uses of these models, such as analysis
of longitudinal data.
Latin squares, balanced incomplete block designs and Youden designs are conve-
niently found and randomized by the functions design.lsd(), design.bib()
and design.youden() in the agricolae package. For our example,
design.bib(trt=c("Placebo", "D1", "D2"), k=2) yields our first
BIBD with k = 3 and s = 2, and design.youden(trt=LETTERS[1:7],
r=3) generates the Youden design of Fig. 7.16.
Contrast analysis is based on either aov() or lmer() for estimating the linear
model, and estimated marginal means from emmeans(), where results from aov()
are based exclusively on the intra-block information and can differ from those based
on lmer().
Summary
Crossing a unit factor with the treatment structure leads to a blocked design, where
each treatment occurs in each level of the blocking factor. This factor organizes the
experimental units into groups, and treatment contrasts can be calculated within each
group before averaging over groups. This effectively removes the variation captured
by the blocking factor from any treatment comparisons. If experimental units are
more similar within the same group than between groups, then this strategy can
lead to substantial increase in precision and power, without increasing the sample
size. The price we pay is slightly larger organizational effort to create the groups,
randomize the treatments independently within each group, and to keep track of
which experimental unit belongs to which group for the subsequent analysis.
The most common blocking design is the randomized complete block design,
where each treatment occurs once per block. Its analysis requires the assumption of
no block-by-treatment interaction, which the experimenter can ensure by suitable
choice of the blocking factor. The efficiency of blocking is evaluated by appropriate
effect sizes, such as the proportion of variation attributed to the blocking. It typically
7.5 Notes and Summary 191
deteriorates if the block size becomes too large, since experimental units then become
more heterogeneous. A balanced incomplete block design allows blocking of simple
treatment structures if only a subset of treatments can be accommodated in each
block.
Multiple blocking factors can be introduced in the unit structure. Nesting them
enables estimation of their respective variance components, while crossing leads to
row-column designs that control for two sources of variation simultaneously.
Blocked designs yield ANOVA results with multiple error strata, and only the
lowest—within-block—stratum is typically used for analysis. Linear mixed models
account for all information, and results might differ slightly from an ANOVA if the
design is not fully balanced.
References
Addelman, S. (1969). “The generalized randomized block design”. In: The American Statistician
23.4, pp. 35–36.
Bates, D. et al. (2015). “Fitting linear mixed-effects models using lme4”. In: Journal of Statistical
Software 67.1, pp. 1–48.
Cochran, W. G. and G. M. Cox (1957). Experimental Designs. John Wiley & Sons, Inc.
Cox, D. R. (1984). “Interaction”. In: International Statistical Review 52, pp. 1–31.
de Gonzalez, A. and D. R. Cox (2007). “Interpretation of interaction: a review”. In: The Annals of
Applied Statistics 1.2, pp. 371–385.
Dixon, P. (2016). “Should Blocks Be Fixed or Random?” In: Conference on Applied Statistics in
Agriculture.
Festing, M. F. W. (2014). “Randomized block experimental designs can increase the power and
reproducibility of laboratory animal experiments”. In: ILAR Journal 55.3, pp. 472–476.
Fisher, R. A. (1971). The Design of Experiments. 8th. Hafner Publishing Company, New York.
Galecki, A. and T. Burzykowski (2013). Linear mixed-effects models using R. Springer New York.
Gates, C. E. (1995). “What really is experimental error in block designs?” In: The American
Statistician 49, pp. 362–363.
Karp, N. A. (2018). “Reproducible preclinical research-Is embracing variability the answer?” In:
PLOS Biology 16.3, e2005413.
Kuznetsova, A., P. B. Brockhoff, and R. H. B. Christensen (2017). “lmerTest Package: Tests in
Linear Mixed Effects Models”. In: Journal Of Statistical Software 82.13, e1–e26.
Lazic, S. E. and L. Essioux (2013). “Improving basic and translational science by accounting for
litter-to-litter variation in animal models”. In: BMC Neuroscience 14.37, e1–e11.
Pearce, S. C. (1964). “Experimenting with Blocks of Natural Size”. In: Biometrics 20.4, pp. 699–
706.
Perrin, S. (2014). “Make mouse studies work”. In: Nature 507, pp. 423–425.
Samuels, M. L., G. Casella, and G. P. McCabe (1991). “Interpreting Blocks and Random Factors”.
In: Journal of the American Statistical Association 86.415, pp. 798–808.
Watt, L. J. (1934). “Frequency Distribution of Litter Size in Mice”. In: Journal of Mammalogy 15.3,
pp. 185–189.
Youden, W. J. (1937). “Use of incomplete block replications in estimating tobacco-mosaic virus”.
In: Contributions from Boyce Thompson Institute 9, pp. 41–48.
Chapter 8
Split-Unit Designs
8.1 Introduction
In previous designs, we randomized all treatment factors on the same unit factor and
these designs therefore have a single experimental unit factor. In some experimental
setups, however, some treatment factors are more conveniently applied to groups of
units while others can easily be allocated to individual units within groups.
For example, we might study the growth rate of a bacterium at different con-
centrations of glucose and different temperatures. Using 96-well plates for growing
the bacteria, we can use a different amount of glucose for each well, but incubation
restricts the whole plate to the same temperature. In other words, a well is the exper-
imental unit for the glucose treatment while a plate is the experimental unit for the
temperature treatment.
This kind of design is known as a split-unit (or split-plot) design, where (at least)
two treatment factors (glucose concentration and temperature) are randomized on
different nested unit factors (plates and wells nested in plates). The precision of a
contrast estimate then depends on the treatment factors involved and their respective
experimental units.
A related experimental design is the criss-cross design (commonly called split-
block or strip-plot), where the two experimental unit factors are crossed rather than
nested. This design naturally arises, e.g., when using a multi-channel pipette in a 96-
well experiment: with one treatment per channel, all wells in a row of the plate contain
the same treatment. Using different concentrations for a dilution series randomized
over columns yields the second treatment and experimental unit since all wells in a
column have the same dilution.
Both types of designs require care in the model specification to correctly reflect
the relations between treatments and units. Otherwise, precision and power are over-
stated for some contrasts, resulting in deceptively low uncertainties and erroneous
conclusions.
We begin our discussion using two nested unit factors and two crossed treatment
factors. A common application of the split-unit design is the accommodation of hard-
to-change factors where applying a different level of one treatment factor is much
more cumbersome than applying a different level of the other. To avoid frequent
simultaneous changes of both levels, we keep the first treatment factor constant for
a group of units, and randomize the second treatment factor within this group. This
sacrifices precision and power for main effects of the first factor for the benefit of
easier implementation. We call the first treatment factor the whole-unit treatment,
and the group unit factor the whole-unit. We randomize the second treatment factor
(the sub-unit treatment) on the nested unit factor (the sub-unit).
8.2.1 Experiment
We revisit our drug-diet example, with three drugs (placebo, D1, D2) combined with
two diets (low fat, high fat) in an experiment with four mice per treatment, 24 mice
in total, using enzyme level as our response.
In previous instances, we randomly assigned a drug-diet combination to each
mouse (or each mouse in each block). To implement such an experiment, we have
to individually apply the assigned drug to each mouse once at the beginning of the
experiment. But we also have to feed each mouse its respective diet throughout the
experiment; even if we hold several mice in one cage, we cannot apply the same diet
to the whole cage, but have to individually feed each mouse within each cage.
A more practical implementation of the experiment uses eight cages with three
mice, but while each mouse per cage is treated with a different drug, all mice in the
same cage are fed the same diet. This makes each cage a block for the drugs, but the
experimental unit for the diets. The experimental layout is shown in Fig. 8.1.
Fig. 8.1 Split-unit experiment with two diets randomized on cages of three mice, and three drugs
randomized on mice within cages
8.2 Simple Split-Unit Design 195
M M M11
(Mouse)
24
(Mouse)12
Fig. 8.2 Split-unit design with diets randomized on cages and drugs randomized on mice within
cages. Cages are blocks for the drug treatment, but experimental units for the diet treatment
The Hasse diagrams are constructed using our previous approaches and are shown
in Fig. 8.2. The treatment structure is a 3 × 2 factorial with interaction. The unit
structure consists of (Mouse) (quite literally) nested in (Cage); since we measure
one sample per mouse, (Mouse) is the response unit.
In contrast to previous designs, the two treatment factors now have different exper-
imental units: we feed all mice in a cage the same diet, and Diet is randomized
on (Cage), while Drug is randomized on (Mouse). Each level of the interaction
Diet:Drug is a combination of a diet and a drug, and is randomly assigned to a
mouse. As for most blocked designs, we assume that interactions between unit and
treatment factors are negligible and do not include the factors (Cage:Drug) and
(Cage:Diet:Drug).
The experiment design diagram shows that (Cage) is a blocking factor for Drug
and Drug:Diet; this removes the between-cage variation for contrasts of drug main
effects and drug-diet interactions, but not for contrasts involving only Diet. Likewise,
the presence of more than one mouse per cage looks like pseudo-replication for diet
main effects, and increasing the number of mice per cage does not increase replication
for Diet.
The F-test and contrasts for Diet are based on the degrees of freedom and the
variation associated with (Cage). Power and precision are therefore lower than for
Drug and Drug:Diet, whose F-tests and contrasts are based on (Mouse). The loss
of precision for the whole-unit factor is the principal disadvantage of a split-unit
design. For our purposes, the design is still successful: first, it achieves the desired
simplified implementation of the experiment. Second, our main research question
concerns the effects of the three drugs (the Drug main effect) and their modification
by the diet (the Drug:Diet interaction). Both are based on the full replication and
the lowest residual variance terms in the design. We are not interested in comparing
196 8 Split-Unit Designs
only the diets themselves and our intended analysis is therefore largely unaffected
by the comparatively low replication and precision for the Diet main effect.
The linear model for this design is
yi jk = μ + αi + β j + (αβ)i j + c jk + ei jk ,
where αi , β j , (αβ)i j are the drug and diet main effect parameters and the interaction
parameters with i = 1 . . . 3 and j = 1 . . . 2. The random variables c jk ∼ N (0, σc2 )
are effects for the eight cages, and ei jk ∼ N (0, σe2 ) are the residuals within each cage
with k = 1 . . . 4.
We derive the model specification directly from the experiment design diagram
(Fig. 8.2C). All random factors are present in the unit structure, and the Error()
term is therefore Error(cage/mouse) or simply Error(cage). The fixed fac-
tors are all in the treatment structure, which is specified as drug*diet. The model
specification is hence y~drug*diet+Error(cage), leading to an ANOVA table
with two error strata. We find each treatment factor exclusively in the error stratum
of its experimental unit: Diet appears in the (Cage) error stratum and Drug and the
interaction appear in the residual (Mouse) error stratum. The correct denominator
for each F-test is found by starting from the corresponding treatment factor in the
diagram, and following the edges downward until we find the first random factor:
Diet is tested against the variation from cage to cage alone, and the F-test is based
on one numerator and six denominator degrees of freedom. The resulting ANOVA
table is
Comparing the degrees of freedom in this table with those from the diagram con-
firms that our model specification corresponds to the design. Between-cage variation
seems to be the dominant source of random variation in this experiment, and we are
unable to detect any significant main effect for Diet. Both Drug and Drug:Diet are
tested against the lower within-cage variation on twelve degrees of freedom, resulting
in higher power.
8.2 Simple Split-Unit Design 197
An equivalent analysis using the linear mixed model uses the specification
y~drug*diet+(1|cage), where we directly find a between-cage variance of
σ̂c2 = 0.45, which is about half of the residual variance σ̂e2 = 0.9, leading to an intra-
class correlation of ICC=33%. The cages provide less efficient blocking than litters,
but this is unproblematic since we introduced this factor to simplify the experiment
implementation, and blocking for the drug effects is simply a welcome benefit.
The linear mixed model calculates sums of squares for Diet and (Cage) differently,
but F-values and p-values are identical to the ANOVA:
The interaction explains about 19% of the variation and its resulting F-test is
statistically significant.
We define and estimate linear contrasts based on a split-unit design in the same
way as before, and can rely on estimated marginal means for providing the required
treatment group means. Contrasts of drugs and of drug-diet interactions profit from
higher replication and lower variance and are more precise than those comparing
diets.
As an illustration, we first compare D1 and D2 to the placebo treatment separately
under both diets and use a Dunnett-correction for multiple testing:
Precision decreases for contrasts that involve comparisons between diets, such as
contrasting the placebo averages between the two diets:
This contrast had the same precision as the four other contrasts in our previous
designs, but has higher standard error and lower precision in this split-unit design.
198 8 Split-Unit Designs
The fact that several experimental unit factors are present requires particular care
in setting up the analysis, and split-unit experiments are notorious for the many
ways they can be incorrectly designed, analyzed, and interpreted. One problem is
misspecification of the model. Starting from the Hasse diagram, this problem is easily
avoided and the results can be checked by comparing the degrees of freedom between
diagram and ANOVA table.
Another common problem is the inadvertent split-unit design, where an exper-
iment is intended as, e.g., a completely randomized design but implemented as a
split-unit design. Examples are numerous, particularly (but by no means exclusively)
in the engineering literature on process optimization and quality control.
Inadvertent split-unit designs usually originate in the implementation phase, by
deviating from the design table for a more convenient implementation. For example, a
technician might realize that feeding mice by cage rather than individually simplifies
the experiment, and create a split-unit design out of an anticipated CRD.
In his classic paper ‘Complex Experiments’, Frank Yates reviews and expands the
advances in statistical design of experiments since the 1920s (Yates 1935). The paper
contains an experiment to investigate different varieties of oat using several levels
of nitrogen as fertilizer, which we discuss as an additional example of a split-unit
design with additional blocking.
The experiment is illustrated in Fig. 8.3A: three oat varieties ‘Victory’, ‘Golden
Rain’, and ‘Marvellous’ (denoted v1 . . . v3 ) are applied to plots of sufficient size.
Meanwhile, four nitrogen levels n 1 . . . n 4 are applied to smaller patches of land,
denoted subplots (nested in plots). This yields a split-unit design with varieties ran-
domized on plots, and nitrogen on subplots nested in plots.
A common problem in agricultural experimentation is the heterogeneity of the soil,
exposure to sunlight, irrigation, and other factors, which add substantial variability
between plots that are spatially more distant. In this example, the whole experiment
is replicated in six blocks I . . . VI, where each block consists of three neighboring
plots, and varieties are independently randomized to plots within each block. This
increases the replication to achieve precision of contrasts between varieties while
simultaneously controlling for spatial heterogeneity over a large area. The design
is therefore a split-unit design with a randomized complete block design on the
whole-plot level.
The resulting 72 observations are shown in Fig. 8.3B, individually for each block.
Block effects are clearly visible, and patterns are very similar between blocks, so
assuming no block-by-treatment interaction seems reasonable. We also observe a
8.3 A Historical Example—Oat Varieties 199
A B I II III
175
150
125
100
75
50
Yield
IV V VI
175
150
125
100
75
50
0 0.2 0.4 0.6 0 0.2 0.4 0.6 0 0.2 0.4 0.6
Nitrogen level
Fig. 8.3 A Split-unit design with three oat varieties randomized on plots, four nitrogen amounts
randomized on subplots within plots, and replication in six blocks. B Data shown separately for
each block. Point: Golden Rain; triangle: Marvellous; square: Victory
pronounced trend of increasing yield with increasing nitrogen level, and this trend
seems roughly linear. Differences between oat varieties are less obvious.
The Hasse diagrams are given in Fig. 8.4 and show the simple factorial treatment
structure and the chain of nested unit factors combined into a fairly complex design,
where the whole treatment structure is blocked, and the nitrogen and interaction
treatment factors are blocked by the plots.
The original analysis in 1930 was of course done using an analysis of variance
approach. Here, we analyze the experiment using a linear mixed model and derive
200 8 Split-Unit Designs
M M M11
Nitrogen Variety
6
(Block ) (Block )5 Variety32 Nitrogen43
Variety : Nitrogen
(Plot)
18
(Plot)10 Variety : Nitrogen12
6
(Subplot)
72
(Subplot)45
Fig. 8.4 Hasse diagram for Yates’ oat variety and nitrogen example with two treatment factors
randomized on plots, respectively, subplots in plots, and replication in six blocks
The small and non-significant interaction shows that increasing the nitrogen level
has roughly the same effect on yield for all three oat varieties. In addition, differences
between oat varieties are also small with average yields between 80 and 175 and
differences all less than 10, and not significant. The nitrogen level, on the other hand,
shows a large and highly significant effect, and higher levels give more yield.
We further quantify these findings by estimating corresponding contrasts and
their confidence intervals. First, we compare the varieties within each nitrogen level
(Table 8.1). In each case, Marvellous provides higher yield than both Golden Rain
and Victory, and Golden Rain gives higher yield than Victory: the varieties have a
clear order, which is stable over all nitrogen levels. As the confidence intervals show,
however, none of the differences are significant, and the precision of estimates is
fairly low.
For quantifying the dose-response relationship between nitrogen level and yield,
we estimate the nitrogen main effect contrasts independently within each oat variety.
We use a polynomial contrast for Nitrogen, which provides information about
linear, quadratic, and cubic components of a dose-response. The results are shown
in Table 8.2.
For each variety, we find a substantial linear upward trend. Since both quadratic
and cubic terms are small and not significant, we can ignore all potential curvature in
the trends and arrive at an easy to interpret result: the yield increases proportionally
with increases in nitrogen level. We already determined that the average and nitrogen-
level-specific yields are almost identical between varieties. The current contrasts
8.3 A Historical Example—Oat Varieties 201
Table 8.1 Comparing the three oat varieties within each level of nitrogen
Contrast Estimate se df LCL UCL
Nitrogen: 0.0
Golden Rain—Marvellous −6.67 9.71 30.23 −30.61 17.27
Golden Rain—Victory 8.50 9.71 30.23 −15.44 32.44
Marvellous—Victory 15.17 9.71 30.23 −8.77 39.11
Nitrogen: 0.2
Golden Rain—Marvellous −10.00 9.71 30.23 −33.94 13.94
Golden Rain—Victory 8.83 9.71 30.23 −15.11 32.77
Marvellous—Victory 18.83 9.71 30.23 −5.11 42.77
Nitrogen: 0.4
Golden Rain—Marvellous −2.50 9.71 30.23 −26.44 21.44
Golden Rain—Victory 3.83 9.71 30.23 −20.11 27.77
Marvellous—Victory 6.33 9.71 30.23 −17.61 30.27
Nitrogen: 0.6
Golden Rain—Marvellous −2.00 9.71 30.23 −25.94 21.94
Golden Rain—Victory 6.33 9.71 30.23 −17.61 30.27
Marvellous—Victory 8.33 9.71 30.23 −15.61 32.27
Table 8.2 Orthogonal contrasts for nitrogen levels within each oat variety show linear dose-
response relation
Contrast Estimate se df t value P(>|t|)
Golden Rain
Linear 150.67 24.30 45 6.20 0.00
Quadratic −8.33 10.87 45 −0.77 0.45
Cubic −3.67 24.30 45 −0.15 0.88
Marvellous
Linear 129.17 24.30 45 5.32 0.00
Quadratic −12.17 10.87 45 −1.12 0.27
Cubic 14.17 24.30 45 0.58 0.56
Victory
Linear 162.17 24.30 45 6.67 0.00
Quadratic −10.50 10.87 45 −0.97 0.34
Cubic −16.50 24.30 45 −0.68 0.50
202 8 Split-Unit Designs
additionally show that the estimates of the three linear components are all within
roughly one standard error of each other, demonstrating a comparable dose-response
relation for all three varieties. This of course agrees with the previous result that there
is no variety-by-nitrogen interaction.
We turned our previous drug-diet example into a split-unit design by grouping mice
into cages and using the new grouping factor as experimental unit for the diets.
This creates a whole-plot factor ‘above’ the original experimental unit. Similar to
our discussion of choosing a blocking factor for an RCBD, we can alternatively
sub-divide the original experimental unit further to create a sub-plot factor ‘below’.
To illustrate this idea, we consider the following situation: we start from our
original drug-diet design with factorial treatment structure randomized on mice (a
CRD). Previously, we also considered comparing two sample preparation kits from
vendors A and B based on the enzyme level measurements. Since we already have
our drug-diet experiment planned, we would like to ‘squeeze’ the comparison of the
two kits into that experiment without jeopardizing our main objective of estimating
contrasts of the drug-diet treatments.
The idea is simple: we draw two samples per mouse and randomly assign either kit
A or kit B to each sample. The resulting experiment structure is shown in Fig. 8.5A,
and we recognize it as a split-unit design. Here, the whole-plot unit (Mouse) is
combined with a factorial treatment structure, and the sub-plot unit (Sample) is
nested in (Mouse) to compare levels of Vendor. The resulting treatment structure
is a 3 × 2 × 2 factorial, where we removed all interactions involving Vendor under
the assumption that these are negligible. The original drug-diet experiment is then
unaffected by this augmentation of the design: even if vendor B’s kit is worse, we
8.4 Variations and Related Designs 203
M11 M11
8
Diet : Drug62 (Cage)6 Diet : Drug62
24
(Mouse)18
24
(Mouse)12
48
(Sample)23
48
(Sample)23
Fig. 8.5 A Split-unit design with diets and drugs completely randomized on mice as a CRD and
vendor randomized on samples. B Same treatment structure with split-split-unit design
still have the full data for vendor A; simply removing the B data yields the data for
the originally anticipated design.
We use the linear mixed model framework for estimating the corresponding model
with specification y~drug*diet+vendor+(1|mouse) and estimating the dif-
ference between the two vendors.
This contrast is estimated very precisely with 23 residual degrees of freedom, the
same as for a randomized complete block design with 24 mice as blocks and two
samples per mouse and no other treatment factors. It has much higher precision than
the drug or diet comparisons, because each mouse provides a block for Vendor to
compare the two kits within each mouse.
204 8 Split-Unit Designs
By introducing three nested unit factors and randomizing one treatment factor on
each, we arrive at a split-split-unit design. Further extensions to arbitrary levels of
nested factors are straightforward.
For example, we combine the split-unit design for drugs and diets with a compar-
ison of the two vendors. The new design is shown in Fig. 8.5B and uses (Cage) as
experimental unit for the hard-to-change factor Diet, (Mouse) in (Cage) as experi-
mental unit for Drug, and (Sample) in (Mouse) in (Cage) to accommodate Vendor
as an additional treatment factor.
From the diagram, we find one random intercept for each cage, leading to a ran-
dom effect term (1|cage), one random intercept for each mouse within a cage,
with (1|cage:mouse), and the omitted (1|cage:mouse:sample). A lin-
ear mixed model that ignores all interactions of Vendor with other factors is there-
fore specified as y~drug*diet+vendor+(1|cage)+(1|cage:mouse) and
yields the ANOVA table
The results are very similar to our split-unit design without the additional Vendor
treatment. Interactions involving Vendor can of course be introduced, and lead to a
more complex analysis and interpretation of results.
In contrast to the split-unit design, we cross the two unit factors in a criss-cross design
and combine this unit structure with a factorial treatment structure. The simplest
instance of a criss-cross design is a row-column design with a rows and b columns,
where a treatment factor with a levels is randomized on the rows, and a crossed
treatment factor with b levels is randomized on columns. This treatment structure
is an a × b factorial, but each treatment factor has its own experimental unit. In
contrast to a split-unit design, the interaction of the two treatment factors does not
share its experimental unit with any of the main effect factors. For a two-way factorial
treatment structure, the criss-cross design therefore has three experimental units and
such a design needs to be replicated several times to arrive at suitable residual degrees
of freedom for all experimental unit factors. Usually, the rows and columns are
independently replicated, and randomization is done independently for each replicate
of the row-column criss-cross design.
8.4 Variations and Related Designs 205
1 6 9 3 7 11 5 10 8 4 12 2 4 12 9 2 11 6 5 1 3 7 8 10
C 1 6 9 3 7 11 5 10 8 4 12 2 4 12 9 2 11 6 5 1 3 7 8 10
D 1 6 9 3 7 11 5 10 8 4 12 2 4 12 9 2 11 6 5 1 3 7 8 10
B 1 6 9 3 7 11 5 10 8 4 12 2 4 12 9 2 11 6 5 1 3 7 8 10
C 1 6 9 3 7 11 5 10 8 4 12 2 4 12 9 2 11 6 5 1 3 7 8 10
B 1 6 9 3 7 11 5 10 8 4 12 2 4 12 9 2 11 6 5 1 3 7 8 10
A 1 6 9 3 7 11 5 10 8 4 12 2 4 12 9 2 11 6 5 1 3 7 8 10
D 1 6 9 3 7 11 5 10 8 4 12 2 4 12 9 2 11 6 5 1 3 7 8 10
A 1 6 9 3 7 11 5 10 8 4 12 2 4 12 9 2 11 6 5 1 3 7 8 10
Fig. 8.6 Criss-cross experiment layout: two replicates of four drugs (background shade) random-
ized on rows, dilutions (numbers) randomized on columns. Two replicate plates shown, random-
ization of rows kept constant while dilutions are randomized independently
The criss-cross design rather naturally arises in experiments on 96-well plates when
using multi-channel pipettes; common multi-channel pipettes offer eight channels
such that eight consecutive wells can be handled simultaneously.
This setup is advantageous in assays based on dilution series, where up to eight
different conditions are subjected to twelve dilutions each. A typical response is
the optical density in each well, for example. Using one pipette channel for each
condition allows randomization of the conditions on the rows of each plate, but the
same condition is then assigned to all wells in the same row. Similarly, the dilution
steps can be randomized on columns, but each of the eight rows then has a fixed
dilution. This arrangement leads to a criss-cross design with conditions randomized
on rows by randomly assigning them to the channels of the pipette at the beginning
of the experiment, and dilutions randomized on columns.
The plate layouts in Fig. 8.6 show a version of this strategy for comparing the effect
of four drugs on bacterial growth in twelve glucose concentrations in the growth
medium. Two channels are randomly assigned to each drug and each glucose level is
used on one full column; we use two plates to provide higher replication. For easier
implementation, the assignment of drugs to pipette channels is only randomized
once, and then kept identical for both plates. The glucose levels are randomized
independently to columns for each plate. This provides an interesting variant of the
criss-cross design.
The Hasse diagrams for this example are shown in Fig. 8.7. The treatment structure
is a simple two-way factorial design of drug and glucose. In the unit structure,
columns are nested in plates since randomization is independent between plates,
but rows are crossed with plates since any row in the first plate is identified with the
corresponding row in the second plate. We omitted several interaction factors that we
assume negligible for brevity, but the experiment structure is already rather complex.
From the experiment diagram, we derive the model specifications y ~ drug
* glucose + Error(plate/col + row) for an ANOVA and y ~ drug
206 8 Split-Unit Designs
M M M11
2
Glucose Drug (Plate) (Row ) (Plate)1 Glucose12
11 Drug43
8
Glucose : Drug Glucose : Drug48
33 (Row )4
24
(Column) (Column)11
(Well )
192
(Well )128
Fig. 8.7 Criss-cross design arising from use of multi-channel pipette. Four drugs are tested with
12 glucose concentrations on each plate, two plates provide replication. Use of 8-channel pipette
allows two replicates of each drug; random assignment of drug to channel is kept constant for both
plates, but assignment of glucose concentration to columns is randomized independently
The denominator of the three treatment F-tests corresponds to the closest random
factor below its treatment factor. With several random factors crossed and nested,
traditional ANOVA and linear mixed model results differ; we would prefer the latter.
A useful design for increasing precision and power is the cross-over design, where
different treatments are assigned in sequence to the same experimental unit. As a
basic example, we consider an experiment for determining the effect of the low- and
high-fat diet (with no drug treatment) on the enzyme levels. We use six mice, which
we split into two groups: we feed the mice in the first group on the low-fat diet for
some time, and then switch them to the high-fat diet. In the second group, we reverse
the order and feed first the high-fat diet, and then the low-fat diet. This is a two-period
two-treatment cross-over design. The experiment is illustrated in Fig. 8.8 for three
mice per group.
8.4 Variations and Related Designs 207
Fig. 8.8 Cross-over experiment with two diets assigned in one of two orders
6 24 6
Sequence : Diet41 (Mouse)4 PrePost : Drug62 (Mouse)21 Time : Drug94 (Mouse)3
12 48 18
(Sample)4 (Sample)21 (Sample)6
Fig. 8.9 A Cross-over design uses two diet treatments sequentially on same mouse to provide
within-mouse contrasts. B Pretest-posttest design with measurement before and after application of
treatment to consider mouse-specific baseline response values. C Longitudinal repeated measures
design to allow multiple measurements of same mouse at different time-points.
Before each diet treatment, we feed all mice with a standard diet. This should
allow the enzyme level to reset to ‘normal’, such that the first diet does not affect
observations with the second diet. The observations are taken after several days on
the respective diet, with one observation per mouse per diet.
The experiment diagram is shown in Fig. 8.9A. The treatment factor Sequence
denotes the group: each mouse is assigned to either the low-high (L-H) sequence
of diets, or the high-low (H-L) sequence. The sequence is crossed with the second
treatment factor Diet, since each diet occurs in each sequence. Each level of the
interaction Sequence:Diet corresponds to the application of one diet at a specific part
in each sequence (the period). Each mouse is randomly assigned to one sequence,
so (Mouse) is the experimental unit for Sequence. Each sample corresponds to a
combination of a period and a diet, and is the experimental unit for Diet and the
interaction.
208 8 Split-Unit Designs
where μ is the grand mean, αi are the effects of the low and high fat diet, π j is the
effect of period j, and γk are the residual carry-over effects from the previous diet
not eliminated by the washout period between diets.
The Sequence main effect is
1 1 1
(μ L H 1 + μ L H 2 ) − (μ H L1 + μ H L2 ) = (γ L − γ H ) ,
2 2 2
with associated hypothesis H0 : γ L = γ H that the two carry-over effects are equal
(but not necessarily zero!). This test essentially asks if there is a difference between
the two orders in which the diets are applied. If both carry-over effects are equal,
then no difference exists since then γ L = γ H = γ, and we can merge γ with the
period effect π2 (all observations are higher or lower by the same amount in the first
compared to the second period).
The Diet main effect is
1 1 1
(μ L H 1 + μ H L2 ) − (μ H L1 + μ L H 2 ) = α L − α H − (γ L − γ H ) ,
2 2 2
and is biased if the two carry-over effects are not equal. Note that we can in principle
estimate and test the bias from the Sequence main effect, but that this effect has
lowest replication in the design, and low precision and power. In the case of unequal
carry-over effects, one often restricts the analysis to data from the first period alone,
and estimates the treatment effect via (μ L H 1 − μ H L1 )/2.
The Sequence:Diet interaction effect is
1 1 1
(μ L H 1 − μ L H 2 ) − (μ H L1 − μ H L2 ) = π1 − π2 − (γ L + γ H ) ,
2 2 2
and is biased whenever there are—even equal—carry-over effects.
Cross-over designs form an important class of designs and the two-period two-
treatment design is only the simplest instance. It does not allow estimation of the
carry-over effects, which is a major weakness in practice where carry-over can often
be suspected and the experiment should provide information about its magnitude.
Better variants of the cross-over design that allow explicit estimation of the carry-
8.4 Variations and Related Designs 209
over should therefore be preferred whenever feasible. One variant also uses two
periods, but includes the two combinations H-H and L-L in addition to H-L and L-H.
Carry-over can then be estimated by comparing the H-H to the L-H observations,
for example. Another variant extends the design to three periods, with treatment
sequences including H-H-L and L-H-L, for example, such that one treatment is
observed twice in each sequence. The references in Sect. 8.5 provide more in-depth
coverage of different cross-over designs and associated analyses.
Of greatest interest is usually the PrePost:Drug interaction, which shows how dif-
ferent the changes of enzyme levels are between drugs from baseline to post-treatment
measurement. This is the drug effect corrected for the baseline measurement. We
can replicate the corresponding F-test as follows: for each mouse i, calculate the
difference i = yi,post − yi,pre of the post-treatment response and the pre-treatment
response. This ‘adjusts’ the response to the treatment for the baseline value. Now, we
perform a one-way ANOVA with Drug as the treatment factor, and i as the response
variable. The resulting F-ratio and p-value are identical to the PrePost:Drug test.
Split-unit designs are sometimes still used for repeated measures and longitudinal
designs, in which multiple response variables are measured for the same experimental
unit, respectively, the same response variable is measured at multiple occasions for the
same experimental unit. Both designs thus have a more complex response structure
than the classical approach can handle.
An example of a longitudinal design is shown in Fig. 8.9C, where three drugs are
randomized on two mice each, and each mouse is then measured at three time-points.
In this design, we randomize Drug on (Mouse), and the fixed unit factor Time groups
the samples from each mouse. We can then relate observations from the same mouse
to each other to analyze the temporal profile of each mouse. The advantage of the
longitudinal design is that observations can be contrasted within each mouse, and
the between-mouse variation is removed from such contrasts.
The main caveat of this approach is the crude approximation of the complex
longitudinal response structure by a fixed block factor Time. This assumes that any
pair of time-points has the same correlation, while observations closer in time often
tend to have stronger correlations than those further apart. This caveat does not apply
to the pretest-posttest designs, where only two time-points are considered.
Notes
Insightful accounts on split-unit designs are Federer (1975) and Box (1996), and a
gentle introduction is given in Kowalski and Potcner (2003). Recent developments in
split-unit designs are reviewed in Jones and Nachtsheim (2009). Analysis of split-unit
designs with more complex whole-unit and sub-unit treatment designs are discussed
in Goos and Gilmour (2012), and power analysis in Kanji and Liu (1984). Increasing
availability of liquid-handling robots renewed interest in split-unit and criss-cross
designs for microplate-based experiments (Buzas et al. 2011).
8.5 Notes and Summary 211
References
Abdi, H. (2010). “The Greenhouse-Geisser correction”. In: Encyclopedia of Research Design. Ed.
by Neil Salkind. SAGE Publications, Inc.
Bonate, P. L. (2000). Analysis of Pretest-Posttest Designs. Chapman & Hall/CRC.
Box, G. E. P. (1996). “Quality quandaries: Split plot experiments”. In: Quality Engineering 8.3, pp.
515–520.
Brogan, D. R. and M. H. Kutner (1980). “Comparative Analyses of Pretest-Posttest Research
Designs”. In: The American Statistician 34.4, pp. 229–232.
Buzas, J. S., C. G. Wager, and D. M. Lansky (2011). “Split-Plot Designs for Robotic Serial Dilution
Assays”. In: Biometrics 67.4, pp. 1189–1196.
Diggle, P. et al. (2013). Analysis of Longitudinal Data. 2nd. Oxford University Press.
Federer, W. T. (1975). “The misunderstood split plot”. In: Applied Statistics. Ed. by R. P. Gupta.
Amsterdam: North-Holland, pp. 9–39.
Finney, D. J. (1956). “Cross-Over Designs in Bioassay”. In: Proceedings of the Royal Society B:
Biological Sciences 145.918, pp. 42–61.
Fitzmaurice, G. M., N. M. Laird, and J. H. Ware (2011). Applied longitudinal analysis. John Wiley
& Sons, Inc.
Goos, P. and S. G. Gilmour (2012). “A general strategy for analyzing data from split-plot and
multistratum experimental designs”. In: Technometrics 54.4, pp. 340–354.
Greenhouse, S. W. and S. Geisser (1959). “On methods in the analysis of profile data”. In: Psy-
chometrika 24.2, pp. 95–112.
Huynh, H. and L. S. Feldt (1976). “Estimation of the Box correction for degrees of freedom from
sample data in randomized block and split-plot designs”. In: Journal Of Educational Statistics
1.1, pp. 69–82.
Johnson, D. E. (2010). “Crossover experiments”. In: Wiley Interdisciplinary Reviews: Computa-
tional Statistics 2.5, pp. 620–625.
Jones, B. and C. J. Nachtsheim (2009). “Split-plot designs: what, why, and how”. In: Journal of
Quality Technology 41.4, pp. 340–361.
Kanji, G. K. and C. K. Liu (1984). “Power Aspects of Split-Plot Designs”. In: Journal of the Royal
Statistical Society. Series D 33.3, pp. 301–311.
Kowalski, S. M. and K. Potcner (2003). “How to recognize a split-plot experiment”. In: Quality
Progress 36.11, pp. 60–66.
Senn, S. J. (1994). “The AB/BA crossover: past, present and future?” In: Statistical Methods in
Medical Research 3, pp. 303–324.
Senn, S. J. (2002). Cross-over trials in clinical research. 2nd. Wiley, New York, p. 364.
Shuster, J. J. (2017). “Linear combinations come alive in crossover designs”. In: Statistics in
Medicine 36.24, pp. 3910–3918.
Yates, F. (1935). “Complex Experiments”. In: Journal of the Royal Statistical Society 2.2, pp.
181–247.
Chapter 9
Many Treatment Factors: Fractional
Factorial Designs
9.1 Introduction
Factorial treatment designs are necessary for estimating factor interactions and offer
additional advantages (Chap. 6). However, their implementation is challenging if we
consider many factors or factors with many levels, because the number of treatments
might then require prohibitive experiment sizes. Large factorial experiments also
pose problems for blocking, if reasonable block sizes that ensure homogeneity of the
experimental material within a block are smaller than the number of treatment level
combinations.
For example, a factorial treatment structure with five factors of two levels each
has 25 = 32 treatment combinations. An experiment with 32 experimental units
then has no residual degrees of freedom, but two full replicates of this design already
require 64 experimental units. If each factor has three levels, the number of treatment
combinations increases drastically to 35 = 243.
On the other hand, we might justify the assumption of effect sparsity: high-order
interactions are often negligible, especially if interactions of lower orders already
have small effect sizes. The key observation for reducing the experiment size is
that a large portion of model parameters relate to higher-order interactions: in a
25 -factorial, there are 32 model parameters: one grand mean, five main effects, 10
two-way interactions, 10 three-way interactions, five four-way interactions, and one
five-way interaction. The number of higher-order interactions and their parameters
grows fast with increasing number of factors as shown in Table 9.1 for factorials with
two factor levels and 3 to 7 factors.
If we ignore three-way and higher interactions in the example, we remove 16
parameters from the model equation and only require 16 observations for estimating
the remaining model parameters; this is known as a half-fraction of the 25 -factorial.
Of course, the ignored interactions do not simply vanish, but their effects are now
confounded with those of lower-order interactions or main effects. The question then
arises: which 16 out of the 32 possible treatment combinations should we consider
such that no effect of interest is confounded with a non-negligible effect?
© Springer Nature Switzerland AG 2021 213
H.-M. Kaltenbach, Statistical Design and Analysis of Biological Experiments,
Statistics for Biology and Health, https://doi.org/10.1007/978-3-030-69641-2_9
214 9 Many Treatment Factors: Fractional Factorial Designs
9.2.1 Introduction
We begin our discussion with the simple example of a 23 -factorial treatment structure
in a completely randomized design. We denote the treatment factors as A, B, and
C and their levels as A, B, and C with values −1 and +1, generically called the
low and high level, respectively. Recall that main effects and interactions (of any
order) all have one degree of freedom in a 2k -factorial; hence, we can encode the two
independent levels of an interaction as −1 and +1. We define the level by multiplying
the levels of the constituent factors: for A = −1, B = +1, C = −1, the level of A:B
is AB = A · B = −1 and the level of A:B:C is ABC = A · B · C = +1.
It is also convenient to use an additional shorthand notation for a treatment combi-
nation, where we use a character string containing the lower-case letter of a treatment
factor if it is present on its high level, and no letter if it is present on its low level. For
example, we write abc if A, B, C are on level +1, and all potential other factors are
on the low level −1, and ac if A and C are on the high level, and B on its low level.
We denote a treatment combination with all factors on their low level by (1). For a
23 -factorial, the eight different treatments are then (1), a, b, c, ab, ac, bc, and abc.
9.2 Aliasing in the 23 -Factorial 215
Table 9.2 Eight treatment level combinations for 23 -factorial with corresponding level of interac-
tions and shorthand notation
Carbon Nitrogen Vitamin A B C AB AC BC ABC Shorthand
Glc low Mix 1 −1 −1 −1 +1 +1 +1 −1 (1)
Glc low Mix 2 −1 −1 +1 +1 −1 −1 +1 c
Glc high Mix 1 −1 +1 −1 −1 +1 −1 +1 b
Glc high Mix 2 −1 +1 +1 −1 −1 +1 −1 bc
Fru low Mix 1 +1 −1 −1 −1 −1 +1 +1 a
Fru low Mix 2 +1 −1 +1 −1 +1 −1 −1 ac
Fru high Mix 1 +1 +1 −1 +1 −1 −1 −1 ab
Fru high Mix 2 +1 +1 +1 +1 +1 +1 +1 abc
For example, testing compositions for growth media with factors Carbon with
levels Glc (glucose) and Fru (fructose), Nitrogen with levels low and high, and
Vitamin with levels Mix 1 and Mix 2 leads to a 23 -factorial with the 8 possible
treatment combinations shown in Table 9.2.
1
A main effect = ( (a − (1)) + (ab − b) + (ac − c) + (abc − bc) ) .
4
This is equivalent to calculating the Carbon main effect by averaging the difference
between the sum of observations with glucose (‘low’) and the sum of observations
with fructose (‘high’). In terms of Table 9.2, this amounts to adding all those obser-
vations for which A = +1, namely a, ab, ac, abc and subtracting the sum of all
observations for which A = −1, namely (1), b, c, bc. This yields
⎛ ⎞
1⎝
A main effect = (a + ab + ac + abc) − ((1) + b + c + bc)⎠ ,
4
A=+1 A=−1
which we see is simply the previous calculation with terms grouped differently.
216 9 Many Treatment Factors: Fractional Factorial Designs
1 1
((abc − ab) − (ac − a)) and ((bc − b) − (c − (1))) .
2 2
if A=+1 if A=−1
This value is equivalently found by taking the difference between observations with
BC = +1 (the interaction at its ‘high’ level) and BC = −1 (the interaction at its
‘low’ level) and averaging. The other interaction effects are estimated by contrasting
the corresponding observations for AB = ±1, AC = ±1, and ABC = ±1, respec-
tively.
We are interested in reducing the size of the experiment and for reasons that will
become clear shortly, we choose a design based on measuring the response for four out
of the eight treatment combinations. This will only allow estimation of four param-
eters in the linear model, and exactly which parameters can be estimated depends
on the treatments chosen. The question then is which four treatment combinations
should we select?
We investigate three specific choices to get a better understanding of the conse-
quences for effect estimation. The designs are illustrated in Fig. 9.1, where treatment
level combinations form a cube with eight vertices, from which four are selected in
each case.
First, we arbitrarily select the four treatment combinations (1), a, b, ac (Fig. 9.1A).
With this choice, none of the main effects or interaction effects can be estimated using
all four observations. For example, an estimate of the A main effect involves a − (1),
ab − b, ac − c, and abc − bc, but only a − (1) is available in this experiment. Com-
pared to a factorial experiment in four runs, this choice of treatment combinations
thus allows using only one-half of the available data for estimating this effect. If we
follow the above logic and contrast the observations with A at the high level with
those with A at the low level, thereby using all data, then the main effect is estimated
as (ac + a) − (b + (1)) and leads to a biased and incorrect estimate of the main
9.2 Aliasing in the 23 -Factorial 217
A B C
B+
C+
B- C-
A- A+ A- A+ A- A+
Fig. 9.1 Subsets of a 23 -factorial. A Arbitrary choice of treatment combinations leads to problems
in estimating any effects properly. B One variable at a time (OVAT) design. C Keeping one factor at a
constant level confounds this factor with the grand mean and creates a 22 -factorial of the remaining
factors
effect, since the other factors are at ‘incompatible’ levels. Similar problems arise
for B and C main effects, where only b − (1), respectively, ac − a are available.
None of the interactions can be estimated from these data and we are left with a very
unsatisfactory muddle of biased estimates.
Next, we try to be more systematic and select the four treatment combinations
(1), a, b, c (Fig. 9.1B) where all factors occur on low and high levels. Again, main
effect estimates are based on half of the data for each factor, but their calculation is
now simpler: a − (1), b − (1), and c − (1), respectively. Each estimate involves the
same level (1) and only two of four observations are used. This design resembles
a one variable at a time experiment, where effects can be estimated individually
for each factor, but no estimates of interactions are available. All advantages of a
factorial treatment design are then lost.
Finally, we select the four treatment combinations (1), b, c, bc with A on the low
level (Fig. 9.1C). This design is effectively a 22 -factorial with treatment factors B and
C and allows estimation of their main effects and their interaction, but no information
is available on any effects involving the treatment factor A. For example, we estimate
the B main effect as (bc + b) − (c + (1)) using all data, and the B:C interaction
as (bc − b) − (c − (1)). If we look more closely into Table 9.2, we find a simple
confounding structure: the level of B is always the negative of A:B. In other words,
the two effects are completely confounded in this design, and (bc + b) − (c + (1))
is in fact an estimate of the difference of the B main effect and the A:B interaction.
Similarly, C is the negative of A:C, and B:C is the negative of A:B:C. Finally,
the grand mean is confounded with the A main effect; this makes sense since any
estimate of the overall average is based only on the ‘low’ level of A.
218 9 Many Treatment Factors: Fractional Factorial Designs
Neither of the previous three choices provides a convincing reduction of the factorial
design. We now discuss a fourth possibility, the half-replicate of the 23 -factorial,
called a 23−1 -fractional factorial. The main idea is to deliberately alias a high-order
interaction with the grand mean. For a 23 -factorial, we alias the three-way interaction
A:B:C by selecting either those four treatment combinations that have ABC = −1
or those that have ABC = +1. We call the corresponding equation the generator
of the fractional factorial; the two possible sets are shown in Fig. 9.2. With either
choice, we find three more effect aliases by consulting Table 9.2. For example, using
ABC = +1 as our generator yields the four treatment combinations a, b, c, abc and
we find that A is completely confounded with B:C, B with A:C, and C with A:B.
In this design, any estimate thus corresponds to the sum of two effects. For exam-
ple, (a + abc) − (b + c) estimates the sum of A and B:C: first, the main effect of
A is found as the difference of the runs a and abc with A on its high level, and the
runs b and c with A on its low level: (a + abc) − (b + c). Second, we contrast runs
with B:C on the high level (a and abc) with those with B:C on its low level (b and
c) for estimating the B:C interaction effect, which is again (a + abc) − (b + c).
The fractional factorial based on this generator hence deliberately aliases each
main effect with a two-way interaction, and the grand mean with the three-way
interaction. Each estimate is then the sum of the two aliased effects. Moreover, we
note that by pooling the treatment combinations over levels of one of the three factors,
we create three different 22 -factorials for the two remaining factors as seen in Fig. 9.2.
For example, ignoring the level of C leads to the full factorial in A and B. This is a
consequence of the aliasing, as C is completely confounded with A:B.
The confounding of different effects can be described by the alias sets, where each
set contains the effects that cannot be distinguished. For the generator ABC = +1,
the alias sets are
and for the generator ABC = −1, the alias sets are
Estimation of the A main effect, for example, is only possible if the B:C interac-
tion is zero in line with our previous observations. A more detailed discussion of
confounding in terms of the parameters of the underlying linear model is given in
Sect. 9.9.
9.3 Aliasing in the 2k -Factorial 219
B+ B+
C+ C+
B- B-
C- C-
A- A+ A- A+
Projected 2x2 factorials
B C B C C
C
A A B A A B
Fig. 9.2 The two half-replicates of a 23 -factorial with three-way interaction and grand mean con-
founded. Any projection of the design to two factors yields a full 22 -factorial design and main effects
are confounded with two-way interactions. A Design based on low level of three-way interaction;
B Complementary design based on high level
The half-replicate of a 23 -factorial still does not provide an entirely convincing exam-
ple for the usefulness of fractional factorial designs due to the complete confounding
of main effects and two-way interactions, both of which are typically of great interest.
With more factors in the treatment structure, however, we are able to alias interac-
tions of higher order and confound low-order interactions of interest with high-order
interactions that we might assume negligible.
ABC = +1
220 9 Many Treatment Factors: Fractional Factorial Designs
selects all those rows in Table 9.2 for which the relation is true and A:B:C is on the
high level.
A generator determines the effect confounding of the experiment: the generator
itself is one confounding, and ABC = +1 describes the complete confounding of
the three-way interaction A:B:C with the grand mean.
From the generator, we can derive all other confoundings by simple algebraic
manipulation. By formally ‘multiplying’ the generator with an arbitrary word, we
find a new relation between effects. In this manipulation, the multiplication with the
letter +1 leaves the equation unaltered, multiplication with −1 inverses signs, and
a product of two identical letters yields +1. For example, multiplying our generator
ABC = +1 with the word B yields
ABC · B = (+1) · B ⇐⇒ AC = B .
In other words, the B main effect is confounded with the A:C interaction. Similarly,
we find AB = C and BC = A as two further confounding relations by multiplying
the generator with C and A, respectively.
Further trials with manipulating the generator show that we can obtain no addi-
tional relations. For example, multiplying ABC = +1 with the word AB yields
C = AB again, and multiplying this relation with C yields C · C = AB · C ⇐⇒
+1 = ABC, the original generator. This means that indeed, we have fully con-
founded four pairs of effects and no others. In general, a generator for a 2k -factorial
produces 2k /2 = 2k−1 alias relations between factors, so we have a direct way to
check if we found all. In our example, 23 /2 = 4, so our relations ABC = +1,
AB = C, AC = B, and BC = A cover all existing aliases.
This property also means that we arrive at exactly the same set of alias relations,
no matter which of them we choose as our generator. For example, instead of ABC =
+1, we might choose A = BC; this selects the same set of rows and implies the same
set of confounding relations. Usually, we use a generator that aliases the highest-order
interaction with the grand mean and yields the least severe confounding.
Generators provide a systematic way for aliasing that results in interpretable effect
estimates with known confoundings. A generator selects one-half of the possible
treatment combinations, and this is the reason why we set out to choose four rows in
our first example.
We briefly note that our first and second choices in Sect. 9.2.3 are not based on a
generator, leaving us with a complex partial confounding of effects. In contrast, our
third choice selected all treatments with A on the low level and does have a generator,
namely
A = −1 .
Algebraic manipulation then shows that this design implies the additional three
alias relations AB = −B, AC = −C, and ABC = −BC. In other words, any effect
involving the factor A is confounded with another effect not involving that factor,
which we easily verify from Table 9.2.
9.3 Aliasing in the 2k -Factorial 221
9.3.2 Half-Replicates
Generators and their algebraic manipulation provide an efficient way for finding the
confoundings in higher-order factorials, where looking at the corresponding table
of treatment combinations quickly becomes unfeasible. As we can see from the
algebra, the most useful generator is always confounding the grand mean with the
highest-order interaction.
For four factors, this generator is ABC D = +1 and we expect that there are
24 /2 = 8 relations in total. Multiplying with any letter reveals that main effects are
then confounded with three-way interactions, such as ABC D = +1 ⇐⇒ BC D =
A after multiplying with A, and similarly B = AC D, C = AB D, and D = ABC.
Moreover, by multiplication with two-letter words we find that all two-way interac-
tions are confounded with other two-way interactions, namely via the three relations
AB = C D, AC = B D, and AD = BC. We thus found eight relations and can be
sure that there are no others.
The resulting confounding is already an improvement over fractions of the 23 -
factorial, especially if we can make the argument that three-way interactions can be
neglected and we thus have direct estimates of all main effects. If we find a significant
and large two-way interaction—A:B, say—then we cannot distinguish if it is A:B,
its alias C:D, or a combination of the two that produces the effect. Subject-matter
considerations might be available to separate these possibilities. If not, there is at least
a clear goal for a subsequent experiment to disentangle the two interaction effects.
Things improve further for five factors and the generator ABC D E = +1 which
reduces the number of treatment combinations from 25 = 32 to 25−1 = 16. Now,
main effects are confounded with four-way interactions, and two-way interactions
are confounded with three-way interactions. Invoking the principle of effect sparsity
and neglecting the three- and four-way interactions yields main effects and two-way
interactions as the estimated parameters.
Main effects and two-way interactions are confounded with interactions of order
four or higher for factorials with six factors and more, and we can often assume that
these interactions are negligible.
density over time. We use the increase in optical density (OD) between onset of
growth and flattening of the growth curve at the diauxic shift as a rough but sufficient
approximation for increase in number of cells.
To determine how the five medium components influence the growth of the yeast
culture, we use the composition of a standard medium as a reference point, and
simultaneously alter the concentrations of the five components. For this, we select
two concentrations per component, one lower, the other higher than the standard,
and consider these as two levels for each of five treatment factors. The treatment
structure is then a 25 -factorial and would in principle allow estimation of the main
effects and all two-, three-, four-, and five-factor interactions when we use all 32
possible combinations. However, a single replicate requires two-thirds of a 48-well
plate and this is undesirable because we would like sufficient replication and also
be able to compare several yeast strains in the same plate. Both requirements can
be accommodated by using a half-replicate of the 25 -factorial with 16 treatment
combinations, such that three independent experiments fit on a single plate.
A generator ABC D E = +1 confounds the main effects with four-way inter-
actions, which we consider negligible for this experiment. Still, two-way interac-
tions are confounded with three-way interactions, and in the first implementation
we assume that three-way interactions are much smaller than two-way interactions.
We can then interpret main effect estimates directly, and assume that estimates of
parameters involving two-way interactions have only small contributions from the
corresponding three-way interactions. The design is shown in Table 9.3.
We use two replicates of this design for adequate sample size, requiring 32 wells
in total. This could also accommodate the full 25 -factorial, but we would then have
no replication for estimating the residual variance. Moreover, our duplicate of the
same design enables inspection of reproducibility of measurements and detection of
errors and aberrant observations. The observed increase in optical density is shown
in Table 9.3 with columns ‘OD 1’ and ‘OD 2’ for the two replicates.
Clearly, the medium composition has a huge impact on the resulting growth,
ranging from a minimum of close to zero to a maximum of 216.6. The original
medium has an average ‘growth’ of OD ≈ 80, and this experiment already reveals
a condition with approximately 2.7-fold increase. We also see that observations with
N2 at the low level are abnormally low in the first replicate and we remove these
eight values from further analysis.1
1 It
later transpired that the low level of N2 had zero concentration in the first, but a low, non-zero
concentration in the second replicate.
9.4 A Real-Life Example—Yeast Medium Composition 223
Table 9.3 Treatment combinations for half-replicate of 25 -factorial design for determining yeast
growth medium composition. Last two columns show responses for two replicates, observations in
italics result from experimental error and are removed from analysis
Glucose Nitrogen 1 Nitrogen 2 Vitamin 1 Vitamin 2 OD 1 OD 2
−1 −1 −1 −1 1 1.7 35.68
1 −1 −1 −1 −1 0.1 67.88
−1 1 −1 −1 −1 1.5 27.08
1 1 −1 −1 1 0 80.12
−1 −1 1 −1 −1 120.2 143.39
1 −1 1 −1 1 140.3 116.30
−1 1 1 −1 1 181 216.65
1 1 1 −1 −1 40 47.48
−1 −1 −1 1 −1 5.8 41.35
1 −1 −1 1 1 1.4 5.70
−1 1 −1 1 1 1.5 84.87
1 1 −1 1 −1 0.6 8.93
−1 −1 1 1 1 106.4 117.48
1 −1 1 1 −1 90.9 104.46
−1 1 1 1 −1 129.1 157.82
1 1 1 1 1 131.5 143.33
9.4.2 Analysis
Our fractional factorial design has five treatment factors and several interaction fac-
tors, and we initially use an analysis of variance to determine which of the medium
components has an appreciable effect on growth, and how the components interact.
The model Growth~(Glc+N1+N2+Vit1+Vit2)2 yields the ANOVA table
If only the single replicate is available, then we have to reduce the model to free
up degrees of freedom from parameter estimation to estimate the residual vari-
ance (cf. Sect. 6.4.2). If subject-matter knowledge is available to decide which fac-
tors can be safely removed without missing important effects, then a single repli-
cate can be successfully analyzed. For example, knowing that the two nitrogen
sources and the two vitamin components do not interact, we might specify the model
Growth~(Glc+N1+N2+Vit1+Vit2)2 - N1:N2 - Vit1:Vit2 that
removes the two corresponding interactions while keeping the three remaining ones.
This strategy is somewhat unsatisfactory, since we now still only have two residual
degrees of freedom and correspondingly low precision and power, and we cannot
test if removal of the factors was really justified. Without good subject-matter knowl-
edge, this strategy can give very misleading results if significant and large effects are
removed from the analysis.
For higher-order factorials starting with the 25 -factorials, useful designs are also
available for higher than one-half fractions, such as quarter-replicates that would
require only 8 of the 32 treatment combinations in a 25 -factorial. These designs are
constructed by using more than one generator, and combined aliasing leads to more
complex confounding of effects.
For example, a quarter-fractional requires two generators: one generator to specify
one-half of the treatment combinations, and a second generator to specify one-half
of those. Both generators introduce their own aliases which we determine using
9.5 Multiple Aliasing 225
the generator algebra. In addition, multiplying the two generators introduces further
aliases through the generalized interaction.
G 1 : ABC D E = +1 and G 2 : BC D E = +1 .
The resulting eight treatment combinations are shown in Table 9.4 (left). We see
that in addition to the two generators, we also have a further highly undesirable
confounding of the main effect of A with the grand mean: the column A only contains
the high level. This is a consequence of the interplay of the two generators, and we
find this additional confounding directly by comparing the left- and right-hand side
of their generalized interaction:
G 1 G 2 = ABC D E · BC D E = AB BCC D D E E = A = +1 .
G 1 : AB D = +1 and G 2 : AC E = +1 ,
226 9 Many Treatment Factors: Fractional Factorial Designs
G 1 G 2 = A ABC D E = BC D E = +1 .
The resulting treatment combinations are shown in Table 9.4 (right). We note that
some—but not all—main effects and two-way interactions are now confounded.
Finding good pairs of generators is not entirely straightforward, and software or
tabulated designs are often used.
Recall that we used a 25−1 half-replicate for our yeast medium example in Sect. 9.4,
but that we had to remove all observations with N2 at the low level from the first
replicate of this experiment. This effectively introduces a second generator for this
replicate, namely C = +1. Since N2 is only observed on one level, no effects involv-
ing this factor can be estimated. In addition, the combination of the second generator
with the original generator ABC D E = +1 leads to the additional alias AB = D E
between the interaction Glc:N1 and the interaction Vit1:Vit2 for this replicate. For-
tunately, the corresponding observations from the second replicate were not affected
by this problem, such that the pooled data from both replicates could be analyzed as
planned.
G 1 : ABC D F = +1 and G 2 : AB D E G = +1
9.6.1 Resolution
A fractional factorial design has resolution K if the grand mean is confounded with
at least one factor of order K , and no factor of lower order. The order is typically
given as a Roman numeral. For example, a 23−1 design with generator ABC = +1
3−1
has order III, and we denote such a design as 2III .
For a factor of any order, the resolution gives the lowest order of a factor con-
founded with it: a resolution-III design confounds main effects with two-way interac-
tions (III = 1 + 2), and the grand mean with a three-way interaction (III = 0 + 3). A
resolution-V design confounds main effects with four-way interactions (V = 1 + 4),
two-way interactions with three-way interactions (V = 2 + 3), and the five-way
interaction with the grand mean (V = 5 + 0).
Designs with more factors allow fractions of higher resolution. Our previous 25 -
5−1 5−2
factorial example admits a 2V design with 16 combinations, and a 2III design with
8 combinations. With the first design, we can estimate main effects and two-way
interactions free of other main effects and two-way interactions, while the second
design aliases main effects with two-way interactions. Our 7-factor example has
resolution IV.
In practice, resolutions III, IV, and V are the most ubiquitous, and a resolution
of V is often the most useful if it is achievable, since then main effects and two-way
interactions are aliased only with interactions of order three and higher. Main effects
and two-way interactions are confounded for resolution III, and these designs are
useful for screening larger numbers of factors, but usually not for experiments where
228 9 Many Treatment Factors: Fractional Factorial Designs
9.6.2 Aberration
For the 27 -factorial, both one-quarter and one-eighth reductions lead to a resolution-
IV design, even though these designs have very different severity of confounding.
The aberration provides an additional criterion to compare designs with identical
resolution. It is based on the idea that we prefer aliasing higher-order interactions to
aliasing lower-order interactions.
We find the aberration of a design as follows: we write down the generators and
derive their generalized interactions. We then sort the resulting set of alias relations
by word length and count how many relations there are of each length. The fewer
words of short length a set of generators produces, the more we would prefer it over
a set with more short words.
For example, the two generators
ABC F = +1 and AD E G = +1 .
7−2
also yield a 2IV design, this time with generalized interaction ABC F · AD E G =
BC D E F G = +1. The corresponding aliases thus contain two words of length four
9.6 Characterizing Fractional Factorials 229
and one word of length six and we would prefer this set of generators over the first
set because of its less severe confounding.
One class of screening designs uses fractional factorials of resolution III. Noteworthy
15−11
examples are the 2III design, which allows screening 15 factors in 16 runs, or the
31−26
2III design, which allows screening 31 factors in 32 runs!
A problem of this class of designs is that the ‘gap’ between useful screening
designs increases with increasing number of factors, because we can only consider
fractions that are powers of two: reducing a 27 -design with 128 runs yields designs
of 64 runs (27−1 ) and 32 runs (27−2 ), but we cannot find designs with less than 64 and
more than 32 runs, for example. On the other hand, fractional factorials are familiar
designs that are relatively easy to interpret and if a reasonable design is available,
there is no reason not to consider it.
Factor screening experiments will typically use a single replicate of a (fractional)
factorial, and effects cannot be tested formally. If only a minority of factors is active,
230 9 Many Treatment Factors: Fractional Factorial Designs
we can use the method by Lenth to still identify the active factors by more informal
comparisons (Lenth 1989); see Sect. 6.4.2 for details on this method.
A different idea for constructing screening designs was proposed by Plackett and
Burman in a seminal paper (Plackett and Burman 1946). These designs require
that the number of runs is a multiple of four. The most commonly used are the
designs in 12, 20, 24, and 28 runs, which can screen 11, 19, 23, and 27 factors,
respectively. Plackett–Burman designs do not have a simple confounding structure
that could be determined with generators. Rather, they are based on the idea of
partially confounding some fraction of each effect with other effects. These designs
are used for screening main effects only, as main effects are already confounded with
two-way interactions in rather complicated ways that cannot be easily disentangled by
follow-up experiments. Plackett–Burman designs considerably increase the available
options for the screening experiment sizes, and offer designs when no fractional
factorial design is available.
With many treatments, blocking a design becomes challenging because the efficiency
of blocking deteriorates with increasing block size, and there are other limits on the
maximal number of units per block. The incomplete block designs in Sect. 7.3 are
a remedy for this problem for unstructured treatment levels. The idea of fractional
factorial designs is useful for blocking factorial treatment structures and exploits
their properties by deliberately confounding (higher-order) interactions with block
effects. This reduces the required block size to the size of the corresponding fractional
factorial.
We can further extend this idea by using different confoundings for different sets
of blocks, such that each set accommodates a different fraction of the same factorial
treatment structure. We are then able to recover most of the effects of the full factorial,
albeit with different precision.
We consider a blocked design with a 23 -factorial treatment structure in blocks of
size four as our main example. This is a realistic scenario if studying combinations of
three treatments on mice and blocking by litter, with typical litter sizes being below
eight. Two questions arise: (i) which treatment combinations should we assign to the
same block? and (ii) with replication of blocks, should we use the same assignment
of treatment combinations to blocks? If not, how should we determine treatment
combinations for sets of blocks?
9.8 Blocking Factorial Experiments 231
9.8.1 Half-Fraction
A first idea is to use a half-replicate of the 23 -factorial and assign its four treatment
combinations to the four units in each block. For example, we can use the gen-
erator ABC = +1 and randomize the same treatment combinations {a, b, c, abc}
independently within each block. A layout for four blocks is
Block Generator 1 2 3 4
I ABC = +1 a b c abc
II ABC = +1 a b c abc
III ABC = +1 a b c abc
IV ABC = +1 a b c abc
This design confounds the three-way interaction with the block effect and resem-
bles a replication of the same fractional factorial, where systematic differences
between replicates are accounted for by the block effects. The fractional factorial
has resolution III, and main effects are confounded with two-way interactions within
each block (and thereby also overall).
From the 16 observations, we require four degrees of freedom for estimating the
treatment parameters, and three degrees of freedom for the block effect, leaving us
with nine residual degrees of freedom. The latter can be increased by using more
blocks, where we gain four observations with each block and lose one degree of
freedom for the block effect. Since the effect aliases are the same in each block,
increasing the number of blocks does not change the confounding: no matter how
many blocks we use, we are unable to disentangle the main effect of A, say, and the
B:C interaction.
We can improve the design substantially by noting that it is not required to use the
same half-replicate in each block. For instance, we might instead use the generator
ABC = +1 with combinations {a, b, c, abc} to create a half-replicate of the treat-
ment structure for the first two of four blocks, and use the corresponding generator
ABC = −1 (the fold-over) with combinations {(1), ab, ac, bc} for the remaining
two blocks.
With two replicates for each of the two levels of the three-way interaction, its
parameters are estimable using the block totals. All other effects can be estimated
more precisely, since we now have two replicates of the full factorial design after we
account for the block effects.
232 9 Many Treatment Factors: Fractional Factorial Designs
Block Generator 1 2 3 4
I ABC = +1 a b c abc
II ABC = +1 a b c abc
III ABC = −1 (1) ab ac bc
IV ABC = −1 (1) ab ac bc
and shows that while the half-fraction of a 23 -factorial is not an interesting option
in itself due to the severe confounding, it gives a very appealing design for reducing
block sizes.
For example, we have confounding of A with B:C for observations based on
the ABC = +1 half-replicates (with A = BC), but we can resolve this confound-
ing using observations from the other half-replicate, for which A = −BC. Indeed,
for blocks I and II, the estimate of the A main effect is (a + abc) − (b + c) and
for blocks III and IV it is (ab + ac) − (bc + (1)). Similarly, the estimates for
B:C are (a + abc) − (b + c) and (bc + (1)) − (ab + ac), respectively. Note that
these estimates are all free of block effects. Then, the estimates of the two effects
are also free of block effects and are proportional to [(a + abc) − (b + c)] +
[(ab + ac) − (bc + (1))] = (a + ab + ac + abc) − ((1) + b + c + bc) for A,
respectively, [(a + abc) − (b + c)] − [(ab + ac) − (bc + (1))] = ((1) + a + bc +
abc) − (b + c + ab + ac) for B:C. These are the same estimates as for a two-fold
replicate of the full factorial design. Somewhat simplified: the first two blocks allow
estimation of the sum of A main effect and B:C interaction, while the second pair
allows estimation of their difference. The sum of these two estimates is 2 · A, while
the difference is 2 · BC.
The same argument does not hold for the A:B:C interaction, of course. Here, we
have to contrast observations in ABC = +1 blocks with observations in ABC = −1
blocks, and block effects do not cancel. If instead of four blocks, our design only uses
two blocks—one for each generator—then main effects and two-way interactions can
still be estimated, but the three-way interaction is completely confounded with the
block effect.
Using a classical ANOVA for the analysis, we find two error strata for the inter-
and intra-block errors, and the corresponding F-test for A:B:C in the inter-block
stratum with two denominator degrees of freedom: we have four blocks, and lose
one degree of freedom for the grand mean, and one degree of freedom for the A:B:C
parameters. All other tests are in the intra-block stratum and based on six degrees of
freedom: a total of 4 · 4 = 16 observations, with seven degrees of freedom spent on
the model parameters except the three-way interaction, and three degrees of freedom
spent on the block effects.
A useful consequence of these considerations is the possibility of augmenting
a fractional factorial design with the complementary half-replicate. For example,
we might consider a half-replicate of a 25 -factorial with generator ABC D E = +1.
If we find large effects for the confounded two- and three-way interactions, we
can use a single second experiment with ABC D E = −1 to provide the remaining
9.8 Blocking Factorial Experiments 233
While using the highest-order interaction to define the confounding with blocks is
the natural choice, we could also use any other generator. In particular, we might use
A = +1 and A = −1 as our two generators, thereby allocating half the blocks to the
low level of A, and the other half to its high level. In other words, we randomize A
on the block factor, and the remaining treatment factors are randomized within each
block. This is precisely the split-unit design with the blocking factor as the whole-unit
factor, and A randomized on it. With four blocks, we need one degree of freedom to
estimate the block effect, and the remaining three degrees of freedom are split into
estimating the A main effect (1 d.f.) and the between-block residual variance (2 d.f.).
All other treatment effects profit from the removal of the block effect and are tested
with 6 degrees of freedom for the within-block residual variance.
The use of generators offers more flexibility than a split-unit design, because it
allows us to confound any effect with the blocking factor, not just a main effect.
Whether this is an advantage depends on the experiment: if application of the treat-
ment factors to experimental units is equally simple for all factors, then it is usually
more helpful to confound a higher-order interaction with the blocking factor. This
design then allows estimation of all main effects and their contrasts with equal preci-
sion, and lower-order interaction effects can also be estimated precisely. A split-unit
design, however, offers advantages for the logistics of the experiment if levels of one
treatment factor are more difficult to change than levels of the other factors. By con-
founding the hard-to-change factor with the blocking factor, the experiment becomes
easier to implement. Split-unit designs are also conceptually simpler than confound-
ing of interaction effects with blocks, but that should not be the sole motivation for
using them.
Block Generator 1 2 3 4
I ABC = +1 a b c abc
II ABC = −1 (1) ab ac bc
III AB = +1 (1) c ab abc
IV AB = −1 a b ac bc
V AC = +1 (1) b ac abc
VI AC = −1 a b ab bc
VII BC = +1 (1) a bc abc
VIII BC = −1 b c ab ac
We can further reduce the required block size by considering higher fractions of a
factorial. As we saw in Sect. 9.5, these require several simultaneous generators, and
additional aliasing occurs due to the generalized interaction between the generators.
For example, the half-fraction of a 25 -factorial still requires a block size of 16,
which might not be practical. We further reduce the block size using the two pairs
of generators
ABC = ±1 , AD E = ±1 ,
Block Generator 1 2 3 4 5 6 7 8
I ABC = −1, AD E = −1 (1) bc de bcde abd acd abe ace
II ABC = −1, AD E = +1 b c bde cde ad abcd ae abce
III ABC = +1, AD E = −1 d bcd e bce ab ac abde acde
IV ABC = +1, AD E = +1 bd cd be ce a abc ade abcde
In this design, the two three-way interactions A:B:C and A:D:E and their gen-
eralized four-way interaction B:C:D:E are partially confounded with block effects.
All other effects, and in particular all main effects and all two-way interactions, are
free of block effects and estimated precisely. By carefully selecting the generators,
we are often able to confound effects that are known to be of limited interest to the
researcher.
Expected Results
We can broadly distinguish three classes of proteins that we expect to find in this
experiment.
The first class is proteins directly involved in the known pathway. For these, we
expect low levels of abundance for a placebo treatment, because the placebo does
not activate the pathway. For the drug treatment, we expect to see high abundance
in the wild-type, as the pathway is then activated, but low abundance in the mutant,
since the drug cannot bind to the receptor and thus pathway activation is impeded.
In other words, we expect a large genotype-by-drug interaction.
The second class is proteins in the alternative pathway(s) activated by the drug but
exhibiting a different receptor. Here, we would expect to see high abundance in both
wild-type and mutant for the drug treatment and low abundance in both genotypes for
a placebo treatment, since the mutation does not affect receptors in these pathways.
This translates into a large drug main effect, but no genotype main effect and no
genotype-by-drug interaction.
The third class is proteins unrelated to any mechanisms activated by the drug.
Here, we expect to see the same abundance levels in both genotypes for both drug
treatments, and no treatment factor should show a large and significant effect.
We are somewhat unsure what to expect for the duration. It seems plausible that a
protein in an activated pathway will show lower abundance after longer time, since the
pathway should trigger a response to the inflammation and lower the inflammation.
This would mean that a three-way interaction exists at least for proteins involved in
the known or alternative pathways. A different scenario results if one pathway takes
longer to activate than another pathway, which would present as a two- or three-way
interaction of drug and/or genotype with the duration.
Abundance
A B
Placebo Drug Placebo Drug
Wt
Molecular mass
Abundance
Mt
Molecular mass
Fig. 9.3 Proteomics experiment. A 23 -factorial treatment structure with three-way interaction con-
founded in two blocks. B Mass spectra with four tags (symbol) for same protein from two blocks
(shading)
the three-way interaction, and the second confounding one of the three two-way
interactions. A promising candidate is the drug-by-duration interaction, since we are
very interested in the genotype-by-drug interaction and would like to detect different
activation times between the known and alternative pathways, but we do not expect a
drug-by-duration interaction of interest. This yields the data shown in Fig. 9.4, where
the eight resulting protein abundances are shown separately for short and long dura-
tion between drug administration and measurement, and for three typical proteins
in the known pathway, in an alternative pathway, and unrelated to the inflammation
response.
Notes
Deliberate effect confounding in factorial designs was fully developed in the 1940s
(Fisher 1941; Finney 1945) and is an active research area to this day. A general
review is given in Gunst and Mason (2009), and modern developments for multi-
stratum designs are given in Cheng (2019). Some specific designs are discussed for
engineering applications in Box (1992) and Box and Bisgaard (1993).
Fractional factorials can also be constructed for factors with more than two levels,
such as the 3k -series (Cochran and Cox 1957), or generally the p k -series ( p a prime
number). A more general concept for confounding in factorials with mixed number
of factor levels are design keys (Patterson and Bailey 1978). For the analysis of non-
replicated designs, the methods by Lenth (1989) (discussed in Sect. 6.4.2) and Box
and Meyer (1986) are widely used.
The website for NIST’s Engineering Statistics Handbook provides tables with
commonly used 2k−l -fractional factorials, Plackett–Burman, and other useful designs
in its Chap. 5.
238 9 Many Treatment Factors: Fractional Factorial Designs
40
short
20
Protein abundance
0
60
40
long
20
0
wt mt wt mt wt mt
Genotype
Fig. 9.4 Data of proteomics experiment. Round point: placebo, triangle: drug treatment. Panels
show typical protein scenarios in columns and waiting duration in rows
yi jkl =μ + α A · ai + α B · b j + αC · ck
+ α AB · ai · b j + α AC · ai · ck + α BC · b j · ck
+ α ABC · ai · b j · ck + ei jkl ,
where ai , b j , ck encode the factor level of A, B, and C, respectively, for that specific
observation. With a sum-encoding, we have ai = −1 if A is on the low level, and
ai = +1 if A is on the high level, with values for b j , ck accordingly. The seven
parameters α X are the effects of the corresponding factor that we want to estimate.
Using the generator ABC = +1 then translates to imposing the relation ai · b j ·
ck = +1 for each observation i, j, k, and we can replace ai · b j · ck with +1 in the
linear model equation. It follows that the parameter α ABC of the three-way interaction
is completely confounded with the grand mean μ. Similarly, we note that ai · b j = ck
for each observation, and we can replace ai · b j with ck in the model equation. Thus,
the two parameters α AB and αC , encoding the effect of the two-way interaction A:B
9.9 Notes and Summary 239
and the main effect of C, respectively, are completely confounded and only their sum
α AB + αC can be estimated. Continuing this way, we find that the generator implies
the linear model
yi jkl = β0 + β1 · ai + β2 · b j + β3 · ck + ei jkl ,
References
Box, G. E. P. (1992). “What can you find out from sixteen experimental runs?” In: Quality Engi-
neering 5.1, pp. 167–178.
Box, G. E. P. and R. D. Meyer (1986). “An analysis for unreplicated fractional factorials”. In:
Technometrics 28.1, pp. 11–18.
Box, G. E. P. and S. Bisgaard (1993). “Quality quandaries: iterative analysis from two-level facto-
rials”. In: Quality Engineering 6.2, pp. 319–330.
Cheng, C.-S. (2019). Theory of Factorial Design: Single- and Multi-Stratum Experiments. Chapman
& Hall/CRC.
Cochran, W. G. and G. M. Cox (1957). Experimental Designs. John Wiley & Sons, Inc.
Finney, D. J. (1945). “The fractional replication of factorial arrangements”. In: Annals of Eugenics
12, pp. 291-301.
Finney, D. J. (1955). Experimental Design and its Statistical Basis. The University of Chicago Press.
Fisher, R. A. (1941). “The theory of confounding in factorial experiments in relation to the theory
of groups”. In: Annals of Human Genetics 11.1, pp. 341-353.
Grömping, U. (2014). “R package FrF2 for creating and analyzing fractional factorial 2-level
designs”. In: Journal of Statistical Software 56.1, e1–e56.
Gunst, R. F. and R. L. Mason (2009). “Fractional factorial design”. In: WIREs Computational
Statistics 1, pp. 234–244.
Kobilisnsky, A., A. Bouvier, and H. Monod (2012). PLANOR: an R package for the automatic
generation of regular fractional factorial designs. Tech. rep. INRA, e1–e97.
Kobilinsky, A., H. Monod, and R. A. Bailey (2017). “Automatic generation of generalised regular
designs.” In: Computational Statistics and Data Analysis 113, pp. 311–329.
Lenth, R. V. (1989). “Quick and easy analysis of unreplicated factorials”. In: Technometrics 31.4,
pp. 469–473.
Patterson, H. D. and R. A. Bailey (1978). “Design keys for factorial experiments”. In: Journal of
the Royal Statistical Society C 27.3, pp. 335–343.
Plackett, R. L. and J. P. Burman (1946). “The design of optimum multifactorial experiments”. In:
Biometrika 33.4, pp. 305–325.
Chapter 10
Experimental Optimization
with Response Surface Methods
10.1 Introduction
The key new idea is to consider the response as a smooth function of the quantitative
treatment factors. We generically call the treatment factor levels x1 , . . . , xk for k
factors, so that we have five such variables for our example, corresponding to the
five concentrations of medium components. We again denote the response variable
by y, which is the increase in optical density in a defined time-frame in our example.
The response surface φ(·) relates the expected response to the experimental variables:
E(y) = φ(x1 , . . . , xk ) .
We assume that small changes in the variables will yield small changes in the
response, so the surface described by φ(·) is smooth enough that, given two points and
their resulting responses, we feel comfortable in interpolating intermediate responses.
The shape of the response surface and its functional form φ(·) are unknown, and each
measurement of a response is additionally subject to some variability.
The goal of a response surface design is to define n design points (x1, j , . . . , xk, j ),
j = 1 . . . n, and use a reasonably simple yet flexible regression function f (·) to
approximate the true response surface φ(·) from the resulting measurements y j so
that f (x1 , . . . , xk ) ≈ φ(x1 , . . . , xk ), at least locally around a specified point. This
approximation allows us to predict the expected response for any combination of
factor levels that is not too far outside the region that we explored experimentally.
Having found such approximation, we can determine the path of steepest ascent,
the direction along which the response surface increases the fastest. For optimizing
the response, we design the next experiment with design points along this gradient
(the gradient pursuit experiment). Having found a new set of conditions that gives
higher responses than our starting condition, we iterate the two steps: locally approx-
imate the response surface around the new best treatment combination and follow
its path of steepest ascent in a subsequent experiment. We repeat these steps until
no further improvement of the measured response is observed, we found a treatment
combination that yields a satisfactorily high response, or we run out of resources.
This idea of a sequential experimental strategy is illustrated in Fig. 10.1 for two
treatment factors whose levels are on the horizontal and vertical axis, and a response
surface shown by its contours.
For implementing this strategy, we need to decide what point to start from;
how to approximate the surface φ(·) locally; how many and what design points
(x1, j , . . . , xk, j ) to choose to estimate this approximation; and how to determine the
path of steepest ascent.
Optimizing treatment combinations usually means that we already have a reason-
ably reliable experimental system, so we use the current experimental condition as
our initial point for the optimization. An important aspect here is the reproducibility
of the results: if the system or the process under study does not give reproducible
results at the starting point, there is little hope that the response surface experiments
will give improvements.
10.2 Response Surface Methodology 243
Slice
Slice
Slice
Fig. 10.1 Sequential experiments to determine optimum conditions. Three exploration experiments
(A, C, E), each followed by a gradient pursuit (B, D, F). Dotted lines: contours of response surface.
Black lines and dots: region and points for exploring and gradient pursuit. Inlet curves in panels B,
D, F show the slice along the gradient with measured points and resulting maximum
We start our discussion by looking into the first-order model for locally approximating
the response surface.
10.3.1 Model
k
y = f (x1 , . . . , xk ) + e = β0 + β1 x1 + · · · + βk xk + e = β0 + βi xi + e .
i=1
We estimate its parameters using standard linear regression; the parameter βi gives
the amount by which the expected response increases if we increase the ith factor
from xi to xi + 1, keeping all other factors fixed. Without interactions, the predicted
change in response is independent of the values of all other factors, and interactions
could be added if necessary. We assume a constant error variance Var(e) = σ 2 and
justify this by the fact that we explore the response surface only locally.
and gives the direction with fastest increase in response. This gradient is independent
of the factor levels (the function g(·) does not depend on any xi ), and the direction
of steepest ascent is the same no matter where we start.
In the next iteration, we explore this direction experimentally to find a treatment
combination that yields a higher expected response than our starting condition. The
local approximation of the true response surface by our first-order model is likely
to fail once we venture too far from the starting condition, and we will encounter
decreasing responses and increasing discrepancies between the response predicted by
the approximation and the response actually measured. One iteration of exploration
and gradient pursuit is illustrated in Fig. 10.2.
10.3 The First-Order Model 245
A B C 1st order
prediction
Response
surface
Fig. 10.2 Sequential response surface experiment with first-order model. A The first-order model
(solid lines) build around a starting condition (star) approximates the true response surface (dotted
lines) only locally. B The gradient follows the increase of the plane and predicts indefinite increase
in response. C The first-order approximation (solid line) deteriorates from the true surface (dotted
line) at larger distances from the starting condition, and the measured responses (points) start to
decrease
The first-order model with k factors has k + 1 parameters, and estimating these
requires at least as many observations at different design points. Two options for an
experimental design are shown in Fig. 10.3 for a two-factor design.
In both designs, we use several replicates at the starting condition as center points.
This allows us to estimate the residual variance independently of any assumed model.
Comparing this estimate with the observed discrepancies between any other point of
the predicted response surface and its corresponding measured value allows testing
the goodness of fit of our model. We also observe each factor at three levels, and can
use this information to detect curvature not accounted for by the first-order model.
In the first design (Fig. 10.3A), we keep all factors at their starting condition level,
and only change the level for one factor at a time. This yields two axial points for
each factor, and the design resembles a coordinate system centered at the starting
condition. It requires 2k + m measurements for k factors and m center point replicates
and is adequate for a first-order model without interactions.
The second design (Fig. 10.3B) is a full 22 -factorial where all factorial points are
considered. This design would allow estimation of interactions between the factors
and thus a more complex form of approximation. It requires 2k + m measurements,
but fractional replication reduces the experiment size for larger numbers of factors.
A practical problem is the choice for the low and high level of each factor. When
chosen too close to the center point, the values for the response will be very similar
and it is unlikely that we detect anything but large effects if the residual variance is
not very small. When chosen too far apart, we might ‘brush over’ important features
of the response surface and end up with a poor approximation. Only subject-matter
246 10 Experimental Optimization with Response Surface Methods
A B
Nitrogen
Nitrogen
Axial + center Factorial + center
high high
low low
Fig. 10.3 Two designs for fitting a first-order RSM with two factors. A Center points and axial
points that modify the level of one factor at a time only allow estimation of main effects, but not of
interactions. B Center points and factorial points increase the experiment size for k > 2 but allow
estimation of interactions
knowledge can guide us in choosing these levels satisfactorily; for biochemical exper-
iments, one-half and double the starting concentration is often a reasonable first
guess.1
1 This requires that we use a log-scale for the levels of the treatment factors.
10.3 The First-Order Model 247
Table 10.1 Half-fraction of 25 -factorial and center points. Variables are recoded from original
levels to −1/0/ + 1. Last column: observed changes in optical density
Glc N1 N2 Vit1 Vit2 Measured
Center point replicates
0 0 0 0 0 81.00
0 0 0 0 0 84.08
0 0 0 0 0 77.79
0 0 0 0 0 82.45
0 0 0 0 0 82.33
0 0 0 0 0 79.06
Factorial points
−1 −1 −1 −1 +1 35.68
+1 −1 −1 −1 −1 67.88
−1 +1 −1 −1 −1 27.08
+1 +1 −1 −1 +1 80.12
−1 −1 +1 −1 −1 143.39
+1 −1 +1 −1 +1 116.30
−1 +1 +1 −1 +1 216.65
+1 +1 +1 −1 −1 47.48
−1 −1 −1 +1 −1 41.35
+1 −1 −1 +1 +1 5.70
−1 +1 −1 +1 +1 84.87
+1 +1 −1 +1 −1 8.93
−1 −1 +1 +1 +1 117.48
+1 −1 +1 +1 −1 104.46
−1 +1 +1 +1 −1 157.82
+1 +1 +1 +1 +1 143.33
Table 10.2 ANOVA table for first-order response surface model. FO: first-order model in the given
variables
Df Sum Sq Mean Sq F value Pr(>F)
FO(Glc, N1, N2, Vit1, Vit2) 5 38102.72 7620.54 7.94 6.30e-04
Residuals 16 15355.32 959.71
Lack of fit 11 15327.98 1393.45 254.85 3.85e-06
Pure error 5 27.34 5.47
248 10 Experimental Optimization with Response Surface Methods
Table 10.3 Gradient of steepest ascent from first-order response surface model
Glc N1 N2 Vit1 Vit2
−0.32 0.17 0.89 −0.09 0.26
next round of experimentation along the gradient. Recall that our main goal is to find
the direction of highest increase in the response values and we are less interested in
finding an accurate description of the surface in the beginning of the experiment.
The pure error is based solely on the six replicates of the center points and conse-
quently has five degrees of freedom. Its variance is about 5.5 and very small compared
to the other contributors. This means that the starting point produces highly replicable
results and that small differences on the response surface are detectable.
The resulting gradient for this example is shown in Table 10.3, indicating the
direction of steepest ascent in standardized coordinates.
The parameter estimates β̂0 , . . . , β̂5 are shown in Table 10.4. The intercept corre-
sponds to the predicted average response at the center point; the empirical average
is 81.12, in good agreement with the model.
We note that N2 has the largest impact and should be increased, that Vit2 should
also be increased, that Glc should be decreased, while N1 and Vit1 seem to have
only a comparatively small influence.
10.4.1 Model
The second-order model adds purely quadratic (PQ) terms xi2 and two-way interaction
(TWI) terms xi · x j to the first-order model (FO), such that the regression model
equation becomes
10.4 The Second-Order Model 249
k
k
k
y = f (x1 , . . . , xk ) + e = β0 + βi xi + βii xi2 + βi j xi x j + e ,
i=1 i=1 i< j
and we again assume a constant error variance Var(e) = σ 2 for all points (and can
again justify this because we are looking at a local approximation of the response
surface). For example, the two-factor second-order response surface approximation
is
y = β0 + β1 x1 + β2 x2 + β1,1 x12 + β2,2 x22 + β1,2 x1 x2 + e .
and only requires estimation of six parameters to describe the true response surface
locally. It performs satisfactorily for most problems, at least locally in the vicinity of
a given point. Importantly, the model is still linear in the parameters, and parameter
estimation can therefore be handled as a standard linear regression problem.
The second-order model allows curvature in all directions and interactions
between factors which provide more information about the shape of the response
surface. Canonical analysis then allows us to detect ridges along which two or more
factors can be varied with only little change to the response and to determine if a
stationary point on the surface is a maximum or minimum or a saddle-point where
the response increases along one direction, and decreases along another direction.
We do not pursue this technique and refer to the references in Sect. 10.6 for further
details.
∂
(g(x1 , . . . , xk ))1 = f (x1 , . . . , xk ) = β1 + 2β1,1 x1 + β1,2 x2 + · · · + β1,k xk ,
∂x1
For a second-order model, we need at least three points on each axis to estimate
its parameters and we need sufficient combinations of factor levels to estimate the
interactions. An elegant way to achieve this is a central composite design (CCD).
250 10 Experimental Optimization with Response Surface Methods
Rotational Facet
C D
Fig. 10.4 A central composite design for three factors. A Design points for 23 full factorial. B
Center point location and axial points chosen along axis parallel to coordinate axis and through
center point. C Central composite design with axis points chosen for rotatability. D Combined
design points with axial points chosen on facets introduce no new factor levels
In the CCD, we combine axial points and factorial points, and there are two reason-
able alternatives for choosing their levels: first, we can choose the axial and factorial
points at the same distance from the center point, so they lie on a sphere around the
center point (Fig. 10.4C). This rotationally symmetric design yields identical
√ stan-
dard errors for all parameters and uses levels ±1 for factorial and ± k for axial
points.
Second, we can choose axial points on the facets of the factorial hyper-cube
by using the same low/high levels as the factorial points (Fig. 10.4D). This has the
advantage that we do not introduce new factor levels, which can simplify the imple-
mentation of the experiment. The disadvantage is that the axial points are closer
10.4 The Second-Order Model 251
to the center points than the factorial points and estimates of parameters then have
different standard errors, leading to different variances of the model’s predictions
depending on the direction.
The factorial and axial parts of a central composite design are orthogonal to each
other. This attractive feature allows a very simple blocking strategy, where we use
one block for the factorial points and some of the replicates of the center point, and
another block for the axial points and the remaining replicates of the center point. We
can then conduct the CCD of our example in two experiments, one with 16 + 3 = 19
measurements, and the other with 10 + 3 = 13 measurements, without jeopardizing
the proper estimation of parameters. The block size for the factorial part can be
further reduced using the techniques for blocking factorials from Sect. 9.8.
In practice, this property provides considerable flexibility when conducting such
an experiment. For example, we might first measure the axial and center points and
estimate a first-order model from these data and determine the gradient.
With replicated center points, we are then able to quantify the lack of fit of the
model and the curvature of the response surface. We might then decide to continue
with the gradient pursuit based on the first-order model if curvature is small, or to
conduct a second experiment with factorial and center points to augment the data for
estimating a full second-order model.
Alternatively, we might start with the factorial points to determine a model with
main effects and two-way interactions. Again, we can continue with these data alone,
augment the design with axial points to estimate a second-order model, or augment
the data with another fraction of the factorial design to disentangle confounded factors
and gain higher precision of parameter estimates.
Next, we measure the response for several experimental conditions along the path
of steepest ascent. We then iterate the steps of approximation and gradient pursuit
based on the condition with highest measured response along the path.
The overall sequential procedure is illustrated in Fig. 10.5: the initial second-order
approximation (solid lines) of the response surface (dotted lines) is only valid locally
(A) and predicts an optimum far from the actual optimum, but pursuit of the steepest
ascent increases the response values and moves us closer to the optimum (B). Due
to the local character of the approximation, predictions and measurements start to
diverge further away from the starting point (C). The exploration around the new
best condition predicts an optimum close to the true optimum (D) and following
the steepest ascent (E) and a third exploration (F) achieves the desired optimization.
252 10 Experimental Optimization with Response Surface Methods
A B C 2nd order
prediction
Response
surface
D E F
Fig. 10.5 Sequential experimentation for optimizing conditions using second-order approxima-
tions. A First approximation (solid lines) of true response surface (dotted lines) around a starting
point (star). B Pursuing the path of steepest ascent yields a new best condition. C Slice along the path
of steepest ascent. D Second approximation around new best condition. E Path of steepest ascent
based on second approximation. F Third approximation captures the properties of the response
surface around the optimum
The second-order model now has an optimum very close to the true optimum and
correctly approximates the factor influences and their interactions locally.
A comparison of predicted and measured responses along the path supplies infor-
mation about the range of factor levels for which the second-order model provides
accurate predictions. A second-order model also allows prediction of the optimal
condition directly. This prediction strongly depends on the approximation accuracy
of the model, and we should treat predicted optima with suspicion if they are far
outside the factor levels used for the exploration.
We revisit our example of yeast medium optimization from Sect. 9.4. Recall that
our goal is to alter the composition of glucose (Glc), monosodium glutamate (nitro-
gen source N1), a fixed composition of amino acids (nitrogen source N2), and two
mixtures of trace elements and vitamins (Vit1 and Vit2) to maximize growth of
yeast. Growth is measured as increase in optical density, where higher increase
indicates higher cell density. We summarize the four sequential experiments that
are used to arrive at the specification of a new high cell density (HCD) medium
(Roberts et al. 2020).
10.5 A Real-Life Example—Yeast Medium Optimization 253
RSM 1
Fig. 10.6 Measured increase in OD for first exploration experiment. Triangles indicate center point
replicates
254 10 Experimental Optimization with Response Surface Methods
This shows that optimization cannot be done iteratively one factor at a time, but that
several factors need to be changed simultaneously. The non-significant quadratic
terms indicate that either the surface is sufficiently plane around the starting point or
that we chose factor levels too close together or too far apart to detect curvature. The
ANOVA table for this model (Table 10.5), however, shows that quadratic terms and
two-way interactions as a whole cannot be neglected, but that the first-order part of
the model is by far the most important.
The table also suggests considerable lack of fit, but we expect this at the begin-
ning of the optimization. This becomes very obvious when looking at the predicted
stationary point of the approximated surface (Table 10.6), which contains negative
concentrations and is far outside the region we explored in the first experiment (note
that the units are given in standardized coordinates, so a value of 10 means 10 times
farther from the center point than our axial points).
Based on this second-order model, we calculated the path of steepest ascent for the
second experiment in Table 10.7.
The first column shows the distance from the center point in standardized coor-
dinates, so this path moves up to 4.5 times further than our axial points. The next
columns are the resulting levels for our five treatment factors in standardized coor-
dinates. The column ‘Predicted’ provides the predicted increase in OD based on our
second-order approximation.
10.5 A Real-Life Example—Yeast Medium Optimization 255
Table 10.7 Path of steepest ascent in standardized coordinates, predicted increase in OD, and
measured increase in OD based on first exploration data
Distance Glc N1 N2 Vit1 Vit2 Predicted Measured
0.0 0.0 0.0 0.0 0.0 0.0 81.8 78.3
0.5 −0.2 0.1 0.4 0.0 0.1 108.0 104.2
1.0 −0.4 0.3 0.8 0.0 0.3 138.6 141.5
1.5 −0.6 0.6 1.1 0.0 0.5 174.3 156.9
2.0 −0.8 0.8 1.4 0.1 0.8 215.8 192.8
2.5 −1.1 1.1 1.7 0.2 1.0 263.3 238.4
3.0 −1.3 1.4 1.9 0.3 1.3 317.2 210.5
3.5 −1.5 1.7 2.1 0.4 1.5 377.4 186.5
4.0 −1.7 2.1 2.4 0.5 1.8 444.3 144.2
4.5 −1.9 2.4 2.6 0.6 2.0 517.6 14.1
500
400
Measured
Increase in OD
300 Predicted
200
100
0
0 1 2 3 4
Distance
Fig. 10.7 Observed growth and predicted values along the path of steepest ascent. Model predic-
tions deteriorate from a distance of about 2.5−3.0
In the next iteration, we explored the response surface locally using a new second-
order model centered at the current best condition. We proceeded just as for the first
RSM estimation, with a 25−1 -design combined with axial points and six center point
replicates.
The 32 observations from this experiment are shown in Fig. 10.8 (top row). The
new center conditions have consistently higher values than the conditions of the first
exploration (bottom row), which is a good indication that the predicted increase is
indeed happening in reality. Several conditions show further increase in response,
although the fold-changes in growth are considerably smaller than before. The maxi-
mal increase from the center point to any condition is about 1.5-fold. This is expected
when we assume that we approach the optimum condition.
The center points are more widely dispersed now, which might be due to more
difficult pipetting, especially as we originally used a standard medium. The response
surface might also show more curvature in the current region, such that deviations
from a condition result in larger changes in the response.
Importantly, we do not have to compare the observed values between the two
experiments directly, but need only compare a fold-change in each experiment over
its center point condition. That means that changes in the measurement scale (which
is in arbitrary units) do not affect our conclusions. Here, the observed center point
responses of around 230 agree very well with those predicted from the gradient
pursuit experiment, so we have confidence that the scales of the two experiments are
comparable.
The estimated parameters for a second-order model now present a much cleaner
picture: we find only main effects large and significant, and the interaction of glucose
with the second nitrogen source (amino acids); all other two-way interactions and the
quadratic terms are small. The ANOVA Table 10.8 still indicates some contributions
of interactions and purely quadratic terms, but the lack of fit is now negligible. In
essence, we are now dealing with a first-order model with one substantial interaction.
The model is still a very local approximation, as is evident from the stationary
point far outside the explored region with some values corresponding to negative
concentrations (Table 10.9).
RSM 2
RSM 1
0 100 200 300 400
Increase in OD
Fig. 10.8 Distribution of measured values for first (bottom) and second (top) exploration experi-
ment with 32 conditions each. Triangles indicate center point replicates
10.5 A Real-Life Example—Yeast Medium Optimization 257
Based on this second model, we again calculated the path of steepest ascent and
followed it experimentally. The predictions and observed values in Fig. 10.9 show
excellent agreement, even though the observed values are consistently higher. The
systematic shift already occurs for the starting condition and a likely explanation is
the gap of several months between second exploration experiment and the gradient
pursuit experiment, leading to differently calibrated measurement scales.
This systematic shift is not a problem in practice, since we have the same center
point condition as a baseline measurement, and in addition are interested in changes
in the response along the path, and less in their actual values. Along the path, the
600
Measured
Increase in OD
Predicted
500
400
300
0 1 2 3 4
Distance
Fig. 10.9 Observed growth and predicted values along the second path of steepest ascent
258 10 Experimental Optimization with Response Surface Methods
10.5.5 Conclusion
Gradient 2
RSM 2
Experiment
Medium
Initial
Gradient 1 Trial
Repl. RSM1
Repl. RSM2
Other: SD
RSM 1
Other: YPD
Fig. 10.10 Summary of data acquired during four experiments in sequential response surface
optimization. Additional conditions were tested during second exploration and gradient pursuit
(point shape), and two replicates were measured for both gradients (point shading). Initial starting
medium is shown in black
10.5 A Real-Life Example—Yeast Medium Optimization 259
for each tested condition. Round points indicate trial conditions of the experimental
design for RSM exploration or gradient pursuit, with grey shading denoting two
independent replicates. During the second iteration, additional standard media (YPD
and SD) were also measured to provide a direct comparison against established
alternatives. The second gradient pursuit experiment repeated previous exploration
points for direct comparison.
We noted that while increase in OD was excellent, the growth rate was much
slower compared to the starting medium composition. We addressed this problem
by changing our optimality criterion from the simple increase in OD to an increase
penalized by duration. Using the data from the second response surface approxima-
tion, we calculated a new path of steepest ascent based on this modified criterion
and arrived at the final high-cell density medium which provides the same increase
in growth with rates comparable to the initial medium (Roberts et al. 2020).
Notes
Sequential experimentation for optimizing a response was already discussed in
Hotelling (1941), and the response surface methodology was introduced and popu-
larized for engineering statistics by George Box and co-workers (Box and Wilson
1951; Box and Hunter 1957; Box 1954); a current review is Khuri and Mukhopad-
hyay (2010). Two relevant textbooks are the introductory account in Box et al. (2005)
and the more specialized text by Box and Draper (2007); both also discuss canonical
analysis. The classic Cochran and Cox (1957) also provides a good overview. The
use of RSM in the context of biostatistics is reviewed in Mead and Pike (1975).
Using R
The rsm package (Lenth 2009) provides the function rsm() to estimate response
surface models; they are specified using R’s formula framework extended by conve-
nience functions (Table 10.10). Coding of data into −1/0/ + 1 coordinates is done
using coded.data(), and steepest() predicts the path of steepest ascent.
Central composite designs (with blocking) are generated by the ccd() function.
Summary
Response surface methods provide a principled way for finding optimal experimental
conditions that maximize (or minimize) the measured response. The main idea is to
create an experimental design for estimating a local approximation of the response
surface around a given point, determine the path of steepest ascent using the approx-
imation model and then experimentally explore this path. If a ‘better’ experimental
condition is found, the process is repeated from this point until a satisfactory condi-
tion is found.
A commonly used design for estimating the approximation model is the central
composite design. It consists of a (fractional) factorial design augmented by axial
points. Conveniently, axial points and the factorial points are orthogonal and this
allows us to implement the design in several stages, where individual smaller exper-
iments are run for the axial points and for (parts of the fractional) factorial points.
Multiple replicates of the center point allow separation of lack of fit of the model
from residual variance.
References
Box, G. E. P. (1954). “The Exploration and Exploitation of Response Surfaces: Some General
Considerations and Examples”. In: Biometrics 10.1, p. 16.
Box, G. E. P. and N. R. Draper (2007). Response Surfaces, Mixtures, and Ridge Analyse. Wiley &
Sons, Inc.
Box, G. E. P. and J. S. Hunter (1957). “Multi-Factor Experimental Designs for Exploring Response
Surfaces”. In: The Annals of Mathematical Statistics 28.1, pp. 195–241.
Box, G. E. P. and K. B. Wilson (1951). “On the Experimental Attainment of Optimum Conditions”.
In: Journal of the Royal Statistical Society Series B (Methodological) 13.1, pp. 1–45.
Box, G. E. P., J. S. Hunter, and W. G. Hunter (2005). Statistics for Experimenters. Wiley, New York.
Cochran, W. G. and G. M. Cox (1957). Experimental Designs. John Wiley & Sons, Inc.
Hotelling, H. (1941). “Experimental Determination of the Maximum of a Function”. In: The Annals
of Mathematical Statistics 12.1, pp. 20–45.
Khuri, A. I. and S. Mukhopadhyay (2010). “Response surface methodology”. In: WIREs Compu-
tational Statistics 2, pp. 128–149.
Lenth, R. V. (2009). “Response-surface methods in R, using RSM”. In: Journal of Statistical Soft-
ware 32.7, pp. 1–17.
Mead, R. and D. J. Pike (1975). “A Biometrics Invited Paper. A Review of Response Surface
Methodology from a Biometric Viewpoint”. In: Biometrics 31.4, pp. 803–851.
Roberts, T. M., H.-M. Kaltenbach, and F. Rudolf (2020). “Development and optimisation of a
defined high cell density yeast medium”. In: Yeast 37 (5-6), pp. 336–347.
Index
trend, 104 t, 26
Control group, 3 noncentral, 26, 62
Coordinates Dunnett, see multiple comparison
standarized in RSM, 246
Correlation, 30
estimator of, 30 E
Covariance Effective sample size, see block design
arithmetic rules, 21 Effect modulation, see interaction
definition, 21 Effect reversal, see interaction
estimator, 30 Effect size, 36
Criss-cross design, 193, 204 and p-value, 44
example, 205 ANOVA
specification, 205 raw, 76
Cross-over design, 206 standardized, 76
carry-over, 208 Cohen’s d, 40
effect estimates, 208 for contrast, 107
Cohen’s f 2 , 76
from group differences, 78
D power, 79
Data monitoring committee, 10 two-way ANOVA, 132
Defining equations of BIBD, 175 difference
Degrees of freedom raw , 40
χ2 -distribution, 25 standardized d, 40
F-distribution, 26 η 2 , 75
Hasse diagram, 85 power, 80
Kenward-Roger approximation, 92 RCBD, 162
of variance estimator, 30 two-way ANOVA, 132
one-way ANOVA, 72, 74 for contrast, 107
researcher, 10 minimal, 58
sampling design, 28 partial η 2
Satterthwaite approximation, 92 evaluation of blocking, 165
t-distribution, 26 RCBD, 162
two-way ANOVA, 127 two-way ANOVA, 133
Density function, see distribution variance explained, 75
Derived parameter, 239 Effect sparsity
Design of experiment, 1 fractional factorial, 213
as logical structure, 4 in factorial design, 143
Dispersion, see variance Efficiency
Distribution relative
χ2 , 25 and balance, 54
noncentral, 25 and ICC, 165
conditional, 17 BIBD example, 181
cumulative distribution function, 16 BIBD vs RCBD, 180
definition, 16 CRD vs RCBD, 166
density function, 16 factorial CRD vs RCBD, 173
F, 26 Error rate
noncentral, 27 false negative, 46
joined, 17 false positive, 46
marginal, 17 family-wise, 110
normal, 16, 23 Error stratum, see analysis of variance
of ML estimator, 28 Estimand, 11, 28
of variance estimator, 31 Estimate, 28
standard normal, 23 Estimated marginal means
264 Index
H L
Hasse diagram Latin square, see block design
blocked factorial, 172 Least squares, 28
blocked split-unit design, 199 Least squares means, see estimated marginal
criss-cross design, 205 means
cross-over design, 207 Level, see factor
denominator of F-test, 86 Linear contrast, see contrast
experimental unit, 85 Linear mixed model, see model
experiment structure, 84, 85 Location, see expectation
factorial design, 123 Longitudinal design, 210
F-test, 86
longitudinal design, 210
pretest-posttest design, 209 M
RCBD, 159 Main effect, 126
response structure, 83 interpretation, 129
sampling design, 27 power, 152
split-split-unit design, 204 Marginality principle, 133
split-unit design, 195 counterexample, 133
additional factor, 202 Maximum likelihood, 28
sub-sampling, 86 Mean squares
subscripts, 85 definition, 74
superscripts, 85 expectation of
treatment structure, 83, 85 interaction, 128
t-test, 36 main effect, 71, 128
unit structure, 83, 85 one-way ANOVA
Heterogeneity residual, 74
systematic, 7 treatment, 74
Hypothesis two-way ANOVA, 127
266 Index
Measurand, 5 N
Minimum effective dose, 106 Noncentrality parameter
Model χ2 -distribution, 25
additive F-distribution, 27
BIBD, 177 and effect size, 77
example, 135 and power, 77
latin square, 185 BIBD, 181
one observation per cell, 137 for contrast, 108
two-way ANOVA, 130 latin square, 188
unbalanced data, 147, 148 one-way ANOVA, 77
BIBD, 177 RCBD, 167
factor order, 178 t-distribution, 26
block design for power analysis, 62
additive, 160 Normal distribution, see distribution
with interaction, 160, 169 Null hypothesis, see hypothesis
blocked factorial, 172
cell means
one-way ANOVA, 87 O
two-way ANOVA, 124 Omnibus hypothesis, see hypothesis
of dependent difference, 39 One-variable-at-a-time, 136
of independent difference, 36 Ordered factor, see factor
one-way ANOVA, 87 Orthogonality
contrasts, 102
parameters
of sum of squares, 73
mixed model, 163
Orthogonal polynomial, 104
sum coding, 94
Outcome, see realization
treatment coding, 94
parametric
one-way ANOVA, 87 P
two-way ANOVA, 125 Pairwise comparison, see multiple compari-
RCBD son
linear mixed model, 162 Parameter
reduction of, 133 derived, 239
marginality, 133 Parametric model, see model
specification Path of steepest ascent, 242
BIBD, 178, 179 first-order model, 244
blocked factorial, 173 second-order model, 249
from Hasse diagram, 89 Pilot experiment, 65
one-way ANOVA, 88 Plackett-Burman design, 230
RCBD, 161 Portable power, see power
split-split-unit design, 204 Post-randomization event, 11
split-unit, 197 Power
sub-sampling, 89 analysis, 45, 53
split-unit design, 196 and variance estimate, 65, 81
Multiple comparison F-distribution
Bonferroni, 111 for η 2 , 80
Bonferroni-Holm, 111 for f 2 , 79
Dunnett, 112 for raw effect size, 79
family-wise error, 110 for mean difference
post-hoc contrast, 113 known variance, 60
problem, 110 unknown variance, 62
procedure for control, 110 F-test
Scheffé, 113 R code, 93
Tukey, 112 of test, 45
Index 267
T V
t-distribution, see distribution Validity, 1
Test construct, 5
χ2 , 47 external, 7
Index 269
internal, 5 of F-distribution, 26
Variance of independent difference, 38
arithmetic rules, 19 of mean estimator, 31
definition, 19 of t-distribution, 26
estimator, 29 of variance estimator, 31
between-groups, 71 pooled estimate, 38
within-groups, 71 residual, 27
inhomogeneous Variance component, 22
example, 115 RCBD, 162, 163
of a difference, 19
of χ2 -distribution, 25
of dependent difference, 40 Y
of dependent variables, 21 Youden design, 189