A Guide To Sample Size For Animal-Based Studies (VetBooks - Ir)
A Guide To Sample Size For Animal-Based Studies (VetBooks - Ir)
A Guide To Sample Size For Animal-Based Studies (VetBooks - Ir)
Penny S. Reynolds
Department of Anesthesiology, College of Medicine
Department of Small Animal Clinical Sciences
College of Veterinary Medicine
University of Florida, Gainesville
Florida, USA
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic,
mechanical, photocopying, recording or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is
available at http://www.wiley.com/go/permissions.
The right of Penny S. Reynolds to be identified as the author of this work has been asserted in accordance with law.
Registered Offices
John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA
John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK
For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com.
Wiley also publishes its books in a variety of electronic formats and by print-on-demand. Some content that appears in standard print versions of this book
may not be available in other formats.
Trademarks: Wiley and the Wiley logo are trademarks or registered trademarks of John Wiley & Sons, Inc. and/or its affiliates in the United States and other
countries and may not be used without written permission. All other trademarks are the property of their respective owners. John Wiley & Sons, Inc. is not
associated with any product or vendor mentioned in this book.
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
Preface
‘H ow large a sample size do I need for my
study’? Although one of the most commonly
asked questions in statistics, the importance of
experiments, so they are statistically, operationally,
and ethically justifiable. A ‘right-sized’ experiment
has a clear plan for sample size justification and
proper sample size estimation seems to be over- transparently reports the numbers of all animals
looked by many preclinical researchers. Over the used in the study. For basic and veterinary research-
past two decades, numerous reviews of the pub- ers, appropriate sample sizes are critical to the
lished literature indicate many studies are too small design and analysis of a study. The best sample
to answer the research question and results are too sizes optimise study design to align with available
unreliable to be trusted. Few published studies pres- resources and ensure the study is adequately
ent adequate justification of their chosen sample powered to detect meaningful, reliable, and gener-
sizes or even report the total number of animals alisable results. Other stakeholders not directly
used. On the other hand, it is not unusual for proto- involved in animal experimentation can also benefit
cols (usually those involving mouse models) to from understanding the basic principles involved.
request preposterous numbers of animals, some- Oversight veterinarians and ethical oversight com-
times in the tens or even hundreds of thousands, mittees are responsible for appraising animal
‘because this is an exploratory study, so it is research protocols for compliance with best prac-
unknown how many animals we will require’. tice, ethical, and regulatory standards. An apprecia-
This widespread phenomenon of sample sizes tion of sample size construction can help assess
based on nothing more than guesswork or intuition scientific and ethical justifications for animal use
illustrates the pervasiveness of what Amos Tversky and whether the proposed sample size is fit for
and Daniel Kahneman identified in 1971 as the purpose. Funding agencies and policymakers use
‘belief in the law of small numbers’. Researchers research results to inform decisions related to ani-
overwhelmingly rely on best judgement in planning mal welfare, public health, and future scientific ben-
experiments, but judgement is almost always mis- efit. Understanding the logic behind sample size
leading. Researchers choose sample sizes based on justification can assist in evaluation of study quality
what ‘worked’ before or because a particular sample and reliability of research findings, and ultimately
size is a favourite with the research community. promote more informed evidence-based decision-
Tversky and Kahneman showed that researchers making.
who gamble their research results on small An extensive background in statistics is not
intuitively-based samples consistently have the required, but readers should have had some basic
odds stacked against their findings (even if results statistical training. The emphasis throughout is on
are true). They overestimate the stability and preci- the upstream components of the research process –
sion of their results, and fail to account for sampling statistical process, study planning, and sample size
variation as a possible reason for observed pattern. calculations rather than analysis. I have used real
The result is research waste on a colossal scale, espe- data in nearly all examples and provided formulae
cially of animals, that is increasingly difficult to and code, so sample size approximations can be
justify. reproduced by hand or by computer. By training
This book was written to assist non-statisticians and inclination I prefer SAS, but whenever possible
who use animals in research to ‘right-size’ I have provided R code or links to R libraries.
Acknowledgements
M any thanks to Anton Bespalov (PAASP, Hei-
delberg, Germany); Cori Astrom, Christina
Hendricks, and Bryan Penberthy (University of
the UK Animals in Science Education Trust enabled
me to upgrade my home computer system, making
working on this project immeasurably easier.
Florida); Cora Mezger, Maria Christodoulou, and The book was nearing completion when I came
Mariagrazia Zottoli (Department of Statistics, Uni- across the Icelandic word sprakkar that means
versity of Oxford); and Megan Lafollette (North ‘extraordinary women’. I have been fortunate to
American 3Rs Collaborative), who kindly reviewed encounter many sprakkar whilst writing this book.
various chapters of this book whilst it was in prep- In addition to the women (and men!) already
aration and provided much helpful feedback. mentioned, special thanks to researchers Amara
Thanks to the University of Florida IACUC chairs Estrada, Francesca Griffin, Autumn Harris, Maggie
Dan Brown and Rebecca Kimball, who encouraged Hull, Wendy Mandese, and Elizabeth Nunamaker,
researchers to consult the original 10-page handout who generously allowed me to use some of their
I had devised for sample size estimation. And last, data as examples. And special thanks to Jane Buck
but certainly not least, special thanks to Tim Morey, and Julie Laskaris for their wonderful friendship
Chair of the Department of Anesthesiology, Univer- and hospitality over the years. Jane Buck, Professor
sity of Florida, who encouraged me to put that Emerita of Psychology, Delaware State University,
handout into book form. and past president of the American Association of
Thanks are also due to the University of Florida University Professors, continues to amaze and show
Faculty Endowment Fund for providing me with what is possible for a statistician ‘with attitude’.
a Faculty Enhancement Opportunities grant to Julie advised me that the only approach to properly
allow me to devote some concentrated time to writ- edit one’s own work on a book-length project was to
ing. A generous honorarium from Scientist Center ‘slit its throat’, then told me to do as she said, not as
for Animal Welfare (SCAW) and an award from she actually did. Cheers.
I
What is Sample Size?
A Guide to Sample Size for Animal-based Studies, First Edition. Penny S. Reynolds.
© 2024 John Wiley & Sons Ltd. Published 2024 by John Wiley & Sons Ltd.
4 A Guide to Sample Size for Animal-based Studies
of resources, and the ethical requirement to mini- unaware of their existence. Many preclinical studies
mise waste and suffering of research animals. Thus, reported in the literature consist of numerous two-
sample size calculations are not a single calculation group designs. However, this approach is both inef-
but a set of calculations, involving iteration through ficient and inflexible and unsuited to exploratory
formal estimates, followed by reality checks for studies with multiple explanatory variables
feasibility and ethical constraints (Reynolds 2019). (Reynolds 2022). Statistically based designs are
Additional challenges to right-sizing experiments rarely reported in the preclinical literature. In part,
include those imposed by experimental design and this is because the design of experiments is seldom
biological variability (Box 1.2). In The Principles of taught in introductory statistics courses directed
Humane Experimental Technique (1959), Russell towards biomedical researchers.
and Burch were very clear that Reduction is achieved Power calculations are the gold standard for sam-
by systematic strategies of experimentation rather ple size justification. However, they are commonly
than trial and error. In particular, they emphasised misapplied, with little or no consideration of study
the role of the statistically based family of experi- design, type of outcome variable, or the purpose of
mental designs and design principles proposed by the study. The most common power calculation is
Ronald Fisher, still relatively new at the time. For- for two-group comparisons of independent samples.
mal experimental designs customised to address However, this is inappropriate when the study is
the particular research question increase the experi- intended to examine multiple independent factors
mental signal through the reduction of variation. and interactions. Power calculations for continuous
Design principles that reduce bias, such as rand- variables are not appropriate for correlated observa-
omisation and allocation concealment (blinding) tions or count data with high prevalence of zeros.
increase validity. These methods increase the Power calculations cannot be used at all when sta-
amount of usable information that can be obtained tistical inference is not the purpose of the study, for
from each animal (Parker and Browne 2014). example, assessment of operational and ethical fea-
Although it has now been almost a century sibility, descriptive or natural history studies, and
since Fisher-type designs were developed many species inventories.
researchers in biomedical sciences still seem Evidence of right-sizing is provided by a clear
plan for sample size justification and transparent
reporting of the number of all animals used in the
BOX 1.2 study. This is why these items are part of best-
Challenges for Right-Sizing Animal-Based Studies practice reporting standards for animal research
publications (Kilkenny et al. 2010, Percie du Sert
Ethics and welfare considerations. The three Rs
et al. 2020 and are essential for the assessment of
Replacement, Reduction, and Refinement should
research reproducibility (Vollert et al. 2020). Unfor-
be the primary driver of animal numbers.
tunately, there is little evidence that either sample
Experimental design. Animal-based research has
size justification or sample size reporting has
no design culture. Clinical trial models are inapp-
improved over the past decade. Most published ani-
ropriate for exploratory research. Multifactorial
mal research studies are underpowered and biased
agriculture/industrial design may be more suita-
(Button et al. 2013, Henderson et al. 2013, Macleod
ble in many cases, and they are unfamiliar to
et al. 2015) with poor validity (Würbel 2017, Sena
most researchers.
and Currie 2019), severely limiting reproducibility
Biological variability. Animals can display signifi-
and translation potential (Sena et al. 2010, Silver-
cant differences in responses to interventions,
man et al. 2017). A recent cross-sectional survey
making it challenging to estimate an appropriate
of mouse cancer model papers published in high-
sample size.
impact oncology journals found that fewer than
Cost and resource constraints. The financial cost of
2% reported formal power calculations, and less
conducting animal-based research, including the
than one-third reported sample size per group. It
cost of housing, caring for, and monitoring the
was impossible to determine attrition losses, or
animals, must be considered in estimates of
how many experiments (and therefore animals)
sample size.
were discarded due to failure to achieve statistical
The Sample Size Problem in Animal-Based Research 5
significance (Nunamaker and Reynolds 2022). The the most appropriate statistically based study
most common sample size mistake is not perform- design. Although advanced statistical or mathemat-
ing any calculations at all (Fosgate 2009). Instead, ical skills are not required, readers are expected to
researchers make vague and unsubstantiated state- have at least a basic course on statistical analysis
ments such as ‘Sample size was chosen because it is methods and some familiarity with the basics of
what everyone else uses’ or ‘experience has shown power and hypothesis testing. SAS code is provided
this is the number needed for statistical signifi- in appendices at the end of each chapter and refer-
cance’. Researchers often game, or otherwise adjust, ences to specific R packages in the text. It is strongly
calculations to obtain a preferred sample size recommended that everyone involved in devising
(Schultz and Grimes 2005, Fitzpatrick et al. 2018). animal-based experiments take at least one course
In effect, these studies were performed without jus- in the design of experiments, a topic not often cov-
tification of the number of animals used. ered by statistical analysis courses.
Statistical thinking is both a mindset and a set of This book is organised into four sections
skills for understanding and making decisions (Figure 1.1).
based on data (Tong 2019). Reproducible data can
only be obtained by sustained application of statis- Part I Sample size basics discusses definitions of
tical thinking to all experimental processes: good sample size, elements of sample size determina-
laboratory procedure, standardised and comprehen- tion, and strategies for maximising information
sive operating protocols, appropriate design of power without increasing sample size.
experiments, and methods of collecting and analys- Part II Feasibility. This section presents strategies
ing data. Appropriate strategies of sample size justi- for establishing study feasibility with pilot stud-
fication are an essential component. ies. Justification of animal numbers must first
address questions of operational feasibility
(‘Can it work?’ Is the study possible? suitable?
1.1 Organisation of the Book convenient? sustainable?). Once operational
This book is a guide to methods of approximating logistics are standardised, pilot studies can be
sample sizes. There will never be one number or performed to establish empirical feasibility
approach, and sample size will be determined for (‘Does it work?’ is the output large enough to
the most part by study objectives and choice of be measured? consistent enough to be reliable?)
Sampling Maximise
probability Signal Effect size
Data Minimise
visualisation Noise Process variation
Operational
Subject variation
feasibility?
No Yes
Justifiable
Feasible? numbers
Figure 1.1: Overview of book organisation. For animal numbers to be justifiable (Are they feasible? appropriate? ethical?
verifiable?), sample size should be determined by formal quantitative calculations (arithmetic, probability-based, precision-
based, power-based) and consideration of operational constraints.
6 A Guide to Sample Size for Animal-based Studies
and translational feasibility (‘Will it work?’ proof Graham, M.L. and Prescott, M.J. (2015). The multifacto-
of concept and proof of principle) before pro- rial role of the 3Rs in shifting the harm-benefit analysis
ceeding to the main experiments. Power calcula- in animal models of disease. European Journal of
tions are not appropriate for most pilots. Instead, Pharmacology 759: 19–29. https://doi.org/
common-sense feasibility checks include basic 10.1016/j.ejphar.2015.03.040.
arithmetic (with structured back-of-the-envelope Henderson, V.C., Kimmelman, J., Fergusson, D. et al.
(2013). Threats to validity in the design and conduct
calculations), simple probability-based calcula-
of preclinical efficacy studies: a systematic review of
tions, and graphics. guidelines for in vivo animal experiments. PLoS Med-
Part III Description. This section presents methods icine 10: e1001489.
for summarising the main features of the sample Kilkenny, C., Browne, W.J., Cuthill, I.C. et al. (2010).
data and results. Basic descriptive statistics pro- Improving bioscience research reporting: the ARRIVE
vide a simple and concise summary of the data in guidelines for reporting animal research. PLoS Biol-
ogy 8 (6): e1000412. https://doi.org/10.1371/
terms of central tendency and dispersion or
journal.pbio.1000412.
spread. Graphical representations are used to
Macleod, M.R., Lawson McLean, A., Kyriakopoulou,
identify patterns and outliers and explore rela- A. et al. (2015). Risk of bias in reports of in vivo
tionships between variables. Intervals computed research: a focus for improvement. PLoS Biology 13:
from the sample data are the range of values e1002301. https://doi.org/10.1371/journal.
estimated to contain the true value of a popula- pbio.1002273.
tion parameter with a certain degree of confi- Nunamaker, E.A. and Reynolds, P.S. (2022). “Invisible
dence. Four types of intervals are discussed: actors”—how poor methodology reporting compro-
confidence intervals, prediction, intervals, toler- mises mouse models of oncology: a cross-sectional sur-
ance intervals, and reference intervals. Intervals vey. PLoS ONE 17 (10): e0274738. https://doi.org/
shift emphasis away from significance tests 10.1371/journal.pone.0274738.
and P-values to more meaningful interpretation Parker, R.M.A. and Browne, W.J. (2014). The place of
of results. experimental design and statistics in the 3Rs. ILAR
Journal 55 (3): 477–485.
Part IV Comparisons. Power-based calculations Percie du Sert, N., Hurst, V., Ahluwalia, A. et al. (2020). The
for sample size are centred on understanding ARRIVE guidelines 2.0: updated guidelines for reporting
effect size in the context of specific experimen- animal research. PLoS Biology 18 (7): e3000410.
tal designs and the choice of outcome variables. https://doi.org/10.1371/journal.pbio.3000410.
Effect size provides information about the Reynolds, P.S. (2019). When power calculations won’t
practical significance of the results beyond con- do: fermi approximation of animal numbers. Lab Ani-
siderations of statistical significance. Specific mal (NY) 48: 249–253.
Reynolds, P.S. (2021). Statistics, statistical thinking, and
designs considered are two-group comparisons,
the IACUC. Lab Animal (NY) 50 (10): 266–268.
ANOVA-type designs, and hierarchical designs.
https://doi.org/10.1038/s41684-021-00832-w.
Reynolds, P.S. (2022). Between two stools: preclinical
research, reproducibility, and statistical design of
References experiments. BMC Research Notes 15: 73. https://
doi.org/10.1186/s13104-022-05965-w.
Button, K.S., Ioannidis, J.P.A., Mokrysz, C. et al. (2013). Russell, W.M.S. and Burch, R.L. (1959). The Principles
Power failure: why small sample size undermines the of Humane Experimental Technique. London:
reliability of neuroscience. Nature Reviews Neurosci- Methuen.
ence 14: 365–376. Schulz, K.F. and Grimes, D.A. (2005). Sample size calcu-
Fitzpatrick, B.G., Koustova, E., and Wang, Y. (2018). Get- lations in randomised trials: mandatory and mystical.
ting personal with the “reproducibility crisis”: inter- Lancet 365 (9467): 1348–1353. https://doi.org/
views in the animal research community. Lab 10.1016/S0140-6736(05)61034-3.
Animal (NY) 47: 175–177. Sena, E.S. and Currie, G.L. (2019). How our approaches
Fosgate, G.T. (2009). Practical sample size calculations to assessing benefits and harms can be improved. Ani-
for surveillance and diagnostic investigations. Journal mal Welfare 28: 107–115.
of Veterinary Diagnostic Investigation 21: 3–14. Sena ES, van der Worp HB, Bath PM, Howells DW,
https://doi.org/10.1177/104063870902100102. Macleod MR. Publication bias in reports of animal
The Sample Size Problem in Animal-Based Research 7
stroke studies leads to major overstatement of efficacy. Vollert, J., Schenker, E., Macleod, M. et al. (2020). Sys-
PLoS Biology, 2010 8(3):e1000344. https://doi.org/ tematic review of guidelines for internal validity in
10.1371/journal.pbio.1000344. the design, conduct and analysis of preclinical biomed-
Silverman, J., Macy, J., and Preisig, P. (2017). The role of ical experiments involving laboratory animals. BMJ
the IACUC in ensuring research reproducibility. Lab Open Science 4 (1): e100046. https://doi.org/
Animal (NY) 46: 129–135. 10.1136/bmjos-2019-100046.
Tong, C. (2019). Statistical inference enables bad science; Würbel, H. (2017). More than 3Rs: the importance of sci-
statistical thinking enables good science. American entific validity for harm-benefit analysis of animal
Statistician 73: 246–261. research. Lab Animal 46: 164–166.
2
Sample Size Basics
A Guide to Sample Size for Animal-based Studies, First Edition. Penny S. Reynolds.
© 2024 John Wiley & Sons Ltd. Published 2024 by John Wiley & Sons Ltd.
10 A Guide to Sample Size for Animal-based Studies
(a) (b)
(c) (d)
t1 t2 t3 t4
t1 t2 t3 t4
Figure 2.1: Units of replication. (a) Experimental unit = individual animal = biological unit. The entire entity to which an
experimental or control intervention can be independently applied. There are two treatment interventions A or B. Here each
mouse receives a separate intervention, and the individual mouse is the experimental unit (EU). The individual mouse is also
the biological unit. (b). Experimental unit = groups of animals. There are two treatment interventions A or B. Each dam
receives either A or B, but measurements are conducted on the pups in each litter. The experimental unit is the dam (N = 2),
and biological unit is the pup (n = 8). For this design, the number of pups cannot contribute to the test of the central
hypothesis. (c) Experimental unit with repeated observations. The experimental unit is the individual animal (= biological unit)
with four sequential measurements made on each animal. The sample size N = 2. (d) Experimental unit = part of each animal.
There are two treatment interventions A or B. Treatment A is randomised to either the right or left flank of each mouse,
and B is injected into the opposite flank of that mouse. The experimental unit is flank (N = 8). The individual mouse is the
biological unit. Each mouse can be considered statistically as a block with paired observations within each animal.
applied (Figure 2.1a). Cox and Donnelly (2011) size N will not be the same as the number of
define it as the ‘smallest subdivision of the experi- animals.
mental material such that two distinct units might The total sample size N refers to the number of
be randomized (randomly allocated) to different independent experimental units in the sample.
treatments.’ Whatever happens to one experimental The classic meaning of a ‘replicate’ refers to the
unit will have no bearing on what happens to the number of experimental units within a treatment
others (Hurlbert 2009). If the test intervention is or intervention group. Therefore, replicating exper-
applied to a ‘grouping’ other than the individual imental units (and hence increasing N) contributes
animal (e.g. a litter of mice, a cage or tank of ani- to statistical power for testing the central statistical
mals, a body part; (Figure 2.1b–d)), then the sample hypothesis. Power calculations estimate the number
Sample Size Basics 11
of experimental units required to test the hypothe- unit can be an individual biological unit, a group of
sis. The assignment of treatments and controls to biological units, a sequence of observations on a sin-
experimental units should be randomised if the gle biological unit or a part of a biological unit (Lazic
intention is to perform statistical hypothesis tests and Essioux 2013; Lazic et al. 2018). The biological
on the data (Cox and Donnelly 2011). unit of replication may be the whole animal or a sin-
Independence of experimental units is essential gle biological sample, such as strains of mice, cell
for most null hypothesis statistical tests and meth- lines or tissue samples (Table 2.1).
ods of analysis and is the most important condition
for ensuring the validity of statistical inferences (van
Belle 2008). Non-independence of experimental 2.4 Technical Replicates
units occurs with repeated measures and multi-level
Technical replicates or repeats are multiple mea-
designs and must be handled by the appropriate sta-
surements made on subsamples of an experimental
tistically based designs and analyses for hypothesis
unit (Figure 2.2). Technical replicates are used
tests to be valid.
to obtain an estimate of measurement error, the dif-
ference between a measured quantity and its true
value. Technical replicates are essential for asses-
2.3 Biological Unit sing internal quality control of experimental proce-
The biological unit is the entity about which infer- dures and processes, and ensuring that results are
ences are to be made. Replicates of the biological not an artefact of processing variation (Taylor and
unit are the number of unique biological samples Posch 2014). Differences between operators and
or individuals used in an experiment. Replication instruments, instrument drift, subjectivity in deter-
of biological units captures biological variability mination of measurement landmarks, or faulty cal-
between and within these units (Lazic et al. 2018). ibration can result in measurement error. Cell
The biological unit is not necessarily the same as cultures and protein-based experiments can also
the experimental unit. Depending on how the treat- show considerable variation from run to run, so
ment intervention is randomised, the experimental in vitro experiments are usually repeated several
Table 2.1: Units of Replication in a Hypothetical Single-Cell Gene Expression RNA Sequencing Experiment. Designating a
given replicate unit as an experimental unit depends on the central hypothesis to be tested and the study design.
times. At least three technical replicates of Western response/expression rates, expected false positive
blots, PCR measurements, or cell proliferation rate, and number of sampling time points (Lee
assays may be necessary to assess reliability of tech- and Whitmore 2002; Lin et al. 2010; Jung and Young
nique and confirm validity of observed changes in 2012).
protein levels or gene expression (Taylor and Posch
2014).
The variance calculated from the multiple mea- Example: Experimental Units with Techni-
surements is an estimate of the precision, and cal Replication
therefore the repeatability, of the measurement.
Technical replicates measure the variability Two treatments A and B are randomly allocated
between measurements on the same experimental to six individually housed mice, with three mice
units. Repeating measurements increases the preci- receiving A and three receiving B. Lysates are
sion only for estimates of the measurement error; obtained from each mouse in three separate ali-
they do not measure variability either within or quots (Figure 2.2).
between treatment groups. Therefore, increasing The individual mouse is the experimental unit
the number of technical replicates does not improve because treatments can be independently and
power or contribute to the sample size for testing the randomly allocated to each mouse. There are
central hypothesis. Analysing technical repeats as three subsamples or technical replicates per
independent measurements is pseudo-replication. mouse. The total sample size is N = 6, with k =
High-dimensionality studies produce large 2 treatments, n = 3 mice per treatment group,
amounts of output information per subject. Exam- and j = 3 technical replicates per mouse. The total
ples include multiple DNA/RNA microarrays; bio- sample size N is 6, not 18.
chemistry assays; biomarker studies; proteomics;
metabolomics; inflammasome profiles, etc. These
studies may require a number of individual animals,
either for operational purposes (for example, to 2.5 Repeats, Replicates, and
obtain enough tissue for processing) or as part of
the study design (for example, to estimate biological
Pseudo-Replication
variation). Sample size will then be determined by Confusion of repeats with replicates is a problem of
the amount of tissue required for the assay techni- study design, and pseudo-replication is a problem of
cal replicates, or by design-specific requirements analysis. Study validity is compromised by incorrect
for power. Design features include anticipated identification of the experimental unit. A replicate is
A B
Figure 2.2: Experimental unit versus technical replicates. Two treatments A and B are randomly allocated to six mice.
The individual mouse is the experimental unit. Three lysate aliquots are obtained from each mouse. These are technical
replicates. The total sample size N is 6, not 18.
Sample Size Basics 13
correct repli-cation unit, and almost half showed Sacrificial pseudo-replication occurs when there
pseudo-replication, suggesting that inferences based are multiple replicates within each treatment arm,
on hypothesis tests were likely invalid. the data are structured as a feature of the design
Three of the most common types of pseudo- (such as pairing, clustering, or nesting), but design
replication are simple, sacrificial, and temporal. structure is ignored in the analyses. The units are
Others are described in Hurlbert and White (1993) treated as independent, so the degrees of freedom
and Hurlbert (2009). for testing treatment effects are too large. Sacrificial
Simple pseudo-replication occurs when there is pseudo-replication is especially common in studies
only a single replicate per treatment. There may with categorical outcomes when the χ 2 test or Fish-
be multiple observations, but they are not obtained er’s exact test is used for analysis (Hurlbert and
from independent experimental replicates. The arti- White 1993; Hurlbert 2009).
ficial inflation of sample size results in estimates of
the standard error that are too small, contributing to
increased Type I error rate and increased number of Example: Sunfish Foraging Preferences
false positives. Dugatkin and Wilson (1992) studied feeding suc-
cess and tankmate preferences in 12 individually
marked sunfish housed in two tanks. Preference
Example: Mouse Photoperiod Exposure was evaluated for each fish for all possible pair-
A study on circadian rhythms was conducted to wise combinations of two other tankmates. There
assess the effect of two different photoperiods on were 2 groups × 60 trials, per group × 2 replicate
mouse wheel-running. Mice in one environmen- sets of trials, for a total of 240 observations. They
tal chamber were exposed to a long photoperiod concluded that feeding success was weakly but
with 14 hours of light, and mice in a second statistically significantly correlated with aggres-
chamber to a short photoperiod with 6 hours of sion (P < 0.001) based on 209 degrees of freedom,
light. There were 15 cages in each chamber and that fish in each group strongly preferred
with four mice per cage. What is the effective (P < 0.001) the same individual in each of the
sample size? two replicate preference experiments, based on
This is simple pseudo-replication. The experi- 60 observations.
mental unit is the chamber, so the effective sam- The actual number of experimental units is 12,
ple size is one per treatment. Analysing the data with 6 fish per tank. The correct degrees of
as if there is a sample size of n = 60 (or even n = freedom for the regression analysis is 4, not
15) per treatment is incorrect. The number of 209. Suggested analyses for preference data
mice and cages in each chamber is irrelevant. included one-sample t-tests with 5 degrees of free-
This design implicitly assumes that chamber con- dom or one-tailed Wilcoxon matched-pairs test
ditions are uniform and chamber effects are zero. with N = 12. Correct analyses would produce
However, variation both between chambers and much larger P-values, suggesting that interpreta-
between repeats for the same chamber can be tion of these data requires substantial revision
considerable (Potvin and Tardif 1988; Hammer (Lombardi and Hurlbert 1996).
and Hopper 1997). Increasing sample size of mice
will not remedy this situation because chamber
Temporal (or spatial) pseudo-replication occurs
environment is confounded with photoperiod. It
when multiple measurements are obtained sequen-
is, therefore, not possible to estimate experimental
tially on the same experimental units, but analysed
error, and inferential statistics cannot be applied.
as if they represent an individual experimental unit.
Analysis should be restricted to descriptive statis-
Sequential observations (or repeated measures)
tics only. The study should be re-designed either to
are correlated within each individual. Repeated mea-
allow replication across several chambers, or if
sures increase the precision of within-unit estimates,
chambers are limited, as a multi-batch design
but the number of repeated measures do not increase
replicated at two or more time points.
the power for estimating treatment effects.
Sample Size Basics 15
A Guide to Sample Size for Animal-based Studies, First Edition. Penny S. Reynolds.
© 2024 John Wiley & Sons Ltd. Published 2024 by John Wiley & Sons Ltd.
18 A Guide to Sample Size for Animal-based Studies
testable question. A well-constructed research ques- that are manipulated or serve as comparators),
tion consists of four concept areas: the study popu- effects (the outcomes that are measured to assess
lation or problem of interest, the test intervention, causality), and the test platform (the animals used
the comparators or controls, and the outcome. For- to assess cause and effect). Breaking the research
mat is modified according to study type (Box 3.1 and question into components allows the identification
Figure 3.1). and correction of metrics that are otherwise poorly
Structuring the research question enables clear defined or unmeasurable.
identification and discrimination of causes (factors A well-constructed research question is essential
for effective literature searches. Comprehensive lit-
erature reviews provide current evidence-based
BOX 3.1 assessments of the scientific context, the research
The ‘Well-Built’ Research Question gaps to be addressed, suitability of the proposed ani-
mal and disease model, and more realistic assess-
Experimental/intervention studies: PICOT
ments of potential harms and benefits of the
Population/Problem proposed research (Ritskes-Hoitinga and Wever
Intervention 2018; Ormandy et al. 2019). Collaborative research
Comparators/Controls groups such as CAMARADES (https://www.ed.
Outcome ac.uk/clinical-brain-sciences/research/camar-
Time frame, follow up ades/about-camarades) and SYRCLE (https://
Observational studies: PECOT www.syrcle.network/) are excellent resources for
Population/Problem certain specialities such as stroke, neuropathic pain,
Exposure and toxicology, and provide a number of e-training
Comparators resources and tools for assessing research quality.
Outcome Construction of the research question in the PICOT
Time frame, follow up framework was originally developed for evidence-
based medicine. Information on constructing
Diagnostic studies: PIRT research questions and designing literature searches
Population/Problem can be obtained from university library resources
Index test sections and the Oxford Centre for Evidence-based
Reference/gold standard Medicine website.
Target condition. The research question dictates formation of
both the research hypothesis and related statistical
hypotheses. These are often confused or conflated.
The research hypothesis is a testable and quantifiable
proposed explanation for an observed or predicted
relationship between variables or patterns of events.
It should be rooted in a plausible mechanism as to
why the observation occurred. One or more testable
I predictions should follow logically from the central
Interventions P
“Patient” O
hypothesis (‘If A happens, then B should occur, oth-
C
“Platform” Outcomes erwise C’). A description of the scientific hypothesis
“Population”
Comparators, provides justification for the experiments to be per-
controls formed, why animals are needed, and rationale for
T
Time
the species, type or strain of animals, and justifica-
tion of animal numbers.
Figure 3.1: System diagram for the ‘well-built’ research The statistical hypothesis is a mathematically-based
question. statement about a specific statistical population
Ten Strategies to Increase Information (and Reduce Sample Size) 19
parameter. Hypothesis tests are the formal testing 3.3 Structured Inputs
procedures based on the underlying probability dis-
tribution of the relevant sample test statistic. Choice (Experimental Design)
of statistical test will depend upon the statistical The design of an animal-based study will affect esti-
hypothesis, the study design, the types of variables mates of sample size. Good study designs are an
(continuous, categorical, ordinal time to event), essential part of the 3Rs (Russell and Burch 1959;
designated as inputs or outcomes. Statistical Kilkenny et al. 2009; Karp and Fry 2021; Gaskill
hypotheses should be a logical extension of the and Garner 2020; Eggel and Würbel 2021). Rigor-
research hypothesis (Bolles 1962). However, the ous, statistically-based experimental designs consist
research hypothesis may not immediately conform of the formal arrangement and structuring of inde-
to any one statistical hypothesis, and multiple sta- pendent (or explanatory) variables hypothesised to
tistical hypotheses may be required to adequately affect the outcome. (Box 3.2). The optimum design
test all predictions generated from the research will depend on the specific research problem
hypothesis. addressed. However, to be fit for purpose, all designs
must facilitate discrimination of signal from noise,
by identifying and separating out contributions of
Example: Research Versus Statistical explanatory variables from different sources of var-
Hypotheses iation (Reynolds 2022). By increasing the power to
Based on a comprehensive literature review, an detect real treatment differences, a properly
investigator determined that ventricular dys- designed experiment requires the use of far fewer
rhythmia after myocardial infarction is associ- time, money, and resources (including animals)
ated with high risk of subsequent sudden for the amount of information obtained.
cardiac arrest in humans (clinical observation, Well-designed studies start with a well-
clinical pattern). The investigator wished to constructed research question and well-defined
design a series of experiments using a mouse input and output variables. A good design also
model of myocardial infarction to test the effects incorporates specific design features that ensure
of several candidate drugs with the goal of redu- results are reliable and valid. These include correct
cing sudden cardiac death. specification of the unit of analysis (or experimental
unit), relevant inclusion and exclusion criteria, bias
Scientific hypothesis. Pharmacological suppres- minimisation methods (such as allocation conceal-
sion of ventricular dysrhythmia should ment and randomisation; Section 3.10), and mini-
result in clinically important reductions in misation of variation (Addelman 1970). Useful
the incidence of sudden cardiac death. designs for animal-based research include com-
Research question. In a mouse model of dys- pletely randomised design, randomised complete
rhythmia following myocardial infarction block designs, factorial designs, and split-plot
(P), will drug X (I) when compared to a saline designs (Festing and Altman 2002; Montgomery
vehicle solution (C) result in fewer deaths (O) 2017; Festing 2020; Karp and Fry 2021).
at four weeks post-administration (T)?
Quantified outcomes. Number of deaths (n) in BOX 3.2
each group and proportion of deaths (p) in Statistical Design of Experiments: Components
each group.
Statistical hypothesis. The null hypothesis is Study design. Formal structuring of input or explan-
that of no difference in the proportion of atory variables according to statistically-based
deaths for mice treated with drug X (pX) ver- design principles
sus the proportion of deaths for mice treated Study design features. Unit of analysis (experimen-
with control C (pC) is H0 pX = pC or H0: tal unit), inclusion/exclusion criteria, bias mini-
pX – pC = 0. misation methods, sources of variation.
20 A Guide to Sample Size for Animal-based Studies
Sample size balance. Balanced allocation designs Positive controls assess the validity of the test; that
with equal sample sizes n per group have the is, whether or not the response can actually be
highest precision and power for a given total measured or distinguished from background
sample size. Equal sample sizes may reduce noise. Negative controls assess the magnitude
the effect of chance imbalances in measured of response that would be expected if the test
covariates and therefore minimising the poten- intervention had no effect. Standard controls
tial for bias in the results. Unbalanced sample are procedures or agents for which the effect
size allocation increases the variability of the is already known and verified (standard of
sample, reducing precision and power and care), to be compared to the new intervention.
making it more difficult to detect true differ- Sham controls are common in experiments
ences between groups. where surgical procedures are required for
instrumentation or to induce iatrogenic disease
3.7 Appropriate or injury. Shams account for inflammatory,
morphological, and behavioural effects associ-
Comparators and Controls ated with surgical procedures, anaesthesia,
Comparators and controls are used to establish analgesia, and post-surgical pain. They are used
cause-and-effect relationships between interven- to discriminate systemic effects resulting from
tions and responses (Box 3.6). By providing a stand- the procedures from those actually induced
ard of comparison, controls act as a reference for by the test intervention. However, sham proce-
evaluating if results are due to the effect of the inde- dures may be inappropriate controls for some
pendent or explanatory variables or are instead an models and can profoundly confound results
artefact of some unknown confounder or time (Cole et al. 2011). They may be unnecessary
dependency in the data. Controls, therefore, provide for certain models (see below).
Vehicle controls assess potential physiological
effect of the active agent by accounting for
BOX 3.6 the potential effects of the excipient, carrier,
Appropriate Comparators and Controls or solubilising solution in which the agent is
Consider: suspended. Vehicles may not be physiologically
inert and can radically affect experimental out-
1. Type: positive, negative, standard, sham, vehi-
comes, so they must be chosen carefully
cle, strain, and self.
(Moser 2019).
2. Allocation: Concurrent versus historical.
3. Necessity: Is a specific type of control required? Matching is appropriate for both experimental
4. Non-redundancy: Is a specific type of control and observational studies, especially when
informative? sample sizes are small and experimental units
24 A Guide to Sample Size for Animal-based Studies
However, appropriate and informative estimates of preliminary data for subsequent studies should
priors require historical studies to be high-quality include controls (and if possible be randomised
and well-powered to begin with. If prior data are and blinded) to strengthen the evidence for or
biased and inadequate, priors should not be used, against proceeding with the investigation. Study
and sample sizes for the new experiment obtained designs where animals act as their own control
from appropriate power calculations in the usual may not require a separate control group (see Bate
way (Bonapersona et al. 2021). and Clark 2014; Lazic 2016; Karp and Fry 2021).
investigators confuse the technical probability-based animal-based studies (Box et al. 2005; Montgom-
meaning of ‘random’ with the lay meaning of ery 2017). Experiments are designed to run in
‘unplanned’ or ‘haphazard’ (Reynolds and Garvan series, with results from the preceding stage
2020). Non-randomisation seriously compromises informing the design of the next stage. A pilot
study validity, both by inflation of the false positive phase may be necessary to determine logistics
rate and invalidation of statistical hypothesis tests. and standardise procedures. The next phase is
focused on factor reduction: ‘screening out the
trivial many from the significant few’. In this
3.10 Think Sequentially phase, experiments consist of two-level factorial
designs where a relatively large number of candi-
The most efficient and cost-effective experiments
date inputs or interventions thought to affect the
are iterative (Box 3.9). That is, results of previous
response are reduced to a few of the more prom-
experiments should inform the approach for the
ising or important. Subsequent experiments can
next set of experiments, allowing rapid convergence
then be designed to optimise responses based on
to an optimal solution (Box et al. 2005). In practice
the specific factors identified in the preceding
however, preclinical research is typified by numer-
phases. In many cases, only two or three itera-
ous two-group, trial-and-error, OFAT comparisons.
tions are required to obtain the best or optimal
It is sometimes argued that the OFAT approach
response. Advantages of the sequential assembly
‘promotes creativity’ and discovery of serendipitous
approach include overall reduction of sample
results. However, there are several major disadvan-
sizes (and therefore number of animals), estima-
tages of OFAT. These include the inability to detect
tion of treatment effects with greater precision
interactions between multiple factors (which results
for a given N, a better understanding of relation-
in most ‘discovery by chance’), prioritising statisti-
ships between variables (such as dose-response
cal significance over biological or clinical relevance,
relationships), and improved decision-making
and extremely inefficient use of animals compared
(such as earlier termination of less-promising
to statistically-based designs (Montgomery 2017;
directions of enquiry). A further advantage of
Festing 2020; Gaskill and Garner 2020).
this structured approach, especially for explora-
Adaptive designs are one type of iterative design
tory animal-based research, is the increased
common in human randomised clinical trials for
chance of ‘discovering the unexpected’ or seren-
assessing efficacy, futility, or safety. They are char-
dipitous results not possible with OFAT or trial-
acterised by planned changes based on observed
and-error (Box et al. 2005).
data that allow refinement of sample size, dropping
of unpromising interventions or drug doses, and
early termination for futility, thus increasing power
and efficiency and reducing costs (Pallmann et al.
3.11 Think ‘Right-Sizing’, Not
2018). However, most adaptive design methods ‘Significance’
are unsuited for preclinical studies, because of dif- Right-sizing experiments, or sample size optimisa-
ferent study objectives, complex methodology, and tion, refer to the process of determining the most
sample sizes that would be prohibitively large for efficient and optimal sample size for a study to
animal-based experiments. achieve its objectives with the least amount of
Sequential assembly is an alternative approach resources (Box 3.10). This involves finding the bal-
that will accomplish many of the same objectives ance between adequate effect size and study feasi-
as adaptive designs and is more suited to bility. Evidence of right-sizing is provided by a
clear plan for sample size justification and reporting
the number of all animals used in the study. In con-
BOX 3.9
trast, ‘chasing statistical significance’ refers to the
Think Sequentially
experimental goal of obtaining a statistically signif-
‘Several small rather than one large’ icant (P < 0.05) result (Marín-Franch 2018).
Chasing significance is a particularly insidious
Each phase informs the conduct and structure of the
and harmful practice contributing to the waste of
next: pilot, screening, optimisation, confirmatory.
countless number of animals in so-called ‘negative’
28 A Guide to Sample Size for Animal-based Studies
The Experimental Design Assistant. https:// experiments with historical control data. Nature Neu-
eda.nc3rs.org.uk/ roscience 24: 470–477.
Jackson Laboratories: Choosing appropriate con- Box, G.E.P., Hunter, J.S., and Hunter, W.G. (2005). Sta-
trols for transgenic animals https://www.jax.org/ tistics for Experimenters, 2e. New York: Wiley.
jax-mice-and-services/customer-support/tech- Carbone, L. (2011). Pain in laboratory animals: the ethical
nical-support/breeding-and-husbandry-support/
and regulatory imperatives. PLoS ONE 6 (9): e21578.
https://doi.org/10.1371/journal.pone.0021578.
considerations-for-choosing-controls
Chavalarias, D., Wallach, J.D., Li, A.H., and Ioannidis, J.P.
(2016). Evolution of reporting P values in the biomedi-
References cal literature, 1990-2015. JAMA 315 (11): 1141–1148.
https://doi.org/10.1001/jama.2016.1952.
Aarts, E., Dolan, C.V., Verhage, M., and van der Sluis, Cole, J.T., Yarnell, A., Kean, W.S. et al. (2011). Craniot-
S. (2015). Multilevel analysis quantifies variation in omy: true sham for traumatic brain injury, or a sham of
the experimental effect while optimizing power a sham? Journal of Neurotrauma 28: 359–369.
and preventing false positives. BMC Neuroscience Conradi, U. and Joffe, A.R. (2017). Publication bias in ani-
16: 93. https://doi.org/10.1186/s12868-015- mal research presented at the 2008 Society of Critical Care
0228-5. Medicine Conference. BMC Research Notes 10 (1): 262.
Addelman, S. (1970). Variability of treatments and exper- https://doi.org/10.1186/s13104-017-2574-0.
imental units in the design and analysis of experi- Cook, A.J., Delong, E., Murray, D.M. et al. (2016). Statis-
ments. Journal of the American Statistical Association tical lessons learned for designing cluster randomized
65 (331): 1095–1109. pragmatic clinical trials from the NIH Health Care Sys-
Åhlgren, J., and Voikar, V. (2019). Experiments done tems Collaboratory Biostatistics and Design Core. Clin-
in Black-6 mice: what does it mean? Lab Animal ical Trials 13: 504–512. https://doi.org/10.1177/
(NY) 48 (6): 171–180. https://doi.org/10.1038/ 1740774516646578.
s41684-019-0288-8. Dugnani, E., Pasquale, V., Marra, P. et al. (2018). Four-
Altman, D.G. (1980). Statistics and ethics in medical class tumor staging for early diagnosis and monitoring
research: misuse of statistics is unethical. BMJ 281: of murine pancreatic cancer using magnetic resonance
1182–1184. and ultrasound. Carcinogenesis 39 (9): 1197–1206.
André, V., Gau, C., Scheideler, A. et al. (2018). Labora- https://doi.org/10.1093/carcin/bgy093.
tory mouse housing conditions can be improved using Eggel, M. and Würbel, H. (2021). Internal consistency
common environmental enrichment without compro- and compatibility of the 3Rs and 3Vs principles for
mising data. PLoS Biology 16 (4): e2005019. https:// project evaluation of animal research. Laboratory Ani-
doi.org/10.1371/journal.pbio.2005019. mals 55 (3): 233–243. https://doi.org/10.1177/
Bailoo, J.D., Murphy, E., Boada-Saña, M. et al. (2018). 0023677220968583.
Effects of cage enrichment on behavior, welfare and Festing, M.F.W. (2020). The ‘completely randomised’
outcome variability in female mice. Frontiers in Behav- and the ‘randomised block’ are the only experimental
ioral Neuroscience 26: https://doi.org/10.3389/ designs suitable for widespread use in pre-clinical
fnbeh.2018.00232. research. Scientific Reports 10: 17577. https://doi.
Bate, S.T. and Clark, R. (2014). The Design and Statistical org/10.1038/s41598-020-74538-3.
Analysis of Animal Experiments. Cambridge: Cam- Festing, M.F.W. and Altman, D.G. (2002). Guidelines for
bridge University Press. the design and statistical analysis of experiments using
Bespalov, A., Wicke, K., and Castagné, V. (2019). Blind- laboratory animals. ILAR Journal 4394): 244–258.
ing and randomization. In: Good Research Practice in Garland, T. Jr. and Adolph, S.C. (1994). Why not to do
Non-Clinical Pharmacology and Biomedicine, Hand- two-species comparative studies: limitations on infer-
book of Experimental Pharmacology 257 (ed. A. Bespa- ring adaptation. Physiological Zoology 67 (4): 797–828.
lov, M. Michel, and T. Steckler). Cham: Springer Garner, J.P., Gaskill, B.N., Weber, E.M. et al. (2017).
https://doi.org/10.1007/164_2019_279. Introducing therioepistemology: the study of how
Bolles, R.C. (1962). The difference between statistical knowledge is gained from animal research. Lab Ani-
hypotheses and scientific hypotheses. Psychological mal (NY) 46 (4): 103–113. https://doi.org/
Reports 11 (3): 639–645. https://doi.org/ 10.1038/laban.1223.
10.2466/pr0.1962.11.3.639. Gaskill, B.N., and Garner, J.P. (2020). Power to the peo-
Bonapersona, V., Hoijtink, H., and RELACS Consortium ple: power, negative results and sample size. Journal of
(2021). Increasing the statistical power of animal the American Association for Laboratory Animal
30 A Guide to Sample Size for Animal-based Studies
Science (JAALAS) 59 (1): 9–16. https://doi.org/ Karp, N.A. and Fry, D. (2021). What is the optimum
10.30802/AALAS-JAALAS-19-000042. design for my animal experiment? BMJ Open Science
Gerlach, B., Untucht, C., and Stefan, A. (2019). Electronic 5: e100126.
lab notebooks and experimental design assistants. In: Keenan, C., Elmore, S., Francke-Carroll, S. et al. (2009).
Good Research Practice in Non-Clinical Pharmacology Best practices for use of historical control data of prolif-
and Biomedicine, Handbook of Experimental Pharma- erative rodent lesions. Toxicologic Pathology 37 (5): 679–
cology 257 (ed. A. Bespalov, M. Michel, and T. Steckler). 693. https://doi.org/10.1177/0192623309336154.
Springer https://doi.org/10.1007/164_2019_287. Kernan, W.N., Viscoli, C.M., Makuch, R.W. et al. (1999).
Gouveia, K. and Hurst, J.L. (2019). Improving the practi- Stratified randomization for clinical trials. Journal of
cality of using non-aversive handling methods to Clinical Epidemiology 52 (1): 19–26. https://doi.
reduce background stress and anxiety in laboratory org/10.1016/s0895-4356(98)00138-3.
mice. Scientific Reports 9: 20305. https://doi.org/ Kilkenny, C., Parsons, N., Kadyszewski, E. et al. (2009).
10.1038/s41598-019-56860-7. Survey of the quality of experimental design, statistical
Greenland, S. and Morgenstern, H. (1990). Matching and analysis and reporting of research using animals. PLoS
efficiency in cohort studies. American Journal of Epide- ONE 4: e7824. https://doi.org/10.1371/journal.
miology 131: 151–159. pone.0007824.
Greenland, S., Senn, S.J., Rothman, K.J. et al. (2016). Sta- Kramer, M. and Font, E. (2017). Reducing sample size in
tistical tests, P values, confidence intervals, and power: a experiments with animals: historical controls and
guide to misinterpretations. European Journal of Epide- related strategies. Biological Reviews of the Cambridge
miology 31 (4): 337–350. https://doi.org/10.1007/ Philosophical Society 92 (1): 431–445. https://doi.
s10654-016-0149-3. org/10.1111/brv.12237.
Hatswell, A., Freemantle, N., Baio, G. et al. (2020). Sum- Landis, S., Amara, S., Asadullah, K. et al. (2012). A call
marising salient information on historical controls: a for transparent reporting to optimize the predictive
structured assessment of validity and comparability value of preclinical research. Nature 490: 187–191.
across studies. Clinical Trials 17 (6): 607–616. https://doi.org/10.1038/nature11556.
https://doi.org/10.1177/1740774520944855. Lazic, S. (2016). Experimental Design for Laboratory Biol-
Hawkins, P., Morton, D.B., Burman, O., et al. (2010). A ogists: Maximising Information and Improving Repro-
guide to defining and implementing protocols for the ducibility. Cambridge: Cambridge University Press.
welfare assessment of laboratory animals. Eleventh Lilley, E., Stanford, S.C., Kendall, D.E. et al. (2020).
report of the BVAAWF/FRAME/RSPCA/UFAW Joint ARRIVE 2.0 and the British Journal of Pharmacology:
Working Group on Refinement. www.rspca.org.uk/ updated guidance for 2020. British Journal of Pharma-
sciencegroup/researchanimals/implemen- cology 177 (16): 3611–3616. https://doi.org/
ting3rs/refinement (accessed 2021). 10.1111/bph.15178.
Head, M.L., Holman, L., Lanfear, R. et al. (2015). The Löscher, W., Ferland, R.J., and Ferraro, T.N. (2017). The
extent and consequences of P-hacking in science. PLoS relevance of inter- and intra-strain differences in mice
Biology 13 (3): e1002106. and rats and their implications for models of seizures
Holmdahl, R. and Malissen, B. (2012). The need for litter- and epilepsy. Epilepsy and Behaviour 73: 214–235.
mate controls. European Journal of Immunology 42: https://doi.org/10.1016/j.yebeh.2017.05.040.
45–47. https://doi.org/10.1002/eji.201142048. MacCallum, C.J. (2010). Reporting animal studies: good
Hooijmans, C.R., Rovers, M.M., de Vries, R.B. et al. science and a duty of care. PLoS Biology 8: e1000413.
(2014). SYRCLE’s risk of bias tool for animal studies. Macleod, M.R., Lawson McLean, A., Kyriakopoulou, A.
BMC Medical Research Methodology 14: 43. https:// et al. (2015). Risk of bias in reports of in vivo research:
doi.org/10.1186/1471-2288-14-43. a focus for improvement. PLoS Biology 13 (11): e1002301.
Johnson, P.D. and Besselsen, D.G. (2002). Practical Mahajan, V.S., Demissie, E., Mattoo, H. et al. (2016).
aspects of experimental design in animal research. Striking immune phenotypes in gene-targeted mice
ILAR Journal 43 (4): 202–206. https://doi.org/ are driven by a copy-number variant originating from
10.1093/ilar.43.3.202. a commercially available C57BL/6 strain. Cell Reports
Karp, N.A. (2018). Reproducible preclinical research—is 15 (9): 1901–1909.
embracing variability the answer? PLoS Biology 16 Marín-Franch, I. (2018). Publication bias and the chase for
(3): e2005413. https://doi.org/10.1371/journal. statistical significance. Journal of Optometry 11 (2): 67–68.
pbio.2005413. https://doi.org/10.1016/j.optom.2018.03.001.
Ten Strategies to Increase Information (and Reduce Sample Size) 31
Martin, B., J, S., Maudsley, S., and Mattson, M.P. (2010). Reynolds PS, Garvan CW (2020). Gap analysis of swine-
‘Control’ laboratory rodents are metabolically morbid: based hemostasis research: houses of brick or man-
why it matters. Proceedings of the National Academy of sions of straw? Military Medicine, 185 (S1): 88–95.
Sciences USA 107 (14): 6127–6133. https://doi.org/ https://doi.org/10.1093/milmed/usz249
10.1073/pnas.0912955107. Reynolds, P.S., McCarter, J., Sweeney, C. et al. (2019).
Montgomery, D.C. (2017). Design and Analysis of Experi- Informing efficient pilot development of animal
ments, 8e. New York: Wiley. trauma models through quality improvement strate-
Moser, P. (2019). Out of control? Managing baseline var- gies. Laboratory Animal 53 (4): 394–403. https://
iability in experimental studies with control groups. In: doi.org/10.1177/0023677218802999.
Good Research Practice in Non-Clinical Pharmacology Richter, S.H., Garner, J.P., Auer, C. et al. (2010). System-
and Biomedicine, Handbook of Experimental Pharma- atic variation improves reproducibility of animal
cology, 257 (ed. A. Bespalov, M. Michel, and T. Steck- experiments. Nature Methods 7 (3): 167–168.
ler). Cham: Springer https://doi.org/10.1007/ https://doi.org/10.1038/nmeth0310-167.
164_2019_280. Richter, S.H., Garner, J.P., and Würbel, H. (2009). Envi-
Muhlhausler, B.S., Bloomfield, F.H., and Gillman, M.W. ronmental standardization: cure or cause of poor
(2013). Whole animal experiments should be more like reproducibility in animal experiments? Nature Meth-
human randomized controlled trials. PLoS Biology 11 ods 6 (4): 257–261.
(2): e1001481. https://doi.org/10.1371/jour- Ritskes-Hoitinga, M. and Wever, K. (2018). Improving
nal.pbio.1001481. the conduct, reporting, and appraisal of animal
Ormandy, E.H., Weary, D.M., Cvek, K. et al. (2019). Ani- research. BMJ 360: j4935. https://doi.org/
mal research, accountability, openness and public 10.1136/bmj.j4935.
engagement: report from an international expert Russell, W.M.S. and Burch, R.L. (1959). The Principles of
forum. Animals 9 (9): 622. https://doi.org/ Humane Experimental Technique. London: Methuen
10.3390/ani9090622. and Co. Limited.
Pallmann, P., Bedding, A.W., Choodari-Oskooei, B. et al. Sena, E.S., van der Worp, H.B., Bath, P.M., Howells,
(2018). Adaptive designs in clinical trials: why use them, D.W., Macleod, M.R. (2010). Publication bias in reports
and how to run and report them. BMC Medicine 16: 29. of animal stroke studies leads to major overstatement
https://doi.org/10.1186/s12916-018-1017-7. of efficacy. PLoS Biology, 8(3):e1000343. https://
Percie du Sert, N., Alfieri, A., Allan, S.M. et al. (2017a). The doi.org/10.1371/journal.pbio.1000343.
IMPROVE guidelines (Ischaemia models: procedural Simon, M.M., Greenaway, S., White, J.K. et al. (2013). A
refinements of in vivo experiments). Journal of Cerebral comparative phenotypic and genomic analysis of
Blood Flow and Metabolism 37 (11): 3488–3517. C57BL/6J and C57BL/6N mouse strains. Genome Biol-
https://doi.org/10.1177/0271678X17709185. ogy 14 (7): R82. https://doi.org/10.1186/gb-
Percie du Sert, N., Bamsey, I., Bate, S.T. et al. (2017b). The 2013-14-7-r82.
experimental design assistant. PLoS Biology 15 (9): Smith, A.J., Clutton, R.E., Lilley, E. et al. (2018). PRE-
e2003779. https://doi.org/10.1371/journal. PARE: guidelines for planning animal research and
pbio.2003779 [Web-based The Experimental Design testing. Laboratory Animals 52: 135–141. https://
Assistant. https://eda.nc3rs.org.uk/]. doi.org/10.1177/0023677217724823.
Percie du Sert, N., Hurst, V., Ahluwalia, A. et al. (2020). The Smith, G.D. and Ebrahim, S. (2002). Data dredging, bias,
ARRIVE guidelines 2.0: updated guidelines for reporting or confounding. BMJ 325: 1437–1438.
animal research. PLoS Biology 18 (7): e3000410. Souren, N.Y., Fusenig, N.E., Heck, S. et al. (2022). Cell
https://doi.org/10.1371/journal.pbio.3000410. line authentication: a necessity for reproducible bio-
Preece, D.A. (1982). The design and analysis of experi- medical research. EMBO Journal 27: e111307.
ments: what has gone wrong? Utilitas Mathematica https://doi.org/10.15252/embj.2022111307.
21A: 201–244. ter Riet, G., Korevaar, D.A., Leenaars, M. et al. (2012).
Preece, D.A. (1984). Biometry in the Third World: science Publication bias in laboratory animal research: a sur-
not ritual. Biometrics 40: 519–523. vey on magnitude, drivers, consequences and potential
Reynolds, P.S. (2021). Statistics, statistical thinking, and solutions. PLoS ONE 7 (9): e43403. https://doi.org/
the IACUC. Lab Animal (NY) 50 (10): 266–268. 10.1371/journal.pone.0043403.
https://doi.org/10.1038/s41684-021-00832-w. Szucs, D. (2016). A tutorial on hunting statistical signif-
Reynolds, P.S. (2022). Between two stools: preclinical icance by chasing N. Frontiers in Psychology 7: 1443.
research, reproducibility, and statistical design of https://doi.org/10.3389/fpsyg.2016.01444.
experiments. BMC Research Notes 15 (1): 73. Tremoleda, J.L., Watts, S.A., Reynolds, P.S. et al. (2017).
https://doi.org/10.1186/s13104-022-05965-w. Modeling acute traumatic hemorrhagic shock injury:
32 A Guide to Sample Size for Animal-based Studies
challenges and guidelines for preclinical studies. Wasserstein, R.L. and Lazar, N.A. (2016). The ASA’s
Shock 48 (6): 610–623. https://doi.org/10.1097/ statement on P-values: context, process, and purpose.
SHK.0000000000000901. The American Statistician 70: 129–133. https://
Uchio-Yamada, K., Kasai, F., Ozawa, M., and Kohara, A. doi.org/10.1080/00031305.2016.1154108.
(2017). Incorrect strain information for mouse cell Wickham, H. (2014). Tidy data. Journal of Statistical Soft-
lines: sequential influence of misidentification on ware 59 (10): 1–23. https://doi.org/10.18637/
sublines. In Vitro Cellular and Developmental Biology jss.v059.i109.
53 (3): 225–230. https://doi.org/10.1007/ Wieschowski, S., Biernot, S., Deutsch, S. et al. (2019).
s11626-016-0104-3. Publication rates in animal research. Extent and char-
Ujita, A., Seekford, Z., Kott, M. et al. (2021). Habituation acteristics of published and non-published animal
protocols improve behavioral and physiological studies followed up at two German university medical
responses of beef cattle exposed to students in an ani- centres. PLoS ONE 14 (11): e0223758. https://doi.
mal handling class. Animals 11 (8): 2159. https:// org/10.1371/journal.pone.0223758.
doi.org/10.3390/ani11082159. Workman, P., Aboagye, E.O., Balkwill, F. et al. (2010).
Vetter, T.R. and Mascha, E.J. (2017). Defining the pri- Guidelines for the welfare and use of animals in cancer
mary outcomes and justifying secondary outcomes of research. British Journal of Cancer 102 (11): 1555–1577.
a study: usually, the fewer, the better. Anesthesia https://doi.org/10.1038/sj.bjc.6605642.
and Analgesia 125 (2): 678–681. https://doi.org/ Zingarelli, B., Coopersmith, C.M., Drechsler, S. et al.
10.1213/ANE.0000000000002223. (2019). Part I: minimum quality threshold in preclinical
Voelkl, B., Vogt, L., Sena, E.S., and Würbel, H. (2018). sepsis studies (MQTIPSS) for study design and humane
Reproducibility of pre-clinical animal research modeling endpoints. Shock 51 (1): 10–22. https://
improves with heterogeneity of study samples. PLoS doi.org/10.1097/SHK.0000000000001243.
Biology 16 (2): e2003693. https://doi.org/
10.1371/journal.pbio.2003992.
II
Sample Size for
Feasibility and Pilot
Studies
Chapter 4: Why Pilot Studies?
Chapter 5: Operational Pilot Studies: ‘Can It Work?’
Chapter 6: Empirical and Translational Pilots.
Chapter 7: Feasibility Calculations: Arithmetic.
Chapter 8: Feasibility: Counting Subjects.
4
Why Pilot Studies?
4.1 Introduction
Pilot studies are small preparatory studies meant to BOX 4.1
inform the design and conduct of the later main Pilot Trials
experiment (Box 4.1). In 1866, Augustus De Morgan
wrote ‘The first experiment already illustrates a Pilot trials are SMALL.
truth of the theory, well confirmed by practice, What: Small, preparatory studies for informing design
whatever can happen will happen if we make trials and conduct of subsequent ‘definitive’ experiments.
enough.’ Now enshrined as Murphy’s three laws, Why: Essential tools to ensure:
problems of implementation and timing are integral
to experimentation and research. Pilot studies are a ▪ Methodological rigour
tool to anticipate and to some extent circumvent ▪ Feasibility and trial viability
these problems (Box 4.2). The focus of all pilot stud- ▪ Internal and external validity
ies is on feasibility, process, and description, rather ▪ Problem identification and troubleshooting
than statistical tests of inference (Lancaster et al. ▪ To minimise effect of Murphy’s law
2004; Thabane et al. 2010; Moore et al. 2011; Sha- Pilot trials should not be used to:
nyinde et al. 2011; Abbott 2014; Lee et al. 2014; ▪ Test hypotheses
Kistin and Silverstein 2015). ▪ Assess efficacy
The main reason for a pilot study is for early ▪ Assess safety
identification and correction of potential problems
A Guide to Sample Size for Animal-based Studies, First Edition. Penny S. Reynolds.
© 2024 John Wiley & Sons Ltd. Published 2024 by John Wiley & Sons Ltd.
36 A Guide to Sample Size for Animal-based Studies
Pilot trial designs and rules of thumb used for BOX 4.4
human clinical trials may be appropriate for ‘regu- What a Pilot Study Is Not
lated’ veterinary clinical trials, when veterinary
clinical data is to be submitted to regulatory autho- 1. A throw-away/unfunded study (or ‘student
rities (such as the FDA) for approval (Davies et al. project’)
2017). Guidelines are available for pilots intended 2. A study lacking methodological rigour
to inform human clinical trials (Lancaster et al. 3. A study misaligned with stated aims and
2004; Arnold et al. 2009; Thabane et al. 2010), and objectives
these provide invaluable advice for planning. 4. A study without purpose
5. An underpowered, too-small study
6. A completed study that cannot reject the null
4.2.3 Pilot Study Results Should Be hypothesis.
Reported
It is an ethical obligation to make the best use statements of ‘feasibility’, but the study goes
of research and research animals by reporting all on to present hypothesis tests of efficacy on
aspects of the research process, including pilot stud- numerous outcomes. Misaligned studies usu-
ies (van Teijlingen et al. 2001; Thabane et al. 2010; ally confuse or conflate the purpose of pilots
Friedman 2013). Many preclinical studies are never with that for exploratory studies (see next).
written up or published if experiments are perceived
to have ‘gone nowhere’ because results were ‘nega- Unsubstantiated studies that make vague state-
tive’, not ‘statistically significant’, or otherwise ments of intent that any resulting data will be to
unfavourable (Kimmelman and Federico 2017). inform future (funded) studies, but give no plans
Carefully performed pilots provide useful informa- for subsequent action or how data will be used.
tion on identification and description of problems, Futility loop studies that consist of an endless
potential solutions, and best practices and will not cycle of small experiments conducted to ‘find
only benefit other researchers but contribute to something’, but lack specific decision criteria
minimising waste of animals. Venues include dedi- or action plans that would enable assessment
cated journals and protocol description sites (e.g. of progress (Moore et al. 2011).
Lancaster 2015; Macleod and Howells 2016) and
are increasingly encouraged by mainstream jour- Sometimes completed studies are retrospectively
nals (Dolgin 2013). relabelled as ‘pilots’ because they are either under-
powered or null. Underpowered studies did not have
a large enough sample size in the first place to test a
4.3 Pilot Studies: What They meaningful hypothesis. Underpowered studies can
occur because of insufficient time, resources, fund-
Are Not ing, or merely because of poor planning (Arain et al.
Six categories of studies are often labelled incorrectly 2010). In contrast, null studies may not produce
– and misleadingly – as ‘pilot studies’ (Box 4.4): statistically significant results regardless of sample
size or power (Abbott 2014). A completed study that
Throw-away studies lacking methodological rig- is too small or cannot reject the null hypothesis is
our. These include trial and error studies to not ipso facto a pilot study.
see what ‘might work’, studies conducted with
little or no resources and funding, or studies
conducted by personnel (usually students) 4.3.1 Pilot Studies Differ from
without the requisite skills for meaningful Exploratory and Screening Studies
research (Moore et al. 2011).
A pilot is a small study acting as a preliminary to the
Misaligned studies, where the objectives of the main definitive study, and used to demonstrate fea-
pilot study consist of vague or undefined sibility, not provide results of hypothesis testing.
Why Pilot Studies? 39
Clear and specific documentation of methods and they are small: the goal is to maximise the most infor-
procedures (What will be done, and how will it mation from the fewest possible number of animals.
be done?). Pilot studies should also be designed to be nimble.
Clearly defined, specific, and measurable ‘success’ They should have sufficient flexibility to change
criteria (How will you know it ‘works’?). scope, procedures, and direction as necessary to
establish normal best-practice standard operating
An evaluation component (Did it ‘work’?). procedures. Formal power calculations are not appro-
An action plan with decision criteria (If the pilot is priate because by design and intent, pilots are small-
successful, what are the next steps?). scale assessments of feasibility and therefore too small
to obtain reliable estimates of effect size or variance or
The type of pilot dictates the specific metrics used find ‘statistical significance’ (Kraemer et al. 2006; Lee
to determine whether or not the pilot works and et al. 2014). Multiple small pilots may be preferable
how the pilot will inform the next step of the exper- to one larger study especially if different types of
imentation process. Measurable decision criteria information are required.
(‘go-no go’) are used to decide if the experiment Animal numbers for a pilot can be based on one
can proceed without modification or with only a or more of the following strategies: zero-animal,
few minor changes (‘green light’), if major changes pragmatic, and precision-directed (Box 4.7).
to protocol need to be made and what those changes
would entail (‘yellow light’), or the experiment
should be terminated altogether (‘red light’). For 4.6.1 Zero-Animal Sample Size
veterinary clinical trials, recruitment issues will be The principles of the 3Rs must always be considered
especially important to identify at an early stage. before planning any experiment. Pilot studies may
Recruitment metrics to be assessed will include pro- not require any animals at all. For personnel train-
jected subject recruitment and accrual rates, client ing, skill acquisition, and development of consistent
consent rates, treatment compliance, and enrol- levels of technical standardisation, simulators, car-
ment strategies (Hampson et al. 2018). casses, and/or ex vivo models should always be con-
sidered before live animals obtained for purpose.
For example, medical residents trained in the repair
of penetrating cardiac injury on either an ex vivo
4.6 How Large a Pilot? model or animals obtained for purpose had similar
There are no hard and fast sample size guidelines skill and management competencies at assessment
for animal-based pilot studies. However, the chief (Izawa et al. 2016). Equipment testing for reliability
guiding principle for animal-based pilot studies is that (calibration, drift, measurement errors) should
42 A Guide to Sample Size for Animal-based Studies
always be conducted on non-animal standard and The amount of useable data obtained from each
test materials and confirmed before using animals. subject is inversely related to the number of sub-
Even ‘proof of concept’ studies may not necessar- jects (Morse 2000). When resources are limited,
ily require data from live animals. Systematic liter- animal-based pilot studies can be designed to be
ature reviews and meta-analyses (Rice et al. 2008; as information-dense as possible by maximising
Hooijmans et al. 2018; Soliman et al. 2020; Vollert the amount of useable data from the fewest num-
et al. 2020) promote animal-based study validity ber of subjects. Designing a pilot study based on
and improve translation potential by synthesising information power (Malterud et al. 2016) requires
all available external data. Therefore, these provide descriptions of the intended scope of the pilot,
the best possible estimates of the likely true effect data collection methods, and study design. For
size and variation. The Collaborative Approach to example, before-after and longitudinal studies
Meta-Analysis and Review of Animal Experimental produce more data than do cross-sectional studies
Studies (CAMARADES) research group specialises with only a single observation per subject. High-
in systematic reviews of preclinical neurological dimensionality technologies such as mass cyto-
disease models such as stroke, neuropathic pain, metry, immunohistochemistry, single-cell RNA
and Alzheimer’s disease, and provides guidance sequencing, microfluidics, and bioinformatics
on study design and current best practice can provide massive datasets for relatively few
(https://www.ed.ac.uk/clinical-brain-sciences/ animals.
research/camarades/about-camarades). Targeted
surveys of academic and/or industry research orga- 4.6.3 Precision-Based Sample Size
nisations can be an alternative to systematic
literature reviews. Survey results can be used to The emphasis of pilot trials should be on descrip-
determine range of likely outcomes or responses tions based on point estimates and confidence
anticipated for candidate animal models, identify intervals (or some other measure of precision).
current best practice, and explore the range of study Because ‘significance’ as such is relatively mean-
designs and methods that best minimise animal use ingless, Lee et al. (2014) recommend that precision
(e.g. Chapman et al. 2012). intervals should be constructed to provide an esti-
mated range of possible treatment effects. Simu-
lations and sensitivity analyses are extremely
4.6.2 Pragmatic Sample Size valuable for estimating uncertainties associated
Pragmatic sample size is determined by the avail- with various projected sample sizes and exploring
ability of time, budget, resources, and technical best-and worst-case scenarios for the statistical
personnel. Pragmatic sample size justification models, covariates, and analysis methods proposed
may also apply to retrospective chart reviews, for the definitive study (Bell et al. 2018; Gelman
where the total sample size may be restricted by et al. 2020). Chapter 6 presents methods for evalu-
the relative rarity of the condition of interest, or ating pilot data based on preliminary estimates of
the number of records with complete or relevant precision and inspection of data plots.
information. Sample size calculations for external and internal
Pragmatic sample size approximation does not pilot trials have been developed for application to
require power calculations. Arithmetic approxima- human clinical trials (Julious 2005; Whitehead
tions use basic arithmetic ‘back of the envelope’ cal- et al. 2014; Machin et al. 2018). External pilots are
culations to confirm and cross-validate numbers for conducted prior to the definitive trial, and although
operational feasibility (Chapter 8 Arithmetic). Cal- data are used to inform subsequent trial design
culations based on sequential probabilities are useful they are not included in later data analyses.
when identification of subjects requires screening of Internal pilots are conducted to permit reassess-
large number of candidates, or if animals enter the ment of sample size within the ongoing definitive
trial sequentially rather than being randomised trial. Sample sizes based on considerations of
simultaneously (Chapter 9 Probability). human clinical trials are usually prohibitively large
Why Pilot Studies? 43
for most lab-based animal studies and therefore not Davies, R., London, C., Lascelles, B., and Conzemius, M.
recommended. They may be appropriate for large- (2017). Quality assurance and best research practices
scale veterinary clinical trials. for non-regulated veterinary clinical studies. BMC Vet-
erinary Research 13: 242. https://doi.org/
10.1186/s12917-017-1153-x.
References De Wever, B., Fuchs, H.W., Gaca, M. et al. (2012). Imple-
mentation challenges for designing integrated in vitro
Abbott, J.H. (2014). The distinction between randomized testing strategies (ITS) aiming at reducing and repla-
clinical trials (RCTs) and preliminary feasibility and cing animal experimentation. Toxicology In Vitro
pilot studies: what they are and are not. Journal of Ortho- 26 (3): 526–534. https://doi.org/10.1016/j.
paedic and Sports Physical Therapy 44 (8): 555–558. tiv.2012.01.009.
Albers, C. and Lakens, D. (2018). When power analyses Dolgin, E. (2013). Publication checklist proposed to boost
based on pilot data are biased: inaccurate effect size rigor of pilot trials. Nature Medicine 19: 795–796.
estimators and follow-up bias. Journal of Experimental https://doi.org/10.1038/nm0713-79.
Social Psychology 74: 187–195. https://doi.org/ Eisenhart, C. (1952). The reliability of measured values:
10.1016/j.jesp.2017.09.004. fundamental concepts. National Bureau of Standards
Altman, D.G. (1980). Statistics and ethics in medical Report, Publication Number 1600. Commerce Depart-
research. Collecting and screening data. British Medi- ment, National Institute of Standards and Technology
cal Journal 281 (6252): 1399–1401. https://doi. (NIST). https://www.govinfo.gov/content/
org/10.1136/bmj.281.6252.1399. pkg/GOVPUB-C13-78b559d7a3b0f60880eba04-
Arain, M., Campbell, M.J., Cooper, C.L., and Lancaster, f045e72f8/pdf/GOVPUB-C13-78b559d7a3b0-
G.A. (2010). What is a pilot or feasibility study? f60880eba04f045e72f8.pdf (accessed 2022).
A review of current practice and editorial policy. Eldridge, S.M., Lancaster, G.A., Campbell, M.J. et al. (2016).
BMC Medical Research Methodology 10: 67. https:// Defining feasibility and pilot studies in preparation for
doi.org/10.1186/1471-2288-10-67. randomised controlled trials: development of a concep-
Arnold, D.M., Burns, K.E., Adhikari, N.K. et al. (2009). tual framework. PLoS ONE 11 (3): e0150205. https://
The design and interpretation of pilot trials in doi.org/10.1371/journal.pone.0150205.
clinical research in critical care. Critical Care Medicine Friedman, L. (2013). Commentary: why we should report
37 (Suppl 1): 69–74. https://doi.org/10.1097/ results from clinical trial pilot studies. Trials 14: 14.
CCM.0b013e3181920e33. Gelman, A., Hill, J., and Vehtari, A. (2020). Regression and
Bell, M.L., Whitehead, A.L., and Julious, S.A. (2018). Other Stories. Cambridge: Cambridge University Press.
Guidance for using pilot studies to inform the design Glasziou, P. and Chalmers, I. (2018). Research waste is
of intervention trials with continuous outcomes. Clin- still a scandal—an essay by Paul Glasziou and Iain
ical Epidemiology 18 (10): 153–157. https://doi. Chalmers. BMJ 363: https://doi.org/10.1136/
org/10.2147/CLEP.S146397. bmj.k4645.
Bowen, D.J., Kreuter, M., Spring, B. et al. (2009). How we Hahn, G.J., Hill, W.J., Hoerl, R.W., and Zinkgraf, S.A.
design feasibility studies. American Journal of Preven- (1999). The impact of Six Sigma improvement—a
tive Medicine 36 (5): 452–457. https://doi.org/ glimpse into the future of statistics. The American Stat-
10.1016/j.amepre.2009.02.002. istician 53 (3): 208–215.
Box, G.E.P. and Draper, N.R. (1987). Empirical Model- Hampson, L.V., Williamson, P.R., Wilby, M.J., and Jaki,
Building and Response Surfaces. New York: Wiley. T. (2018). A framework for prospectively defining pro-
Box, G.E.P., Hunter, W.G., and Hunter, J.S. (2005). Statis- gression rules for internal pilot studies monitoring
tics for Experimenters: An Introduction to Design, Data recruitment. Statistical Methods in Medical Research
Analysis, and Model Building, 2e. New York: Wiley. 27 (12): 3612–3627. https://doi.org/10.1177/
Chapman, K.L., Andrews, L., Bajramovic, J.J. et al. 0962280217708906.
(2012). The design of chronic toxicology studies of Hooijmans, C.R., de Vries, R.B.M., Ritskes-Hoitinga, M.
monoclonal antibodies: implications for the reduction et al. (2018). Facilitating healthcare decisions by asses-
in use of non-human primates. Regulatory Toxicology sing the certainty in the evidence from preclinical
and Pharmacology 62 (2): 347–354. https://doi. animal studies. PLoS ONE 13: e0187271.
org/10.1016/j.yrtph.2011.10.016. Izawa, Y., Hishikawa, S., Muronoi, T. et al. (2016). Ex-
Clayton, M. and Nordheim, E.V. (1991). [Avoiding statisti- vivo and live animal models are equally effective train-
cal pitfalls]: comment. Statistical Science 6 (3): 255–257. ing for the management of a penetrating cardiac
44 A Guide to Sample Size for Animal-based Studies
injury. World Journal of Emergency Surgery 11: 45. Malterud, K., Siersma, V.D., and Guassora, A.D. (2016).
https://doi.org/10.1186/s13017-016-0104-3. Sample size in qualitative interview studies: guided
Johnson, J., Brown, G., and Hickman, D.L. (2013). Is by information power. Qualitative Health Research
‘duplicative’ really duplication? Lab Animal 42 (3): 26 (13): 1753–1760. https://doi.org/10.1177/
81–825. 1049732315617444.
Julious, S.A. (2005). Sample size of 12 per group rule of Mogil, J. and Macleod, M. (2017). No publication without
thumb for a pilot study. Pharmaceutical Statistics confirmation. Nature 542: 409–411. https://doi.
4: 287–291. org/10.1038/542409a.
Kimmelman, J. and Federico, C. (2017). Consider drug Montgomery, D.C. (2012). Design and Analysis of Experi-
efficacy before first-in-human trials. Nature 542: ments, 8e. New York: Wiley 752 pp.
25–27. https://doi.org/10.1038/542025a. Montgomery, D.C. and Jennings, C.L. (2006). An over-
Kimmelman, J., Mogil, J.S., and Dirnagl, U. (2014). Dis- view of industrial screening experiments, Chapter 1.
tinguishing between exploratory and confirmatory In: Screening: Methods for Experimentation in Industry,
preclinical research will improve translation. PLoS Drug Discovery, and Genetics (ed. A. Dean and S.
Biology 12 (5): e1001863. https://doi.org/ Lewis), 1–20. New York: Springer 332 pp.
10.1371/journal.pbio.1001863. Moore, C.G., Carter, R.E., Nietert, P.J., and Stewart, P.W.
Kistin, C. and Silverstein, M. (2015). Pilot studies: a crit- (2011). Recommendations for planning pilot studies in
ical but potentially misused component of interven- clinical and translational research. Clinical and Trans-
tional research. JAMA 314 (15): 1561–1562. lational Science 4 (5): 332–337. https://doi.org/
Kraemer, H.C., Mintz, J., Noda, A. et al. (2006). Caution 10.1111/j.1752-8062.2011.00347.x.
regarding the use of pilot studies to guide power calcu- Morgan D (2011). Avoiding duplication of research
lations for study proposals. Archives of General Psychi- involving animals. Occasional Paper No. 7, New
atry 63 (5): 484–489. Zealand National Animal Ethics Advisory Committee,
Lancaster, G.A. (2015). Pilot and feasibility studies come 2011.
of age! Pilot and Feasibility Studies 1: 1. https://doi. Morgan, D.G. and Stewart, N.J. (2002). Theory building
org/10.1186/2055-5784-1-1. through mixed-method evaluation of a dementia spe-
Lancaster, G.A., Dodd, S., and Williamson, P.R. (2004). cial care unit. Research in Nursing and Health 25 (6):
Design and analysis of pilot studies: recommendations 479–488. https://doi.org/10.1002/nur.10059.
for good practice. Journal of Evaluation in Clinical Morse, J.M. (2000). Determining sample size. Qualitative
Practice 10 (2): 307–312. https://doi.org/ Health Research 10 (1): 3–5.
10.1111/j.2002.384.doc.x. Percie du Sert, N., Hurst, V., Ahluwalia, A. et al. (2020). The
Lazic, S.E. (2016). Experimental Design for Laboratory ARRIVE guidelines 2.0: updated guidelines for reporting
Biologists: Maximising Information, Improving Repro- animal research. PLoS Biology 18 (7): e3000410.
ducibility. Cambridge: Cambridge University Press. https://doi.org/10.1371/journal.pbio.3000410.
Lazic, S.E. (2018). Four simple ways to increase power Reynolds, P.S. (2022a). Between two stools: preclinical
without increasing the sample size. Laboratory Ani- research, reproducibility, and statistical design of
mals 52 (6): 621–629. experiments. BMC Research Notes 15: 73. https://
Lee, E.C., Whitehead, A.L., Jacques, R.M., and Julious, S. doi.org/10.1186/s13104-022-05965-w.
A. (2014). The statistical interpretation of pilot trials: Reynolds, P.S. (2022b). Introducing non-aversive mouse
should significance thresholds be reconsidered? BMC handling with ‘squnnels’ in a mouse breeding facility.
Medical Research Methodology 14: 41. Animal Technology and Welfare Journal 21 (1): 42–45.
Leon, A.C., Davis, L.L., and Kraemer, H.C. (2011). Reynolds, P.S., McCarter, J., Sweeney, C. et al. (2019).
The role and interpretation of pilot studies in Informing efficient pilot development of animal
clinical research. Journal of Psychiatric Research trauma models through quality improvement strate-
45 (5): 626–629. https://doi.org/10.1016/j. gies. Lab Animal 53 (4): 394–404. https://doi.
jpsychires.2010.10.008. org/10.1177/0023677218802999.
Machin, D., Campbell, M.J., Tan, S.B., and Tan, S.H. Rice, A.S.C., Cimino-Brown, D., Eisenach, J.C. et al.
(2018). Sample Sizes for Clinical, Laboratory and Epide- (2008). Animal models and the prediction of efficacy
miology Studies, 4e. Wiley-Blackwell. in clinical trials of analgesic drugs: a critical appraisal
Macleod, M. and Howells, D. (2016). Protocols for labo- and call for uniform reporting standards. Pain
ratory research. Evidence-Based Preclinical Medicine. 139: 243–247. https://doi.org/10.1016/j.pain.
https://doi.org/10.1002/ebm2.21. 2008.08.017.
Why Pilot Studies? 45
Robinson, G.K. (2000). Practical Strategies for Experimen- Research Methodology 10: 1. http://www.biomedcen-
tation. Chichester: Wiley. tral.com/1471-2288/10/1.
Sackett, D.L. (1979). Bias in analytic research. Journal of Van Teijlingen, E.R., Rennie, A.M., Hundley, V., and
Chronic Diseases 32: 51–63. https://doi.org/ Graham, W. (2001). The importance of conducting
10.1016/0021-9681(79)90012-2. and reporting pilot studies: the example of the
Shanyinde, M., Pickering, R.M., and Weatherall, M. Scottish Births Survey. Journal of Advanced Nursing
(2011). Questions asked and answered in pilot and fea- 34: 298–295. https://doi.org/10.1046/j.1365-
sibility randomized controlled trials. BMC Medical 2648.2001.01757.x.
Research Methodology 11: 117–117. Vollert, J., Schenker, E., Macleod, M. et al. (2020). Sys-
Smith, A.J., Clutton, R.E., Lilley, E. et al. (2018). PRE- tematic review of guidelines for internal validity in
PARE: guidelines for planning animal research and the design, conduct and analysis of preclinical biomed-
testing. Laboratory Animals 52 (2): 135–141. ical experiments involving laboratory animals. BMJ
https://doi.org/10.1177/0023677217724823. Open Science 4: e100046.
Soliman, N., Rice, A., and Vollert, J. (2020). A practical Wareham, K.J., Hyde, R.M., Grindlay, D. et al. (2017).
guide to preclinical systematic review and meta-anal- Sponsorship bias and quality of randomised controlled
ysis. Pain 161 (9): 1949–1954. https://doi.org/ trials in veterinary medicine. BMC Veterinary Research
10.1097/j.pain.0000000000001974. 13 (1): 234. https://doi.org/10.1186/s12917-
Tan, Y.J., Crowley, R.J., and Ioannidis, J.P.A. (2019). An 017-1146-9.
empirical assessment of research practices across 163 Whitehead, A.L., Sully, B.G.O., and Campbell, M.J.
clinical trials of tumor-bearing companion dogs. Scien- (2014). Pilot and feasibility studies: is there a difference
tific Reports 9: 11877. https://doi.org/10.1038/ from each other and from a randomised controlled
s41598-019-48425-5. trial? Contemporary Clinical Trials 38: 130–133.
Thabane, L., Ma, J., Chu, R. et al. (2010). A tutorial on
pilot studies: the what, why and how. BMC Medical
5
Operational Pilot Studies:
‘Can It Work?’
A Guide to Sample Size for Animal-based Studies, First Edition. Penny S. Reynolds.
© 2024 John Wiley & Sons Ltd. Published 2024 by John Wiley & Sons Ltd.
48 A Guide to Sample Size for Animal-based Studies
Sedation Intubation
YES
Correct
Surgical site Surgical Physiological
placement?
preparation instrumentation assessment
NO
YES Proceed
Corrective Stable? with
actions experiment
NO
Corrective
actions
Figure 5.1: Simplified workflow map for a swine trauma model. There are two process decision points requiring clinical
assessment of surgical quality and where corrective actions are performed if necessary.
Source: Adapted from Reynolds et al. (2019).
50 A Guide to Sample Size for Animal-based Studies
Table 5.1: Sample checklist for surgical procedures and pre-intervention animal physiological status. All checks were to be
documented in the surgical record for each animal.
Experimental phase Task
Sample times on whiteboard The target was 0/14 errors per experiment. The
9 graph shows that the addition of a dedicated
8
data scribe after the first six experiments was
Dedicated scribe added associated with increased protocol compliance
7
Number of critical errors
3
Run charts are also useful for assessing variation
Target zero errors in animal behaviour. Animal behaviour studies
2
often require performance stability after a period
1 of habituation time. Habituation time is adequate
0 if task performance times or learning trajectories
0 5 10 15 20 stabilise by the end of a certain number of training
Experiment order
sessions. Run charts are also useful for identifying
Figure 5.2: Process chart of critical errors made during floor-ceiling artefacts in behavioural tasks. Floor-
twenty sequential experiments. Errors decreased after a
ceiling artefacts occur if the task is either too diffi-
dedicated data scribe was added to the personnel roster
cult, such that most or all animals fail, or too easy,
(Source: Adapted from Reynolds et al. 2019).
and all animals succeed.
52 A Guide to Sample Size for Animal-based Studies
Figure 5.3: Sequential plot of platform location times for Specified proportion of subjects achieve speci-
four mice tested over five days in a Morris Water Maze. fied pharmacological/physiological criteria
Search times were truncated at 120 seconds. Between- at a given dose or intervention level.
mouse variation in search times increased with day.
Mouse 1 show no improvement, whereas the remaining Specified proportion of subjects achieve
mice showed modest to substantial reductions steady-state behaviour measures by the
(Source: Data from Higaki et al. 2018). completion of a pre-specified number of
sessions.
All measurements are obtained at each time
point for each subject (zero missing data).
5.3 Performance Metrics
Performance metrics are used to assess if the opera-
tional pilot has achieved pre-specified objectives
(the operational pilot ‘worked’). Performance metrics
are outcome-neutral, that is, they are not response 5.3.1 Measurement Times
variables and do not test the research hypothesis. Times required for a given task may be categorised
However, they must be relevant to major operational as ‘set’ times, ‘target’ times, or ‘untargeted’ times.
objectives. Metrics must be simple, quantifiable, and Set times are protocol-driven, used for standardis-
transparent to everyone involved with the project ing operational procedures.
Operational Pilot Studies: ‘Can It Work?’ 53
Fixed sampling time points are specified for ▪ Devising codes for variables and endpoints
obtaining blood and tissue. ▪ Classification of subjects into disease categories
Pre-specified duration is defined for physio- ▪ Behavioural research and behaviour outcomes
logical stabilisation of an animal following ▪ Assessment of humane endpoints
surgical instrumentation and before experimen- ▪ Subjective evaluations of human or animal
tal intervention. performance
▪ Clinical decisions based on signs and
symptoms
Target times are mandated by best-practice ▪ Decisions based on score thresholds
requirements. ▪ Retrospective chart reviews
▪ Questionnaire and survey instrument
validation
Example: Best-Practice Target Times ▪ Qualitative research
The definitive airway (intubation and confirma- ▪ Systematic reviews and meta-analyses
tion of correct placement) for large-animal sur-
gery models must be secured in <60 seconds to
prevent injury and physiological compromise. consistently with minimal bias, an operational
pilot will determine if measurements are valid
Untargeted times do not have rigid performance (data and data assessments capture what they were
limits. intended to measure), reliable (data are consistent),
and assessors show agreement (the frequency with
which different assessors assign the same score to
Example: No Pre-specified the same items). For high-priority items and wel-
Performance Time fare assessments, at least two data evaluators
should be used and trained to a uniform stand-
The primary objective of a successful surgical
ard (Box 5.3).
instrumentation of an experimental animal is
completion of all procedures without physiologi-
cal compromise. Total preparation time cannot
be too rigidly fixed because different animals will
5.4 Pilots for Retrospective
respond differently to anaesthesia and surgery, Chart Review Studies
and require different corrective measures and
In retrospective chart review studies, subject data
time to achieve stability. Personnel should not
have already been collected and are contained in
be compelled to sacrifice the quality of the prep-
patient medical records or registry databases. Med-
aration by rushing or taking shortcuts to meet
ical record data are commonly used in veterinary
an arbitrary time limit. However, if unduly pro-
medicine to assess clinical outcomes, treatment
longed preparation times result in physiological
patterns, clinical care resources, and costs. Retro-
compromise, this is grounds for exclusion of that
spective chart reviews may also be conducted for
subject and its data (Reynolds et al. 2019).
quality assurance and improvement assessments
(Jones et al. 2019).
Chart review studies are highly susceptible to
bias. To prevent wasting time and effort in a futile
5.3.2 Subjective Measurements study, piloting a chart review study begins by asses-
Some outcome variables are subjective. Assessors sing a very small number of records (say 5–10 cases)
differ in skill levels, experience, and even interest. (Box 5.4). Preliminary scans of these records should
To ensure that all variables are collected and rated show if there are enough case samples with
54 A Guide to Sample Size for Animal-based Studies
strongly suggested for larger definitive studies of rows and columns). The first row of each
(Altman 1991). sheet is reserved for variable header names.
2. Copy separate spreadsheets into the same
Revise and repeat. There should be frequent dis- workbook. Each sheet is given an assessor
cussions with all personnel (including the study ID (X, Y).
lead) to resolve ambiguities, discrepancies, 3. Cell-by-cell comparison: Create a third
questions, and problems. Areas of disagree- (empty) sheet with the same variable header
ment are resolved by consensus. row. Copy the Excel formula: =IF(X!A2<>Y!
A2,“X:”&X!A2&CHAR(10)& Y: &Y!A2, )
Finalise. Once standardised, the definitive data
into the first cell A2. Drag and drop so all
collection spreadsheets and SOPs can be con-
row-column dimensions are covered in the
structed, and all data extractors and assessors
third sheet. Any discrepancies between the
can be instructed in use. Planning of the defin-
two assessors will be flagged with the assessor
itive study will require identification of the
ID and entry values.
study design (matched case, case–control,
4. Examine original entries indicated by flags
cohort), matching criteria, grouping criteria,
for the cause of the difference. Differences
and formal sample size justification. Formal
may be the result of transcription and entry
procedures must be put in place for data screen-
errors (e.g. typos, stray characters, missed
ing, cleaning, storage, and security (Benchimol
information) or because of assessor differ-
et al. 2015).
ences in interpretation.
5. Correct obvious coding errors. Discrepancies
5.5 Sample Size occurring because of ambiguities in data
Considerations recording, interpretation or coding are
resolved by discussion and consensus.
Operational pilots do not require statistical signifi- Retraining of data collection personnel, and
cance testing. The size of operational pilot studies revision of the data dictionary and SOPs
is determined by pragmatic considerations and may be required.
logistic constraints. There may be no need for ani-
mals at all. For planning highly complex large
experiments, researchers should be prepared to References
invest up to 20% of all resources (time, personnel, Altman, D.G. (1991). Practical Statistics for Medical
budget, and equipment). If animals must be used, Research. London: Chapman and Hall.
bounds on animal numbers are computed relative American Society for Quality (ASQ) (2023). https://
to available resources, using simple arithmetic asq.org/quality-resources
(Chapter 7) or probability-based (Chapter 8) approx- Backhouse, A., Ogunlayi, F. (2020). Quality improve-
imations. To assess stability and consistency of pro- ment into practice. BMJ, 368:m865. doi: https://
cesses and procedures, charts of counts, averages doi.org/10.1136/bmj.m865.
and ranges are constructed from at least two Benchimol, E.I., Smeeth, L., Guttmann, A. et al. (2015).
sequential groups of 2–10 measurements each The reporting of studies conducted using observational
(Gygi et al. 2012). routinely-collected health data (RECORD) statement.
PLoS Medicine 12 (10): e1001885.
Chakraborty, A., Tan, K.C. (2012) Qualitative and quan-
titative analysis of Six Sigma in service organizations.
5.A Conducting Paired Data In Total Quality Management and Six Sigma (Ed. Tau-
Assessments in Excel seef Aized). doi: https://doi.org/10.5772/46104;
https://www.intechopen.com/chapters/38085
Chambers, C.D., Feredoes, E., Muthukumaraswamy, S.D.,
1. Two raters X and Y score items independ- and Etchells, P.J. (2014). Instead of ‘playing the game’ it is
ently and enter data in separate Excel spread- time to change the rules: registered reports at AIMS Neu-
sheets. The sheet format must be the same for roscience and beyond. AIMS Neuroscience https://
each assessor (variable names, order, number doi.org/10.3934/Neuroscience.2014.1.4.
56 A Guide to Sample Size for Animal-based Studies
Davies, R., London, C., Lascelles, B., Conzemius, M. NHS Institute for Innovation and Improvement (2008).
(2017) Quality assurance and best research practices Improvement Leaders Guide: Process Mapping, Analysis
for non-regulated veterinary clinical studies. BMC and Redesign. Warwick: NHS.
Veterinary Research, 13:242. doi https://doi.org/ Reynolds, P.S., McCarter, J., Sweeney, C., Mohammed,
10.1186/s12917-017-1153-x B.M., Brophy, D.F., Fisher, B., Martin, E.J., Natarajan,
Deming, W.E. (1987). On the statistician’s contribution to R. (2019) Informing efficient pilot development of ani-
quality. Invited paper, 46th session of the International mal trauma models through quality improvement stra-
Statistical Institute (ISI), Tokyo. https://deming. tegies. Lab Animal, 53(4):394–404. doi: https://doi.
org/wp-content/uploads/2020/06/On-The-Sta- org/10.1177/0023677218802999.
tisticians-Contribution-to-Quality-1987.pdf Schulman, J. (2001) ‘Thinking Upstream’ to evaluate and
von Elm, E., Altman, D.G., Egger, M., Pocock, S.J., to improve the daily work of the Newborn Intensive
Gøtzsche, P.C., Vandenbroucke, J.P., STROBE Care Unit. Journal of Perinatology, 21: 307–311. doi:
Initiative (2007). The strengthening the reporting https://doi.org/10.1038/sj.jp.7200528
of observational studies in epidemiology (STROBE) Tague, N.R. (2005). The Quality Toolbox, Seconde, 584.
statement: guidelines for reporting observational ASQ Quality Press American Society for Quality.
studies. Epidemiology;18:800–4. doi:https://doi. UNDP/World Bank/WHO Special Programme for
org/10.1097/EDE.0b013e3181577654 Research and Training in Tropical Diseases & Scien-
Gawande, A. (2009). The Checklist Manifesto: How to Get tific Working Group on Quality Practices in Basic Bio-
Things Right. New York: Picador. medical Research (2006). Handbook: Quality Practices
Gilbert, E.H., Lowenstein, S.R., Koziol-McLain, J., Barta, in Basic Biomedical Research/Prepared for TDR by the
D.C., Steiner, J. (1996). Chart reviews in emergency Scientific Working Group on Quality Practices in Basic
medicine research: where are the methods? Annals Biomedical Research. World Health Organization.
of Emergency Medicine, 27(3):305–8. doi: https:// https://apps.who.int/iris/handle/
doi.org/10.1016/s0196-0644(96)70264-0 10665/43512.
Gygi, C., William, Z.B., and DeCarlo, N. (2012). Six Sigma Trebble, T.M., Hansi, N., Hydes, T., Smith, M.A., Baker,
for Dummies, 4th ed. NJ: John Wiley & Sons. M. (2010) Process mapping the patient journey: an
Higaki, A., Mogi, M., Iwanami, J. et al. (2018). Recogni- introduction. BMJ, 341: c4078–c4078. doi: https://
tion of early stage thigmotaxis in Morris water maze doi.org/10.1136/bmj.c4078.
test with convolutional neural network. PLoS ONE Vassar, M. and Holzmann, M. (2013). The retrospective
13 (5): e0197003. https://doi.org/10.1371/jour- chart review: important methodological considera-
nal.pone.0197003. tions. Journal of Educational Evaluation for Health
Jones, B., Vaux, E., and Olsson-Brown, A. (2019). How to Professions 10: 12. https://doi.org/10.3352/
get started in quality improvement. BMJ 364: k5408. jeehp.2013.10.12.
https://doi.org/10.1136/bmj.k543730655245. Walton, M. (1986). The Deming Management Method.
Kaji, A.H., Schriger, D., and Green, S. (2014). Looking New York: Perigree.
through the retrospectoscope: reducing bias in emer- World Health Organization (2009). WHO Guidelines
gency medicine chart review studies. Annals of Emer- for Safe Surgery 2009: Safe Surgery Saves Lives.
gency Medicine 64 (3): 292–298. https://doi.org/ Geneva, Switzerland: WHO Press, World Health
10.1016/j.annemergmed.2014.03.025. Organization.
6
Empirical and Translational
Pilots
BOX 6.1
6.1 Introduction Empirical and Translational Pilot Studies
Empirical and translational pilot studies provide
Sample size is optimised for information density and
preliminary evidence of efficacy and translation
information power
potential. They are used to determine if further exper-
imentation will be worthwhile (Box 6.1; Table 6.1). Empirical pilots provide evidence of potential efficacy
Empirical pilot studies provide evidence that the under ideal conditions
intervention ‘might work’ (Bowen et al. 2009). Effi- Goal: Signal maximisation
cacy potential is assessed by the quality of the out-
Is the response large enough to be measured? Stable
come, or response, variables. The response should
enough to be reliable? Expected from the research
be large enough to be measured (magnitude), be sta-
hypothesis? Realistic compared to previously pub-
ble and consistent over multiple determinations
lished data?
(accuracy and precision), and behave as predicted
by the research hypothesis (relevance). The goal is Translational pilots provide evidence of generalisabil-
to obtain the largest possible signal from the data. ity for a variety of models and conditions.
Therefore, empirical pilots should be conducted Goal: Signal robustness
under controlled and relatively ideal conditions,
When assessed over a range of animal/disease
rather than attempting a more realistic replication
models and operating conditions, are results
of the clinical or disease model. Generalisation is
Consistent? Reliable? Concordant?
not a consideration at this phase.
A Guide to Sample Size for Animal-based Studies, First Edition. Penny S. Reynolds.
© 2024 John Wiley & Sons Ltd. Published 2024 by John Wiley & Sons Ltd.
Is there preliminary evidence of intervention and outcome Is there preliminary evidence of efficacy and concordance?
feasibility (the intervention ‘might work’)? Efficacy
Are outcomes and endpoints ▪ Are models and endpoints clinically relevant?
▪ large enough to be measured? Can they be measured ▪ Are results clinically or biologically meaningful?
easily? Are laboratory readouts within limits of
detection? Concordance
▪ stable and consistent? (minimal trend, minimal ▪ Are results robust?
variation) ▪ Are results consistent and reproducible over multiple
▪ clinically or biologically meaningful (concept validity)? independent models/locations/ conditions?
▪ Does the intervention do what it was supposed to do?
▪ Is the intervention potentially cost-and time-effective
to implement?
Validity Validity
▪ Internal validity ▪ Internal validity
▪ Construct validity ▪ External validity: Construct validity, face validity,
predictive validity
Tools Tools
▪ Small, nimble, designed, randomised controlled ▪ Appropriately designed experiments; adequate and
experiments. appropriate sample sizes, randomisation, blinding, and
controls
▪ Multi-batch experiments
▪ Systematic heterogeneity
Deliverables Deliverables
Progression decision metrics Progression decision metrics
Effect size and confidence intervals Effect size and confidence intervals
Proof of principle, proof of concept
Statistical significance tests may be appropriate,
depending on study objectives
Translational pilot studies (‘Will it work?’) pro- study (Dolgos et al. 2016; Table 6.2). The goal of
vide preliminary evidence of generalizability translational research is the evidence-based transi-
(Bowen et al. 2009). Generalisable results are tion from ‘bench to bedside’ – the progression from
broadly applicable over a range of disease models, basic science and/or mechanistic studies to clini-
animal models, and operating conditions. Preclini- cally relevant animal models, clinical proof of con-
cal proof of principle (pPoP) experiments demon- cept studies, and eventually to randomised clinical
strate effect of the intervention on the target trials (Rossello and Yellon 2016; Heusch 2017).
disease process or pathophysiology. Preclinical Animal-based laboratory studies are usually
proof of concept (pPoC) experiments demonstrate exploratory, designed to investigate multiple factors
effect of the intervention in clinically relevant ani- with small sample sizes. Therefore, sample sizes for
mal models, assessed by measured outcomes with preclinical pilot studies will be determined by on
a direct relationship to the clinical disease under information power rather than statistical power.
There are no hard and fast criteria as to how to For preclinical drug development models, Ferreira
maximise external validity, or decide how much et al. (2019) developed a rigorous scoring system
face or construct validity is needed for predictive and score calculator for comparing animal models
validity. External validity criteria tend to be subjec- and identifying the model that will best fit specific
tive and are therefore subject to bias in both inter- research objectives. The calculator is based
pretation and practical application. Trade-offs on 22 validation criteria in eight domains: epide-
between the various types of external validity will miological, symptomology and disease natural
be necessary for models of complex multifactorial history, genetic, biochemical, aetiological, phar-
diseases, such as oncological and neurodegenerative macological, and endpoints.
diseases (Pitkänen et al. 2013; Tadenev and Bur-
gess 2019). 6.2.2.2 Sex as a Biological Variable
For a model to be ‘good enough’ for transla- Sex is an essential variable for ensuring construct
tional application, it needs to be informative validity. Sex differences in basic biology, pathophys-
and reasonably predictive of potential efficacy. iology, pharmacokinetics, and response to interven-
A model can be useful without strict model fidel- tions have been reported for numerous diverse
ity, which in any case will be rarely possible animal models (Karp et al. 2017; Wilson et al.
(Pitkänen et al. 2013; Galanopoulou et al. 2017), 2022). However, female animals are still markedly
nor always necessary (Russell and Burch 1959). under-represented in much of biomedical research
The best option is to understand the general prin- (Will et al. 2017; Karp and Reavey 2019). There is
ciples of external validity, and learn how to apply no evidence that females are inherently more vari-
them on a study-by-study basis. A practical guide able than males, a common reason for justifying
for ensuring construct validity was developed by male-only studies (Beery and Zucker 2011; Beery
Henderson et al. (2013). This is a 25-item checklist 2018). As it is, many published animal-based studies
in four domains: the animal model, the clinical are based only on a single sex or are too underpow-
condition or disease to be modelled, experimental ered to detect potential interaction of sex with other
outputs, and experimental operations (Table 6.3). factors.
Clinical setting
Because sex bias in preclinical research greatly coefficient b is obtained by nonlinear regression
limits translation, National Institute of Health Y = aXb, or more commonly, by least-squares
best-practice standards strongly encourage inclu- regression on log-transformed variables: log
sion of female animals in addition to males (Y) = log(a) + b log (X).
(Clayton and Collins 2014 and incorporation of Ratios (for example, mg/kg body mass) should be
sex as a biological variable in study design used only with considerable caution. Ratios implic-
(Miller et al. 2017; Honarpisheh and McCullough itly assume the response Y is directly proportional to
2019). Regulatory guidelines covering chemical body size (b = 1). If b is not equal to 1, then ratio data
and drug toxicity testing may mandate testing of will result in considerable statistical bias, spurious
both sexes at each dose level (e.g. OECD Test correlation, and greatly reduced precision. Body size
Guidelines https://www.oecd.org/). differences should be accounted for by regression-
based methods, such as analysis of covariance
Sample size considerations. Incorporating both or multiple regression (Tanner 1949; Kronmal
sexes into the study design does not require dou- 1993; Packard and Boardman 1988, 1999; Jasienski
bling of sample size, nor does it need to cost and Bazzaz 1999). Body mass is useful for laboratory
more money (Clayton 2015). Including sex as a animal-to-human drug dosage conversions if
biological variable is handled by appropriate sta- exponents are verified by current evidence or
tistical study design. If sex is not of interest as a appropriate pharmacological models. However,
primary explanatory (independent) variable, body surface area (e.g. mg/cm3) has been long dis-
variation due to sex differences can be controlled credited as an imprecise, flawed, and obsolete met-
by statistically based methods, such as stratified ric without any biological basis and should not be
randomisation on sex and sample size balance. If used (Tang and Mayersohn 2011; Blanchard and
sex is included as an independent variable, Smoliga 2015).
statistical power for detecting sex effects is Allometric equations may be extremely unreli-
increased by factorial and repeated-measures able when used for prediction, especially for drug
designs, and further improved by incorporating dosages and pharmacokinetics. General scaling
sex-related covariates, such as weight. relationship approximations can provide initial
approximations for expected forms of the rela-
6.2.2.3 Body Size and Allometric Scaling tionship between size and a given physiological
Body size differences between model and target spe- response (Calder 1984; Schmidt-Nielsen 1984).
cies are a major challenge to construct validity. However, extrapolation to other species based
Commonly, allometric relationships are determined on inappropriate conversion assumptions and
for cross-sectional data over a range of body sizes insufficient evidence of efficacy can result in cat-
from multiple species. Allometric models are used astrophic outcomes (Leist and Hartung 2013;
for scaling pharmacokinetics, risk, drug dosages Kimmelman and Federico 2017; Van Norman
(Caldwell et al. 2004; Huang and Riviere 2014), 2019). Validity of scaling exponents depends on
tumour growth (Pérez-García et al. 2020), and phys- the type of comparison (interspecific or intraspe-
iological responses. They can also be used to esti- cific), the data used to construct the relationship,
mate the likely range of responses if specific data and validity of the allometric model itself (choice
for a given species are unavailable (Lindstedt and of the best-fit line, data distribution, influence of
Schaeffer 2002). outliers, measurement error). The allometric
The relationship between body size for the refer- models assume that the response is determined
ence and target species is by size-related factors alone and do not account
for other potential determinants, such as age,
b
X target species or strain, body composition, and sex
Y=
X ref (Eleveld et al. 2022). Sex effects in mice may be
trait-specific; case-by-case evaluation of drug dos-
where X is a measure of body size, such as body age scaling for mice is recommended (Wilson
mass, and b is the scaling coefficient. The scaling et al. 2022).
Sample size considerations. For a simple allomet- be based on predefined primary outcome, effect
ric regression based on two variables, a rule of size, and power.
thumb sample size is N ≥ 25 if variance is The concept is borrowed from qualitative
expected to be high (Jenkins and Quintana- research studies and is based on the concept that
Ascencio 2020). Sample size can be determined the more relevant information contained in the
more formally by iteration for a pre-specified sample, the fewer number of subjects will be needed
precision of the regression coefficients b. (Malterud et al. 2016).
Precision is calculated from the two-sided
100(1 − α)% confidence interval, b ± t1−α/2,n−2
SE(b). For multiple regression models, sample
6.3.1 Information Density
size can be determined from Cohen’s effect size Information density is the maximum amount of
f 2 = R2/(1 − R2) (Chapter 15). Disadvantages of useable, scientifically productive data that can be
this method are the requirement for preliminary obtained from each animal. However, measuring
estimates of R2, and the assumption of a straight- as many outcome variables as possible is not the
line fit. More advanced methods for sample size same as maximising the amount of useable informa-
determination based on the joint estimation of tion. There is a trade-off between technical effi-
the intercept and slope and estimation of the ciency and effectiveness (or thoroughness; Hollner
non-centrality parameter have been described 2009). Efficiency refers to procedural leanness: an
(Colosimo et al. 2007; Jan and Shieh 2019). Large efficient experiment minimises investment of
formal analyses involving multiple strains or animals, resources, time and workload to obtain
species should consider phylogenetic regression information required to meet study objectives.
methods (Cooper et al. 2016). Effectiveness (or thoroughness) refers to task quality:
an effective study produces data that are correct,
accurate, timely, and of uniformly high quality.
6.3 Sample Size An effective study will also include recognition
Determination and prompt correction of mistakes or errors in
protocol implementation.
Sample size features: Statistical study design, seq- Design strategies for information density will
uential testing strategy, internal validity involve trade-offs between animal welfare measures
Sample size for preclinical pilots are determined (refinement) versus overall reduction of animal
by information density and information power rather numbers (Nunamaker et al. 2021). Emphasis on
than statistical power (Box 6.5). The goal of pilot technical speed and efficiency comes at the expense
studies is to obtain the maximum amount of usea- of data quality. Measuring too many outcome vari-
ble, scientifically productive data from the fewest ables can produce results that are too difficult to
number of animals. This contrasts with sample size interpret (Vetter and Mascha 2017), but measuring
for definitive or confirmatory experiments that must too few may miss important relationships or not
properly capture the test effect. More data-intensive
BOX 6.5 procedures per subject will reduce the number of
Information Density and Power animals that can be processed in a given time. On
the other hand, as complexity increases, procedural
Information density times also increase, together with the potential for
Maximum amount of useable, scientifically produc- increased suffering and risk of adverse events (such
tive data obtained from each animal as the animal dying before completion of the proce-
Sample size features: Trade-off between efficiency dure). After methodology and protocol have been
and effectiveness. standardised, experimental performance can shift
to an emphasis on efficiency, although frequent
Information power
monitoring of task quality throughout the research
Maximum amount of useable, scientifically produc- cycle is advised. Emphasis on task quality over task
tive data from the fewest number of animals. completion may necessitate a shift in lab culture
and will require adequate oversight, communica- large number of animals. In contrast, designed
tion, and support from investigators (Reynolds experiments, especially factorial-type designs,
et al. 2019). permit simultaneous assessment of multiple input
variables and are flexible, efficient, and very
Sample size for information density. Pilot studies economical. They can therefore contribute to
should prioritise thoroughness over efficiency. large savings in animal numbers for much more
Therefore, operational constraints (time, skilled information. In general, designed experiments
technical personnel, budget, resources) will involve selection of the most likely set of candidate
determine the number of endpoints measured factors, deciding on a small fixed number of levels
on each animal and therefore the number of for each factor (usually a high and low value), then
animals that can be processed in a given time conducting the experiment on all possible combina-
(Chapter 5). The number of endpoints will be tions of those factors and factor levels. Each run
determined by procedural needs per task and combination is randomly assigned to an experimen-
variable priority (‘need to know’ mission-critical tal unit. Genuine replicates made at the same run
variables’ versus less important ‘nice to know’ combination and addition of centre points provide
variables). When surgical instrumentation is an estimate of the variance for the effect (usually
required, the number of procedures that can a regression coefficient). Therefore, sample size is
be realistically completed per animal is assessed determined by the number of replicate runs, not
by placing strict limits on maximum permissible by ‘group size’ (Box and Draper 1987; Box et al.
surgical duration and time under anaesthesia. 2005; Chapter 19).
The recommended experimental strategy is
sequential testing in three phases (Trutna
et al. 2012):
6.3.2 Information Power
Information power is the maximum amount of use- 1. Factor reduction. Screening designs are used
able, scientifically productive data obtained from narrow down a large number of candidate
the fewest number of animals (sensu Malterud factors to a smaller subset of the most impor-
et al. 2016). Even for pilot studies, formal statistical tant. Factor ‘importance’ is assessed by the
experimental designs are strongly recommended. size of main effects and two-way interactions.
Compared to trial and error experimentation, exper- Classic screening designs include fractional
imental designs make the variance of the responses factorials and Plackett-Burman designs
as small as possible, and avoid or minimise bias in (Montgomery and Jennings 2006). Modern,
estimates of effects. Bias is further reduced by ran- computer-generated definitive screening
domisation and blinding. Designed pilot experi- designs are more efficient and less biased
ments greatly increase the chances of detecting a than older designs, resulting in more precise
difference in responses between test interventions. and reliable results. Definitive screening
An experiment involves changing one or more designs should be used only in the early
explanatory variables (factors) and observing the stages of experimentation, and if there are
effects on one or more response variables. With four or more factors to be assessed. Factors
exploratory animal-based studies, there will be a are usually continuous variables, but designs
large number of potential factors and very little can accommodate a few two-level categorical
prior knowledge of how they might be expected to factors. It may be necessary to conduct sepa-
affect response or each other. The conventional rate experiments on each level of a multi-
approach is experiments limited to two-group and level categorical factor to avoid problems
one-factor-at-a-time (OFAT) comparisons, with with interpretation (Jones and Nacht-
sample size calculated as multiples of the number sheim 2011).
of experimental ‘groups’ (Box and Draper 1987; 2. Factor importance. After a smaller subset of
Czitrom 1999). This approach is slow, inefficient, the most important factors has been identi-
misses potentially useful information, and wastes fied, experiments are conducted on these
few remaining factors to determine signifi- measurements of efficacy would be most strongly
cance of interaction effects and identify affected by mouse strain, vaccine dose, challenge-
regions of the optimal or target response. killing interval, mouse age at vaccination, and
Typical designs include full factorial and mouse sex. The initial study plan was to evaluate
response surface designs. Responses are six mouse strains, three vaccine doses, three chal-
quantified by linear or polynomial regression lenge-killing intervals, and three ages on both
and ANOVA table statistics (Box et al. 2005; male and female mice. The goals were to deter-
Montgomery 2017). mine what combination of factors contributes
3. Factor testing. Optimising experimental con- to the ‘best’ response, and quantify any possible
ditions will maximise the experimental signal sex differences.
if it exists. Therefore definitive experiments
can proceed with greater certainty of success.
Results from each experiment provide feedback Conventional approach. The investigators pro-
on whether or not the experiment is converging to posed a series of experiments designed as a
the target solution predicted by the research series of two-group comparisons or ‘one-way
hypothesis. Sequential feedback means that the ANOVAs’ on combinations of the major fac-
study factors can be changed strategically and sen- tors. They intended to use five mice per group
sibly as information from each stage becomes avail- ‘because that was necessary for statistical sig-
able, rather than having to resort to trial and error nificance’ and was ‘what everyone else does’.
tweaking (Czitrom 1999; Box et al. 2005). Because They concluded that at least 1620 mice would
the factorial approach allows a much greater be required for total of 324 experimental
amount of precision for a given number of experi- ‘groups’ with five animals per group: Strain ×
mental runs, it is much more likely that a true effect dose × interval × age × sex = 6 × (3 × 3 × 3 ×
will be detectable and not swamped by experimen- 2) = 324 × 5 = 1620.
tal error and unknown variation.
This piecemeal approach has serious disadvan-
Sample size for information power. Sample size tages. It is likely to miss the most important factors
recommendations for multi-batch studies are or optimal factor levels giving the ‘best’ response
a minimum of three animals per ‘group’ and cannot estimate interactions between factors.
(Karp et al. 2020). However, with screening It is highly unlikely that an experiment involving
designs, fewer animals can be used because over a thousand animals could be processed in a rea-
sample size is determined, not by group size, sonable period of time. Animals will be wasted as
but by the number of replicate runs. Statistical they age out of the study before they can be used.
power increases both with the size of regression Finally, a large un-designed experiment is difficult
coefficients and coefficient hierarchy, that is, to control and risks potentially large, unmanagea-
more power is associated with main effects ble, and undetectable sources of variation that will
and less power with interactions (Jones and swamp true experimental signals (Reynolds 2021).
Nachtsheim 2011). Dean and Lewis (2006) is
an excellent introduction to screening designs. Definitive screening design. The same study could
Customised screening designs can be generated be more efficiently and economically designed
in commercially available packages such as as a screening study, using far fewer animals
SAS, SAS JMP Pro, and R (e.g. Lawson 2020). and increased probability of detecting a real
effect. Strain and sex are categorical factors.
Example: Information Power: Conventional Dose, interval, and age are continuous factors
Versus Screening Designs with three levels each.
Investigators wished to evaluate efficacy potential A reasonable approach for a screening study is to
of a trial vaccine in mice. They thought that run separate definitive screening experiments on
Table 6.4: Example of a definitive screening experiment 6.3.3 Veterinary Clinical Trials
for assessing vaccine efficacy in mice. The design was
generated in SAS JMP Pro 16. Experiments are conducted on Data from empirical clinical pilots can be used to
separate strains in two blocks on the factors of sex, dose, assess feasibility and also to obtain estimates of
interval, and age. This type of design requires less than 10%
of the animals originally requested.
sample size parameters (such as outcome variance,
event rates, and effect size) for the later clinical trial
Run Block Dose Interval Age Sex (Chapter 16.4). Suggested sample size for a clinical
1 1 −1 −1 1 Female trial pilot is a minimum of 12 subjects per arm, for a
2 1 −1 −1 1 Male
total of 24 subjects (Julious 2005). Up to 70 subjects
per arm may be necessary depending on the effect
3 1 0 1 1 Female
size to be estimated and the amount of precision
4 1 0 −1 −1 Male required for the estimate of the variance (Teare
5 1 −1 1 0 Male et al. 2014).
6 1 1 −1 0 Female
7 1 0 0 0 Male
6.3.4 A Note on Safety and
8 1 1 1 −1 Male
Tolerability
9 1 0 0 0 Female
Empirical and translation pilot studies are usually
10 1 1 1 −1 Female
too small to reliably assess safety or tolerability of
11 2 −1 0 −1 Male a clinical intervention, especially if the safety met-
12 2 1 1 1 Male ric is occurrence of an adverse event (yes/no).
13 2 1 −1 1 Male
Adverse events are usually rare, and zero events
in a small study do not mean that the intervention
14 2 −1 −1 −1 Female
is safe. The sample size N to detect at least one
15 2 −1 1 1 Male adverse event can be approximated if the probabil-
16 2 −1 1 −1 Female ity of detection α and the expected prevalence p of
adverse events are known (or can be guessed).
17 2 1 0 1 Female
Sample size required for a safety study can be
18 2 1 −1 −1 Female approximated by probability-based feasibility cal-
culations (Chapter 8).
each strain, with each strain experiment run in
blocks so that experimental effort can be distributed
over two sessions. Table 6.4 is an example of a defin- 6.4 Assessing Evidentiary
itive screening design to be run in two blocks. For
each trial, 18 runs for each strain are performed
Strength
in random order. Nine males and nine females Strength of evidence means having confidence in
are required for each experiment. Each factor level study results. There is no sense continuing a certain
is replicated seven times on each low and high line of enquiry or method of experimentation if only
value, and four times on the intermediate levels. poor or unreliable results are produced. Strong evi-
The centre points (0, 0, 0) provide the variance esti- dence is obtained by consistency of results across
mates for assessing significance of main effects. The multiple lines of experimentation. Exploratory data
total number of mice required is now (6 strains × 18 analysis by data visualisation methods enables rapid
runs) = 108 mice. This design requires fewer than assessment of size, direction, and precision of
10% of the number of animal originally proposed. results. The most promising findings are verified
Definitive studies based on factorial designs will by sequential testing and replication, first with dif-
require formal simulation-based power analyses, ferent models of the same syndrome, followed by
especially if interaction effects are of primary inter- independent replication within and across labora-
est (Chapter 19). tories or locations (Box 6.6).
(a)
50
45
35
30
25
20
15
10
–5
So
TF um
VI
VI
VI
C
C
C rid
C l
Fo er
Io
Iro e
K
M
M G
M CY
PH
Pr S
SE in
ZN
al
hl
ob
op
di
AN
ET
G S
TA
TD
TE
ot
lic
di
AT
L
O
ci
o
n
a
e
p
um
3
Nutrients
(b)
7
6
Standardised mean ratio and 95% CI
0
So
TF
C
Io r
Iro
PH
Pr
SE
ZN
al
hl
ob
op
di
ET
ot
n
di
AT
L
O
c
or
ne
al
pe
ei
iu
um
C
S
id
n
m
YS
Nutrients
Figure 6.2: Profile plot. Mean daily intake for 27 obese dogs of essential nutrients and 95% simultaneous confidence
intervals. Values are standardised to a common ratio based on recommended daily allowance levels.
Source: Data from German et al. (2015). (a) Standardised mean values and confidence intervals for 20 nutrients. Dotted line is the line of
equality for the standardised recommended intake. (b) Mean observed values and confidence intervals for the subset of nutrients with
standardised means less than 10.
(a)
(b)
1 2 3 4 1 2 3 4
Replicates Replicates
Figure 6.4: Interaction plots for assessing generalizability. The graph shows mean differences between two mouse stains
and 95% confidence intervals across four experimental replicates.
Source: Adapted from von Kortzfleisch et al. (2020). (a) Poor repeatability across replicates; evidence for generalizability is poor. (b) Good
repeatability across replicates; evidence for generalizability is good as the data suggest treatment effects are relatively consistent and
independent of replicate effects
Table 6.6: Six types of replication studies defined by number of experimental locations, ‘conditions’, model system (animal or
disease model) and variables measured (explanatory and/or response). Replication studies can be the same across all
components (direct replication), or differ in one or more components (conceptual replication).
Location Experimental Model Variables What is evaluated?
conditions system
Direct replication
Same Same Same Same Single laboratory study with identical repeats (direct internal
replication). Can assess sampling error, quality standards,
mistakes, fraud, etc. Measurement error, precision
External validity (limited) with multi-batch designs (planned
heterogeneity)
Differ Same Same Same Multi-laboratory study (independent replication). Assesses
effects of different lab procedures, personnel, equipment,
environments, etc. in different labs
Robustness (between laboratories), external validity
Conceptual replication
Same Differ Same Same Systemic variation in experimental operating conditions
(planned heterogeneity) between replications. Conditions
that are varied include environment, batch, days, time of
day, operators, suppliers, vendors, etc.
Robustness (procedural/process replication)
Same Same Same Differ Systematic changes in explanatory factors, response
variables, and/or methods for measuring or quantifying the
response. Assesses operationalisation of the research
question (e.g. ‘benefit’ = symptom relief versus improved
function versus improved survival).
Robustness (outcome replication); external validity
Same Same Differ Same Systematic changes in the animal model/species/strain, and/
or disease model. Tests generalisation and robustness of
results in a new model system related to measures of
efficacy or effect tested
Robustness (animal and disease model); external validity
Same or Same Same Same or Multiple simultaneous systematic controlled changes in any or
differ or differ or differ differ all components of study design and operations
Robustness, external validity
improved coverage probability of the true effect size assessing health care interventions for the effective
for a fixed number of animals (von Kortzfleisch et al. health care program of the agency for healthcare
2020). Replications across as few as two laboratories research and quality: an update. Methods guide for
can produce substantial improvements in power comparative effectiveness reviews (Prepared by the
and predictive validity, as long as protocols are stan- RTI-UNC Evidence-based Practice Center under Con-
tract No. 290-2007-10056-I). AHRQ Publication No.
dardised and internal validity is ensured (Karp 2018;
13(14)-EHC130-EF. Rockville, MD: Agency for
Voelkl et al. 2018; Drude et al. 2021). Healthcare Research and Quality. www.effective
healthcare.ahrq.gov/reports/final.cfm
(accessed 2022).
6.4.8 Design and Sample Size for Bespalov, A., Wicke, K., and Castasgné, V. (2019). Blind-
Replication ing and randomization. In: Good Research Practice in
Non-Clinical Pharmacology and Biomedicine, Hand-
Any statistically based experimental design appro- book of Experimental Pharmacology, 257 (ed. A.
priate for the study can be used for replication stud- Bespalov, M. Michel, and T. Steckler). Springer Cham.
ies (factorial, split-plot, randomised complete block, https://doi.org/10.1007/164_2019_279.
etc.). Internal validity must be high, with clearly Bishop, D.V.M., Thompson, J., and Parker, A.J. (2022).
specified methods for randomisation and allocation Can we shift belief in the ‘Law of Small Numbers’?
concealment (blinding), and defined inclusion and Royal Society Open Science 9: 211028. https://doi.
exclusion criteria (Drude et al. 2021). org/10.1098/rsos.211028.
Sample size is greatly reduced by adoption of Blanchard, O.L. and Smoliga, J.M. (2015). Translating
multi-batch designs rather than the conventional dosages from animal models to human clinical trials
approach of replicating independently powered – revisiting body surface area scaling. FASEB Journal
experiments. Multi-batch designs consist of several 29 (5): 1629–1634. https://doi.org/10.1096/
small independent experiments conducted at sepa- fj.14-269043.
Bowen, D.J., Kreuter, M., Spring, B. et al. (2009). How we
rate time points by the same laboratory or even in
design feasibility studies. American Journal of Preven-
different laboratories. They are in effect large-scale
tive Medicine 36 (5): 452–457. https://doi.org/
randomised complete block experiments where 10.1016/j.amepre.2009.02.002.
each block is batch or replicate. Results are com- Box, G.E.P. and Draper, N.R. (1987). Empirical Model-
bined to assess intervention effects. The suggested Building and Response Surfaces. New York: Wiley.
minimum is three batches (Karp et al. 2020; von Box, G.E.P., Hunter, W.G., and Hunter, J.S. (2005).
Kortzfleisch et al. 2020). Statistics for Experimenters: An Introduction to Design,
Data Analysis, and Model Building, 2e. New
York: Wiley.
References Calder, W.A. III (1984). Size, Function and Life History.
Cambridge, MA: Harvard University Press.
Beery, A.K. (2018). Inclusion of females does not increase Caldwell, G.W., Masucci, J.A., Yan, Z., and Hageman, W.
variability in rodent research studies. Current Opinion (2004). Allometric scaling of pharmacokinetic para-
in Behavioral Sciences 23: 143–149. https://doi. meters in drug discovery: can human CL, Vss and
org/10.1016/j.cobeha.2018.07.016. t1/2 be predicted from in-vivo rat data? European Jour-
Beery, A.K. and Zucker, I. (2011). Sex bias in neurosci- nal of Drug Metabolism and Pharmacokinetics 29:
ence and biomedical research. Neuroscience and Biobe- 133–143.
havioral Reviews 35: 565–572. https://doi.org/ Clayton, D. and Hills, M. (1993). Statistical Models in
10.1016/j.neubiorev.2010.07.002. Epidemiology. Oxford: Oxford University Press.
Belzun, C. and Lemoine, M. (2011). Criteria of validity for Clayton, J.A. (2015). Studying both sexes: a guiding prin-
animal models of psychiatric disorders: focus on anx- ciple for biomedicine. FASEB Journal 30 (2): 519–524.
iety disorders and depression. Biology of Mood & Anx- https://doi.org/10.1096/fj.15-279554:10.1096/
iety Disorders 1 (1): 9. https://doi.org/10.1186/ fj.15-279554.
2045-5380-1-9. Clayton, J.A. and Collins, F.S. (2014). Policy: NIH to bal-
Berkman, N.D., Lohr, K.N., Ansari, M., et al. (2013). ance sex in cell and animal studies. Nature 509 (7500):
Grading the strength of a body of evidence when 282–283. https://doi.org/10.1038/509282a.
Colosimo, E.A., Cruz, F.R.B., Miranda, J.L.O., and Van anesthesiology. Anesthesiology 136 (4): 609–617.
Woensel, T. (2007). Sample size calculation for method https://doi.org/10.1097/ALN.00000000000
validation using linear regression. Journal of Statistical 04115 .
Computation and Simulation 77 (6): 505–516. Errington, T.M., Denis, A., Perfito, N. et al. (2021a).
https://doi.org/10.1080/00949650601151729. Reproducibility in cancer biology: challenges for asses-
Cook, J.A., Hislop, J., Altman, D.G. et al. (2015). Specify- sing replicability in preclinical cancer biology. eLife 10:
ing the target difference in the primary outcome for a e67995. https://doi.org/10.7554/eLife.67995.
randomised controlled trial: guidance for researchers. Errington, T.M., Mathur, M., Soderberg, C.K. et al.
Trials 16: 12. https://doi.org/10.1186/s13063- (2021b). Investigating the replicability of preclinical
014-0526-8. cancer biology. eLife 10: e71601. https://doi.org/
Cook, J.A., Julious, S.A., Sones, W. et al. (2017). Choosing 10.7554/eLife.71601.
the target difference (‘effect size’) for a randomised Ferreira, G.S., Veening-Griffioen, D.H., Boon, W.P.C.
controlled trial – DELTA2 guidance protocol. Trials et al. (2019). A standardised framework to identify
18 (1): 271. https://doi.org/10.1186/s13063- optimal animal models for efficacy assessment in drug
017-1969-5. development. PLoS ONE 14 (6): e0218014. https://
Cooper, N., Thomas, G.H., and FitzJohn, R.G. (2016). doi.org/10.1371/journal.pone.0218014.
Shedding light on the ‘dark side’ of phylogenetic com- Filliben, J.J. and Heckert, A. (2012). Exploratory data
parative methods. Methods in Ecology and Evolution analysis. In: NIST/SEMATECH e-Handbook of
7 (6): 693–699. https://doi.org/10.1111/2041- Statistical Methods (ed. C. Croakin and P. Tobias).
210X.12533. http://www.itl.nist.gov/div898/handbook/2012.
Cumming, G. (2012). Understanding the New Statistics: https://doi.org/10.18434/M32189.
Effect sizes, Confidence Intervals, and Meta-Analysis. Fitts, D.A. (2011). Ethics and animal numbers: informal
New York: Routledge. analyses, uncertain sample sizes, inefficient replica-
Czitrom, V. (1999). One-factor-at-a-time versus designed tions, and Type I errors. Journal of the American Asso-
experiments. The American Statistician 53: 126–131. ciation for Laboratory Animal Science 50: 445–453.
Daniel, C. (1959). Use of half-normal plots in interpreting Fraser, H., Barnett, A., Parker, T.H., and Fidler, F. (2019).
factorial two-level experiments. Technometrics 1 (4): The role of replication studies in ecology. Ecology and
311–341. Evolution 10: 5197–5207.
Dean, A. and Lewis, S. (2006). Screening Methods for Freedman, L.P., Cockburn, I.M., and Simcoe, T.S. (2015).
Experimentation in Industry, Drug Discovery, and The economics of reproducibility in preclinical
Genetics. Springer, 330 p. https://link.springer. research. PLoS Biology 13: e1002165.
com/content/pdf/10.1007%2F0-387-28014-6. Frommlet, F. and Heinze, G. (2021). Experimental repli-
Dirnagl, U., Bannach-Brown, A., and McCann, S. (2021). cations in animal trials. Laboratory Animals 55 (1): 65–
External validity in translational biomedicine: 75. https://doi.org/10.1177/0023677220907617.
understanding the conditions enabling the cause to Galanopoulou, A.S., Pitkänen, A., Buckmaster, P.S., and
have an effect. EMBO Molecular Medicine 14 (2): Moshé, S.L. (2017). What do models model? What
1757–4676. https://doi.org/10.15252/emmm.2021 needs to be modeled? In: Models of Seizures and Epi-
14334. lepsy, 2e (ed. A. Pitkänen, P.S. Buckmaster, A.S. Gala-
Dolgos, H., Trusheim, M., Gross, D. et al. (2016). Trans- nopoulou, and S.L. Moshé), 1107–1119. Elsevier.
lational Medicine Guide transforms drug development German, A.J., Holden, S.L., Serisier, S. et al. (2015).
processes: the recent Merck experience. Drug Discovery Assessing the adequacy of essential nutrient intake
Today 21 (3): 517–526. https://doi.org/10.1016/ in obese dogs undergoing energy restriction for weight
j.drudis.2017.01.003. loss: a cohort study. BMC Veterinary Research 11: 253.
Drude, N.I., Gamboa, L.M., Danziger, M. et al. (2021). https://doi.org/10.1186/s12917-015-0570-y.
Science Forum: improving preclinical studies through Goodman, S.N. and Royall, R. (1988). Evidence and sci-
replications. eLife 10: e62101. https://doi.org/ entific research. American Journal of Public Health 78
10.7554/eLife.62101. (12): 1568–1574. https://doi.org/10.2105/ajph.
Eisenhart, C. (1968). Expression of the uncertainties of 78.12.1568.
final results. Science 160: 1201–1204. https://doi. Henderson, V.C., Kimmelman, J., Fergusson, D. et al.
org/10.1126/science.160.3833.1201. (2013). Threats to validity in the design and conduct
Eleveld, D.J., Koomen, J.V., Absalom, A.R. et al. (2022). of preclinical efficacy studies: a systematic review of
Allometric scaling in pharmacokinetic studies in guidelines for in vivo animal experiments. PLoS
Medicine 10 (7): e1001489. https://doi.org/ Jones, B. and Nachtsheim, C.J. (2011). A class of three
10.1371/journal.pmed.1001489. level designs for definitive screening in the presence
Heusch, G. (2017). Critical issues for the translation of of second-order effects. Journal of Quality Technology
cardioprotection. Circulation Research 120 (9): 43 (1): 1–15. https://doi.org/10.1080/00224065.
1477–1486. https://doi.org/10.1161/CIRCRESAHA. 2011.11917841.
117.310820. Julious, S.A. (2005). Sample size of 12 per group rule of
Higgins, J.P., Altman, D.G., Gøtzsche, P.C. et al. (2011). thumb for a pilot study. Pharmaceutical Statistics 4:
The Cochrane collaboration’s tool for assessing risk of 287–291.
bias in randomised trials. BMJ 343: d5928. https:// Karp, N.A. (2018). Reproducible preclinical research—Is
doi.org/10.1136/bmj.d5928. embracing variability the answer? PLoS Biology 16 (3):
Higgins, J.P.T., Savović, J., Page, M.J. et al. (2022). e2005413. https://doi.org/10.1371/journal.
Chapter 8: Assessing risk of bias in a randomized trial. pbio.2005413.
In: Cochrane Handbook for Systematic Reviews of Inter- Karp, N.A., Mason, J., Beaudet, A.L. et al. (2017). Preva-
ventions, version 6.3 (updated February 2022) (ed. H. lence of sexual dimorphism in mammalian phenotypic
JPT, J. Thomas, J. Chandler, et al.). Cochrane www. traits. Nature Communications 8: 15475. https://
training.cochrane.org/handbook. doi.org/10.1038/ncomms15475.
Hollner, E. (2009). The ETTO Principle: Efficiency- Karp, N.A. and Reavey, N. (2019). Sex bias in preclinical
Thoroughness Trade-Off: Why Things That Go Right research and an exploration of how to change the sta-
Sometimes Go Wrong. Taylor & Francis Group, tus quo. British Journal of Pharmacology 176 (21):
ProQuest Ebook Central https://ebookcentral. 4107–4118. https://doi.org/10.1111/bph.14539.
proquest.com/lib/ufl/detail.action?docID= Karp, N.A., Wilson, Z., Stalker, E. et al. (2020). A multi-
438714 . batch design to deliver robust estimates of efficacy and
Honarpisheh, P. and McCullough, L.D. (2019). Sex as a reduce animal use – a syngeneic tumour case study.
biological variable in the pathology and pharmacology Scientific Reports 10 (1): 6178. https://doi.org/
of neurodegenerative and neurovascular diseases. Brit- 10.1038/s41598-020-62509-7.
ish Journal of Pharmacology 176 (21): 4173–4192. Kimmelman, J. and Federico, C. (2017). Consider drug
https://doi.org/10.1111/bph.14675. efficacy before first-in-human trials. Nature 542:
Howick, J., Glasziou, P., and Aronson, J.K. (2009). The 25–27. https://doi.org/10.1038/542025a.
evolution of evidence hierarchies: what can Bradford Kimmelman, J., Mogil, J.S., and Dirnagl, U. (2014). Dis-
Hill’s ‘guidelines for causation’ contribute? Journal tinguishing between exploratory and confirmatory
of the Royal Society of Medicine 102 (5): 186–194. preclinical research will improve translation. PLoS
https://doi.org/10.1258/jrsm.2009.090020. Biology 12 (5): e1001863. https://doi.org/10.1371/
Huang, Q. and Riviere, J.E. (2014). The application of journal.pbio.1001863.
allometric scaling principles to predict pharmacoki- Kronmal, R.A. (1993). Spurious correlation and the fal-
netic parameters across species. Expert Opinion on lacy of the ratio standard revisited. Journal of the Royal
Drug Metabolism & Toxicology 10: 1241–1253. Statistical Society Series A (Statistics in Society) 156 (3):
Huang, W., Percie du Sert, N., Vollert, J., and Rice, A.S.C. 379–392. https://doi.org/10.2307/2983064.
(2020). General principles of preclinical study design. Landis, S.C., Amara, S.G., Asadullah, K. et al. (2012). A
In: Handbook of Experimental Pharmacology, call for transparent reporting to optimize the predictive
vol. 257, 55–69. https://doi.org/10.1007/164_ value of preclinical research. Nature 490: 187–191.
2019_277. https://doi.org/10.1038/nature11556.
Jan, S.L. and Shieh, G. (2019). Sample size calculations Lawson J (2020). daewr: design and analysis of experi-
for model validation in linear regression analysis. ments with R. R package version 1.2-5. http://www.
BMC Medical Research Methodology 19: 54. https:// r-qualitytools.org (accessed 2022).
doi.org/10.1186/s12874-019-0697-9. Lee, E.C., Whitehead, A.L., Jacques, R.M., and Julious, S.
Jasienski, M. and Bazzaz, F.A. (1999). The fallacy of ratios A. (2014). The statistical interpretation of pilot trials:
and the testability of models in biology. Oikos 321–327. should significance thresholds be reconsidered? BMC
Jenkins, D.G. and Quintana-Ascencio, P.F. (2020). A Medical Research Methodology 14: 41. https://doi.
solution to minimum sample size for regressions. PLoS org/10.1186/1471-2288-14-41.
ONE 15 (2): e0229345. https://doi.org/10.1371/ Leist, M. and Hartung, T. (2013). Inflammatory findings
journal.pone.0229345. on species extrapolations: humans are definitely not
70-kg mice. Archives of Toxicology 87: 563–567. human randomized controlled trials. PLoS Biology 11:
https://doi.org/10.1007/s00204-013-1038-0. e1001481.
Lindstedt, S.L. and Schaeffer, P.J. (2002). Use of allometry Nunamaker, E.A., Davis, S., O’Malley, C.I., and Turner,
in predicting anatomical and physiological parameters P.V. (2021). Developing recommendations for cumula-
of mammals. Laboratory Animals 36 (1): 1–19. tive endpoints and lifetime use for research animals.
https://doi.org/10.1258/0023677021911731. Animals (Basel) 11 (7): 2031. https://doi.org/
Lynch, J. (1982). On the external validity of experiments 10.3390/ani11072031.
in consumer research. Journal of Consumer Research Packard, G. and Boardman, T. (1988). The misuse of
9 (3): 225–239. https://doi.org/10.1086/208919. ratios, indices, and percentages in ecophysiological
Macleod, M.R., Lawson McLean, A., Kyriakopoulou, A. research. Physiological Zoology 61: 1–9. https://
et al. (2015). Risk of bias in reports of in vivo doi.org/10.1086/physzool.61.1.30163730.
research: a focus for improvement. PLoS Biology Packard, G. and Boardman, T. (1999). The use of percen-
13 (10): e1002273. https://doi.org/10.1371/ tages and size-specific indices to normalize physiolog-
journal.pbio.1002273. ical data for variation in body size: wasted time, wasted
Malterud, K., Siersma, V.D., and Guassora, A.D. (2016). effort? Comparative Biochemistry and Physiology, Part
Sample size in qualitative interview studies: guided by A 122 (1): 37–44.
information power. Qualitative Health Research 26 Pérez-García, V.M., Calvo, G.F., Bosque, J.J. et al. (2020).
(13): 1753–1760. https://doi.org/10.1177/104973 Universal scaling laws rule explosive growth in human
2315617444. cancers. Nature Physics 16: 1232–1237.
Markou, A., Chiamulera, C., Geyer, M. et al. (2009). Pitkänen, A., Nehlig, A., Brooks-Kayal, A.R. et al. (2013).
Removing obstacles in neuroscience drug discovery: Issues related to development of antiepileptogenic
the future path for animal models. Neuropsychophar- therapies. Epilepsia 54 (Suppl 4): 35–43. https://
macology 34: 74–89. https://doi.org/10.1038/ doi.org/10.1111/epi.12297.
npp.2008.173. Porter, W.P., Hinsdill, R., Fairbrother, A. et al. (1984).
McGonigle, P. and Ruggeri, B. (2014). Animal models of Toxicant-disease-environment interactions associated
human disease: challenges in enabling translation. with suppression of immune system, growth, and
Biochemical Pharmacology 87 (1): 162–171. https:// reproduction. Science 224 (4652): 1014–1017.
doi.org/10.1016/j.bcp.2013.08.007. Pound, P. and Ritskes-Hoitinga, M. (2018). Is it possible
McKinney, W.T. and Bunney, W.E. (1969). Animal to overcome issues of external validity in preclinical
model of depression: I. Review of evidence: implica- animal research? Why most animal models are bound
tions for research. Archives of General Psychiatry 21 to fail. Journal of Translational Medicine 16 (1): 304.
(2): 240–248. https://doi.org/10.1001/archpsyc. https://doi.org/10.1186/s12967-018-1678-1.
1969.01740200112015. Reynolds, P. (2021). Statistics, statistical thinking, and
Miller, L.R., Marks, C., Becker, J.B. et al. (2017). Consid- the IACUC. Lab Animal 50: 266–268. https://doi.
ering sex as a biological variable in preclinical org/10.1038/s41684-021-00832-w.
research. FASEB Journal 31 (1): 29–34. https:// Reynolds, P.S., McCarter, J., Sweeney, C. et al. (2019).
doi.org/10.1096/fj.201600781R. Informing efficient pilot development of animal
Moher, D., Hopewell, S., Schulz, K.F. et al. (2010). CON- trauma models through quality improvement strate-
SORT 2010 explanation and Elaboration: updated gies. Laboratory Animals 53 (4): 394–404. https://
guidelines for reporting parallel group randomised doi.org/10.1177/0023677218802999.
trials. BMJ 340: c869. https://doi.org/10.1136/ Richter, H. (2017). Systematic heterogenization for better
bmj.c869. reproducibility in animal experimentation. Lab Ani-
Montgomery, D.C. (2017). Design and Analysis of Experi- mal (NY) 46: 343–349. https://doi.org/10.1038/
ments, 8th ed. New York: Wiley 752 pp. laban.1330.
Montgomery, D.C. and Jennings, C.L. (2006). Chapter 1: Rossello, X. and Yellon, D.M. (2016). Cardioprotec-
An overview of industrial screening experiments. In: tion: the disconnect between bench and bedside.
Screening: Methods for Experimentation in Industry, Circulation 134: 574–575. https://doi.org/10.
Drug Discovery, and Genetics (ed. A. Dean and S. 1161/circulationaha.116.022829.
Lewis), 1–20. New York: Springer 332 pp. Rothman, K.J. and Greenland, S. (2018). Planning study
Muhlhausler, B.S., Bloomfield, F.H., and Gillman, M.W. size based on precision rather than power. Epidemiol-
(2013). Whole animal experiments should be more like ogy 29: 599–603.
Russell, W.M.S. and Burch, R.L. (1959). The Principles of variability improves both replicability and generaliza-
Humane Experimental Technique. London: bility in preclinical research. PLoS Biology 19 (5):
Methuen & Co. e3001009. https://doi.org/10.1371/journal.
Schmidt-Nielsen, K. (1984). Scaling: Why Is Animal Size pbio.3001009.
So Important? Cambridge: Cambridge Univer- Van Norman, G.A. (2019). Limitations of animal studies
sity Press. for predicting toxicity in clinical trials: is it time to
Steidl, R.J., Hayes, J.P., and Schauber, E. (1997). Statisti- rethink our current approach? JACC: Basic to Transla-
cal power analysis in wildlife research. Journal of tional Science 4 (7): 845–854. https://doi.org/
Wildlife Management 61 (2): 270–279. 10.1016/j.jacbts.2019.10.008.
Tadenev, A.L.D. and Burgess, R.W. (2019). Model valid- Vetter, T.R. and Mascha, E.J. (2017). Defining the pri-
ity for preclinical studies in precision medicine: pre- mary outcomes and justifying secondary outcomes of
cisely how precise do we need to be? Mammalian a study: usually, the fewer, the better. Anesthesia &
Genome 30 (5-6): 111–122. https://doi.org/ Analgesia 125 (2): 678–681. https://doi.org/
10.1007/s00335-019-09798-0. 10.1213/ANE.0000000000002224.
Tang, H. and Mayersohn, M. (2011). Controversies in Voelkl, B., Vogt, L., Sena, E.S., and Würbel, H. (2018).
allometric scaling for predicting human drug clear- Reproducibility of preclinical animal research
ance: an historical problem and reflections on what improves with heterogeneity of study samples. PLoS
works and what does not. Current Topics in Medicinal Biology 16 (2): e2003693. https://doi.org/10.1371/
Chemistry 11 (4): 340–350. https://doi.org/ journal.pbio.2003693.
10.2174/156802611794480945. von Kortzfleisch, V.T., Karp, N.A., Palme, R. et al. (2020).
Tanner, J.M. (1949). Fallacy of per-weight and per- Improving reproducibility in animal research by split-
surface area standards, and their relation to spurious ting the study population into several ’mini-experi-
correlation. Journal of Applied Physiology 2 (1): 1–15. ments’. Scientific Reports 10 (1): 16579. https://
https://doi.org/10.1152/jappl.1949.2.1.1. doi.org/10.1038/s41598-020-73503-4.
Teare, M.D., Dimairo, M., Shephard, N. et al. (2014). Will, T.R., Proaño, S.B., Thomas, A.M. et al. (2017). Pro-
Sample size requirements to estimate key design para- blems and progress regarding sex bias and omission in
meters from external pilot randomised controlled neuroscience research. eNeuro 4 (6): eneuro.0278-
trials: a simulation study. Trials 15: 264. https:// 17.2017.
doi.org/10.1186/1745-6215-15-264. Willner, P. (1984). The validity of animal models of
Trutna, L., Sapgon, P., Del Castillo, E. et al. (2012). Proc- depression. Psychopharmacology 83: 1–17. https://
ess improvement. In: NIST/SEMATECH e-Handbook doi.org/10.1007/BF00427414.
of Statistical Methods. https://doi.org/10.18434/ Wilson, L.A.B., Zajitschek, S.R.K., Lagisz, M. et al. (2022).
M32189. Sex differences in allometry for phenotypic traits in
Tukey, J. (1977). Exploratory Data Analysis. Reading: mice indicate that females are not scaled males. Nature
Addison-Wesley. Communications 13: 7502. https://doi.org/
US EPA (Environmental Protection Agency) (2017). 10.1038/s41467-022-35266-6.
Causal Analysis/Diagnosis Decision Information Sys- Würbel, H. (2017). More than 3Rs: the importance of sci-
tem (CADDIS). Washington: Office of Research and entific validity for harm-benefit analysis of animal
Development https://www.epa.gov/caddis-vol1/ research. Lab Animal 46: 164–167. https://doi.
consistency-evidence. org/10.1038/laban.1220.
Usui, T., Macleod, M.R., McCann, S.K. et al. (2021).
Meta-analysis of variation suggests that embracing
A Guide to Sample Size for Animal-based Studies, First Edition. Penny S. Reynolds.
© 2024 John Wiley & Sons Ltd. Published 2024 by John Wiley & Sons Ltd.
82 A Guide to Sample Size for Animal-based Studies
entering the study sequentially and at variable inter- Example: Trial Size for a Grant Proposal
vals. For the trial to have a chance of success, antici-
A researcher wished to apply for a grant proposal
pated enrolment rates must agree with numbers
to study a novel disease biomarker in client-
obtained from power-based sample size calculations
owned companion animals. The plan was to
(Box 7.4).
develop a clinical prediction model featuring an
additional 11 predictor variables already con-
firmed in the primary literature to be associated
Example: Recruitment of Client-Owned with severity of that disease. The study was
Companion Animals intended to be a single-centre clinical prospective
cohort study. How many animals will need to be
Investigators determined from power calcula- recruited for a two-year study?
tions that at least 100 subjects would be required
to detect a meaningful difference between two
intervention groups, with 50 subjects per compar-
To develop an adequately robust clinical
ison arm (healthy versus diseased). Clinic records
prediction model using 12 predictors (including
indicated that a total of 90 subjects meeting study
the biomarker of interest), formal sample size calcu-
eligibility criteria had visited the clinic in the pre-
lations (Riley et al. 2020) suggested at least 260–300
vious year and included 20 subjects with the dis-
subjects are required. Is this number of subjects a
ease of interest. However, only 50% of owners
feasible recruitment goal?
consented to participate in previous trials at the
institution. How long will the trial have to run Information required. Number of eligible subjects
to enrol sufficient subjects for this new trial? visiting the clinic per week or per month, antici-
pated client consent rate, total operating
Information required. Size of the patient recruit- budget, processing costs per patient, and total
ment pool, anticipated client consent rate, and study duration.
expected number of patients with the disease.
Clinic volume. Based on prior clinical record
The expected number that can be recruited from data, approximately 2–3 eligible animals vis-
the total pool of eligible subjects (NT) is approxi- ited the clinic per week. Fully subsidised clin-
mated as ical and laboratory costs encouraged high
client consent rates, which were expected to
N = N T pE average about 80%.
Budget. Laboratory and clinical work-ups were The number of animals required is less easy to
approximately $1500 per animal. The grant estimate. In general, the number of animals will
allowed a total operating budget of $120,000. be determined by the total amount of tissue or cells
required to perform the assays (M) and the amount
Maximum number processed of viable tissue or cells per animal (m/animal)
= $120,000 $1500 cat = 80 animals in two years N = M m animal
Reality check. Sample size based on power calcu-
lations does not align with either anticipated
enrolment or budget constraints. Example: Number of Mice Required for
Refinement. Reducing the number of laboratory a Microarray Study
tests performed cut costs from $1500 to $1000. Suppose a cell culture requires M = 106 cells per
The revised sample size is now $120,000/ plate to obtain a sufficient amount of RNA for
($1000/cat) = 120 animals. However, this still analysis. However, only m = 2.5 × 105 relevant
does not align with the power-based sample cell types can be isolated per mouse.
size estimates in the original proposal. The pro-
posal was revised in the direction of more mod- The number of mice required
est research goals that could be accommodated
= 106 cells 2 5 × 105 cells per mouse
by the budget. The investigator prioritised the
marker of interest plus the top four candidate = 4 mice plate
markers, with priority based on clinical rele-
vance and importance. The reduced subset of
five candidate predictors would allow reasona-
bly precise estimates to be obtained with an The evolution of microarray and RNA-seq tech-
anticipated enrolment of 100–120 animals nology means that very small amounts (for exam-
and remain within budget. ple, <~1 pg) of mRNA can be extracted from
tissue and even single cells for gene expression pro-
filing (Amit et al. 2009; Shalik et al. 2013; Ye et al.
2018). Therefore very few animals, or only one,
7.3.3 High Dimensionality Studies maybe all that are necessary. Variability in array
These studies typically involve the harvest and pro- preparation and background determination will
cessing of cells or tissues from multiple animals. determine the number of technical replicates
Typically, large volumes of output information are required. However, technical replicates only affect
produced per subject. Examples include DNA/ measurement precision; they do not contribute to
RNA microarrays, biochemistry assays, biomarker reducing variance of the overall effect.
studies, gene expression, proteomic, metabolomic, If the research question involves mapping heter-
and inflammasome profiles. ogeneity at the subject level, the study will have to
For identification of differentially expressed be designed to accommodate the true experimental
genes, sample size refers to either the number of unit, which is the whole animal. If the number of
arrays or the number of biological replicates, biological replicates is too small, statistical power
depending on the research question. For array- for detecting differentially expressed genes will be
based studies, sample sizes are determined by the too low, and false positive rates will be high. For
desired fold change to be detected, the number of detecting differential gene expression between two
replicate arrays, investigator-specified sensitivity groups (e.g. knockout versus wild-type mice;
(proportion of detected differentially expressed tumour versus non-tumour tissue), sample size is
genes), number of expected false positives, the cor- determined by power calculations. Sample size will
relation between expression levels of different depend on the fold change to be detected, power
genes, and the number of sampling time points to detect that change (the true positive rate, or
(Jung and Young 2012; Lin et al. 2010). power 1 − β), confidence α (Type I error
86 A Guide to Sample Size for Animal-based Studies
probability), and the variance in the sample. The skillsets have been developed on non-animal and/
variance can be estimated from the 75th percentile or ex vivo models. Computer simulations, virtual
of the standard deviation of log ratio of expression models, and simulators are rapidly replacing ani-
levels (the variance for the 75% least variable genes mals. Carcasses and culls can be used for skills
in the array). Sample size can then be estimated as acquisition (Baillie et al. 2016). Ex vivo models can
usual for a two-sample t-test, or iteratively using the be as effective as animals obtained for purpose, as
formula for power and the non-centrality parameter well as considerably cheaper (e.g. Izawa et al.
(Wei et al. 2004). 2016). Additional refinement measures include sup-
plementary educational materials (lectures, text,
webinars, etc.) for orientation to equipment, techni-
7.3.4 Training, Teaching, Skill ques, and procedures. Detailed standard operating
Acquisition procedures and training plans should be devised.
Animal numbers for training and skill acquisition Instructional plans should describe how specific
are determined primarily by the number of repeti- critical skills and skill sets will be taught before live
tions of the task required to meet predetermined animals are used. Specific and measurable profi-
competency standards. For teaching purposes, the ciency metrics and competency assessments must
trainer:trainee ratio, number of assessors, and avail- be identified. Training must be species- and proce-
ability of teaching-related resources must be fac- dure-specific. Strategies for development of ani-
tored into estimates (Box 7.5). mal-based training programs are described by
Competency in essential clinical skills and surgi- Conarello and Shepherd (2007).
cal techniques is of major importance for preclinical
laboratory experiments and veterinary and human
medicine. On-the-job training is usually not suffi- Example: Surgical Training Laboratory
cient to acquire competency (Bergmeister et al. A teaching lab requested swine to be used for sur-
2020). High-quality and consistent skill sets are gical training of residents in multiple invasive
developed and maintained only with structured procedures. Students were to have received sev-
training and sufficient practice. Poor technique eral weeks of intensive training on basic skill
and inconsistent and unstandardised procedures acquisition on simulators and excised organs
will contribute to considerable variability in experi- prior to this lab. There are 16 trainees. Eight trai-
ments, potentially hiding any effect of the experi- nees can be accommodated per training session.
mental intervention. One animal can be allocated to every two trai-
Use of live animals for surgical skills training is nees. Four instructors are available. How many
still part of many medical and veterinary training animals are required?
programmes (DeMasi et al. 2016).
Use of live animals obtained for purpose should
Calculations. Number of sessions = 16 trainees/
be considered a last resort, and only after critical
(8 trainees/session) = 2 sessions.
▪ Number of sessions required for competency Total number of animals = 4 animals session
▪ Resource availability × 2 sessions = 8 animals
▪ Number of trainees
▪ Number of trainees per session
▪ Number of instructors, assessors 7.3.5 Rodent Breeding Production
▪ Time available per session
▪ Assessor availability. Breeding protocols are required for in-house pro-
duction of research animals that are too expensive
Feasibility Calculations: Arithmetic 87
to purchase in sufficient quantity or cannot be Table 7.1: Rules of Thumb for Estimating Mouse Numbers
Required for Genetic Analyses.
obtained commercially. Examples include the crea-
tion of new transgenic, knockout, or other geneti- Purpose Number of Number
cally modified animals; back-crossing of breeding pairs required
per line
genetically modified lines; or production of prenatal
or early neonate subjects. Maintenance and up to 5 80–100
The projected total pup production and the characterisation
of transgenic or
anticipated number of pups for a specific genotype
knockout line
depend on the number of breeding adults, litter size,
number of litters over the productivity lifespan, and Strain construction 10–12 750–1200
with congenic
weaning success (Box 7.6). Reasonable predictions genotyping
of pup production can be based on breeding colony
Quantitative trait 4–6 pairs 500–1000 F2
records, current breeding stock numbers, facility loci analysis inbred
and personnel capability, and projections from past parental
demand. For genetic analyses involving mice, esti- strains
mating the number of individuals and breeding 2–4 reciprocal
pairs per line can be approximated by simple rules F1 hybrid
of thumb (Table 7.1). Number of animals subjected pairs
to embryonic or foetal manipulations must be Gene mapping 10–12 1200
included in the total number of animals requested
Source: Adapted from Pitts (2002).
for a given study.
Scientific justification must be provided for
breeding protocols and plans for disposition of
unused animals. When target genotypes are of breeding protocols must include end-use disposition
research interest, potentially large number of ani- plans to minimise unacceptably large collateral
mals will be euthanised because they are the losses (Reynolds 2021).
unwanted genotype, or if they are produced in such
quantities that they age out or otherwise cannot be
used in the experimental protocol. Justifications for Example: Estimating Number of Mice With
a Desired Genotype
From preliminary power calculations, an investi-
BOX 7.6
gator determined that 50 homozygous knockout
Rodent Breeding Production
(KO −/−) and 50 homozygous wild-type (WT;
How many pups can be produced (N)? +/+) mouse pups of a certain strain are required
How many breeding adults are required to produce to study a specific disease. Only KO and WT mice
N pups? were to be used for experiments.
How many pups are born? Ten pairs of heterozygous breeder mice (HET
How many pups are successfully weaned? +/−) were available, 10 males and 10 females.
Information required
From past breeding records and vendor specifica-
tions, the expected average litter size for this
Initial number of breeders strain was approximately 5 pups, and each pair
Sex ratio (number of females per breeding male) was expected to produce 6 litters over 6 months
Litter size = (number of pups/litter/female) of the productivity lifespan. Genotype distribu-
Estimated total number of pups produced tion was thought to be approximately Mendelian;
Proportion of pups lost (attrition) that is, approximately 50% of pups will be HET,
Expected genotype distribution 25% of pups KO, and 25% WT. However, perinatal
Proportion of desired genotype required losses of KO were 15–20%. How many litters will
Time frame (pups produced per unit time).
88 A Guide to Sample Size for Animal-based Studies
A Guide to Sample Size for Animal-based Studies, First Edition. Penny S. Reynolds.
© 2024 John Wiley & Sons Ltd. Published 2024 by John Wiley & Sons Ltd.
90 A Guide to Sample Size for Animal-based Studies
score based on the standard normal distribution and 1. Specify the range of candidate sample sizes n.
pre-specified confidence level. For example, for a 2. Specify the form of the cumulative binomial
two-tailed test with 95% confidence, α = 0.05 and probability. For example, if the target value
z = 1.96. The expected number of subjects with for the expected number of success was to
the condition (n) is obtained by multiplying total be at least 10 in a sample of N subjects, then
sample size N by prevalence p: Pr (X ≥ 10).
3. Compute the probability y for each n using
n=N p the cumulative distribution function and/or
Although simple to perform, this method will the probability density function of the bino-
give only poor approximations for most practical mial distribution (Appendix 8.A).
purposes (Newcombe 1998). Bias correction adjust- 4. Select the sample size that equals or exceeds
ment methods (such as the Agresti-Coull method) the target confidence 1 − α.
are described in Chapter 10.
Example: Genetically Linked Bleeding in
Doberman Pinschers
8.3 Binomial (Exact)
Investigators planned a prospective study to deter-
Distribution mine differences in coagulation profiles between
The binomial distribution is the probability distri- healthy Doberman Pinschers and those that are
bution of a binomial random variable. It is used to autosomally recessive for the mutation for von
determine the number of successes s in N selections. Willebrand disease (vWD), and therefore at high
It is applicable when the population is very large rel- risk for severe bleeding. From the literature,
ative to the target sample, and the N selections are they estimated a 25% prevalence of homozygous
independent; that is, the outcome on one selection affected Dobermans. The investigators needed to
does not affect the outcome on other selections, and estimate the number of dogs that would have to
the probability of success p is the same on every trial be screened (N) and the number of homozygous
(that is, p = 1/2 or 0.5). The binomial probability is affected dogs n they could expect to obtain in that
the probability that a binomial ‘experiment’ results sample with a probability of 95% and a precision
in exactly s successes. In practice, the cumulative of 5%.
binomial probability will be applicable to most stud-
ies. This is the probability that a binomial experiment Normal approximation. For a probability of 95%,
results in s successes that occur within a specified α = 0.05 and z1-α/2 = 1.96, prevalence p = 0.25,
range (either greater than or equal to a given lower and precision δ = 0.05. Then the total number
limit or less than or equal to a given upper limit). of animals to be screened is:
The binomial distribution is given by 2 2
N 1 96 0 25 1 − 0 25 0 05 = 288 1
N s N −s 290
Pr X = s = p 1−p
s
and the expected number of homozygous affected is
where N is the total sample size (number of ‘trials’ or n = N p 290 × 0.25 73.
selections), s is the number of subjects with the con-
dition (‘successes’), p = the expected proportion of When the Agresti-Coull adjustment (Agresti and
successes, and (1 − p) is the expected proportion Coull 1998) is applied (Chapter 11):
without the condition or event (‘failures’). The prob-
p = 25 + 2 100 + 4 = 0 2596,
ability of success p is the same on every trial.
2 2
Because the binomial variable is the sum of N inde- N 1 96 0 2596 1 − 0 2596 0 05
pendent binomial variables, the mean μ = N p, and − 4 = 291 4 292
the variance σ 2 = (N p)(1 − p) = μ(1 − p)
Sample size is calculated using the binomial exact and the expected number of homozygous affected is
method: n = 76.
92 A Guide to Sample Size for Animal-based Studies
With the asymptotic normal approximation, When no events have occurred (‘Rule of Threes’). If
approximately 290–292 dogs need to be screened there are no observed events in n trials, then x = 0
to have a 95% probability of capturing 73–76 homo- and the observed proportion p is 0/n. The approxima-
zygous affected dogs within the general Doberman tion for the upper confidence limit is obtained from
population. n 0 n−0
Pr X = 0 = p 1−p
0
Exact binomial calculation. Investigators wished
to determine the total number of dogs that which reduces to
would have to be screened to have 95% confi- n
dence of obtaining at least 75 or more homozy- 1−p =α
gous affected dogs from a population with an
expected prevalence of 25%. The cumulative Setting α to 0.05 and solving for p, the upper
binomial probability is Pr (X ≥ 75) = 0.95. Sam- bound of the 95% upper confidence limit for p is
ple size is then approximated by iteration over a 3/n (Louis 1981, Hanley and Lippman-Hand
range of potential sample sizes to find the sam- 1983, Jovanovic and Levy 1997).
ple size that is closest to Pr (X ≥ 75) = 0.95. For Alternatively, the exact upper confidence limit
this example, a potential sample size range of can be calculated from the cumulative probability
10–400 was chosen. SAS code is: distribution of the binomial distribution as follows:
the size of the population that needs to be tested. where p is the disease prevalence and m is the batch
Batch testing can result in considerable savings size or number of samples in each batch.
per unit of information. It is especially useful if dis- The ‘most efficient’ group size m is that which
eases are rare (Hou et al. 2017), reagents and other minimises E(T)/m, where E(T) is expected number
resources are scarce, and the number of available of tests:
tests is limited relative to the number of potential E T =1+m P B=1
test subjects that require them (Litvak et al. 1994,
Hughes-Oliver 2006), or if sample volumes are too As a first-pass approximation, m can be esti-
small for processing singly and must be pooled over mated by
several subjects for analysis (Giles et al. 2021).
m=1 p
The concept of batch testing was first developed
during World War II for detecting syphilis in US with rounding up to the nearest integer. Then the
army conscripts (Dorfman, 1943). It has been expected number of tests needed is approximately:
adapted for use in large-scale testing of diseases,
such as HIV and other sexually transmitted diseases E T 2 N√p
(Hughes-Oliver 2006, Shipitsyna et al. 2007), bacte-
riological screening of livestock herds and poultry for a population of N subjects (Finucan 1964).
flocks (Arnold et al. 2005, Arnold et al. 2009),
screening for environmental contaminants, such For feasibility and planning purposes, it is useful
as lead, and assessment of large number of molecu- to estimate the percentage reduction in the number of
lar targets for drug discovery (Hughes-Oliver 2006). tests by batch testing compared to testing individual
The SARS-COV2 pandemic of 2020 renewed samples:
interest in further application to the problems of RT = 100 1 − E T m
an emergent disease pandemic (Hitt et al. 2020,
Zhou and O’Leary 2021, FDA https://www.fda. The expected percentage increase in testing capac-
gov/medical-devices/coronavirus-covid-19-and- ity is approximated by the following:
medical-devices/pooled-sample-testing-and-
screening-testing-covid-19). Batch testing is also 1
I TC = 100 −1
used when individual animals are too small to allow E T m
the collection of sufficient tissue or blood for single-
sample analyses, such as surveillance testing of bat Exact binomial determination. The above formu-
colonies (Giles et al. 2021). lations assume that both the sensitivity (the proba-
In batch testing, samples from m subjects are bility of detecting a true positive) and specificity
pooled and tested for the presence of the disease (the probability of detecting a true negative) of
(yes = positive; no = negative). The simplest batch- the tests are equal to 1.0. Exact binomial determina-
testing design tests samples in two stages. The first tions that incorporate more realistic test sensitivity
stage tests the pooled samples, where each individ- and specificity provide more rigorous estimates of m
ual sample is part of one batch. If a batch tests and E(T). An open-access shiny app (Hitt et al. 2020)
Feasibility: Counting Subjects 95
numerically iterates over a range of batch sizes to wishes to pilot a new drug therapy on five cats with
determine optimal batch size m, E(T), E(T)/m, per- primary brain tumours. However, the rarity of the
cent reductions in the expected number of tests with condition means that determining the total number
batch testing, and the expected increase in testing of subjects to be screened will be potentially very
efficiency. large. The investigator will need to know how many
An analogous batch size determination algorithm potentially eligible subjects will need to be screened
has been developed for microarray or microplate until five subjects are obtained, when screening
studies (Bilder et al. 2019). In microarray studies, will stop. The number N of repeated trials to produce
samples are arranged in a two-dimensional grid. s successes is a negative binomial random variable.
In the first stage of testing, samples are pooled by It is assumed that the trials are independent – the
row and by column. In the second stage, only sam- outcome of one trial does not affect the outcome of
ples located at the intersections of positive rows and other trials.
columns are retested. See Hitt et al. (2019, 2020) and The negative binomial distribution is the proba-
https://bilder.shinyapps.io/PooledTesting/for bility distribution of a negative binomial random
more details. variable. It describes the probability that N = s + f
trials will be required to obtain s successes. Because
sampling continues until a predetermined number
Example: Screening for SARS CoV-2 of s is observed, this means there will be many fail-
(COVID-19) ures in the sequence (because f subjects will be
Abdalhamid et al. (2020) report an evaluation of selected that do not have the condition). When
batch-testing strategies for SARS-CoV-2 from the pre-specified number of successes has been
specimens collected by nasopharyngeal swabs achieved, the sampling stops, so s defines the stop-
(the ‘gold standard’ for SARS-CoV-2. diagnosis). ping criterion. The exact probability is given by:
They used the Shiny application for pooled test-
ing (available at https://www.chrisbilder. f + s−1 s f
Pr X = f = p 1−p
com/shiny). s−1
For an expected disease prevalence of 5% (p =
0.05), test sensitivity of 0.95 and test specificity of where the number of successes s is fixed, and the
1.0, with testing in two stages, they obtained an number of failures is f. Because N is the number
optimal batch size m of 5, E(T)= 2.07, and of ‘trials’ or draws, then f = N − s. The expected pro-
expected percentage reduction in the number of portion of ‘successes’ is p, and (1 − p) is the expected
tests of RT = 100[1 – 0.41] = 59%. However, if proportion of ‘failures’. Then the probability to be
prevalence is only 0.1% (p = 0.001) then m = 33, determined is the number of m failures given r
E(T) = 2.02, and RT = 94%. successes.
Zhou and O’Leary (2021) report lower sensitiv- The geometric distribution is a special case of the
ity (82–88%) for anterior nares or mid-turbinate negative binomial distribution, when the number of
(nasal) swabs. If sensitivity is assumed to be successes to be obtained stops at the first subject
0.85 with prevalence 5% and test specificity of with the target condition: s = 1.
1, then m = 6, E(T) = 2.35, and RT = 61%.
The hypergeometric distribution is used to deter- Approximately 30% of Maine Coon cats have an
mine a sample size when the total population is MYBPC3 gene mutation associated with hyper-
small (Box 8.4). Subsequent draws without replace- trophic cardiomyopathy (HCM). There were
ment of each item during the selection process 320 cats listed in the Maine Coon cat registry.
reduce the total population size. As a result, the An investigator wished to determine how many
cats would need to be screened to have 95% prob-
ability that at least five cats in the sample would
BOX 8.4 have the mutation.
Examples of Small Defined Target Populations for The total number of subjects in the registry
Hypergeometric Sampling Problems population is N = 320. The prevalence of the con-
dition is approximately 30%; therefore R = 0.3 N
Rosters listing all clients visiting a clinic within a
= 96. The number of mutation-positive cats
given time period.
required is at least 5, so Pr (x ≥ 5). SAS code for
All individuals listed in a breed or disease registry.
computations is given in Appendix 8.B.
All animals surveyed in a defined location.
Results are shown in Figure 8.1. A random
Pollinators visiting flowers in a given location.
selection of 28 cats from the registry list must
All blood samples in a given batch.
Feasibility: Counting Subjects 97
0.6
Probability
1. Calculate the z-score from values for the tar- 8.A Determining Cumulative
get change d, the mean x , the sample stand-
ard error SE, and sample size n: Probabilities for Binomial,
Negative Binomial, and
d−x
z=
SE n Hypergeometric
Distributions with SAS
2. Convert z to the probability determined by
Calculation of probabilities requires the specifica-
the pdf of the standard normal distribution:
tion of the cumulative distribution function
(cdf ) and probability density function (pdf ). The
Pr = φ z cumulative distribution function (cdf ) gives the
98 A Guide to Sample Size for Animal-based Studies
The negative binomial distribution is applicable for Y = PDF('HYPER', 5, 320, 96, n) + 1 - CDF
estimating the total sample size N necessary to ('HYPER', 5, 320, 96, n);
output;
obtain a pre-specified number of subjects with the end;
condition. The ‘experiment’ continues until a prede- run;
termined number of successes s is observed, and s proc print;
defines the stopping criterion (there is a ‘success’ run;
on the last trial). /*plot change in probability with sample size
The total number of samples required to obtain s n*/
successes is N = s + f trials. In SAS, the syntax is set proc sgplot data=prob;
up as s is the number of successes, f = the number series x=n y=y;
of failures (= N − s), and p is the probability of xaxis label="Sample size" grid
values=(0 to 50 by 1);
success. yaxis label="Probability" grid
values=(0 to 1 by 0.1);
Pr(X = f) y = PDF('NEGBINOMIAL', f, p, s); refline 0.9 0.95 0.99;
Pr(X ≤ f) y = CDF('NEGBINOMIAL', f, p, s); run;
Feasibility: Counting Subjects 99
A Guide to Sample Size for Animal-based Studies, First Edition. Penny S. Reynolds.
© 2024 John Wiley & Sons Ltd. Published 2024 by John Wiley & Sons Ltd.
104 A Guide to Sample Size for Animal-based Studies
about the mean for the whole population. The uncer- these tables provide information about the animals
tainty in the estimates is quantified by confidence in the study sample and are analogous to patient
intervals. Several other types of intervals (prediction, demographic data reported for human clinical
tolerance, or reference) can be computed, depending trials. Information for variables recorded at baseline
on study objectives. is important for providing context for subsequent
The best-practice reporting standard is to present interpretation of results. Effects of the test interven-
descriptive statistics in at least two summary tables. tion and potential benefit are assessed by the mag-
So-called ‘Table 1’ is for reporting study animal char- nitude of post-intervention changes compared to
acteristics (‘Who was studied?’). ‘Table 2’ summarises baseline (Fishel et al. 2007). Between-study differ-
the results (‘What was found?’). Total sample size, ences in animal characteristics and baseline or
sample sizes per group, and numbers lost through ‘starting’ conditions may also contribute to substan-
attrition must always be reported. Summary descrip- tial differences in results (Mahajan et al. 2016).
tive data and meticulous sample sizes accounting Standard deviations are used to describe the sam-
enable the assessment of internal and external valid- pling distribution of the mean (Altman and Bland
ity and are required for systematic reviews and meta- 2005). Performing significance tests for differences
analyses (Macleod et al. 2015). Reporting of both on Table 1 characteristics is usually illogical and
sets of descriptive and summary statistics is recom- not recommended (Altman 1985; Senn 1994;
mended by all international consensus reporting Roberts and Torgerson 1999; de Boer et al. 2015;
guidelines, including ARRIVE 2.0 for animal-based Pijls 2022).
research (Percie du Sert et al. 2020a, 2020b), and
STROBE-VET for observational clinical studies
(O’Connor et al. 2016; Sargeant et al. 2016). Example: Sample ‘Table 1’ Data for a
Two-Arm Crossover Study of Stress in Dogs
‘Table 9.1’ shows group sample sizes and descrip-
9.2 Describing Sample Data tive statistics for signalment, baseline heart rate,
‘Table 9.1’ presents summary data for basic descrip- and baseline fear, anxiety, and stress (FAS) scores
tors: animal signalment (source, species, strain, age, for dogs in a crossover study of clinical exam loca-
sex, body weight) and baseline (or pre-intervention) tion effects.
traits (e.g. baseline biochemistry, haematology,
Table 9.1: Example of Descriptive Statistics Reporting for
hemodynamics, scores, behaviour metrics, etc.)
Animal Signalment and Baseline Characteristics.
(Box 9.2). Table 9.1 information allows assessment
of the representativeness of the sample and any Variable Group 1 Group 2
potential biases (Roberts and Torgerson 1999). Number of dogs, n 24 20
The more the study animals are representative of Sex, n males (%) 12 (50%) 12 (70%)
the target population, the more likely the results
will be generalisable. The descriptive statistics in Age, years; 5.0 (2.5, 5.7 (4.7,
median (IQR; 9.0; 5.0; 2,
minimum, maximum) 1,15) 11)
Body weight, kg; median 19 (9, 26; 22 (10, 30;
BOX 9.2 (IQR; minimum, 3, 65) 3, 39)
Describing Descriptive Data: ‘Who Was Studied?’ maximum)
Heart rate, bpm; 103 (21) 107 (25)
‘Table 1’ information: mean (SD)
What: Summary of major signalment, clinical, and Baseline FAS scores Number of dogs
pre-intervention characteristics for each
0–1 9 11
study group.
Descriptors: Sample statistics: sample size per group 2–3 7 7
n, mean (SD), median (interquartile range), and >3 2 3
counts (per cent).
Source: Data from Mandese et al. (2021).
Descriptions and Summaries 105
9.3 Describing Results Table 9.2: Example of Descriptive Statistics Reporting for
Results of Hypothesis Tests.
‘Table 2’ information consists of the summary data
for results per group and/or differences between Variable Adjusted mean P-value
intervention and control. These are reported as difference
(95% confidence
effect sizes (such as mean differences) and confi- interval)
dence (or other) intervals (Box 9.3). Intervals
measure the uncertainty around the point esti- Sample size, n 59
breeding pairs
mate and are the best estimate of how far the esti-
mate is likely to deviate from the true value Number of pups +1.04 (0.95, 1.14) 0.41
extra born per
(Altman and Bland 2014a). Intervals should be pair,
used to present the results of statistical inference tunnel versus
so that the practical or biological importance of tail-lift
differences can be properly evaluated (Gardner Number of pups +1.07 (0.94, 1.20) 0.33
and Altman 1986; Cumming 2012; Altman and weaned per pair,
Bland 2014a, b). tunnel versus
tail-lift
BOX 9.3
Describing Descriptive Data: ‘What Was Found?’
9.4 Confidence and Other
‘Table 2’ information:
Intervals
What: Summary of major results for each study
There are four main types of intervals: confidence,
group, and results of hypothesis tests for groups
prediction, tolerance, and reference. Choice of
differences in primary and secondary outcomes,
interval will depend on study objectives. Confi-
population-based measures of precision.
dence intervals are related to the use of signifi-
Descriptors: Sample size per group n, means, confi-
cance tests but provide more useful, actionable,
dence intervals (or other relevant interval
and interpretable information on the size, direc-
measure).
tion, and uncertainty associated with the observed
effect compared to a P-value (Gardner and Altman
1986; Steidl et al. 1997; Cumming 2012; Rothman
and Greenland 2018).
The interval is an estimate based on the sampling
distribution of the statistic. The general form of the
Example: Sample ‘Table 2’ Results from a
interval is:
Two-Arm Randomised Controlled Study of
Mouse Productivity Sample point estimate ± width SD, n, α
‘Table 9.2’ provides descriptive statistics Therefore the width of the interval is deter-
expressed as adjusted means and 95% confi- mined by the variation in the sample (the standard
dence limits for differences in productivity for deviation SD), the inverse square root of the sam-
two methods of mouse handling. Means were ple size 1/√N, and the pre-specified significance
adjusted for parental and parity effects. The dif- level α.
ferences in favour of tunnel handling were There are four main types of intervals: confidence,
operationally important but not statistically sig- prediction, tolerance, and reference (Box 9.4). Choice
nificant. of interval will depend on study objectives.
106 A Guide to Sample Size for Animal-based Studies
9.5 Relationship Between The values of c and d are obtained from the stand-
ard normal z-scores for the desired level of signifi-
Interval Width, Power, and cance and power, respectively. The z-score is the
Significance number of standard deviations between an observa-
tion and the mean:
Intervals are related to statistical tests of signifi-
cance but provide more practical information about observed value − mean x−μ
the true size of the specified difference or effect. z= =
SD σ
Figure 9.1 shows the relationship between inter-
val width, statistical significance, and power The z-score is therefore a standardised normal
(Clayton and Hills 1993). The power of the study value for the observation. The maximum possible
is the probability that the lower confidence limit z-score is (n − 1)/ n. Probability and z-scores are
is greater than the value specified by the null obtained from the standard normal distribution
hypothesis, the hypothesis of no difference between (Box 9.5). For the normal distribution, the probabil-
means. The probability of type I error is specified by ity for the value in that interval is the area under the
the significance threshold α, the probability of normal curve. For example, a 95% confidence inter-
obtaining a false positive (type I error), or incor- val is defined by the lower and upper confidence
rectly rejecting the null hypothesis when it is true. limits zl and zu. The value of zl is −1.96 and for zu
The confidence interval (1 − α) is a range that is is 1.96. The area under the curve to the left of zl is
expected to contain the population parameter with 0.025, and the area to the right of zu is 0.025. The
a specified probability. area under the curve between the two limits is
For sufficient power to detect a specified effect, 0.95. For power of 90%, z = d = 1.282.
the lower limit of the confidence interval must The difference between the null value and the
exceed the upper limit for null hypothesis value. lower limit of the observed value is expressed as a
The bounds for the null value are defined by the multiple of the number of standard deviations. To
range of d standard deviations from the null value. be statistically significant, the distance (c + d) SD
The confidence interval covers a range of c standard must be greater than the effect so that the lower
deviations on either side of the mean, so the limits of limit of the confidence interval (effect − c SD) is
the confidence interval are defined by ± c SD. larger than the upper limit for power (null value +
d SD). Because confidence and power are pre-
specified, sample size is chosen so that SD ≥
Power Confidence interval effect/(c + d), and (c + d) is approximately three.
That is, the effect to be detected must be equal to
d·SD
or greater than three standard deviations (Clayton
and Hills 1993; Table 9.4).
c·SD c·SD
BOX 9.5
Calculating z Scores for Confidence and Power
(SAS Code)
Value Expected
under the effect *significance alpha;
null *0.1 for 90% power, 0.2 for 80% power;
hypothesis
Table 9.4: Confidence, Power, and Corresponding z-Values for a Standard Normal Distribution.
α Confidence z1 – α/2 β Power z1–β
100(1 – α)% 100(1 – β)%
Interval plots with different confidence levels are Bland, J.M. and Altman, D.G. (1986). Statistical methods
especially useful for evaluating empirical and trans- for assessing agreement between two methods of clin-
lational pilot studies (Chapter 6). Pilot studies are ical measurement. Lancet 1: 307–310.
small by design and therefore underpowered to Bland, J.M. and Altman, D.G. (2003). Applying the right
detect any but very large effect sizes. As a result, statistics: analyses of measurement studies. Ultrasound
in Obstetrics & Gynecology 22 (1): 85–93. https://
non-significant p-values are often interpreted as evi-
doi.org/10.1002/uog.122.
dence of ‘no effect’, and potentially meaningful
de Boer, M.R., Waterlander, W.E., Kuijper, L.D. et al.
results are rejected (Altman and Bland 1995; (2015). Testing for baseline differences in randomized
Lee et al. 2014). However, the data may still provide controlled trials: an unhealthy research behavior that
sufficient evidence of promise (always assuming the is hard to eradicate. International Journal of Behavioral
pilot study was designed with sufficient internal Nutrition and Physical Activity 12: 4. https://doi.
validity), although the study may be too small to org/10.1186/s12966-015-0162-z.
statistically detect an effect size with a specified Clayton, D. and Hills, M. (1993). Statistical Models in Epi-
power. Lee et al. (2014) recommend that the demiology. Oxford: Oxford University Press 375 pp.
strength of preliminary evidence can be assessed Cumming, G. (2012). Understanding the New Statistics:
by comparing the effect relative to the null with Effect sizes, Confidence Intervals, and Meta-Analysis.
confidence intervals of different widths (e.g. 95%, New York: Routledge.
90%, 80%, 75%). Fishel, S.R., Muth, E.R., and Hoover, A.W. (2007). Estab-
lishing appropriate physiological baseline procedures
for real-time physiological measurement. J Cogn Engin
Decision Making 1: 286–308.
References Gardner, M.J. and Altman, D.G. (1986). Confidence
intervals rather than P values: estimation rather than
Altman, D.G. (1985). Comparability of randomised hypothesis testing. British Medical Journal 292
groups. The Statistician 34: 125–136. https://doi. (6522): 746–750. https://doi.org/10.1136/
org/10.2307/2987510. bmj.292.6522.746.
Altman, D.G. and Bland, J.M. (1995). Absence of evidence Gerring, J. (2012). Mere description. British Journal of
is not evidence of absence. BMJ 311 (7003): 485. Political Science 42 (4): 721–746. https://www.
https://doi.org/10.1136/bmj.311.7003.485. jstor.org/stable/23274165.
Altman, D.G. and Bland, J.M. (2005). Standard devia- Hull, M.A., Reynolds, P.S., and Nunamaker, E.A. (2022).
tions and standard errors. BMJ 331: 903. https:// Effects of non-aversive versus tail-lift handling on
doi.org/10.1136/bmj.331.7521.903. breeding productivity in a C57BL/6J mouse colony.
Altman, D.G. and Bland, J.M. (2014a). Uncertainty and PLoS ONE 17 (1): e0263192. https://doi.org/
sampling error. BMJ 349: g7064. https://doi.org/ 10.1371/journal.pone.0263192.
10.1136/bmj.g7064. Jennen-Steinmetz, C. and Wellek, S. (2005). A new
Altman, D.G. and Bland, J.M. (2014b). Uncertainty approach to sample size calculation for reference inter-
beyond sampling error. BMJ 349: g7065. https:// val studies. Statistics in Medicine 24 (20): 3199–3212.
doi.org/10.1136/bmj.g7065. https://doi.org/10.1002/sim.2177.
Descriptions and Summaries 109
Lee, E.C., Whitehead, A.L., Jacques, R.M., and Julious, e300049. https://doi.org/10.1371/journal.
S.A. (2014). The statistical interpretation of pilot trials: pbio.3000410.
should significance thresholds be reconsidered? BMC Percie du Sert, N., Ahluwalia, A., Alam, S. et al. (2020b).
Medical Research Methodology 14: 41. https://doi. Reporting animal research: explanation and elabora-
org/10.1186/1471-2288-14-41. tion for the ARRIVE guidelines 2.0. PLoS Biology
Macleod, M.R., Lawson McLean, A., Kyriakopoulou, A. 18 (7): e3000411. https://doi.org/10.1371/
et al. (2015). Risk of bias in reports of in vivo research: journal.pbio.3000411.
a focus for improvement. PLoS Biology 13 (11): Pijls, B.G. (2022). The Table I fallacy: P values in baseline
e1002301. tables of randomized controlled trials. Journal of Bone
Mahajan, V.S., Demissie, E., Mattoo, H. et al. (2016). and Joint Surgery 104 (16): e71. https://doi.org/
Striking immune phenotypes in gene-targeted mice 10.2106/JBJS.21.01166.
are driven by a copy-number variant originating from Roberts, C. and Torgerson, D.J. (1999). Understanding
a commercially available C57BL/6 strain. Cell Reports controlled trials: baseline imbalance in randomised
15 (9): 1901–1909. https://doi.org/10.1016/j. controlled trials. BMJ 319 (7203): 185. https://doi.
celrep.2016.04.080. org/10.1136/bmj.319.7203.185.
Mandese, W.W., Griffin, F.C., Reynolds, P.S. et al. (2021). Rothman, K.J. and Greenland, S. (2018). Planning study
Stress in client-owned dogs related to clinical exam size based on precision rather than power. Epidemiol-
location: a randomised crossover trial. Journal of Small ogy 29: 599–603.
Animal Practice 62 (2): 82–88. https://doi.org/ Sargeant JM, O’Connor AM, Dohoo IR, Erb HN, Cevallos
10.1111/jsap.13248. M, Egger M, Ersboll AK, Martin SW, Neilsen LR, Pearl
Meeker, W.Q., Hahn, G.J., and Escobar, L.A. (2017). Sta- DL, Pfeiffer DU, Sanchez J, Torrence ME, Vigre H,
tistical Intervals: A Guide for Practitioners and Waldner C, Ward MP (2016). Methods and processes
Researchers, 2e. New York: Wiley. of developing the Strengthening the Reporting of
O’Connor, A.M., Sargeant, J.M., Dohoo, I.R. et al. (2016). Observational Studies in Epidemiology-Veterinary
Explanation and elaboration document for the (STROBE-Vet) statement. Preventative Veterinary Med-
STROBE-Vet statement: strengthening the Reporting icine, 134:188–196. https://doi.org/10.1016/j.
of Observational Studies in Epidemiology – Veterinary prevetmed.2016.09.005
extension. Journal of Veterinary Internal Medicine Senn, S. (1994). Testing for baseline balance in clinical
30 (6): 1896–1928. https://doi.org/10.1111/ trials. Statistics in Medicine 13 (17): 1715–1726.
jvim.14592. https://doi.org/10.1002/sim.4780131703.
Percie du Sert, N., Hurst, V., Ahluwalia, A. et al. (2020a). Steidl, R.J., Hayes, J.P., and Schauber, E. (1997). Statisti-
The ARRIVE guidelines 2.0: updated guidelines for cal power analysis in wildlife research. Journal of Wild-
reporting animal research. PLoS Biology 18 (7): life Management 61 (2): 270–279.
10
Confidence Intervals and
Precision
A Guide to Sample Size for Animal-based Studies, First Edition. Penny S. Reynolds.
© 2024 John Wiley & Sons Ltd. Published 2024 by John Wiley & Sons Ltd.
112 A Guide to Sample Size for Animal-based Studies
It can also be expressed as a proportional estimate and refers to the actual uncertainty
measure of the deviation from the mean. in the metric itself. Absolute precision is used
For example, if the desired confidence inter- when the goal is to estimate the population
val width is 10% of the mean (or 0.1), then parameter to within d percentage points of
precision d will be ±(0.1/2) or ±5% (±0.05). the true value (estimate ± d), where percentage
2. Calculate the standard error of the estimate. point is the unit for the arithmetic difference of
The formula for the standard error (SE) two percentages. For example, if the prevalence
depends on the type of variable and the asso- of a given disease is 25% with a precision of 10%,
ciated distribution. For example, the SE for the absolute precision of the estimate is 10%.
the mean of a continuous normal variable
Relative precision is used when the goal is to esti-
SE Y is s n; the standard error for a pro- mate the population parameter to within a
portion is SE p = p 1 − p n. given percentage of the value itself. For exam-
3. Specify confidence. The most common α is ple, if prevalence is 25% ± 10%, the relative
0.05 (5%), corresponding to a 95% confidence uncertainty is 10% of 25%, or 2.5%. Relative pre-
interval. Other commonly used settings for α cision scales the desired amount of precision to
are 0.01 (1%; 99% confidence) and 0.1 (10%; the parameter estimate; that is, the variation in
90% confidence). A one-sided confidence the sample is a fraction or proportion of the true
interval is constructed by using α rather than value of the parameter: estimate± d estimate
α/2 in the expression for the lower or upper (Lwandga and Lemeshow 1991)
limit. For most practical purposes and for
ease of calculation, the z-score is used. For
example, for a two-tailed 95% confidence 10.4 Continuous (Normal)
interval z1 − α/2 = 1.96. If sample size is small,
the t statistic can be substituted. However,
Outcome Data
sample size approximations will require The confidence interval is computed from the prod-
iterations over a range of sample sizes to uct of the standard error for the estimate and the
determine the value that is closest to the tar- critical value of the test statistic. For a mean Y
get power for a given effect size. and standard error SE Y = s n, the confidence
4. Divide the confidence interval by the square interval for the mean is
of the precision, d2.
Y ± z1 − α 2 s n
10.3.1 Absolute Versus Relative
Re-arranging, sample size is therefore:
Precision
‘Precision’ can be either absolute or relative n = z21 − α 2 s2
(Box 10.2). Research objectives will determine
which measure of precision to use (Lwandga and For absolute precision, the standard error is
Lemeshow 1991). divided by the square of the precision:
If the investigator wished to determine T with where Fα,p,n-p is the critical F value. Appendix 10.A
a precision of 1 C, then d = 0.5 C and provides sample SAS code for calculating simultane-
2 2 ous confidence bands and generating profile plots.
N = 1 96 63 05 97
N 1 96 2 0 25 1 − 0 25 0 05 2
288 12 = 289
The expected number of homozygous affected
nHOM = N pHOM is therefore
Sample size
nHOM 289 0 25 72
Therefore, the investigators need to recruit
approximately 300 subjects to capture at least
72 homozygous affected dogs within the general
Doberman population.
How many dogs must be sampled to capture
the expected proportion of affected dogs within
0.0 0.2 0.4 0.6 0.8 1.0
10% of the true population prevalence with 95%
Proportion p confidence?
This is the relative precision question. The rel-
Figure 10.2: Sample size requirements for relative
precision. Sample size N declines with increasing p. ative precision range is 10% of the true prevalence
25% (0.1 × 0.25), or 22.5–27.5%:
precision can produce impossibly wide confidence 1−p 0 75
intervals for small values or when events are rare. N = z21 − α 2 2 = 1 962
d p 0 12 0 25
If sample size is based on relative precision,
= 1152 48 = 1153
sample size increases as prevalence decreases
(Figure 10.2). Relative precision can estimate the
interval without a lot of bias but may require sub-
stantial and often impossibly large sample sizes.
Sample size can be reduced by increasing α, and Example: Estimating Prevalence of Feline
more drastically, by increasing di. However, these Calicivirus (FCV) in Shelter Cats
make estimates less precise.
A researcher wished to estimate the prevalence of
feline calicivirus (FCV) in shelter cats. Prior data
Example: Planning Trial Recruitment established that true prevalence was unlikely to
Numbers: von Willebrand Disease in exceed 30%.
Dobermans How many cats need to be sampled to obtain
Investigators planned a prospective observational prevalence of FCV within 5 percentage points
study to determine differences in coagulation pro- of the true value with 95% confidence?
files between healthy Doberman Pinschers and This is an absolute precision problem, so:
those autosomally recessive for the mutation for N = z21 − α 2 p 1 − p d2
von Willebrand disease (vWD) and, therefore, at
high risk for associated bleeding. They determined 03 07
then N = 1 962 = 322 7 = 323 cats
from the literature that they could expect that 25% 0 052
of Dobermans are homozygous affected (pHOM).
How many dogs need to be recruited and How many cats need to be sampled if the esti-
screened (N) to ensure the proportion of homozy- mate of FCV prevalence is to occur within 10% of
gous affected (nHOM) dogs in the sample is within the true value with 95% confidence?
five percentage points of the true value with 95% This is a relative precision problem:
confidence? N = z21 − α 2 1 − p d2 p
This is an absolute precision problem. The
absolute precision range is ±5% of the population n = 1 962 0 7 0 12 0 3 = 896 4 = 897 cats
prevalence or 20–30%. The expected number With 5% relative precision, sample size
needed to screen (N) is increases to 3586 cats.
118 A Guide to Sample Size for Animal-based Studies
Poisson. Count data expressed as a rate (events/unit ▪ Number of birds visiting a feeder over a day
time, length, area, volume) ▪ Cells per square in a haemocytometer
▪ Number of foalings per year
Variance = mean ▪ Number of adverse events in a clinical trial
▪ Bacterial cell counts per mm2
Negative binomial. Count data showing aggrega-
▪ Arrival rate of dogs at a clinic or shelter
tion, usually with over-representation of zeros
▪ Number of cases of bubonic plague per year
Variance > > mean ▪ Biological dosimetry, mutation detection &
mutation rates
▪ Spiking activity rates of individual neurons;
inter-spike intervals (sometimes)
metric such as distance, area, or volume. A key fea- ▪ Lifetime encounter rate between parasitoid
and host
ture of the Poisson model is that the mean and var-
iance are equal (equidispersion). ▪ Visits of pollinators to a flower
Aggregated data are more appropriately modelled ▪ Predator attack rates
using a negative binomial distribution. Many biolog- ▪ Accident rates
ical count distributions are characterised by a highly
skewed and uneven frequency distribution of sub-
jects for a given characteristic, such that only a e − λ λx
Pr X = x =
few subjects have the characteristic, and many or x
most subjects do not. This is called aggregation or
clumping. Aggregated distributions are charac- where λ is the mean occurrence rate or number of
terised by over-representation of zeros and relatively events x per unit time and e is a constant
few observations in the remaining categories. The (2.718282). Because the mean and variance are
negative binomial distribution differs from the equal, they are both estimated as
Poisson in that the variance exceeds the mean, a
property referred to as over-dispersion. If over- λ=x N
dispersion is present and a Poisson distribution is
incorrectly assumed, then the confidence intervals The number of events x that occur during a time
will be too wide. The negative binomial distribution interval t is estimated by substituting λ = at, so that
is widely used for ecological applications (White and the number of events x expected during any time
Bennetts 1996; Stoklosa et al. 2022). The best-known interval t is a.
examples are patterns of macroparasite distribution Model fit is assessed by evaluation of the scaled
among hosts (Pennycuick 1971; Shaw et al. 1998; deviance goodness of fit statistic. If the Poisson
McVinish and Lester 2020). model is not a good fit, the deviance statistic will
be >1.
Sample size. The large-scale approximation for
10.7.1 Poisson Distribution sample size N for a target precision d is
The Poisson distribution is the limiting case of the
binomial distribution where the binomial probabil- z1 − α 2
2
N= λ
ity p for each ‘trial’ or event occurrence is unknown, d
but the average p is known. Examples are given in
Box 10.5. Confidence intervals. The Poisson distribution
The probability of x number of events during an converges to the binomial distribution at large sam-
interval t is ple sizes (or as the interval between events
120 A Guide to Sample Size for Animal-based Studies
Number of leaves
were obtained for 67 Soay sheep that died during
the 1989 population crash on St Kilda. Worm 40
s2 = 47 5 + 47 52 0 841 = 2730 3
0
What sample size for a new study would 0 1 2 3 4 5 6 7
Number of mites
enable the determination of the expected mean
worm count with a precision of 10% and 95% Figure 10.3: Counts of female mites on apple leaves
confidence? follow a negative binomial distribution.
Source: Data from Bliss and Fisher (1953).
1 96 2 1 1
N= 2 + = 317 5 318 sheep
01 47 5 0 841
10.A SAS Code for
Computing Simultaneous
Confidence Intervals (Data
Example: Counts of Red Mites on Apple from German et al. 2015)
Leaves: Estimating μ and k from Raw Data %let p=20;
DATA DOG;
(Data from Bliss and Fisher 1953). A total of 172 INPUT ID $ PROTEIN ARGIN HIST ISOLEUC METCYS
female mites were counted on 150 apple leaves LEUC LYSINE PHETYR THRE TRYPT VALINE TFAT
(Figure 10.3). Appendix 10.E provides data and LINOL CALCIUM PHOS MG SODIUM K CHLORIDE IRON
SAS code. COPPER ZN MANG SEL IODINE VITA VITD3 VITE
THIAMIN RIBO PYRID NIACIN PANTO COBAL FOLIC;
Simple descriptive statistics (mean, variance)
were calculated from the raw count data. The *calculate standardised variables;
mean number of mites per leaf is 172/ data new;
150 = 1.147 and the variance is s2 = 2.274. The set DOG;
data are clearly over-dispersed with an over- var="PROTEIN"; ratio=PROTEIN/3.28; output;
var="METCYs"; ratio=METCYS/0.21; output;
representation of zeros. The Poisson distribution var="CALCIUM"; ratio=CALCIUM/0.13; output;
is not a good fit, as the variance is nearly twice as var="PHOS"; ratio=PHOS/0.1; output;
large as the mean, and the scaled deviance statis- var="MG"; ratio=MG/19.7; output;
tic is 1.91 (much greater than 1.0). keep var ratio; run;
The negative binomial provided a better fit to
proc sort; by var; run;
the data. The maximum likelihood estimate for
k is 0.976. The standard error is, therefore *calculate summary statistics means SD and n
for each standardised variable;
proc means;
SE = 1 147 + 0 967 150 = 1 074 by var;
var ratio;
output out=a n=n mean=xbar var=s2;
The asymptotic 95% confidence interval is run;
1.147 ± 1.96(1.074) = [−0.95, 3.18], or approxi-
mately *calculate simultaneous confidence intervals;
(−1, 3.2) mites per leaf. data b;
set a;
Confidence Intervals and Precision 123
Lwandga, S.K. and Lemeshow, S. (1991). Sample Size Shaw, D.J., Grenfell, B.T., and Dobson, P. (1998). Pat-
Determination in Health Studies: A Practical Manual. terns of macroparasite aggregation in wildlife host
Geneva: World Health Organization. populations. Parasitology 117: 597–610. https://
McVinish, R. and Lester, R.J.G. (2020). Measuring aggre- doi.org/10.1017/S0031182098003448.
gation in parasite populations. Journal of the Royal Shilane, D., Evans, S.N., and Hubbard, A.E. (2010). Con-
Society Interface 17: 20190886. https://doi.org/ fidence intervals for negative binomial random vari-
10.1098/rsif.2019.0886. ables of high dispersion. The International Journal of
Meeker, W.Q., Hahn, G.J., and Escobar, L.A. (2017). Sta- Biostatistics 6 (1): Article 10. https://doi.org/
tistical Intervals: A Guide for Practitioners and 10.2202/1557-4679.1164.
Researchers, 2e. New York: Wiley. Stoklosa, J., Blakey, R.V., and Hui, F.K.C. (2022). An
Newcombe, R.G. (1998). Two-sided confidence intervals overview of modern applications of negative binomial
for the single proportion: comparison of seven meth- modelling in ecology and biodiversity. Diversity 2022
ods. Statistics in Medicine 17: 857–872. (14): 320. https://doi.org/10.3390/d14050320.
Pennycuick, L. (1971). Frequency distributions of Thompson, S.K. (1987). Sample size for estimating mul-
parasites in a population of three-spined sticklebacks, tinomial proportions. The American Statistician 41 (1):
Gasterosteus aculeatus L, with particular reference 42–46.
to the negative binomial distribution. Parasitology Tortora, R. (1978). A note on sample size estimation for
63 (3): 389–406. https://doi.org/10.1017/ multinomial populations. The American Statistician
S0031182000079920. 32 (3): 100–102.
Ross, G.J.S. and Preece, D.A. (1985). The negative bino- White, G.C. and Bennetts, R.E. (1996). Analysis of fre-
mial distribution. Statistician 34: 323–336. quency count data using the negative binomial distri-
Saha, K. and Paul, S. (2005). Bias-corrected maximum bution. Ecology 77 (8): 2549–2557.
likelihood estimator of the negative binomial disper-
sion parameter. Biometrics 61: 179–185.
11
Prediction Intervals
BOX 11.1
11.2 Prediction Intervals:
Applications of Prediction Intervals Continuous Data
▪ ‘Personalised’ reference intervals Regression The prediction interval is calculated as
models
▪ Forecasting models ynew ± t − value SEnew
▪ Relationship between two traits for an unmeas-
ured species in phylogenetic analyses where ynew is the sample estimate for the new pre-
▪ Replication studies dicted observation and SEnew is the standard error
▪ Meta-analyses. of the new predicted value.
A Guide to Sample Size for Animal-based Studies, First Edition. Penny S. Reynolds.
© 2024 John Wiley & Sons Ltd. Published 2024 by John Wiley & Sons Ltd.
128 A Guide to Sample Size for Animal-based Studies
n1 − 1 s21 + n2 − 1 s22
1 sp =
ynew ± t1 − α s 1+ n1 + n2 − 2
2, N − 1
N
Here, n1 and n2 are the sample sizes for each group,
The two-sided prediction interval 100(1 − α)% for s21 and s22 are the variances for each group, and m1
the difference between means y1 − y2 is and m2 are the number of new observations.
y1 − y2 ± t1 − α 2,n1 + n2 − 2 SE
Example: Predictions Based on a Fixed
Sample Size: Anaesthetic Immobilisation
The one-sided lower prediction limit of the differ- Times in Mice
ence between future observations is
(Data from Dholakia et al. 2017). Time of com-
y1 − y2 − t α,n1 + n2 − 2 SE plete immobilisation following anaesthesia was
measured in two groups of 16 CD-1 male mice
randomly allocated to receive intraperitoneal
injections of either ketamine-xylazine (KX) or
Example: Predicting a Single Future Value: ketamine-xylazine combined with lidocaine
Creatinine Levels in Obese Dogs (KXL). Immobilisation time averaged 38.8 (SD
7.9) min for KX mice and 33.3 (SD 3.9) min for
(Data from German et al. 2015). Serum creatinine KXL mice for a difference of 5.5 minutes.
was measured in 27 obese dogs before entering a If a new study is planned using only five mice
weight loss programme, averaging 81.5 (SD per group, what is the expected difference in
19.6) μmol/L. What is the predicted value of cre- future immobilisation times for KX mice com-
atinine for a single future obese dog at α = 0.05? pared to KXL with 95% confidence?
Prediction Intervals 129
2
ynew ± tα 2,N − K + 1 MSE + SE ynew
Therefore, immobilisation times for KXL could
be as much as 12 min longer than KX or 1.2 min
shorter. where MSE is the mean square error of the regres-
sion with associated degrees of freedom N − (K + 1)
and the term MSE + SE ynew 2 is the standard
error of the prediction. In SAS, the 95% prediction
Example: Predicting a Range of Future interval for a new observation at a given value of
Observations: Creatinine Levels in X is called by the command cli in the model option
Obese Dogs statement.
(Data from German et al. 2015). Serum creatinine The ordinary least squares regression is fitted
was measured in 27 obese dogs before entering a under the assumption that the independent variable
weight loss programme, averaging 81.5 (SD is measured without error, or at least the error is
19.6) μmol/L. Suppose a new study was planned negligible compared with that for the response var-
using a sample size of 100 dogs. What range of iable (Draper and Smith 1998).
mean creatinine values can be expected with Prediction intervals must be interpreted with cau-
95% confidence? tion if there is a substantial error for values of the
For this query, the prediction interval on the independent variable. del Río et al. (2001) describe
difference between two means must be com- methods for constructing prediction intervals for
puted, and both the predicted mean and sample linear regression that account for errors on both
standard deviation are hypothetical but assumed axes. Gelman et al. (2021) provide an excellent guide
to be equal to those in the original sample. The to modern methods of regression, including Bayes-
prediction interval for the new study is then ian inference methods (incorporation of prior infor-
mation into inferences) and R commands for
computation.
s21 s2
original mean ± tdf , N 1 − 1 + 1
N1 N2 Example: Rodent p50 in Relation
19 62 19 62 to Body Mass.
= 81 5 ± 2 052 +
27 100 Oxygen affinity for haemoglobin is quantified by
= 81 5 ± 8 7 = 72 8, 90 2 P50, the partial pressure of oxygen at which hae-
moglobin is 50% saturated. In mammals, P50
scales negatively with body mass, such that small
Therefore the range of average creatinine mammals have larger P50 than large mammals
values that can be expected in a future study with (Schmidt-Nielsen and Larimer 1958). Body mass
a sample size of 100 is between 72.8 and (kg) and P50 (mm Hg) data for 23 rodent species
90.2 μmol/L. are given in Table 11.1.
130 A Guide to Sample Size for Animal-based Studies
for mean effects in a meta-analysis will usually be administered to CD-1 mice. PLoS ONE 12 (9):
too narrow for an adequate description of the range e0184911. https://doi.org/10.1371/journal.
of possible study effects. If the number of studies in pone.0184911.
the meta-analysis is >10, then the 100(1 − α)% pre- Draper, N.R., and Smith, H. (1998). Applied Regression
diction interval can be estimated as Analysis, 3e. New York: Wiley.
Gelman, A., Hill, J., and Vehtari, A. (2021). Regression
and Other Stories. Cambridge University Press.
2
Y ± t 1 − α,k − 2 τ2 + SE Y German, A.J., Holden, S.L., Serisier, S. et al. (2015).
Assessing the adequacy of essential nutrient intake
in obese dogs undergoing energy restriction for weight
where Y is the summary mean of absolute measures loss: a cohort study. BMC Veterinary Research 11: 253.
of effect (e.g. risk difference, mean difference, stan- https://doi.org/10.1186/s12917-015-0570-y.
2
dardised mean difference), SE Y is the variance, Hahn, G.J. and Meeker, W.Q. (1991). Statistical Intervals:
t1 − α,k − 2 is the critical t-value for 1 − α and k − 2 A Guide for Practitioners. New York: Wiley.
degrees of freedom, and τ is the estimate of the var- Hartnack, S. and Roos, M. (2021). Teaching: confidence,
iation of the true effects (heterogeneity). For relative prediction, and tolerance intervals in scientific prac-
measures, such as RR and OR, the interval must be tice: a tutorial on binary variables. Emerging Themes
calculated from the logarithm (ln) of the summary in Epidemiology 18: 17. https://doi.org/10.1186/
s12982-021-00108-1.
estimate, as is the case for confidence intervals
Higgins, J.P.T., Thompson, S.G., and Spiegelhalter, D.J.
(Higgins et al. 2009).
(2009). A re-evaluation of random-effects meta-analy-
These prediction intervals may have very poor sis. Journal of the Royal Statistical Society: Series
coverage when k is small, resulting in intervals that A (Statistics in Society) 172: 137–159.
are too narrow (Partlett and Riley 2017). Nagashima IntHout J, Ioannidis JPA, Rovers MM, Goeman JJ (2016)
et al. (2019a, b) have developed an R package pimeta Plea for routinely presenting prediction intervals in
that compiles several alternative methods for con- meta-analysis. BMJ Open, 6: e010247. https://doi.
structing prediction intervals based on bootstrap- org/10.1136/bmjopen-2015-010247
ping. The package also includes methods for Joss, R., Baschnagel, F., Ohlerth, S. et al. (2019). The risk
estimating prediction intervals when the number of a shod and unshod horse kick to create orbital
of studies is very small (k < 5). fractures in equine cadaveric skulls. Veterinary and
Comparative Orthopaedics and Traumatology 32 (4):
282–288.
References Landon, J. and Singpurwalla, N.D. (2008). Choosing a
coverage probability for prediction intervals. The
Coşkun, A., Sandberg, S., Unsal, I. et al. (2021). Persona- American Statistician 62 (2): 120–124. https://doi.
lized reference intervals in laboratory medicine: a new org/10.1198/000313008x304062.
model based on within-subject biological variation. Meeker, W.Q., Hahn, G.J., and Escobar, L.A. (2017).
Clinical Chemistry 67 (2): 374–384. https://doi. Statistical Intervals: A Guide for Practitioners and
org/10.1093/clinchem/hvaa233. Researchers, 2e. New York: Wiley.
Deeks, J.J., Higgins, J.P.T., and Altman, D.G. (2022). Nagashima, K., Noma, H., and Furukawa, T.A. (2019a).
Chapter 10: Analysing data and undertaking meta-ana- Prediction intervals for random-effects meta-analysis:
lyses. In: Cochrane Handbook for Systematic Reviews of a confidence distribution approach. Statistical Methods
Interventions version 6.3 (updated February 2022) (ed. H. in Medical Research 28 (6): 1689–1702. https://doi.
JPT, J. Thomas, J. Chandler, et al.). Cochrane Available org/10.1177/0962280218773520.
from www.training.cochrane.org/handbook. Nagashima, K., Noma, H., and Furukawa, T.A. (2019b)
del Río, F.J., Riu, J., and Rius, F.X. (2001). Prediction pimeta: prediction intervals for random-effects meta-
intervals in linear regression taking into account errors analysis. R package version 1.1.2. Available from:
on both axes. Journal of Chemometrics 15: 773–788. https://CRAN.R-project.org/package=pimeta
https://doi.org/10.1002/cem.663. (accessed 2022).
Dholakia, U., Clark-Price, S.C., Keating, S.C.J., and NIST/SEMATECH (2012) e-Handbook of Statistical
Stern, A.W. (2017). Anesthetic effects and body weight Methods www.itl.nist.gov/div898/handbook,
changes associated with ketamine-xylazine-lidocaine https://doi.org/10.18434/M32189
132 A Guide to Sample Size for Animal-based Studies
A Guide to Sample Size for Animal-based Studies, First Edition. Penny S. Reynolds.
© 2024 John Wiley & Sons Ltd. Published 2024 by John Wiley & Sons Ltd.
134 A Guide to Sample Size for Animal-based Studies
1 zp + z2p − ab
df 1 +
N k 1 =
k = z 1+p 2 a
χ 21 − α,df
z2α z2
where a = 1 − , and b = z2p − α
with coverage p (the proportion of observations 2 N −1 N
that need to lie within the interval), χ 21 − α,df is the
critical value distribution, with df = N − 1 degrees
of freedom, and z(1 + p)/2 is the critical value of the Example: Regulatory Threshold for Race-
normal distribution with cumulative probability horse Medication Withdrawal
(1 + p)/2. For example, suppose the required cover-
(Data from RMTC 2016.) The Racehorse Medi-
age is 99% (p = 0.99). Then (1 + p)/2 = (1 + 0.99)/2
cation and Testing Consortium [RMTC] Scien-
= 0.995, and z0.995 is 2.576. For 95% coverage,
tific Advisory Committee has determined the
(1 + p)/2 = 0.975 and z0.975 = 1.96.
regulatory threshold for specific medications to
For small samples (N < 30), Guenther (1977) sug-
be the upper limit of the 95/95 tolerance inter-
gests a weighted correction for k(2) as k 2 = k 2 w,
val; that is, the specified tolerance interval has
where w is 95% coverage with 95% confidence. Samples
from 20 research horses were collected 24 hours
N − 3 − χ 2N − 1,α after administration of a certain medication
w= 1+ 2
2 N+1 and assayed to determine medication residue.
Observed values were:
y = 6 8, 3 4, 6 2, 5 4, 0 3, 0 5, 2 6, 0 1, 0 1, 4 5,
12.3.2 One-Sided Limits 1 0, 2 3, 10 0, 3 5, 0 2, 1 2, 0 8, 1 0, 1 4, 20 0
For continuous normally distributed data with
The range is 0.1–20.0 ng/mL, with mean 3.565
mean y and standard deviation s, the upper limit
ng/mL (SD = 4.596 ng/mL); median 1.85 (IQR
of a one-sided tolerance interval is y + k 1 s, and
0.725, 4.725) ng/mL. Because the data were
the lower limit is y – k 1 s. The one-sided normal non-normal and right-skewed, they were ln-
tolerance intervals have an exact solution based transformed for analysis. The transformed mean
on the non-central t-distribution. The z-distribution and variance are 0.43 and 1.50, respectively, with
can be used for large-sample approximations (N > p = 0.95, and α = 0.95.
100). In general, there is no difference between a Sample SAS and R (Young 2010) codes for cal-
one-sided tolerance bound and a one-sided confi- culating tolerance limits are provided in Appen-
dence bound on a given quantile of the distribution. dix 12.A. The value for k1 is 2.383. The one-
For example, a 95 per cent confidence limit on the sided upper tolerance limit is exp(0.43 + 2.283
upper 95th percentile and an upper tolerance limit 1.50) = exp(4.00523) for a threshold value of
on the 95th percentile at 95% confidence are the 54.6 ng/mL.
same (Meeker et al. 2017).
For small-sample one-sided tolerance intervals
based on the non-central t-distribution
t α,N − 1,λ Example: One-Sided Lower Tolerance
k1 =
N Interval: Osprey Eggshell Thickness
(Data from Odsjö and Sondell 2014.) Poor repro-
where λ is the non-centrality parameter zp N. The
ductive success in ospreys is related to eggshell
sample size N is obtained by iteration to find the
thinning, mostly resulting from bioaccumulation
minimum N that satisfies tα,N − 1,λ ≥ t1 − α,N − 1,λ.
of environmental contaminants. Investigators
For one-sided tolerance intervals based on the
measured shell thicknesses in 166 eggs; average
large-sample normal distribution, k(1) is calcu-
shell thicknesses were 0.51 (SD 0.039) mm.
lated as
136 A Guide to Sample Size for Animal-based Studies
From historical data, the investigators deter- The prediction interval for a single future
mined that a reduction in shell thickness by observation is
approximately 20% was associated with markedly
increased rates of reproductive failure. A 20% RI = Y ± t 1 − α 2,N − 1 s 1+1 N
reduction in shell thickness from the mean corre- = 95 54 ± 1 97 7 42 1 + 1 210
sponds to an absolute eggshell thickness of 0.4 mm. = 80 9, 110 18
Suppose the effectiveness of environmental
remediation was defined as at least 95% of eggs Using the normtol.int option in R package
in the population to be above the breakage tolerance and the Howe method for estimat-
threshold of 0.4 mm with 90% confidence. What ing the two-sided k, the tolerance interval
is the lower one-sided tolerance limit? for 95% coverage and 95% confidence is (80.1,
Using the normtol.int option in R library 110.9) mg/dL
tolerance:
set.seed(166)
x <- rnorm(166, 0.515, 0.039)
out <- normtol.int(x = x, alpha = 0.10, P = 12.4 Non-parametric
0.95, side = 1, method = "HE", log.norm =
FALSE) Tolerance Limits
out
Non-parametric, or distribution-free, tolerance lim-
The lower one-sided tolerance limit is its are used if the data are not normally distributed
0.448 mm. and cannot be readily transformed or if the investi-
gator chooses not to make distributional assump-
tions about the data. The only assumption is that
the underlying distribution function is a non-
decreasing continuous probability distribution.
Example: Reference Intervals: Normally
Non-parametric tolerance intervals are approxi-
Distributed Data
mated by rank order methods. A major disadvan-
Reference intervals for common veterinary hae- tage of non-parametric approximations is the
matology and biochemistry variables are usually requirement for large sample sizes.
based on the mean ± 2 SD if it can be assumed the For count or binary data, the tolerance interval
data are normally distributed (Klaassen 1999). specifies the upper and lower bounds on the
How do reference intervals based on this approx- number of observations that are expected to show
imation compare with prediction and tolerance the event of interest (yes/no) in a future sample
intervals for the same data? of m observations with a specified confidence 100
(Data from Liu et al. 2021.) Fasting blood glu- (1 − α)%. For two-sided tolerance intervals with
cose values were obtained from 210 subjects and an upper and a lower limit, a specified proportion
averaged 95.54 (SD 7.42) mg/dL. P of the population is contained within the bounds
The 95% confidence interval is: with a specified level of confidence. For one-sided
tolerance intervals, the upper (or lower) limit
RI = Y ± z1 − α 2 s n = 95 54 describes the specified proportion P that meets or
exceeds (or falls below) a specified threshold value
± 1 96 7 42 210 P . For example, to provide evidence of safety, an
investigator might wish to show with 90% confi-
= 94 5, 96 5
dence that ≤1% of subjects exposed to a test sub-
stance demonstrated adverse effects (Meeker et al.
The reference interval RI approximated by ±2 2017; Hartnack 2019; Hartnack and Roos 2021).
SD (Klaassen 1999) is For continuous data, non-parametric tolerance
intervals are calculated from quantiles of the ranked
RI = 95 54 ± 2 7 42 = 80 71, 110 37 mg dL sample data using the largest and smallest values in
Tolerance Intervals 137
1 − α = 1 − pN
12.6 Sample Size for
Therefore, the smallest p to achieve a desired con-
fidence interval is (1 − α) is
Tolerance Based on
Permissible Number of
p = exp ln α N
Failures
and the minimum sample size is When trialling novel medical procedures or devices,
the consequences of error can be catastrophic. For
sequential processes with risk of harm, a predeter-
N = ln α ln p mined level of acceptable risk must be included with
‘confidence’ (percentage of occurrences) and ‘relia-
Another approximation for N is bility’ (the proportion of the population sample)
in sample size calculations. Durivage (2016 a, b)
1+p 1 recommends setting confidence and reliability
N χ 2α,4 + levels based on risk acceptance (Table 12.1). As risk
4 1−p 2 increases, the predetermined level of reliability (or
proportion of the sample population evaluated
where χ 2 α,4 is the critical value of the χ 2 distribution without failure) increases.
with 4 degrees of freedom that is exceeded with When number of failures is predetermined, sam-
probability α (Hahn and Meeker 1991, NIST/SEMA- ple size N is approximated as
TECH 2012). Kirkpatrick (1977) provides tabulated
values for determining sample size for one-sided 0 5 χ 21 − C,2 r + 1
N=
and two-sided tolerance limits for both normally 1−R
distributed and distribution-free data.
where r is the number of failures, C is the confidence
level, R is the ‘reliability’, and the value of χ 2 is
Example: Sample Size for Osprey Egg Study determined for (1 − C) confidence and 2(r + 1)
degrees of freedom.
How many eggs need to be randomly sampled to If zero failures are allowed:
be able to claim that 95% of the osprey egg popu-
lation will exceed the lower tolerance bound with ln 1 − C
N=
95% confidence? ln R
Sample size is estimated from the
Table 12.1: Confidence and reliability levels can be
approximation: specified based on subjective levels of risk acceptance.
A Guide to Sample Size for Animal-based Studies, First Edition. Penny S. Reynolds.
© 2024 John Wiley & Sons Ltd. Published 2024 by John Wiley & Sons Ltd.
144 A Guide to Sample Size for Animal-based Studies
97.5th
percentile
Example: Reference Range: Normally
Figure 13.2: Construction of the reference interval requires Distributed Data
specification of three items: interval coverage, reference
limits, and cut-point precision. (a) By convention, coverage p is (Data from Liu et al. 2021.) Fasting blood
specified as 95%. Therefore the central 95% of the sample glucose values obtained from 210 subjects aver-
measurements for the reference distribution is bounded
by the 2.5th and 97.5th percentiles, the reference limits. aged 95.54 (SD 7.42) mg/dL. Reference ranges
(b) Cut-point precision is defined by the coverage over the calculated from these summary data are sum-
reference limits themselves. The desired coverage for the marised below.
reference limit is usually specified as 90% (β = 0.90).
tolerance, with the Howe method for estimat- RI95 95 = 35 5, 477 6 μmol L
ing the two-sided k (Young 2010, 2013) is
Using the non-parametric method with
RI95 95 = 79 9, 111 8 mg dL nptol.int in the R library tolerance
13.3.3 Parametric Sample Size (From Wellek et al. 2014.) For two-sided reference
coverage of 95%, cut-point precision δ of 0.01, and
Estimates
β of 90%, the parametric estimate of sample size is
Wellek et al. (2014) give the approximate sample
size formula for a two-sided parametric reference 1 962 2
N≈ 1+ 0 05845 1 96 0 01 2
interval based on the standard normal probability 2
density function. Sample size is ≈ 1534
2
N ≥ 1 + z21 2 Φ z2 δ 2
and the non-parametric sample size is
2
where z1 is the (1 + q)/2 percentile for coverage, and 1+0 95 2 1− 1+0 95 2 1 96 0 01 12
z2 = (1 + β)/2 percentile for the cut-point precision, ≈ 0 975 0 025 1 96 0 005 2 ≈3746
δ. Typically, q = 0.95 and β = 0.90, so that z1 = z2 =
1.96 for the two-sided case, For the one-sided
reference interval, z1 is the q percentile and cut-
point precision is δ (rather than δ/2), and z1 = 13.3.5 Covariate-Dependent Sample
z2 = 1.6440. The quantity Φ is the pdf of the standard Size Estimates
normal density function at x = z1 and is calculated as When clinical measurements vary with one or more
covariates, sample size estimates need to account for
1
e − z1
2
Φ= 2
the structure of the regression model and how cov-
2π ariates are distributed, as well as the desired degree
of precision. In general, sample sizes based on cov-
If z1 = 1.96, then Φ = 0.05845 and if z1 = 1.6448
ariate-dependent samples will be very much larger
then Φ = 0.10314.
than those that are independent of covariates.
Bellera and Hanley (2007) developed a simple
13.3.4 Non-parametric Sample Size regression-based method for estimating sample size
Estimates when the value of the response depends on a single
covariate, for example, age or body weight. Four
Non-parametric (distribution-free) formulations are variations of this method are available to account
based on weighted average of the rank orders of different distributions of the covariate in the study
measurements. Reference intervals are determined sample.
by percentiles of the distribution of clinical bio-
marker values. The reference interval is defined Uniform distribution. If the covariate is uniformly
by the qth and [1 − (1 − q)/2]th quantiles of the nor- distributed across its range, minimum sample
mal distribution. If a one-sided reference interval is size N is:
required, the two-sided quantile [1 − (1 − q)/2] is
replaced by [1 − (1 − q)] = q. z21 − α 4 + z2p p
2
The approximate sample size formula for a two- N≥
sided non-parametric reference interval (Wellek z21 − β 2 D2
et al. 2014) is
for the 100(1 − α)% confidence interval, the 100 p%
N ≥ 1+q 2 1− 1 + q 2 z2 δ 2 2 reference limit compared to the 100(1 − β)% refer-
ence range, and for margin of error D.
and for the one-sided case:
Normal distribution. If the covariate is approxi-
2 mately normally distributed, and its range is
N ≥ q 1−q z2 δ
Reference Intervals 149
Henny, J. (2009). The IFCC recommendations for deter- Mulherin, S.A. and Miller, W.C. (2002). Spectrum bias or
mining reference intervals: strengths and limitations. spectrum effect? Subgroup variation in diagnostic test
Journal of Laboratory Medicine 33 (2): 45–51. evaluation (PDF). Annals of Internal Medicine 137 (7):
https://doi.org/10.1515/JLM.2009.0162. 598–602. https://doi.org/10.7326/0003-4819-
Hortowitz GL, Altaie S, Boyd JC, Ceriotti F, Garg U, 137-7-200210010-00011.
Horn P, Pesce A, Sine HE, Zakwski J. Defining, estab- Ransohoff, D.F, Feinstein, A.R. (1978). Problems of spec-
lishing, and verifying reference intervals in the clinical trum and bias in evaluating the efficacy of diagnostic
laboratory; Approved guideline. 3 Clinical and Labora- tests. New England Journal of Medicine (NEJM),
tory Standards Institute (CLSI) Document EP28-A3C, 299 (17): 926–30. doi:https://doi.org/10.1056/
Vol. 28, No. 30 Wayne, PA: Clinical and Laboratory NEJM197810262991705.
Standards Institute; 2010. Smith, D., Silver, E., and Harnly, M. (2006). Environmen-
Ichihara, K. and Boyd, J.C. (2010). An appraisal of statis- tal samples below the limits of detection – comparing
tical procedures used in derivation of reference inter- regression methods to predict environmental concen-
vals. Clinical Chemistry and Laboratory Medicine 48: trations. http://www.lexjansen.com/wuss/2006/
1537–1551. Analytics/ANL-Smith.pdf (accessed 2022).
Jennen-Steinmetz, C. (2013). Sample size determination Tarr, G. (2012). Small sample performance of quantile
for studies designed to estimate covariate-dependent regression confidence intervals. Journal of Statistical
reference quantile curves. Statistics in Medicine 33: Computation and Simulation 82 (1): 81–94. https://
1336–1348. doi.org/10.1080/00949655.2010.527844.
Jennen-Steinmetz, C. (2014). Sample size determination Tivers, M.S., Handel, I., Gow, A.G. et al. (2014). Hyper-
for studies designed to estimate covariate-dependent ammonemia and systemic inflammatory response syn-
reference quantile curves. Statistics in Medicine, 33 drome predicts presence of hepatic encephalopathy in
(8):1336–48. https://doi.org/10.1002/sim.6024. dogs with congenital portosystemic shunts. PLoS ONE
Jennen-Steinmetz, C. and Wellek, S. (2005). A new 9 (1): e82303. https://doi.org/10.1371/journal.
approach to sample size calculation for reference inter- pone.0082303.
val studies. Statistics in Medicine 24 (20): 3199–3212. Wei, Y., Kehm, R.D., Goldberg, M., and Terry, M.B.
https://doi.org/10.1002/sim.2177. (2019). Applications for quantile regression in
Katayev, A., Balciza, C., and Seccombe, D.W. (2010). epidemiology. Current Epidemiology Reports 6:
Establishing reference intervals for clinical laboratory 191–199. https://doi.org/10.1007/s40471-019-
test results: is there a better way? American Journal of 00204-6.
Clinical Pathology 133 (2): 180–186. https://doi. Wellek, S. and Jennen-Steinmetz, C. (2022). Reference
org/10.1309/ajcpn5bmtsf1cdyp. ranges: Why tolerance intervals should not be used.
Klaassen, J.K. (1999). Reference values in veterinary [Comment on Liu, Bretz and Cortina-Borja, Reference
medicine. Laboratory Medicine 30 (3): 194–197. range: Which statistical intervals to use? SMMR,
https://doi.org/10.1093/labmed/30.3.194. 2021,Vol. 30(2) 523–534]. Statistical Methods in Medi-
Koenker, R. (2022). R package quantreg. https://cran. cal Research 31 (11): 2255–2256. https://doi.org/
r-project.org (accessed 2022). 10.1177/09622802221114538.
Koenker, R. and Hallock, K.F. (2001). Quantile regres- Wellek, S., Lackner, K.J., Jennen-Steinmetz, C. et al.
sion. Journal of Economic Perspectives 15 (4): 143–156. (2014). Determination of reference limits: statistical
Liu, W., Bretz, F., and Cortina-Borja, M. (2021). Refer- concepts and tools for sample size calculation. Clinical
ence range: which statistical intervals to use? Statisti- Chemistry and Laboratory Medicine (CCLM)
cal Methods in Medical Research 30 (2): 523–534. 52 (12): 1685–1694. https://doi.org/10.1515/
https://doi.org/10.1177/0962280220961793. cclm-2014-0226.
Machin, D., Campbell, M.J., Tan, S.B., and Tan, S.H. Wright, E.M. and Royston, P. (1997a). Simplified estima-
(2018). Sample Sizes for Clinical, Laboratory and Epide- tion of age-specific reference intervals for skewed data.
miology Studies, 4e. New York: Wiley. Statistics in Medicine 16: 2785–2803.
Misbach, C., Lefebvre, H.P., Concordet, D. et al. (2014 Wright, E.M. and Royston, P. (1997b). A comparison
Jun). Echocardiography and conventional Doppler of statistical methods for age-related reference
examination in clinically healthy adult Cavalier King intervals. Journal of the Royal Statistical Society, A
Charles Spaniels: effect of body weight, age, and gen- 160: 47–69.
der, and establishment of reference intervals. Journal Young, D.S. (2010). tolerance: an R package for estimat-
of Veterinary Cardiology 16 (2): 91–100. https:// ing tolerance intervals. Journal of Statistical Software
doi.org/10.1016/j.jvc.2014.03.001. 36 (5): 1–39.
Reference Intervals 151
Young, D.S. (2013). Regression tolerance intervals. Com- Young, D.S., Gordon, C.M., Zhu, S., and Olin, B.D.
munications in Statistics: Simulation and Computation (2016). Sample size determination strategies for nor-
42 (9): 2040–2055. https://doi.org/10.1080/ mal tolerance intervals using historical data. Quality
03610918.2012.689064. Engineering 28 (3): 337–351. https://doi.org/
Young, D.S. (2014). Computing tolerance intervals and 10.1080/08982112.2015.1124279.
regions using R. In: Handbook of Statistics: Computa- Yu, K., Lu, Z., and Stander, J. (2003). Quantile regression:
tional Statistics with R, vol. 32 (ed. M.B. Rao and applications and current research areas. The Statisti-
C.R. Rao), 309–338. Amsterdam: North Holland- cian 52 (3): 331–350.
Elsevier.
IV
Sample Size for
Comparison
A Guide to Sample Size for Animal-based Studies, First Edition. Penny S. Reynolds.
© 2024 John Wiley & Sons Ltd. Published 2024 by John Wiley & Sons Ltd.
156 A Guide to Sample Size for Animal-based Studies
0.4
λ = 0 (central t-distribution)
14.4 Estimating Sample Size
Sample size using non-centrality is computed by
0.3 iterative numerical methods (Fenschel et al. 2011;
λ=4 Stroup 2011, 2012). Depending on the type of infor-
mation available, estimates of the non-centrality
Density
0.2
λ=8 parameter λ are determined from:
0.1 λ = 16
1. Summary statistics for each group. These can be
means and standard deviations, or standard
error and sample size obtained from prior data.
0.0
–5 0 5 10 15 20 25 2. An estimate of mean difference and the
X pooled variance of the difference.
Figure 14.2: Central (λ = 0) and non-central (λ > 0) t-
distributions with 10 degrees of freedom. As λ gets
14. Exemplary or raw data. An exemplary or
larger, the curve shifts to the right and becomes more ‘dummy’ data set is artificial data that consist
asymmetrical. of values obtained from historical data or
Sample Size and Hypothesis Testing 159
simulated from statistical models used to model alternative hypothesis, the test statistic now has a
the expected responses. The λ is calculated from non-zero mean t-distribution.
the mock data by fitting a prespecified analysis The power to detect the difference d with signif-
model (e.g. ANOVA, mixed models, generalised icance α is approximately
linear models) and extracting the necessary sta-
tistics from the output (O’Brien and Muller d n
1 − β = T n − 1 tα 2,n − 1
1993; Castelloe and O’Brien 2001). The model σ
output step provides F-values and degrees of
freedom for calculating λ.
where Tn−1 is the cumulative distribution function
Calculations for λ and the critical value for the of the non-central t-distribution with degrees of
test statistic (λ = 0) are made by iterating over a freedom n − 1 and non-centrality parameter λ.
range of candidate sample sizes (Littell et al. (Box 14.5 gives definitions for cumulative and prob-
2006; Fenschel et al. 2011; Stroup 2012). The sam- ability density functions).
ple size is chosen that results in the value of the The critical value of t is estimated by the inverse
computed power that equals or exceeds the pre- function of t, calculated from the total sample size n,
specified target power. Total sample size N, sample degrees of freedom df = n − 1, and significance level
size per group ni, or maximum number of groups k α (Harrison and Brady 2004). In SAS and R, the tinv
for a specified total sample size can be calculated function returns the pth quantile from the Student’s
with this method. t-distribution:
t_crit = tinv(alpha/2,n-1)
14.4.1 Non-central t-Distribution
For a one-sided test, the one-sided significance
The one-sample t-statistic follows the Student’s t- level α replaces the two-sided α/2.
distribution with n − 1 degrees of freedom: Power is then computed from the critical value of
t, the associated degrees of freedom, and the non-
x−μ centrality parameter (ncp):
t=
s n
in SAS as power = 1 - probt(t_crit, n-1, ncp)
and in R as
where x is the sample mean of a random sample ncp <- d/(s/sqrt(n))
from a normal population, s is the standard devia- t <- qt(0.975,df=n-1)
tion, n is the sample size, and μ is the population pt(t,df=n-1,ncp=ncp)-pt(-t, df=n-1,ncp=ncp)
mean. The effect size is 1-(pt(t,df=n-1,ncp=ncp)-pt(-t,df=n-1,ncp=ncp))
x−μ
d=
s BOX 14.5
The central t is distributed around zero. Suppose the Cumulative Distribution Function (cdf ) and Probability
Density Function (pdf )
true population mean is actually μ1, then the t-
statistic will follow a non-central t-distribution: The cumulative distribution function (cdf ) gives the
probabilities of a random variable X being smaller
μ1 − μ
λ= than or equal to some value x, Pr(X ≤ x) = F(x). The
σ n inverse of the cdf gives the value x that would make
F(x) return a particular probability p: F−1(p) = x.
with values of t is distributed around the non-
The probability density function (pdf ) returns the
centrality parameter λ. Under the null hypothesis,
relative frequency of the value of a statistic or the prob-
the difference between μ and μ1 is zero, and the test
ability that a random variable X takes on a certain
statistic has a standard t-distribution. The distribu-
value. The pdf is the derivative of the cdf.
tions move to the right as λ increases. Under the
160 A Guide to Sample Size for Animal-based Studies
0.5
Example: Determining Sample Size from
Exemplary Data: Mouse Breeding
0.4 λ=0 Productivity
(Data from Hull et al. 2022.) A study was designed
0.3
to compare two handling methods on reproduc-
λ=4 tive indices for laboratory mice. The study design
0.2 was a two-arm randomised controlled trial, with
λ=8 breeding pair as random effect and handling
method as the fixed effect.
0.1 λ = 16 The primary outcome was number of pups pro-
duced per pair. The outcome data were discrete
0 counts, which are non-Gaussian. Analyses
0 5 10 15 20
required a generalised linear mixed model with
F-value
a Poisson distribution for counts. Therefore, con-
Figure 14.3: Central (λ = 0) and non-central (λ > 0) ventional power calculation formulae for deter-
F-distributions with degrees of freedom (3, 20). The
critical F3,20 = 14.098. Area under the curve to the right
mining sample size were inappropriate, and
of the critical value (dotted line) for each non-central power and sample size were estimated by simula-
curve is the power of the test. tion (Littell et al. 2006; Stroup 2011, 2012).
Sample size was estimated in three steps
(Sample SAS code is in Appendix 14.A). First, an
centrality parameter is zero (λ = 0). Under the alter- exemplary data set was created for the total num-
nate hypothesis, the sampling distribution of the F- ber of pups expected in each group. Data were
ratio is the non-central F-distribution F(ν1, ν2, λ), λ based on historical information for controls and
> 0 and the F-distribution stretches out to the right the anticipated effect size. The expected total num-
(Figure 14.3). ber of pups for control animals in this study was:
The critical F-value is obtained from the 100
(1 − α)% quantile value of the central F distribution 5 litters × 6 pups litter = 30 pups per pair
and corresponding degrees of freedom. For exam-
ple, for α = 0.05, the quantile is 100 (1 − 0.05), or The expected difference was an additional pup
95th quantile value of the central F-distribution per litter, or 4–5 additional pups per pair, for a
with λ = 0. This is calculated from the inverse total of 34–35 pups in the test group. It was deter-
function of the corresponding cumulative density mined a priori that an increase of one extra pup
function cdf. The critical F-value is obtained by set- per pair was operationally significant enough to
ting the non-centrality parameter to zero: justify the facility switching to the new handling
method. In this example, exemplary data were
In SAS this is F_crit=Finv(1-alpha, numdf, generated for a sample size of 40 pairs per treat-
dendf,0)
and in R as F.crit = qf(alpha, numdf, ment arm. The 40 ‘observations’ in each group
dendf,lower.tail=F) had the values specified by the anticipated pup
counts (30 and 35).
where numdf is the numerator degrees of freedom Second, a generalised linear model and Poisson
ν1 = (k − 1), and dendf is the denominator degrees distribution were fitted to the exemplary data in
of freedom ν2 = k (n − 1) = (N − k). SAS proc glimmix. The model output step provides
The associated power for each N is then estimated expected F-values and degrees of freedom df for
as Power = Pr F ν1 , ν2,λ ≥ F ν1 , ν2 : calculating the non-centrality parameter, appr-
oximated as (numerator df) ○ (F-value). The crit-
in SAS as Power = 1-probf(F_crit, NumDF, ical F-value was estimated from the inverse
DenDF, NCP)
and in R as pf(F.crit, NumDF, DenDF,
function for the χ2 distribution, given the pre-
ncp=delta, lower.tail=F) specified confidence (1 − α), the degrees of
162 A Guide to Sample Size for Animal-based Studies
Table 14.3: Relative efficiency of balanced versus unbalanced allocation designs. The total sample size N = nA + nB is 20.
nA nB 1 nA + 1 nB Relative efficiency Correction Adjusted sample sizes
nA nB Na
14.A Sample SAS Code for /* Input expected pup counts for each arm;*/
F test in ANOVA. Journal of Educational and Behav- and Data Analysis. https://support.sas.com/
ioral Statistics 29 (2): 251–255. https://doi.org/ resources/papers/proceedings11/349-2011.pdf
10.3102/10769986029002251. (accessed 2019).
O’Brien, R.G. and Muller, K.E. (1993). Unified power Stroup, W.W. (2012). Generalized Linear Mixed Models:
analysis for t-tests through multivariate hypotheses, Modern Concepts, Methods and Applications. Boca
chap 8. In: Applied Analysis of Variance in Behavioral Raton: Chapman & Hall/CRC.
Science (ed. L.K. Edwards), 297–344. New York: Szucs, D. (2016). A tutorial on hunting statistical signif-
Marcel Dekker. icance by chasing N. Frontiers in Psychology 7: 1444.
Pearson, E. and Hartley, H.O. (1958). Biometrika https://doi.org/10.3389/fpsyg.2016.01444.
Tables for Statisticians, vol. 1. Cambridge: Cambridge Ware, J.J. and Munafò, M.R. (2016). Significance chasing
University Press. in research practice: causes, consequences, and possi-
Prodinger, P.M., Bürklein, D., Foehr, P. et al. (2018). ble solutions. Addiction 110 (1): 4–8. https://doi.
Improving results in rat fracture models: enhancing org/10.1111/add.126714.
the efficacy of biomechanical testing by a modification Wasserstein, R.L. and Lazar, N.A. (2016). The ASA’s
of the experimental setup. BMC Musculoskeletal Disor- statement on P-values: context, process, and purpose.
ders 19: 243. https://doi.org/10.1186/s12891- The American Statistician 70: 129–314. https://
018-2155-y. doi.org/10.1080/00031305.2016.1154108.
Sena, E.S., van der Worp, H.B., Bath, P.M. et al. (2010). Winer, B.J., Brown, D.R., and Michels, K.M. (1991).
Publication bias in reports of animal stroke studies Statistical Principles in Experimental Design, 3e. New
leads to major overstatement of efficacy. PLoS Biology York: McGraw-Hill.
8 (3): e1000344. https://doi.org/10.1371/jour- Zar, J.H. (2010). Biostatistical Analysis, 5e. Upper
nal.pbio.1000344. Saddle River: Prentice-Hall.
Stroup, W.W. (2011). Living with generalized linear
mixed models. SAS Global Forum 2011: Statistics
15
A Bestiary of Effect Sizes
A Guide to Sample Size for Animal-based Studies, First Edition. Penny S. Reynolds.
© 2024 John Wiley & Sons Ltd. Published 2024 by John Wiley & Sons Ltd.
168 A Guide to Sample Size for Animal-based Studies
BOX 15.1
Effect Sizes
Choose an effect size (or effect size family) most appropriate for testing study hypotheses.
The effect size must be relevant to study objectives (context).
Effect size does not infer causality.
Effect size does not infer statistical significance.
Effect size does not infer practical significance.
15.2 Effect Size Basics The ‘best’ effect size is that which provides the
most power for a given sample size (Box 15.2). This
Choosing the most appropriate effect size depends is usually an effect size based on a continuous var-
on the study objectives and the study context: the iable (rather than binary or time-to-event variables)
study design or statistical model, and the type of effect and a balanced design (equal number of experimen-
most relevant to the research question and study tal units per treatment arm). Studies using binary or
objectives. How the effect size is calculated depends time-to-event outcomes or designs with unequal
on whether or not the effect size is to be standar- allocation require many more subjects than do bal-
dised, and if standardised, how the standardiser is anced studies on continuous outcomes.
to be calculated (Cumming 2012).
Regardless of which family of effect sizes is
selected, all effect size calculations require the fol- 15.3 d Family Effect Sizes
lowing information:
The d family of effect sizes describe differences
The primary outcome. The type of outcome vari- between independent groups. For continuous vari-
able dictates which class or family of effect sizes ables, these are expressed as the differences between
can be used. group means. Effect sizes based on group differ-
ences may be unstandardised or standardised
The pre-specified difference between groups most (Box 15.3).
likely to be of practical or biological significance. An unstandardised effect size expresses the differ-
Number of groups to be compared. ence between groups without adjusting for variation
in the sample. Raw effect sizes retain the original
Desired probability for type I error (α) and type II
units of measurement and thus have the enormous
error (β).
advantage of being easy to interpret when units of
Whether comparisons are one-sided or two-sided. measurement are themselves meaningful. For these
reasons, raw effect sizes are often used in meta-
Allocation ratio, or sample size balance among
analyses of health research (Vedula and Altman,
groups.
2010). Examples of unstandardised effect sizes
During planning and initial sample size calcula-
tions, careful thought should be given to choosing
BOX 15.3
preliminary values for first-pass approximations.
d-Family Effect Sizes
Reasonable initial values can be obtained from
systematic reviews and meta-analyses, targeted lit- Effect sizes describing differences between independ-
erature reviews, pilot data, and data from previous ent groups
experiments (Rosenthal 1994; Ferguson 2009;
Unstandardised effect size is a simple difference.
Cummings 2012; Lakens 2013). Lakens (2013) pro-
Standardised effect size is the difference scaled by
vides an excellent tutorial on calculating and report-
the sample variation.
ing effect sizes.
170 A Guide to Sample Size for Animal-based Studies
of two means divided by the standard deviation of The effect size is specified by the difference
the paired difference sD: between the correlation expected for the population
for the null r0 and the correlation r hypothesised
ds,pair = x 2 − x 1 sD under the alternative hypothesis (the postulated
effect of the test intervention), such that
where sD = s21 + s22 2. Accounting for the pair-
ing between differences reduces the variation. f = r − r0
Therefore sD will be smaller, estimates more precise,
and effect sizes larger, than if the standard deviation
was calculated assuming independence of 15.4.2 Regression
observations.
The linear regression model describing the relation-
ship of Y in relation to X is
15.4 r Family (Strength of
Y = β0 + β 1 X + ε
Association) Effect Sizes
The r family effect sizes describe the relationship where β0 is the intercept, β1 is the change in Y with a
between variables as ratios of shared variance. They unit change in X (the slope) and ε is the random
are used for correlation, regression, and analysis of error, which is normally distributed with mean zero
variance (ANOVA) models (Box 15.4). The total var- and variance σ 2.
iance of the data is partitioned into several different Effect size. For regression models, the effect size
pieces or sources of variation. can be quantified by the regression coefficient, or
the coefficient of determination R2.
Regression coefficients β provide the most imme-
15.4.1 Correlation diately obvious information about the magnitude of
The most familiar metric from the r family is Pear- the effect. In the simplest case with two treatment
son’s correlation coefficient r. Correlation is a meas- groups and no other predictors, the groups can be
ure of the strength of linear association between defined by a single indicator variable such that
variables. It is estimated as the ratio of shared vari- X = 0 for the comparator and X = 1 for the test
ance between two variables: group. Then the effect d is the estimate of the true
difference in means between the two groups:
cov x, y d = β1. The standardised effect size is the regression
r x,y = coefficient divided by the pooled within-group
sx sy
standard deviation of the outcome measure
(Feingold 2015). The residual standard deviation
where sx and sy are the sample standard deviations
(root mean square error) of the regression is used
for each variable, and the covariance between two
when there are two or more predictors. However,
sample variables x and y is
the estimated size of the effect for models with mul-
tiple predictors will depend on the scale of each
Xi − X Y j − Y
cov x, y = independent variable. Therefore effect sizes based
n−1 on different variables in the same study, or variables
across multiple studies, cannot be compared
directly (Lorah 2018).
BOX 15.4 Regression models have at least three sources of
r Family Effect Sizes variation: the total sum of squares, the regression
sum of squares, and the error sum of squares. The
Effect sizes that describe the relationship between structure of the model and how variance is parti-
variables as ratios of shared variance. (strength of tioned are shown by constructing an ANOVA table
association) (Table 15.2).
172 A Guide to Sample Size for Animal-based Studies
Table 15.2: Analysis of variance table for the simple linear regression with one predictor variable X.
Source of variation Degrees of Sum of Mean F
freedom (df) squares (SS) square (MS)
The total sum of squares (SST) is the sum of Correlation r is related to the slope b1 as
squared deviations of observations yi from the mean
of those observations y. sx
r = b1
The regression sum of squares (RSS) describes the sy
amount of variation among the y values that can be
attributed to the linear model. The least squares esti- The adjusted R2 R2adj is a corrected goodness-of-
mate for the coefficient β1 is calculated as fit metric that adjusts for the number of terms in
the model:
xi − x yj − y
β1 = N −1
xi − x 2 R2adj = 1 – 1 − R2
N −k
15.4.3 Analysis of Variance (ANOVA) The partial correlation ratio η2p can also be esti-
Methods mated from existing or exemplary F-values and
degrees of freedom (df) as
ANOVA models belong to the r family and are math-
ematically equivalent to regression. The F-statistic is
the ratio of the estimate of the between-groups, or F df num
treatment, mean square to the error mean square. η2p =
F df num + df denom
The single-factor ANOVA partitions variation into
between-group variation with k − 1 degrees of free-
dom (where k is the number of groups) and residual where dfnum is the numerator, or between-groups,
(within group, or error) variation with k(n − 1) degrees of freedom, and dfdenom is the denominator
degrees of freedom. Between-group variation assesses or error degrees of freedom (Cohen 1988).
treatment effects. A major drawback to both η2 and η2p is that they
Effect size. The effect size f for ANOVA designs is are biased. Their application in power calculations
can lead to underpowered studies because the sam-
f = η2 1 − η 2 ple size estimates will be too small. Estimates of η2p
become more complicated to calculate as study
where η2 is the correlation ratio. The correlation design becomes more complex because η2p is deter-
ratio η2 is analogous to R2 and for a single-factor mined by the different sources of variation contrib-
ANOVA, η2p = R2 . It describes the proportion of uted by additional factors, factor interactions,
the total variation (total sum of squares SSTotal) in nesting, and blocking (Lakens 2013). η2p can be used
the response variable accounted for by differences for comparisons of effect sizes across studies, but
between groups (between-group sum of squares, only if study designs are similar and there are no
SSbetw): covariates or blocking variables (Lakens 2013).
When reporting effect sizes for ANOVA, it is recom-
SSbetw mended to report both the generalized η2 and the
η2 =
SSTotal standardised η2p .
When sample size is small, the partial omega-
For example, η2 of 0.2 means that 20% of the total squared ω2p is a less-biased alternative to η2 and η2p
variance can be attributed to group membership. It (Olejnik and Algina 2003; Albers and Lakens
is the uncorrected effect size estimate determined by 2018). It estimates the proportion of variance in
the variance explained by group membership in the the response variables accounted for by the explan-
sample for a single study (Lakens 2013). However, atory variables. For the completely randomised
studies based on η2 will be considerably underpow- single-actor ANOVA, ω2p can be calculated from
ered; several authors recommend strongly against
the F-statistic and associated degrees of freedom
using η2 as an ANOVA effect size estimator
for the treatment and error:
(Troncoso Skidmore and Thompson 2013; Albers
and Lakens 2018). If there are only two independent
groups, f is related to Cohen’s ds as f = ds/2. F −1
ω2p =
The partial correlation ratio η2p is the ‘standar- F+ dfdenom + 1
dfnum
dised’ effect size. It is appropriate only for fixed fac-
tor designs without covariates. Software programs
such as G∗Power require η2p for sample size calcula- Warning: This formula cannot be used for repeated
tions. It is calculated as measure designs. Neither η2 or ω2p effect sizes are
applicable to ANOVA with observational categori-
SSbetw SSbetw cal explanatory variables (e.g. sex) or if covariates
η2p = =
SSbetw + SSerror SStotal are included.
174 A Guide to Sample Size for Animal-based Studies
15.5 Risk Family Effect Sizes The number needed to treat (NNTT) is the recip-
rocal of the risk difference:
When outcome measures are binary (0/1; yes/no;
present/absent), the effect size is assessed as a NNTT = 1 ARR
comparison between proportions. It is common in
epidemiological investigations for effect sizes for This is an estimate of the number of subjects that
proportions to be expressed as risk difference, rela- need to be treated for one extra subject to benefit.
tive risk (RR), or the odds ratio (OR) (Sánchez-Meca The RR or risk ratio, is the ratio of two propor-
et al. 2003, Machin et al. 2018). Cramer’s V is a tions. It compares the probability, or risk, of an
measure of the strength of association between event occurring in one group in one group relative
two nominal variables. It is typically obtained from to that for a comparator group:
χ2 contingency tables (Box 15.5).
RR = b n1 a n0 = p1 p0
15.5.1 Risk Difference, Relative Risk,
When RR is 1, risk for the two groups is the same;
and Odds Ratio that is, an RR of 1 indicates no effect. If RR > 1, there
Table 15.3 shows how proportions are structured is increased risk in group 1 relative to group 2, and
for comparisons between the test group and a RR < 1 risk is decreased in group 1 relative to
comparator, or standard. Sample SAS code group 2. Therefore an RR of 0.5 means that the risk
for calculating RR and OR is provided in has been reduced by one-half or 50%. Confidence
Appendix 15.A. intervals should be reported to obtain an idea of
Effect size metrics are calculated as follows: the precision of the estimate.
The risk difference or absolute risk reduction ARR, There are two possible ways to compute RR
is the difference between proportions: depending on whether the presence or absence of
an event is of interest. Scientific or clinical judgment
ARR = b n1 − a n0 = p1 − p0 is required to determine which RR is appropriate, as
they are not simple reciprocals of one another.
If the absolute risk difference is zero, this indi- The OR is the ratio of the odds of occurrence of an
cates there is no effect. event (or ‘success’) p relative to the odds of failure
(1 − p) in a test group relative to those for a compar-
ator group:
BOX 15.5 p1 1 − p1
Risk Family Effect Sizes OR =
p0 1 − p0
Effects sizes for binary outcomes: risk difference, rela-
tive risk, odds ratio. In case–control studies, the OR is the measure of
Effect size for nominal variables: Cramer’s V. association between an exposure or risk factor and
occurrence of disease.
Table 15.3: Calculation of proportions for comparisons of a test group against a control.
When OR is 1, odds associated with success are difficulties with interpretation, especially for asses-
similar in the two groups; that is, an OR of 1 indi- sing benefits and harms. For example, an OR = 0.20
cates no effect. If the OR > 1 odds are increased in cannot be interpreted as an 80% reduction in risk.
group 1 relative to group 2, and OR < 1 odds are It means that the odds of the event occurring in the
decreased in group 1 relative to group 2. Rule of first group (e.g. the test group) are 0.2 times the
thumb values for OR effect sizes has been reported odds in the second group (e.g. the control). How-
as ‘small’ for OR = 1.68, ‘medium’ for OR = 3.47, ever, OR has several advantages over RR. ORs
and ‘large’ for OR = 6.71 (Chen et al. 2010). If the are symmetrical with respect to the outcome defini-
ORs are obtained from logistic regression coeffi- tion; that is, interpretation does not depend on
cients, they must be standardised using the root which group is used as the numerator or denomi-
mean square before use. ORs obtained from logistic nator, or whether the emphasis is on the occur-
regression are usually unstandardised, as they rence of an event or its failure to occur. In
depend on the scale of the predictor. contrast, interpretation of the RR does depend on
outcome definition. Because choice of the refer-
ence comparison can greatly affect interpretation,
15.5.2 Interpretation it is important to clearly identify which group is
Both RR and OR are relative measures of effect, and the ‘control’ or reference comparator group and
therefore unaffected by changes in baseline risk. whether that group is the numerator or denomina-
However, proper interpretation requires that both tor of the ratio. For example, Cochrane Reviews
raw numbers and baseline risk should be reported. use OR < 1 as favouring the treatment group, not
Absolute changes in risk are more important than the control (Higgins et al. 2022).
relative changes in risk. Sensible interpretation For observational studies, the study design (e.g.
requires consideration of both RR and OR as ratio case-control or cohort studies), sampling strategy,
problems; that is, assessment of magnitudes is based case type (incident or prevalent) and source popula-
on reciprocals, not simple arithmetic differences. tion (fixed or dynamic) will determine which meas-
When the outcome event is rare, RR and OR will ure of association can be used (Knol et al. 2008,
be approximately equal. This is because both (1 − p1) Labrecque et al. 2021).
and (1 − p2) will be close to 1, so OR will approach
RR. As a result, OR can be regarded as an estimate
of RR when incidence of an event is low in both Example: Survival or Mortality: Which?
groups (usually <10%). As incidence or baseline risk
A study assessed the number of subjects that sur-
increases, RR and OR become more dissimilar.
vived or died following a course of experimental
If there are zero events in the comparator group
treatment A relative to control treatment B.
(p0 = 0), neither the RR or OR can be calculated
Results were as follows:
(because the denominator is zero). If the occurrence
of the event in the intervention group is 100% (p1 = Intervention Alive Dead Total
1), OR cannot be calculated.
Group A 75 25 100
To more readily interpret OR as a change in num-
ber of events, Higgins et al. (2022) recommend con- Group B 50 50 100
verting the OR to RR and then interpreting RR
relative to a range of assumed comparator or base- The RR of death for group B relative to
line group risks (ACR): group A is
OR 50 100 0 50
RR = RR = = =20
1 − ACR 1 − OR 25 100 0 25
Effect size for time-to-event data is most appropri- Various rule-of-thumb benchmark values have
ately expressed as a hazard ratio (HR). The HR been proposed to indicate how ‘meaningful’ effect
accounts for both the number and the timing of size is, usually in subjective terms such as ‘small’,
A Bestiary of Effect Sizes 177
BOX 15.7 Predetermined effect sizes are too strict. This was
Interpreting Effect Size recognised by Cohen himself (Cohen 1988)
and others (Thompson 2002, Funder and Ozer
Unless you know what the effect size index is and how 2019). Effect size must always be determined
it was calculated, you cannot interpret it. and interpreted by study context and practical
Benchmark effect sizes should be avoided because significance. For example, an effect size R2 =
they are 0.1 in large clinical trial with death and major
morbidity as a primary outcome is of far greater
▪ Irrelevant to animal-based research
practical significance than it would be for a
▪ Too strict
small laboratory study measuring weakly vali-
▪ Subject to technical problems
dated biomarker concentrations or behaviour
▪ Easily gamed
characterised by large between- and within-
subject variability.
Predetermined effect sizes have numerous technical
BOX 15.8 problems. The magnitude of an effect (therefore
Determining and Interpreting Effect Sizes any resulting interpretation of its biological
importance) will be affected by presence of cov-
▪ Do not use subjective effect size benchmarks
ariates, data with non-normal error structure
(‘small’, ‘medium’, ‘large’) for planning studies
and/or variances, non-independence of obser-
or interpreting results.
vations, and small sample sizes (Nakagawa
▪ The magnitude of the effect size may be deter-
and Cuthill 2007). Frequently, conversion of
mined more by study design and variation than
one type of effect size metric to another can
the anticipated difference in intervention
result in discordant and nonsensical bench-
groups.
mark values, for example conversion of d effect
▪ Consider effect sizes like any other ratio (both
sizes to r (Lakens 2013). Sampling distributions
numerator AND denominator are important).
will generally be non-normal and highly
skewed, so the smaller the sample size, the less
reliable the effect size estimate will be (Albers
‘medium’, and ‘large’. Benchmarks may be useful as and Lakens 2018).
first-pass sample size approximations for study pla- Predetermined effect sizes are subject to ‘gaming’.
nning purposes. They can also be useful for asses- An unfortunately common practice among
sing results in comparison results already well- investigators is manipulating effect size calcula-
understood or validated. However, in most circum- tions to obtain a preferred sample size. This will
stances, effects sizes should be directly calculated from often result in estimates of implausibly large
the individual components (Box 15.7). Predeter- effect sizes so that future studies will be
mined (‘canned’) benchmarks should be avoided extremely underpowered for more biologically
for the following reasons (Box 15.8): realistic and plausible effect sizes.
Predetermined effect sizes are irrelevant to animal-
based studies. By far, the leading objection to 15.7.1 Interpreting Effect Sizes as
benchmark criteria is that they were originally
developed for social and psychology studies and Ratios
are meaningless without a practical frame of ref- Cumming (2012) points out that sensible construc-
erence (Thompson 2002). Therefore, the criteria tion and interpretation of effect sizes should con-
defining the ‘size’ of an effect may be neither sider effect size as a ratio problem. That is, both
appropriate nor relevant for pre-clinical or veter- the numerator and denominator are important
inary clinical studies. and should be considered equally in the
178 A Guide to Sample Size for Animal-based Studies
interpretation of effect size. Two studies with the significance has been discussed in detail elsewhere
same effect sizes may differ in the magnitude of (Greenland et al. 2016).
the difference between groups (numerator), the Effect sizes and 95% confidence intervals in orig-
amount of variation in the two samples (denomina- inal measurement units have the advantage of being
tor), or both. If variation between experimental more readily interpretable compared to standar-
units is small, then the denominator is small, result- dised effect sizes. Reporting of effect sizes and mea-
ing in large effect sizes. This may conceal the fact sures of precision, whether or not they are
that differences between groups are trivial and of ‘statistically significant’, have the further advantage
no practical importance. Conversely, large standar- of allowing inclusion in systematic reviews and
dised effect sizes may be a reflection of better study meta-analyses (Nakagawa and Cuthill 2007). Sys-
design (control of variation), not larger effects per se tematic reviews and meta-analyses increase data
(differences between groups). Neglect of basic ‘shelf life’, and minimise waste of animals and
design features such as ensuring quality of data resources in additional non-informative experi-
sampling, collection, and measurement, and espe- ments (Reynolds and Garvan 2020).
cially incorporation of allocation concealment (or
blinding) can also account for non-trivial effect sizes
(Rosenthal 1994, Funder and Ozer 2019). For exam-
ple, a meta-investigation of mouse models of amyo-
15.A Using SAS Proc Genmod
trophic lateral sclerosis provided strong evidence to Calculate OR and RR
that unattributed sources of variation (such as litter
effects) were major contributors to apparent thera-
(Adapted from Spiegelman
peutic efficacy of candidate drugs, rather than true and Hertzmark 2005). As an
treatment effects (Scott et al. 2008; Perrin 2014).
alternative to fitting a logistic
regression model, OR and RR
15.7.2 What Is a Meaningful
Effect Size?
can be calculated using the
A ‘meaningful’ effect size (Box 15.8) is described by
log-binomial maximum
the effect size estimate together with a measure of likelihood estimators.
precision (such as 95% confidence intervals). Mean- *to calculate OR, use the logit link;
ing is context-dependent and is determined by proc genmod descending;
model validity, goals of the intervention, predefined class x;
criteria for what constitutes biological or clinical model y = x / dist = binomial link = logit;
‘importance’ (Kazdin 1999; Thompson 2002; Naka- estimate 'Beta' x 1 -1/ exp;
run;
gawa and Cuthill 2007), and evaluation in compar-
ison with estimated effect size in comparison to *to calculate RR use the log link;
values reported for other similar studies (Schuele proc genmod descending;
and Justice 2006). Unfortunately, discussions of class x;
model y = x/ dist = binomial link = log;
research results are often framed in terms of ‘statis- estimate 'Beta' x 1 -1/ exp;
tical significance’ and P-values alone, which do not run;
convey much useful scientific information (Cohen
1990; Thompson 2002; Nakagawa and Cuthill
2007; Cumming 2012; Kelley and Preacher 2012;
Schober et al. 2018). ‘Statistical significance’ does
References
not mean that the observed difference is large Abberbock, J., Anderson, S., Rastogi, P., and Tang, G.
enough to be of any practical or biological impor- (2019). Assessment of effect size and power for survival
tance. Common misunderstanding of P-values and analysis through a binary surrogate endpoint in
A Bestiary of Effect Sizes 179
clinical trials. Statistics in Medicine 38: 301–314. Hojat, M. and Xu, G. (2004). A visitor’s guide to effect
https://doi.org/10.1002/sim.7981. sizes – statistical significance versus practical (clinical)
Albers, C. and Lakens, D. (2018). When power analyses importance of research findings. Advances in Health
based on pilot data are biased: inaccurate effect size Sciences Education 9 (3): 241–249.
estimators and follow-up bias. Journal of Experimental Kazdin, A.E. (1999). The meanings and measurement of
Social Psychology 74: 187–195. https://doi.org/ clinical significance. Journal of Consulting and Clinical
10.1016/j.jesp.2017.09.004. Psychology 67: 332–339.
Chen, H., Cohen, P., and Chen, S. (2010). How big is a big Kelley, K. and Preacher, K.J. (2012). On effect size. Psy-
odds ratio? Interpreting the magnitudes of odds ratios chological Methods 17 (2): 137–152. https://doi.
in epidemiological studies. Communications in Statis- org/10.1037/a0028086.
tics: Simulation and Computation 39: 860–864. Knol, M.J., Vandenbroucke, J.P., Scott, P. et al. (2008).
Cohen, J. (1988). Statistical Power Analysis for the Behav- What do case-control studies estimate? Survey of
ioral Sciences. Hillsdale: Lawrence Erlbaum methods and assumptions in published case-control
Associates. research. American Journal of Epidemiology 168 (9):
Cohen, J. (1990). Things I have learned (so far). American 1073–1081.
Psychologist 45 (12): 1304–1312. https://doi.org/ Labrecque, J.A., Hunink, M.M.G., Ikram, M.A., and
10.1037/0003-06615.45.12.1304. Ikram, M.K. (2021). Do case-control studies always
Cumming, G. (2012). Understanding the New Statistics: estimate odds ratios? American Journal of Epidemiol-
Effect sizes, Confidence Intervals, and Meta-Analysis. ogy 190 (2): 318–321. https://doi.org/10.1093/
New York: Routledge. aje/kwaa167.
Draper, N., and Smith, H. (1998). Applied Regression Lakens, D. (2013). Calculating and reporting effect sizes
Analysis, 3rd ed. New York: Wiley. to facilitate cumulative science: a practical primer for t-
Feingold, A. (2015). Confidence interval estimation for tests and ANOVAs. Frontiers in Psychology 2013: 4.
standardized effect sizes in multilevel and latent https://doi.org/10.3389/fpsyg.2013.00863.
growth modeling. Journal of Consulting and Clinical Lorah, J. (2018). Effect size measures for multilevel mod-
Psychology 83 (1): 157–168. https://doi.org/ els: definition, interpretation, and TIMSS example.
10.1037/a0037721. Large-Scale Assessments in Education 6: 8. https://
Ferguson, C.J. (2009). An effect size primer: a guide for doi.org/10.1186/s40536-018-0061-2.
clinicians and researchers. Professional Psychology: Machin, D., Cheung, Y.B., and Parmar, M. (2006) Sur-
Research and Practice 40 (5): 532–538. vival Analysis: A Practical Approach 2nd edition, Wiley.
Funder, D.C. and Ozer, D.J. (2019). Evaluating effect size Machin, D., Campbell, M.J., Tan, S.B., and Tan, S.H.
in psychological research: sense and nonsense. (2018) Sample Sizes for Clinical, Laboratory and Epide-
Advances in Methods and Practices in Psychological miology Studies. 4th Ed. Wiley.
Science 2: 156–168. Nakagawa, S. and Cuthill, I.C. (2007). Effect size,
Glass, G.V. (1976). Primary, secondary, and meta- confidence interval and statistical significance: a prac-
analysis of research. Educational Researcher 5: 3–8. tical guide for biologists. Biological Reviews of the Cam-
https://doi.org/10.2307/1174772. bridge Philosophical Society 82 (4): 591–605. https://
Greenland, S., Senn, S.J., Rothman, K.J. et al. (2016). Sta- doi.org/10.1111/j.1469-18515.2007.00027.15.
tistical tests, P values, confidence intervals, and power: Erratum in: Biol Rev Camb Philos Soc. 2009 84(3):515.
a guide to misinterpretations. European Journal of Epi- Olejnik, S. and Algina, J. (2003). Generalized eta and
demiology 31: 337–350. https://doi.org/10.1007/ omega squared statistics: measures of effect size for
s10654-016-0149-3. some common research designs. Psychological Methods
Hedges, L.V. and Olkin, I. (1985). Statistical Methods for 8 (4): 434–447.
meta-analysis. San Diego, CA: Academic Press. Perrin, S. (2014). Make mouse studies work. Nature 507:
Higgins, J.P.T., Li, T., and Deeks, J.J. (2022). Chapter 6: 423–425.
Choosing effect measures and computing estimates Reynolds, P.S. and Garvan, C.S. (2020). Gap analysis of
of effect. In: Cochrane Handbook for Systematic swine-based hemostasis research: “Houses of brick or
Reviews of Interventions version 6.3 (Updated February mansions of straw?”. Military Medicine 185 (Suppl 1):
2022) (ed. H. JPT, J. Thomas, J. Chandler, et al.). 88–95. https://doi.org/10.1093/milmed/usz249.
Cochrane Available from www.training.cochrane. Rosenthal, R. (1994). Parametric measures of effect size.
org/handbook. In: The Handbook of Research Synthesis (ed. H. Cooper
and L.V. Hedges), 231–244. New York: Sage.
180 A Guide to Sample Size for Animal-based Studies
Sánchez-Meca, J., Marín-Martínez, F., and Chacón- psychology. Journal of Educational Psychology 102
Moscoso, S. (2003). Effect-size indices for dichoto- (4): 989–1004.
mized outcomes in meta-analysis. Psychological Meth- Thompson, B. (2002). Statistical, practical, and clinical:
ods 8 (4): 448–467. https://doi.org/10.1037/ how many kinds of significance do counselors need
1082-98915.8.4.448. to consider? Journal of Counseling and Development
Schober, P., Bossers, S.M., and Schwarte, L.A. (2018). Sta- 80 (1): 64–71.
tistical significance versus clinical importance of Tierney, J.F., Stewart, L.A., Ghersi, D. et al. (2007). Prac-
observed effect sizes: what do p values and confidence tical methods for incorporating summary time-to-
intervals really represent? Anesthesia and Analgesia event data into meta-analysis. Trials 8: 16. https://
126 (3): 1068–1072. https://doi.org/10.1213/ doi.org/10.1186/1745-6215-8-16.
ANE.0000000000002798. Troncoso Skidmore, S. and Thompson, B. (2013). Bias
Schuele, C.M. and Justice, L.M. (2006). The importance and precision of some classical ANOVA effect sizes
of effect sizes in the interpretation of research: primer when assumptions are violated. Behavior Research
on research: part 3. The ASHA Leader. 11 (10): Methods 45: 536–546. https://doi.org/10.3758/
https://doi.org/10.1044/leader. s13428-012-0257-2.
FTR4.11102006.14. van Belle, G.G. (2008). Statistical Rules of Thumb, 2nd
Scott, S., Kranz, J.E., Cole, J. et al. (2008). Design, edition. New York: Wiley.
power, and interpretation of studies in the standard Vedula, S.S. and Altman, D.G. (2010). Effect size estima-
murine model of ALS. Amyotrophic Lateral Sclerosis tion as an essential component of statistical analysis.
9: 4–15. Archives of Surgery 145 (4): 401–402. https://doi.
Spiegelman, D. and Hertzmark, E. (2005). Easy SAS cal- org/10.1001/archsurg.2010.33.
culations for risk or prevalence ratios and differences. Zwetsloot, P.P., Van Der Naald, M., Sena, E.S. et al.
American Journal of Epidemiology 162: 199–205. (2017). Standardized mean differences cause funnel
Sun, S., Pan, W., and Wang, L.L. (2010). A comprehen- plot distortion in publication bias assessments. eLife
sive review of effect size reporting and interpreting 6: e24260. https://doi.org/10.7554/eLife.24260.
practices in academic journals in education and
16
Comparing Two Groups:
Continuous Outcomes
BOX 16.1
16.1 Introduction Two-Group Designs
The simplest experimental design is for compari- The simplest between-group design usually compar-
son of two groups (Box 16.1). In general, these ing a test intervention against a control.
designs involve a test intervention and a control, Two-arm designs are powerful tools for assessing
randomly allocated to experimental units in each efficacy, especially for veterinary clinical trials.
group. Due to their simple structure, two-arm Two-arm are not recommended for exploratory
designs are useful for definitive veterinary clinical studies with multiple variables; t-tests are commonly
trials, which require large sample sizes and used to compare continuous outcomes for two
sufficient power for testing efficacy. However, groups.
A Guide to Sample Size for Animal-based Studies, First Edition. Penny S. Reynolds.
© 2024 John Wiley & Sons Ltd. Published 2024 by John Wiley & Sons Ltd.
182 A Guide to Sample Size for Animal-based Studies
two-arm designs are not the best choice for most and confidence. However, sample sizes can also
laboratory rodent studies, which are mostly explor- be determined to obtain a pre-specified precision
atory with multiple explanatory variables. Alterna- or tolerance for the difference, or for a range of sam-
tive designs are discussed in Chapter 19. ple variation that can detect the pre-specified differ-
When outcome variables are continuous and nor- ence with a given power.
mally distributed, sample size can be approximated
by formulae based on the t or the z distribution. The
t-distribution describes the standardised distances 16.2.1 Asymptotic Large-Scale
of sample means to the population mean. The t- Approximation
statistic is based on the sample variance. The shape Sample size based on the z-distribution for a two-
of a t-distribution depends on the degrees of free- group comparison is:
dom (df), where df = n − 1, so the critical t-value tcrit 2
is given by tα/2, n − 1 for a two-sided test and tα, n − 1 z1 − α 2 + z1 − β
N= 2
for a one-sided test. In contrast, the approximation d s
based on the z-distribution assumes the population
where the difference d is the difference between
variance σ is known, and the population is normally
sample means d = x 1 − x 2 , s is the sample standard
distributed. The z-distribution can be used when
deviation, d/s is the effect size, and z1 − α/2 and
sample sizes n are large (n > 30) because as n
z1 − β are the z-scores for confidence and power,
increases, the t-distribution approaches the z-distri-
respectively.
bution. However, the large-scale asymptotic approx-
To detect an increase in the average by one stand-
imation based on the z-distribution should be used
ard deviation, the relation becomes
with caution. Although sample size estimates may
useful for initial planning, sample sizes will be 2
z1 − α 2 + z1 − β 2
seriously underestimated compared to exact deter- N= 2 = z1 − α 2 + z1 − β
1 1
minations from the t-distribution. A drawback of
the exact method is that it must be determined by or the square of the summed z-scores for confidence
numerical iteration over a candidate range of n, and power.
although this is readily performed with simple
computer code. Sample SAS codes are given in
Appendix 16.A. 16.2.2 Sample Size Based on the
t-Distribution
16.2 Sample Size Calculation When the study is small, estimating sample size
directly from the t-distribution will provide more
Methods exact approximations. The sample size calculation
Sample size determinations for two-group compar- for a two-sided comparison of two samples is
isons require the minimum information for sample
2
size planning (Chapter 14): pre-specified confidence N = t2α 2,n − 1 d s
and power, and the minimum biologically relevant
difference to be detected. In addition, information is where t is the critical t-value for confidence α/2 and
required as to directionality (if the comparison is n − 1 degrees of freedom (df).
one-sided or two-sided, as this determines α or The degrees of freedom that define the t-
α/2, respectively), and the type of comparison to distribution is calculated from the total sample size
be made. These include a single sample compared itself. Therefore, sample size has to be estimated by
to a reference value, comparison of two independent iterating over a range of candidate sample sizes, and
samples, or comparison of paired observations on the final sample size is obtained by solving for β:
the same subject. The type of comparison will deter-
mine how the sample standard deviation is calcu- d n
1 − β = tinv t α 2,n − 1
lated and therefore affect estimates of the effect size. s
Commonly, sample sizes for two-group compari-
sons are computed to detect a pre-specified biolog- The inverse function tinv computes the pth quantile
ically relevant difference d with a given power from the t-distribution, with degrees of freedom df
Comparing Two Groups: Continuous Outcomes 183
2
Example: Anaesthesia Duration in Mice di − d
s=
(Data from Dholakia et al. 2017.) Anaesthesia n−1
duration was reported for 16 mice randomly
assigned to receive intraperitoneal injections of Within-subject correlation is a feature of many
either ketamine-xylazine (KX) or ketamine- association-type effect sizes, such as ANOVA–type
xylazine plus lidocaine (KXL). Mean and stand- models, nested designs, and regression. Because
ard deviation for each group were KX: 39 (SD the variance of the paired differences of the n sub-
8) min, and KXL: 30 (SD 9) min. What sample jects is
size is required to estimate a difference between
groups with a 95% confidence interval that is
no wider than 15 min, with power of 80%? s2 = var x 1 + var x 2 − 2 cov x 1 , x 2
In this example, α = 0.05, and β = 0.2. The
pooled standard deviation is and the within-subject correlation r is
Comparing Two Groups: Continuous Outcomes 185
cov x 1 , x 2
r= and 11.7 (SD 3.64) ppm at t1. The mean and stand-
var x 1 var x 2 ard deviation for the paired differences (t1 − t0)
were −12 ppm and 5.8 ppm, respectively. The
(deVore 1987), the variance s2 is also computed as correlation r between baseline and t1 measure-
(s1 + s2 − 2 r s1 s2). Standard statistical packages ments was −0.33.
such as SAS require an estimate of r to estimate What sample size is required to detect a mini-
power for paired t-tests. This is not required in mum difference of 10 ppm 90% of the time with
G∗Power if the mean and standard deviation of 95% confidence?
the differences are used to determine effect size. This is a paired design. Here, α = 0.05 and the
Calculating r on raw data using standard statisti- desired power is 1 − β = 0.9. The 95% confidence
cal packages (e.g. SAS proc corr, R stats cor.test) interval for the difference is
ignores non-independence of the observations and
will result in inaccurate estimates. Instead, sum- d ± tα 2,n − 1 sd = − 12 ± 2 262 5 8
marise the correlation at the subject level by calcu- − 25, +1 ppm
lating the averages for each pair of observations
(x1 + x2)/2 for each subject, then calculating the The effect size is 10/5.8 = 1.724. The sample
correlation between these averages and the size to detect a minimum paired difference of
difference for each pair. This is the foundation of 10 ppm is 12 paired measurements, or 6 cows
the Bland–Altman method used for method with realised power of 0.97 (Appendix 16.A).
comparison.
Table 16.1: Urinary fluoride concentrations (ppm) of 10 cows measured at two different times.
Cattle ID Urinary fluoride concentrations ppm Difference (t1 − t0) Average (t0 + t1)/2
Time 0 Time 1
The standardised effect size for the new trial is then 43 per arm, for a total of 86 dogs.
calculated from the desired mean difference to be
detected (d) divided by the new standard deviation: 1
d/snew. The total sample size for the new trial is com- Standard deviations were estimated by the method of Wan
et al. (2014) from medians, IQR and range data described in
puted as before: the original paper.
Comparing Two Groups: Continuous Outcomes 187
18 meandiff = mean2-mean1;
k= = 1 3845 s2= [VALUE]; *calculate pooled variance;
9 3905
alpha=0.05;
Therefore, the corrected effect size is 0.51. The
sample size for the new trial is * Calculate non-centrality parameter ncp;
ncp = n*(meandiff**2)/(4*s2);
1+1 2
1 96 + 1 282 2 1 962 F_crit = finv(1-alpha,1,n-2,0);
N= + *Solve for power for each n;
1 4 7 82242 2 2 power = 1-probF(F_crit,1,n-2, NCP);
output;
82 per arm, for a total of 164 dogs. end;
With the simplified non-central t-method, θ = run;
*output realised power and corresponding N.
3.485. The sample size for the new trial is Pick N with power that ≥ target power;
2 proc print data=tsample; run;
1+1 3 4852
N=
1 0 71 2
16.A.3 Sample Size for a Fixed Power:
49–50 per arm, for a total of 98–100 dogs. Crossover Design (Cattle Example)
proc power;
pairedmeans test=diff
alpha=0.05
meandiff = 10
std = 5.8
16.A Sample SAS Code for corr = -0.33
Calculating Sample Size for npairs = .
power = .9;
Two-Group Comparisons run;
beta=0.1;
z = quantile("Normal", 1-alpha/2);
References
zbeta=quantile("Normal", 1-beta);
Browne, R.H. (1995). On the use of a pilot sample for
Nmain= (z*z/2)+4*((zbeta+z)**2)/(meandiff/ sample size determination. Statistics in Medicine 14
Spilot)**2; (17): 1933–1940. https://doi.org/10.1002/
run; sim.4780141709.
proc print; run; Debackere, M. and Delbeke, F.T. (1978). Fluoride
pollution caused by a brickworks in the Flemish
countryside of Belgium. International Journal of
Environmental Studies 11: 245–252.
16.B.2 Upper Confidence Interval DeVore, J.L. (1987). Probability and Statistics for Engi-
Correction neering and the Sciences, 2e. Brooks/Cole Publishing.
Dholakia, U., Clark-Price, S.C., Keating, S.C.J., and
data power; Stern, A.W. (2017). Anesthetic effects and body weight
*pilot data;
meandiff = 4;
changes associated with ketamine-xylazine-lidocaine
df=18; administered to CD-1 mice. PLoS ONE 12 (9):
stddev = 5.65; e0184911. https://doi.org/10.1371/journal.
pone.0184911.
alpha = 0.05; Julious, S.A. and Owen, R.J. (2006). Sample size calcula-
beta = 0.1; tions for clinical studies allowing for uncertainty about
the variance. Pharmaceutical Statistics 5: 29–37.
z = quantile("Normal", 1-alpha/2);
zbeta = quantile("Normal", 1-beta); https://doi.org/10.1002/pst.197.
C_crit = quantile('CHISQ',alpha,df ); Karlin, E.T., Rush, J.E., and Freeman, L.M. (2019). A
pilot study investigating circulating trimethylamine
k=sqrt(df/C_Crit); N-oxide and its precursors in dogs with degenerative
UCL=k*stddev; mitral valve disease with or without congestive heart
failure. Journal of Veterinary Internal Medicine 33
Nmain= (z*z/2)+4*((zbeta+z)**2)/(meandiff/ (1): 46–53. https://doi.org/10.1111/jvim.15347.
UCL)**2;
run;
Lehr, R.S. (1992). Sixteen s-squared over d-squared:
proc print; run; A relation for crude sample size estimates. Statistics
in Medicine 11: 1099–1102.
Machin, D., Campbell, M.J., Tan, S.B., and Tan, S.H.
16.B.3 Simplified Non-Central t (2018). Sample Sizes for Clinical, Laboratory and Epide-
Correction miology Studies, 4e. Wiley-Blackwell.
van Belle, G. (2008). Statistical Rules of Thumb, 2nd
data power; edition. New York: Wiley.
*pilot data; Wan, X., Wang, W., Liu, J., and Tong, T. (2014). Estimat-
df=18; *pilot degrees of freedom; ing the sample mean and standard deviation from the
meandiff=4;
sample size, median, range and/or interquartile range.
spilot=5.65;
BMC Medical Research Methodology 14: 135. https://
ES = meandiff/spilot; doi.org/10.1186/1471-2288-14-135.
Whitehead, A.L., Julious, S.A., Cooper, C.L., and Camp-
alpha = 0.05; bell, M.J. (2016). Estimating the sample size for a pilot
beta=0.1; randomised trial to minimise the overall trial sample
size for the external pilot and main trial for a continu-
z = quantile("Normal", 1-alpha/2);
ous outcome variable. Statistical Methods in Medical
theta = tinv(1-beta,df,z);
Research 25 (3): 1057–1073. https://doi.org/
Nmain = 4*(theta**2)/ES**2; 10.1177/0962280215588241.
run;
proc print;
run;
17
Comparing Two Groups:
Proportions
A Guide to Sample Size for Animal-based Studies, First Edition. Penny S. Reynolds.
© 2024 John Wiley & Sons Ltd. Published 2024 by John Wiley & Sons Ltd.
190 A Guide to Sample Size for Animal-based Studies
=300 5
where (p1 − p0) is the difference d, and p =
p0 + p1 2 is the mean of the two proportions. or approximately 301 surgeries.
The margin of error, or precision, is one-half of
the confidence interval range (u − l)/2. It can also
be calculated as z1-α/2 SE(pdiff). Sample size can
be obtained for a target precision by iterating over 17.2.2 Sex Ratios
a range of candidate sample sizes, calculating the Sex ratio, and factors inducing sex ratio shifts, are
confidence interval for each, and choosing the sam- important information for a variety of disciplines
ple size that approximates the desired precision. such as evolutionary biology, ecology, reproductive
biology, genetics, animal husbandry, and manage-
ment of laboratory populations. In most species
17.2 Difference Between with separate sexes, males and females are pro-
Two Proportions duced in approximately equal numbers, so the
expected sex ratio is 1:1; that is, p0 = p1 = 0.5. How-
17.2.1 One Proportion Known ever, sex ratio can vary in response to environmen-
tal factors such as temperature, food supply,
When one proportion is already known (reference
population density, and parental effects.
proportion p0), then sample size for detecting a dif-
The sex ratio R is calculated as a proportion of the
ference d of the new or anticipated p from the refer-
total number of animals:
ence level, with a significance level of α and power
1 − β is: R = n1 N
2
z1 − α p0 1 − p0 + z1 − β p 1−p where the total sample size N = n1 + n2. Calculating
N= direct ratios of one sex to the other (R = n1/n2) is not
d2
recommended. If there are no representatives of one
sex in the sample, the resulting ‘ratio’ will be
This is usually a one-sided test of the hypothesis either undefined or zero depending on which is
pnew < p0 or pnew > p0, so the significance is α rather the numerator or denominator (Wilson and
than α/2. Hardy 2002).
Comparison of observed sex ratio when the
expected sex ratio is 1:1. Under the null hypothesis,
Example: Before-After Comparison: Mouse
the sex ratio is 1:1, and the probability p of obtaining
Surgery Success
a member of the target sex will be exactly one-half,
A series of surgeries performed on mice had a loss or 0.5 (p0 = p1 = 0.5). Usually, the hypothesis of
rate of 15%. The investigator initiated a rigorous interest is a test of a shift in sex ratio from p0 =
series of retraining and procedural changes, with 0.5 (the null value) to an alternative value p1 with
the goal of reducing losses to 5% or less before power 1 − β.
beginning the new protocol. How many surgeries The equation for sample size is a modification of
need to be observed to be 90% confident that the the former equation for a one-sample test:
change in procedures has reduced attrition to 5%?
2
This tests the hypothesis that pnew < p0, with z1 – α 0 5 + z1 − β 0 25 – d2
proportions p0 = 0.15 and pnew = 0.05, α = 0.05, N=
d2
Comparing Two Groups: Proportions 191
where the reference proportion is p0 = 0.5, p1 is the range of candidate sample sizes. Sample sizes n0 and
new proportion, and d = p0 − p1 is the difference n1 should be equal for simplicity and to minimise
from 0.5 under the alternate hypothesis. the total sample size required for a target power.
The equation is
N = 1 96 + 0 8416 2
0 75 1 − 0 75 + 0 4 0 4
17.2.3 Two Independent Samples = 27 4
0 75 − 0 4 2
When sample sizes in each group are equal, then the
large sample approximation (Wald equation) for
28 for each group, for a total of 56.
total sample size is
With the exact method, the sample size is
32 per group, for a total of 64 and power of
2 p0 1 − p0 + p1 1 − p1
N ≥ z1 − α 2 + z1 − β 80%. For power of 90%, the sample size is
d2 42 per group, for a total of 84 (Figure 17.1).
where subscripts 0 and 1 indicate the control (or ref-
erence) and intervention groups, respectively, and
the difference of interest is d = (p1 − p0). A conserv- 17.2.4 Confidence Intervals
ative estimate for sample size is obtained by assum-
ing that p = 0.5, because p(1 − p) is a maximum at The conventional Wald confidence interval for one
p = 0.5. sample is
Better estimates for sample size are based on the
exact method: solving for power and iterating over a p ± z1 − α 2 p 1–p n
192 A Guide to Sample Size for Animal-based Studies
0.7
w= p0 −l0 2 + u1 −p1 2 + p1 −l1 2 + u0 −p0 2
0.6
The lower (l) and upper (u) confidence
Power
0.5
limits are (A − B)/C and (A + B)/C,
0.4 respectively, where A = 2 ni pi + z21 − α 2 ; B = z1 − α 2
0.3
z21 − α 2 + 4 ni pi 1 − pi ; and C = 2 ni + z21 − α 2 .
0.2
Agresti-Caffo method. The unadjusted propor-
0.1
tions are pi = xi/ni. The adjusted confidence interval
0.0 ‘adds two successes and two failures’ where
10 20 30 40 50 60 70 80 90 100 ni = n + 4 trials and pi = x + 2 n + 4 . The
n per group
adjusted xi are x i = x i + z21 − α 2 4 , adjusted ni are
Figure 17.1: Sample size for evaluating risk difference in
a mouse infection model estimated by the exact method.
ni = ni + z21 − α 2 2 , and adjusted pi are pi = x ni .
For power of 80%, the sample size is 32 per group, and for These adjusted values are substituted into the con-
power of 90%, the sample size is 42 per group. fidence interval and sample size equations
(Agresti and Caffo 2000).
For the difference between two independent pro-
portions, the confidence interval is Example: Mouse Infections: Confidence
Intervals for Difference Between
p1 − p0 ± z 1 − α 2 SE p1 − p0 Proportions
(Data from Zar 2010). Two species of mice were
where the standard error for the difference SE is examined for the presence of a certain parasite.
For species 1, 18 of 24 mice were infected (p0 =
0.75), and for species 2, 10 of 25 were infected
p1 1 − p1 p 1 − p0
SE p1 − p0 = + 0 (p1 = 0.40).
n1 n0 The 95% confidence intervals for the risk differ-
ence (p1 − p0) of 0.35 were constructed by the
However, the Wald equation is not recommended conventional Wald, Newcombe, and Agresti-
for small samples (n < 30 per arm) because it is Caffo methods (SAS code in Garner 2016) and
unstable and usually very inaccurate even for very are summarised in Table 17.1. The Wald interval
large samples. It cannot be used when proportions is wider than either the Newcombe or Agresti-
are extreme (p < 0.1, p > 0.9). Caffo intervals, indicating less precision. The
Two preferred alternatives discussed here
are the Newcombe (Wilson score interval) and Table 17.1: 95% Confidence intervals constructed for
mouse infection data p0 = 0.75 and p1 = 0.40 with a risk
Agresti-Caffo methods. Newcombe (1998) describes difference (p1 − p0) of 0.35.
11 methods for constructing confidence intervals
for two-group comparisons of proportions. SAS Method 95% confidence
interval
code for these and six additional methods are
described by Garner (2016). Wald 0.091 0.609
Newcombe (Wilson score intervals). This method Agresti-Caffo 0.072 0.576
incorporates the lower and upper confidence limits
Newcombe (Wilson score) 0.073 0.561
(l, u) estimated for each proportion (Newcombe
1998; Newcombe and Altman 2000). This method Source: SAS code adapted from Garner (2016).
Comparing Two Groups: Proportions 193
4
ngroup = 2
17.3 Relative Risk and p0 RR − 1
Odds Ratio
Alternatively, it can be estimated by the ln-based
Relative risk (RR), risk differences, and odds ratio
formulation
(OR) are commonly used in epidemiological studies
to describe the probability of an event occurring in
one group relative to another. Study designs can be 8 RR + 1 RR
randomised controlled trials or observational cohort ngroup =
p0 ln RR 2
and case-control studies.
1 − p0 1 − p1 Alternatively,
ln RR ± zα +
n0 p 0 n1 p 1
8 5+1 5
n= = 370 6 ≈ 371, for a total of 742
then back-transforming to the linear scale. For the 0 01 ln 5 2
95% confidence interval zα = 1.96, and the confi-
dence limits are
Rule of thumb. RR reduction is the difference
1 −p0 1− p1 between two RRs (RR1 − RR0). Studies evaluat-
Lower limit l = exp lnRR − 1 96 +
n0 p 0 n1 p 1 ing a rare binary outcome require approxi-
mately 50 events in the control group and an
1 −p0 1− p1 equal number in the exposed group to detect
Upper limit u = exp lnRR + 1 96 + a 50% RR reduction (or halving of risk) with
n0 p 0 n1 p1
80% power (Glasziou and Doll 2006).
194 A Guide to Sample Size for Animal-based Studies
1+1 2
1 96 + 1 2816 2 most common distributions being the Poisson and
N= 2 = 203 8 negative binomial (Cundill and Alexander 2015).
1 0 533 1 − 0 533 ln 3
1 1 1 1
where φi = 2 arcsin( pi). The arcsine transforma- where a = and b = , and sub-
p1 λ 1 p0 λ0
tion is used to stabilise the variance for the binomial
scripts 0 and 1 indicate reference (or control) and
distribution (Rücker et al. 2008, 2009). The total
intervention groups, respectively.
sample size N is then approximated by
Rules of thumb. The square root of a Poisson-
2 distributed sample is approximately normal.
z1 − α 2+ z1 − β
N=4 With Lehr’s approximation (van Belle 2008),
h
sample size per group is approximately
BOX 18.1
18.1 Introduction Examples of Time-to-Event Applications
A time to event outcome combines two separate
Time from disease diagnosis to death.
pieces of information: whether an event did or did
Time to nest failure
not occur in the designated time (Y/N) and the time
Number of nestlings surviving to fledging.
at which the event occurred. The ‘event’ of interest
Comparison of dogs completing guide dog training
may be adverse or classified as a ‘failure’ (death of a
to those withdrawn early.
subject, time to humane endpoint, time to tumour
Time from disease induction to palpable tumour
appearance), neutral (pollinator visits, duration of
appearance
shelter animal stay), or positive (time to complete
Length of pregnancy in cattle following artificial
wound healing, time to conception; Box 18.1). Cen-
insemination.
soring is when an individual subject does not expe-
Time from disease diagnosis to disease resolution.
rience the event of interest in the study time frame,
Time from initial disease exposure to first major
and is a distinguishing feature of time-to-event data.
symptom.
Sample size calculations for time-to-event data
Time from completion of cancer treatment to
are more complicated than those for continuous
tumour recurrence.
or binary endpoints and usually require large sam-
Time between seizures.
ple sizes. This makes ‘survival’ as a primary out-
Time between flower visits by pollinators
come unacceptable for most laboratory studies
A Guide to Sample Size for Animal-based Studies, First Edition. Penny S. Reynolds.
© 2024 John Wiley & Sons Ltd. Published 2024 by John Wiley & Sons Ltd.
200 A Guide to Sample Size for Animal-based Studies
Proportion surviving
be planned for uncertainties in availability to avoid 0.75
unintentionally uneven allocation ratios. Likewise,
follow-up times should be planned with respect 0.50
to the time allotted to enrolment period. If the
study is terminated at a fixed time regardless of
study entry time, individual follow-up times will 0.25
sample size calculations for continuous or binary Figure 18.1: Patterns of survival over time. Type I: Constant
endpoints. However, staggered entry will pro- mortality throughout the study duration. Type II: Low
early mortality; high survival over most of the study period.
foundly affect sample size calculations for time-to-
Type III: High early mortality for most subjects, with a few
event data, as they depend on the number of events longer-term survivors
observed, and therefore both patterns of study entry
and whether or not event rates are low or high
(Witte 2002).
Pattern interpretation should be made with cau-
tion, especially when sample sizes are small,
18.2.4 Data Structure because very small numbers in the tails of the curves
Time-to-event data are characterised by differing make survival estimates unstable (Altman 1991;
time-dependent patterns of survival (time-dependent Altman and Bland 1998).
trajectories or survivorship curve shape) and censoring. Censoring occurs when the event of interest is not
Survivorship curves illustrate the distribution of observed for a subject, either because the event has
survival over time, measured as the number, or pro- not occurred by the time the study ends or because
portion, of individuals surviving at each time t. The it occurred before the subject entered the study.
shape of the curve will be modified by the strength Censored observations are not ‘missing’ in the con-
of the test intervention or agent, genotype, and ventional sense. Subjects may be right-censored
environmental effects. Graphical examination of pat- (subjects do not have the event before the study
terns can provide insight into potential mechanisms ends), left-censored (subjects experience the event
affecting survival in laboratory and clinical investiga- before the study begins), or interval-censored (the
tions. The three main patterns (Figure 18.1) were event occurs within some known time interval,
described in the last century (Pearl and Miner but the exact time cannot be precisely identified)
1935; Demetrius 1978). (Table 18.1; Figure 18.2).
Because of censoring, time-to-event data cannot
Type I survival indicates a constant proportion of be analysed by conventional summary statistics
individuals dying over the course of the study. used for continuous data. Attempting to impose a
normal distribution on these data will result in large
Type II survival exhibits high survival for most of
apparent outliers. If conventional analysis methods
the study. Mortality increases near the end of
are used, data from censored subjects will be
the monitoring period.
excluded because exact survival duration cannot
Type III survival exhibits high early mortality be calculated for censored individuals; times are
followed by low mortality for relatively few indi- known only for those subjects for which the event
viduals over the remainder of the monitoring occurred. Omitting censored data from analyses
period. contributes to survivorship bias.
202 A Guide to Sample Size for Animal-based Studies
Right censoring: The most common type. Occurs when a subject completes the study without experiencing the event.
Type I right censoring occurs when study duration is fixed beforehand.
Type II right censoring occurs when the study is completed when a fixed number of events has occurred (study duration
is variable).
Left censoring occurs when the subject experiences the event before enrolment or monitoring begins.
Examples. In oncology trials, an animal presents with a tumour before the study begins.
Interval censoring: The subject experiences the event within some pre-specified time interval, but the exact time is
unknown. Interval censoring occurs when subjects can be monitored only periodically (for example, at weekly or
monthly intervals).
Examples. Veterinary studies of client-owned animals followed up at scheduled clinic visits; disease progression studies
(Finkelstein and Wolfe 1985; Radke 2003; Rotolo et al. 2017); monitoring of bird nest boxes (Heisey et al. 2007;
Fieberg and DelGiudice 2008).
Pre-study
window Study window
Time to event
Figure 18.2: Types of censoring. (a) Right censoring occurs when a subject completes the study without experiencing the
event. (b) Left censoring occurs when the event occurs before the study begins. (c) Interval censoring occurs when the subject
experiences the event within some pre-specified time interval, but the exact time is unknown (for example, death of the
animal between clinic visits).
Time-to-Event (Survival) Data 203
function S(t) is obtained by multiplying the prob- relationship between the proportion surviving (ps)
abilities at each time point. and the hazard rate h(t) at a given time t, is therefore
S(t) = p1 p2 . = (1 – d1/n1)(1 – d2/n2)…. This is
formally expressed as h t = − ln ps t
tor group
0.50 HR = 1: Both groups (treatment and control) are
experiencing an equal number of events at any
0.25
time t.
HR > 1: More events occurred in the treatment
groups at any time t compared to the control group.
0.0 HR < 1: Fewer events occurred in the treatment
Time
groups at any point in time compared to the control
Figure 18.3: Kaplan-Meier curve of cumulative failure
probability.
group.
Time-to-Event (Survival) Data 205
curves that cross each other are a strong indicator compared to balanced designs: approximately 12%
of non-proportionality. Peto et al. (1977) describe for 2:1 and 33% for 3:1 (Hey and Kimmelman
common errors in survival analyses. 2014). If unequal ratios are planned, sample size
estimates must be corrected by the pre-specified
allocation ratio.
18.4 Time-to-Event Sample
Effect size. The ratio of median survival times is
Size Calculations often the simplest method for estimating the haz-
Sample size estimates for time-to-event data require ard ratio for a new study (Machin et al. 2018):
an estimate of the effect size (hazard ratio), the
expected proportion of subjects in each group that h1 t ln 2 M 1 M0
HR = = =
might experience the event, and the expected h0 t ln 2 M 0 M1
number of subjects showing the event of interest
(Box 18.5). where h1(t) and h0(t) are the hazard rates for the
In effect, two sample sizes must be estimated, the test group and the control or comparator group,
number of subjects expected to have the event (nE), respectively, and M1 and M0 are the correspond-
and the total number of subjects (N). The number of ing medians.
events that occur determines the variance used in
the sample size calculation. If censoring does not Number of events. The total number of subjects
occur for any subject, then E = N. The prevalence with the event nE that need to be observed in
of the disease (or proportion of subjects expected to a two-group study is
have the event) is PE. Then the total sample size is 2
2
r+1 z1 − α 2 + z1 − β
nE nE = 2
N= r ln HR
PE
The allocation ratio describes the balance of sub-
Other information needed for time-to-event sam- jects in each arm. The allocation ratio adjustment
ple size calculations is the allocation ratios or num- is (r + 1)2/r. For an allocation ratio of 1 (balanced
ber of subjects allocated to each group, the desired group sample size), the adjustment factor is
confidence and power. Two-sided tests are recom- 1 + 1 2 1 = 4.
mended because the existence of a survival differ-
ence to be detected or the direction of that The hazard ratio can be approximated from an
difference is usually not known beforehand. Bal- expected clinically relevant effect. For example, to
anced group sizes with a 1:1 allocation ratio result determine a 50% decrease in events with a new
in the most power. For a given power, unbalanced treatment relative to controls, the planning value
designs will require much large sample sizes for HR would be 1/0.5 = 2.0.
N = nE 40 150 N adj = N 1 − pf
be of interest. Suppose a study is proposed evaluat- myxomatous mitral valve disease and cardiomegaly:
ing two factors A and B. With a 2 × 2 factorial the EPIC study-a randomized clinical trial. Journal
design, there are four ‘groups’ to be evaluated: of Veterinary Internal Medicine 30 (6): 1765–1779.
A only, B only, both A and B, and neither A or B. https://doi.org/10.1111/jvim.14586.
Therefore, to assess the effect of each factor, there Dahal, P., Simpson, J.A., Dorsey, G. et al. (2017). Statis-
tical methods to derive efficacy estimates of anti-
are four median survival times to be estimated
malarials for uncomplicated Plasmodium falciparum
and two hazard ratios to be compared.
malaria: pitfalls and challenges. Malaria Journal 16
(1): 430. https://doi.org/10.1186/s12936-017-
1. Estimate hazard ratios for A and B separately 2074-7.
(assuming no interaction). Calculate N Demetrius, L. (1978). Adaptive value, entropy and survi-
for each. vorship curves. Nature 275: 213–214.
2. If the N for each group is roughly similar, Fieberg, J. and DelGiudice, G. (2008). Exploring migra-
choose the larger N. If hazard ratios (and tion data using interval-censored time-to-event
hence Ns) are very different, prioritise factors models. Journal of Wildlife Management 72 (5):
based on importance to the research ques- 1211–1219. https://doi.org/10.2193/2007-403.
tion. For example, suppose A is a comparison Finkelstein, D.M. and Wolfe, R.A. (1985). A semipara-
of the therapeutic effect of the test drug metric model for regression analysis of interval-
against a control, versus B which is a compar- censored failure time data. Biometrics 41 (4): 933–945.
ison of intravenous versus oral drug delivery Furuya, Y., Wijesundara, D.K., Neeman, T., and Metzger,
D.W. (2014). Use and misuse of statistical significance
routes. Factor A will be of higher priority if
in survival analyses. MBio 5 (2): e00904-14. https://
the goal is to determine efficacy. Power off
doi.org/10.1128/mBio.00904-14.
the most important factor. Heisey, D.M., Shaffer, T.L., and White, G.C. (2007). The
3. If interactions (A × B) to assess synergism or ABCs of nest survival: theory and application from a
antagonism are of greater interest than main biostatistical perspective. Studies in Avian Biology 34:
effects, then estimate the size of the interac- 13–33.
tion, and power the study based on the inter- Hey, S.P. and Kimmelman, J. (2014). The questionable
action. Be aware that, in general, estimating use of unequal allocation in confirmatory trials. Neu-
interactions requires roughly 16 times the rology 82 (1): 77–79. https://doi.org/10.1212/
sample size compared to that for estimating 01.wnl.0000438226.10353.1c.
a main effect for equivalent power.2 Machin, D., Campbell, M.J., Tan, S.B., and Tan, S.H.
(2018). Sample Sizes for Clinical, Laboratory and Epide-
miology Studies, 4e. Wiley-Blackwell 390 pp.
Machin, D., Cheung, Y.B., and Parmar, M. (2006). Sur-
References vival Analysis: A Practical Approach, 2e. 278 pp.
Mantel, N., Bohider, N.R., and Ciminera, J.L. (1977).
Altman, D.G. (1991). Practical Statistics for Medical Mantel-Haenszel analyses of litter-matched time-to-
Research. London: Chapman & Hall/CRC. response data, with modifications for recovery of
Altman, D.G. and Bland, J.M. (1998). Time to event (sur- inter-litter information. Cancer Research 37:
vival) data. BMJ 317 (7156): 468–469. 3863–3868.
Atkinson, G., Williamson, P., and Batterham, A.M. Moffatt, C.J., Franks, P.J., Oldroyd, M. et al. (1992). Com-
(2019). Issues in the determination of ‘responders’ munity clinics for leg ulcers and impact on healing.
and ‘non-responders’ in physiological research. Exper- BMJ 305 (6866): 1389–1392. https://doi.org/
imental Physiology 104 (8): 1215–1225. 10.1136/bmj30568661389.
Boswood, A., Häggström, J., Gordon, S.G. et al. (2016). National Research Council (2011). Guide for the Care and
Effect of pimobendan in dogs with preclinical Use of Laboratory Animals, 8e. Washington: The
National Academies Press.
Pearl, R. and Miner, J.R. (1935). Experimental studies on
2
Gelman A (2018) https://statmodeling.stat.colum- the duration of life. XIV. The comparative mortality of
bia.edu/2018/03/15/need-16-times-sample-size- certain lower organisms. Quarterly Review of Biology 10
estimate-interaction-estimate-main-effect/ (1): 60–79.
Time-to-Event (Survival) Data 209
Peto, R., Pike, M.C., Armitage, P. et al. (1977). Design and Tierney, J.F., Stewart, L.A., Ghersi, D. et al. (2007). Prac-
analysis of randomized clinical trials requiring pro- tical methods for incorporating summary time-to-
longed observation of each patient. II. Analysis and event data into meta-analysis. Trials 8: 16.
examples. British Journal of Cancer 35 (1): 1–39. Weiss, G.B., Bunce, H. 3rd, and Hokanson, J.A. (1983).
https://doi.org/10.1038/bjc.1977.1. Comparing survival of responders and nonresponders
Radke, B.R. (2003). A demonstration of interval-censored after treatment: a potential source of confusion in
survival analysis. Preventive Veterinary Medicine 59: interpreting cancer clinical trials. Controlled Clinical
241–256. Trials 4 (1): 43–52. https://doi.org/10.1016/
Rotolo, M.L., Sun, Y., Wang, C. et al. (2017). Sampling s0197-2456(83)80011-7.
guidelines for oral fluid-based surveys of group-housed Witte, J. (2002). Sample size calculations for randomized
animals. Veterinary Microbiology 209 (20-29): 2017. controlled trials. Epidemiologic Reviews 24 (1): 39–53.
Senn, S. (2018). Statistical pitfalls of personalized medi-
cine. Nature 563 (7733): 619–621.
19
Comparing Multiple Factors
A Guide to Sample Size for Animal-based Studies, First Edition. Penny S. Reynolds.
© 2024 John Wiley & Sons Ltd. Published 2024 by John Wiley & Sons Ltd.
212 A Guide to Sample Size for Animal-based Studies
the idea that the total information of an experiment variance for estimates for each factor and factor inter-
with N experimental units can be determined from actions, the effect of blocking, the magnitude of each
the total variation based on N − 1 degrees of free- relevant residual error term, potential confounding,
dom. In a simple experiment, the resource equation and information about the power of hypothesis tests.
partitions the variation among three sources: the A skeleton ANOVA is especially useful for estimating
treatment component T, the blocking component sample sizes for complex designs. Sample size formu-
B, and the error component E: lae may not even exist, and determine the optimal
allocation of experimental units among multiple
T + B + E = N −1 sources of variation can be challenging. Examples
The treatment degrees of freedom are k, where k of complex designs include split-plot and factorial
is the number of groups. The blocking variable B designs, designs with complex blocking and cluster-
with b − 1 degrees of freedom, is a known categori- ing, and repeated measures on different factors
cal variable used to minimise variation from sources (Brien 1983; Draper and Smith 1991; Brien and Bailey
not of direct interest to the experimenter. The error 2006, 2009, 2010). The planning rule of thumb is that
component is used for testing hypotheses. Mead there should be at least one degree of freedom for all
(1988) recommends a sample size that allows main factors and interactions of interest and 10–20
10–20 degrees of freedom for the error term. degrees of freedom for the residual error term for test-
Mead’s resource equation is most appropriate for ing hypotheses (Mead 1988).
single-factor designs. A further disadvantage is that
the inclusion of a block term seemingly penalises 19.4 Completely
the experiment by reducing degrees of freedom
for the error term. However, this is misleading Randomised Design, Single
because blocking removes the variation associated Factor
with the nuisance variable from the error term
and therefore can contribute substantially to the The single factor or one-way ANOVA consists of
power and precision of the experiment. one factor with k levels or ‘groups’. Experimental
units are independent of each other, and treatments
are randomly allocated to each experimental unit.
Example: Mead’s Resource Equation: The main disadvantage of the completely rando-
Simple Experiment mised design is that different sources of variation
not accounted for by the model will go into the error
An experiment is planned with four treatments
term. This inflates the mean square error, increas-
with five mice per group, for a total of 20 mice.
ing the error variance and reducing power.
Here, N = 20, B = 0, T = 4, so E = 19 − (4 − 1)
There are three variation components for a
= 16, suggesting the sample size is adequate.
single-factor completely randomised design:
If the experiment is to be run in 5 blocks, then
variation between units receiving different treat-
E = 19 − (5 − 1) − (4 − 1) = 12.
ments (between-treatment or between-group), the
variation between experimental units receiving
the same treatment (within-group), and the residual
19.3.4 Skeleton ANOVA error term (mean square error MSE) that estimates
The skeleton ANOVA can be thought of as an exten- the random variation between experimental units.
sion of the resource equation. The skeleton ANOVA The ratio of the between-group to within-group
table lists the sources of variation and correspond- mean square is the F-statistic (Table 19.1).
ing degrees of freedom for a specific experimental For a balanced design with equal sample sizes for
design (Table 19.1). Candidate values for the number k levels (commonly referred to as ‘groups’), the min-
of factors, levels of each factor, and number of exper- imal sample size per group n is
imental units can be selected, and corresponding
degrees of freedom calculated from those values. 2 F 1 − α,k − 1,k n − 1 MSE
The skeleton analysis of variance is therefore a n≥ 2
relatively simple method of evaluating the likely d
Table 19.1: Skeleton ANOVA for common study designs. N is the total number of experimental units (EUs), n is sample size
per group (number of replications of EUs), and σ2ε is the error or residual variance.
D. Crossover design
Due to ‘treatment’ (A versus B) k−1
Period p−1
Sequence (AB versus BA) s−1
Subject (sequence) n(t − 1)
Within-group variation n–1
Error (k − 1)(nk − 2)
Total (corrected) nsk − 1 = N − 1
(continued)
216 A Guide to Sample Size for Animal-based Studies
G. Two-factor design on factors A and B, with EUs randomised to A and repeated measures on B, such that subjects
are nested within A: S(A)
Between treatments
A a–1
B b−1
A×B (a – 1)(b − 1)
Error ab − 1
Within treatments
S(A) n−1
B × S(A) (n − 1)(b − 1)
Error ab(n − 1)
Total (corrected) abn − 1 = N − 1
H. Split-plot design
Whole ‘plot’ (effects)
Replicate on A r−1
Whole plot A a−1
Whole plot error (r − 1)(a − 1)
Whole plot error
Subplot B b−1
Subplot B × Main A (a − 1)(b − 1)
Subplot error rabn − a(r + b − 1)
where d is the mean difference to be detected, MSE For two groups with group size n, the total sample
is the mean square error (the estimate of the size is 2 n. For multiple treatment levels k, the total
residual variance), and F 1 − α,υ1 ,υ2 is the critical F- sample size is k n.
value with confidence (1 − α), and υ1 = (k − 1) The following examples show how to calculate
for between-treatment degrees of freedom, and sample size from the non-centrality parameter
υ2 = k(n − 1) for within-group degrees of freedom. and iterating over a range of candidate n
The effect size is d / MSE The non-centrality (Appendix 19.A). Pre-specified values for mean dif-
parameter is ference, variance, and confidence are entered, and
2 the corresponding non-centrality parameter and
n d power are calculated for each n. The appropriate
λ= 2
2 MSE sample size is selected by matching n to that which
Comparing Multiple Factors 217
0.8
Example: Guinea-Pig Weight Gain 0.7
(Data from Zar 2010.) A small trial was conducted 0.6
Power
to evaluate the effect of four different diets on 0.5
weight gain (g) of guinea pigs. The factor was diet
0.4
with four levels (A, B, C, D). Diets were randomly d = 2.5
assigned to each of 20 guinea-pigs. There were 0.3
five guinea-pigs per diet group (n = 5/group). 0.2
The raw data are:
0.1
Number 1.0
of groups
0.9
k
1.0 2 0.8
0.9 4 0.7
0.8 6 0.6
Power
0.7 0.5
0.6 0.4 n per
Power
0.5 group
0.3
10
0.4
0.2 8
6
0.3 4
0.1 2
0.2
0.0
0.1
0.5 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5
0.0 Variance
2 4 6 8 10 12 14 16 18 20
Figure 19.3: Sample size and variance. Increasing error
n per group
variance greatly increases the sample size required to
Figure 19.2: Guinea-pig growth: Sample size per group in a obtain a given power.
single-factor four-level design for two to six levels (or
groups). Increasing the number of groups to be compared
greatly increases the sample size required to obtain a the minimal difference term. For the guinea pig
given power.
data, the grand mean is
experiment, a sample size n of 12–13 per group is Y = 8 16 + 5 38 + 5 84 + 7 64 4 = 6 8 g
required for a total 24–26 animals. For 6 groups
(k = 6), the sample size required is 19–20 per group, The non-centrality parameter λ is
for a total of 114–120 animals. Sample sizes per
group become smaller as the number of groups is ai 2
reduced. λ=n
MSE
2 2 2 2
8 16 − 6 8 + 5 38 − 6 8 + 5 84 − 6 8 + 7 64 − 6 8
λ=n
MSE
In this example, sample sizes were approxi- to detect the target difference of 2.5 g if variance
mated over a range of MSE from 0.5 to 9.5 is much greater than 2.5.
(Figure 19.3). These values represent effect sizes In the presence of large variation, the number of
of approximately 0.55 (small) to over 10 (implausi- animals required to obtain a desired power of
bly large). In this example, a sample size of 4 per at least 80% power will be unacceptably large or will
group might be adequate if variance is less than require implausibly large effect sizes (Figure 19.4).
~20% of the mean difference. However, the Study refinement to minimise the effects of
required sample size per group increases consider- between-animal variation and alternative designs
ably with increasing variance. Even a sample size should be considered. Candidate designs include ran-
of 10 per group will not provide sufficient power domised complete block and factorial designs.
Comparing Multiple Factors 219
4.75 (a)
4.25
3.75
3.25
Effect size
n=2
2.75
n=4
2.25 n=6
n=8
n = 10
1.75 (b)
1.25
0.75
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Power
Method
up as statistically significant. Although the block
1. Define the blocking variable so that homoge- effect required (5 − 1) = 4 degrees of freedom
neous experimental units are grouped (with correspondingly fewer degrees of freedom
together. for the error term), the mean square error is
2. Randomise treatments allocated to each unit 0.773, a seven-fold reduction in variation. The lar-
or animal within each block. Ideally, ensure gest difference between means is for diets A and
that all treatments are represented equally B is 2.8 g with 95% confidence interval of (1.6,
in each block. 4.0). The precision for the estimate of the mean
3. Include the (treatment x block) interaction in difference is therefore 2.4/2 = 1.2. The R2 is
the analysis to assess the effect of the block as 0.91, indicating considerable improvement in
a source of variation. model fit.
For a sample size calculation based on this
Example: Guinea-Pig Growth: Randomised revised estimate of variance, only 4 guinea-pigs
Complete Block Design per group (total sample size N = 16) are required
to detect a mean difference of 2.5 g for a realised
The 20 animals in the study were actually meas- power of 0.827, and 5 per group (N = 20) for rea-
ured in ‘batches’ over five weeks. A batch con- lised power of 0.93.
sisted of one complete replicate of all four diets.
The blocking variable is the week of study entry.
Each block contained four animals, with one of
four diets randomly allocated to each guinea- Example: Experiment to be Conducted over
pig in each block. Several Days
studied in fractional factorial designs when the run of each combination of treatment factors
cost and size of full factorials become too prohib- (Montgomery 2012). Therefore, rather than think-
itive (Box et al. 2005), or if certain combinations ing of treatments as randomised to ‘groups’ of ani-
of factor levels are too toxic or otherwise logisti- mals and basing sample size of number per group,
cally unacceptable (Collins et al. 2009). For gen- it is more constructive to evaluate sample size for
uine replication (Box et al. 2005) and so that a factorial design as the number of replicate runs
statistical inference is valid, run combination is required to estimate the error variance. In a bal-
randomly allocated to each experimental units, anced factorial design, power to detect the main
and run order randomised for all treatment com- effect of a factor is determined by the sample size
binations and replicate runs. per level of each factor, not the sample size of the
‘group’. This means that in a two-factor design, each
animal is doing double duty by contributing to the
Example: Weight Gain in Mice on
main effect estimates for both factors at the same
Antibiotics
time. That is, when estimating the main effect for
(Adapted from Snedecor and Cochran 1980). An each factor, two means are compared based on
experiment was planned to measure weight gain m = 2 n observations from n experimental units.
of newly weaned mice given an antibiotic at a The minimum number of replicates per treatment
high dose (‘High’ =1), an intermediate dose combination is two.
(intermediate = 0) or vehicle control (‘low’ = −1) The effect size ES is obtained from regression
and vitamin B12 or vehicle control. This is a coefficients for each main effect, and standardising
3 × 3 factorial design with nine treatment them by the expected standard deviation of the
combinations: response variable Y:
Power
MSE 0.5
Table 19.2: Skeleton ANOVA for a proposed split-plot design with a = 3 levels of radiation intensity, r = 1 replication of
radiation at the whole plot level, b = 4 drugs, and n = 4 mice per drug. Zero degrees of freedom for the radiation effect
indicate that the design must be revised if radiation effects are of interest.
Source Degrees of freedom
19.8 Repeated-Measures means are averaged over time. The total sample size
N increases as correlation r increases. This is
(Within-Subject) Designs because the average group effect size is a
Repeated-measures designs consist of observations between-subjects comparison of averages, and the
made on the same subject so that observations variance of the estimate increases with r. However,
are correlated (Box 19.2). Types of repeated mea- when comparing group differences across time
sures designs included before-after and crossover (time × intervention interaction) as correlation r
designs (with paired observations on the same sub- increases, then total sample size N decreases, as
ject), repeated measures on treatment (different the contribution of individual variance to the esti-
treatments randomly applied to the same subject), mate of the rate of change in response is reduced
repeated measures on time (observations on the (Diggle et al. 2002).
same subject are obtained for multiple time Repeated-measures designs can test one or more
points), and spatial autocorrelation designs (obser- of three hypotheses for overall group differences
vations are correlated over space rather than time). (test on main effects), the overall trend over time
Sample size calculations for repeated measures (time effect), and differences between groups with
data require information on the number of time (time × intervention interaction) (Winer
repeated observations per experimental unit and et al. 1991). Therefore, three sources of variation
an estimate of the correlation among the repeated must be accounted for:
observations (Diggle et al. 2002).
Observations are correlated when pairs of obser- Between-subject variation (random effects): the
vations close to each other are more similar than variation between animals due to the character-
pairs of observations that are more widely istics of the animals in the sample.
separated. Recall for a series of n independent Within-subject variation: the variation within a
observations with mean Y and var(Y ) = s2/n, the single subject resulting from the effects of time
large-sample normal approximation for sample and measurement error. Time dependencies
size is: result when measurements are obtained on
N ≥ z21 − α s d 2 the same subject over time. Measurements
2
taken at short time intervals will be more highly
However, if there is a correlation r between obser- correlated than those taken further apart.
vations made at t time points, then Interaction effects: the variation attributable to
2 t−1 treatment differences across time (time × inter-
2
N ≥ z21 − α 2 s d 1+r vention interaction).
t
When comparing the overall difference in Sample size calculations for repeated–measures
response between two or more groups, the group designs should include preliminary estimates of
the variance and the expected correlation structure
BOX 19.2 for the repeated measurements. The four most com-
Repeated Measures Designs mon correlation patterns (Guo et al. 2013) are:
1. Paired measurements: Before-after, crossover 1. zero correlation, if observations are
designs independent.
2. Repeated measures on treatment: Treatments 2. compound symmetry, if any two sequential
randomly applied to the same subject observations are equally correlated.
3. Repeated measures on time, longitudinal designs: 3. autoregressive, when time points are equally
Measurements taken on the same subject at two spaced, and correlation decreases with
or more time points increasing distance between observations.
4. Spatial autocorrelation: Measurements are cor- 4. unstructured, if correlation is present but
related in space rather than time without any specific pattern.
226 A Guide to Sample Size for Animal-based Studies
Choice of a specific correlation structure must be treatments in reverse order. The great advantage
done with care, as the wrong structure or a structure of the crossover design is that two or more factors
that is too simple will inflate type I error and can be investigated on the same subjects, resulting
increase false positives. The conventional Pearson in a conservable reduction in animal numbers.
correlation coefficient does not account for the However, the disadvantage of crossover trials is
within-subject correlation structure of responses the potential for carry-over effect: measurements
so is unsuited for estimating correlation for a obtained for the second intervention may be
repeated-measures study. More appropriate prelim- affected by the previous intervention, resulting in
inary estimates can be obtained by calculating the interaction of the two treatments. The study must
correlation based on subject averages of the two be designed so that the washout period between
measurements (Bland and Altman 1994, 1995). Iri- treatment applications is sufficiently long to mini-
mata et al. (2018) describe several alternative meth- mise carry-over effects. Senn (2002) argues against
ods for estimating correlation for repeated- multi-stage testing for carry-over effects, as has been
measures designs. The best is to obtain relevant var- recommended in the past.
iance components from a mixed-model analysis of Sample size in an AB/BA crossover trial can be
variance on prior or exemplary data (Castelloe estimated by using power calculation methods for
and O’Brien 2001; Hamlett et al. 2004; Irimata a paired t-test, two-sample t test, or equivalence test,
et al. 2018). Mixed-model analyses provide the most depending on the hypothesis to be tested. In SAS,
reliable (and therefore smallest) estimates of total the power analysis for the paired sample t-test can
sample size (Hedeker et al. 1999; Guo et al. 2013; Iri- be performed with proc power. It requires an esti-
mata et al. 2018; Shan et al. 2020). mate of the within-subject correlation r. In R, the
power analysis can be performed with the pwr.t.test
function
19.8.1 Before-After and Crossover
Designs
Example: Crossover Trial: Heart Rate
In these designs, the ‘groups’ consist of observations Elevation in Cats
paired on the individual subject or experimental
unit. Because each subject serves as its own control, (Data from Griffin et al. 2021.) Heart rates of
between-subject variation is eliminated, and preci- 21 healthy adult cats presenting for routine well-
sion is greatly increased. The advantage of these ness exams were measured in a randomised two-
designs is that the same of level of statistical power period two-treatment crossover trial to assess anx-
can be achieved with fewer subjects than would be iety levels as result of exam location. The two loca-
the case if observations were obtained from inde- tions were an isolated exam room with the owner
pendent, parallel arm groups. Disadvantages present (A) or a treatment room without the owner
include carry-over effects (when treatment effects (B). Data from the trial indicated a reduction in
from the first test period persist to the second) heart rate of approximately 30 bpm when cats were
and missing data with subject dropout (Senn 2002). examined with owner present. How many cats are
A before-after design consists of two sets of obser- required for a new study if it was desired to detect a
vations made in a fixed sequence. The first set of clinically relevant difference of 25 bpm over the
observations is made on each subject at baseline baseline heart rate of 185 (SD 34) bpm for a confi-
before application of the experimental intervention, dence of 95% and power of 80%?
and the second set after the intervention. The correlation r was estimated from the vari-
A crossover design has subjects randomly allo- ance components from the original data using
cated to one of two sequences of two (or more) treat- mixed-model analysis of variance and equal
ments given consecutively. The sequence order AB correlation (compound symmetry). The subject
or BA is randomised to subject. Subjects allocated to variance was 543.59, and residual variance
AB receive treatment A first, followed by treatment was 343.98, so the correlation was 543.59/
B, and subjects allocated to BA receive the (543.59 + 343.98) = 0.6. The sample size for the
Comparing Multiple Factors 227
2
new study could then be approximated using the 2 zα + z1 − β 1 + r t−1
n= 2
formula for computing power for a paired means t d s2
t-test with baseline mean heart rate of 185 bpm,
mean difference of 25 bpm, and standard devia- where r is the correlation of the repeated measures,
tion 34. The new study would require approxi- s2 is the assumed common variance in the two
mately 15 cats. groups (or MSE), and t is the number of time points
(Diggle et al. 2002). Sample SAS code is provided in
Appendix 19.B.
19.8.2 Repeated Measures on Time:
Continuous Outcome 19.8.3 Repeated Measures on Time:
If the mean difference between levels d can be Proportions Outcome
assumed to be constant over time, then the sample Assuming there is a stable difference in proportions
size per group n to estimate the time-averaged p1 − p2 between two groups across t time points, the
response is number of subjects (n) in each of two groups is
2
z1 − α 2 p 1 − p + z1 − β p1 1 − p1 + p2 1 − p2 1 + r t−1
n= 2
t p1 − p2
reduced factorial designs. Psychological Methods 14 (3): Methodology 13: 100. https://doi.org/10.1186/
202–224. https://doi.org/10.1037/a0015826. 1471-2288-13-100.
Conquest, L.L. (1993). Statistical approaches to environ- Hamlett A, Ryan L, Wolfinger R (2004). On the use of proc
mental monitoring: did we teach the wrong things? mixed to estimate correlation in the presence of repeated
Environmental Monitoring and Assessment 26 (2-3): measures. SUGI29 Paper 198-29 https://support.
107–124. https://doi.org/10.1007/BF00547490. sas.com/resources/papers/proceedings/pro-
Cooper, N., Thomas, G.H., and FitzJohn, R.G. (2016). ceedings/sugi29/198-29.pdf (accessed 2022).
Shedding light on the ’dark side’ of phylogenetic com- Hedeker, D., Gibbons, R.D., and Waternaux, C. (1999).
parative methods. Methods in Ecology and Evolution 7 Sample size estimation for longitudinal designs with
(6): 693–699. https://doi.org/10.1111/2041- attrition. Journal of Educational and Behavioral Statis-
21019.12533. tics 24: 70–93. https://doi.org/10.3102/10769
Czitrom, V. (1999). One-factor-at-a-time versus designed 986024001070.
experiments. The American Statistician 53: 126–131. Hurlbert, S.H. and Lombardi, C.M. (2004). Research
https://doi.org/10.1080/ methodology: experimental design sampling design,
00031305.1999.10474445. statistical analysis. In: Encyclopedia of Animal Behav-
Diggle, P.J., Heagerty, P., Liang, K.-Y., and Zeger, S.L. ior, vol. 2 (ed. M.M. Bekoff), 755–762. London: Green-
(2002). Analysis of Longitudinal Data, Oxford Statisti- wood Press.
cal Science Series, 2e. Oxford: Oxford University Press. Irimata K, Wakim P, Li X (2018). Estimation of correla-
Divine, G., Kapke, A., Havstad, S., and Joseph, C.L. tion coefficient in data with repeated measures. Paper
(2010). Exemplary data set sample size calculation 2424-2018, Proceedings SAS Global Forum, 2018:8-11
for Wilcoxon-Mann-Whitney tests. Statistics in Medi- Karp, N.A. and Fry, D. (2021). What is the optimum
cine 29 (1): 108–115. https://doi.org/10.1002/ design for my animal experiment? BMJ Open Science
sim.3770. 5: e100126.
Draper, N. and Smith, K. (1991). Applied Regression Anal- Karp, N.A., Wilson, Z., Stalker, E. et al. (2020). A multi-
ysis, 2e. New York: Wiley. batch design to deliver robust estimates of efficacy and
Festing, M.F.W. (2014). Randomized block experimental reduce animal use – a syngeneic tumour case study.
designs can increase the power and reproducibility of lab- Scientific Reports 10 (1): 6178. https://doi.org/
oratory animal experiments. ILAR Journal 55: 472–476. 10.1038/s41598-020-62509-7.
Festing, M.F.W. (2018). On determining sample size in Kowalski, S.M. and Potcner, K.J. (2003). How to recog-
experiments involving laboratory animals. Laboratory nize a split-plot experiment. Quality Progress 36 (11):
Animals 52 (4): 341–350. https://doi.org/10.1177/ 60–66.
00236772177382. Lakens, D. and Caldwell, A.R. (2021). Simulation-based
Festing, M.F.W. (2020). The “completely randomized” power analysis for factorial analysis of variance
and the “randomized block” are the only experimental designs. Advances in Methods and Practices in Psycho-
designs suitable for widespread use in preclinical logical Science 4 (1): https://doi.org/10.1177/
research. Scientific Reports 10: 17577. 2515245920951503.
Gamble, C., Krishan, A., Stocken, D. et al. (2017). Guide- Lazic, S.E. (2016). Experimental Design for Laboratory
lines for the content of statistical analysis plans in clin- Biologists. Cambridge: Cambridge University Press.
ical trials. JAMA 318 (23): 2337–2343. Machin, D., Campbell, M.J., Tan, S.B., and Tan, S.H.
Garland, T. and Adolph, S.C. (1994). Why not to do two- (2018). Sample Sizes for Clinical, Laboratory and Epide-
species comparative studies: limitations on inferring miology Studies, 4e. Wiley-Blackwell.
adaptation. Physiological Zoology 67: 797–828. Mead, R. (1988). The Design of Experiments. Cambridge
Goldman, A.I. and Hillman, D.W. (1992). Exemplary data: UK: Cambridge University Press.
sample size and power in the design of event-time clin- Montgomery, D.C. (2012). Design and Analysis of Experi-
ical trials. Controlled Clinical Trials 13 (4): 256–271. ments, 8th ed. New York: Wiley.
https://doi.org/10.1016/0197-2456(92)90010-w. O’Brien, R.G. and Muller, K.E. (1993). Unified power
Griffin, F.C., Mandese, W.W., Reynolds, P.S. et al. (2021). analysis for t-tests through multivariate hypotheses.
Evaluation of clinical examination location on stress in In: Applied Analysis of Variance in Behavioral Science
cats: a randomized crossover trial. Journal of Feline (ed. L.K. Edwards), 297–344. New York: Marcel
Medicine and Surgery 23 (4): 364–369. https://doi. Dekker.
org/10.1177/1098612X20959046. Reynolds, P.S. (2022). Between two stools: preclinical
Guo, Y., Logan, H.L., Glueck, D.H., and Muller, K.E. research, reproducibility, and statistical design of
(2013). Selecting a sample size for studies with experiments. BMC Research Notes 15: 73. https://
repeated measures. BMC Medical Research doi.org/10.1186/s13104-022-05965-w.
Comparing Multiple Factors 231
Senn, S. (2002). Cross-Over Trials in Clinical Research, PLoS ONE 10 (4): e0122880. https://doi.org/
2nd ed. New York: Wiley. 10.1371/journal.pone.0122880.
Shan, G., Zhang, H., and Jiang, T. (2020). Correlation Voelkl, B., Vogt, L., Sena, E.S., and Würbel, H. (2018).
coefficients for a study with repeated measures. Com- Reproducibility of preclinical animal research
putational and Mathematical Methods in Medicine improves with heterogeneity of study samples. PLoS
2020: 7398324. https://doi.org/10.1155/2020/ Biology 16 (2): e2003693. https://doi.org/
7398324. 10.1371/journal.pbio.2003693.
Simpson, S.H. (2015). Creating a data analysis plan: what Winer, B.J., Brown, D.R., and Michels, K.M. (1991). Sta-
to consider when choosing statistics for a study. The tistical Principles in Experimental Design, 3e. Boston
Canadian Journal of Hospital Pharmacy 68 (4): 311– MA: McGraw-Hill.
317. https://doi.org/10.4212/cjhp.v68i4.1471. Würbel, H., Voelkl, B., Altman, N.S. et al. (2020). Reply to ‘It
Snedecor, G.W. and Cochran, W.G. (1980). Statistical is time for an empirically informed paradigm shift in ani-
Methods, 7e. Ames: Iowa State University Press. mal research’. Nature Reviews Neuroscience 21: 661–662.
Sødring, M., Oostindjer, M., Egelandsdal, B., and Paul- https://doi.org/10.1038/s41583-020-0370-7.
sen, J.E. (2015). Effects of hemin and nitrite on intes- Zar, J.H. (2010). Biostatistical Analysis, 5th ed. Upper
tinal tumorigenesis in the A/J min/+ mouse model. Saddle River: Prentice-Hall.
20
Hierarchical or Nested Data
A Guide to Sample Size for Animal-based Studies, First Edition. Penny S. Reynolds.
© 2024 John Wiley & Sons Ltd. Published 2024 by John Wiley & Sons Ltd.
234 A Guide to Sample Size for Animal-based Studies
BOX 20.1 when both anticipated sample sizes and effect size
Examples of Hierarchical Data to be detected are small (Goldstein 2003; Moer-
beek 2004).
▪ Nestlings in different nests One solution to these problems is to average
▪ Pups in each litter from each multiparous across all observations for each cluster, then analys-
breeding pair ing the means. However, averaging loses a lot of
▪ One of two drugs is randomly assigned to each information and does not permit exploration of
mouse, the brain of each mouse is harvested, the effects of predictor variables acting at the differ-
and cells from each brain region are isolated ent hierarchical levels of organisation.
and plated Hierarchical or multilevel design models (also
▪ Brains from two strains of mouse are harvested, called nested models, random effects models, mixed
cells are isolated and plated, then one of two models, and variance components models) specifically
drugs is randomly assigned to cells in each well account for the nested structure of organismal units
in a plate and the correlation within clusters. Design features
▪ Multisite veterinary trials where dogs with a adjust for the different variances that occur between
particular cancer within each site are randomly and within clusters (thus ensuring estimation of
assigned to one of two or more treatments correct standard errors). Predictor or explanatory vari-
▪ Synaptic vesicles counts are obtained for each ables (covariates) can be incorporated at both the level
set of neurons harvested from the brains of sev- of the individual observation and the cluster. There are
eral strains of mice two main types of multilevel designs: between-cluster
▪ Cows within herds within locations and within-cluster (Figure 20.1). Between-cluster
▪ Survey studies: Individuals within clinics
within city (a)
▪ Meta-analyses
Cluster A B A B
Units A A B B A A B B
Two- and three-level models are common in studies
of neurobiology (Aarts et al. 2014, 2015), rodent (b)
breeding (e.g. pups within litter within dam; Lazic Cluster
and Essiou 2013; Hull et al. 2022), ecology (e.g. nest-
lings within nest within site), agriculture (e.g. live- Units A C C B B A C B
stock within herds within locations), multisite
B A C B
veterinary clinical trials (e.g. subjects within clinic
within city), and meta-analyses (individuals within (c)
studies).
When units are nested, observations for the units Cluster B A A B
designs randomise the factor or treatment to the strategy as aggregate. Sample size determina-
whole cluster; experimental units within each cluster tions prioritise the number of clusters k, not
are essentially subsamples. Within-cluster designs the number of individuals n (Aarts et al. 2014,
randomise treatment to units within each cluster. 2015). Cluster sample size is the priority if the
Repeated measures designs (where multiple observa- focus of the study is testing the effects of an
tions are nested within each subject) and randomised intervention applied at the cluster level, or if
block designs (where at least one replicate from each the number of subjects contained in each clus-
treatment is nested within blocks) are a type of ter is anticipated to be small, or else must be
multilevel design. fixed a priori during the design phase. Clus-
There are at least two sample size calculations ter-level designs have the advantage of relative
required for multilevel designs: the sample size simplicity (only one treatment is allocated to
per cluster (the number of units per cluster) the entire cluster). However, they are liable to
and the total number of clusters. When clustering a type of representation bias if there is large
is present, the correlation between observations between-cluster variation in subject character-
within clusters (the intra-class correlation, ICC) istics. The requirement for large number of
will affect statistical power. Larger sample sizes clusters to be sampled to achieve adequate
will be required if variance components are power may make these studies very expensive,
large and the sample size at the lowest level especially for multisite trials or large-scale sur-
(e.g. number of individuals per cluster) is very veys (Raudenbush and Liu 2000).
small (Konstantopoulos 2008, 2010; Austin and
Leckie 2018). Within-cluster assignment. When interventions
are assigned randomly to different individuals
within the same cluster, the individual is
20.2 Steps in Multilevel the unit of randomisation, and all treatments
are represented in each cluster. Within-cluster
Sample Size Determinations designs are sometimes referred to as multilevel
There are four steps for sample size determination blocked trials. Individual subjects are within
for multilevel models: ‘blocks’ because all treatments are represented
in each cluster (Spybrook et al. 2011; Konstan-
1. Identify the unit of randomisation topoulos 2012). Individual randomisation is
2. Determine design features often preferable to cluster randomisation
3. Decide on the effect size: δ, ICC, f2, R2 because sample sizes required are usually much
4. Other considerations: Balance, sparse data, smaller, and it is easier to control for represen-
and costs tation bias. However, individual treatment
allocations will not be possible if subjects can-
not be individually identified, and ‘treatment
20.2.1 Identify the Unit of contamination’ (inadvertent application of the
Randomisation wrong intervention) may be a consideration
In randomised experimental studies, test or control (Torgerson 2001). For testing effects of an inter-
interventions can be applied to the whole cluster, to vention applied at the individual level, sample
individuals within each cluster, or different inter- size determinations prioritise the number of
ventions may be randomised at different levels. individuals n. The number of clusters k will
be determined by the anticipated cluster size,
Cluster-level assignment. When interventions are or number of individuals within each cluster.
assigned randomly to whole clusters, the clus-
ter is the unit of randomisation. All individuals If there are more treatments than individuals per
from the same cluster will therefore belong to cluster, so that all treatments cannot be represented
the same experimental or treatment group. in each block, incomplete block or fractional facto-
Machin et al. (2018) refer to this assignment rial designs should be considered. More complex
236 A Guide to Sample Size for Animal-based Studies
experiments can be designed where different inter- or ordinal data), the assumptions of normally
ventions can be separately assigned to cluster and distributed error distributions no longer apply.
individual units. The outcome must be transformed with the
Multilevel designs are best understood as a appropriate non-linear link function in a gener-
regression model. Features of the regression model alised multilevel model. The response ηi is the
to be considered are number of levels, or hierarchi- log odds of the event occurring for the ith indi-
cal units, and specification of predictor variables (or vidual in the jth cluster:
covariates) at the different levels (if applicable).
μ
ηij = log
1−μ
20.2.2 No Predictors
The most basic multilevel model has no predictor where μ = 1/[1 + exp(−β0)]. The model is
variables and two hierarchical levels – the individ-
ual level (level-1) and the cluster level (level-2). Level 1 ηij = β0j
The individual is nested within cluster.
Level 2 β0j = γ 00 + u0j , u0j N 0, τ00
Continuous outcome. The model is described by
regression on the two levels as:
where β0j is the intercept or the average log odds of
Level 1 Y ij = β0j + eij , eij N 0, σ 2e the event occurring in cluster j and the level-2 error
variance term is designated by τ00. Because the var-
Level 2 β0j = γ 00 + u0j , u0j N 0, σ 2u iance of a non-Gaussian model is determined by the
population mean, the ‘level-1 error variance term’ as
Here Yij is the response measured for the ith indi- such is not computed as it is with Gaussian data but
vidual in the jth cluster, the intercept β0j is the mean is estimated as π2 / 3 = 3.29 (Ene et al. 2015; Hox
response for the jth cluster, and γ 00 is the grand et al. 2018). The level-2 random component is
mean (mean across all clusters). The covariance assumed to belong to the exponential family of dis-
matrix of the random effects is designated as the tributions: binary, binomial, Poisson, geometric,
G-matrix. It consists of the variance for the inter- and negative binomial distributions. Count data
cept, the variance for the slopes, and the covariance are usually modelled with a Poisson distribution.
between intercepts and slopes. The level-1 random Ordinal data are described by proportional odds,
effect, or error, term is eij, and is assumed to be nor- or cumulative logit, models.
In SAS, the variance components are obtained
mally distributed with mean 0 and variance σ 2e , the
from the variance components option in the ran-
variance between individuals. The level-2 random
dom statement of proc glimmix. The grand mean
component is normally distributed with mean 0
(with 95% confidence intervals) is obtained from
and variance σ 2u , the variance between clusters.
the fixed effects intercept (Appendix 20.A). Other
The total observed variance for the response is thus
distributions can be specified by the relevant distri-
σ 2e +σ 2u (Bell et al. 2013). Combining the two equa- bution and link functions.
tions gives the full model:
Y ij = γ 00 + ui + eij
20.2.3 Multilevel Models with
that is, the response is the sum of the grand mean Predictors
plus the between- and within-cluster variances. In The basic model is easily expanded to include one or
SAS, the variance components σ 2u , σ 2e are obtained more predictors, or covariates, at any or all levels.
from the variance components option in the ran- The level to which a treatment, or intervention, pre-
dom statement of proc mixed (SAS code is provided dictor is assigned depends on how the unit of rando-
in Appendix 20.A). misation is defined. The associated slope terms
(regression coefficients β1, β2, etc.) are of primary
Categorical outcome. When outcome variables are interest as they describe the effects of the interven-
categorical (such as binary, proportions, count, tion on the outcome. Interpretation is simplified if
Hierarchical or Nested Data 237
Dummy variables are used to streamline calculations when categorical variables with two or more distinct ‘levels’ are
used as predictors (Draper and Smith 1998).
1. Two category ‘levels’ A and B. One dummy variable Z that takes values of 0 or 1 is required to uniquely code for each
group.
Example. One of two treatments is randomly assigned to individual within a cluster. Treatment is represented by
indicators Z1 = 0 (control) and Z1 = 1 (test intervention). The regression equation is
Yij = β0j + β1 Z1 + eij
Example. One of two treatments is randomly assigned to equal number of individuals of both sexes, male or female.
Sex is represented by indicators Z2 = 0 (male) and Z2 = 1 (female). The regression equation is
Y = β 0 + β 1Z 1 + β 2 Z 2 + e
2. More than two category ‘levels’. More than two levels require k − 1 dummy variables z1, z2, …., zk-1, with each taking the
values of 0 or 1.
Example. Animals are sampled from each of three locations A, B, C. To code for location, two dummy variables z1 and
z2 are required:
Location A z1 = 1, z2 = 0
Location B z1 = 0, z2 = 1
Location C z1 = 0, z2 = 0
The regression equation is
Y = β0 + β1X1 + a1z1 + a2z2 + e
The regression coefficients a1 and a2 estimate the difference between each group and the reference group (0, 0).
intervention variables are categorical and dummy- β0j = γ 00 + u0j ; u0j N 0, τ00
coded (Table 20.1). For example, suppose a two-
arm experiment is planned comparing a test drug β1j = γ 10 + u1j ; u1j N 0, τ11
A versus a vehicle control C. Then ‘treatment’ can
be coded with a single dummy variable to indicate
Here γ 00 is the grand mean of the response, γ 10 is
the two groups (0 for C and 1 for A). The regression
the treatment effect, or the difference in means
coefficient will be the quantitative difference in
between the experimental group and the compara-
responses between groups and is thus a measure
of unstandardised effect size. tor group (Y t − Y c ), τ00 is the between-cluster vari-
Additional covariates can be added to increase ation, and τ11 is the variance for the treatment
precision (reduce standard errors) of the estimated effect between clusters.
intervention effects, and reduce noise, therefore
improving signal and increasing power to detect Level-2 predictors. Predictors at the cluster level
treatment effects if they exist (Hedges and Hedberg (level-2) are indicated by W, rather than 20.
2007; Raudenbush et al. 2007; Konstantopoulos The cluster-level regressions are
2008, 2012). Signalment variables (age, sex, repro-
β0j = γ 00 + γ 01 W j + u0j level − 2 intercept
ductive status, weight, etc) are useful covariates.
The cross-level term X1ij W1j indicates the clustered studies, there are multiple sources of var-
strength of the adjustment of level-2 characteristics iation and therefore several choices of the standard
on the level-1 (subject) responses within each clus- deviation to be used for the standardisation: the
ter (Sample SAS code is provided in Appendix 20.B). total variance, the variance for the subjects nested
within clusters, the variance between clusters, and
20.2.4 Constructing the Model when applicable, the variance for the main effect
of treatment.
Bell et al. (2013) and Ene et al. (2015) recommend a Initial effect sizes for continuous data can be
‘bottom-up’ model-building strategy. The initial approximated using Cohen’s d and ignoring for
model is intercept-only with no predictors, then pre- level-2 clustering and associated predictors. Recall
dictors and random effects are added one at a time. that effect size is expressed in standard deviation
Predictors should be specified a priori, and selection units as d = Y 1 − Y 0 /σ total , where Y 1 − Y 0 is
based on best available knowledge or scientific
the biologically important difference δ between
justification. Final model specification is based on
treatment means, and σ total is the standard deviation
coefficients that are statistically significant (or
of the outcome, estimated as the square root of the
‘meaningfully large’) and model fit based on the
total variance σ 2total . For the two-level model, the
deviance test, which assesses differences in the
−2 log likelihood values between candidate models. standard deviation of the outcome is σ 2u + σ 2e ,
Model misspecification is indicated by failure to
converge, or a non-positive definite g-matrix. The and for three levels σ 2u + σ 2v + σ 2e Values for δ
latter indicates that one or more of the random can be a ‘best guess’ or hypothesised scientifically
effects variance components are zero. There may meaningful difference Y1 − Y0 to be detected. Basic
be too few observations to properly estimate the ran- descriptive or summary statistics can be used to
dom effects (there should be at least p + 1 observa- obtain the standard deviation of the outcome Y.
tions to estimate p random effects), too many If there is little or no prior information on antici-
random effects specified in the model, or there is lit- pated mean differences and variances, Cohen’s
tle to no variation between units at the given level. benchmark values for small, medium, and large
Dropping the zero variance random effects is an effect sizes could be substituted. However, these
option. An alternative is to use a ‘population aver- values (and their interpretation) will be highly unre-
age’ model and dispense entirely with random liable and are not recommended for animal-based
effects. The disadvantage of this approach is that studies (see Chapter 15).
these models cannot capture between-cluster differ- When outcome variables are categorical, the
ences, but this may not matter if cluster effects are effect size is estimated by the difference in log odds
not of primary interest or are too small to be of prac- ratios:
tical importance (McNeish et al. 2016).
p1 1 − p0
δ = η1 − η0 = ln
p0 1 − p1
20.3 Estimating Effect Size
Effect sizes can be estimated by Cohen’s d, unstan- where η0 and η1 are the log odds ratios for the con-
dardised or standardised regression coefficients, trol and experimental groups, respectively, with the
multilevel R2, Cohen’s f 2, and/or the intraclass cor- average treatment effect defined in terms of the log
relation coefficient ICC (Lorah 2018). odds of the proportions of ‘successes’ (the number of
events occurring) in each group.
of most biological interest). Feingold (2015) recom- Selya et al. (2012) describe how to calculating f2
mends that the regression coefficient for slope β is using SAS proc mixed. Bates et al. (2011) and Naka-
divided by the pooled within-group standard devia- gawa and Schielzeth (2013) have developed
tion of the response Y. The estimated effect size will R packages for estimating R2 for multilevel models,
be somewhat smaller if the effect size is estimated further extended by Johnson (2014) and Nakagawa
without accounting for cluster variances. Because et al. (2017) to include generalised linear models for
the magnitude of the fixed effects depend on the non-normal outcomes.
scale of each independent variable, effect sizes
based on different variables in the same study, or
variables across multiple studies, cannot be com- Example: Calculating R2 and f 2: Sparrow
pared directly (Lorah 2018). Model of Early Life Stress
An alternative measure of effect size for regres- (Data from Grace et al. 2017.) Nestling house
sion models is the proportion of variation explained sparrows (Passer domesticus) were used as a
by the predictors. Cohen’s f 2 estimates the propor- model of the effects of early life stress on growth
tion of variance in the sample explained by the (cat- and survival. Nestlings were fed mealworms con-
egorical) variable for the intervention effect relative taining either corticosterone or vehicle control to
to all other covariates in the sample, or R2. Here f 2 is assess the effect of glucocorticoid exposure (a
calculated by running the regression model several stress indicator) on body mass gain (‘treatment’,
times to obtain three different variance compo- TRT). Measurements were obtained for mass
nents: the variance components for the full model (M) and tarsus length (TL) on day 12, and SEX
with all fixed effects and covariates included, vari- (male or female). For this example, data were lim-
ance components for the covariates-only model ited to observations for four nestlings per nest,
(which estimates the variance in the response with- two of which received corticosterone and two of
out the fixed effects), and variance components for which received control. There were 31 nestlings
the null model (intercept-only model with no pre- in the experimental and control groups, respec-
dictors). The respective R2 for the full and covariate tively, for a total of 72 nestlings in 18 nests.
models are calculated from these variance compo- The model is a two-level design, with individ-
nents as ual nestlings within nest as the unit of randomi-
σ 2null − σ 2full sation to which treatment interventions were
σ 2null − σ 2other
R2full = ; R2other = applied. Mass M is the dependent variable,
σ 2null σ 2null TRT is the fixed effect (1 = corticosterone,
0 = control), and TL and sex are covariates.
For a binary response variable, R2 is: R2 and Cohen’s f 2 were calculated in SAS
σ 2fixed (adapted from Selya et al. 2012). Variances and
R2binary = corresponding R2 are shown in Table 20.2.
σ 2fixed + τ200 + π 2 3 Cohens f 2 is
where σ 2fixed is the sample variance of the fixed 0 234 − 0 07
f2 = = 0 215
effects linear predictor, and τ00 is the cluster 1 − 0 234
(level-2) error variance term (Austin and
Merlo 2017).
Cohen’s f2 is then calculated as:
Table 20.2: Calculation of R2 and f 2 for sparrow data. where u and v refer to the level-2 and level-3 clus-
Source Variance R2 ters, respectively.
For non-Gaussian outcomes (e.g, binary, count,
Full model (TRT, 2.209 R2full = (2.886-2.209)/ proportions), the ICC for a two-level model is
SEX, TL) 2.886 = 0.234
Covariates only 2.683 R2other = (2.886-2.683)/
τ00
ICC =
(SEX, TL) 2.886 = 0.070 τ00 + π2 3
Intercept-only (no 2.886
predictors) where τ00 is the intercept variance. For binary and
proportion response data, the level-1 error variance
term is determined by the population mean. The
second term for calculating the total variance is esti-
size in the same way as a conventional correlation mated from the logistic link function with scale fac-
coefficient r (Snijders and Bosker 1993, 1999; Lorah tor of 1 and is π2 / 3 = 3.29 (Ene et al. 2015; Hox
2018). It is also a measure of the ‘information con- et al. 2018).
tent’ contained in each cluster. As the correlation Initial values of ICC may be obtained from previ-
between observations increases, the level-1 variance ously published data. If raw data are available, the
(σ e2) becomes smaller, and ICC increases. ICC is ICC can be calculated from the sum of the variance
zero if responses for all individuals are independent components of the intercept-only regression model
of one another. A non-zero ICC denotes responses (Bell et al. 2013; Ene et al. 2015).
are not independent, and ICC of 1 indicates all
the responses in all clusters are the same. A large
ICC indicates the information contained in individ- Example: Calculating ICC: Magpie Egg
ual observations is similar. Therefore, if ICC is large, Weights
increasing within-cluster sample size will be redun-
dant and will not contribute much in the way of new (Data from Reynolds 1996.) Weights were
information for tests of treatment effects. (When obtained for 118 eggs from 19 Black-billed Mag-
ICC is calculated as a measure of inter-rater agree- pie (Pica pica) nests. The fitted model is an inter-
ment, a high value is desirable as a measure of cept-only no-predictor model:
method reliability).
Egg weight per egg per nest = a0 + ui + eij
If the design is balanced (equal n within each
cluster), the between-cluster ICC in a two-level where the random intercept term a0 is an esti-
model with a continuous outcome is mate of mean egg weight, ui is the random effects
error term for the between-nest effects, and eij is
σ 2u the residual variance, or the error term for egg. In
ICCu =
σ 2u + σ 2e SAS, the variance components σ 2u , σ 2e are obtained
from the variance components option in the ran-
dom statement, and mean egg weight (with 95%
That is, the ICC is the ratio of the between-cluster confidence intervals) from the fixed effects inter-
variance to the total variance. For a three-level cept (Appendix 20.A).
model, there are two cluster ICCs to be calculated: The covariance parameter estimates are 0.68
for level-2 nest effects (intercept) and 0.12
σ 2u for level-2 egg effects (residual). Mean egg weight
ICCu =
σ 2u + σ 2v + σ 2e is 9.3 (95% confidence interval 8.9, 9.7) g. The
ICC = σ 2u σ 2u + σ 2e = 0.68/(0.68+0.12) = 0.85.
σ 2v
ICC v = The ICC indicates that 85% of the variation in
σ 2u + σ 2v + σ 2e egg weight can be accounted for by nest of origin.
Hierarchical or Nested Data 241
20.5.2 Sample Size Based on approximately 10 subjects per cluster, the study
Design Effect requires 325/10 = 32.5 clusters, rounded up to
If an estimate of ICC is available, sample size can be 33 clusters.
approximated using the design effect formula. The
design effect quantifies the increase in sample size
expected for a nested design over that of a simple
random sample with no clusters (Kish 1987). That 20.5.4 Asymptotic Normal
is, if a sample size N has been approximated using Approximation: Balanced Cluster Sizes
conventional power calculations, then N must be
multiplied by the design effect (Deff) to adjust for Sample size for a balanced two-arm study with
the effects of clustering. The design effect increases level-1 randomisation (non-aggregate design) is
with both increased ICC and n per cluster, therefore
larger N would need to be sampled. 2
z1 − α 2 + z1 − β z21 − α 2
The design effect factor Deff will be determined N = 4 Deff 2 +
ES 2
by whether cluster sample sizes are balanced or
unbalanced.
where ES is the effect size δ/stotal.
Balanced cluster sizes. If sample sizes can For a study with level-2 randomisation (aggre-
assumed to be approximately equal for each gated design),
cluster, then the design effect factor is:
2
z1 − α + z1 − β z21 − α
Deff = 1 + n − 1 ICC N=4
2
+
2
2
δ sy 2
where n is the within-cluster sample size.
Deff s2total
where sy = n
20.5.3 Initial Approximations
Perform power calculations under the assumption 20.5.5 Asymptotic Normal
of simple random sampling in the usual way. Then
multiply that sample size estimate by Deff.
Approximation: Unbalanced
Cluster Sizes
For many types of animal-based studies, there will
Example: Adjusting Sample Size for Cluster be a considerable amount of between-cluster imbal-
Effects ance. Common examples are litter sizes in rodents
A new study was proposed where power calcula- and clutch sizes in birds, reptiles, and amphibians.
tions based on the assumption of simple random Relative to balanced cluster designs or unclustered
sampling indicated that approximately 100 ani- studies, large variation in cluster size n contributes
mals would be required. However, it is antici- to loss of power. As a result, projected sample sizes
pated that clustering effects will be important. for clustered studies need to be adjusted for the
There are 10 subjects per cluster with ICC of anticipated variation in cluster sizes.
0.25. What is the revised sample size? Machin et al. (2018) recommend an adjusted Deff
based on the coefficient of variation CV for within-
The effective sample size would require cluster sample sizes:
Deff = 1 + 10 − 1 0 25 = 3 25 CV n = sn n
that is, 3.25 times as many subjects as a non- where sn and n are the standard deviation and mean
clustered random sample. Then the new study cluster sizes, respectively. Then the adjusted design
would require 3.25 (100) = 325 subjects. With effect will be
Hierarchical or Nested Data 243
m d2
1 96 + 1 2816 2
1 96 2 λ= σ 2e
N≥4 32 + ≈ 212 τ11 +
08 2 2 n
Figure 20.2: Number of nests to be sampled to attain 2. Binary outcome. The variance components
prespecified power in a study of sparrow early life stress.
are obtained from the variance components
Hierarchical or Nested Data 245
run;
%let numClust = 100; quit;
data cluster; ods output close;
*iterate through range of cluster sample
sizes; *3. Calculate variance for null model (no
do m = 5 to &numClust by 5; predictors);
*prespecified difference; ods output CovParms = VarNull;
meandiff = 1.5; proc mixed data=sparrow covtest CL method=ML;
s2_subj= 2.228; *pooled variance; class nest;
s2_clus= 2.209; model mass = / s CL;
n = 6; *assume median random INT / subject=nest type=vc;
clutch size of 6; run;
quit;
alpha=0.05; ods output close;
246 A Guide to Sample Size for Animal-based Studies
Julian, M. (2001). The consequences of ignoring multi- Behavioral Research, 39, 129–149. https://doi.
level data structures in non-hierarchical covariance org/10.1207/s15327906mbr3901_
modeling. Structural Equation Modeling 8: 325–352. Nakagawa, S., Johnson, P.C.D., and Schielzeth, H.
https://doi.org/10.1207/S15328007SEM0803_1. (2017). The coefficient of determination R2 and
Kiernan, K. (2018). Insights into using the GLIMMIX intra-class correlation coefficient from generalized lin-
procedure to model categorical outcomes with random ear mixed-effects models revisited and expanded. Jour-
effects. Paper SAS2179-2018. https://support.sas. nal of the Royal Society Interface 14: 20170214.
com/resources/papers/proceedings18/2179- https://doi.org/10.1098/rsif.2017.0213.
2018.pdf. Nakagawa, S. and Schielzeth, H. (2013). A general and
Kish, L. (1987). Statistical Design for Research. New simple method for obtaining R2 from generalized lin-
York: Wiley. ear mixed-effects models. Methods in Ecology and Evo-
Konstantopoulos, S. (2008). The power of the test for lution 4 (2): 133–142. https://doi.org/10.1111/
treatment effects in three-level cluster randomized j.2041-21020.2012.00261.x.
designs. Journal of Research on Educational Effective- Raudenbush, S.W. and Liu, X. (2000). Statistical power
ness 1 (1): 66–88. https://doi.org/10.1080/ and optimal design for multisite randomized trials.
19345740701692522. Psychological Methods 5 (2): 199–213. https://doi.
Konstantopoulos, S. (2010). Power analysis in two-level org/10.1037/1082-98920.5.2.199.
unbalanced designs. Journal of Experimental Educa- Raudenbush, S.W., Martinez, A., and Spybrook, J. (2007).
tion 78 (3): 291–317. Strategies for improving precision in group-
Konstantopoulos, S. (2012). The impact of covariates on randomized experiments. Educational Evaluation
statistical power in cluster randomized designs: which and Policy Analysis 29 (1): 5–29. https://doi.org/
level matters more? Multivariate Behavioral Research 10.3102/0162373707299460.
47 (3): 392–420. https://doi.org/10.1080/ Reynolds, P.S. (1996). Brood reduction and siblicide in
00273171.2012.673898. Black-Billed Magpies (Pica pica). The Auk 113 (1):
Lazic, S.E. (2010). The problem of pseudoreplication in 189–199. https://doi.org/10.2307/4088945.
neuroscientific studies: is it affecting your analysis? Selya, A.S., Rose, J.S., Dierker, L.C. et al. (2012). A practical
BMC Neuroscience 11: 5. https://doi.org/ guide to calculating Cohen’s f2, a measure of local effect
10.1186/1471-2202-11-5. size, from PROC MIXED. Frontiers in Psychology 3: 111.
Lazic, S.E., Clarke-Williams, C.J., and Munafò, M.R. https://doi.org/10.3389/fpsyg.2012.00111.
(2018). What exactly is ‘N’ in cell culture and animal Snijders, T.A. and Bosker, R.J. (1993). Standard errors
experiments? PLoS Biology 16 (4): e2005282. https:// and sample sizes for two-level research. Journal of
doi.org/10.1371/journal.pbio.2005282. Educational and Behavioural Statistics 18: 237–259.
Lazic, S.E. and Essiou, L. (2013). Improving basic and Snijders, T.A.B. and Bosker, R.J. (1999). Multilevel Anal-
translational science by accounting for litter-to-litter ysis: An Introduction to Basic and Advanced Multilevel
variation in animal models. BMC Neuroscience 14: Modeling. Thousand Oaks, CA: Sage.
37. https://doi.org/10.1186/1471-2202-14-37. Spybrook, J., Bloom, H., Congdon, R., et al. (2011). Opti-
Lorah, J. (2018). Effect size measures for multilevel mod- mal design plus empirical evidence: documentation for
els: definition, interpretation, and TIMSS example. the “Optimal Design” software. http://hlmsoft.
Large-Scale Assessments in Education 6: 8. https:// net/od/ (accessed 2022).
doi.org/10.1186/s40536-018-0061-2. Torgerson, D.J. (2001). Contamination in trials: is cluster
Maas, C.J.M. and Hox, J.J. (2005). Sufficient sample sizes randomisation the answer? BMJ 322: 355. https://
for multilevel modelling. Methodology 1 (3): 86–92. doi.org/10.1136/bmj.322.7282.355.
https://doi.org/10.1027/1614-1881.1.3.86. van Breukelen, G.J., Candel, M.J., and Berger, M.P.
Machin, D., Campbell, M.J., Tan, S.B., and Tan, S.H. (2007). Relative efficiency of unequal versus equal
(2018). Sample Sizes for Clinical, Laboratory and Epide- cluster sizes in cluster randomized and multicentre
miology Studies, 4e. Wiley-Blackwell. trials. Statistics in Medicine 26 (13): 2589–2603.
McNeish, D., Stapleton, L.M., and Silverman, R.D. https://doi.org/10.1002/sim.2740.
(2016). On the unnecessary ubiquity of hierarchical Voillemot, M., Hine, K., Zahn, S. et al. (2012). Effects of
linear modeling. Psychological Methods 22: 114–140. brood size manipulation and common origin on phe-
https://doi.org/10.1037/met0000078. notype and telomere length in nestling collared fly-
Moerbeek M (2004). The consequences of ignoring a level catchers. BMC Ecology 12: 17. https://doi.org/
of nesting in multilevel analysis. Multivariate 10.1186/1472-6785-12-17.
21
Ordinal Data
A Guide to Sample Size for Animal-based Studies, First Edition. Penny S. Reynolds.
© 2024 John Wiley & Sons Ltd. Published 2024 by John Wiley & Sons Ltd.
250 A Guide to Sample Size for Animal-based Studies
(a) (b)
40
40
35 35
30 30
25 25
Frequency
20 20
15 15
10 10
5 5
0 0
1 2 3 4 5 1 2 3 4 5
Scores
Figure 21.1: Two simulated ordinal data sets on a 5-point scale, both with median score 3. (a) The distribution mirrors
the cognitive ‘expectation’. (b) The median does not represent the clinical picture, which shows most subjects occurring
at either extreme and relatively few occurring at the median.
Basic arithmetic operations (addition, subtrac- an odds ratio, the expected distribution of propor-
tion, multiplication, division) cannot be performed tions across categories, and the allocation ratio
on ordinal data. Descriptive statistics used for con- (Campbell et al. 1995; Machin et al. 2018).
tinuous data (e.g. mean, median, mode, standard Sample size increases substantially as proportion
deviation) often do not have a clinically or biologi- representation becomes more unbalanced toward
cally sensible interpretation when applied to ordinal dominance of one category. If the mean proportions
data. Although it is sometimes suggested that ordi- pi are expected to be roughly similar across all cate-
nal data can be summarised by the median and gories, then pi depends on the number of categories
mode, such summary statistics can be very mislead- k and pi is approximately 1/k (Table 21.1). For a two-
ing. Figure 21.1 shows two data sets with the same arm study with i = 4 categories of response, and
median score of 3. However, biological or clinical responses across all four categories are expected to
interpretation will be very different. In the first sam- be similar, then p1 = p2 = p3 = p4 = 1/4 = 0.25.
ple, the distribution mirrors the cognitive expecta- In contrast, if there is one dominant category with
tion that the summary statistic reflects the central sparse representation in the other categories, sam-
tendency of the data. In second sample, most sub- ple size will be inflated by approximately 43% over
jects occur at both extremes of the ranking, with rel- the uniform distribution case.
atively few occurring at the median. Maximum power will be obtained if sample sizes
Ordinal data can be analysed by conventional for each intervention group are equal. To correct for
rank-based non-parametric tests such as the allocation ratio, the sample size formula is adjusted
Mann-Whitney U-test for unpaired two samples, by (1 + r)2/r. If there is an equal number of subjects
Kruskal-Wallis (more than two unmatched groups) in each group, then r = n1/n2 = 1, and the correction
and Friedman tests (more than two matched pair is (1 + 1)2/1 = 4. The expected distribution of propor-
groups or repeated measures). Cumulative ordinal tions across categories can be obtained from previous
regression or cumulative link models are more pow- data. If there is no information on the likely response
erful alternatives (Agresti 2013). pattern, sample size can be approximated based on
the most likely pattern of distribution (Table 21.1).
Table 21.1: Sample size inflation resulting from proportion imbalance over four categories. Five common scenarios are
presented. The sample size for the uniform category distribution is represented by n0. The ratio n/n0 is the sample size
inflation caused by imbalances. Sample size requirements increase substantially with increasing dominance of a single
category.
Category 0 1 2 3
Proportions p1 p2 p3 p4 n/n0
Equally probable occurrence (uniform) 0.25 0.25 0.25 0.25 1.00
Graduated occurrence 0.10 0.20 0.30 0.40 1.04
Single under-represented category 0.10 0.30 0.30 0.30 1.14
Bimodal representation 0.05 0.05 0.45 0.45 1.15
Single dominant category 0.10 0.10 0.10 0.70 1.43
Test 0 2 1 0 1 2 0 0 0 0
A ‘summary’ odds ratio for planning pur-
poses can be estimated from the geometric 1 0 0 0 3 0 1 0 0 0
mean of all the odds ratios.
4. Calculate the average proportions p of the Control 0 0 2 2 2 1 3 1 3 1
treatment and control probabilities for each 0 0 2 0 0 2 2 0 1 1
category i.
Suppose it is desired to conduct a new study,
Then, sample size is for which it is assumed that the distribution of
lung damage scores will be similar to the previous
2 2 study. There will be equal sample sizes in each of
3 1+r z1 − α 2 + z1 − β
Nk = k 2 the control and intervention groups. What is the
1− i = 1p
3 r ln OR
new sample size?
First, reorganise the data in tabular form
If allocation is equal, then (r + 1)2/r is 4. If all and calculate counts and proportions for
average proportions p are approximately equal, then each category. The proportions p for each of
k
3 1− i = 1p
3
depends only on the total number i = 4 histopathology categories are shown in
of categories k, and reduces to 3/[1 − (1/k2)]. Table 21.2.
252 A Guide to Sample Size for Animal-based Studies
Table 21.3: Mouse lung tissue histopathology: calculation of expected proportions for a new study based on prior odds
ratios.
Example: Sample Size for Detecting expected odds ratio in favour of the test inter-
a Pre-Specified Odds Ratio vention is 2. What sample size is required to
detect ORnew = 2 with 95% confidence and
Suppose a new study is planned where the dis-
80% power?
tribution of subjects in the control arm is
Here the proportions and cumulative propor-
expected to be similar to those in the initial
tions for the test group must be estimated from
study. Disproportionately more mice in the test
the control group data and the planned odds ratio
group are expected to be in lowest lung dam-
(Machin et al. 2018). First, calculate the expected
age categories and more mice in the control
cumulative proportion
group with higher lung damage scores. The
Ordinal Data 253
Table 21.4: Mouse lung tissue histopathology: Calculating expected proportions in each histopathology category when
the odds ratio for a new study is pre-specified.
Histopathology Proportion of mice New Cumulative proportion
category i average p
Control, Expected Control, Expected
p0 test p1 P0,i test P1,i
2
P3 = 2 0 90 1 − 0 90 + 2 0 90 = 0 947, N k = 3 38 4
1 96 + 0 8416
and P4 = 1 ln 2 2
= 220.8, or approximately 221.
Table 21.5: In paired studies with binary outcomes (positive = 1, negative = 0), there are four possible outcomes for the first
and second responses.
First Second Sum of first and
response response second responses
Finally, the total number of paired observations multiple observers on the same specimen) and
Npair is repeatability (the similarity in scores for the same
observer over multiple assessments on the same
N pair =
2
specimen). The reliability of the measurements is
z1 − α 2 OR + 1 + z1 − β pdisc OR + 1 2 − OR − 1 2 interpreted in the context of a priori definition of
the minimally acceptable difference between obser-
2
pdisc OR − 1 vers (Hallgren 2012). Detailed coverage of the design
of reproducibility and repeatability (R & R) studies is
If there is a single dominant category, Julious beyond the scope of this chapter; however, detailed
et al. (1999) suggest that preliminary sample size guidance can be found elsewhere (e.g. Montgomery
estimates can be obtained by dichotomising the data and Runger 1993a, b; https://sixsigmastudyguide.
into the dominant group and one other by pooling com/repeatability-and-reproducibility-rr).
observations from the other categories, then esti- Rater agreement studies are a special case of
mating sample size as above. repeated binary or categorical ordinal data (Fleiss
and Cohen 1973; Fleiss 1981). In these studies,
Example: Stress Reduction in Healthy Dogs two or more observers assess the same specimens
or items, often several times. Cohen’s kappa coeffi-
Mandese et al. (2021) reported that 27/44 dogs cient (κ) is used to measure both inter-rater and
(61%) showed a major increase in stress levels rel- intra-rater reliability for two raters only. Fleiss’ κ
ative to baseline following separation from own- is used for >2 raters (Fleiss 1971) but is not a gener-
ers during routine physical exams. There were alisation of Cohen’s κ. For estimating inter-rater
16/44 (36%) dogs that showed a reversal in stress reliability, the number of specimens to be assessed,
levels between baseline and the subsequent N, can be estimated from the expected proportion of
assessment (discordance). discordant ratings pdisc and Cohen’s κ as follows:
Suppose a new study was to be designed to
detect a 50% reduction in risk of stress (OR = po − pe
κ=
0.5) with 80% power and 95% confidence. Then, 1 − pe
the number of discordant paired observations
required will be where po is the observed agreement between raters,
and pe is the agreement expected by chance
2 (Appendix 21.A). Then the number of specimens
1 96 0 5 + 1 + 2 0 8416 05
ndisc = 2 ≈ 68 N to be assessed by at least two individual raters
0 5−1
with precision δ, a pre-specified κ, and confidence
From the previous data, it was expected that s interval 100(1 − α)% is
+ t = 0.36. The new OR = 0.5. Therefore, the
4 1−κ
approximate sample size is 68/0.36 ≈ 189 N = z21 − α 2
δ2
κ 2−κ
1 − κ 1 − 2κ +
21.5 Sample Size for 2 pdisc 1 − pdisc
Table 21.6: Data for two pathologists evaluating the same 50 histology specimens.
Rater 1 Rater 2 Number of specimens
run;
21.A Sample SAS Code For proc freq data=test;
Calculating Cohen’s κ from tables rater1*rater2 /agree;
weight count;
Raw Data test kappa;
exact kappa;
data test; title "Cohen's Kappa Coefficients";
input rater1 rater2 count @@; run;
datalines;
............ The macro MKAPPA to calculate κ for more than
; two raters is described by Chen et al. (2012)
256 A Guide to Sample Size for Animal-based Studies
References Julious, S.A. and Campbell, M.J. (1998). Sample size cal-
culations for paired or matched ordinal data. Statistics
Agresti, A. (2013). Categorical Data Analysis, 3rd ed. New in Medicine 17: 1635–1642.
York: Wiley. Julious, S.A., Campbell, M.J., and Altman, D.G. (1999).
Campbell, M.J., Julious, S.A., and Altman, D.G. (1995). Estimating sample sizes for continuous, binary, and
Estimating sample sizes for binary, ordered categori- ordinal outcomes in paired comparisons: practical hints.
cal, and continuous outcomes in two-group compari- Journal of Biopharmaceutical Statistics 9 (2): 241–251.
sons. BMJ 311: 1145–1148. Machin, D., Campbell, M.J., Tan, S.B., and Tan, S.H.
Chen, B., Zaebast, D., and Seel, L. (2012). A macro to (2018). Sample Sizes for Clinical, Laboratory and Epide-
calculate kappa statistics for categorizations by multi- miology Studies, 4th ed. Wiley-Blackwell.
ple raters. https://support.sas.com/resources/ Mandese W, Griffin F. Reynolds P, Blew A, Deriberprey
papers/proceedings/proceedings/sugi30/155-30. A, Estrada, A (2021). Stress in client-owned dogs
pdf (accessed 2022). related to clinical exam location: a randomised crosso-
Fleiss, J.L. (1971). Measuring nominal scale agreement ver trial. Journal of Small Animal Practice, 62(2):82–88.
among many raters. Psychological Bulletin 76: 378–382. https://doi.org/10.1111/jsap.13248.
Fleiss, J.L. (1981). Statistical Methods for Rates and Pro- Montgomery, D.C. and Runger, G.C. (1993a). Gauge
portions, 2nd ed. New York: Wiley. capability analysis and designed experiments. Part I:
Fleiss, J.L. and Cohen, J. (1973). The equivalence of basic methods. Quality Engineering 6 (1): 115–135.
weighted kappa and the intraclass correlation coeffi- Montgomery, D.C. and Runger, G.C. (1993b). Gauge
cient as measures of reliability. Educational and Psy- capability analysis and designed experiments. Part II:
chological Measurement 33 (3): 613–619. https:// experimental design models and variance component
doi.org/10.1177/001316447303300309. estimation. Quality Engineering 6 (2): 289–305.
Hallgren, K.A. (2012). Computing inter-rater reliability Whitehead, J. (1993). Sample size calculation for
for observational data: an overview and tutorial. Tuto- ordered categorical data. Statistics in Medicine 12:
rial in Quantitative Methods for Psychology 8 (1): 23–34. 2257–2271.
https://doi.org/10.20982/tqmp.08.1.p023.
22
Dose-Response Studies
A Guide to Sample Size for Animal-based Studies, First Edition. Penny S. Reynolds.
© 2024 John Wiley & Sons Ltd. Published 2024 by John Wiley & Sons Ltd.
258 A Guide to Sample Size for Animal-based Studies
by crossover designs (Senn 2002), factorial shape with only two parameters (Crippa and
designs (Dmitrienko et al. 2007), or Monte Orsini 2016). Interpreting appearance of curve
Carlo-based methods that simultaneously shapes based on historical data must be per-
assess multiple heterogeneous tumour models formed with care, as artefact can be introduced
(Ciecior et al. 2021). by small sample sizes, lack of randomisation,
confounders, large variability, and ‘fishing’
Efficacy metric type: binary, ordinal, time to event,
(analysing large number of endpoints)
continuous. Much smaller sample sizes are
(Thayer et al. 2005). However, design features
needed when the outcome is continuous com-
that increase precision, such as dose placement
pared to categorical or time-to-event outcomes.
and number of dose levels, coupled with con-
Dose range. The dose range must be wide enough sistency in experimental procedures, are more
to cover the expected response range and dis- important than model choice (Slob et al. 2005).
criminate different models of trend (e.g. linear
versus quadratic or nonlinear). Doses should be
chosen to bracket the benchmark dose (usually 22.2.1 Translational Considerations
zero) and the likely maximum dose (Slob et al. Dose-response sample size planning should con-
2005). Choice of the maximum dose must be sider three major factors affecting translation poten-
made with care. If the dose is too high, unex- tial of the model: sex, allometric scaling, and
pected or severe adverse events related to safety application of 3Rs principles. These are described
and tolerance may occur, and if too low, effi- in more detail in Chapter 6.
cacy effects will not be detectable (Pinheiro
et al. 2006, Bretz et al. 2008). Sex. National Institute of Health best-practice
standards for animal-based research strongly
Dose placement. Selection of dosage levels and encourage the inclusion of female animals
how they are spaced are crucial design compo- (Clayton and Collins 2014 and consideration
nents for reliable descriptions of the dose- of sex as a biological variable (Miller et al.
response relationship. At least five or six dose 2017). Regulatory guidelines may mandate test-
groups are recommended, especially when out- ing of both sexes at each dose level (e.g. OECD
comes are highly variable (coefficient of varia- Test Guidelines https://www.oecd.org/). Sam-
tion of 18% or more; Slob et al. 2005). When ple sizes can be minimised by careful choice of
the shape of the response is not known before- statistically–based experimental designs.
hand, equally spaced designs are more robust
than optimal designs (e.g. D-optimal and c- Allometric scaling. Care must be taken when deter-
optimal designs with dose levels determined mining dose ranges derived from interspecific or
by optimisation techniques; Holland-Letz and intraspecific allometric dose conversions. Gen-
Kopp-Schneider 2015). eral scaling relationships may be useful for initial
approximations. However, allometric predic-
Expected response frequency. It is usually assumed tions for pharmacokinetic relationships should
that higher doses will result in more pro- be based on quality data and validated up-to-date
nounced responses compared to lower doses methodologies (Blanchard and Smoliga 2015).
(Dmitrienko et al. 2007).
3Rs principles. Regulatory agencies promote the
Model choice. The shape expected for the response incorporation of reduction and refinement
curve will determine the form of the model to methods and processes into protocols, and
be fitted to the data, and therefore the number increasingly, replacement with non-animal
of regression coefficients to be estimated (Bretz models, such as cell and tissue-based assays.
et al. 2008). Linear regression is useful for first- For example, OECD guidelines for chemical
pass approximations of the overall dose- testing in animals1 provide practical
response relation, and for initial estimates of
nonlinear parameter values. Splines have the 1
https://www.oecd.org/chemicalsafety/testing/
advantage of being able to fit almost any curve oecdguidelinesforthetestingofchemicals.htm
260 A Guide to Sample Size for Animal-based Studies
meanmax − meanmin
β1,plan =
22.3 Dose-Response dosemax − dosemin
Regression Models Preliminary estimates for minimum and maximum
The basic dose-response design is the parallel-line efficacy are obtained from the anticipated maxi-
assay, modelled as a linear regression: mum and minimum values of the response variable.
Preliminary estimates of the median effective dose
Y = β0 + β1 D + β2 X ED50 can be obtained from the median value of
planned doses covering the range of the anticipated
Here β0 is the intercept, β1 is the slope describing the therapeutic effect.
change in Y with dose D, and β2 is the coefficient for Hypothesis tests are one-tailed. This is because
describing the difference between control (X = 0) most dose-response studies are designed to assess
and test treatment (X = 1) arms (Finney 1978; differences in response to increasing doses, so there
Figure 22.2). is only one pre-specified test direction.
When the response is binary, logit or probit trans-
formations are used to linearise the data. Nonlinear
regression models are used to estimate more com- 22.4 Sample Size: Binary
plex relationships between dose and response.
For continuous response data, the 3-parameter or
Response
4-parameter nonlinear Emax models are commonly Binary responses are modelled as proportions p for
used. When dose concentrations span a very wide each dose group. Therefore the distribution of the
range or increase exponentially, analyses can response will be sigmoid, and the relation is line-
arised by the logit or probit transformation of pro-
portions (Finney 1978, Kodell et al. 2010). The
form of the regression is
Y = F − 1 P = β0 + β1 log 10 D + β2 X
Control Test
Probit Logρ intervention
or logit
where P is the proportion of subjects exposed to dose
D that shows the response. For logit analysis, F is
the logistic cumulative distribution function (cdf ),
where F(x) = 1/(1 + e−x). For probit analysis, F is
ED50 ED50 Log10 (Dose) the normal (Gaussian) cdf.
control test
Efficacy (ρ) is the ratio of the median dose for the
Figure 22.2: The probit regression model for quantifying test agent relative to control:
efficacy between a test and control group. Efficacy is the
horizontal difference between parallel test and control ED50 Test
regression lines. The Y axis is the probit score for a binary ρ=
response, and the X axis is the log-transformed doses. ED50 Control
Dose-Response Studies 261
(Finney 1978). The difference in efficacy (log10ρ) respectively. The total sample size N is the product
between test and control groups is the difference of the sample size per group, the number of groups
in the log10 transformed values. It is shown graphi- in each treatment arm, and the number of treatment
cally as the horizontal distance between parallel test arms, so that N = 2 g ni (Kodell et al. 2010).
and control regression lines (Figure 22.2):
ρ = 1.2 2
1.0 z1 − α + z1 − β
N =g n=g 2
ρ = 1.15 D b s
0.9
ρ = 1.10
0.8
(Machin et al. 2018). Here g is the number of dose
0.7 levels, and D is the adjustment for the doses di
0.6 and dose range:
Power
0.5 ρ = 1.05
g−1 2
0.4
D= i=0
di − d
0.3
If the doses di are equally spaced, then D reduces
0.2 to g(g2 − 1)/12. The effect size b/s is estimated by the
0.1 regression slope b obtained from the difference in
0.0
expected mean response values at the maximum
3 5 7 9 11 13 15 17 19 and minimum dose: b = (μmax− μmin)/(dmax− dmin),
n per group and the anticipated standard deviation s.
Figure 22.3: Power curves for a range of sample size per
group n and efficacy ρ.
Example: Dose-Response Effects of Vitamin
Source: Data from Kodell et al. (2010).
D on Mouse Retinoblastoma
α = 0.05, and the t value t7,0.95 is 1.895. Based (Data from Dawson et al. 2003). To determine
on the previous data, a common slope of 28.5 the effectiveness of a vitamin D analogue in
is assumed for the calculations. Power is inhibiting retinoblastoma, transgenic mice
calculated by deriving the cdf for each value (N = 175) were fed a vitamin D- and calcium-
of t1 − β, iterating over the entire range of can- restricted diet, then randomised to five dose
didate n and choosing the n that results in groups of a vitamin D analogue: 0.0 (vehicle con-
power >0.8. Total sample size is then calcu- trol), 0.1, 0.3, 0.5, or 1.0 μg. Outcomes were
lated as 2(5) n. tumour area (μm) and serum calcium (mg/dL).
Results are shown in Figure 22.3. A sample size of The authors concluded there was no dose-
n = 8 per group (or total N = 80) has power of only dependent response. However, examination of
46% to detect relative efficacy of 1.05 with α = 0.05. the data suggests that the most dramatic tumour
Approximately 20 mice per group (N = 200) are reductions occurred at doses between 0 and 0.1.
required to detect efficacy this small with 95% con- It was observed that undesirable adverse events,
fidence. In contrast, to detect a larger relative effi- such as hypercalcemia (calcium > 10 mg/dL)
cacy of ρ = 1.2, only two mice per group (N = 20) and mortality, were minimised at very
are required for power >90%. However, it must be low doses.
determined ahead of time if an efficacy of this mag- Suppose a new dose-response study is planned
nitude is even biologically feasible before consider- for candidate doses <0.1 μg, with the objective of
ing such small sample sizes for a study. detecting a per cent reduction in tumour area of
at least 65% with standard deviation of 20%, one-
sided α = 0.05, and power 1 – β = 0.8. The candi-
22.5 Sample Size: date doses were 0.025, 0.05, and 0.1 μg against
vehicle control of 0 μg. In the original study,
Continuous Response the initial tumour area was approximately 90 ×
103 μm3 for the control group. Then the expected
22.5.1 Linear Dose-Response tumour size at 0.1 μg is 31.5, the slope is (90 −
When the response is continuous and normally dis- 31.5)/(0 − 0.1) = −585, and s = 0.2∗585 = 117.
tributed, and the relation between response and There will be g = 4 new dose groups. Doses are
dose levels is linear, total sample size N can be 2
not equally spaced, so D = 4−1 i = 0 di −d The
approximated by:
Dose-Response Studies 263
datalines;
Example: Dose-Response Effects of Vitamin 0.05 0.275 0.5 0.725 0.95
D on Mouse Retinoblastoma: Contrasts ;
Method data prob;
set prob;
A new mouse retinoblastoma study was planned
to test the hypothesis of a linear dose-response *calculate logit weights;
w_logit=P*(1-P);
with a relative change of 15% in tumour areas
of 90 and 30 μm. The relative change in tumour *calculate probit weights;
area at each dose level (expressed as a proportion norminv = quantile('normal', P);
relative to placebo) can be approximated f=pdf('NORMAL',norminv,0,1);
as θ0 = 0, θ1 = 0 15, θ2 = 0 30, θ3 = 0 65, with w_probit=f*f/w_logit;
run;
standard deviation s of approximately 0.3.
To test the initial hypothesis ‘Is there a
response?’ the contrasts are set up to compare
the control group against the other three dose
22.A.2 Determining Relative
levels: ci = (−3, 1, 1, 1). Then Efficacy in a Dose-Response
ε = − 3 0 + 1 0 15 + 1 0 3 + 1 0 65 = 1 1 Study. (Data from Kodell
et al. 2010.)
The contrasts are ci = (−3, −1, 1, 3). Then ε =
(−3) 0 + (−1) 0.15 + (1(0.3) + (3) 0.65 = 2.1. For /*TRT is the two treatment groups where C is
the control group and
one-sided α = 0.05 and power of 80%, the approx- T is the test drug treatment group */
imate sample size is /*N is number of mice, Response Y is number of
2
deaths */
1 645 + 0 842 03
n= 4 data probit;
11
input trt $ dose N Y;
−3 2 + 1 2 + 1 2 + 1 2 ldose=log10(dose); *log10 transformation for Y;
datalines;
≈ 27, for a total sample size of 27 × 4 = 108 C 7 8 0
C 8 8 1
C 9 8 3
C 10 8 8
C 11 8 8
The R package ‘DoseFinding’ provides a com- T 7 8 0
prehensive set of tools for design, contrast genera- T 8 8 0
tion, multiple contrast tests, and nonlinear T 9 8 0
T 10 8 4
model fits for dose-response models (Bornkamp T 11 8 5
et al. 2023; https://cran.r-project.org/web/ ;
packages/DoseFinding). run;
Macdougall, J. (2006). Analysis of dose–response studies Senn, S. (2002). Crossover Trials in Clinical Research. Chi-
– Emax model, chap 9. In: Dose Finding in Drug Devel- chester: Wiley.
opment, Statistics for Biology and Health (ed. N. Ting), Slob, W., Moerbeek, M., Rauniomaa, E., and Piersma, A.
127–145. New York: Springer https://doi.org/ H. (2005). A statistical evaluation of toxicity study
10.1007/0-387-33706-7_9. designs for the estimation of the benchmark dose in
Machin, D., Campbell, M.J., Tan, S.B., and Tan, S.H. continuous endpoints. Toxicological Sciences 84: 167–
(2018). Sample Sizes for Clinical, Laboratory and Epide- 185. https://doi.org/10.1093/toxsci/kfi004.
miology Studies, 4e. Wiley-Blackwell. Stewart, W.H. and Ruberg, S.J. (2000). Detecting dose
Miller, L.R., Marks, C., Becker, J.B., et al. (2017). Consid- response with contrasts. Statistics in Medicine 19 (7):
ering sex as a biological variable in preclinical 913–921. https://doi.org/10.1002/(sici)1097-
research. FASEB Journal, 31(1):29–34. https:// 0258(20000415)19:7<913::aid-sim397>3.0.
doi.org/10.1096/fj.201600781R. co;2-2.
Nunamaker, E.A. and Reynolds, P.S. (2022). ‘Invisible Thayer, K.A., Melnick, R., Burns, K., Davis, D., Huff, J.
actors’ – how poor methodology reporting compro- (2005). Fundamental flaws of hormesis for public
mises mouse models of oncology: a cross-sectional sur- health decisions. Environmental Health Perspectives,
vey. PLoS ONE 17 (10): e0274738. https://doi.org/ 113(10), 1271–1276. https://doi.org/10.1289/
10.1371/journal.pone.0274738. ehp.7811
Pinheiro, J.C., Bornkamp, B., and Bretz, F. (2006). Design Tukey, J.W., Ciminera, J.L., and Heyse, J.F. (1985). Test-
and analysis of dose-finding studies combining multi- ing the statistical certainty of a response to increasing
ple comparisons and modeling procedures. Journal of doses of a drug. Biometrics 41: 295–301.
Biopharmaceutical Statistics 16: 639–656. Wong, W.K., and Lachenbruch, P.A. (1996). Tutorial in
Salsberg, D. (2006). Dose finding based on preclinical biostatistics. Designing studies for dose response. Sta-
studies, chap 2. In: Dose Finding in Drug Development, tistics in Medicine, 15(4):343–59. https://doi.org/
Statistics for Biology and Health (ed. N. Ting), 18–29. 10.1002/(SICI)1097-0258(19960229)15:4<343
New York: Springer https://doi.org/10.1007/0- ::AID-SIM163>3.0.CO;2-F .
387-33706-7_9.
Index
Please note that page references to Figures will be followed by the letter ‘f’, to Tables by the letter ‘t’.
A Guide to Sample Size for Animal-based Studies, First Edition. Penny S. Reynolds.
© 2024 John Wiley & Sons Ltd. Published 2024 by John Wiley & Sons Ltd.
268 Index
bias C
controls, 24 calculation/determination of sample size, 112–113
correction adjustment methods, 91 adverse events, 64
increasing, 13 asymptotic large-scale approximation, 182
internal validity, 60 asymptotic normal approximation, 242–243
minimising, 13, 17, 19, 23, 26–27 based on t-distribution, 182–183
prior data, 25 clusters see clusters/clustering
in published animal research studies, 4 confidence intervals and standard deviation, 69
reducing, 4 conventional approach, 89–90
removing, 24 covariate-dependent estimates, 148–149
replication vs. repeats, 13 derived from percentage change in means, 183
representation, 235– design effect, based on, 242
research results, 15 effect sizes, 213
selection, 60, 144 empirical and translational pilots, 64–67
spectrum, 144 exact methods, 90
binary data, 130, 137 from exemplary data, 161
binomial distribution, 90–94 hierarchical/nested data, 242–244
and batch testing, 93–95 initial approximations, 242
cumulative probabilities for, 97–98 Lehr’s formula, 183, 195
determining screening numbers, 95 Mead’s resource equation, 213–214
exact calculation, 93 multiple factors, comparing, 213–214
negative see negative binomial distribution non-centrality parameter (NCP), 213, 243
rare or non-existent events, 92–93 non-parametric or parametric estimates, 148
‘Rule of Threes,’ 92 reference intervals, 147–149
see also adverse events; multinomial samples rules of thumb, 147, 183, 241
biological unit, 11 sample-based coverage methods, 147–148
biological variability, 4, 11 skeleton ANOVA, 214
birds time-to-event (survival) data, 205–206
avian influenza, 195 tolerance intervals, 137–138
Black-billed Magpie (Pica pica) clutch size, 240 two-level model, subjects within cluster, 243
breeding, 241 units of randomisation, 243–244
clutch sizes, 243 see also feasibility calculations; sample size
osprey eggshell thickness, 135–136, 138 CAMARADES (collaborative research group), 18, 42
Poisson applications, 119 cancer, in mice
sparrows blocking on donor, 221
house sparrow clutch size, 243 dose-response effects of vitamin D on retinoblastoma,
house sparrow data, calculations of R2 and Cohen’s f 2 262, 264
for, 245–246 effect of diet on tumorigenesis, 223
model of early life stress, 239, 244 model papers, cross-sectional survey, 4
sunbirds, visitation rates to flowers, 120 tumour proliferation, 15
bladder disease, in dogs, 118 canine pulmonary stenosis, 149
bleeding, 91, 117 cardiac disease, in dogs, 207
blocking, 212–214 cats
balanced block design, 219 feline calicivirus (FCV) in shelter cats, 117–118
blocking factor, 212 heart rate elevation in, 226–227
mouse oncology, 221 hypertrophic cardiomyopathy markers in Maine Coon
randomised complete block design, 22, cats, 96–97, 98
219–221 MYBPC3 gene mutation, in Maine Coon cats, 96
by treatment interaction, 73 overweight or obese, weight loss in, 97
variables, 22, 73, 173, 219, 220 cattle, 185, 187
body size Causal Analysis/Diagnosis Decision Information System
and allometric scaling, 63–64 (CADDIS), 68
in large animals, 53, 83, 129 cause and effect, 18
bootstrapping, 112, 131, 147 Cavalier King Charles Spaniels, echocardiography, 147
butterflies, sex ratios, 191
Index 269
mean square error (MSE), 129, 172, 212, 213, 214, 216, 217, based on mean differences, 217
219, 220 sample size determination methods, 213–214
measurement error, 9, 11, 12, 13, 20, 41, 63, 74, 114 split-plot design, 223–224
measurement times, 52–53 mussel shell length, 228
meta-analyses, prediction intervals, 130–131
mice N
amyotrophic lateral sclerosis, 178
National Institute of Health (NIH), 13, 63
anaesthesia duration, 184
negative binomial distribution, 90, 95–96, 119, 121–122
anaesthetic immobilisation times, 128
cumulative probabilities for, 97–98
on antibiotics, weight gain in, 222
evaluating, 124
body size and allometric scaling, 63
fitting counts of red mites on apple leaves, 124–125
breeding productivity, 161
osteosarcoma pilot study, 95–96
C57BL/6 mouse strain, 24
skewed count data, 196
cancer in, 4, 15, 221, 223, 262
Newcombe (Wilson score intervals) method, 137, 192, 193
environmental toxins, effects on growth, 72
N-hacking, 28, 157
estimating number with desired genotype, 87–88
NIH see National Institute of Health (NIH)
heterozygous breeder, 87
non-centrality parameter (NCP), 64, 86, 135, 156, 157–162,
infections, 191, 192–193, 194
164, 183, 186–187, 213, 216, 218, 222–223, 243–245
irradiation study on, 224
non-Gaussian data, 163–164
laboratory processing capacity experiment, 81
normal distribution, 90–91
LD50 radiation lethality study on, 261
nuisance variables, 22
longitudinal data plots for, 52
numbers of animals
lung tissue histopathology changes in, 251, 252t
justifiable, 5f
number required for a microarray study, 85
reducing, 24
perception of as ‘furry test tubes,’ 28
vs. sample size, 9
photoperiod exposure, 14
verifiable, feasible and fit for purpose, 3
pups, 10f, 72, 87, 88, 105t, 161–164, 233, 234
see also animal-based research; right-sizing
repeating of entire experiments, 13
surgical success study on, 190
tumorigenesis, 223 O
two-arm drug study, 183 observational studies (PECOT), 18
two-arm randomised controlled study of productivity, 105 observer agreement studies, sample size for, 254–255
two-group comparisons, 190 odds ratio (OR), 174–175, 203
see also rodents pre-specified, sample size for detecting, 252
microarray studies, 85 two-group comparisons, 194–195
MSE see mean square error (MSE) using SAS proc genmod to calculate, 178
multi-batch designs, 75 OFAT see one-factor-at-a-time (OFAT)
multinomial samples, 118 oncology, 4, 24, 221, 260
multiple factors, comparing, 211–231 see also cancer, in mice
before-after designs, 226 one-factor-at-a-time (OFAT) design, 25, 27, 221
constructing the F-distribution, 212–213 empirical and translational pilots, 65, 66
continuous outcome, 227 operating procedures, standardised, 54
crossover designs, 226 operational feasibility, determining, 81, 82–88
design components, 212–213 available time for large animal experiments, 83
factorial designs, 221–223 basic science/laboratory studies, 82–83
proportions outcome, 227 high dimensionality studies, 85–86
randomised complete block design, 219–221 laboratory processing capacity, 83
repeated measures large animal, available time for, 83
simple repeated-measures design, 229 operation and logistics, 83
in space, 227–228 training, teaching and skill acquisition, 86
over time, 227 veterinary clinical trials, 83–85
within-subject designs, 225–228 operational pilot studies, 25, 40, 47–56
sample size approximations animal surgery procedural checklist, 50
based on a number of levels for a single factor, 217–218 animal-based research, 48
based on an expected range of variations, 218, 219f benchmark metrics, 47
Index 275
best-practice target times, 53 experiments, 37–39, 48, 50, 51f, 58, 60, 64–66, 70, 75
‘Can It Work?’ 48 general role of, 36
checklists, 50 justification, 40
constructing, 49 literature reviews, 40
deliverables, 47 Murphy’s First, Second and Third Law, 36
goal, 47 operational, 47–56
measurement times, 52–53 planning, 39–40
no pre-specified performance times, 53 principles, 39
operational tools, 48–52 proportionate investment, 39
paired data assessments, in Excel, 55 rationale for, 35–45
performance metrics, 52–53 reporting of results, 38
process/workflow maps, 48–50 role in laboratory animal-based research, 37
retrospective chart review studies, 53–55 role in veterinary research, 37–38
run charts, 50–52 sample size, 41–43
sample size considerations, 55 sequential probabilities, 42
set-time metrics, 53 similitude, 39
standardising performance, 54–55 stakeholder requirements, 40
subjective measurements, 53 translational pilots, 57–80
surgical implementation for physiological monitoring in trials, 35, 40–41
swine model, 49–50 utility, 39
task identification, 47 well-conducted, 36
see also pilot studies see also evidentiary strength; osteosarcoma pilot study
operational tools, 48–52 PIRT framework, 18
OR see odds ratio (OR) planned heterogeneity, 21, 75, 219
ordinal data, 249–256 plots
common examples, 249 coverage, 68–69
paired or matched, 253–254 half-normal, 70, 72f
rater agreement, 254–255 interaction, 72–73
sample size approximations, 250–253 longitudinal data, 52
osprey eggshell thickness, 135–136, 138 profile, 70, 71f
osteosarcoma pilot study, 95–96 sequential, 52f
outcome data Pocock criteria, concurrent controls, 24
basic equation for continuous data, 170 Poisson distribution, 118, 119–121
count data, 203 applications, 119
hazard rate and ratio, 204–205 evaluating, 124
survival times, 203–204 examples of applications, 119
Oxford Centre for Evidence-based Medicine fitting counts of red mites on apple leaves,
website, 18 124–125
rules of thumb, 195
P skewed count data, 118–119, 195
PECOT framework, 18 visitation rates of sunbirds to flowers, 120
performance metrics, 52–53 power
no-fail surgical performance, 139 calculations, 4–5, 10–11, 81, 163–164
standardising performance, 54–55 descriptions, 107–108
P-hacking, 28, 157 hypothesis testing, 155–157
phenotypic responses, 21 information, 58, 64, 65–67
PICOT framework, 18 simulation-based analyses, 67
pilot studies statistical, 58
applications, 36–38 pragmatic sample size, 42
arithmetic approximations, 42 precision
categories of studies labelled incorrectly as, 38–39 absolute vs. relative, 113
defining, 35 and confidence intervals, 111–126
distinguished from exploratory and screening defining, 12, 106
studies, 38–39 role of design to increase, 19–23
empirical pilots, 57–80 internal validity, 60
276 Index
precision (cont’d) R
precision-based sample size, 42–43 R code, 116, 136–139, 145–146, 160–161, 244
see also margin of error racehorse medication, 135, 139
predetermined effect sizes, 177 randomisation, 4, 13, 19, 22–27, 39, 60, 63, 65, 76, 156,
prediction intervals, 105, 106, 127–132 200–201, 58t, 234f
binary data, 130 in ANOVA-type design, 212
continuous data, 127–130 bias minimisation, 26
defining, 127 block, 22
meta-analyses, 130–131 cluster
see also confidence intervals; reference intervals; tolerance cluster-level assignment, 235
intervals as units of randomisation, 243–244
predictive validity, 61 within-cluster assignment, 235
probability density function (pdf ), 97, 98, 159 see also clusters/clustering
probability distribution, 89, 90 formal, 23
see also binomial (exact) distribution; hypergeometric individual, 236
distribution; negative binomial distribution; normal lack of, 259
distribution; Poisson distribution level-2, 242
procedural variability, 20 non-randomisation, 27
probit, 121, 123–124, 229, 260f protocols, 24
process behaviour charts, 50–52 schedule, 200–201
process control, 20–21 strategies, 201
process maps, 48–50 units of, 236, 241, 243–244, 245
profile plots, 70, 71f clusters as, 243–244
proof of concept (pPoC), 58, 59t identifying, 235–236
proof of principle (pPoP), 58, 59t see also randomised complete block design (RCBD)
proportional hazards assumption, 204 randomised complete block design (RCBD), 22, 212,
proportions, 115–118, 227 219–221, 228–229
protocols, 5, 20–22, 37, 50, 103 rare or non-existent events, 92–93
approval, 50 see also adverse events
breeding, 86, 87 ratios
compliance, 51 allocation, 162–163, 169, 186, 194, 201,
coverage plots, 69 205, 250
effect sizes, 169 effect sizes, 177–178
errors, 51 hazard ratio (HR), 176, 204–208, 168t
experimental, 36, 37, 47, 48, 74, 87 odds ratio see odds ratio (OR)
fixed-span, 160 sex, 190–191
fully articulated, 41 rats, 160, 206
habituation, 22 recommended daily allowance (RDA), 114
implementation, 64 red mites, 122, 124–125
individual idealised-span, 160 Reduction principle, 3, 4, 17, 22, 27, 64
large animal experiments, 83 reference intervals, 105, 106, 136, 143–151
measurement times, 52 confidence probability, 145
mouse surgical success, 190 constructing, 144–147
pilot studies, 38, 40 cut-point precision, 144, 145
randomisation, 24 defining, 143
research, 37 interval coverage, 144
standardised, 76 markers, 143, 145
translational considerations, 259 reference limits, 144, 145
pseudo-replication, 12, 13–15, 219, 234 regression-based reference ranges, 146–147
P-values, 59, 156, 157 sample size determination, 147–149
selection bias, 144
Q specificity, 143
quality assurance (QA), 20, 48, 53 spectrum bias, 144
quality improvement (QI), 20, 48 see also confidence intervals; prediction intervals;
quantified outcomes, 19 tolerance intervals
Index 277
variation (cont’d) W
research animals, 21–22 Wald method
statistical control, 22–23 adjusted Wald see Agresti-Coull method
sources of, 19, 22, 24, 37, 66, 171–173, 178, confidence interval, 123, 191–192
212–215, 228 equation, 191
within-subject, 177, 225 interval limits, 193
veterinary clinical trials, 67, 83–85 for proportions, 115, 116, 121
Food and Drug Administration (FDA)- regulated, 48 water maze testing, 52
loss to follow-up, 206 welfare indicators, 22, 26, 249
non-central t-distribution, 186 ‘well-built’ research question, 17–19
pilot studies, 37–38 Wilcoxon matched-pairs test, 14
recruitment, 41 Wilcoxon test, skewed count data, 118
sample SAS code for calculating sample size, Wilson method, for proportions, 115, 116
187–188 workflow maps, 48–50
sample size for two-arm, 185–188 World Health Organisation (WHO), 48, 50
conventional method, uncorrected s, 186
upper confidence limit method, 186
time-to-event (survival) data, 206–207 Z
von Willebrand disease (vWD), in Doberman Pinschers, zebrafish mortality, 120
91, 117 zero-animal sample size, 41–42