A Guide To Sample Size For Animal-Based Studies (VetBooks - Ir)

A Guide to Sample
Size for Animal-based

Studies
Penny S. Reynolds
Department of Anesthesiology, College of Medicine
Department of Small Animal Clinical Sciences
College of Veterinary Medicine
University of Florida, Gainesville
Florida, USA
0005646879 3D 1 17/8/2023 3:50:21 PM

This edition first published 2024
© 2024 John Wiley & Sons Ltd
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic,
mechanical, photocopying, recording or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is
available at http://www.wiley.com/go/permissions.
The right of Penny S. Reynolds to be identified as the author of this work has been asserted in accordance with law.
Registered Offices
John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA
John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK
For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com.
Wiley also publishes its books in a variety of electronic formats and by print-on-demand. Some content that appears in standard print versions of this book
may not be available in other formats.
Trademarks: Wiley and the Wiley logo are trademarks or registered trademarks of John Wiley & Sons, Inc. and/or its affiliates in the United States and other
countries and may not be used without written permission. All other trademarks are the property of their respective owners. John Wiley & Sons, Inc. is not
associated with any product or vendor mentioned in this book.
Limit of Liability/Disclaimer of Warranty

While the publisher and authors have used their best efforts in preparing this work, they make no representations or warranties with respect to the accuracy
or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchantability
or fitness for a particular purpose. No warranty may be created or extended by sales representatives, written sales materials or promotional statements for
this work. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained
herein may not be suitable for your situation. You should consult with a specialist where appropriate. The fact that an organization, website, or product is
referred to in this work as a citation and/or potential source of further information does not mean that the publisher and authors endorse the information or
services the organization, website, or product may provide or recommendations it may make. Further, readers should be aware that websites listed in this
work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any
loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.
Library of Congress Cataloging-in-Publication Data applied for
Paperback ISBN: 9781119799979
Cover Design: Wiley

Cover Images: © MOLEKUUL/SCIENCE PHOTO LIBRARY/Getty Images; MOLEKUUL/SCIENCE PHOTO LIBRARY/Getty Images; Verin/Shutterstock;
n.tati.m/Shutterstock; Mariia Zotova/Getty Images; RF Pictures/Getty Images
Set in 11.5/13.5pt STIXTwoText by Straive, Pondicherry, India
0005646879 3D 2 17/8/2023 3:50:48 PM

To Nyx, Mel, Finnegan, and Fat Boy Higgins
Holly, Molly, and Abby
and all their nameless, uncounted kindred
who have done so much to advance science and medicine
Contents
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
PART I. What is Sample Size?

1 The Sample Size Problem in Animal-Based Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Sample Size Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3 Ten Strategies to Increase Information (and Reduce Sample Size) . . . . . . . . . . . . . . . 17
PART II. Sample Size for Feasibility and Pilot Studies

4 Why Pilot Studies? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5 Operational Pilot Studies: ‘Can It Work?’ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6 Empirical and Translational Pilots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
7 Feasibility Calculations: Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
8 Feasibility: Counting Subjects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
PART III. Sample Size for Description

9 Descriptions and Summaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
10 Confidence Intervals and Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
11 Prediction Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
12 Tolerance Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
13 Reference Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
PART IV. Sample Size for Comparison

14 Sample Size and Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
15 A Bestiary of Effect Sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
16 Comparing Two Groups: Continuous Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
17 Comparing Two Groups: Proportions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
18 Time-to-Event (Survival) Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
19 Comparing Multiple Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
20 Hierarchical or Nested Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
21 Ordinal Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
22 Dose-Response Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
Preface
‘H ow large a sample size do I need for my
study’? Although one of the most commonly
asked questions in statistics, the importance of
experiments, so they are statistically, operationally,
and ethically justifiable. A ‘right-sized’ experiment
has a clear plan for sample size justification and
proper sample size estimation seems to be over- transparently reports the numbers of all animals
looked by many preclinical researchers. Over the used in the study. For basic and veterinary research-
past two decades, numerous reviews of the pub- ers, appropriate sample sizes are critical to the
lished literature indicate many studies are too small design and analysis of a study. The best sample
to answer the research question and results are too sizes optimise study design to align with available
unreliable to be trusted. Few published studies pres- resources and ensure the study is adequately
ent adequate justification of their chosen sample powered to detect meaningful, reliable, and gener-
sizes or even report the total number of animals alisable results. Other stakeholders not directly
used. On the other hand, it is not unusual for proto- involved in animal experimentation can also benefit
cols (usually those involving mouse models) to from understanding the basic principles involved.
request preposterous numbers of animals, some- Oversight veterinarians and ethical oversight com-
times in the tens or even hundreds of thousands, mittees are responsible for appraising animal
‘because this is an exploratory study, so it is research protocols for compliance with best prac-
unknown how many animals we will require’. tice, ethical, and regulatory standards. An apprecia-
This widespread phenomenon of sample sizes tion of sample size construction can help assess
based on nothing more than guesswork or intuition scientific and ethical justifications for animal use
illustrates the pervasiveness of what Amos Tversky and whether the proposed sample size is fit for
and Daniel Kahneman identified in 1971 as the purpose. Funding agencies and policymakers use
‘belief in the law of small numbers’. Researchers research results to inform decisions related to ani-
overwhelmingly rely on best judgement in planning mal welfare, public health, and future scientific ben-
experiments, but judgement is almost always mis- efit. Understanding the logic behind sample size
leading. Researchers choose sample sizes based on justification can assist in evaluation of study quality
what ‘worked’ before or because a particular sample and reliability of research findings, and ultimately
size is a favourite with the research community. promote more informed evidence-based decision-
Tversky and Kahneman showed that researchers making.
who gamble their research results on small An extensive background in statistics is not
intuitively-based samples consistently have the required, but readers should have had some basic
odds stacked against their findings (even if results statistical training. The emphasis throughout is on
are true). They overestimate the stability and preci- the upstream components of the research process –
sion of their results, and fail to account for sampling statistical process, study planning, and sample size
variation as a possible reason for observed pattern. calculations rather than analysis. I have used real
The result is research waste on a colossal scale, espe- data in nearly all examples and provided formulae
cially of animals, that is increasingly difficult to and code, so sample size approximations can be
justify. reproduced by hand or by computer. By training
This book was written to assist non-statisticians and inclination I prefer SAS, but whenever possible
who use animals in research to ‘right-size’ I have provided R code or links to R libraries.
Acknowledgements
M any thanks to Anton Bespalov (PAASP, Hei-
delberg, Germany); Cori Astrom, Christina
Hendricks, and Bryan Penberthy (University of
the UK Animals in Science Education Trust enabled
me to upgrade my home computer system, making
working on this project immeasurably easier.
Florida); Cora Mezger, Maria Christodoulou, and The book was nearing completion when I came
Mariagrazia Zottoli (Department of Statistics, Uni- across the Icelandic word sprakkar that means
versity of Oxford); and Megan Lafollette (North ‘extraordinary women’. I have been fortunate to
American 3Rs Collaborative), who kindly reviewed encounter many sprakkar whilst writing this book.
various chapters of this book whilst it was in prep- In addition to the women (and men!) already
aration and provided much helpful feedback. mentioned, special thanks to researchers Amara
Thanks to the University of Florida IACUC chairs Estrada, Francesca Griffin, Autumn Harris, Maggie
Dan Brown and Rebecca Kimball, who encouraged Hull, Wendy Mandese, and Elizabeth Nunamaker,
researchers to consult the original 10-page handout who generously allowed me to use some of their
I had devised for sample size estimation. And last, data as examples. And special thanks to Jane Buck
but certainly not least, special thanks to Tim Morey, and Julie Laskaris for their wonderful friendship
Chair of the Department of Anesthesiology, Univer- and hospitality over the years. Jane Buck, Professor
sity of Florida, who encouraged me to put that Emerita of Psychology, Delaware State University,
handout into book form. and past president of the American Association of
Thanks are also due to the University of Florida University Professors, continues to amaze and show
Faculty Endowment Fund for providing me with what is possible for a statistician ‘with attitude’.
a Faculty Enhancement Opportunities grant to Julie advised me that the only approach to properly
allow me to devote some concentrated time to writ- edit one’s own work on a book-length project was to
ing. A generous honorarium from Scientist Center ‘slit its throat’, then told me to do as she said, not as
for Animal Welfare (SCAW) and an award from she actually did. Cheers.
I
What is Sample Size?
Chapter 1: The Sample Size Problem in Animal-Based Research.

Chapter 2: Sample Size Basics.
Chapter 3: Ten Strategies to Increase Information (and Reduce Sample Size).
1
The Sample Size Problem in
Animal-Based Research
CHAPTER OUTLINE HEAD

1.1 Organisation of the Book 5 References 6
Good Numbers Matter. This is especially true when BOX 1.1

animals are research subjects. Researchers are Right-Sizing Checklist
responsible for minimising both direct harms to
research animals and the indirect harms that result Statistically defensible: Are numbers verifiable?
from wasting animals in poor-quality studies (Calculations)
(Reynolds 2021). The ethical use of animals in Outcome variable identified
research is framed by the ‘Three Rs’ principles of Difference to be detected
Replacement, Reduction, and Refinement. Origi- Expected variation in response
nating over 60 years ago (Russell and Burch 1959), Number of groups
the 3Rs strategy is framed by the premise that max- Anticipated statistical test (if hypothesis tests used)
imal information should be obtained for minimal All calculations shown
harms. Harms are minimised by Replacement,
methods or technologies that substitute for animals; Operationally defensible: Are numbers feasible?
Reduction, the methods using the fewest animals for (Resources)
the most robust and scientifically valid information; Qualified technical staff
and Refinement, the methods that improve animal Time
welfare through minimising pain, suffering, distress, Space
and other harms (Graham and Prescott 2015). Resources
The focus of this book is on Reduction and meth- Equipment
ods of ‘right-sizing’ experiments. A right-sized Funding
experiment is an optimal size for a study to achieve Ethically defensible: Are numbers fit for purpose? (3Rs)
its objectives with the least amount of resources,
including animals. However, simply minimising Appropriate for study objectives?
the total number of animals is not the same as Reasonable number of groups?
right-sizing. A right-sized experiment has a sample Are collateral losses accounted for and minimized?
size that is statistically, operationally, and ethically Are loss mitigation plans described?
defensible (Box 1.1). This will mean compromising Are 3Rs strategies described?
between the scientific objectives of the study, pro- Source: Adapted from Reynolds (2021).
duction of scientifically valid results, availability
A Guide to Sample Size for Animal-based Studies, First Edition. Penny S. Reynolds.
© 2024 John Wiley & Sons Ltd. Published 2024 by John Wiley & Sons Ltd.
4 A Guide to Sample Size for Animal-based Studies
of resources, and the ethical requirement to mini- unaware of their existence. Many preclinical studies
mise waste and suffering of research animals. Thus, reported in the literature consist of numerous two-
sample size calculations are not a single calculation group designs. However, this approach is both inef-
but a set of calculations, involving iteration through ficient and inflexible and unsuited to exploratory
formal estimates, followed by reality checks for studies with multiple explanatory variables
feasibility and ethical constraints (Reynolds 2019). (Reynolds 2022). Statistically based designs are
Additional challenges to right-sizing experiments rarely reported in the preclinical literature. In part,
include those imposed by experimental design and this is because the design of experiments is seldom
biological variability (Box 1.2). In The Principles of taught in introductory statistics courses directed
Humane Experimental Technique (1959), Russell towards biomedical researchers.
and Burch were very clear that Reduction is achieved Power calculations are the gold standard for sam-
by systematic strategies of experimentation rather ple size justification. However, they are commonly
than trial and error. In particular, they emphasised misapplied, with little or no consideration of study
the role of the statistically based family of experi- design, type of outcome variable, or the purpose of
mental designs and design principles proposed by the study. The most common power calculation is
Ronald Fisher, still relatively new at the time. For- for two-group comparisons of independent samples.
mal experimental designs customised to address However, this is inappropriate when the study is
the particular research question increase the experi- intended to examine multiple independent factors
mental signal through the reduction of variation. and interactions. Power calculations for continuous
Design principles that reduce bias, such as rand- variables are not appropriate for correlated observa-
omisation and allocation concealment (blinding) tions or count data with high prevalence of zeros.
increase validity. These methods increase the Power calculations cannot be used at all when sta-
amount of usable information that can be obtained tistical inference is not the purpose of the study, for
from each animal (Parker and Browne 2014). example, assessment of operational and ethical fea-
Although it has now been almost a century sibility, descriptive or natural history studies, and
since Fisher-type designs were developed many species inventories.
researchers in biomedical sciences still seem Evidence of right-sizing is provided by a clear
plan for sample size justification and transparent
reporting of the number of all animals used in the
BOX 1.2 study. This is why these items are part of best-
Challenges for Right-Sizing Animal-Based Studies practice reporting standards for animal research
publications (Kilkenny et al. 2010, Percie du Sert
Ethics and welfare considerations. The three Rs
et al. 2020 and are essential for the assessment of
Replacement, Reduction, and Refinement should
research reproducibility (Vollert et al. 2020). Unfor-
be the primary driver of animal numbers.
tunately, there is little evidence that either sample
Experimental design. Animal-based research has
size justification or sample size reporting has
no design culture. Clinical trial models are inapp-
improved over the past decade. Most published ani-
ropriate for exploratory research. Multifactorial
mal research studies are underpowered and biased
agriculture/industrial design may be more suita-
(Button et al. 2013, Henderson et al. 2013, Macleod
ble in many cases, and they are unfamiliar to
et al. 2015) with poor validity (Würbel 2017, Sena
most researchers.
and Currie 2019), severely limiting reproducibility
Biological variability. Animals can display signifi-
and translation potential (Sena et al. 2010, Silver-
cant differences in responses to interventions,
man et al. 2017). A recent cross-sectional survey
making it challenging to estimate an appropriate
of mouse cancer model papers published in high-
sample size.
impact oncology journals found that fewer than
Cost and resource constraints. The financial cost of
2% reported formal power calculations, and less
conducting animal-based research, including the
than one-third reported sample size per group. It
cost of housing, caring for, and monitoring the
was impossible to determine attrition losses, or
animals, must be considered in estimates of
how many experiments (and therefore animals)
sample size.
were discarded due to failure to achieve statistical
The Sample Size Problem in Animal-Based Research 5
significance (Nunamaker and Reynolds 2022). The the most appropriate statistically based study
most common sample size mistake is not perform- design. Although advanced statistical or mathemat-
ing any calculations at all (Fosgate 2009). Instead, ical skills are not required, readers are expected to
researchers make vague and unsubstantiated state- have at least a basic course on statistical analysis
ments such as ‘Sample size was chosen because it is methods and some familiarity with the basics of
what everyone else uses’ or ‘experience has shown power and hypothesis testing. SAS code is provided
this is the number needed for statistical signifi- in appendices at the end of each chapter and refer-
cance’. Researchers often game, or otherwise adjust, ences to specific R packages in the text. It is strongly
calculations to obtain a preferred sample size recommended that everyone involved in devising
(Schultz and Grimes 2005, Fitzpatrick et al. 2018). animal-based experiments take at least one course
In effect, these studies were performed without jus- in the design of experiments, a topic not often cov-
tification of the number of animals used. ered by statistical analysis courses.
Statistical thinking is both a mindset and a set of This book is organised into four sections
skills for understanding and making decisions (Figure 1.1).
based on data (Tong 2019). Reproducible data can
only be obtained by sustained application of statis- Part I Sample size basics discusses definitions of
tical thinking to all experimental processes: good sample size, elements of sample size determina-
laboratory procedure, standardised and comprehen- tion, and strategies for maximising information
sive operating protocols, appropriate design of power without increasing sample size.
experiments, and methods of collecting and analys- Part II Feasibility. This section presents strategies
ing data. Appropriate strategies of sample size justi- for establishing study feasibility with pilot stud-
fication are an essential component. ies. Justification of animal numbers must first
address questions of operational feasibility
(‘Can it work?’ Is the study possible? suitable?
1.1 Organisation of the Book convenient? sustainable?). Once operational
This book is a guide to methods of approximating logistics are standardised, pilot studies can be
sample sizes. There will never be one number or performed to establish empirical feasibility
approach, and sample size will be determined for (‘Does it work?’ is the output large enough to
the most part by study objectives and choice of be measured? consistent enough to be reliable?)
Sample size Sample size Confidence, power,

definitions basics significance
Feasibility Description Comparison
Arithmetic Interval Power
Sampling Maximise
probability Signal Effect size
Data Minimise
visualisation Noise Process variation
Operational
Subject variation
feasibility?
No Yes
Justifiable
Feasible? numbers
Figure 1.1: Overview of book organisation. For animal numbers to be justifiable (Are they feasible? appropriate? ethical?
verifiable?), sample size should be determined by formal quantitative calculations (arithmetic, probability-based, precision-
based, power-based) and consideration of operational constraints.
and translational feasibility (‘Will it work?’ proof Graham, M.L. and Prescott, M.J. (2015). The multifacto-
of concept and proof of principle) before pro- rial role of the 3Rs in shifting the harm-benefit analysis
ceeding to the main experiments. Power calcula- in animal models of disease. European Journal of
tions are not appropriate for most pilots. Instead, Pharmacology 759: 19–29. https://doi.org/
common-sense feasibility checks include basic 10.1016/j.ejphar.2015.03.040.
arithmetic (with structured back-of-the-envelope Henderson, V.C., Kimmelman, J., Fergusson, D. et al.
(2013). Threats to validity in the design and conduct
calculations), simple probability-based calcula-
of preclinical efficacy studies: a systematic review of
tions, and graphics. guidelines for in vivo animal experiments. PLoS Med-
Part III Description. This section presents methods icine 10: e1001489.
for summarising the main features of the sample Kilkenny, C., Browne, W.J., Cuthill, I.C. et al. (2010).
data and results. Basic descriptive statistics pro- Improving bioscience research reporting: the ARRIVE
vide a simple and concise summary of the data in guidelines for reporting animal research. PLoS Biol-
ogy 8 (6): e1000412. https://doi.org/10.1371/
terms of central tendency and dispersion or
journal.pbio.1000412.
spread. Graphical representations are used to
Macleod, M.R., Lawson McLean, A., Kyriakopoulou,
identify patterns and outliers and explore rela- A. et al. (2015). Risk of bias in reports of in vivo
tionships between variables. Intervals computed research: a focus for improvement. PLoS Biology 13:
from the sample data are the range of values e1002301. https://doi.org/10.1371/journal.
estimated to contain the true value of a popula- pbio.1002273.
tion parameter with a certain degree of confi- Nunamaker, E.A. and Reynolds, P.S. (2022). “Invisible
dence. Four types of intervals are discussed: actors”—how poor methodology reporting compro-
confidence intervals, prediction, intervals, toler- mises mouse models of oncology: a cross-sectional sur-
ance intervals, and reference intervals. Intervals vey. PLoS ONE 17 (10): e0274738. https://doi.org/
shift emphasis away from significance tests 10.1371/journal.pone.0274738.
and P-values to more meaningful interpretation Parker, R.M.A. and Browne, W.J. (2014). The place of
of results. experimental design and statistics in the 3Rs. ILAR
Journal 55 (3): 477–485.
Part IV Comparisons. Power-based calculations Percie du Sert, N., Hurst, V., Ahluwalia, A. et al. (2020). The
for sample size are centred on understanding ARRIVE guidelines 2.0: updated guidelines for reporting
effect size in the context of specific experimen- animal research. PLoS Biology 18 (7): e3000410.
tal designs and the choice of outcome variables. https://doi.org/10.1371/journal.pbio.3000410.
Effect size provides information about the Reynolds, P.S. (2019). When power calculations won’t
practical significance of the results beyond con- do: fermi approximation of animal numbers. Lab Ani-
siderations of statistical significance. Specific mal (NY) 48: 249–253.
Reynolds, P.S. (2021). Statistics, statistical thinking, and
designs considered are two-group comparisons,
the IACUC. Lab Animal (NY) 50 (10): 266–268.
ANOVA-type designs, and hierarchical designs.
https://doi.org/10.1038/s41684-021-00832-w.
Reynolds, P.S. (2022). Between two stools: preclinical
research, reproducibility, and statistical design of
References experiments. BMC Research Notes 15: 73. https://
doi.org/10.1186/s13104-022-05965-w.
Button, K.S., Ioannidis, J.P.A., Mokrysz, C. et al. (2013). Russell, W.M.S. and Burch, R.L. (1959). The Principles
Power failure: why small sample size undermines the of Humane Experimental Technique. London:
reliability of neuroscience. Nature Reviews Neurosci- Methuen.
ence 14: 365–376. Schulz, K.F. and Grimes, D.A. (2005). Sample size calcu-
Fitzpatrick, B.G., Koustova, E., and Wang, Y. (2018). Get- lations in randomised trials: mandatory and mystical.
ting personal with the “reproducibility crisis”: inter- Lancet 365 (9467): 1348–1353. https://doi.org/
views in the animal research community. Lab 10.1016/S0140-6736(05)61034-3.
Animal (NY) 47: 175–177. Sena, E.S. and Currie, G.L. (2019). How our approaches
Fosgate, G.T. (2009). Practical sample size calculations to assessing benefits and harms can be improved. Ani-
for surveillance and diagnostic investigations. Journal mal Welfare 28: 107–115.
of Veterinary Diagnostic Investigation 21: 3–14. Sena ES, van der Worp HB, Bath PM, Howells DW,
https://doi.org/10.1177/104063870902100102. Macleod MR. Publication bias in reports of animal
The Sample Size Problem in Animal-Based Research 7
stroke studies leads to major overstatement of efficacy. Vollert, J., Schenker, E., Macleod, M. et al. (2020). Sys-
PLoS Biology, 2010 8(3):e1000344. https://doi.org/ tematic review of guidelines for internal validity in
10.1371/journal.pbio.1000344. the design, conduct and analysis of preclinical biomed-
Silverman, J., Macy, J., and Preisig, P. (2017). The role of ical experiments involving laboratory animals. BMJ
the IACUC in ensuring research reproducibility. Lab Open Science 4 (1): e100046. https://doi.org/
Animal (NY) 46: 129–135. 10.1136/bmjos-2019-100046.
Tong, C. (2019). Statistical inference enables bad science; Würbel, H. (2017). More than 3Rs: the importance of sci-
statistical thinking enables good science. American entific validity for harm-benefit analysis of animal
Statistician 73: 246–261. research. Lab Animal 46: 164–166.
2
Sample Size Basics

2.1 Introduction 9 2.5 Repeats, Replicates, and Pseudo-
2.2 Experimental Unit 9 Replication 12
2.3 Biological Unit 11 2.5.1 Repeats of Entire Experiments 13
2.5.2 Pseudo-Replication 13
2.4 Technical Replicates 11
References 15
provide an estimate of time dependencies in response.

2.1 Introduction Technical replicates are used to obtain an estimate of
Investigators frequently assume ‘sample size’ is the measurement error and are essential for quality con-
same as ‘the number of animals’. This is not necessar- trol of experimental procedures. Pseudo-replication
ily true. Reliable sample size estimates are determined is a serious statistical error that occurs when the
by the correct identification of the experimental units, number of data points (evaluation units) is confused
the true unit of replication (Box 2.1). Replication of with the number of independent samples, or experi-
experimental units increases both precision of esti- mental units (Hurlbert 2009; Lazic 2010). Incorrect
mates and statistical power for testing the central specification of the true sample size results in errone-
hypothesis. Replicates on the same subject over time ous estimates of the standard error, inflated type
I error rates, and increased number of false positives
BOX 2.1 (Cox and Donnelly 2011). Research results will there-
What Is Sample Size? fore be biased and misleading.
Definitions of ‘replicates’ and ‘replication’ are fre-
A replicate is one unit in one group. quently confused in the literature, and further con-
Sample size is determined by the number of replicates flated with study replication. Planning experiments
of the experimental unit. using formal statistical designs can help differenti-
Experimental unit: Entire entity to which a treatment
ate between the different types of replicates and
or control intervention can be independently and indi-
sampling units, and determine which is best suited
vidually applied.
for the intended study.
Biological replicate is a biologically distinct and inde-

pendent experimental unit.
2.2 Experimental Unit
Technical replicate is one of multiple measurements
The experimental unit or unit of analysis is the smal-
on subsamples of the experimental unit, used to obtain
lest entire entity to which a treatment or control
an estimate of measurement error.
intervention can be independently and randomly
(a) (b)
(c) (d)
t1 t2 t3 t4
t1 t2 t3 t4
Figure 2.1: Units of replication. (a) Experimental unit = individual animal = biological unit. The entire entity to which an
experimental or control intervention can be independently applied. There are two treatment interventions A or B. Here each
mouse receives a separate intervention, and the individual mouse is the experimental unit (EU). The individual mouse is also
the biological unit. (b). Experimental unit = groups of animals. There are two treatment interventions A or B. Each dam
receives either A or B, but measurements are conducted on the pups in each litter. The experimental unit is the dam (N = 2),
and biological unit is the pup (n = 8). For this design, the number of pups cannot contribute to the test of the central
hypothesis. (c) Experimental unit with repeated observations. The experimental unit is the individual animal (= biological unit)
with four sequential measurements made on each animal. The sample size N = 2. (d) Experimental unit = part of each animal.
There are two treatment interventions A or B. Treatment A is randomised to either the right or left flank of each mouse,
and B is injected into the opposite flank of that mouse. The experimental unit is flank (N = 8). The individual mouse is the
biological unit. Each mouse can be considered statistically as a block with paired observations within each animal.
applied (Figure 2.1a). Cox and Donnelly (2011) size N will not be the same as the number of
define it as the ‘smallest subdivision of the experi- animals.
mental material such that two distinct units might The total sample size N refers to the number of
be randomized (randomly allocated) to different independent experimental units in the sample.
treatments.’ Whatever happens to one experimental The classic meaning of a ‘replicate’ refers to the
unit will have no bearing on what happens to the number of experimental units within a treatment
others (Hurlbert 2009). If the test intervention is or intervention group. Therefore, replicating exper-
applied to a ‘grouping’ other than the individual imental units (and hence increasing N) contributes
animal (e.g. a litter of mice, a cage or tank of ani- to statistical power for testing the central statistical
mals, a body part; (Figure 2.1b–d)), then the sample hypothesis. Power calculations estimate the number
Sample Size Basics 11
of experimental units required to test the hypothe- unit can be an individual biological unit, a group of
sis. The assignment of treatments and controls to biological units, a sequence of observations on a sin-
experimental units should be randomised if the gle biological unit or a part of a biological unit (Lazic
intention is to perform statistical hypothesis tests and Essioux 2013; Lazic et al. 2018). The biological
on the data (Cox and Donnelly 2011). unit of replication may be the whole animal or a sin-
Independence of experimental units is essential gle biological sample, such as strains of mice, cell
for most null hypothesis statistical tests and meth- lines or tissue samples (Table 2.1).
ods of analysis and is the most important condition
for ensuring the validity of statistical inferences (van
Belle 2008). Non-independence of experimental 2.4 Technical Replicates
units occurs with repeated measures and multi-level
Technical replicates or repeats are multiple mea-
designs and must be handled by the appropriate sta-
surements made on subsamples of an experimental
tistically based designs and analyses for hypothesis
unit (Figure 2.2). Technical replicates are used
tests to be valid.
to obtain an estimate of measurement error, the dif-
ference between a measured quantity and its true
value. Technical replicates are essential for asses-
2.3 Biological Unit sing internal quality control of experimental proce-
The biological unit is the entity about which infer- dures and processes, and ensuring that results are
ences are to be made. Replicates of the biological not an artefact of processing variation (Taylor and
unit are the number of unique biological samples Posch 2014). Differences between operators and
or individuals used in an experiment. Replication instruments, instrument drift, subjectivity in deter-
of biological units captures biological variability mination of measurement landmarks, or faulty cal-
between and within these units (Lazic et al. 2018). ibration can result in measurement error. Cell
The biological unit is not necessarily the same as cultures and protein-based experiments can also
the experimental unit. Depending on how the treat- show considerable variation from run to run, so
ment intervention is randomised, the experimental in vitro experiments are usually repeated several
Table 2.1: Units of Replication in a Hypothetical Single-Cell Gene Expression RNA Sequencing Experiment. Designating a
given replicate unit as an experimental unit depends on the central hypothesis to be tested and the study design.
Replicate ‘unit’ Replicate type
Animals Colonies Biological

Strains Biological
Cohoused animals in a cage Biological
Sex (male, female) Biological
Individuals Biological
Sample Organs from animals killed for purpose Biological
preparation
Methods for dissociating cells from tissue Technical
Dissociation runs from given tissue sample Technical
Individual cells Biological
RNA-seq library construction Technical
Sequencing Runs from the library of a given cell Technical
Readouts from different transcript molecules Biological or technical
Readouts with unique molecular identifier (UMI) from a Technical
given transcript molecule
Source: Adapted from Blainey et al. (2014).

times. At least three technical replicates of Western response/expression rates, expected false positive
blots, PCR measurements, or cell proliferation rate, and number of sampling time points (Lee
assays may be necessary to assess reliability of tech- and Whitmore 2002; Lin et al. 2010; Jung and Young
nique and confirm validity of observed changes in 2012).
protein levels or gene expression (Taylor and Posch
2014).
The variance calculated from the multiple mea- Example: Experimental Units with Techni-
surements is an estimate of the precision, and cal Replication
therefore the repeatability, of the measurement.
Technical replicates measure the variability Two treatments A and B are randomly allocated
between measurements on the same experimental to six individually housed mice, with three mice
units. Repeating measurements increases the preci- receiving A and three receiving B. Lysates are
sion only for estimates of the measurement error; obtained from each mouse in three separate ali-
they do not measure variability either within or quots (Figure 2.2).
between treatment groups. Therefore, increasing The individual mouse is the experimental unit
the number of technical replicates does not improve because treatments can be independently and
power or contribute to the sample size for testing the randomly allocated to each mouse. There are
central hypothesis. Analysing technical repeats as three subsamples or technical replicates per
independent measurements is pseudo-replication. mouse. The total sample size is N = 6, with k =
High-dimensionality studies produce large 2 treatments, n = 3 mice per treatment group,
amounts of output information per subject. Exam- and j = 3 technical replicates per mouse. The total
ples include multiple DNA/RNA microarrays; bio- sample size N is 6, not 18.
chemistry assays; biomarker studies; proteomics;
metabolomics; inflammasome profiles, etc. These
studies may require a number of individual animals,
either for operational purposes (for example, to 2.5 Repeats, Replicates, and
obtain enough tissue for processing) or as part of
the study design (for example, to estimate biological
Pseudo-Replication
variation). Sample size will then be determined by Confusion of repeats with replicates is a problem of
the amount of tissue required for the assay techni- study design, and pseudo-replication is a problem of
cal replicates, or by design-specific requirements analysis. Study validity is compromised by incorrect
for power. Design features include anticipated identification of the experimental unit. A replicate is
A B
Figure 2.2: Experimental unit versus technical replicates. Two treatments A and B are randomly allocated to six mice.
The individual mouse is the experimental unit. Three lysate aliquots are obtained from each mouse. These are technical
replicates. The total sample size N is 6, not 18.
a new experimental run on a new experimental

will be appropriately estimated. In the second
unit. Randomisation of interventions to experimen-
scenario, treatment intervention may or may
tal units and randomising the order in which exper-
not have been randomly allocated to mice, but
imental units are measured (sequence allocation
measurements were obtained for all mice in the
randomisation) minimises the effects of systematic
first group followed by those in the second group.
error or bias. A repeat is a consecutive run of the
Bias results from confounding of outcome mea-
same treatment or factor combination. It does not
surements with potential time-dependencies
minimise bias and may actually increase bias if
(for example, increasing skill levels or learning)
there are time-dependencies in the data. Repeats
and difference in assessment, especially if treat-
are not valid replicates.
ment allocation is not concealed (blinded).
Example: Replication Versus Repeats

In Figure 2.3, the experimental units are eight 2.5.1 Repeats of Entire Experiments
mice that receive one of two interventions. In
A common practice in mouse studies is to repeat an
the first scenario, both treatment allocated to
entire experiment two or three times. It has been
mouse and the measurement sequence are rando-
argued that this practice provides evidence that
mised. Bias is minimised and treatment variance
results are robust. However, NIH directives are clear
that replication is justifiable only for major or key
Y1 Y5 results, and that replications be independent.
Y2
Repeating an experiment in triplicate by a single
Y6
laboratory is not independent replication. These
Y3 Y7
repeats can provide only an estimate of the overall
measurement error of that experiment for that
Y4 Y8 lab. A major consideration is study quality. If the
True replicates
study is poorly designed and underpowered, repli-
cating it only wastes animals. Unless the purpose
of direct internal replications is scientifically justi-
Y1 Y5 fied, experiments are appropriately designed and
conducted to maximise internal validity, and exper-
Y2 Y6
imental, biological, and technical replicates are
clearly distinguished, simple direct repeats of
Y3 Y7
experiments on whole animals are rarely ethically
Y4 Y8 justifiable. Chapter 6 provides practical guidelines
for experiment replication.
Repeats
Figure 2.3: Replicates versus repeats. True replicates are

separate runs of the same treatment on separate
2.5.2 Pseudo-Replication
experimental units. Both treatment allocation to units In a classic paper, Hurlbert (1984) defines pseudo-
and sequence allocation for the processing of individual
experimental units are randomised. In this experiment,
replication as occurring when inferential statistics
the eight measurements on eight mice are taken in are used ‘to test for treatment effects with data
random order. Repeat measurements are taken during from experiments where either treatments are
the same experimental run or consecutive runs. Unless not replicated (though samples may be) or experi-
processing order is randomised, there will be
mental units are not statistically independent’
confounding with systematic sources of variability
caused by other variables that change over time. In this (Hurlbert 1984, 2009). The extent of pseudo-
experiment, eight measurements on eight mice are replication in animal-based research is disturbingly
obtained consecutively with units in the first treatment prevalent. Lazic et al. (2018) reported that less than
measured first. one-quarter of studies they surveyed identified the
correct repli-cation unit, and almost half showed Sacrificial pseudo-replication occurs when there
pseudo-replication, suggesting that inferences based are multiple replicates within each treatment arm,
on hypothesis tests were likely invalid. the data are structured as a feature of the design
Three of the most common types of pseudo- (such as pairing, clustering, or nesting), but design
replication are simple, sacrificial, and temporal. structure is ignored in the analyses. The units are
Others are described in Hurlbert and White (1993) treated as independent, so the degrees of freedom
and Hurlbert (2009). for testing treatment effects are too large. Sacrificial
Simple pseudo-replication occurs when there is pseudo-replication is especially common in studies
only a single replicate per treatment. There may with categorical outcomes when the χ 2 test or Fish-
be multiple observations, but they are not obtained er’s exact test is used for analysis (Hurlbert and
from independent experimental replicates. The arti- White 1993; Hurlbert 2009).
ficial inflation of sample size results in estimates of
the standard error that are too small, contributing to
increased Type I error rate and increased number of Example: Sunfish Foraging Preferences
false positives. Dugatkin and Wilson (1992) studied feeding suc-
cess and tankmate preferences in 12 individually
marked sunfish housed in two tanks. Preference
Example: Mouse Photoperiod Exposure was evaluated for each fish for all possible pair-
A study on circadian rhythms was conducted to wise combinations of two other tankmates. There
assess the effect of two different photoperiods on were 2 groups × 60 trials, per group × 2 replicate
mouse wheel-running. Mice in one environmen- sets of trials, for a total of 240 observations. They
tal chamber were exposed to a long photoperiod concluded that feeding success was weakly but
with 14 hours of light, and mice in a second statistically significantly correlated with aggres-
chamber to a short photoperiod with 6 hours of sion (P < 0.001) based on 209 degrees of freedom,
light. There were 15 cages in each chamber and that fish in each group strongly preferred
with four mice per cage. What is the effective (P < 0.001) the same individual in each of the
sample size? two replicate preference experiments, based on
This is simple pseudo-replication. The experi- 60 observations.
mental unit is the chamber, so the effective sam- The actual number of experimental units is 12,
ple size is one per treatment. Analysing the data with 6 fish per tank. The correct degrees of
as if there is a sample size of n = 60 (or even n = freedom for the regression analysis is 4, not
15) per treatment is incorrect. The number of 209. Suggested analyses for preference data
mice and cages in each chamber is irrelevant. included one-sample t-tests with 5 degrees of free-
This design implicitly assumes that chamber con- dom or one-tailed Wilcoxon matched-pairs test
ditions are uniform and chamber effects are zero. with N = 12. Correct analyses would produce
However, variation both between chambers and much larger P-values, suggesting that interpreta-
between repeats for the same chamber can be tion of these data requires substantial revision
considerable (Potvin and Tardif 1988; Hammer (Lombardi and Hurlbert 1996).
and Hopper 1997). Increasing sample size of mice
will not remedy this situation because chamber
Temporal (or spatial) pseudo-replication occurs
environment is confounded with photoperiod. It
when multiple measurements are obtained sequen-
is, therefore, not possible to estimate experimental
tially on the same experimental units, but analysed
error, and inferential statistics cannot be applied.
as if they represent an individual experimental unit.
Analysis should be restricted to descriptive statis-
Sequential observations (or repeated measures)
tics only. The study should be re-designed either to
are correlated within each individual. Repeated mea-
allow replication across several chambers, or if
sures increase the precision of within-unit estimates,
chambers are limited, as a multi-batch design
but the number of repeated measures do not increase
replicated at two or more time points.
the power for estimating treatment effects.
Hurlbert, S.H. and White, M.D. (1993). Experiments

Example: Tumour Proliferation in Mouse with freshwater invertebrate zooplanktivores: quality
Models of Cancer of statistical analysis. Bulletin of Marine Science 53:
Sequential measurements of solid tumour volume 128–153.
Hurlbert, S. (2009). The ancient black art and transdisci-
in mice are commonly reported as a measure of dis-
plinary extent of pseudoreplication. Journal of Com-
ease progression or response to an intervention.
parative Psychology 123 (4): 434–443.
Mull et al. (2020) tested the effects of low-dose Hurlbert, S.H. (1984). Pseudoreplication and the design
UCN-01 to promote survival of tumour-bearing of ecological field experiments. Ecological Monographs
mice with lower tumour burden. Mice in four 54: 187–211.
treatment groups were weighed daily for 30 days, Jung, S.-H. and Young, S.S. (2012). Power and sample
then twice weekly to day 75. Differences in tumour size calculation for microarray studies. Journal of Bio-
volume between groups were assessed by t-tests pharmaceutical Statistics 22: 30–42.
and one-way ANOVA at five time points. Lazic, S.E. (2010). The problem of pseudoreplication in
neuroscientific studies: is it affecting your analysis?
BMC Neuroscience 11: 5. https://doi.org/
This is temporal pseudo-replication because the 10.1186/1471-2202-11-5.
same groups of mice are repeatedly sampled over Lazic, S.E. and Essioux, L. (2013). Improving basic and
time, but separate hypothesis tests were performed translational science by accounting for litter-to-litter
at different time points. However, successive obser- variation in animal models. BMC Neuroscience 14: 37.
Lazic, S.E., Clarke-Williams, C.J., and Munafò, M.R.
vations on the same mice are correlated, and sam-
(2018). What exactly is ‘N’ in cell culture and animal
ple size is expected to decline as mice die or are experiments? PLoS Biology 16: e2005282.
humanely euthanised at different times during Lee, M.-L.T. and Whitmore, G.A. (2002). Power and
the study. Traditional ANOVA or repeated- sample size for DNA microarray studies. Statistics in
measures ANOVA methods cannot handle missing Medicine 21: 3543–3570.
data or imbalance in the number of repeated Lin, W.-J., Hsueh, H.-M., and Chen, J.J. (2010). Power
responses and do not incorporate the actual corre- and sample size estimation in microarray studies.
lation structure of the data. Mixed models are BMC Bioinformatics 11: 48. https://doi.org/
much more appropriate, because the true variation 10.1186/1471-2105-11-48.
in the repeated measurements can be modelled Lombardi, C.M. and Hurlbert, S.H. (1996). Sunfish
directly by incorporating time dependencies and cognition and pseudoreplication. Animal Behaviour
allowing customisation of the correlation struc- 52: 419–422.
ture; they can also accommodate missing data Millar, R.B. and Anderson, M.J. (2004). Remedies for pseu-
doreplication. Fisheries Research 70: 397–407. https://
due to subject loss.
doi.org/10.1016/j.fishres.2004.08.016.
Mull, B.B., Livingston, J.A., Patel, N. et al. (2020). Spe-
cific, reversible G1 arrest by UCN-01 in vivo provides
References cytostatic protection of normal cells against cytotoxic
chemotherapy in breast cancer. British Journal of
Blainey, P., Krzywinski, M., and Altman, N. (2014). Rep- Cancer 122 (6): 812–822. https://doi.org/
lication. Nature Methods 11: 879–880. 10.1038/s41416-019-0707-z.
Cox, D.R. and Donnelly, C.A. (2011). Principles of Applied Potvin, C. and Tardif, S. (1988). Sources of variability and
Statistics. Cambridge: Cambridge University Press. experimental designs in growth chambers. Functional
Dugatkin, L.A. and Wilson, D.S. (1992). The prerequi- Ecology 2: 123–130.
sites for strategic behaviour in bluegill sunfish, Lepo- Taylor, S.C. and Posch, A. (2014). The design of a quan-
mis macrochirus. Animal Behaviour 44: 223–230. titative Western Blot. BioMed Research International
Hammer, P.A. and Hopper, D.A. (1997). Experimental https://doi.org/10.1155/2014/361590.
design. In: Plant Growth Chamber Handbook (ed. R. van Belle, G. (2008). Statistical Rules of Thumb, 2nd edi-
W. Langhans and T.W. Tibbitts), 177–188. Iowa State tion. New York: Wiley.
University NCR-101 Publication No. 340. https:// Vaux, D., Fidler, F., and Cumming, G. (2012). Replicates
www.controlledenvironments.org/wp-content/ and repeats—what is the difference and is it signifi-
uploads/sites/6/2017/06/Ch13.pdf. cant? EMBO Reports 13: 291–296.
3 Ten Strategies to Increase
Information (and Reduce
Sample Size)

3.1 Introduction 17 3.7.2 When Are Controls Unnecessary? 25
3.2 The ‘Well-Built’ Research Question 17 3.8 Informative Outcome Variables 25
3.3 Structured Inputs (Experimental Design) 19 3.9 Minimise Bias 26
3.4 Reduce Variation I: Process Control 20 3.10 Think Sequentially 27
3.5 Reduce Variation II: Research Animals 21 3.11 Think ‘Right-Sizing’, Not ‘Significance’ 27
3.6 Reduce Variation III: Statistical Control 22 3.A Resources for Animal-Based Study
3.7 Appropriate Comparators and Controls 23 Planning 28
3.7.1 Types of Controls 23 References 29
The ten strategies are as follows:

3.1 Introduction
Reduction of animal numbers is a key tenet of the 1. ‘Well-built’ research questions
3Rs strategy, but at times may seem to conflict 2. Structured inputs (statistical study designs)
with the goal of maximising statistical power. 3. Reduce variation I: Process control
Large power results in part from increasing 4. Reduce variation II: Research animals
sample size. However, a large sample size does 5. Reduce variation III: Statistical control
not guarantee adequate power, and high 6. Appropriate comparators and controls
power alone does not ensure that results are infor- 7. Informative outcomes
mative. This section outlines ten complementary 8. Minimise bias
strategies for maximising experimental signal 9. Think sequentially
and reducing noise, and therefore increasing the 10. Think ‘right-sizing’, not ‘significance’
information content of study data. Highlighted
are strategies for reducing experimental variation
before, rather than after, the experiment is con- 3.2 The ‘Well-Built’ Research
ducted. Incorporating all ten strategies will also
increase experimental efficiency – the ability of
Question
an experiment to achieve study objectives with Once the investigator has identified an interesting
minimal expenditure of time, money, and clinical or biological research problem, the chal-
animals. lenge is to turn it into an actionable, focused, and
testable question. A well-constructed research ques- that are manipulated or serve as comparators),
tion consists of four concept areas: the study popu- effects (the outcomes that are measured to assess
lation or problem of interest, the test intervention, causality), and the test platform (the animals used
the comparators or controls, and the outcome. For- to assess cause and effect). Breaking the research
mat is modified according to study type (Box 3.1 and question into components allows the identification
Figure 3.1). and correction of metrics that are otherwise poorly
Structuring the research question enables clear defined or unmeasurable.
identification and discrimination of causes (factors A well-constructed research question is essential
for effective literature searches. Comprehensive lit-
erature reviews provide current evidence-based
BOX 3.1 assessments of the scientific context, the research
The ‘Well-Built’ Research Question gaps to be addressed, suitability of the proposed ani-
mal and disease model, and more realistic assess-
Experimental/intervention studies: PICOT
ments of potential harms and benefits of the
Population/Problem proposed research (Ritskes-Hoitinga and Wever
Intervention 2018; Ormandy et al. 2019). Collaborative research
Comparators/Controls groups such as CAMARADES (https://www.ed.
Outcome ac.uk/clinical-brain-sciences/research/camar-
Time frame, follow up ades/about-camarades) and SYRCLE (https://
Observational studies: PECOT www.syrcle.network/) are excellent resources for
Population/Problem certain specialities such as stroke, neuropathic pain,
Exposure and toxicology, and provide a number of e-training
Comparators resources and tools for assessing research quality.
Outcome Construction of the research question in the PICOT
Time frame, follow up framework was originally developed for evidence-
based medicine. Information on constructing
Diagnostic studies: PIRT research questions and designing literature searches
Population/Problem can be obtained from university library resources
Index test sections and the Oxford Centre for Evidence-based
Reference/gold standard Medicine website.
Target condition. The research question dictates formation of
both the research hypothesis and related statistical
hypotheses. These are often confused or conflated.
The research hypothesis is a testable and quantifiable
proposed explanation for an observed or predicted
relationship between variables or patterns of events.
It should be rooted in a plausible mechanism as to
why the observation occurred. One or more testable
I predictions should follow logically from the central
Interventions P
“Patient” O
hypothesis (‘If A happens, then B should occur, oth-
C
“Platform” Outcomes erwise C’). A description of the scientific hypothesis
“Population”
Comparators, provides justification for the experiments to be per-
controls formed, why animals are needed, and rationale for
T
Time
the species, type or strain of animals, and justifica-
tion of animal numbers.
Figure 3.1: System diagram for the ‘well-built’ research The statistical hypothesis is a mathematically-based
question. statement about a specific statistical population
Ten Strategies to Increase Information (and Reduce Sample Size) 19
parameter. Hypothesis tests are the formal testing 3.3 Structured Inputs
procedures based on the underlying probability dis-
tribution of the relevant sample test statistic. Choice (Experimental Design)
of statistical test will depend upon the statistical The design of an animal-based study will affect esti-
hypothesis, the study design, the types of variables mates of sample size. Good study designs are an
(continuous, categorical, ordinal time to event), essential part of the 3Rs (Russell and Burch 1959;
designated as inputs or outcomes. Statistical Kilkenny et al. 2009; Karp and Fry 2021; Gaskill
hypotheses should be a logical extension of the and Garner 2020; Eggel and Würbel 2021). Rigor-
research hypothesis (Bolles 1962). However, the ous, statistically-based experimental designs consist
research hypothesis may not immediately conform of the formal arrangement and structuring of inde-
to any one statistical hypothesis, and multiple sta- pendent (or explanatory) variables hypothesised to
tistical hypotheses may be required to adequately affect the outcome. (Box 3.2). The optimum design
test all predictions generated from the research will depend on the specific research problem
hypothesis. addressed. However, to be fit for purpose, all designs
must facilitate discrimination of signal from noise,
by identifying and separating out contributions of
Example: Research Versus Statistical explanatory variables from different sources of var-
Hypotheses iation (Reynolds 2022). By increasing the power to
Based on a comprehensive literature review, an detect real treatment differences, a properly
investigator determined that ventricular dys- designed experiment requires the use of far fewer
rhythmia after myocardial infarction is associ- time, money, and resources (including animals)
ated with high risk of subsequent sudden for the amount of information obtained.
cardiac arrest in humans (clinical observation, Well-designed studies start with a well-
clinical pattern). The investigator wished to constructed research question and well-defined
design a series of experiments using a mouse input and output variables. A good design also
model of myocardial infarction to test the effects incorporates specific design features that ensure
of several candidate drugs with the goal of redu- results are reliable and valid. These include correct
cing sudden cardiac death. specification of the unit of analysis (or experimental
unit), relevant inclusion and exclusion criteria, bias
Scientific hypothesis. Pharmacological suppres- minimisation methods (such as allocation conceal-
sion of ventricular dysrhythmia should ment and randomisation; Section 3.10), and mini-
result in clinically important reductions in misation of variation (Addelman 1970). Useful
the incidence of sudden cardiac death. designs for animal-based research include com-
Research question. In a mouse model of dys- pletely randomised design, randomised complete
rhythmia following myocardial infarction block designs, factorial designs, and split-plot
(P), will drug X (I) when compared to a saline designs (Festing and Altman 2002; Montgomery
vehicle solution (C) result in fewer deaths (O) 2017; Festing 2020; Karp and Fry 2021).
at four weeks post-administration (T)?
Quantified outcomes. Number of deaths (n) in BOX 3.2
each group and proportion of deaths (p) in Statistical Design of Experiments: Components
each group.
Statistical hypothesis. The null hypothesis is Study design. Formal structuring of input or explan-
that of no difference in the proportion of atory variables according to statistically-based
deaths for mice treated with drug X (pX) ver- design principles
sus the proportion of deaths for mice treated Study design features. Unit of analysis (experimen-
with control C (pC) is H0 pX = pC or H0: tal unit), inclusion/exclusion criteria, bias mini-
pX – pC = 0. misation methods, sources of variation.
3.4 Reduce Variation I: error, can be reduced by training all personnel up

to predetermined best-practice standards, perform-
Process Control ing regular quality assurance (QA) checks for equip-
‘Running a tight ship’ to minimise process-related ment calibration and drift, and frequent checks for
variation ensures high-quality research results data anomalies (Preece 1982, 1984).
(Box 3.3). Reliable and consistent results are Data management is an essential but often over-
achieved by a commitment to ongoing quality looked skill that contributes to the information
improvement (QI) and process control. The mechan- power for a study. Datasets need to be properly
ism for process control is standardisation of proce- structured and standardised to facilitate data collec-
dures and human performance. Poor planning and tion, checking, cleaning, visualisation, and eventual
lack of management can introduce additional analysis. Data entry should be formatted according
uncontrolled and often unrecognised variation into to the rules of ‘tidy data’ (Wickham 2014; Box 3.3A).
experimental processes, and seriously jeopardise Incorrectly measured or recorded data and missing
both research quality and reproducibility. When data reduce sample size and statistical power and
protocols have been previously approved by invalidate results. Therefore sufficient time and
ethical oversight committees, deviations and non- resources must be allotted to QA methods for ensur-
compliances must be reported to institutional autho- ing data are reliable, error-free, and complete. Plans
rities and federal authorities. must also be in place for appropriate data documen-
Protocols should present a focused blueprint of tation, archiving, and security. Many journals now
work that could realistically be accomplished in require all data necessary to replicate research
the given time with resources, budget, and personnel findings are made available, either in appropriate
actually available, and with best-practice care and public data repositories or as part of the published
welfare plans in place. It follows that investigators, manuscript.
technical staff, and other personnel involved with The data dictionary or variable key should be cre-
the study should be familiar with the protocol and ated on a separate spreadsheet. It should include
standard operating procedures before the study names of all variables, a description, and definitions
begins, and routinely review procedures throughout. for each variable, together with measurement units
Procedural variability is reduced by standardising and coding descriptions. All codes should be clearly
operating procedures, using simple relevant perfor-
mance benchmark metrics to evaluate procedural
BOX 3.3A
compliance, and incorporating quality control tools
Data Management Rules
(such as checklists and visual aids) into the experi-
mental workflow (Reynolds et al. 2019). Variation 1. One variable per column
in operator performance, and hence measurement 2. One observation per row
3. One ‘observational unit’ per table
▪ Each column header is a variable name.
BOX 3.3
Process Control
▪ Date and time are separate variables
(therefore separate columns).
Minimise process-related variation. ▪ No special characters.
▪ Standardise protocol and operational
▪ Do not code missing data as zero or leave
cells empty.
procedures
▪ Standardise human performance; minimise
▪ Keep separate records of original raw
data before manipulation.
human performance errors
▪ Regular quality checks
▪ Always create copies with multiple
▪ Simple performance benchmarks
backups.
▪ Commit to upgrading and improving perfor-
▪ Always include a data dictionary.
mance standards
▪ Label auxiliary readouts with appropri-
▪ Good data management practices.
ate tracking information.
defined and interpretable. Regular quality checks BOX 3.4

should be performed to ensure that data required Research Animals
for the study are actually being collected, that data
are not contaminated or missing due to technical Minimise animal-related variation and unnecessary
artefacts, and recording and transcription errors waste of animals.
are caught and corrected at an early stage. Ensure Account for:
all samples and readouts can be matched to the
correct subject and sample collection time (date,
▪ Signalment
time, subject ID). Track and trace systems ensure
▪ Housing, husbandry, and routine handling
data are reliable, enable troubleshooting, and may Have plans in place for:
be required for regulatory compliance. ▪ Care, housing, and husbandry
Formatting and organisation of data collection ▪ Welfare and pain and stress management
spreadsheets may evolve, so it is essential to test- ▪ End-use disposition.
drive first to find the most efficient methods for
recording and quality review. Plans should also be
in place for data archiving, security, and safeguard-
ing of sensitive information. An excellent reference et al. 2016). Functional variants and sub-strain dif-
on electronic documentation of experimental proce- ferences in phenotypic responses have been
dures is Gerlach et al. (2019). reported for the same strains even with the same
vendor but from different barrier facilities (e.g.
Simon et al. 2013). Misidentification, mislabelling,
3.5 Reduce Variation II: and/or cross-contamination of mouse strains and
cell lines (Uchio-Yamada et al. 2017; Souren et al.
Research Animals 2022) are symptomatic of management failure.
Differences between animals are a major source Investigators should be prepared to include geno-
of variation in experiments. Between-individual typing and cell line authentication into protocol
variability results from the characteristics of individ- work descriptions and thoroughly document all
ual animals, the environment in which they are strain- and cell-related authentication procedures
maintained, and conditions of handling and (Souren et al. 2022).
experimentation. It is a common misconception that between-
Choice of the animal model depends on the dis- animal variation can be reduced by ‘standardising’
ease or injury condition to be simulated, or charac- housing and husbandry conditions by denying
teristics that best serve as a test platform for enrichment or opportunities for normal activity
evaluating candidate therapeutics. Signalment, and behaviours. Such ‘standardised’ conditions
housing and husbandry, and care and welfare must have been shown to shorten rodent lifespan and
be clearly described and scientifically justified adversely affect animal health and welfare. Animals
(Lilley et al. 2020; Box 3.4). kept in barren conditions cannot be considered rep-
Animal signalment is analogous to human demo- resentative of baseline or control states (Martin et al.
graphic information. Signalment information 2010). Another misconception is that environmen-
includes species, strain, age, sex, weight, reproduc- tal enrichment compromises research data. Instead,
tive status, and source or vendor. These data provide there is abundant evidence to show refinements
information necessary for assessing model (con- such as enrichment improve rather than compro-
struct) validity, generalisability (external validity), mise internal and external validity and reproduci-
and similarity of group baseline characteristics for bility (Richter et al. 2009, 2010; André et al. 2018;
hypothesis-testing studies. Source and strain infor- Voelkl et al. 2018). Incorporating enrichment as
mation is especially crucial for rodent models. Phe- ‘planned heterogeneity’ significantly reduces ani-
notype changes can occur as the result of random mal stress and minimises extraneous and uncon-
and quiet mutations, genetic drift, and even subtle trolled variation, thus promoting more reliable
differences in environmental factors (e.g. Mahajan results (Bailoo et al. 2018; Karp 2018; Eggel and
Würbel 2021). Habituation protocols (Reynolds BOX 3.5

et al. 2019; Ujita et al. 2021) and non-aversive hand- Statistical Control
ling (Gouveia and Hurst 2019) also reduce distress
and anxiety in animals and make results less Minimise extraneous variation.
variable. 1. Controlling for known sources of variation
Unrelieved or inadequately treated pain, suffer-
ing, and stress modify animal behaviour and phys- Blocking
iology and are, therefore, additional confounders of Stratification
study outcomes. Protocols should include prospec- Clustering
tive and comprehensive welfare plans for recogni- 2. Sample size balance.
tion of pain and distress behaviours, best-practice
palliation measures, choice of analgesic agents,
and clearly defined humane endpoints, with criteria Blocks consist of similar experimental units or
for deployment (Carbone 2011). subjects, grouped into homogeneous subgroups
Careful end-use disposition planning is an or blocks, categorised by a nuisance variable.
important part of sample size reduction strategies Randomisation is conducted independently
(Smith et al. 2018; Reynolds 2021). Rodent breed- for each block, so each block contains all treat-
ing programmes in particular can involve consider- ments. Blocking variables may be physical (e.g.
able waste if animals are simply euthanised a cage of mice, a litter of piglets, a pen of calves,
without being used; for example, if they age out a tank of fish) or temporal (e.g. by week). The
of the study or are of an unwanted genotype. Stra- simplest blocking design is the randomised
tegies for minimising collateral losses and animal complete block design (RCBD). Permuted block
waste should incorporate preliminary estimates randomisation is especially useful in preventing
of the target number of animals required for sample size imbalance and controlling for time
experiments, the proportion of animals to be pro- dependencies when participants enter the study
duced with both the desired and unwanted geno- sequentially (Montgomery 2017).
types, and plans for reducing wasted animals Stratification involves dividing the experimental
(e.g. revised production schedule to minimise litter units into homogeneous subgroups based on
drop at any one time, transfer to other protocols, characteristics of the subjects, such as age, sex,
reduced study scope). or baseline characteristics. Randomisation is
conducted independently for each stratum
(Kernan et al. 1999). Examples of stratification
3.6 Reduce Variation III: variables include age class (e.g. neonate, juve-
Statistical Control nile, adult), sex, body condition (e.g. below nor-
mal, normal, overweight, obese), clinical
Extraneous nuisance or confounding factors severity (e.g. IRIS stages of feline chronic kidney
increase variation or ‘noise’ in experiments and disease), tumour staging (Dugnani et al. 2018),
may obscure true effects. Statistical methods for and time. Stratification is common in survey
reducing variation include controlling for known research and may be useful for relatively small
sources of variability and grouping similar experi- studies when responsiveness to a drug interven-
mental units together, and ensuring balanced sam- tion or clinical prognosis will be strongly affected
ple sizes for groups (Box 3.5). by a given confounding variable. However, if
Blocking, stratification, and clustering used in sample size is small, there is the possibility that
statistics to reduce variability in experimental some strata will have too few or zero experimen-
design and improve the precision of results. Known tal units, unless the number of strata is reduced,
sources of variation are categorised as nuisance vari- in which case baseline imbalances would not be
ables. Nuisance variables are not of direct interest correctable.
for testing the scientific hypotheses but may be
expected to potentially influence the outcome and Clustering involves grouping experimental units
increase variation. into homogeneous subgroups based on a
hierarchical relationship to a natural grouping a method to distinguish experimental ‘signal’ from

variable, such that participants are contained or ‘noise’. For valid comparisons, the experimental
nested with clusters. Examples include litters units or animals used as controls should always
clustered within breeding pairs, nestlings be as similar to those in the test groups as possible.
within nests, or participants clustered within To minimise bias, experiments with controls should
clinics. The simplest cluster design has two also include formal randomisation and blinding
levels, with individual subjects nested within (Moser 2019).
clusters. Depending on the research question,
randomisation can occur at either the cluster 3.7.1 Types of Controls
level (clusters are the experimental unit and
clusters are randomised) or the individual level Seven types of controls typically used to establish
(subjects within clusters are the experimental causality in animal research are positive, negative,
unit and subjects within each cluster are rando- standard, sham, vehicle (Johnson and Besselsen
mised). Multilevel hierarchical designs can be 2002), matching, and strain (Moser 2019). Baseline
considered (Aarts et al. 2015; Cook et al. 2016). or ‘own’ controls are discussed separately.
Sample size balance. Balanced allocation designs Positive controls assess the validity of the test; that
with equal sample sizes n per group have the is, whether or not the response can actually be
highest precision and power for a given total measured or distinguished from background
sample size. Equal sample sizes may reduce noise. Negative controls assess the magnitude
the effect of chance imbalances in measured of response that would be expected if the test
covariates and therefore minimising the poten- intervention had no effect. Standard controls
tial for bias in the results. Unbalanced sample are procedures or agents for which the effect
size allocation increases the variability of the is already known and verified (standard of
sample, reducing precision and power and care), to be compared to the new intervention.
making it more difficult to detect true differ- Sham controls are common in experiments
ences between groups. where surgical procedures are required for
instrumentation or to induce iatrogenic disease
3.7 Appropriate or injury. Shams account for inflammatory,
morphological, and behavioural effects associ-
Comparators and Controls ated with surgical procedures, anaesthesia,
Comparators and controls are used to establish analgesia, and post-surgical pain. They are used
cause-and-effect relationships between interven- to discriminate systemic effects resulting from
tions and responses (Box 3.6). By providing a stand- the procedures from those actually induced
ard of comparison, controls act as a reference for by the test intervention. However, sham proce-
evaluating if results are due to the effect of the inde- dures may be inappropriate controls for some
pendent or explanatory variables or are instead an models and can profoundly confound results
artefact of some unknown confounder or time (Cole et al. 2011). They may be unnecessary
dependency in the data. Controls, therefore, provide for certain models (see below).
Vehicle controls assess potential physiological
effect of the active agent by accounting for
BOX 3.6 the potential effects of the excipient, carrier,
Appropriate Comparators and Controls or solubilising solution in which the agent is
Consider: suspended. Vehicles may not be physiologically
inert and can radically affect experimental out-
1. Type: positive, negative, standard, sham, vehi-
comes, so they must be chosen carefully
cle, strain, and self.
(Moser 2019).
2. Allocation: Concurrent versus historical.
3. Necessity: Is a specific type of control required? Matching is appropriate for both experimental
4. Non-redundancy: Is a specific type of control and observational studies, especially when
informative? sample sizes are small and experimental units
differ considerably in baseline characteristics. Examples of within-subject designs include

Matching removes some of this bias by pairing paired, before-after, crossover, classic repeated
subjects on specific baseline covariates, such as measures, and longitudinal designs (Bate and
age, sex, sibling relationship, disease stage, and Clark 2014; Hooijmans et al. 2014; Lazic 2016).
sample collection times. Regression-based
Concurrent controls are the most relevant com-
models are often more powerful than simple
parators for determining treatment-related
matched-pairs comparisons (Greenland and
effects and most appropriate for ensuring inter-
Morgenstern 1990).
nal validity. Historical controls are comparators
Strain controls. Several different control strains or for new experiments that consist of data from
sub-strains should be included in the experi- experiments completed in the past. Their use
mental design (another example of planned is occasionally suggested as a method to reduce
heterogeneity; Åhlgren and Voikar 2019). animal numbers. However, historical controls
Alternatively, wild-type littermates could be are strongly discouraged for naïve hypothesis
used as controls, for example, homozygous tests against new test data. First, randomisation
knockouts (Holmdahl and Malissen 2012) in is essential for valid statistical inference, but
blocked designs. historical data cannot be integrated into rando-
misation protocols. Therefore, comparing new
Selection of appropriate strain or species controls experimental data to historical data introduces
can be difficult. Availability and cost will affect sam- serious biases that will compromise results. Sec-
ple size determinations. Comparisons of test strains ond, historical controls will be valid only if
with a single control or founder strain are highly stringent ‘Pocock criteria’ can be met
problematic and not recommended (Garland and (Hatswell et al. 2020). The Pocock criteria spec-
Adolph 1994). Observed differences between strains ify that all experimental conditions (animals,
could be simply the result of random genetic drift animal housing and husbandry, experimental
(rather than the experimental intervention) and procedures, study design, endpoints, and meth-
are also statistically confounded with strain mem- ods of measurement) must be identical to the
bership. Different strains and sub-strains of rodents historical condition, except for the treatment
diverge both genetically and phenotypically, to be tested. However, these restrictions are
depending on source. For example, the C57BL/6 unlikely to be met in practice (Kramer and Font
mouse strain is one of the most common and widely 2017). In reality, historical controls will be sub-
used inbred strains and is widely regarded as a ‘gold ject to sources of variation not present in the
standard’ comparator. However, substantial differ- new experiment, including changes in animal
ences in phenotype occurring between sub-strains husbandry, laboratory techniques, personnel
and vendors contribute to contradictory results and personnel skill levels, strain, vendor or
reported for common behavioural and physiological source, animal eligibility criteria, routes of
tests (Åhlgren and Voikar 2019) or disease condi- agent administration, chemical reagents, and
tions (Löscher et al. 2017). suppliers.
Own control or within-subject designs have a base- Historical control data can be extremely useful
line or covariate measurement on an animal for planning new studies and informing study
that adjusts for subsequent measurements design, for example, for estimating variance and
taken on the same animal. Designs with baseline means, or establishing plausible range of
repeated measures are preferred to simple par- effect sizes (Kramer and Font 2017). They are also
allel-arm (‘between-subjects’) designs that do useful in oncology studies for interpreting rare
not incorporate covariates because they mini- tumour incidence, severity, and/or proliferation
mise the effects of between-subject variance patterns or identifying evolving trends in tumour
(Karp and Fry 2021), and therefore have more biology and behaviour (Keenan et al. 2009). Animal
power and require fewer animals. Methods of numbers can be reduced considerably if sample size
accounting for baseline differences are espe- estimates are based on Bayesian priors calculated
cially important when the experiment is small. from data obtained from previous experiments.
However, appropriate and informative estimates of preliminary data for subsequent studies should
priors require historical studies to be high-quality include controls (and if possible be randomised
and well-powered to begin with. If prior data are and blinded) to strengthen the evidence for or
biased and inadequate, priors should not be used, against proceeding with the investigation. Study
and sample sizes for the new experiment obtained designs where animals act as their own control
from appropriate power calculations in the usual may not require a separate control group (see Bate
way (Bonapersona et al. 2021). and Clark 2014; Lazic 2016; Karp and Fry 2021).
3.7.2 When Are Controls 3.8 Informative Outcome

Unnecessary?
Variables
Inclusion (or omission) of specific types of control
depend on study objectives and specific aims. Fail- Experimental endpoints or outcome variables are
ing to include appropriate controls may result in an the dependent or response variables used to assess
inconclusive experiment, resulting in the waste of if the test intervention ‘works’ (Box 3.7). Choice of
all the animals used in the experiment. Conversely, variable used as the experimental outcome (contin-
use of large numbers of animals in unnecessary or uous, binary, time to event) will affect estimates of
redundant control groups is equally wasteful and effect size and sample. Specific outcomes will also
should be discouraged. Sham controls may be determine the methods and procedures to be used
unnecessary when the sham condition consists of in the experiments to obtain the data. Limitations
animals left untreated after acute, well-established in methods or technology may constrain the ability
procedures with highly predictable experimental to measure the outcome. Humane endpoints and
endpoints, and spontaneous recovery is not possi- welfare intervention points must be accounted for
ble (Kramer and Font 2017). Examples include when determining the choice of outcome variable.
untreated spinal cord crush or transection (Kramer Outcome variables must be clearly described, spe-
and Font 2017), severe haemorrhagic shock cific, and measurable (‘What is being measured?’),
(Tremoleda et al. 2017), or sepsis (Zingarelli et al. and include frequency of measurements obtained
2019). Sham surgical controls may not be required per experimental unit (‘How often will it be meas-
if the only goal is to determine specific effects of the ured?’). Outcomes should be clinically or biologi-
surgical procedure. Reducing variation in person- cally relevances to the condition being studied.
nel technical skills, training up to a high and con- Outcomes measured in preclinical studies are often
sistent performance level, and randomisation can surrogates for clinical conditions, but if target bio-
eliminate the need for shams. Redundant controls markers or molecular pathways are not clinically
occur if studies are designed as numerous one-fac- relevant, they will be impossible to translate. To
tor-at-a-time (OFAT) two-group trials, each with maximise power and minimise sample size, contin-
its own ‘control’. This approach is inefficient, has uous outcome variables are preferred to binary
low precision, and cannot determine any potential outcomes.
interactions, such as synergisms or inhibitory Many investigators collect data for large numbers
effects which are usually of most interest. The of outcomes, presumably to maximise the amount
result is the waste of large numbers of animals
for limited information. More appropriate designs BOX 3.7
are available that enable screening of multiple pre- Informative Outcome Variables
dictors with small sample sizes and do not require
multiple redundant control groups (Bate and Clark ▪ Can it be measured? Is it straightforward to
2014; Lazic 2016). measure?
Operational pilots used to determine, standar- ▪ Is it biologically or clinically relevant?
dise, and finalise details of logistics, procedures, ▪ Consider choice of variable: Continuous rather
methods, resources, and processes will usually not than binary.
need a comparator group (and may not need any ▪ Humane endpoints and welfare intervention
animals at all). However, pilots designed to generate points accounted for.
of information obtained from each animal. Choice 3.9 Minimise Bias

of outcomes should be prioritised based on rele-
vance and applicability to the central hypotheses, Bias is the systematic deviation of estimates based
study goals, and objectives. Without such discrimi- on observations from the true value of the popula-
nation, studies will be overly large and unfocused, tion parameter. Bias can occur at all stages of the
and therefore difficult to interpret (Vetter and research cycle. However, most types of bias can be
Mascha 2017). Measurement and analysis of large accounted for, or minimised, only during planning
numbers of irrelevant variables will greatly increase and design phases and cannot be removed statisti-
the probability of type I error and the number of cally after data are collected. The SYRCLE Risk of
false positives. Such studies are also liable to P- Bias tool (Hooijmans et al. 2014) was developed
hacking (Head et al. 2015) and data dredging for assessing methodological quality of published
(Smith and Ebrahim 2002). research for inclusion in systematic reviews and is
Outcomes should be prioritised by relevance and extremely useful for identifying the most common
importance to study goals on a ‘need to know’ ver- types of bias in animal studies (Box 3.8).
sus ‘nice to know’ basis. ‘Need to know’ or primary Large sample size does not reduce bias. Only bias
outcome variables are mission-critical and are the minimisation methods reduce bias. Essential best-
most important for addressing study objectives. practice methods are randomisation and allocation
The primary outcome drives power calculations concealment (blinding). These should be incorpo-
and sample size estimates and is central to interpre- rated early in the planning process to introduce
tation of results. Secondary outcomes (‘nice to and maintain research rigour (Kilkenny et al.
know’) provide supporting or corroborating evi- 2009; Macleod et al. 2015; Bespalov et al. 2019; Karp
dence for conclusions based on the primary out- and Fry 2021). Strategies are described in Bespalov
come. Different objectives to be addressed in the et al. (2019). Randomisation ensures that a specific
same study may require very different outcome vari- intervention is assigned with a given probability to
ables for separate experiments. any experimental unit. Randomisation minimises
Choice and measurement of outcome variables the effect of systematic bias, and, more crucially,
must also account for humane endpoints, the is the cornerstone of inferential statistical hypothe-
study-relevant welfare indicators determining the sis testing. Allocation concealment involves the
point at which the animal will be removed from coded relabelling of group identifiers, so the associ-
the study, and pain and distress terminated (for ation between any specific intervention and the
example, by euthanasia). Validity, reliability, and group is concealed from relevant personnel until
translation potential of the data collected depend after data collection or analysis. Allocation conceal-
on the health and welfare of the animals used in ment minimises biases in selection or allocation,
the study. It is also an ethical mandate to minimise detection, and outcome assessment.
pain, suffering, and distress of animals used in Unfortunately, routine incorporation of randomi-
experiments. Appropriate welfare indicators are sation and allocation concealment into animal
one or more predetermined, objective, and easily experiments is not standard practice (Kilkenny
recognisable behavioural and physical criteria used et al. 2009; Muhlhausler et al. 2013; Macleod et al.
to assess individual animal welfare and severity of 2015). There is also considerable evidence that
animal distress. Multiple welfare indicators and
appropriate monitoring schedules provide a com-
BOX 3.8
plete welfare picture for each animal and minimise
Minimise Bias
interpretation errors (Hawkins et al. 2010). Pub-
lished guidelines provide information on choice of ▪ Randomisation
appropriate humane endpoints for specific research ▪ Allocation concealment (blinding)
models (e.g. oncology: Workman et al. 2010; ischae- ▪ Consider other sources of bias during planning
mia: Percie du Sert et al. 2017a; sepsis: Zingarelli – selection, performance, detection, attrition,
et al. 2019). reporting, and publication.
investigators confuse the technical probability-based animal-based studies (Box et al. 2005; Montgom-
meaning of ‘random’ with the lay meaning of ery 2017). Experiments are designed to run in
‘unplanned’ or ‘haphazard’ (Reynolds and Garvan series, with results from the preceding stage
2020). Non-randomisation seriously compromises informing the design of the next stage. A pilot
study validity, both by inflation of the false positive phase may be necessary to determine logistics
rate and invalidation of statistical hypothesis tests. and standardise procedures. The next phase is
focused on factor reduction: ‘screening out the
trivial many from the significant few’. In this
3.10 Think Sequentially phase, experiments consist of two-level factorial
designs where a relatively large number of candi-
The most efficient and cost-effective experiments
date inputs or interventions thought to affect the
are iterative (Box 3.9). That is, results of previous
response are reduced to a few of the more prom-
experiments should inform the approach for the
ising or important. Subsequent experiments can
next set of experiments, allowing rapid convergence
then be designed to optimise responses based on
to an optimal solution (Box et al. 2005). In practice
the specific factors identified in the preceding
however, preclinical research is typified by numer-
phases. In many cases, only two or three itera-
ous two-group, trial-and-error, OFAT comparisons.
tions are required to obtain the best or optimal
It is sometimes argued that the OFAT approach
response. Advantages of the sequential assembly
‘promotes creativity’ and discovery of serendipitous
approach include overall reduction of sample
results. However, there are several major disadvan-
sizes (and therefore number of animals), estima-
tages of OFAT. These include the inability to detect
tion of treatment effects with greater precision
interactions between multiple factors (which results
for a given N, a better understanding of relation-
in most ‘discovery by chance’), prioritising statisti-
ships between variables (such as dose-response
cal significance over biological or clinical relevance,
relationships), and improved decision-making
and extremely inefficient use of animals compared
(such as earlier termination of less-promising
to statistically-based designs (Montgomery 2017;
directions of enquiry). A further advantage of
Festing 2020; Gaskill and Garner 2020).
this structured approach, especially for explora-
Adaptive designs are one type of iterative design
tory animal-based research, is the increased
common in human randomised clinical trials for
chance of ‘discovering the unexpected’ or seren-
assessing efficacy, futility, or safety. They are char-
dipitous results not possible with OFAT or trial-
acterised by planned changes based on observed
and-error (Box et al. 2005).
data that allow refinement of sample size, dropping
of unpromising interventions or drug doses, and
early termination for futility, thus increasing power
and efficiency and reducing costs (Pallmann et al.
3.11 Think ‘Right-Sizing’, Not
2018). However, most adaptive design methods ‘Significance’
are unsuited for preclinical studies, because of dif- Right-sizing experiments, or sample size optimisa-
ferent study objectives, complex methodology, and tion, refer to the process of determining the most
sample sizes that would be prohibitively large for efficient and optimal sample size for a study to
animal-based experiments. achieve its objectives with the least amount of
Sequential assembly is an alternative approach resources (Box 3.10). This involves finding the bal-
that will accomplish many of the same objectives ance between adequate effect size and study feasi-
as adaptive designs and is more suited to bility. Evidence of right-sizing is provided by a
clear plan for sample size justification and reporting
the number of all animals used in the study. In con-
BOX 3.9
trast, ‘chasing statistical significance’ refers to the
Think Sequentially
experimental goal of obtaining a statistically signif-
‘Several small rather than one large’ icant (P < 0.05) result (Marín-Franch 2018).
Chasing significance is a particularly insidious
Each phase informs the conduct and structure of the
and harmful practice contributing to the waste of
next: pilot, screening, optimisation, confirmatory.
countless number of animals in so-called ‘negative’
BOX 3.10 increasing sample size, cherry-picking observa-

Think ‘Right-Sizing’, Not ‘Significance’ tions, or excluding outliers without justification
(Szucs 2016). Statements such as ‘We continuously
A ‘right-sized’ experiment has the most efficient and increased the number of animals until statistical
optimal sample size to ethically achieve scientific significance was reached to support our conclu-
objectives with the least amount of resources and sions’ are particularly egregious forms of N-hack-
without wasting animals. ing. Both P-hacking and N-hacking increase the
‘Chasing statistical significance’ has the sole goal of false positive rate (meaning that results are essen-
obtaining a statistically significant result. It leads to tially wrong), make reported P-values ‘essentially
questionable research practices, selective reporting, uninterpretable’, and violate standards of ethical
and waste of large numbers of animals in ‘negative’ scientific and statistical practice (Altman 1980;
experiments. Wasserstein and Lazar 2016). Finally, publication
bias resulting from selective reporting distorts
the evidence base. Once the study is completed,
experiments. Strength of conclusions is derived the investigator has the ethical obligation to report
from model relevance, study design, bias control, all methods and results transparently, completely,
and appropriate data collection methods, not P- and honestly (MacCallum 2010; Percie du Sert
values (Greenland et al. 2016). P-values do not nec- et al. 2020). Complete reporting is necessary to
essarily provide support for the research hypothesis, weigh the reliability and validity of contributions
have no clinical or biological meaning in them- to the evidence base, and provide the basis for fur-
selves, and do not mean that results are scientifi- ther pursuit of particular lines of scientific enquiry
cally important or meaningful. This is because (Landis et al. 2012). Missing study data result in
small P-values can be an artefact of poor study falsely optimistic estimates of efficacy (Sena et al.
design, inappropriate sample sizes, sampling varia- 2010), and make it impossible to assess the impact
tion, methodological errors, incorrect analyses, of test interventions on research animal welfare or
and bias. to assess translation potential, especially the safety
Animals are obvious collateral damage resulting and efficacy of treatments intended for use in
from the three immediate consequences of chasing humans and other animals.
significance: selective reporting, questionable
research practices, and distortion of the evidence
base. Perceived bias against the publication of neg-
ative results means that investigators probably do 3.A Resources for Animal-
not report all experiments with all animals but
only those with statistically-significant findings
Based Study Planning
(Sena et al. 2010; ter Riet et al. 2012; Conradi Oxford Centre for Evidence-Based Medicine
and Joffe 2017; Wieschowski et al. 2019). Nearly (CEBM) https://www.cebm.ox.ac.uk/resources
all recent publications report statistically signifi- https://www.cebm.ox.ac.uk/resources/ebm-
cant results (Chavalarias et al. 2016), indicating a tools/finding-the-evidence-tutorial; https://
high preponderance of selective reporting and dis- www.cebm.ox.ac.uk/resources/data-extraction-
carding of so-called ‘failed’ experiments with neg- tips-meta-analysis/no-intervention
ative results. The perception of research animals, FRAME https://frame.org.uk/resources/
especially mice, as disposable ‘furry test tubes’ experimental-design/
(Garner et al. 2017) further contributes to non- EQIPD https://quality-preclinical-data.eu
reporting of results. Chasing significance PREPARE https://norecopa.no/PREPARE
encourages questionable research practices such NC3Rs Resource Library https://www.nc3rs.
as P-hacking and N-hacking. P-hacking is the org.uk/
manipulation of data and analysis methods to pro- Percie du Sert, N., Bamsey, I., et al. (2017).
duce statistically significant results (Head et al. The experimental design assistant. PLoS Biology
2015). N-hacking is the selective manipulation of 15(9): e2003779. https://doi.org/10.1371/jour-
sample size to achieve statistical significance by nal.pbio.2003779
The Experimental Design Assistant. https:// experiments with historical control data. Nature Neu-
eda.nc3rs.org.uk/ roscience 24: 470–477.
Jackson Laboratories: Choosing appropriate con- Box, G.E.P., Hunter, J.S., and Hunter, W.G. (2005). Sta-
trols for transgenic animals https://www.jax.org/ tistics for Experimenters, 2e. New York: Wiley.
jax-mice-and-services/customer-support/tech- Carbone, L. (2011). Pain in laboratory animals: the ethical
nical-support/breeding-and-husbandry-support/
and regulatory imperatives. PLoS ONE 6 (9): e21578.
https://doi.org/10.1371/journal.pone.0021578.
considerations-for-choosing-controls
Chavalarias, D., Wallach, J.D., Li, A.H., and Ioannidis, J.P.
(2016). Evolution of reporting P values in the biomedi-
References cal literature, 1990-2015. JAMA 315 (11): 1141–1148.
https://doi.org/10.1001/jama.2016.1952.
Aarts, E., Dolan, C.V., Verhage, M., and van der Sluis, Cole, J.T., Yarnell, A., Kean, W.S. et al. (2011). Craniot-
S. (2015). Multilevel analysis quantifies variation in omy: true sham for traumatic brain injury, or a sham of
the experimental effect while optimizing power a sham? Journal of Neurotrauma 28: 359–369.
and preventing false positives. BMC Neuroscience Conradi, U. and Joffe, A.R. (2017). Publication bias in ani-
16: 93. https://doi.org/10.1186/s12868-015- mal research presented at the 2008 Society of Critical Care
0228-5. Medicine Conference. BMC Research Notes 10 (1): 262.
Addelman, S. (1970). Variability of treatments and exper- https://doi.org/10.1186/s13104-017-2574-0.
imental units in the design and analysis of experi- Cook, A.J., Delong, E., Murray, D.M. et al. (2016). Statis-
ments. Journal of the American Statistical Association tical lessons learned for designing cluster randomized
65 (331): 1095–1109. pragmatic clinical trials from the NIH Health Care Sys-
Åhlgren, J., and Voikar, V. (2019). Experiments done tems Collaboratory Biostatistics and Design Core. Clin-
in Black-6 mice: what does it mean? Lab Animal ical Trials 13: 504–512. https://doi.org/10.1177/
(NY) 48 (6): 171–180. https://doi.org/10.1038/ 1740774516646578.
s41684-019-0288-8. Dugnani, E., Pasquale, V., Marra, P. et al. (2018). Four-
Altman, D.G. (1980). Statistics and ethics in medical class tumor staging for early diagnosis and monitoring
research: misuse of statistics is unethical. BMJ 281: of murine pancreatic cancer using magnetic resonance
1182–1184. and ultrasound. Carcinogenesis 39 (9): 1197–1206.
André, V., Gau, C., Scheideler, A. et al. (2018). Labora- https://doi.org/10.1093/carcin/bgy093.
tory mouse housing conditions can be improved using Eggel, M. and Würbel, H. (2021). Internal consistency
common environmental enrichment without compro- and compatibility of the 3Rs and 3Vs principles for
mising data. PLoS Biology 16 (4): e2005019. https:// project evaluation of animal research. Laboratory Ani-
doi.org/10.1371/journal.pbio.2005019. mals 55 (3): 233–243. https://doi.org/10.1177/
Bailoo, J.D., Murphy, E., Boada-Saña, M. et al. (2018). 0023677220968583.
Effects of cage enrichment on behavior, welfare and Festing, M.F.W. (2020). The ‘completely randomised’
outcome variability in female mice. Frontiers in Behav- and the ‘randomised block’ are the only experimental
ioral Neuroscience 26: https://doi.org/10.3389/ designs suitable for widespread use in pre-clinical
fnbeh.2018.00232. research. Scientific Reports 10: 17577. https://doi.
Bate, S.T. and Clark, R. (2014). The Design and Statistical org/10.1038/s41598-020-74538-3.
Analysis of Animal Experiments. Cambridge: Cam- Festing, M.F.W. and Altman, D.G. (2002). Guidelines for
bridge University Press. the design and statistical analysis of experiments using
Bespalov, A., Wicke, K., and Castagné, V. (2019). Blind- laboratory animals. ILAR Journal 4394): 244–258.
ing and randomization. In: Good Research Practice in Garland, T. Jr. and Adolph, S.C. (1994). Why not to do
Non-Clinical Pharmacology and Biomedicine, Hand- two-species comparative studies: limitations on infer-
book of Experimental Pharmacology 257 (ed. A. Bespa- ring adaptation. Physiological Zoology 67 (4): 797–828.
lov, M. Michel, and T. Steckler). Cham: Springer Garner, J.P., Gaskill, B.N., Weber, E.M. et al. (2017).
https://doi.org/10.1007/164_2019_279. Introducing therioepistemology: the study of how
Bolles, R.C. (1962). The difference between statistical knowledge is gained from animal research. Lab Ani-
hypotheses and scientific hypotheses. Psychological mal (NY) 46 (4): 103–113. https://doi.org/
Reports 11 (3): 639–645. https://doi.org/ 10.1038/laban.1223.
10.2466/pr0.1962.11.3.639. Gaskill, B.N., and Garner, J.P. (2020). Power to the peo-
Bonapersona, V., Hoijtink, H., and RELACS Consortium ple: power, negative results and sample size. Journal of
(2021). Increasing the statistical power of animal the American Association for Laboratory Animal
Science (JAALAS) 59 (1): 9–16. https://doi.org/ Karp, N.A. and Fry, D. (2021). What is the optimum
10.30802/AALAS-JAALAS-19-000042. design for my animal experiment? BMJ Open Science
Gerlach, B., Untucht, C., and Stefan, A. (2019). Electronic 5: e100126.
lab notebooks and experimental design assistants. In: Keenan, C., Elmore, S., Francke-Carroll, S. et al. (2009).
Good Research Practice in Non-Clinical Pharmacology Best practices for use of historical control data of prolif-
and Biomedicine, Handbook of Experimental Pharma- erative rodent lesions. Toxicologic Pathology 37 (5): 679–
cology 257 (ed. A. Bespalov, M. Michel, and T. Steckler). 693. https://doi.org/10.1177/0192623309336154.
Springer https://doi.org/10.1007/164_2019_287. Kernan, W.N., Viscoli, C.M., Makuch, R.W. et al. (1999).
Gouveia, K. and Hurst, J.L. (2019). Improving the practi- Stratified randomization for clinical trials. Journal of
cality of using non-aversive handling methods to Clinical Epidemiology 52 (1): 19–26. https://doi.
reduce background stress and anxiety in laboratory org/10.1016/s0895-4356(98)00138-3.
mice. Scientific Reports 9: 20305. https://doi.org/ Kilkenny, C., Parsons, N., Kadyszewski, E. et al. (2009).
10.1038/s41598-019-56860-7. Survey of the quality of experimental design, statistical
Greenland, S. and Morgenstern, H. (1990). Matching and analysis and reporting of research using animals. PLoS
efficiency in cohort studies. American Journal of Epide- ONE 4: e7824. https://doi.org/10.1371/journal.
miology 131: 151–159. pone.0007824.
Greenland, S., Senn, S.J., Rothman, K.J. et al. (2016). Sta- Kramer, M. and Font, E. (2017). Reducing sample size in
tistical tests, P values, confidence intervals, and power: a experiments with animals: historical controls and
guide to misinterpretations. European Journal of Epide- related strategies. Biological Reviews of the Cambridge
miology 31 (4): 337–350. https://doi.org/10.1007/ Philosophical Society 92 (1): 431–445. https://doi.
s10654-016-0149-3. org/10.1111/brv.12237.
Hatswell, A., Freemantle, N., Baio, G. et al. (2020). Sum- Landis, S., Amara, S., Asadullah, K. et al. (2012). A call
marising salient information on historical controls: a for transparent reporting to optimize the predictive
structured assessment of validity and comparability value of preclinical research. Nature 490: 187–191.
across studies. Clinical Trials 17 (6): 607–616. https://doi.org/10.1038/nature11556.
https://doi.org/10.1177/1740774520944855. Lazic, S. (2016). Experimental Design for Laboratory Biol-
Hawkins, P., Morton, D.B., Burman, O., et al. (2010). A ogists: Maximising Information and Improving Repro-
guide to defining and implementing protocols for the ducibility. Cambridge: Cambridge University Press.
welfare assessment of laboratory animals. Eleventh Lilley, E., Stanford, S.C., Kendall, D.E. et al. (2020).
report of the BVAAWF/FRAME/RSPCA/UFAW Joint ARRIVE 2.0 and the British Journal of Pharmacology:
Working Group on Refinement. www.rspca.org.uk/ updated guidance for 2020. British Journal of Pharma-
sciencegroup/researchanimals/implemen- cology 177 (16): 3611–3616. https://doi.org/
ting3rs/refinement (accessed 2021). 10.1111/bph.15178.
Head, M.L., Holman, L., Lanfear, R. et al. (2015). The Löscher, W., Ferland, R.J., and Ferraro, T.N. (2017). The
extent and consequences of P-hacking in science. PLoS relevance of inter- and intra-strain differences in mice
Biology 13 (3): e1002106. and rats and their implications for models of seizures
Holmdahl, R. and Malissen, B. (2012). The need for litter- and epilepsy. Epilepsy and Behaviour 73: 214–235.
mate controls. European Journal of Immunology 42: https://doi.org/10.1016/j.yebeh.2017.05.040.
45–47. https://doi.org/10.1002/eji.201142048. MacCallum, C.J. (2010). Reporting animal studies: good
Hooijmans, C.R., Rovers, M.M., de Vries, R.B. et al. science and a duty of care. PLoS Biology 8: e1000413.
(2014). SYRCLE’s risk of bias tool for animal studies. Macleod, M.R., Lawson McLean, A., Kyriakopoulou, A.
BMC Medical Research Methodology 14: 43. https:// et al. (2015). Risk of bias in reports of in vivo research:
doi.org/10.1186/1471-2288-14-43. a focus for improvement. PLoS Biology 13 (11): e1002301.
Johnson, P.D. and Besselsen, D.G. (2002). Practical Mahajan, V.S., Demissie, E., Mattoo, H. et al. (2016).
aspects of experimental design in animal research. Striking immune phenotypes in gene-targeted mice
ILAR Journal 43 (4): 202–206. https://doi.org/ are driven by a copy-number variant originating from
10.1093/ilar.43.3.202. a commercially available C57BL/6 strain. Cell Reports
Karp, N.A. (2018). Reproducible preclinical research—is 15 (9): 1901–1909.
embracing variability the answer? PLoS Biology 16 Marín-Franch, I. (2018). Publication bias and the chase for
(3): e2005413. https://doi.org/10.1371/journal. statistical significance. Journal of Optometry 11 (2): 67–68.
pbio.2005413. https://doi.org/10.1016/j.optom.2018.03.001.
Martin, B., J, S., Maudsley, S., and Mattson, M.P. (2010). Reynolds PS, Garvan CW (2020). Gap analysis of swine-
‘Control’ laboratory rodents are metabolically morbid: based hemostasis research: houses of brick or man-
why it matters. Proceedings of the National Academy of sions of straw? Military Medicine, 185 (S1): 88–95.
Sciences USA 107 (14): 6127–6133. https://doi.org/ https://doi.org/10.1093/milmed/usz249
10.1073/pnas.0912955107. Reynolds, P.S., McCarter, J., Sweeney, C. et al. (2019).
Montgomery, D.C. (2017). Design and Analysis of Experi- Informing efficient pilot development of animal
ments, 8e. New York: Wiley. trauma models through quality improvement strate-
Moser, P. (2019). Out of control? Managing baseline var- gies. Laboratory Animal 53 (4): 394–403. https://
iability in experimental studies with control groups. In: doi.org/10.1177/0023677218802999.
Good Research Practice in Non-Clinical Pharmacology Richter, S.H., Garner, J.P., Auer, C. et al. (2010). System-
and Biomedicine, Handbook of Experimental Pharma- atic variation improves reproducibility of animal
cology, 257 (ed. A. Bespalov, M. Michel, and T. Steck- experiments. Nature Methods 7 (3): 167–168.
ler). Cham: Springer https://doi.org/10.1007/ https://doi.org/10.1038/nmeth0310-167.
164_2019_280. Richter, S.H., Garner, J.P., and Würbel, H. (2009). Envi-
Muhlhausler, B.S., Bloomfield, F.H., and Gillman, M.W. ronmental standardization: cure or cause of poor
(2013). Whole animal experiments should be more like reproducibility in animal experiments? Nature Meth-
human randomized controlled trials. PLoS Biology 11 ods 6 (4): 257–261.
(2): e1001481. https://doi.org/10.1371/jour- Ritskes-Hoitinga, M. and Wever, K. (2018). Improving
nal.pbio.1001481. the conduct, reporting, and appraisal of animal
Ormandy, E.H., Weary, D.M., Cvek, K. et al. (2019). Ani- research. BMJ 360: j4935. https://doi.org/
mal research, accountability, openness and public 10.1136/bmj.j4935.
engagement: report from an international expert Russell, W.M.S. and Burch, R.L. (1959). The Principles of
forum. Animals 9 (9): 622. https://doi.org/ Humane Experimental Technique. London: Methuen
10.3390/ani9090622. and Co. Limited.
Pallmann, P., Bedding, A.W., Choodari-Oskooei, B. et al. Sena, E.S., van der Worp, H.B., Bath, P.M., Howells,
(2018). Adaptive designs in clinical trials: why use them, D.W., Macleod, M.R. (2010). Publication bias in reports
and how to run and report them. BMC Medicine 16: 29. of animal stroke studies leads to major overstatement
https://doi.org/10.1186/s12916-018-1017-7. of efficacy. PLoS Biology, 8(3):e1000343. https://
Percie du Sert, N., Alfieri, A., Allan, S.M. et al. (2017a). The doi.org/10.1371/journal.pbio.1000343.
IMPROVE guidelines (Ischaemia models: procedural Simon, M.M., Greenaway, S., White, J.K. et al. (2013). A
refinements of in vivo experiments). Journal of Cerebral comparative phenotypic and genomic analysis of
Blood Flow and Metabolism 37 (11): 3488–3517. C57BL/6J and C57BL/6N mouse strains. Genome Biol-
https://doi.org/10.1177/0271678X17709185. ogy 14 (7): R82. https://doi.org/10.1186/gb-
Percie du Sert, N., Bamsey, I., Bate, S.T. et al. (2017b). The 2013-14-7-r82.
experimental design assistant. PLoS Biology 15 (9): Smith, A.J., Clutton, R.E., Lilley, E. et al. (2018). PRE-
e2003779. https://doi.org/10.1371/journal. PARE: guidelines for planning animal research and
pbio.2003779 [Web-based The Experimental Design testing. Laboratory Animals 52: 135–141. https://
Assistant. https://eda.nc3rs.org.uk/]. doi.org/10.1177/0023677217724823.
Percie du Sert, N., Hurst, V., Ahluwalia, A. et al. (2020). The Smith, G.D. and Ebrahim, S. (2002). Data dredging, bias,
ARRIVE guidelines 2.0: updated guidelines for reporting or confounding. BMJ 325: 1437–1438.
animal research. PLoS Biology 18 (7): e3000410. Souren, N.Y., Fusenig, N.E., Heck, S. et al. (2022). Cell
https://doi.org/10.1371/journal.pbio.3000410. line authentication: a necessity for reproducible bio-
Preece, D.A. (1982). The design and analysis of experi- medical research. EMBO Journal 27: e111307.
ments: what has gone wrong? Utilitas Mathematica https://doi.org/10.15252/embj.2022111307.
21A: 201–244. ter Riet, G., Korevaar, D.A., Leenaars, M. et al. (2012).
Preece, D.A. (1984). Biometry in the Third World: science Publication bias in laboratory animal research: a sur-
not ritual. Biometrics 40: 519–523. vey on magnitude, drivers, consequences and potential
Reynolds, P.S. (2021). Statistics, statistical thinking, and solutions. PLoS ONE 7 (9): e43403. https://doi.org/
the IACUC. Lab Animal (NY) 50 (10): 266–268. 10.1371/journal.pone.0043403.
https://doi.org/10.1038/s41684-021-00832-w. Szucs, D. (2016). A tutorial on hunting statistical signif-
Reynolds, P.S. (2022). Between two stools: preclinical icance by chasing N. Frontiers in Psychology 7: 1443.
research, reproducibility, and statistical design of https://doi.org/10.3389/fpsyg.2016.01444.
experiments. BMC Research Notes 15 (1): 73. Tremoleda, J.L., Watts, S.A., Reynolds, P.S. et al. (2017).
https://doi.org/10.1186/s13104-022-05965-w. Modeling acute traumatic hemorrhagic shock injury:
challenges and guidelines for preclinical studies. Wasserstein, R.L. and Lazar, N.A. (2016). The ASA’s
Shock 48 (6): 610–623. https://doi.org/10.1097/ statement on P-values: context, process, and purpose.
SHK.0000000000000901. The American Statistician 70: 129–133. https://
Uchio-Yamada, K., Kasai, F., Ozawa, M., and Kohara, A. doi.org/10.1080/00031305.2016.1154108.
(2017). Incorrect strain information for mouse cell Wickham, H. (2014). Tidy data. Journal of Statistical Soft-
lines: sequential influence of misidentification on ware 59 (10): 1–23. https://doi.org/10.18637/
sublines. In Vitro Cellular and Developmental Biology jss.v059.i109.
53 (3): 225–230. https://doi.org/10.1007/ Wieschowski, S., Biernot, S., Deutsch, S. et al. (2019).
s11626-016-0104-3. Publication rates in animal research. Extent and char-
Ujita, A., Seekford, Z., Kott, M. et al. (2021). Habituation acteristics of published and non-published animal
protocols improve behavioral and physiological studies followed up at two German university medical
responses of beef cattle exposed to students in an ani- centres. PLoS ONE 14 (11): e0223758. https://doi.
mal handling class. Animals 11 (8): 2159. https:// org/10.1371/journal.pone.0223758.
doi.org/10.3390/ani11082159. Workman, P., Aboagye, E.O., Balkwill, F. et al. (2010).
Vetter, T.R. and Mascha, E.J. (2017). Defining the pri- Guidelines for the welfare and use of animals in cancer
mary outcomes and justifying secondary outcomes of research. British Journal of Cancer 102 (11): 1555–1577.
a study: usually, the fewer, the better. Anesthesia https://doi.org/10.1038/sj.bjc.6605642.
and Analgesia 125 (2): 678–681. https://doi.org/ Zingarelli, B., Coopersmith, C.M., Drechsler, S. et al.
10.1213/ANE.0000000000002223. (2019). Part I: minimum quality threshold in preclinical
Voelkl, B., Vogt, L., Sena, E.S., and Würbel, H. (2018). sepsis studies (MQTIPSS) for study design and humane
Reproducibility of pre-clinical animal research modeling endpoints. Shock 51 (1): 10–22. https://
improves with heterogeneity of study samples. PLoS doi.org/10.1097/SHK.0000000000001243.
Biology 16 (2): e2003693. https://doi.org/
10.1371/journal.pbio.2003992.
II
Sample Size for
Feasibility and Pilot
Studies
Chapter 4: Why Pilot Studies?
Chapter 5: Operational Pilot Studies: ‘Can It Work?’
Chapter 6: Empirical and Translational Pilots.
Chapter 7: Feasibility Calculations: Arithmetic.
Chapter 8: Feasibility: Counting Subjects.
4
Why Pilot Studies?

4.1 Introduction 35 4.4 Pilot Study Planning 39
4.2 Pilot Study Applications 36 4.4.1 Principles 39
4.2.1 The Role of Pilot Studies in 4.4.2 Justification 40
Laboratory Animal-Based Research 37 4.5 What Kind of Pilot Trial? 40
4.2.2 The Role of Pilot Studies in 4.6 How Large a Pilot? 41
Veterinary Research 37 4.6.1 Zero-Animal Sample Size 41
4.2.3 Pilot Study Results Should Be 4.6.2 Pragmatic Sample Size 42
Reported 38 4.6.3 Precision-Based Sample Size 42
4.3 Pilot Studies: What They Are Not 38 References 43
4.3.1 Pilot Studies Differ from Exploratory
and Screening Studies 38
4.1 Introduction
Pilot studies are small preparatory studies meant to BOX 4.1
inform the design and conduct of the later main Pilot Trials
experiment (Box 4.1). In 1866, Augustus De Morgan
wrote ‘The first experiment already illustrates a Pilot trials are SMALL.
truth of the theory, well confirmed by practice, What: Small, preparatory studies for informing design
whatever can happen will happen if we make trials and conduct of subsequent ‘definitive’ experiments.
enough.’ Now enshrined as Murphy’s three laws, Why: Essential tools to ensure:
problems of implementation and timing are integral
to experimentation and research. Pilot studies are a ▪ Methodological rigour
tool to anticipate and to some extent circumvent ▪ Feasibility and trial viability
these problems (Box 4.2). The focus of all pilot stud- ▪ Internal and external validity
ies is on feasibility, process, and description, rather ▪ Problem identification and troubleshooting
than statistical tests of inference (Lancaster et al. ▪ To minimise effect of Murphy’s law
2004; Thabane et al. 2010; Moore et al. 2011; Sha- Pilot trials should not be used to:
nyinde et al. 2011; Abbott 2014; Lee et al. 2014; ▪ Test hypotheses
Kistin and Silverstein 2015). ▪ Assess efficacy
The main reason for a pilot study is for early ▪ Assess safety
identification and correction of potential problems
BOX 4.2 one of the biggest contributors to unreproducible

The Role of Pilot Studies results. Unwanted variation arises in part from
unreliability of measurement processes (Eisenhart
One or more pilot studies should be included in the 1952; Hahn et al. 1999). Once uncontrolled varia-
research plan. tion is introduced early in the study cycle, the result-
Murphy’s First Law: Whatever can go wrong, will ing problems cannot be fixed by later statistical
go wrong. analyses (Sackett 1979; Altman 1980; Gelman
et al. 2020). Therefore, pilots enable identification
Pilots identify potential problems and enable devel-
and control of unwanted variation from multiple
opment of safeguards and workarounds before
sources (Hahn et al. 1999; Reynolds et al. 2019).
beginning the definitive study.
Reducing variation at all stages of the research
Murphy’s Second Law: Nothing is as easy as it looks. process can greatly increase statistical power with-
Pilots are opportunities to road-test, improve, and out increasing sample size, an important considera-
standardise methods and identify potential pro- tion for animal-based studies (Lazic 2016, 2018).
blems before they occur. Ensuring quality and reliability of research data also
increases the probability that research results can be
Murphy’s Third Law: Everything takes longer than
successfully translated from basic research to veter-
originally thought.
inary and human clinical practice (Davies et al.
Pilots enable identification and realistic assessment 2017; Vollert et al. 2020).
of timelines, task scheduling, and milestones.
before considerable time, resources, money, and

4.2 Pilot Study Applications
animals have been invested in a full study Definitions and terminology for pilot studies can be
(Clayton and Nordheim 1991). When properly confusing and may vary considerably between
designed and conducted, pilot studies can be sources (Arain et al. 2010; Whitehead et al. 2014;
cost-effective and powerful tools for reducing exper- Eldridge et al. 2016). In part, this may be because
iment failures and increasing data quality. Thought- pilots are remarkably versatile. They can be
ful and well-conducted pilot studies coupled with designed to perform a range of functions vital to
sufficient planning can avoid futility and waste successful performance of the main experiment
of animals in non-informative experiments and (Box 4.3), including:
increase the probability that a true effect of the test
intervention can be detected if it exists. Sample sizes Assessment of overall feasibility (‘Can it work?’,
are small, and use of animals may be completely ‘Does it work?’, ‘Will it work?’).
unnecessary. Assessment of specific logistic, procedural, and
resource requirements.
Pilot studies are not intended to formally test
hypotheses or provide definitive evidence of benefit
or safety. Because the purpose is to develop and
refine subsequent experimental protocols, pilot BOX 4.3
studies are methodologically unstable. A single Pilot Study Applications
noisy estimate of effect size from a pilot will Feasibility assessment
result in seriously underpowered definitive stud- Performance gap identification
ies (Leon et al. 2011; Albers and Lakens 2018). Equipment function testing
Locating procedural bottlenecks
The main purpose of pilot studies is to provide
Identifying animal welfare concerns
methods for experimental quality control, improve-
Identifying stakeholder issues
ment, and assurance. From a statistical point of
Obtaining data for preliminary proof of concept.
view, unrecognised and uncontrolled variation is
Why Pilot Studies? 37
Identification of performance gaps in personnel 4.2.1 The Role of Pilot Studies in

skill and experience. Laboratory Animal-Based Research
Assessment of equipment function, calibration, The majority of preclinical laboratory-based studies
drift, reliability, and integration with computa- are essentially exploratory rather than confirmatory
tional and information technology. (Kimmelman et al. 2014; Mogil and Macleod 2017).
Obtaining data for preliminary proof of concept. They typically consist of multiple factors to be inves-
tigated, coupled with very small sample sizes. Struc-
Refinement, standardisation, adjustment, and turally and heuristically, they are more closely
correction of specific tasks, procedures, and aligned with agricultural and industrial-type study
protocols. designs pioneered by Fisher and Box, rather than
Identification of potential bottlenecks, proce- the two-group, large N designs more typical of clin-
dural hurdles, and human error. ical trials. As a result, sample size guidelines
designed for human clinical pilot trials are, for the
Identification of major sources of variation in most part, inappropriate for laboratory animal-
experimental processes and how they might based studies (Reynolds 2022a). However, before
be controlled. animals are used in experiments, all experimental
Determining whether or not the outcome vari- methods, processes, and procedures should be stan-
ables can actually be measured. dardised and operational variation minimised. This
includes correction of study-specific performance
Estimating variation in outcome variables. issues and assurance that all technical personnel
Identifying potential ‘cultural’ challenges faced are trained to the highest possible standard and
by different stakeholders from implementation are fully compliant with experiment protocols.
of processes and procedures.
Identifying animal welfare concerns, such as 4.2.2 The Role of Pilot Studies in
severity of the proposed procedures or interven- Veterinary Research
tions; effectiveness of remediation and pallia-
tive care measures; and definition of early Veterinary clinical trials (like human clinical trials)
humane endpoints. are generally intended to be confirmatory tests of
efficacy. They typically consist of only two or a
Regulatory and ethical oversight bodies and fund- few treatment arms (usually a test intervention
ing agencies encourage preliminary pilots for all and a comparator), with treatments randomly
types of research investigations when: assigned to many subjects representative of the
target population (Reynolds 2022a). However, in
Experimental procedures are novel, complex, or practice, many published veterinary clinical trials
difficult to implement. are small, underpowered, and poorly designed
(Wareham et al. 2017; Tan et al. 2019). Trials
The ‘best’ or optimum sample size is difficult to
conducted in the clinical setting may involve con-
determine.
siderable management logistics, especially when
The best or optimum subset of agents or inter- research protocols involve client-owned animals
ventions must be selected from a larger candi- and require integration with individual case man-
date pool. agement and clinical decision-making processes
(Morgan and Stewart 2002). The research protocol
Large numbers of animals are requested for
may be further complicated by difficulties in patient
definitive studies without clear justification.
recruitment, client consent, compliance, and reten-
There is a conflict between animal welfare expectation. Administration of test interventions, measure-
tions and the perceived goals of the experiment ment of experimental outcomes, data collection,
(for example, requests to exceed established and other procedures or standard-of-care interven-
humane endpoints ‘because the experiment out- tions can be difficult to implement, especially if they
comes could not be measured otherwise’). deviate from normal practice.
Pilot trial designs and rules of thumb used for BOX 4.4
human clinical trials may be appropriate for ‘regu- What a Pilot Study Is Not
lated’ veterinary clinical trials, when veterinary
clinical data is to be submitted to regulatory autho- 1. A throw-away/unfunded study (or ‘student
rities (such as the FDA) for approval (Davies et al. project’)
2017). Guidelines are available for pilots intended 2. A study lacking methodological rigour
to inform human clinical trials (Lancaster et al. 3. A study misaligned with stated aims and
2004; Arnold et al. 2009; Thabane et al. 2010), and objectives
these provide invaluable advice for planning. 4. A study without purpose
5. An underpowered, too-small study
6. A completed study that cannot reject the null
4.2.3 Pilot Study Results Should Be hypothesis.
Reported
It is an ethical obligation to make the best use statements of ‘feasibility’, but the study goes
of research and research animals by reporting all on to present hypothesis tests of efficacy on
aspects of the research process, including pilot stud- numerous outcomes. Misaligned studies usu-
ies (van Teijlingen et al. 2001; Thabane et al. 2010; ally confuse or conflate the purpose of pilots
Friedman 2013). Many preclinical studies are never with that for exploratory studies (see next).
written up or published if experiments are perceived
to have ‘gone nowhere’ because results were ‘nega- Unsubstantiated studies that make vague state-
tive’, not ‘statistically significant’, or otherwise ments of intent that any resulting data will be to
unfavourable (Kimmelman and Federico 2017). inform future (funded) studies, but give no plans
Carefully performed pilots provide useful informa- for subsequent action or how data will be used.
tion on identification and description of problems, Futility loop studies that consist of an endless
potential solutions, and best practices and will not cycle of small experiments conducted to ‘find
only benefit other researchers but contribute to something’, but lack specific decision criteria
minimising waste of animals. Venues include dedi- or action plans that would enable assessment
cated journals and protocol description sites (e.g. of progress (Moore et al. 2011).
Lancaster 2015; Macleod and Howells 2016) and
are increasingly encouraged by mainstream jour- Sometimes completed studies are retrospectively
nals (Dolgin 2013). relabelled as ‘pilots’ because they are either under-
powered or null. Underpowered studies did not have
a large enough sample size in the first place to test a
4.3 Pilot Studies: What They meaningful hypothesis. Underpowered studies can
occur because of insufficient time, resources, fund-
Are Not ing, or merely because of poor planning (Arain et al.
Six categories of studies are often labelled incorrectly 2010). In contrast, null studies may not produce
– and misleadingly – as ‘pilot studies’ (Box 4.4): statistically significant results regardless of sample
size or power (Abbott 2014). A completed study that
Throw-away studies lacking methodological rig- is too small or cannot reject the null hypothesis is
our. These include trial and error studies to not ipso facto a pilot study.
see what ‘might work’, studies conducted with
little or no resources and funding, or studies
conducted by personnel (usually students) 4.3.1 Pilot Studies Differ from
without the requisite skills for meaningful Exploratory and Screening Studies
research (Moore et al. 2011).
A pilot is a small study acting as a preliminary to the
Misaligned studies, where the objectives of the main definitive study, and used to demonstrate fea-
pilot study consist of vague or undefined sibility, not provide results of hypothesis testing.
Exploratory studies are usually stand-alone heu- BOX 4.5

ristic studies, which may include identification Pilot Study Planning
of large-scale mechanisms and patterns of
response (for example, in gene arrays). Planning priorities for all pilot types include:
A screening study is a type of exploratory study Principles

where the objective is to whittle down a large The 3Rs
number of predictor variables, test agents, or Utility
genes to a smaller, more manageable, subset Similitude
of the most promising candidates thought to Proportionate investment
influence a response (Montgomery and Jen- Justification
nings 2006).
Literature reviews
Any study based on trial and error is inefficient Stakeholder requirements.
and wasteful, producing results that are hard to
interpret. Therefore, formal statistically based
designs are strongly recommended for all experi-
mental studies to provide maximal effectiveness testing, personnel training) or minimised (e.g.
and information content (Box and Draper 1987; using carcasses or culls). Determine if similar
Box et al. 2005; Montgomery and Jennings 2006; information can be obtained from in vitro
Montgomery 2012). Like definitive confirmatory sources (e.g. cell cultures), systematic literature
studies, exploratory studies should also prioritise reviews, surveys of academic or industry
internal validity (randomisation, blinding, appro- research groups, or updated reviews of current
priate controls) and study sensitivity (control of var- best practices established for the speciality.
iation), both to increase the probability of detecting
an effect if it exists and minimise risk of false Utility. A pilot trial should result in useful infor-
negatives. mation for planning future experiments. There
must be a clear description of the information
to be obtained from the pilot and how it will
be used to inform the conduct of later studies.
4.4 Pilot Study Planning An inadequate sample size or budget or non-
Sound planning requires both understanding the significant results are not justifications for
general principles of pilot study design and having declaring an experiment to be a pilot trial.
a clear justification for the information to be
Similitude. Because the pilot is in effect a
obtained from a specific pilot study. The focus of
rehearsal of the main trial, the pilot model for
all pilots is on ‘feasibility, process, and description’
animals, disease, and/or process must be as
(Kistin and Silverstein 2015), rather than hypothesis
similar as possible to the target model. Results
testing (Lee et al. 2014). The four principles of pilot
obtained from dissimilar pilots will be mislead-
study planning are the 3Rs, utility, similitude, and
ing and essentially useless as evidence for
proportionate investment, and justification is sup-
potential feasibility or decisions to proceed with
ported by comprehensive literature review and
further experiments.
stakeholder requirements (Box 4.5).
Proportionate Investment. Prepare to invest
10–20% of resources (time, personnel, budget,
4.4.1 Principles equipment) in a pilot or a small series of pilots.
The 3Rs. Before conducting any pilot, determine Unanticipated problems will occur in any
if animals are required at all. In the early phases experiment. It is better to identify and correct
of experimental process development, animal problems before time, money, resources, and
use can be eliminated entirely (e.g. equipment animals are invested in the full study.
4.4.2 Justification fiscal administrators with how much it is going to

cost (and who pays), and technical and animal care
The choice of pilot design depends on the informa-
staff with the amount of extra work and time that
tion to be obtained from that pilot and how that
will be involved (Reynolds 2022b). It is frequently
information is to be used. The information to be
helpful to identify and confer with major stake-
acquired by the pilot is determined by the scientific
holders beforehand to ensure compliance with
or procedural information gaps to be addressed, the
existing procedures, to communicate how project-
specific information items required to address those
specific operating procedures might affect normal
gaps, the availability of current information, and the
operations (Smith et al. 2018), and determine the
quality of that information. Justification is based on
specific information that should be generated from
literature reviews and stakeholder input.
the pilot.
4.4.2.1 Literature Reviews
Literature reviews are international consensus best
practice for study preparation (PREPARE) and
4.5 What Kind of Pilot Trial?
research reporting (ARRIVE 2.0; Percie du Sert Pilot studies are operationalised by clearly defining
et al. 2020). Systematic literature reviews of previous the research needs, objectives, and specific aims.
research are used to identify specific gaps that the The specific rationale determines pilot focus,
proposed study will be designed to address. One design, and choice of assessment metrics. The objec-
or several pilots will be justifiable if the literature tives and aims will influence the larger strategy for
and prior data present numerous diverse methodo- execution; for example, if the pilot is to be con-
logical approaches (e.g. computer-based, in vitro, in ducted in iterative stages, or if each objective
vivo), different animal and disease models, or if requires its own separate pilot (Robinson 2000;
there is little or no relevant information on either Moore et al. 2011).
mechanism or effect of test agents (De Wever There are three main categories of pilot studies:
et al. 2012). operational, empirical, and translational (Box 4.6).
Literature reviews have two additional goals. These categories have been broadly adapted from
First, it is necessary to affirm that the proposed those proposed for study feasibility assessment in
research is not an unnecessary duplication of previ- the public health arena (Bowen et al. 2009).
ous research (Morgan 2011; Johnson et al. 2013).
Second, a systematic and comprehensive descrip- ‘Can it work?’ Operational pilots are used to deter-
tion of ‘lessons learned’ prior to beginning a project mine and finalise details of logistics, proce-
is advisable for avoiding obvious design flaws or dures, methods, resources, and processes.
problems, and determine if the proposed methods Operational pilots may not require any animals
are best practice or could be improved (Glasziou at all (Chapter 6).
and Chalmers 2018). Information obtained from
the literature must be carefully evaluated because BOX 4.6
it is certain to vary in quality and relevance to the What Kind of Pilot?
proposed study.
‘Can it work?’ Operational
4.4.2.2 Stakeholder Requirements ‘Does it work?’ Empirical, intervention-related
Different stakeholders have different roles to play in ‘Will it work?’ Translational, generalisation
the project and therefore will require different kinds
of information. Stakeholders include the investiga- All types of pilots require a protocol:
tors, technical personnel, lab managers, animal care
staff, veterinarians, facilities and infrastructure
▪ Procedural documentation
staff, fiscal administrators, and the ethical oversight
▪ Clearly defined ‘success’ metrics
committee. For example, veterinarians are prima-
▪ Evaluation component
rily concerned with animal health and welfare,
▪ Follow-on action plan
‘Does it work?’ Empirical pilots are used to deter- BOX 4.7

mine if outcome measurements are fit for pur- Pilot Study Sample Size
pose and capable of meeting study objectives
(Chapter 7). a. How large a pilot?
Pilot studies should be SMALL: Maximise
‘Will it work?’ Translational pilots are used to the most information from the fewest possi-
determine generalisability, that is, if results ble animals
are robust, consistent, and broadly applicable Zero-animal
over a wider range of models and conditions. Pragmatic
Proof of concept and proof of principle studies Precision-based
fit into this category (Chapter 7).
b. How many pilots?
All types of pilots require a fully articulated pro- Several small versus one large
tocol that should include:
Clear and specific documentation of methods and they are small: the goal is to maximise the most infor-
procedures (What will be done, and how will it mation from the fewest possible number of animals.
be done?). Pilot studies should also be designed to be nimble.
Clearly defined, specific, and measurable ‘success’ They should have sufficient flexibility to change
criteria (How will you know it ‘works’?). scope, procedures, and direction as necessary to
establish normal best-practice standard operating
An evaluation component (Did it ‘work’?). procedures. Formal power calculations are not appro-
An action plan with decision criteria (If the pilot is priate because by design and intent, pilots are small-
successful, what are the next steps?). scale assessments of feasibility and therefore too small
to obtain reliable estimates of effect size or variance or
The type of pilot dictates the specific metrics used find ‘statistical significance’ (Kraemer et al. 2006; Lee
to determine whether or not the pilot works and et al. 2014). Multiple small pilots may be preferable
how the pilot will inform the next step of the exper- to one larger study especially if different types of
imentation process. Measurable decision criteria information are required.
(‘go-no go’) are used to decide if the experiment Animal numbers for a pilot can be based on one
can proceed without modification or with only a or more of the following strategies: zero-animal,
few minor changes (‘green light’), if major changes pragmatic, and precision-directed (Box 4.7).
to protocol need to be made and what those changes
would entail (‘yellow light’), or the experiment
should be terminated altogether (‘red light’). For 4.6.1 Zero-Animal Sample Size
veterinary clinical trials, recruitment issues will be The principles of the 3Rs must always be considered
especially important to identify at an early stage. before planning any experiment. Pilot studies may
Recruitment metrics to be assessed will include pro- not require any animals at all. For personnel train-
jected subject recruitment and accrual rates, client ing, skill acquisition, and development of consistent
consent rates, treatment compliance, and enrol- levels of technical standardisation, simulators, car-
ment strategies (Hampson et al. 2018). casses, and/or ex vivo models should always be con-
sidered before live animals obtained for purpose.
For example, medical residents trained in the repair
of penetrating cardiac injury on either an ex vivo
4.6 How Large a Pilot? model or animals obtained for purpose had similar
There are no hard and fast sample size guidelines skill and management competencies at assessment
for animal-based pilot studies. However, the chief (Izawa et al. 2016). Equipment testing for reliability
guiding principle for animal-based pilot studies is that (calibration, drift, measurement errors) should
always be conducted on non-animal standard and The amount of useable data obtained from each
test materials and confirmed before using animals. subject is inversely related to the number of sub-
Even ‘proof of concept’ studies may not necessar- jects (Morse 2000). When resources are limited,
ily require data from live animals. Systematic liter- animal-based pilot studies can be designed to be
ature reviews and meta-analyses (Rice et al. 2008; as information-dense as possible by maximising
Hooijmans et al. 2018; Soliman et al. 2020; Vollert the amount of useable data from the fewest num-
et al. 2020) promote animal-based study validity ber of subjects. Designing a pilot study based on
and improve translation potential by synthesising information power (Malterud et al. 2016) requires
all available external data. Therefore, these provide descriptions of the intended scope of the pilot,
the best possible estimates of the likely true effect data collection methods, and study design. For
size and variation. The Collaborative Approach to example, before-after and longitudinal studies
Meta-Analysis and Review of Animal Experimental produce more data than do cross-sectional studies
Studies (CAMARADES) research group specialises with only a single observation per subject. High-
in systematic reviews of preclinical neurological dimensionality technologies such as mass cyto-
disease models such as stroke, neuropathic pain, metry, immunohistochemistry, single-cell RNA
and Alzheimer’s disease, and provides guidance sequencing, microfluidics, and bioinformatics
on study design and current best practice can provide massive datasets for relatively few
(https://www.ed.ac.uk/clinical-brain-sciences/ animals.
research/camarades/about-camarades). Targeted
surveys of academic and/or industry research orga- 4.6.3 Precision-Based Sample Size
nisations can be an alternative to systematic
literature reviews. Survey results can be used to The emphasis of pilot trials should be on descrip-
determine range of likely outcomes or responses tions based on point estimates and confidence
anticipated for candidate animal models, identify intervals (or some other measure of precision).
current best practice, and explore the range of study Because ‘significance’ as such is relatively mean-
designs and methods that best minimise animal use ingless, Lee et al. (2014) recommend that precision
(e.g. Chapman et al. 2012). intervals should be constructed to provide an esti-
mated range of possible treatment effects. Simu-
lations and sensitivity analyses are extremely
4.6.2 Pragmatic Sample Size valuable for estimating uncertainties associated
Pragmatic sample size is determined by the avail- with various projected sample sizes and exploring
ability of time, budget, resources, and technical best-and worst-case scenarios for the statistical
personnel. Pragmatic sample size justification models, covariates, and analysis methods proposed
may also apply to retrospective chart reviews, for the definitive study (Bell et al. 2018; Gelman
where the total sample size may be restricted by et al. 2020). Chapter 6 presents methods for evalu-
the relative rarity of the condition of interest, or ating pilot data based on preliminary estimates of
the number of records with complete or relevant precision and inspection of data plots.
information. Sample size calculations for external and internal
Pragmatic sample size approximation does not pilot trials have been developed for application to
require power calculations. Arithmetic approxima- human clinical trials (Julious 2005; Whitehead
tions use basic arithmetic ‘back of the envelope’ cal- et al. 2014; Machin et al. 2018). External pilots are
culations to confirm and cross-validate numbers for conducted prior to the definitive trial, and although
operational feasibility (Chapter 8 Arithmetic). Cal- data are used to inform subsequent trial design
culations based on sequential probabilities are useful they are not included in later data analyses.
when identification of subjects requires screening of Internal pilots are conducted to permit reassess-
large number of candidates, or if animals enter the ment of sample size within the ongoing definitive
trial sequentially rather than being randomised trial. Sample sizes based on considerations of
simultaneously (Chapter 9 Probability). human clinical trials are usually prohibitively large
for most lab-based animal studies and therefore not Davies, R., London, C., Lascelles, B., and Conzemius, M.
recommended. They may be appropriate for large- (2017). Quality assurance and best research practices
scale veterinary clinical trials. for non-regulated veterinary clinical studies. BMC Vet-
erinary Research 13: 242. https://doi.org/
10.1186/s12917-017-1153-x.
References De Wever, B., Fuchs, H.W., Gaca, M. et al. (2012). Imple-
mentation challenges for designing integrated in vitro
Abbott, J.H. (2014). The distinction between randomized testing strategies (ITS) aiming at reducing and repla-
clinical trials (RCTs) and preliminary feasibility and cing animal experimentation. Toxicology In Vitro
pilot studies: what they are and are not. Journal of Ortho- 26 (3): 526–534. https://doi.org/10.1016/j.
paedic and Sports Physical Therapy 44 (8): 555–558. tiv.2012.01.009.
Albers, C. and Lakens, D. (2018). When power analyses Dolgin, E. (2013). Publication checklist proposed to boost
based on pilot data are biased: inaccurate effect size rigor of pilot trials. Nature Medicine 19: 795–796.
estimators and follow-up bias. Journal of Experimental https://doi.org/10.1038/nm0713-79.
Social Psychology 74: 187–195. https://doi.org/ Eisenhart, C. (1952). The reliability of measured values:
10.1016/j.jesp.2017.09.004. fundamental concepts. National Bureau of Standards
Altman, D.G. (1980). Statistics and ethics in medical Report, Publication Number 1600. Commerce Depart-
research. Collecting and screening data. British Medi- ment, National Institute of Standards and Technology
cal Journal 281 (6252): 1399–1401. https://doi. (NIST). https://www.govinfo.gov/content/
org/10.1136/bmj.281.6252.1399. pkg/GOVPUB-C13-78b559d7a3b0f60880eba04-
Arain, M., Campbell, M.J., Cooper, C.L., and Lancaster, f045e72f8/pdf/GOVPUB-C13-78b559d7a3b0-
G.A. (2010). What is a pilot or feasibility study? f60880eba04f045e72f8.pdf (accessed 2022).
A review of current practice and editorial policy. Eldridge, S.M., Lancaster, G.A., Campbell, M.J. et al. (2016).
BMC Medical Research Methodology 10: 67. https:// Defining feasibility and pilot studies in preparation for
doi.org/10.1186/1471-2288-10-67. randomised controlled trials: development of a concep-
Arnold, D.M., Burns, K.E., Adhikari, N.K. et al. (2009). tual framework. PLoS ONE 11 (3): e0150205. https://
The design and interpretation of pilot trials in doi.org/10.1371/journal.pone.0150205.
clinical research in critical care. Critical Care Medicine Friedman, L. (2013). Commentary: why we should report
37 (Suppl 1): 69–74. https://doi.org/10.1097/ results from clinical trial pilot studies. Trials 14: 14.
CCM.0b013e3181920e33. Gelman, A., Hill, J., and Vehtari, A. (2020). Regression and
Bell, M.L., Whitehead, A.L., and Julious, S.A. (2018). Other Stories. Cambridge: Cambridge University Press.
Guidance for using pilot studies to inform the design Glasziou, P. and Chalmers, I. (2018). Research waste is
of intervention trials with continuous outcomes. Clin- still a scandal—an essay by Paul Glasziou and Iain
ical Epidemiology 18 (10): 153–157. https://doi. Chalmers. BMJ 363: https://doi.org/10.1136/
org/10.2147/CLEP.S146397. bmj.k4645.
Bowen, D.J., Kreuter, M., Spring, B. et al. (2009). How we Hahn, G.J., Hill, W.J., Hoerl, R.W., and Zinkgraf, S.A.
design feasibility studies. American Journal of Preven- (1999). The impact of Six Sigma improvement—a
tive Medicine 36 (5): 452–457. https://doi.org/ glimpse into the future of statistics. The American Stat-
10.1016/j.amepre.2009.02.002. istician 53 (3): 208–215.
Box, G.E.P. and Draper, N.R. (1987). Empirical Model- Hampson, L.V., Williamson, P.R., Wilby, M.J., and Jaki,
Building and Response Surfaces. New York: Wiley. T. (2018). A framework for prospectively defining pro-
Box, G.E.P., Hunter, W.G., and Hunter, J.S. (2005). Statis- gression rules for internal pilot studies monitoring
tics for Experimenters: An Introduction to Design, Data recruitment. Statistical Methods in Medical Research
Analysis, and Model Building, 2e. New York: Wiley. 27 (12): 3612–3627. https://doi.org/10.1177/
Chapman, K.L., Andrews, L., Bajramovic, J.J. et al. 0962280217708906.
(2012). The design of chronic toxicology studies of Hooijmans, C.R., de Vries, R.B.M., Ritskes-Hoitinga, M.
monoclonal antibodies: implications for the reduction et al. (2018). Facilitating healthcare decisions by asses-
in use of non-human primates. Regulatory Toxicology sing the certainty in the evidence from preclinical
and Pharmacology 62 (2): 347–354. https://doi. animal studies. PLoS ONE 13: e0187271.
org/10.1016/j.yrtph.2011.10.016. Izawa, Y., Hishikawa, S., Muronoi, T. et al. (2016). Ex-
Clayton, M. and Nordheim, E.V. (1991). [Avoiding statisti- vivo and live animal models are equally effective train-
cal pitfalls]: comment. Statistical Science 6 (3): 255–257. ing for the management of a penetrating cardiac
injury. World Journal of Emergency Surgery 11: 45. Malterud, K., Siersma, V.D., and Guassora, A.D. (2016).
https://doi.org/10.1186/s13017-016-0104-3. Sample size in qualitative interview studies: guided
Johnson, J., Brown, G., and Hickman, D.L. (2013). Is by information power. Qualitative Health Research
‘duplicative’ really duplication? Lab Animal 42 (3): 26 (13): 1753–1760. https://doi.org/10.1177/
81–825. 1049732315617444.
Julious, S.A. (2005). Sample size of 12 per group rule of Mogil, J. and Macleod, M. (2017). No publication without
thumb for a pilot study. Pharmaceutical Statistics confirmation. Nature 542: 409–411. https://doi.
4: 287–291. org/10.1038/542409a.
Kimmelman, J. and Federico, C. (2017). Consider drug Montgomery, D.C. (2012). Design and Analysis of Experi-
efficacy before first-in-human trials. Nature 542: ments, 8e. New York: Wiley 752 pp.
25–27. https://doi.org/10.1038/542025a. Montgomery, D.C. and Jennings, C.L. (2006). An over-
Kimmelman, J., Mogil, J.S., and Dirnagl, U. (2014). Dis- view of industrial screening experiments, Chapter 1.
tinguishing between exploratory and confirmatory In: Screening: Methods for Experimentation in Industry,
preclinical research will improve translation. PLoS Drug Discovery, and Genetics (ed. A. Dean and S.
Biology 12 (5): e1001863. https://doi.org/ Lewis), 1–20. New York: Springer 332 pp.
10.1371/journal.pbio.1001863. Moore, C.G., Carter, R.E., Nietert, P.J., and Stewart, P.W.
Kistin, C. and Silverstein, M. (2015). Pilot studies: a crit- (2011). Recommendations for planning pilot studies in
ical but potentially misused component of interven- clinical and translational research. Clinical and Trans-
tional research. JAMA 314 (15): 1561–1562. lational Science 4 (5): 332–337. https://doi.org/
Kraemer, H.C., Mintz, J., Noda, A. et al. (2006). Caution 10.1111/j.1752-8062.2011.00347.x.
regarding the use of pilot studies to guide power calcu- Morgan D (2011). Avoiding duplication of research
lations for study proposals. Archives of General Psychi- involving animals. Occasional Paper No. 7, New
atry 63 (5): 484–489. Zealand National Animal Ethics Advisory Committee,
Lancaster, G.A. (2015). Pilot and feasibility studies come 2011.
of age! Pilot and Feasibility Studies 1: 1. https://doi. Morgan, D.G. and Stewart, N.J. (2002). Theory building
org/10.1186/2055-5784-1-1. through mixed-method evaluation of a dementia spe-
Lancaster, G.A., Dodd, S., and Williamson, P.R. (2004). cial care unit. Research in Nursing and Health 25 (6):
Design and analysis of pilot studies: recommendations 479–488. https://doi.org/10.1002/nur.10059.
for good practice. Journal of Evaluation in Clinical Morse, J.M. (2000). Determining sample size. Qualitative
Practice 10 (2): 307–312. https://doi.org/ Health Research 10 (1): 3–5.
10.1111/j.2002.384.doc.x. Percie du Sert, N., Hurst, V., Ahluwalia, A. et al. (2020). The
Lazic, S.E. (2016). Experimental Design for Laboratory ARRIVE guidelines 2.0: updated guidelines for reporting
Biologists: Maximising Information, Improving Repro- animal research. PLoS Biology 18 (7): e3000410.
ducibility. Cambridge: Cambridge University Press. https://doi.org/10.1371/journal.pbio.3000410.
Lazic, S.E. (2018). Four simple ways to increase power Reynolds, P.S. (2022a). Between two stools: preclinical
without increasing the sample size. Laboratory Ani- research, reproducibility, and statistical design of
mals 52 (6): 621–629. experiments. BMC Research Notes 15: 73. https://
Lee, E.C., Whitehead, A.L., Jacques, R.M., and Julious, S. doi.org/10.1186/s13104-022-05965-w.
A. (2014). The statistical interpretation of pilot trials: Reynolds, P.S. (2022b). Introducing non-aversive mouse
should significance thresholds be reconsidered? BMC handling with ‘squnnels’ in a mouse breeding facility.
Medical Research Methodology 14: 41. Animal Technology and Welfare Journal 21 (1): 42–45.
Leon, A.C., Davis, L.L., and Kraemer, H.C. (2011). Reynolds, P.S., McCarter, J., Sweeney, C. et al. (2019).
The role and interpretation of pilot studies in Informing efficient pilot development of animal
clinical research. Journal of Psychiatric Research trauma models through quality improvement strate-
45 (5): 626–629. https://doi.org/10.1016/j. gies. Lab Animal 53 (4): 394–404. https://doi.
jpsychires.2010.10.008. org/10.1177/0023677218802999.
Machin, D., Campbell, M.J., Tan, S.B., and Tan, S.H. Rice, A.S.C., Cimino-Brown, D., Eisenach, J.C. et al.
(2018). Sample Sizes for Clinical, Laboratory and Epide- (2008). Animal models and the prediction of efficacy
miology Studies, 4e. Wiley-Blackwell. in clinical trials of analgesic drugs: a critical appraisal
Macleod, M. and Howells, D. (2016). Protocols for labo- and call for uniform reporting standards. Pain
ratory research. Evidence-Based Preclinical Medicine. 139: 243–247. https://doi.org/10.1016/j.pain.
https://doi.org/10.1002/ebm2.21. 2008.08.017.
Robinson, G.K. (2000). Practical Strategies for Experimen- Research Methodology 10: 1. http://www.biomedcen-
tation. Chichester: Wiley. tral.com/1471-2288/10/1.
Sackett, D.L. (1979). Bias in analytic research. Journal of Van Teijlingen, E.R., Rennie, A.M., Hundley, V., and
Chronic Diseases 32: 51–63. https://doi.org/ Graham, W. (2001). The importance of conducting
10.1016/0021-9681(79)90012-2. and reporting pilot studies: the example of the
Shanyinde, M., Pickering, R.M., and Weatherall, M. Scottish Births Survey. Journal of Advanced Nursing
(2011). Questions asked and answered in pilot and fea- 34: 298–295. https://doi.org/10.1046/j.1365-
sibility randomized controlled trials. BMC Medical 2648.2001.01757.x.
Research Methodology 11: 117–117. Vollert, J., Schenker, E., Macleod, M. et al. (2020). Sys-
Smith, A.J., Clutton, R.E., Lilley, E. et al. (2018). PRE- tematic review of guidelines for internal validity in
PARE: guidelines for planning animal research and the design, conduct and analysis of preclinical biomed-
testing. Laboratory Animals 52 (2): 135–141. ical experiments involving laboratory animals. BMJ
https://doi.org/10.1177/0023677217724823. Open Science 4: e100046.
Soliman, N., Rice, A., and Vollert, J. (2020). A practical Wareham, K.J., Hyde, R.M., Grindlay, D. et al. (2017).
guide to preclinical systematic review and meta-anal- Sponsorship bias and quality of randomised controlled
ysis. Pain 161 (9): 1949–1954. https://doi.org/ trials in veterinary medicine. BMC Veterinary Research
10.1097/j.pain.0000000000001974. 13 (1): 234. https://doi.org/10.1186/s12917-
Tan, Y.J., Crowley, R.J., and Ioannidis, J.P.A. (2019). An 017-1146-9.
empirical assessment of research practices across 163 Whitehead, A.L., Sully, B.G.O., and Campbell, M.J.
clinical trials of tumor-bearing companion dogs. Scien- (2014). Pilot and feasibility studies: is there a difference
tific Reports 9: 11877. https://doi.org/10.1038/ from each other and from a randomised controlled
s41598-019-48425-5. trial? Contemporary Clinical Trials 38: 130–133.
Thabane, L., Ma, J., Chu, R. et al. (2010). A tutorial on
pilot studies: the what, why and how. BMC Medical
5
Operational Pilot Studies:
‘Can It Work?’

5.1 Introduction 47 5.3.2 Subjective Measurements 53
5.2 Operational Tools 48 5.4 Pilots for Retrospective Chart Review
5.2.1 Process, or Workflow, Maps 48 Studies 53
5.2.2 Checklists 50 5.4.1 Standardising Performance 54
5.2.3 Run Charts, Process Behaviour 5.5 Sample Size Considerations 55
Charts 50 5.A Conducting Paired Data Assessments
5.3 Performance Metrics 52 in Excel 55
5.3.1 Measurement Times 52 References 55
is indispensable for ensuring the research is both

5.1 Introduction rigorous and reproducible. Operational pilots will
Operational pilots are a crucial preliminary to not require statistical significance testing (Deming
determine logistic feasibility of planned experi- 1987), and may not even need animals. However,
ments (‘Can it work?’). Quality can only be built researchers should be prepared to invest up to
into the experimental process, not added on after 20% of resources (time, personnel, budget, and
the experiment has been performed. Useful high- equipment) in operational pilots.
quality data can be obtained only if quality is built There are four components to an operational pilot.
into the experimental process from the beginning, The first step is task identification and the order in
before animals are used. This requires understand- which procedures must be performed. Benchmark,
ing of the entire experimental process, knowledge or ‘success’, metrics are used to evaluate whether or
of variation, organisation, and frequent evaluation not the operational pilot ‘works’ (Chambers et al.
and appraisal (Box 5.1). 2014). The success of operational pilots is determined
The goal of operational pilot studies is to standar- by process and procedure stabilisation without large
dise all the logistics, procedures, tasks, and task fluctuations in day-to-day performance and achieve-
sequences necessary to conduct the experiment. ment of pre-defined target performance standards.
High-quality and reliable data can be obtained only Deliverables for an operational pilot include develop-
if all operational and logistic aspects are identified, ment of standardised operating procedures (SOPs),
standardised, and mistake-proofed before the defin- attainment of uniformly high and consistent skill
itive experiments are performed. Operational issues levels by technical staff, and total compliance with
are the largest contributor to uncontrolled variation all experimental protocols. SOPs should be in docu-
in experiments Therefore, the pre-experiment stage ment form for reference purposes. They enable
BOX 5.1 are underway (Reynolds et al. 2019). Trial-and-error

‘Can It Work?’ Operational Pilots ‘tweaking’ or ‘adjustment’ of experiments already in
progress will seriously compromise data integrity.
If you can’t describe what you are doing as a process, All animal-based research would benefit by incor-
you don’t know what you’re doing. porating best-practice quality improvement and,
– W E Deming quality control methodologies into experimental pro-
What are the operational goals and objectives? tocols. Food and Drug Administration (FDA)-
What is the work process? regulated veterinary trials are expected to be compli-
What will be measured? ant with Good Clinical Practice (GCP) guidelines
Where do bottlenecks occur? How can they be fixed? (VICH GL9. No 85) and other regulatory and QA
Where can the process be improved? standards to ensure data quality and patient safety.
Davies et al. (2017) suggest several strategies (with
What: Workflow maps, checklists, run charts. checklists) for integrating similar best-practice qual-
How will you know it ‘works’? ‘Success’ metrics: ity management practices into basic research.
Outcome-neutral performance measures. There are no mandated best-practice quality require-
ments for non-regulated academic research. How-
What next? Protocol decision criteria ever, numerous quality methodologies, resources,
Proceed as is? Amend? Terminate? and tools are available through the American Society
Is the process as good as it can be? for Quality (ASQ) and World Health Organisation
What are the next steps? Action plan (UNDP/World Bank/WHO 2006) that could be easily
adapted to research protocols and experimental pro-
What do we achieve? cesses. All methods share the common goals of
Deliverables: improving outcomes and achieving predictable
▪ Standardised operating procedures results through standardisation of processes, reduc-
▪ Consistent high-quality technical skill levels tion of variation, and use of iterative, data-driven stra-
and competencies tegies to drive improvement (Tague 2005, Reynolds
▪ 100% compliance with all protocol specifications et al. 2019, Backhouse and Ogunlayi 2020).
consistent performance over the course of compli- 5.2 Operational Tools

cated sets of experiments, minimise human perfor- There are three major tools used to identify and con-
mance error, and reduce miscommunication and trol variability in procedures: the workflow (or proc-
non-compliance. ess) map, checklists, and run charts. These tools can
The operational pilot stage is an invaluable also be used for day-to-day management of the main
opportunity for research personnel to acquire best or definitive studies. They serve as memoranda or
possible research practices and skills before animals memory-joggers, ensuring that all personnel are com-
are used. Technical implementation of experiments pliant with the established protocol, and know what
can be challenging. Experimental protocols are to do, and where, when, and how to do it (Tague 2005;
often complex, with many moving parts, and with Gygi et al. 2012). These tools are simple, user-friendly,
high potential for deviation and error. Operator and not highly technical. Above all, these tools can
training, technical skill sets, and competencies enable a complete understanding of the experiment
may vary considerably among staff performing the as a dynamic system without formal data analyses.
same task, and on-the-job learning will result in
large changes in technical ability and performance
over time. Choice between different procedures
5.2.1 Process, or Workflow, Maps
and non-test interventions (e.g. analgesia, anaesthe- Improving data quality occurs by considering each
sia) can profoundly influence both the animal’s experiment as a process system. The workflow
baseline status and experimental endpoints. Sys- map is a graphic showing the separate, in-sequence
temic problems with study design and conduct steps for all tasks in the study system (Box 5.2;
may not become apparent until after experiments Fig 5.1). Understanding, identifying, and mapping
Operational Pilot Studies: ‘Can It Work?’ 49
BOX 5.2 The study workflow map identifies and describes

Constructing Workflow Maps all tasks and task ordering necessary to accomp-
lish study objectives; specific actions, resources, per-
Title: Define the process. sonnel, time, and decisions to be made at each step;
Landmarks: Where and when does the process areas of unnecessary complexity, bottlenecks, and
start? End? targets for process improvement; and provides doc-
Brainstorm with personnel who actually do the umentation of study methods and processes.
work to identify all necessary activities, tasks, Reporting guidelines, such as ARRIVE 2.0, CON-
and decision points. SORT, STROBE, and STROBE-VET (www.equator-
Write tasks on separate cards or sticky notes. network.org), and many journals strongly suggest
Arrange all tasks in proper sequence. incorporation of workflow diagrams in submitted
Review with all relevant personnel involved – is the manuscripts. Workflow diagrams require far less
process accurately represented? explanation in the text, and enable complicated
Revise as necessary. methodology and study design to be explained
See https://asq.org/quality-resources/ and evaluated more easily.
flowchart; Tague (2005).
experiment components and dependencies before- Example: Surgical Instrumentation for

hand enables a more complete understanding of Physiological Monitoring in a Swine Model
variation and its causes, and how to control the (Adapted from Reynolds et al. 2019). A swine
process to maximise data quality and utility model was developed to assess efficacy of
(Walton 1986, NHS Institute for Innovation and certain fluid resuscitation interventions follow-
Improvement 2008, Trebble et al. 2010). Informative ing acute shock trauma. The task map is
‘downstream’ experimental results occur as the shown in Fig 5.1. Anesthetised animals were
consequence of minimising ‘upstream’ errors surgically instrumented to enable measurement
(Schulman 2001).
Sedation Intubation
YES
Correct
Surgical site Surgical Physiological
placement?
preparation instrumentation assessment
NO
YES Proceed
Corrective Stable? with
actions experiment
NO
Corrective
actions
Figure 5.1: Simplified workflow map for a swine trauma model. There are two process decision points requiring clinical
assessment of surgical quality and where corrective actions are performed if necessary.
Source: Adapted from Reynolds et al. (2019).
of multiple physiological variables. The experi- Example: Animal Surgery Procedural

ment proper began with administration of Checklist
experimental or control intervention. However,
(Adapted from Reynolds et al. 2019). Table 5.1
pre-intervention inconsistencies in procedures
shows a sample checklist of essential compliance
and uncorrected physiological disruptions
items during surgical instrumentation and mon-
(such as acidemia, hypothermia, and unstable
itoring of anaesthetised swine. Technical staff
hemodynamics) will result in unreliable and
were already familiar with the approved proto-
highly variable outcomes. Therefore, a priority
cols and detailed standard operating procedures.
of the pilot study was to establish procedures
Checklists were posted throughout the lab to
resulting in consistent and stable physiological
remind personnel of tasks to be performed in
baseline measurements. Two process decision
complex procedures.
points were identified where potential for phys-
iological disruption was great, and where cor-
rective actions could be easily performed if
necessary. The first was confirmation of correct 5.2.3 Run Charts, Process Behaviour
endotracheal tube placement with objective Charts
assessment of adequate oxygenation and venti-
lation. The second was after a 30-minute post- Run charts are dot and line graphs of performance
instrumentation rest period with blood gas measurements plotted in time order. Run charts are
measurements used to assess and correct pre- useful for monitoring the improvement or deterio-
intervention physiological status before appli- ration of personnel skills over time, flagging any
cation of the experimental intervention. major changes or unusual events, pinpointing areas
requiring remediation, or indicating if any protocol
changes improved quality (Chakraborty and Tan
5.2.2 Checklists 2012, Gygi et al. 2012).
Construction of a control chart requires a clearly
Checklists consist of simple, specific, precise, and defined set of measurable performance metrics and
convenient step-by-step list of actions or tasks for the time period for data collection. Formal control
standard procedures or task groups. Checklists are charts include a horizontal line indicating the aver-
not intended as a detailed list of procedures, but age measure and upper and lower lines to indicate
instead, act as an aide memoire. Checklists reduce the control limits (usually derived from historical
errors of omission and memory lapse, ensure staff data). These may not be necessary if the goal is to
compliance with procedures and protocol, and ena- achieve some pre-specified target value (e.g. zero
ble personnel to maintain a consistent and high errors by the end of pre-specified number of
standard of performance and safety. Checklists also sessions)
simplify management and oversight by increasing
staff investment in the experimental process, so
there is less reliance on the laboratory chief or prin- Example: Process Chart for Technical
cipal investigator for routine protocol implementa- Performance Improvement
tion and monitoring (Gawande 2009, Reynolds et al.
2019). Checklists can also be designed for other pur- (Adapted from Reynolds et al. 2019). Figure 5.2
poses, including troubleshooting (how to identify shows a process chart of critical errors made
what went wrong and how to fix it), to-do lists (what over twenty sequential experiments. Fourteen
needs to be done today), and consultation (what specific critical errors were identified and
outside entities must be called in and when). Check- defined a priori and in consultation with all
lists are used routinely for mistake-proofing com- study personnel. Critical procedural errors were
plex operational procedures in such diverse fields associated with airway management, surgical
as industry, manufacturing, surgery, and aircraft procedure, and animal physiological monitor-
operations (World Health Organization 2009, ing. Errors per experiment were counted and
Gawande 2009). plotted in time order.
Table 5.1: Sample checklist for surgical procedures and pre-intervention animal physiological status. All checks were to be
documented in the surgical record for each animal.
Experimental phase Task
Pre-surgical Surgical skin site preparation × 3

Sedation: correct dose, route, time to effect adequate?
Airway Intubation attempt <60 s?
Pulse oximetry readings >90%
Bilateral breath sounds heard?
Surgery Isoflurane, oxygen levels adequate?
Pre-emptive analgesia given?
Anaesthetic depth checks every 15 min
Baseline Mean arterial pressure >70 mmHg?
Arterial pCO2 35–55 mmHg; pO2 95–110 mmHg?

Arterial lactate ≤2 mmol/L?
Monitoring: Animal Pulse oximetry readings >90%
Core temperature 36-40 C
Heart rate <120 bpm
Surgical plane adequate? Check every 15 min
Monitoring: Procedural Blood sampled at designated times?
Catheters patent?
Cardiac output signal present?
Mechanical sigh breaths every 15 min
Source: Adapted from Reynolds et al. (2019).
Sample times on whiteboard The target was 0/14 errors per experiment. The
9 graph shows that the addition of a dedicated
8
data scribe after the first six experiments was
Dedicated scribe added associated with increased protocol compliance
7
Number of critical errors
and reduced protocol errors.

6
3
Run charts are also useful for assessing variation
Target zero errors in animal behaviour. Animal behaviour studies
2
often require performance stability after a period
1 of habituation time. Habituation time is adequate
0 if task performance times or learning trajectories
0 5 10 15 20 stabilise by the end of a certain number of training
Experiment order
sessions. Run charts are also useful for identifying
Figure 5.2: Process chart of critical errors made during floor-ceiling artefacts in behavioural tasks. Floor-
twenty sequential experiments. Errors decreased after a
ceiling artefacts occur if the task is either too diffi-
dedicated data scribe was added to the personnel roster
cult, such that most or all animals fail, or too easy,
(Source: Adapted from Reynolds et al. 2019).
and all animals succeed.
(Gygi et al. 2012). Metrics that are not clearly defined

Example: Process Chart for Animal
will be inconsistently measured and collected and
Behaviour
will not provide reliable information.
(Data from Higaki et al. 2018). Morris water maze Choice of which specific metrics to use will
testing (MWM) is used to evaluate spatial learn- require a preliminary process or workflow map
ing, navigation, and reference memory in labora- to understand which personnel are involved
tory rodents. In this study, mice were tested over with what parts of the project, what tasks must
five consecutive days. Each session was termi- be completed, the amount of time taken to per-
nated at 120 seconds of search time. Longitudinal form each task, and the resources needed for
data plots for four mice (Fig. 5.3) show increasing each task.
between-mouse variation in search times over
five days and subject-specific patterns in learning
trajectories. One mouse showed no improve- Examples: Training Metrics
ment, and the remainder showed modest to sub-
stantial reductions.
All personnel to complete 100% of training
requirements by a pre-specified time.
120 Mouse 1
All personnel to complete all defined critical
Time to locate platform (seconds)
100 tasks with zero performance errors.
80 All subjects receive the study drug within a

pre-specified time from study start.
60 Mouse 2
All subjects are sampled within ± 2 minutes of
Mouse 3
40 pre-specified sampling times.
All necessary dose adjustments are made in
20
Mouse 4
response to pre-defined laboratory criteria.
0 Complete (100%) follow-up of 100% of enrolled
1 2 3 4 5
subjects.
Day
Figure 5.3: Sequential plot of platform location times for Specified proportion of subjects achieve speci-
four mice tested over five days in a Morris Water Maze. fied pharmacological/physiological criteria
Search times were truncated at 120 seconds. Between- at a given dose or intervention level.
mouse variation in search times increased with day.
Mouse 1 show no improvement, whereas the remaining Specified proportion of subjects achieve
mice showed modest to substantial reductions steady-state behaviour measures by the
(Source: Data from Higaki et al. 2018). completion of a pre-specified number of
sessions.
All measurements are obtained at each time
point for each subject (zero missing data).
5.3 Performance Metrics
Performance metrics are used to assess if the opera-
tional pilot has achieved pre-specified objectives
(the operational pilot ‘worked’). Performance metrics
are outcome-neutral, that is, they are not response 5.3.1 Measurement Times
variables and do not test the research hypothesis. Times required for a given task may be categorised
However, they must be relevant to major operational as ‘set’ times, ‘target’ times, or ‘untargeted’ times.
objectives. Metrics must be simple, quantifiable, and Set times are protocol-driven, used for standardis-
transparent to everyone involved with the project ing operational procedures.
Examples: Set-Time Metrics for an BOX 5.3

Operational Pilot Subjective Metric Applications
Fixed sampling time points are specified for ▪ Devising codes for variables and endpoints
obtaining blood and tissue. ▪ Classification of subjects into disease categories
Pre-specified duration is defined for physio- ▪ Behavioural research and behaviour outcomes
logical stabilisation of an animal following ▪ Assessment of humane endpoints
surgical instrumentation and before experimen- ▪ Subjective evaluations of human or animal
tal intervention. performance
▪ Clinical decisions based on signs and
symptoms
Target times are mandated by best-practice ▪ Decisions based on score thresholds
requirements. ▪ Retrospective chart reviews
▪ Questionnaire and survey instrument
validation
Example: Best-Practice Target Times ▪ Qualitative research
The definitive airway (intubation and confirma- ▪ Systematic reviews and meta-analyses
tion of correct placement) for large-animal sur-
gery models must be secured in <60 seconds to
prevent injury and physiological compromise. consistently with minimal bias, an operational
pilot will determine if measurements are valid
Untargeted times do not have rigid performance (data and data assessments capture what they were
limits. intended to measure), reliable (data are consistent),
and assessors show agreement (the frequency with
which different assessors assign the same score to
Example: No Pre-specified the same items). For high-priority items and wel-
Performance Time fare assessments, at least two data evaluators
should be used and trained to a uniform stand-
The primary objective of a successful surgical
ard (Box 5.3).
instrumentation of an experimental animal is
completion of all procedures without physiologi-
cal compromise. Total preparation time cannot
be too rigidly fixed because different animals will
5.4 Pilots for Retrospective
respond differently to anaesthesia and surgery, Chart Review Studies
and require different corrective measures and
In retrospective chart review studies, subject data
time to achieve stability. Personnel should not
have already been collected and are contained in
be compelled to sacrifice the quality of the prep-
patient medical records or registry databases. Med-
aration by rushing or taking shortcuts to meet
ical record data are commonly used in veterinary
an arbitrary time limit. However, if unduly pro-
medicine to assess clinical outcomes, treatment
longed preparation times result in physiological
patterns, clinical care resources, and costs. Retro-
compromise, this is grounds for exclusion of that
spective chart reviews may also be conducted for
subject and its data (Reynolds et al. 2019).
quality assurance and improvement assessments
(Jones et al. 2019).
Chart review studies are highly susceptible to
bias. To prevent wasting time and effort in a futile
5.3.2 Subjective Measurements study, piloting a chart review study begins by asses-
Some outcome variables are subjective. Assessors sing a very small number of records (say 5–10 cases)
differ in skill levels, experience, and even interest. (Box 5.4). Preliminary scans of these records should
To ensure that all variables are collected and rated show if there are enough case samples with
BOX 5.4 provide reliable information (Gilbert et al. 1996,

Piloting Retrospective Chart Reviews: Checklist Vassar and Holzmann 2013, Kaji et al. 2014).
Research question and objectives defined

Identify case inclusion and exclusion criteria 5.4.1 Standardising Performance
Identify all databases and registries to be reviewed. Reliable and unbiased data are obtained by develop-
Define search terms ing clear rules for systematic data extraction and
Define and operationalise all variables (signalment, coding (data dictionary, standardised operating pro-
primary outcomes, secondary outcomes, predic- cedures), then rehearsing data extraction and coding
tor variables or covariates) procedures before the large study. To minimise mis-
Determine methods of data collection (manual, classification bias, at least two people should per-
automated) form data extraction and collection. Agreement
Standardise data collection/entry templates between data collectors is essential for ensuring
Assessors: consistent and reliable data extraction and coding
▪ Number of data extractors, assessors. (Kaji et al. 2014). Agreement measures are usually
▪ Number of replicates used to establish required by journals as part of best-practice report-
agreement. ing for chart review studies, for example, RECORD
▪ Are ratings conducted independently? (Benchimol et al. 2015) and STROBE (von Elm
Assessor training: et al. 2007).
▪ Sufficient? (define performance metrics).
▪ Adequate monitoring and oversight? Data dictionary. This is a list of all variables to be
▪ SOPS or procedure handbook? measured and their corresponding definitions
and measurement units. The data dictionary
Assessor performance:
can also include brief descriptions of how vari-
▪ Reliable? ables will be measured or assessed, other
▪ Consistent? accepted meanings, and guidance on interpre-
▪ High agreement? tation. Data that are ambiguous or have con-
Identify methods for data management, screening, flicting identifiers (e.g. drug names) will
archiving and security. require clear handling instructions.
Standardised operating procedures describe how
variables will be interpreted, extracted from
characteristics that are representative of the popula- the database, rated, counted, or categorised.
tion of interest, that the variables needed to address Inclusion and exclusion criteria must be clearly
the research question are actually available in the defined. Standardised and user-friendly data
records, and if there are enough complete records collection templates are essential to guide data
for later statistical analyses. Routinely-collected collection and minimise abstraction errors.
medical data may not include the variables neces-
sary to answer the research study question. After data collection methods have been agreed
Once it is established that a larger study is upon, test-drive procedures with a few representa-
realistic, the second stage of a pilot is to develop a tive records. Have all personnel extract and code
standardised and user-friendly data collection data independently, then review data records for
instrument and a standardised set of operational discrepancies, inconsistencies, misunderstandings,
procedures. Pilot case records are used to define ambiguities, major or consistent errors, inaccura-
and operationalise variables for data collection, cies, and/or omissions. Discrepancies between
define inclusion and exclusion criteria, and develop extractors and assessors can be flagged electroni-
consistent data extraction procedures. Variables cally and examined for patterns (Appendix 5.A).
that are not clearly defined, ambiguous, or incon- Formal statistical tests of inter-rater agreement
sistently recorded in the medical records will be are unnecessary at this stage and will be unreliable.
inconsistently measured and collected and cannot However, inter-rater agreement measures are
strongly suggested for larger definitive studies of rows and columns). The first row of each
(Altman 1991). sheet is reserved for variable header names.
2. Copy separate spreadsheets into the same
Revise and repeat. There should be frequent dis- workbook. Each sheet is given an assessor
cussions with all personnel (including the study ID (X, Y).
lead) to resolve ambiguities, discrepancies, 3. Cell-by-cell comparison: Create a third
questions, and problems. Areas of disagree- (empty) sheet with the same variable header
ment are resolved by consensus. row. Copy the Excel formula: =IF(X!A2<>Y!
A2,“X:”&X!A2&CHAR(10)& Y: &Y!A2, )
Finalise. Once standardised, the definitive data
into the first cell A2. Drag and drop so all
collection spreadsheets and SOPs can be con-
row-column dimensions are covered in the
structed, and all data extractors and assessors
third sheet. Any discrepancies between the
can be instructed in use. Planning of the defin-
two assessors will be flagged with the assessor
itive study will require identification of the
ID and entry values.
study design (matched case, case–control,
4. Examine original entries indicated by flags
cohort), matching criteria, grouping criteria,
for the cause of the difference. Differences
and formal sample size justification. Formal
may be the result of transcription and entry
procedures must be put in place for data screen-
errors (e.g. typos, stray characters, missed
ing, cleaning, storage, and security (Benchimol
information) or because of assessor differ-
et al. 2015).
ences in interpretation.
5. Correct obvious coding errors. Discrepancies
5.5 Sample Size occurring because of ambiguities in data
Considerations recording, interpretation or coding are
resolved by discussion and consensus.
Operational pilots do not require statistical signifi- Retraining of data collection personnel, and
cance testing. The size of operational pilot studies revision of the data dictionary and SOPs
is determined by pragmatic considerations and may be required.
logistic constraints. There may be no need for ani-
mals at all. For planning highly complex large
experiments, researchers should be prepared to References
invest up to 20% of all resources (time, personnel, Altman, D.G. (1991). Practical Statistics for Medical
budget, and equipment). If animals must be used, Research. London: Chapman and Hall.
bounds on animal numbers are computed relative American Society for Quality (ASQ) (2023). https://
to available resources, using simple arithmetic asq.org/quality-resources
(Chapter 7) or probability-based (Chapter 8) approx- Backhouse, A., Ogunlayi, F. (2020). Quality improve-
imations. To assess stability and consistency of pro- ment into practice. BMJ, 368:m865. doi: https://
cesses and procedures, charts of counts, averages doi.org/10.1136/bmj.m865.
and ranges are constructed from at least two Benchimol, E.I., Smeeth, L., Guttmann, A. et al. (2015).
sequential groups of 2–10 measurements each The reporting of studies conducted using observational
(Gygi et al. 2012). routinely-collected health data (RECORD) statement.
PLoS Medicine 12 (10): e1001885.
Chakraborty, A., Tan, K.C. (2012) Qualitative and quan-
titative analysis of Six Sigma in service organizations.
5.A Conducting Paired Data In Total Quality Management and Six Sigma (Ed. Tau-
Assessments in Excel seef Aized). doi: https://doi.org/10.5772/46104;
https://www.intechopen.com/chapters/38085
Chambers, C.D., Feredoes, E., Muthukumaraswamy, S.D.,
1. Two raters X and Y score items independ- and Etchells, P.J. (2014). Instead of ‘playing the game’ it is
ently and enter data in separate Excel spread- time to change the rules: registered reports at AIMS Neu-
sheets. The sheet format must be the same for roscience and beyond. AIMS Neuroscience https://
each assessor (variable names, order, number doi.org/10.3934/Neuroscience.2014.1.4.
Davies, R., London, C., Lascelles, B., Conzemius, M. NHS Institute for Innovation and Improvement (2008).
(2017) Quality assurance and best research practices Improvement Leaders Guide: Process Mapping, Analysis
for non-regulated veterinary clinical studies. BMC and Redesign. Warwick: NHS.
Veterinary Research, 13:242. doi https://doi.org/ Reynolds, P.S., McCarter, J., Sweeney, C., Mohammed,
10.1186/s12917-017-1153-x B.M., Brophy, D.F., Fisher, B., Martin, E.J., Natarajan,
Deming, W.E. (1987). On the statistician’s contribution to R. (2019) Informing efficient pilot development of ani-
quality. Invited paper, 46th session of the International mal trauma models through quality improvement stra-
Statistical Institute (ISI), Tokyo. https://deming. tegies. Lab Animal, 53(4):394–404. doi: https://doi.
org/wp-content/uploads/2020/06/On-The-Sta- org/10.1177/0023677218802999.
tisticians-Contribution-to-Quality-1987.pdf Schulman, J. (2001) ‘Thinking Upstream’ to evaluate and
von Elm, E., Altman, D.G., Egger, M., Pocock, S.J., to improve the daily work of the Newborn Intensive
Gøtzsche, P.C., Vandenbroucke, J.P., STROBE Care Unit. Journal of Perinatology, 21: 307–311. doi:
Initiative (2007). The strengthening the reporting https://doi.org/10.1038/sj.jp.7200528
of observational studies in epidemiology (STROBE) Tague, N.R. (2005). The Quality Toolbox, Seconde, 584.
statement: guidelines for reporting observational ASQ Quality Press American Society for Quality.
studies. Epidemiology;18:800–4. doi:https://doi. UNDP/World Bank/WHO Special Programme for
org/10.1097/EDE.0b013e3181577654 Research and Training in Tropical Diseases & Scien-
Gawande, A. (2009). The Checklist Manifesto: How to Get tific Working Group on Quality Practices in Basic Bio-
Things Right. New York: Picador. medical Research (2006). Handbook: Quality Practices
Gilbert, E.H., Lowenstein, S.R., Koziol-McLain, J., Barta, in Basic Biomedical Research/Prepared for TDR by the
D.C., Steiner, J. (1996). Chart reviews in emergency Scientific Working Group on Quality Practices in Basic
medicine research: where are the methods? Annals Biomedical Research. World Health Organization.
of Emergency Medicine, 27(3):305–8. doi: https:// https://apps.who.int/iris/handle/
doi.org/10.1016/s0196-0644(96)70264-0 10665/43512.
Gygi, C., William, Z.B., and DeCarlo, N. (2012). Six Sigma Trebble, T.M., Hansi, N., Hydes, T., Smith, M.A., Baker,
for Dummies, 4th ed. NJ: John Wiley & Sons. M. (2010) Process mapping the patient journey: an
Higaki, A., Mogi, M., Iwanami, J. et al. (2018). Recogni- introduction. BMJ, 341: c4078–c4078. doi: https://
tion of early stage thigmotaxis in Morris water maze doi.org/10.1136/bmj.c4078.
test with convolutional neural network. PLoS ONE Vassar, M. and Holzmann, M. (2013). The retrospective
13 (5): e0197003. https://doi.org/10.1371/jour- chart review: important methodological considera-
nal.pone.0197003. tions. Journal of Educational Evaluation for Health
Jones, B., Vaux, E., and Olsson-Brown, A. (2019). How to Professions 10: 12. https://doi.org/10.3352/
get started in quality improvement. BMJ 364: k5408. jeehp.2013.10.12.
https://doi.org/10.1136/bmj.k543730655245. Walton, M. (1986). The Deming Management Method.
Kaji, A.H., Schriger, D., and Green, S. (2014). Looking New York: Perigree.
through the retrospectoscope: reducing bias in emer- World Health Organization (2009). WHO Guidelines
gency medicine chart review studies. Annals of Emer- for Safe Surgery 2009: Safe Surgery Saves Lives.
gency Medicine 64 (3): 292–298. https://doi.org/ Geneva, Switzerland: WHO Press, World Health
10.1016/j.annemergmed.2014.03.025. Organization.
6
Empirical and Translational
Pilots

6.1 Introduction 57 6.4.2 Coverage Plots 68
6.2 Building in Evidentiary Strength 59 6.4.3 Sample Size From Confidence
6.2.1 Internal Validity 60 Intervals and Standard Deviation 69
6.2.2 External Validity 61 6.4.4 Profile Plots 70
6.3 Sample Size Determination 64 6.4.5 Half-Normal Plots 70
6.3.1 Information Density 64 6.4.6 Interaction Plots 72
6.3.2 Information Power 65 6.4.7 Replication 74
6.3.3 Veterinary Clinical Trials 67 6.4.8 Design and Sample Size for
6.3.4 A Note on Safety and Tolerability 67 Replication 76
6.4 Assessing Evidentiary Strength 67 References 76
6.4.1 Exploratory Data Analysis 68
BOX 6.1
6.1 Introduction Empirical and Translational Pilot Studies
Empirical and translational pilot studies provide
Sample size is optimised for information density and
preliminary evidence of efficacy and translation
information power
potential. They are used to determine if further exper-
imentation will be worthwhile (Box 6.1; Table 6.1). Empirical pilots provide evidence of potential efficacy
Empirical pilot studies provide evidence that the under ideal conditions
intervention ‘might work’ (Bowen et al. 2009). Effi- Goal: Signal maximisation
cacy potential is assessed by the quality of the out-
Is the response large enough to be measured? Stable
come, or response, variables. The response should
enough to be reliable? Expected from the research
be large enough to be measured (magnitude), be sta-
hypothesis? Realistic compared to previously pub-
ble and consistent over multiple determinations
lished data?
(accuracy and precision), and behave as predicted
by the research hypothesis (relevance). The goal is Translational pilots provide evidence of generalisabil-
to obtain the largest possible signal from the data. ity for a variety of models and conditions.
Therefore, empirical pilots should be conducted Goal: Signal robustness
under controlled and relatively ideal conditions,
When assessed over a range of animal/disease
rather than attempting a more realistic replication
models and operating conditions, are results
of the clinical or disease model. Generalisation is
Consistent? Reliable? Concordant?
not a consideration at this phase.
0005604125 3D 57 10/8/2023 5:14:34 PM

Table 6.1: Comparison of empirical and translational pilot studies.

‘Might it work?’ Empirical pilots ‘Will it work?’ Translational pilots
Is there preliminary evidence of intervention and outcome Is there preliminary evidence of efficacy and concordance?
feasibility (the intervention ‘might work’)? Efficacy
Are outcomes and endpoints ▪ Are models and endpoints clinically relevant?
▪ large enough to be measured? Can they be measured ▪ Are results clinically or biologically meaningful?
easily? Are laboratory readouts within limits of
detection? Concordance
▪ stable and consistent? (minimal trend, minimal ▪ Are results robust?
variation) ▪ Are results consistent and reproducible over multiple
▪ clinically or biologically meaningful (concept validity)? independent models/locations/ conditions?
▪ Does the intervention do what it was supposed to do?
▪ Is the intervention potentially cost-and time-effective
to implement?
Success metrics Success metrics

▪ Outcome or outcome surrogates compared to a priori ▪ Outcome or outcome surrogates compared to a priori
biologically relevant standard (‘minimum difference biologically relevant standard (‘minimum difference to
to be detected’) be detected’)
Validity Validity
▪ Internal validity ▪ Internal validity
▪ Construct validity ▪ External validity: Construct validity, face validity,
predictive validity
Tools Tools
▪ Small, nimble, designed, randomised controlled ▪ Appropriately designed experiments; adequate and
experiments. appropriate sample sizes, randomisation, blinding, and
controls
▪ Multi-batch experiments
▪ Systematic heterogeneity
Deliverables Deliverables
Progression decision metrics Progression decision metrics
Effect size and confidence intervals Effect size and confidence intervals
Proof of principle, proof of concept
Statistical significance tests may be appropriate,
depending on study objectives
Translational pilot studies (‘Will it work?’) pro- study (Dolgos et al. 2016; Table 6.2). The goal of
vide preliminary evidence of generalizability translational research is the evidence-based transi-
(Bowen et al. 2009). Generalisable results are tion from ‘bench to bedside’ – the progression from
broadly applicable over a range of disease models, basic science and/or mechanistic studies to clini-
animal models, and operating conditions. Preclini- cally relevant animal models, clinical proof of con-
cal proof of principle (pPoP) experiments demon- cept studies, and eventually to randomised clinical
strate effect of the intervention on the target trials (Rossello and Yellon 2016; Heusch 2017).
disease process or pathophysiology. Preclinical Animal-based laboratory studies are usually
proof of concept (pPoC) experiments demonstrate exploratory, designed to investigate multiple factors
effect of the intervention in clinically relevant ani- with small sample sizes. Therefore, sample sizes for
mal models, assessed by measured outcomes with preclinical pilot studies will be determined by on
a direct relationship to the clinical disease under information power rather than statistical power.
0005604125 3D 58 10/8/2023 5:14:35 PM

Empirical and Translational Pilots 59
Table 6.2: Proof of principle and proof of concept: definitions.
Preclinical proof of principle (pPoP)

▪ Demonstrates effect on the target disease process or pathophysiology
▪ Outcome variables are surrogates for the disease process.
Example: Evidence of potential efficacy of a new therapeutic agent for cancer is assessed by biomarkers for cell
proliferation and apoptosis as surrogate measures of ‘benefit’ (Dolgos et al. 2016).
Preclinical proof of concept (pPoC)

▪ Demonstrates effect in disease-relevant preclinical animal models,
▪ Outcome variables have a direct relationship to the clinical disease
Example. Evidence of potential efficacy of a new therapeutic agent for cancer is assessed by tumour remission in patient-
derived xenografts mouse models (Dolgos et al. 2016).
Clinical proof of concept (cPoC)

Early phase randomised clinical trials used to
▪ demonstrate product or device viability
▪ decide if confirmatory phase III trials can proceed (‘go/no-go’) and inform their planning.
Information power is the maximum amount of use- 6.2 Building in Evidentiary

ful data that can be collected from each subject
(information density) for the fewest total number Strength
of animals. In veterinary clinical research, pilot Study strength and data quality are determined by
studies are usually a preliminary to large rando- study validity (Box 6.2). Validity is the qualitative
mised controlled trials. Pilot studies are used to assessment of ‘the degree to which a result from a
assess feasibility and pilot data to estimate effect size study is likely to be true and free from bias’
components. Methods for estimating sample sizes (Higgins et al. 2022). The objective of a good exper-
for each type of research are discussed separately. iment is to establish cause-and-effect between the
Evidentiary strength of pilot data determined by experimental intervention and the study results.
study validity, with the emphasis on internal validity However, biases in study design, data collection,
and reduction of systematic biases (Berkman et al. outcome assessments, and analysis methods will
2013). It is measured by effect size and precision, lead to misleading conclusions. Without validity,
and assessed by data visualisation and data plots. results will merely reflect the ‘comparison of
P-values alone are not evidence for whether or not uncomparable groups’, rather than a meaningful
the intervention ‘works’. P-values do not provide effect of the experimental intervention. For a study
information about alternative hypotheses, an essen-
tial component of evidence assessment (Goodman
and Royall 1988), and because they are based on
small and usually methodologically unstable sam- BOX 6.2
Validity
ples, P-values will be unreliable. Cautious use of
P-values may be justifiable for larger replication Validity = ‘the degree to which a result from a study is
studies and clinically–oriented pilot studies. Trans- likely to be true and free from bias’
lation potential is assessed by generalisability or
Internal validity
external validity. Generalizability is established by
consistency across multiple lines of evidence, con- Methods-based (‘truth within the study’).
cordance with clinically relevant conditions and rep- Purpose: Minimises systematic error (bias)
lication of model results across diverse operating External validity
conditions (Pitkänen et al. 2013; Kimmelman
Model-based (‘truth beyond the study’).
et al. 2014; Kimmelman and Federico 2017; Karp
Purpose: Increases study generalisability.
et al. 2020).
0005604125 3D 59 10/8/2023 5:14:35 PM

to produce high-quality data, methods that ensure BOX 6.3

study validity are far more important than either Internal Validity
large sample sizes or experiment replication
(Muhlhausler et al. 2013). Internal validity refers to the amount of bias incurred
Bias is unrelated to sample size. However, mini- by flawed study design, conduct, and/or reporting
mising bias is essential for reducing animal use and Bias is minimised by
the total number of animals used. Biased studies
Appropriate comparators
result in misleading claims of efficacy and transla-
Randomisation
tion potential. A major consequence of biased stud-
Allocation concealment (blinding)
ies is the unintended waste of all animals used in
Predefined inclusion and exclusion criteria
those studies.
Clearly defined outcome measures
There are two main categories of validity: internal
validity and external validity. Bias is not lack of precision.
An unbiased study does not necessarily have good
Internal validity is methods-based. It refers to the external validity.
extent of systematic error (bias) in the study and
the methods used to minimise and control bias
(‘truth within the study’). systematic error (bias) is unrelated to either sample
External validity is model-based. It refers to the size or precision. Bias is the consistent directional
extent to which the results obtained under mismatch of observed results with the true effect,
the specific study conditions (the model) can resulting from deficiencies in study design, conduct,
be applied (generalised) to the larger popula- and/or reporting (Eisenhart 1968; Higgins et al.
tion (‘truth beyond the study’)1. 2011). Therefore, internal validity is a measure of
confidence in the cause-effect relationship demon-
Internal validity is necessary for both empirical strated by the study. Internal validity provides
and translational pilots and an essential prerequi- grounds for conclusions that treatment effects or
site for all hypothesis-testing studies. Internal observed differences between groups are real, and
validity alone does not confer external validity. not an artefact of other potential explanations for
However, without internal validity, studies cannot the observed effects (Henderson et al. 2013; Pound
demonstrate external validity (Moher et al. 2010). and Ritskes-Hoitinga 2018).
Both types of validity must be built into the study Bias adversely affects both internal and external
design before experiments are conducted. They validity; however, lack of bias does not confer exter-
cannot be produced by statistical analyses after nal validity. For example, a study may be unbiased
the fact (Bespalov et al. 2019). because it was conducted with high internal validity
safeguards, but it cannot demonstrate external
validity if the study sample is not representative of
6.2.1 Internal Validity the larger target population.
Internal validity refers to the risk of bias or system- Bias occurs during subject selection, allocation of
atic error inherent to a specific study (Box 6.3). All interventions to subjects, subject processing order,
measurements incorporate two sources of error: measurement and assessment of outcomes, and as
random error and systematic error. Random error a result of subject drop-out and missing data. The
measures the precision of results and is determined major bias domains are selection bias, performance
by sample size. Precision is summarised by the bias, detection bias, and attrition bias (Huang et al.
standard deviation and standard error. In contrast, 2020). Bias is minimised by appropriate controls or
comparators, randomisation, allocation concealment
(blinding), clearly defined inclusion and exclusion
1
https://courses.internal.vetmed.wsu.edu/jmgay/
clinical-epidemiology-evidence-based-medicine- criteria, and clearly defined outcome measures
glossary/clinical-study-design-and-methods- (Higgins et al. 2011; Huang et al. 2020). These are
terminology described more fully in Chapter 3.
0005604125 3D 60 10/8/2023 5:14:35 PM

6.2.2 External Validity animal model compromises external validity

(Pound and Ritskes-Hoitinga 2018).
External validity refers to the extent to which results
generated by the study (the model system) are rep-
6.2.2.1 Representativeness
resentative and robust, and therefore more widely
applicable (generalisable) to the larger population The animal model will not be representative if the
(Box 6.4). Representativeness refers to how well model does not match the target population in
the animal model captures essential features of essentials. Model signalment may not match dem-
the condition being modelled (model fidelity). ographics of the target population, the disease
Robustness refers to consistency of results obtained model may not match the relevant pathophysiol-
for diverse models, and how well models and con- ogy of the target condition, and the overall model
structs perform over a variety of conditions. Study may lack clinical relevance (model fidelity). Rep-
replication contributes to assessment of robustness, resentativeness is assessed by face validity, con-
and external validity to translation potential. If struct validity, and predictive validity. These
results cannot be replicated under different condi- criteria were first described in 1969 as principles
tions (‘generalising across’) then translational suc- for the development of animal models of psychiat-
cess (‘generalising to’) is unlikely. ric affective disorders (McKinney and Bunney
Unfortunately, successful translation of preclin- 1969). They were subsequently elaborated and
ical results to clinical applications have been rare refined by Willner (1984) and have since been
(Henderson et al. 2013; Freedman et al. 2015; extended to other basic science animal models
Pound and Ritskes-Hoitinga 2018; Errington (e.g. Markou et al. 2009).
et al. 2021a, b). This is primarily because the inter-
nal validity of most preclinical studies is poor Face validity is assessed by similarity of the ani-
(Henderson et al. 2013; Macleod et al. 2015). Exter- mal model to the clinical or disease phenotype,
nal validity is not possible without internal validity and that model results are specific to the clinical
and will not be improved if methods used to condition being modelled (‘Does it look right?’)
increase study realism compromise internal valid- Construct validity is assessed by the similarity of
ity (Lynch 1982). However, even high internal the animal model and its performance to the
validity will not produce externally valid results target clinical condition (the model is homolo-
if the model system is not sufficiently representa- gous), and that results are related theoretically,
tive or clinically relevant. A non-representative empirically, and unambiguously to the target
condition (‘Does it act right?’).
Predictive validity is assessed by the similarity of
the animal model to the target population,
BOX 6.4
External Validity
and the specificity of response or therapeutic
outcome to the target outcome (‘Do results
Extent to which results are representative, robust, and from the animal model correctly predict perfor-
generalisable. mance in the clinical setting?’). In drug discov-
External validity is not possible without internal ery and pharmacological intervention studies,
validity. predictive validity refers to similarity of
responses of humans to the animal model
Representative (‘generalising to’) (‘human-animal correlation of therapeutic out-
Face validity comes’; Belzun and Lemoine 2011). It also
Construct validity refers to the clinical relevance of the disease
Predictive validity model or therapeutic biomarkers. Predictive
Replicable (‘generalising across’) validity is generally considered the most impor-
tant validity item for translation (Belzun and
Direct replication Lemoine 2011; McGonigle and Ruggeri 2014;
Conceptual replication. Tadenev and Burgess 2019).
0005604125 3D 61 10/8/2023 5:14:35 PM

There are no hard and fast criteria as to how to For preclinical drug development models, Ferreira
maximise external validity, or decide how much et al. (2019) developed a rigorous scoring system
face or construct validity is needed for predictive and score calculator for comparing animal models
validity. External validity criteria tend to be subjec- and identifying the model that will best fit specific
tive and are therefore subject to bias in both inter- research objectives. The calculator is based
pretation and practical application. Trade-offs on 22 validation criteria in eight domains: epide-
between the various types of external validity will miological, symptomology and disease natural
be necessary for models of complex multifactorial history, genetic, biochemical, aetiological, phar-
diseases, such as oncological and neurodegenerative macological, and endpoints.
diseases (Pitkänen et al. 2013; Tadenev and Bur-
gess 2019). 6.2.2.2 Sex as a Biological Variable
For a model to be ‘good enough’ for transla- Sex is an essential variable for ensuring construct
tional application, it needs to be informative validity. Sex differences in basic biology, pathophys-
and reasonably predictive of potential efficacy. iology, pharmacokinetics, and response to interven-
A model can be useful without strict model fidel- tions have been reported for numerous diverse
ity, which in any case will be rarely possible animal models (Karp et al. 2017; Wilson et al.
(Pitkänen et al. 2013; Galanopoulou et al. 2017), 2022). However, female animals are still markedly
nor always necessary (Russell and Burch 1959). under-represented in much of biomedical research
The best option is to understand the general prin- (Will et al. 2017; Karp and Reavey 2019). There is
ciples of external validity, and learn how to apply no evidence that females are inherently more vari-
them on a study-by-study basis. A practical guide able than males, a common reason for justifying
for ensuring construct validity was developed by male-only studies (Beery and Zucker 2011; Beery
Henderson et al. (2013). This is a 25-item checklist 2018). As it is, many published animal-based studies
in four domains: the animal model, the clinical are based only on a single sex or are too underpow-
condition or disease to be modelled, experimental ered to detect potential interaction of sex with other
outputs, and experimental operations (Table 6.3). factors.
Table 6.3: Construct validity checklist.
Animal model Clinical/disease model Outputs Experimental/

operational
Do features of the animal model Does the lab-based disease Do response variables Are they
represent those of the clinical model represent the clinical represent clinically appropriate and
patient population? disease syndrome? relevant outcomes? correctly
performed?
Signalment Mechanistic pathways Biomarkers
Age (juvenile, adult, aged) Signs & symptoms Physiological response Appropriate controls
Sex Pathophysiology Allometric scaling Appropriate
Body mass Chronicity (Acute or chronic) methodology
Severity Technical skill
Baseline characteristics: Experimental
Behaviour, physiology, other Clinical treatment: How defined? confounds
phenotype markers Co-interventions Location
Comorbidities Treatment delivery:

Timing
Inclusion and exclusion criteria Route/method
Duration, exposure
Clinical setting
Source: Adapted from Henderson et al. (2013).
0005604125 3D 62 10/8/2023 5:14:35 PM

Because sex bias in preclinical research greatly coefficient b is obtained by nonlinear regression
limits translation, National Institute of Health Y = aXb, or more commonly, by least-squares
best-practice standards strongly encourage inclu- regression on log-transformed variables: log
sion of female animals in addition to males (Y) = log(a) + b log (X).
(Clayton and Collins 2014 and incorporation of Ratios (for example, mg/kg body mass) should be
sex as a biological variable in study design used only with considerable caution. Ratios implic-
(Miller et al. 2017; Honarpisheh and McCullough itly assume the response Y is directly proportional to
2019). Regulatory guidelines covering chemical body size (b = 1). If b is not equal to 1, then ratio data
and drug toxicity testing may mandate testing of will result in considerable statistical bias, spurious
both sexes at each dose level (e.g. OECD Test correlation, and greatly reduced precision. Body size
Guidelines https://www.oecd.org/). differences should be accounted for by regression-
based methods, such as analysis of covariance
Sample size considerations. Incorporating both or multiple regression (Tanner 1949; Kronmal
sexes into the study design does not require dou- 1993; Packard and Boardman 1988, 1999; Jasienski
bling of sample size, nor does it need to cost and Bazzaz 1999). Body mass is useful for laboratory
more money (Clayton 2015). Including sex as a animal-to-human drug dosage conversions if
biological variable is handled by appropriate sta- exponents are verified by current evidence or
tistical study design. If sex is not of interest as a appropriate pharmacological models. However,
primary explanatory (independent) variable, body surface area (e.g. mg/cm3) has been long dis-
variation due to sex differences can be controlled credited as an imprecise, flawed, and obsolete met-
by statistically based methods, such as stratified ric without any biological basis and should not be
randomisation on sex and sample size balance. If used (Tang and Mayersohn 2011; Blanchard and
sex is included as an independent variable, Smoliga 2015).
statistical power for detecting sex effects is Allometric equations may be extremely unreli-
increased by factorial and repeated-measures able when used for prediction, especially for drug
designs, and further improved by incorporating dosages and pharmacokinetics. General scaling
sex-related covariates, such as weight. relationship approximations can provide initial
approximations for expected forms of the rela-
6.2.2.3 Body Size and Allometric Scaling tionship between size and a given physiological
Body size differences between model and target spe- response (Calder 1984; Schmidt-Nielsen 1984).
cies are a major challenge to construct validity. However, extrapolation to other species based
Commonly, allometric relationships are determined on inappropriate conversion assumptions and
for cross-sectional data over a range of body sizes insufficient evidence of efficacy can result in cat-
from multiple species. Allometric models are used astrophic outcomes (Leist and Hartung 2013;
for scaling pharmacokinetics, risk, drug dosages Kimmelman and Federico 2017; Van Norman
(Caldwell et al. 2004; Huang and Riviere 2014), 2019). Validity of scaling exponents depends on
tumour growth (Pérez-García et al. 2020), and phys- the type of comparison (interspecific or intraspe-
iological responses. They can also be used to esti- cific), the data used to construct the relationship,
mate the likely range of responses if specific data and validity of the allometric model itself (choice
for a given species are unavailable (Lindstedt and of the best-fit line, data distribution, influence of
Schaeffer 2002). outliers, measurement error). The allometric
The relationship between body size for the refer- models assume that the response is determined
ence and target species is by size-related factors alone and do not account
for other potential determinants, such as age,
b
X target species or strain, body composition, and sex
Y=
X ref (Eleveld et al. 2022). Sex effects in mice may be
trait-specific; case-by-case evaluation of drug dos-
where X is a measure of body size, such as body age scaling for mice is recommended (Wilson
mass, and b is the scaling coefficient. The scaling et al. 2022).
0005604125 3D 63 10/8/2023 5:14:35 PM

Sample size considerations. For a simple allomet- be based on predefined primary outcome, effect
ric regression based on two variables, a rule of size, and power.
thumb sample size is N ≥ 25 if variance is The concept is borrowed from qualitative
expected to be high (Jenkins and Quintana- research studies and is based on the concept that
Ascencio 2020). Sample size can be determined the more relevant information contained in the
more formally by iteration for a pre-specified sample, the fewer number of subjects will be needed
precision of the regression coefficients b. (Malterud et al. 2016).
Precision is calculated from the two-sided
100(1 − α)% confidence interval, b ± t1−α/2,n−2
SE(b). For multiple regression models, sample
6.3.1 Information Density
size can be determined from Cohen’s effect size Information density is the maximum amount of
f 2 = R2/(1 − R2) (Chapter 15). Disadvantages of useable, scientifically productive data that can be
this method are the requirement for preliminary obtained from each animal. However, measuring
estimates of R2, and the assumption of a straight- as many outcome variables as possible is not the
line fit. More advanced methods for sample size same as maximising the amount of useable informa-
determination based on the joint estimation of tion. There is a trade-off between technical effi-
the intercept and slope and estimation of the ciency and effectiveness (or thoroughness; Hollner
non-centrality parameter have been described 2009). Efficiency refers to procedural leanness: an
(Colosimo et al. 2007; Jan and Shieh 2019). Large efficient experiment minimises investment of
formal analyses involving multiple strains or animals, resources, time and workload to obtain
species should consider phylogenetic regression information required to meet study objectives.
methods (Cooper et al. 2016). Effectiveness (or thoroughness) refers to task quality:
an effective study produces data that are correct,
accurate, timely, and of uniformly high quality.
6.3 Sample Size An effective study will also include recognition
Determination and prompt correction of mistakes or errors in
protocol implementation.
Sample size features: Statistical study design, seq- Design strategies for information density will
uential testing strategy, internal validity involve trade-offs between animal welfare measures
Sample size for preclinical pilots are determined (refinement) versus overall reduction of animal
by information density and information power rather numbers (Nunamaker et al. 2021). Emphasis on
than statistical power (Box 6.5). The goal of pilot technical speed and efficiency comes at the expense
studies is to obtain the maximum amount of usea- of data quality. Measuring too many outcome vari-
ble, scientifically productive data from the fewest ables can produce results that are too difficult to
number of animals. This contrasts with sample size interpret (Vetter and Mascha 2017), but measuring
for definitive or confirmatory experiments that must too few may miss important relationships or not
properly capture the test effect. More data-intensive
BOX 6.5 procedures per subject will reduce the number of
Information Density and Power animals that can be processed in a given time. On
the other hand, as complexity increases, procedural
Information density times also increase, together with the potential for
Maximum amount of useable, scientifically produc- increased suffering and risk of adverse events (such
tive data obtained from each animal as the animal dying before completion of the proce-
Sample size features: Trade-off between efficiency dure). After methodology and protocol have been
and effectiveness. standardised, experimental performance can shift
to an emphasis on efficiency, although frequent
Information power
monitoring of task quality throughout the research
Maximum amount of useable, scientifically produc- cycle is advised. Emphasis on task quality over task
tive data from the fewest number of animals. completion may necessitate a shift in lab culture
0005604125 3D 64 10/8/2023 5:14:35 PM

and will require adequate oversight, communica- large number of animals. In contrast, designed
tion, and support from investigators (Reynolds experiments, especially factorial-type designs,
et al. 2019). permit simultaneous assessment of multiple input
variables and are flexible, efficient, and very
Sample size for information density. Pilot studies economical. They can therefore contribute to
should prioritise thoroughness over efficiency. large savings in animal numbers for much more
Therefore, operational constraints (time, skilled information. In general, designed experiments
technical personnel, budget, resources) will involve selection of the most likely set of candidate
determine the number of endpoints measured factors, deciding on a small fixed number of levels
on each animal and therefore the number of for each factor (usually a high and low value), then
animals that can be processed in a given time conducting the experiment on all possible combina-
(Chapter 5). The number of endpoints will be tions of those factors and factor levels. Each run
determined by procedural needs per task and combination is randomly assigned to an experimen-
variable priority (‘need to know’ mission-critical tal unit. Genuine replicates made at the same run
variables’ versus less important ‘nice to know’ combination and addition of centre points provide
variables). When surgical instrumentation is an estimate of the variance for the effect (usually
required, the number of procedures that can a regression coefficient). Therefore, sample size is
be realistically completed per animal is assessed determined by the number of replicate runs, not
by placing strict limits on maximum permissible by ‘group size’ (Box and Draper 1987; Box et al.
surgical duration and time under anaesthesia. 2005; Chapter 19).
The recommended experimental strategy is
sequential testing in three phases (Trutna
et al. 2012):
6.3.2 Information Power
Information power is the maximum amount of use- 1. Factor reduction. Screening designs are used
able, scientifically productive data obtained from narrow down a large number of candidate
the fewest number of animals (sensu Malterud factors to a smaller subset of the most impor-
et al. 2016). Even for pilot studies, formal statistical tant. Factor ‘importance’ is assessed by the
experimental designs are strongly recommended. size of main effects and two-way interactions.
Compared to trial and error experimentation, exper- Classic screening designs include fractional
imental designs make the variance of the responses factorials and Plackett-Burman designs
as small as possible, and avoid or minimise bias in (Montgomery and Jennings 2006). Modern,
estimates of effects. Bias is further reduced by ran- computer-generated definitive screening
domisation and blinding. Designed pilot experi- designs are more efficient and less biased
ments greatly increase the chances of detecting a than older designs, resulting in more precise
difference in responses between test interventions. and reliable results. Definitive screening
An experiment involves changing one or more designs should be used only in the early
explanatory variables (factors) and observing the stages of experimentation, and if there are
effects on one or more response variables. With four or more factors to be assessed. Factors
exploratory animal-based studies, there will be a are usually continuous variables, but designs
large number of potential factors and very little can accommodate a few two-level categorical
prior knowledge of how they might be expected to factors. It may be necessary to conduct sepa-
affect response or each other. The conventional rate experiments on each level of a multi-
approach is experiments limited to two-group and level categorical factor to avoid problems
one-factor-at-a-time (OFAT) comparisons, with with interpretation (Jones and Nacht-
sample size calculated as multiples of the number sheim 2011).
of experimental ‘groups’ (Box and Draper 1987; 2. Factor importance. After a smaller subset of
Czitrom 1999). This approach is slow, inefficient, the most important factors has been identi-
misses potentially useful information, and wastes fied, experiments are conducted on these
0005604125 3D 65 10/8/2023 5:14:35 PM

few remaining factors to determine signifi- measurements of efficacy would be most strongly
cance of interaction effects and identify affected by mouse strain, vaccine dose, challenge-
regions of the optimal or target response. killing interval, mouse age at vaccination, and
Typical designs include full factorial and mouse sex. The initial study plan was to evaluate
response surface designs. Responses are six mouse strains, three vaccine doses, three chal-
quantified by linear or polynomial regression lenge-killing intervals, and three ages on both
and ANOVA table statistics (Box et al. 2005; male and female mice. The goals were to deter-
Montgomery 2017). mine what combination of factors contributes
3. Factor testing. Optimising experimental con- to the ‘best’ response, and quantify any possible
ditions will maximise the experimental signal sex differences.
if it exists. Therefore definitive experiments
can proceed with greater certainty of success.
Results from each experiment provide feedback Conventional approach. The investigators pro-
on whether or not the experiment is converging to posed a series of experiments designed as a
the target solution predicted by the research series of two-group comparisons or ‘one-way
hypothesis. Sequential feedback means that the ANOVAs’ on combinations of the major fac-
study factors can be changed strategically and sen- tors. They intended to use five mice per group
sibly as information from each stage becomes avail- ‘because that was necessary for statistical sig-
able, rather than having to resort to trial and error nificance’ and was ‘what everyone else does’.
tweaking (Czitrom 1999; Box et al. 2005). Because They concluded that at least 1620 mice would
the factorial approach allows a much greater be required for total of 324 experimental
amount of precision for a given number of experi- ‘groups’ with five animals per group: Strain ×
mental runs, it is much more likely that a true effect dose × interval × age × sex = 6 × (3 × 3 × 3 ×
will be detectable and not swamped by experimen- 2) = 324 × 5 = 1620.
tal error and unknown variation.
This piecemeal approach has serious disadvan-
Sample size for information power. Sample size tages. It is likely to miss the most important factors
recommendations for multi-batch studies are or optimal factor levels giving the ‘best’ response
a minimum of three animals per ‘group’ and cannot estimate interactions between factors.
(Karp et al. 2020). However, with screening It is highly unlikely that an experiment involving
designs, fewer animals can be used because over a thousand animals could be processed in a rea-
sample size is determined, not by group size, sonable period of time. Animals will be wasted as
but by the number of replicate runs. Statistical they age out of the study before they can be used.
power increases both with the size of regression Finally, a large un-designed experiment is difficult
coefficients and coefficient hierarchy, that is, to control and risks potentially large, unmanagea-
more power is associated with main effects ble, and undetectable sources of variation that will
and less power with interactions (Jones and swamp true experimental signals (Reynolds 2021).
Nachtsheim 2011). Dean and Lewis (2006) is
an excellent introduction to screening designs. Definitive screening design. The same study could
Customised screening designs can be generated be more efficiently and economically designed
in commercially available packages such as as a screening study, using far fewer animals
SAS, SAS JMP Pro, and R (e.g. Lawson 2020). and increased probability of detecting a real
effect. Strain and sex are categorical factors.
Example: Information Power: Conventional Dose, interval, and age are continuous factors
Versus Screening Designs with three levels each.
Investigators wished to evaluate efficacy potential A reasonable approach for a screening study is to
of a trial vaccine in mice. They thought that run separate definitive screening experiments on
0005604125 3D 66 10/8/2023 5:14:35 PM

Table 6.4: Example of a definitive screening experiment 6.3.3 Veterinary Clinical Trials
for assessing vaccine efficacy in mice. The design was
generated in SAS JMP Pro 16. Experiments are conducted on Data from empirical clinical pilots can be used to
separate strains in two blocks on the factors of sex, dose, assess feasibility and also to obtain estimates of
interval, and age. This type of design requires less than 10%
of the animals originally requested.
sample size parameters (such as outcome variance,
event rates, and effect size) for the later clinical trial
Run Block Dose Interval Age Sex (Chapter 16.4). Suggested sample size for a clinical
1 1 −1 −1 1 Female trial pilot is a minimum of 12 subjects per arm, for a
2 1 −1 −1 1 Male
total of 24 subjects (Julious 2005). Up to 70 subjects
per arm may be necessary depending on the effect
3 1 0 1 1 Female
size to be estimated and the amount of precision
4 1 0 −1 −1 Male required for the estimate of the variance (Teare
5 1 −1 1 0 Male et al. 2014).
6 1 1 −1 0 Female
7 1 0 0 0 Male
6.3.4 A Note on Safety and
8 1 1 1 −1 Male
Tolerability
9 1 0 0 0 Female
Empirical and translation pilot studies are usually
10 1 1 1 −1 Female
too small to reliably assess safety or tolerability of
11 2 −1 0 −1 Male a clinical intervention, especially if the safety met-
12 2 1 1 1 Male ric is occurrence of an adverse event (yes/no).
13 2 1 −1 1 Male
Adverse events are usually rare, and zero events
in a small study do not mean that the intervention
14 2 −1 −1 −1 Female
is safe. The sample size N to detect at least one
15 2 −1 1 1 Male adverse event can be approximated if the probabil-
16 2 −1 1 −1 Female ity of detection α and the expected prevalence p of
adverse events are known (or can be guessed).
17 2 1 0 1 Female
Sample size required for a safety study can be
18 2 1 −1 −1 Female approximated by probability-based feasibility cal-
culations (Chapter 8).
each strain, with each strain experiment run in
blocks so that experimental effort can be distributed
over two sessions. Table 6.4 is an example of a defin- 6.4 Assessing Evidentiary
itive screening design to be run in two blocks. For
each trial, 18 runs for each strain are performed
Strength
in random order. Nine males and nine females Strength of evidence means having confidence in
are required for each experiment. Each factor level study results. There is no sense continuing a certain
is replicated seven times on each low and high line of enquiry or method of experimentation if only
value, and four times on the intermediate levels. poor or unreliable results are produced. Strong evi-
The centre points (0, 0, 0) provide the variance esti- dence is obtained by consistency of results across
mates for assessing significance of main effects. The multiple lines of experimentation. Exploratory data
total number of mice required is now (6 strains × 18 analysis by data visualisation methods enables rapid
runs) = 108 mice. This design requires fewer than assessment of size, direction, and precision of
10% of the number of animal originally proposed. results. The most promising findings are verified
Definitive studies based on factorial designs will by sequential testing and replication, first with dif-
require formal simulation-based power analyses, ferent models of the same syndrome, followed by
especially if interaction effects are of primary inter- independent replication within and across labora-
est (Chapter 19). tories or locations (Box 6.6).
0005604125 3D 67 10/8/2023 5:14:35 PM

BOX 6.6 BOX 6.7

Assessing Strength of Evidence Exploratory Data Analysis
A. Causal appraisal Always plot your data

1. Consistency Routine assessments
2. Exploratory data analysis (data visualisa-
tion, graphs, and plots) Why: Assessing data patterns, anomalies, outliers,
distributional assumptions
B. Sequential testing What: Basic graphics (dot plots, histograms, box-
1. Preliminary evidence of efficacy plots, scatterplots, etc.)
a. Input factor screening and
reduction Evidentiary strength
b. Tests of efficacy Why: Assessment of effect size, direction, and
2. Replication: precision
a. Different models of the same What: Coverage plots, profile plots, half-normal
syndrome plots.
b. Independent replication across
multiple laboratories.
Evidentiary strength is evaluated by specialised

plots showing direction, magnitude, and preci-
Simple preliminary appraisal checks are easily sion of results. These include coverage plots
performed by plotting the data, examining patterns, (means and confidence intervals), profile plots
then informally scoring observed patterns as con- (means and simultaneous confidence intervals)
sistent with the scientific hypothesis, inconsistent, and half-normal plots (regression coefficients).
or undetermined (Berkman et al. 2013; US EPA Detailed descriptions methods for calculating
2017). Data plots can be used to assess effect confidence intervals and other measures of pre-
strength, direction, heterogeneity, and transitivity. cision are described in Part III.
Evidentiary strength guidelines were first proposed
for observational epidemiological studies in the
6.4.2 Coverage Plots
1960s and have been updated (Howick et al.
2009). The Causal Analysis/Diagnosis Decision Coverage plots are graphs of means and correspond-
Information System (CADDIS; US EPA 2017) is a ing confidence intervals relative to the null hypoth-
useful guide to performing causal assessments and esis and the target, or minimum biologically
can be readily adapted for laboratory studies. important, difference (MBID) to be detected
(Clayton and Hills 1993). These plots allow simulta-
neous assessment of observed effect size, precision,
6.4.1 Exploratory Data Analysis confidence, and power, and allow immediate evalu-
Exploratory data analysis (EDA) with graphs and ation and interpretation of size, direction, and var-
data plots is an indispensable first step for all stud- iation associated with the observed effect (Steidl
ies, including pilots. Graphics provide easy and et al. 1997; Cumming 2012; Rothman and Green-
rapid visualisation of effect size, direction, and pre- land 2018).
cision without formal statistical testing (Box 6.7). The target difference is a measure of effect size
Routine visual assessments of raw data with sim- and drives subsequent power calculations. It is the
ple graphics (histograms, boxplots, scatterplots, measured difference in the primary outcome varia-
cumulative distribution function graphs, etc.) are ble between interventions and/or controls. It must
essential for identifying patterns, anomalies, and be determined a priori and clearly defined, and
outliers, and checking distributional assumptions the biological importance of the difference must
prior to analyses (Tukey 1977; Filliben and Heckert be scientifically justified (Cook et al. 2015, 2017).
2012). The target difference and rationale should be
0005604125 3D 68 10/8/2023 5:14:35 PM

Statistical Scientific all or a difference larger than target. Study

significance significance
design, study procedures, and experimental
Yes Yes endpoints should be revisited and revised
as needed. A new study may be necessary,
No Equivocal
or the current study redesigned.
Yes No 3. Does the upper confidence limit exclude the
target difference?
No No
If the confidence interval does not cover the
Null Minimum biologically
target difference (‘No’), a new study may be
value important difference necessary. There is insufficient evidence of
Figure 6.1: Coverage plot showing hypothetical mean effect even if the confidence interval excludes
effects and associated confidence intervals. Evidence for a the null value.
biologically important effect is suggested if confidence
intervals cross or exceed the predefined target difference to
be detected (‘yes’), ‘equivocal’ results if neither the target 6.4.3 Sample Size From Confidence
difference nor the expected null difference can be excluded,
and unpromising if confidence intervals include the null but
Intervals and Standard Deviation
exclude the target difference (‘no’). Sample size greatly affects variability but does not
affect estimates of the expected mean of the sample
reported in the methods section of protocols and (Bishop et al. 2022). Lee et al. (2014) recommend
manuscripts. For the study to have sufficient power that strength of preliminary evidence can be
to detect the target difference, the lower limit of the assessed by comparing the effect to null and refer-
confidence interval must be greater than the upper ence values with confidence intervals of different
limit for null hypothesis value. However, intervals widths (e.g. 95%, 90%, 80%, 75%). This information
containing the null value may also contain values can then be used to calculate appropriately pow-
that are not statistically significant but may be of ered sample sizes by working backwards from a
practical or biological importance (Figure 6.1). given confidence interval to compute the sample
Therefore evaluation of pilot study results consists size. A visual approach simplifies evaluation of
of a three-part decision tree: responses relative to some specific reference or tar-
get value. To obtain 80% power, the sample size
1. Does the confidence interval for the observed must be large enough so that 80% of all possible
difference cross or exceed the target differ- estimates will be at least 1.96 standard errors from
ence and exclude the null value? the reference point. If the observed effect does not
attain statistical significance, the sample size in a
If ‘Yes’, results are promising. Sample sizes
new study must be increased by a factor calculated
for subsequent definitive studies can be
as the square of the ratio of the current value of the
obtained by working backwards from the
standard deviation to that value required to obtain
confidence intervals to compute the sample
a desired power with specified confidence (Clayton
size. Plotting confidence interval widths indi-
and Hills 1993).
cates how subsequent studies may be pow-
ered to increase precision of the mean
difference, and confidence interval widths Example: Calculating Sample Size
indicates how subsequent studies may be
powered to increase precision. The target difference in a hypothetical experi-
2. Does the confidence interval cross both the ment is 0.5. The standard deviation (SD) observed
target difference and the null value? in a pilot study with sample size of 30 is 0.3. How-
ever, the standard deviation required to detect the
If ‘Yes’, results are equivocal. Such large target difference with confidence of 95% and
confidence intervals cannot preclude the power 80% is
possibility of either no true difference at
0005604125 3D 69 10/8/2023 5:14:35 PM

SD = 0 5 1 96 + 1 282 = 0 154 recommended daily allowance value (RDA). For

example, crude protein intake is standardised by
Therefore, to detect a difference of 0.5, sample
dividing each observation by 3.28, the NRC recom-
size for the new study must be increased by a fac-
mended level for crude protein. The reference
tor of (0.3/0.154)2 = 3.8. The sample size in the
ratio across all standardised variables equals 1.
new study would have to be
Computation details are given in Chapter 10.
N = 3 8 × 30 114 subjects Figure 6.2a shows standardised mean values
and simultaneous confidence intervals for all
20 nutrients. Confidence intervals for fat-soluble
vitamins A, D3, and E, and folic acid and manga-
6.4.4 Profile Plots nese are well above 1, with mean intakes between
Profile plots are useful for the simultaneous assess- 10–35 times above recommended levels. When
ment of multiple variables in a study. If the vari- these extreme values are excluded from the plot
ables have different units of measurement, they (Figure 6.2b), it is seen more clearly that confi-
must be standardised t otherwise confidence inter- dence intervals for eight nutrients are above
vals are meaningless and cannot be interpreted. recommended levels, those for magnesium and
Simultaneous confidence intervals SCI are con- selenium below recommended levels, and those
structed to assess multiple outcome variables to for the remaining six nutrients suggesting con-
minimise false positive rates. The simultaneous formity with recommended intakes.
confidence bands 100(1 − α)% for each standar-
dised variable r j are
sz j p n−1 6.4.5 Half-Normal Plots

rj ± F α,p,n − p
n n−p
Complex experiments with multiple factors can be
modelled by multiple or polynomial regression,
(see Chapter 10 for details). Transformed means and and regression coefficients displayed in a half-
corresponding confidence bands for all variables are normal plot. The regression coefficients are the
plotted on the same graph. Confidence intervals effect sizes. The half-normal plot allows rapid visual
that do not intersect the pre-specified null value assessment of the strength of all main effects and
suggest evidence for the existence of an effect. No interactions, and discrimination of the most prom-
evidence is provided against the null hypothesis if ising factors to include in subsequent experiments.
confidence intervals contain the null value. Appen- The magnitude of each effect is assessed by the posi-
dix 10.A provides sample SAS code for calculating tion of the coefficient relative to a reference line
simultaneous confidence bands and generating pro- constructed from all the effects closest to zero.
file plots. Effects furthest from zero on the x-axis have greater
magnitude and are ‘significant’ (Daniel 1959).
A factor is important if it has a large main effect
Example: Nutrient Screening of Obese Dogs or is involved in a large two-factor interaction
in a Weight Loss Study (furthest from line of zero effect).
(Data from German et al. 2015). Daily intake of There are six steps in the construction of a
20 essential nutrients was measured for 27 obese half-normal plot:
dogs on an energy-restricted weight loss diet. The
goal was to establish if average daily nutrient 1. Calculate all k regression coefficients for
intakes complied with recommended dietary main effects and two-way interactions
levels. Observations for each nutrient were first 2. Sort coefficients in ascending rank order
standardised to its respective specific minimum from smallest to largest, i = 1 to N.
0005604125 3D 70 10/8/2023 5:14:36 PM

(a)
50
45
Standardised mean ratio and 95% CI

40
35
30
25
20
15
10
–5
So
TF um
VI
VI
VI
C
C
C rid
C l
Fo er
Io
Iro e
K
M
M G
M CY
PH
Pr S
SE in
ZN
al
hl
ob
op
di
AN
ET
G S
TA
TD
TE
ot
lic
di
AT
L
O
ci
o
n
a
e
p
um
3
Nutrients
(b)
7
6
Standardised mean ratio and 95% CI
0
So
TF
C
Io r
Iro
PH
Pr
SE
ZN
al
hl
ob
op
di
ET
ot
n
di
AT
L
O
c
or
ne
al
pe
ei
iu
um
C
S
id
n
m
YS
Nutrients
Figure 6.2: Profile plot. Mean daily intake for 27 obese dogs of essential nutrients and 95% simultaneous confidence
intervals. Values are standardised to a common ratio based on recommended daily allowance levels.
Source: Data from German et al. (2015). (a) Standardised mean values and confidence intervals for 20 nutrients. Dotted line is the line of
equality for the standardised recommended intake. (b) Mean observed values and confidence intervals for the subset of nutrients with
standardised means less than 10.
0005604125 3D 71 10/8/2023 5:14:36 PM

3. Calculate median order statistic for each of m Table 6.5: (continued)

values: im = (i − 0.5)/k Factor Regression Median order z
4. Obtain the z value for each order statistic im ID coefficient statistic
from the standard normal distribution. 1×4 −0.017 0.100 0.5398
5. Plot the z-transformed order statistics against
1×5 0.013 0.767 0.7784
the effect sizes (regression coefficients).
6. Fit a reference line to the effects clustered 2×3 0.001 0.500 0.6915
most closely to zero. 2×4 −0.005 0.367 0.6431
2×5 −0.010 0.167 0.5662
Example: Effects of Environmental Toxins 3×4 −0.007 0.300 0.6179
on Mouse Growth 3×5 −0.001 0.433 0.6676
(Data from Porter et al. 1984). An experiment was 4×5 0.003 0.567 0.7145
designed to simultaneously assess interactions of
Source: Data from Porter et al. (1984).
food and water restriction (factors 1 and 2), an
immunosuppressant chemical (factor 3), an
0.85
infectious agent (factor 4), and a diet supplement
1
(factor 5) on growth rates of Swiss-Webster white 0.80 13 2
Normal order median statistic
15
mice. The study design was a five-factor half-
0.75 12
fraction factorial study design at two levels. This 5
design used 25−1 = 16 runs replicated twice, total- 0.70
45
23
ling 32 female mice. Growth rates were measured
35
on pups from a single litter from each mouse. 0.65 24
There were k = 15 regression coefficients 34
obtained from the 5 main effects and 10 interac- 0.60 3
tions (Table 6.5). The half-normal plot 0.55 14
25
(Figure 6.3) shows the most important factors 4
were food restriction (factor 1), water restriction 0.50
(factor 2), and the infectious agent (factor 4). Diet –0.050 0.000 0.050 0.100
Effect size (coefficients)
supplement (factor 5) and interaction effects were
not important. Factor 5 was dropped from subse- Figure 6.3: Half-normal plot. The experiment screened
quent experiments and a second environmental five environmental stressors thought to affect mouse
pup growth. The plot displays regression coefficients
contaminant chemical was substituted. (effect sizes) computed for the five factors and all two-
way interactions.
Table 6.5: Environmental toxins and mouse growth:
screening for effect size. Regression coefficients Source: Data from Porter et al. (1984).
obtained from polynomial regression on five factors
and all two-way interactions.
Factor Regression Median order z
ID coefficient statistic 6.4.6 Interaction Plots
1 0.100 0.967 0.8331
Interaction plots are a simple and rapid method for
2 0.053 0.900 0.8159 assessing generalizability. In general, the interac-
3 −0.008 0.233 0.5922 tion plot shows the means of one independent cat-
egorical variable on the x-axis and displays the
4 −0.033 0.033 0.5133
means of a second categorical variable as separate
5 0.008 0.633 0.7367 lines, showing how the means change with the
1×2 0.011 0.700 0.7580 levels of the first variable (Chapter 19). Interaction
1×3 0.015 0.833 0.7977 plots used to assess generalizability plot values
(continued)
of the treatment variables on the y-axis against
categorical confounder variables on the x-axis.
0005604125 3D 72 10/8/2023 5:14:36 PM

Interactions are negligible if the line is horizontal

Example: Interaction Plots Across Replicate
parallel to the x-axis. Alternatively, if the interac-
Experiments
tion is significant, the line is not horizontal and
the mean treatment response is not consistent (From von Kortzfleisch et al. 2020). Investigators
across all the levels of the confounder. This sug- compared effectiveness of a conventional exper-
gests that treatment main effects are dominated imental design and a multi-batch design for
by the confounder, may not be biologically impor- evaluating behavioural and physiological differ-
tant or are too variable (Lynch 1982). ences between four mouse strains. Strain differ-
Lynch (1982) suggests four potential sources of ences were evaluated across four replicate
confounding interactions with treatment: experiments. The overall effect size was the
Subject group by treatment interactions, when the mean strain difference. Mean strain differences
experimental units comprise distinct groups or were estimated from the replicate experiments
cohorts. For example, young adult mice will not for each paired strain comparison. Evidence of
be representative of aged mice. reproducibility was suggested by consistency of
the effect of mouse strain across replicate experi-
Block by treatment interaction. Lynch defines ments; that is, the (strain x replicate) interaction
these as ‘situational specifics’ that include is statistically negligible. They found that the
potential blocking variables such as location, multi-batch approach improved reproducibility,
lighting, noise, treatment administration, and increased discovery of treatment effects.
investigator, timing of measurement, etc. Figure 6.4 shows an exemplary interaction plot
Time by treatment interactions are suggested if presenting mean differences between strains
cause-effect relationships differ between base- and confidence intervals; the first plot shows
line and subsequent measurement periods. poor repeatability across replicates, and the sec-
Replicate × treatment interactions. Consistency ond plot good repeatability with negligible effect
between replicates suggests the replicate × treat- of replicate,
ment effect is negligible and provides evidence
of the robustness of results.
(a)
(b)
0 Null effect 0 Null effect
Mean effect Mean effect
1 2 3 4 1 2 3 4
Replicates Replicates
Figure 6.4: Interaction plots for assessing generalizability. The graph shows mean differences between two mouse stains
and 95% confidence intervals across four experimental replicates.
Source: Adapted from von Kortzfleisch et al. (2020). (a) Poor repeatability across replicates; evidence for generalizability is poor. (b) Good
repeatability across replicates; evidence for generalizability is good as the data suggest treatment effects are relatively consistent and
independent of replicate effects
0005604125 3D 73 10/8/2023 5:14:36 PM

6.4.7 Replication discourage unnecessary duplication of animal

experiments. Justification for replication studies
Replication studies consist of multiple experiments
must be based on consistency of evidence across
conducted to assess and validate the robustness of
multiple sources (including a comprehensive litera-
research results (Box 6.8). They are considered the
ture review), evidentiary strength of prior data
ultimate test of external validity (Drude et al.
(especially strong internal validity), and clinical rel-
2021). However, there is considerable misunder-
evance of the models and outcome variables (Drude
standing of how replication is defined, and how rep-
et al. 2021). Initial studies are followed by testing in
lication studies are to be designed and conducted.
another animal and or disease model of the same
Frequently, investigators think a ‘replication study’
syndrome, and culminating in independent replica-
consists of two or more repetitions of an entire
tion studies in different laboratories (Landis et al.
experiment over identical conditions (Fitts 2011;
2012; Henderson et al. 2013; Pitkänen et al. 2013;
Frommlet and Heinze 2021). However, repeating
Dirnagl et al. 2021; Drude et al. 2021).
an unrepresentative, biased, and poor-quality study
Table 6.6 distinguishes the different types of
does not provide evidence of external validity or
replication experiments. The main components
translation potential, no matter how many times
of a replication study are location (single labora-
the study is repeated.
tory, multiple laboratories), study conditions (ani-
Pitkänen et al. (2013) recommend a three-phase
mal environment, lab environment, equipment,
experimental approach
reagents, personnel, etc.), model system (animal,
animal model, disease model) and variables meas-
1. Preliminary studies designed to show evi-
ured (explanatory variables, response variables).
dence of efficacy.
Direct replications are multiple experiments that
2. Independent replication studies in a different
are the same across all study components, and per-
laboratory guided by data from the first study
formed by the same laboratory (Fraser et al. 2019).
in a given model.
Simple direct repeats of experiments (‘repeating an
3. Confirmatory testing in another model of the
experiment in triplicate’) by a single laboratory are
same syndrome.
not recommended unless internal validity is high
Replication studies should be conducted only and the experimental protocol is rigorously defined
after there is substantive preliminary evidence of (Würbel, 2017; Drude et al. 2021; Frommlet and
efficacy. Animal use oversight committees expressly Heinze 2021). Simple repeats can only estimate dif-
ferences due to sampling error between repeats, and
thus estimate only the overall measurement error of
BOX 6.8 the experiment, not the validity of experimental
Replication results. Moreover, if variation is non-systematic
and uncontrolled, simple direct repeats are expen-
Multiple experiments conducted to assess robustness sive and waste animals without increasing informa-
of research findings tion. In contrast, if variation is systematically
Direct replications: Multiple experiments similar introduced with appropriate study designs (e.g.
across all study components, and performed by multi-batching), direct replication studies can pro-
the same laboratory. vide a preliminary assessment of robustness across
Conceptual replications: Multiple experiments dif- experiments, and inform decisions to proceed with
fering in one or more key inputs: Location, model, replication between laboratories (Richter 2017;
operating conditions, variables. Drude et al. 2021).
Repeating unrepresentative, biased, and poor-quality Conceptual replications are multiple experiments
studies does not provide evidence of external validity that differ in one or more key inputs that influ-
or translation potential. ence the effect through different mechanisms
0005604125 3D 74 10/8/2023 5:14:36 PM

Table 6.6: Six types of replication studies defined by number of experimental locations, ‘conditions’, model system (animal or
disease model) and variables measured (explanatory and/or response). Replication studies can be the same across all
components (direct replication), or differ in one or more components (conceptual replication).
Location Experimental Model Variables What is evaluated?
conditions system
Direct replication
Same Same Same Same Single laboratory study with identical repeats (direct internal
replication). Can assess sampling error, quality standards,
mistakes, fraud, etc. Measurement error, precision
External validity (limited) with multi-batch designs (planned
heterogeneity)
Differ Same Same Same Multi-laboratory study (independent replication). Assesses
effects of different lab procedures, personnel, equipment,
environments, etc. in different labs
Robustness (between laboratories), external validity
Conceptual replication
Same Differ Same Same Systemic variation in experimental operating conditions
(planned heterogeneity) between replications. Conditions
that are varied include environment, batch, days, time of
day, operators, suppliers, vendors, etc.
Robustness (procedural/process replication)
Same Same Same Differ Systematic changes in explanatory factors, response
variables, and/or methods for measuring or quantifying the
response. Assesses operationalisation of the research
question (e.g. ‘benefit’ = symptom relief versus improved
function versus improved survival).
Robustness (outcome replication); external validity
Same Same Differ Same Systematic changes in the animal model/species/strain, and/
or disease model. Tests generalisation and robustness of
results in a new model system related to measures of
efficacy or effect tested
Robustness (animal and disease model); external validity
Same or Same Same Same or Multiple simultaneous systematic controlled changes in any or
differ or differ or differ differ all components of study design and operations
Robustness, external validity
Source: Adapted from Fraser et al. (2019).
or pathways. These include test of treatment laboratory replications consisting of a series of

effect as a function of different animal models mini-experiments separated over time can be
(species, strain), disease models, disease sever- as effective as a multi-lab study, as well as more
ity levels, age groups, treatment interventions, economical (von Kortzfleisch et al. 2020).
and/or experimental outcomes. Conceptual
replications are an example of planned hetero- Efficiency-effectiveness trade-offs are inherent to
geneity. Independent replication experiments replication studies. The relative efficiency of repeat-
performed across multiple laboratories are ing experiments in different laboratories compared
regarded as the gold standard for external valid- to a single laboratory can be judged by criteria such
ity and reproducibility (Karp 2018; Voelkl et al. as greater statistical power, relative reduction in
2018; Usui et al. 2021). However, single- animal numbers to detect a given effect size, and
0005604125 3D 75 10/8/2023 5:14:36 PM

improved coverage probability of the true effect size assessing health care interventions for the effective
for a fixed number of animals (von Kortzfleisch et al. health care program of the agency for healthcare
2020). Replications across as few as two laboratories research and quality: an update. Methods guide for
can produce substantial improvements in power comparative effectiveness reviews (Prepared by the
and predictive validity, as long as protocols are stan- RTI-UNC Evidence-based Practice Center under Con-
tract No. 290-2007-10056-I). AHRQ Publication No.
dardised and internal validity is ensured (Karp 2018;
13(14)-EHC130-EF. Rockville, MD: Agency for
Voelkl et al. 2018; Drude et al. 2021). Healthcare Research and Quality. www.effective
healthcare.ahrq.gov/reports/final.cfm
(accessed 2022).
6.4.8 Design and Sample Size for Bespalov, A., Wicke, K., and Castasgné, V. (2019). Blind-
Replication ing and randomization. In: Good Research Practice in
Non-Clinical Pharmacology and Biomedicine, Hand-
Any statistically based experimental design appro- book of Experimental Pharmacology, 257 (ed. A.
priate for the study can be used for replication stud- Bespalov, M. Michel, and T. Steckler). Springer Cham.
ies (factorial, split-plot, randomised complete block, https://doi.org/10.1007/164_2019_279.
etc.). Internal validity must be high, with clearly Bishop, D.V.M., Thompson, J., and Parker, A.J. (2022).
specified methods for randomisation and allocation Can we shift belief in the ‘Law of Small Numbers’?
concealment (blinding), and defined inclusion and Royal Society Open Science 9: 211028. https://doi.
exclusion criteria (Drude et al. 2021). org/10.1098/rsos.211028.
Sample size is greatly reduced by adoption of Blanchard, O.L. and Smoliga, J.M. (2015). Translating
multi-batch designs rather than the conventional dosages from animal models to human clinical trials
approach of replicating independently powered – revisiting body surface area scaling. FASEB Journal
experiments. Multi-batch designs consist of several 29 (5): 1629–1634. https://doi.org/10.1096/
small independent experiments conducted at sepa- fj.14-269043.
Bowen, D.J., Kreuter, M., Spring, B. et al. (2009). How we
rate time points by the same laboratory or even in
design feasibility studies. American Journal of Preven-
different laboratories. They are in effect large-scale
tive Medicine 36 (5): 452–457. https://doi.org/
randomised complete block experiments where 10.1016/j.amepre.2009.02.002.
each block is batch or replicate. Results are com- Box, G.E.P. and Draper, N.R. (1987). Empirical Model-
bined to assess intervention effects. The suggested Building and Response Surfaces. New York: Wiley.
minimum is three batches (Karp et al. 2020; von Box, G.E.P., Hunter, W.G., and Hunter, J.S. (2005).
Kortzfleisch et al. 2020). Statistics for Experimenters: An Introduction to Design,
Data Analysis, and Model Building, 2e. New
York: Wiley.
References Calder, W.A. III (1984). Size, Function and Life History.
Cambridge, MA: Harvard University Press.
Beery, A.K. (2018). Inclusion of females does not increase Caldwell, G.W., Masucci, J.A., Yan, Z., and Hageman, W.
variability in rodent research studies. Current Opinion (2004). Allometric scaling of pharmacokinetic para-
in Behavioral Sciences 23: 143–149. https://doi. meters in drug discovery: can human CL, Vss and
org/10.1016/j.cobeha.2018.07.016. t1/2 be predicted from in-vivo rat data? European Jour-
Beery, A.K. and Zucker, I. (2011). Sex bias in neurosci- nal of Drug Metabolism and Pharmacokinetics 29:
ence and biomedical research. Neuroscience and Biobe- 133–143.
havioral Reviews 35: 565–572. https://doi.org/ Clayton, D. and Hills, M. (1993). Statistical Models in
10.1016/j.neubiorev.2010.07.002. Epidemiology. Oxford: Oxford University Press.
Belzun, C. and Lemoine, M. (2011). Criteria of validity for Clayton, J.A. (2015). Studying both sexes: a guiding prin-
animal models of psychiatric disorders: focus on anx- ciple for biomedicine. FASEB Journal 30 (2): 519–524.
iety disorders and depression. Biology of Mood & Anx- https://doi.org/10.1096/fj.15-279554:10.1096/
iety Disorders 1 (1): 9. https://doi.org/10.1186/ fj.15-279554.
2045-5380-1-9. Clayton, J.A. and Collins, F.S. (2014). Policy: NIH to bal-
Berkman, N.D., Lohr, K.N., Ansari, M., et al. (2013). ance sex in cell and animal studies. Nature 509 (7500):
Grading the strength of a body of evidence when 282–283. https://doi.org/10.1038/509282a.
0005604125 3D 76 10/8/2023 5:14:36 PM

Colosimo, E.A., Cruz, F.R.B., Miranda, J.L.O., and Van anesthesiology. Anesthesiology 136 (4): 609–617.
Woensel, T. (2007). Sample size calculation for method https://doi.org/10.1097/ALN.00000000000
validation using linear regression. Journal of Statistical 04115 .
Computation and Simulation 77 (6): 505–516. Errington, T.M., Denis, A., Perfito, N. et al. (2021a).
https://doi.org/10.1080/00949650601151729. Reproducibility in cancer biology: challenges for asses-
Cook, J.A., Hislop, J., Altman, D.G. et al. (2015). Specify- sing replicability in preclinical cancer biology. eLife 10:
ing the target difference in the primary outcome for a e67995. https://doi.org/10.7554/eLife.67995.
randomised controlled trial: guidance for researchers. Errington, T.M., Mathur, M., Soderberg, C.K. et al.
Trials 16: 12. https://doi.org/10.1186/s13063- (2021b). Investigating the replicability of preclinical
014-0526-8. cancer biology. eLife 10: e71601. https://doi.org/
Cook, J.A., Julious, S.A., Sones, W. et al. (2017). Choosing 10.7554/eLife.71601.
the target difference (‘effect size’) for a randomised Ferreira, G.S., Veening-Griffioen, D.H., Boon, W.P.C.
controlled trial – DELTA2 guidance protocol. Trials et al. (2019). A standardised framework to identify
18 (1): 271. https://doi.org/10.1186/s13063- optimal animal models for efficacy assessment in drug
017-1969-5. development. PLoS ONE 14 (6): e0218014. https://
Cooper, N., Thomas, G.H., and FitzJohn, R.G. (2016). doi.org/10.1371/journal.pone.0218014.
Shedding light on the ‘dark side’ of phylogenetic com- Filliben, J.J. and Heckert, A. (2012). Exploratory data
parative methods. Methods in Ecology and Evolution analysis. In: NIST/SEMATECH e-Handbook of
7 (6): 693–699. https://doi.org/10.1111/2041- Statistical Methods (ed. C. Croakin and P. Tobias).
210X.12533. http://www.itl.nist.gov/div898/handbook/2012.
Cumming, G. (2012). Understanding the New Statistics: https://doi.org/10.18434/M32189.
Effect sizes, Confidence Intervals, and Meta-Analysis. Fitts, D.A. (2011). Ethics and animal numbers: informal
New York: Routledge. analyses, uncertain sample sizes, inefficient replica-
Czitrom, V. (1999). One-factor-at-a-time versus designed tions, and Type I errors. Journal of the American Asso-
experiments. The American Statistician 53: 126–131. ciation for Laboratory Animal Science 50: 445–453.
Daniel, C. (1959). Use of half-normal plots in interpreting Fraser, H., Barnett, A., Parker, T.H., and Fidler, F. (2019).
factorial two-level experiments. Technometrics 1 (4): The role of replication studies in ecology. Ecology and
311–341. Evolution 10: 5197–5207.
Dean, A. and Lewis, S. (2006). Screening Methods for Freedman, L.P., Cockburn, I.M., and Simcoe, T.S. (2015).
Experimentation in Industry, Drug Discovery, and The economics of reproducibility in preclinical
Genetics. Springer, 330 p. https://link.springer. research. PLoS Biology 13: e1002165.
com/content/pdf/10.1007%2F0-387-28014-6. Frommlet, F. and Heinze, G. (2021). Experimental repli-
Dirnagl, U., Bannach-Brown, A., and McCann, S. (2021). cations in animal trials. Laboratory Animals 55 (1): 65–
External validity in translational biomedicine: 75. https://doi.org/10.1177/0023677220907617.
understanding the conditions enabling the cause to Galanopoulou, A.S., Pitkänen, A., Buckmaster, P.S., and
have an effect. EMBO Molecular Medicine 14 (2): Moshé, S.L. (2017). What do models model? What
1757–4676. https://doi.org/10.15252/emmm.2021 needs to be modeled? In: Models of Seizures and Epi-
14334. lepsy, 2e (ed. A. Pitkänen, P.S. Buckmaster, A.S. Gala-
Dolgos, H., Trusheim, M., Gross, D. et al. (2016). Trans- nopoulou, and S.L. Moshé), 1107–1119. Elsevier.
lational Medicine Guide transforms drug development German, A.J., Holden, S.L., Serisier, S. et al. (2015).
processes: the recent Merck experience. Drug Discovery Assessing the adequacy of essential nutrient intake
Today 21 (3): 517–526. https://doi.org/10.1016/ in obese dogs undergoing energy restriction for weight
j.drudis.2017.01.003. loss: a cohort study. BMC Veterinary Research 11: 253.
Drude, N.I., Gamboa, L.M., Danziger, M. et al. (2021). https://doi.org/10.1186/s12917-015-0570-y.
Science Forum: improving preclinical studies through Goodman, S.N. and Royall, R. (1988). Evidence and sci-
replications. eLife 10: e62101. https://doi.org/ entific research. American Journal of Public Health 78
10.7554/eLife.62101. (12): 1568–1574. https://doi.org/10.2105/ajph.
Eisenhart, C. (1968). Expression of the uncertainties of 78.12.1568.
final results. Science 160: 1201–1204. https://doi. Henderson, V.C., Kimmelman, J., Fergusson, D. et al.
org/10.1126/science.160.3833.1201. (2013). Threats to validity in the design and conduct
Eleveld, D.J., Koomen, J.V., Absalom, A.R. et al. (2022). of preclinical efficacy studies: a systematic review of
Allometric scaling in pharmacokinetic studies in guidelines for in vivo animal experiments. PLoS
0005604125 3D 77 10/8/2023 5:14:36 PM

Medicine 10 (7): e1001489. https://doi.org/ Jones, B. and Nachtsheim, C.J. (2011). A class of three
10.1371/journal.pmed.1001489. level designs for definitive screening in the presence
Heusch, G. (2017). Critical issues for the translation of of second-order effects. Journal of Quality Technology
cardioprotection. Circulation Research 120 (9): 43 (1): 1–15. https://doi.org/10.1080/00224065.
1477–1486. https://doi.org/10.1161/CIRCRESAHA. 2011.11917841.
117.310820. Julious, S.A. (2005). Sample size of 12 per group rule of
Higgins, J.P., Altman, D.G., Gøtzsche, P.C. et al. (2011). thumb for a pilot study. Pharmaceutical Statistics 4:
The Cochrane collaboration’s tool for assessing risk of 287–291.
bias in randomised trials. BMJ 343: d5928. https:// Karp, N.A. (2018). Reproducible preclinical research—Is
doi.org/10.1136/bmj.d5928. embracing variability the answer? PLoS Biology 16 (3):
Higgins, J.P.T., Savović, J., Page, M.J. et al. (2022). e2005413. https://doi.org/10.1371/journal.
Chapter 8: Assessing risk of bias in a randomized trial. pbio.2005413.
In: Cochrane Handbook for Systematic Reviews of Inter- Karp, N.A., Mason, J., Beaudet, A.L. et al. (2017). Preva-
ventions, version 6.3 (updated February 2022) (ed. H. lence of sexual dimorphism in mammalian phenotypic
JPT, J. Thomas, J. Chandler, et al.). Cochrane www. traits. Nature Communications 8: 15475. https://
training.cochrane.org/handbook. doi.org/10.1038/ncomms15475.
Hollner, E. (2009). The ETTO Principle: Efficiency- Karp, N.A. and Reavey, N. (2019). Sex bias in preclinical
Thoroughness Trade-Off: Why Things That Go Right research and an exploration of how to change the sta-
Sometimes Go Wrong. Taylor & Francis Group, tus quo. British Journal of Pharmacology 176 (21):
ProQuest Ebook Central https://ebookcentral. 4107–4118. https://doi.org/10.1111/bph.14539.
proquest.com/lib/ufl/detail.action?docID= Karp, N.A., Wilson, Z., Stalker, E. et al. (2020). A multi-
438714 . batch design to deliver robust estimates of efficacy and
Honarpisheh, P. and McCullough, L.D. (2019). Sex as a reduce animal use – a syngeneic tumour case study.
biological variable in the pathology and pharmacology Scientific Reports 10 (1): 6178. https://doi.org/
of neurodegenerative and neurovascular diseases. Brit- 10.1038/s41598-020-62509-7.
ish Journal of Pharmacology 176 (21): 4173–4192. Kimmelman, J. and Federico, C. (2017). Consider drug
https://doi.org/10.1111/bph.14675. efficacy before first-in-human trials. Nature 542:
Howick, J., Glasziou, P., and Aronson, J.K. (2009). The 25–27. https://doi.org/10.1038/542025a.
evolution of evidence hierarchies: what can Bradford Kimmelman, J., Mogil, J.S., and Dirnagl, U. (2014). Dis-
Hill’s ‘guidelines for causation’ contribute? Journal tinguishing between exploratory and confirmatory
of the Royal Society of Medicine 102 (5): 186–194. preclinical research will improve translation. PLoS
https://doi.org/10.1258/jrsm.2009.090020. Biology 12 (5): e1001863. https://doi.org/10.1371/
Huang, Q. and Riviere, J.E. (2014). The application of journal.pbio.1001863.
allometric scaling principles to predict pharmacoki- Kronmal, R.A. (1993). Spurious correlation and the fal-
netic parameters across species. Expert Opinion on lacy of the ratio standard revisited. Journal of the Royal
Drug Metabolism & Toxicology 10: 1241–1253. Statistical Society Series A (Statistics in Society) 156 (3):
Huang, W., Percie du Sert, N., Vollert, J., and Rice, A.S.C. 379–392. https://doi.org/10.2307/2983064.
(2020). General principles of preclinical study design. Landis, S.C., Amara, S.G., Asadullah, K. et al. (2012). A
In: Handbook of Experimental Pharmacology, call for transparent reporting to optimize the predictive
vol. 257, 55–69. https://doi.org/10.1007/164_ value of preclinical research. Nature 490: 187–191.
2019_277. https://doi.org/10.1038/nature11556.
Jan, S.L. and Shieh, G. (2019). Sample size calculations Lawson J (2020). daewr: design and analysis of experi-
for model validation in linear regression analysis. ments with R. R package version 1.2-5. http://www.
BMC Medical Research Methodology 19: 54. https:// r-qualitytools.org (accessed 2022).
doi.org/10.1186/s12874-019-0697-9. Lee, E.C., Whitehead, A.L., Jacques, R.M., and Julious, S.
Jasienski, M. and Bazzaz, F.A. (1999). The fallacy of ratios A. (2014). The statistical interpretation of pilot trials:
and the testability of models in biology. Oikos 321–327. should significance thresholds be reconsidered? BMC
Jenkins, D.G. and Quintana-Ascencio, P.F. (2020). A Medical Research Methodology 14: 41. https://doi.
solution to minimum sample size for regressions. PLoS org/10.1186/1471-2288-14-41.
ONE 15 (2): e0229345. https://doi.org/10.1371/ Leist, M. and Hartung, T. (2013). Inflammatory findings
journal.pone.0229345. on species extrapolations: humans are definitely not
0005604125 3D 78 10/8/2023 5:14:36 PM

70-kg mice. Archives of Toxicology 87: 563–567. human randomized controlled trials. PLoS Biology 11:
https://doi.org/10.1007/s00204-013-1038-0. e1001481.
Lindstedt, S.L. and Schaeffer, P.J. (2002). Use of allometry Nunamaker, E.A., Davis, S., O’Malley, C.I., and Turner,
in predicting anatomical and physiological parameters P.V. (2021). Developing recommendations for cumula-
of mammals. Laboratory Animals 36 (1): 1–19. tive endpoints and lifetime use for research animals.
https://doi.org/10.1258/0023677021911731. Animals (Basel) 11 (7): 2031. https://doi.org/
Lynch, J. (1982). On the external validity of experiments 10.3390/ani11072031.
in consumer research. Journal of Consumer Research Packard, G. and Boardman, T. (1988). The misuse of
9 (3): 225–239. https://doi.org/10.1086/208919. ratios, indices, and percentages in ecophysiological
Macleod, M.R., Lawson McLean, A., Kyriakopoulou, A. research. Physiological Zoology 61: 1–9. https://
et al. (2015). Risk of bias in reports of in vivo doi.org/10.1086/physzool.61.1.30163730.
research: a focus for improvement. PLoS Biology Packard, G. and Boardman, T. (1999). The use of percen-
13 (10): e1002273. https://doi.org/10.1371/ tages and size-specific indices to normalize physiolog-
journal.pbio.1002273. ical data for variation in body size: wasted time, wasted
Malterud, K., Siersma, V.D., and Guassora, A.D. (2016). effort? Comparative Biochemistry and Physiology, Part
Sample size in qualitative interview studies: guided by A 122 (1): 37–44.
information power. Qualitative Health Research 26 Pérez-García, V.M., Calvo, G.F., Bosque, J.J. et al. (2020).
(13): 1753–1760. https://doi.org/10.1177/104973 Universal scaling laws rule explosive growth in human
2315617444. cancers. Nature Physics 16: 1232–1237.
Markou, A., Chiamulera, C., Geyer, M. et al. (2009). Pitkänen, A., Nehlig, A., Brooks-Kayal, A.R. et al. (2013).
Removing obstacles in neuroscience drug discovery: Issues related to development of antiepileptogenic
the future path for animal models. Neuropsychophar- therapies. Epilepsia 54 (Suppl 4): 35–43. https://
macology 34: 74–89. https://doi.org/10.1038/ doi.org/10.1111/epi.12297.
npp.2008.173. Porter, W.P., Hinsdill, R., Fairbrother, A. et al. (1984).
McGonigle, P. and Ruggeri, B. (2014). Animal models of Toxicant-disease-environment interactions associated
human disease: challenges in enabling translation. with suppression of immune system, growth, and
Biochemical Pharmacology 87 (1): 162–171. https:// reproduction. Science 224 (4652): 1014–1017.
doi.org/10.1016/j.bcp.2013.08.007. Pound, P. and Ritskes-Hoitinga, M. (2018). Is it possible
McKinney, W.T. and Bunney, W.E. (1969). Animal to overcome issues of external validity in preclinical
model of depression: I. Review of evidence: implica- animal research? Why most animal models are bound
tions for research. Archives of General Psychiatry 21 to fail. Journal of Translational Medicine 16 (1): 304.
(2): 240–248. https://doi.org/10.1001/archpsyc. https://doi.org/10.1186/s12967-018-1678-1.
1969.01740200112015. Reynolds, P. (2021). Statistics, statistical thinking, and
Miller, L.R., Marks, C., Becker, J.B. et al. (2017). Consid- the IACUC. Lab Animal 50: 266–268. https://doi.
ering sex as a biological variable in preclinical org/10.1038/s41684-021-00832-w.
research. FASEB Journal 31 (1): 29–34. https:// Reynolds, P.S., McCarter, J., Sweeney, C. et al. (2019).
doi.org/10.1096/fj.201600781R. Informing efficient pilot development of animal
Moher, D., Hopewell, S., Schulz, K.F. et al. (2010). CON- trauma models through quality improvement strate-
SORT 2010 explanation and Elaboration: updated gies. Laboratory Animals 53 (4): 394–404. https://
guidelines for reporting parallel group randomised doi.org/10.1177/0023677218802999.
trials. BMJ 340: c869. https://doi.org/10.1136/ Richter, H. (2017). Systematic heterogenization for better
bmj.c869. reproducibility in animal experimentation. Lab Ani-
Montgomery, D.C. (2017). Design and Analysis of Experi- mal (NY) 46: 343–349. https://doi.org/10.1038/
ments, 8th ed. New York: Wiley 752 pp. laban.1330.
Montgomery, D.C. and Jennings, C.L. (2006). Chapter 1: Rossello, X. and Yellon, D.M. (2016). Cardioprotec-
An overview of industrial screening experiments. In: tion: the disconnect between bench and bedside.
Screening: Methods for Experimentation in Industry, Circulation 134: 574–575. https://doi.org/10.
Drug Discovery, and Genetics (ed. A. Dean and S. 1161/circulationaha.116.022829.
Lewis), 1–20. New York: Springer 332 pp. Rothman, K.J. and Greenland, S. (2018). Planning study
Muhlhausler, B.S., Bloomfield, F.H., and Gillman, M.W. size based on precision rather than power. Epidemiol-
(2013). Whole animal experiments should be more like ogy 29: 599–603.
0005604125 3D 79 10/8/2023 5:14:36 PM

Russell, W.M.S. and Burch, R.L. (1959). The Principles of variability improves both replicability and generaliza-
Humane Experimental Technique. London: bility in preclinical research. PLoS Biology 19 (5):
Methuen & Co. e3001009. https://doi.org/10.1371/journal.
Schmidt-Nielsen, K. (1984). Scaling: Why Is Animal Size pbio.3001009.
So Important? Cambridge: Cambridge Univer- Van Norman, G.A. (2019). Limitations of animal studies
sity Press. for predicting toxicity in clinical trials: is it time to
Steidl, R.J., Hayes, J.P., and Schauber, E. (1997). Statisti- rethink our current approach? JACC: Basic to Transla-
cal power analysis in wildlife research. Journal of tional Science 4 (7): 845–854. https://doi.org/
Wildlife Management 61 (2): 270–279. 10.1016/j.jacbts.2019.10.008.
Tadenev, A.L.D. and Burgess, R.W. (2019). Model valid- Vetter, T.R. and Mascha, E.J. (2017). Defining the pri-
ity for preclinical studies in precision medicine: pre- mary outcomes and justifying secondary outcomes of
cisely how precise do we need to be? Mammalian a study: usually, the fewer, the better. Anesthesia &
Genome 30 (5-6): 111–122. https://doi.org/ Analgesia 125 (2): 678–681. https://doi.org/
10.1007/s00335-019-09798-0. 10.1213/ANE.0000000000002224.
Tang, H. and Mayersohn, M. (2011). Controversies in Voelkl, B., Vogt, L., Sena, E.S., and Würbel, H. (2018).
allometric scaling for predicting human drug clear- Reproducibility of preclinical animal research
ance: an historical problem and reflections on what improves with heterogeneity of study samples. PLoS
works and what does not. Current Topics in Medicinal Biology 16 (2): e2003693. https://doi.org/10.1371/
Chemistry 11 (4): 340–350. https://doi.org/ journal.pbio.2003693.
10.2174/156802611794480945. von Kortzfleisch, V.T., Karp, N.A., Palme, R. et al. (2020).
Tanner, J.M. (1949). Fallacy of per-weight and per- Improving reproducibility in animal research by split-
surface area standards, and their relation to spurious ting the study population into several ’mini-experi-
correlation. Journal of Applied Physiology 2 (1): 1–15. ments’. Scientific Reports 10 (1): 16579. https://
https://doi.org/10.1152/jappl.1949.2.1.1. doi.org/10.1038/s41598-020-73503-4.
Teare, M.D., Dimairo, M., Shephard, N. et al. (2014). Will, T.R., Proaño, S.B., Thomas, A.M. et al. (2017). Pro-
Sample size requirements to estimate key design para- blems and progress regarding sex bias and omission in
meters from external pilot randomised controlled neuroscience research. eNeuro 4 (6): eneuro.0278-
trials: a simulation study. Trials 15: 264. https:// 17.2017.
doi.org/10.1186/1745-6215-15-264. Willner, P. (1984). The validity of animal models of
Trutna, L., Sapgon, P., Del Castillo, E. et al. (2012). Proc- depression. Psychopharmacology 83: 1–17. https://
ess improvement. In: NIST/SEMATECH e-Handbook doi.org/10.1007/BF00427414.
of Statistical Methods. https://doi.org/10.18434/ Wilson, L.A.B., Zajitschek, S.R.K., Lagisz, M. et al. (2022).
M32189. Sex differences in allometry for phenotypic traits in
Tukey, J. (1977). Exploratory Data Analysis. Reading: mice indicate that females are not scaled males. Nature
Addison-Wesley. Communications 13: 7502. https://doi.org/
US EPA (Environmental Protection Agency) (2017). 10.1038/s41467-022-35266-6.
Causal Analysis/Diagnosis Decision Information Sys- Würbel, H. (2017). More than 3Rs: the importance of sci-
tem (CADDIS). Washington: Office of Research and entific validity for harm-benefit analysis of animal
Development https://www.epa.gov/caddis-vol1/ research. Lab Animal 46: 164–167. https://doi.
consistency-evidence. org/10.1038/laban.1220.
Usui, T., Macleod, M.R., McCann, S.K. et al. (2021).
Meta-analysis of variation suggests that embracing
0005604125 3D 80 10/8/2023 5:14:37 PM

7
Feasibility Calculations:
Arithmetic

7.1 Introduction 81 7.3.1
Basic Science/Laboratory Studies 82
7.2 The Process 81 7.3.2
Veterinary Clinical Trials 83
7.2.1 Problem Structuring 82 7.3.3
High Dimensionality Studies 85
7.2.2 Calculations 82 7.3.4
Training, Teaching, Skill
7.2.3 Reality Checks 82 Acquisition 86
7.2.4 Refinement 82 7.3.5 Rodent Breeding Production 86
7.3 Determining Operational Feasibility 82 References 88
oversight committees. Funders and investors may

7.1 Introduction require some projection of costs and resources to
Operational feasibility is determined by sufficient determine if the research is worth their investment.
money, space, equipment, time, and properly Animal care and use oversight committees need to
trained and competent technical personnel to meet be assured that animals will not be wasted in unfea-
study objectives (Box 7.1). Typically, formal assess- sible and impractical studies that have little chance
ments of operational feasibility are required by var- of completion.
ious stakeholders, such as funding agencies, Power calculations and hypothesis tests are not
product investment stakeholders, and ethical appropriate for operational feasibility. Simple ‘back
of the envelope’ calculations using basic arithmetic
may be all that are required to confirm that the pro-
BOX 7.1
Operational Feasibility
jected number of subjects makes sense or that the
study is even feasible. Even for studies where power
Operational: Are current work practices, proce- calculations are necessary, basic arithmetic should
dures, and trained personnel sufficient to support be used to cross-validate estimates.
the project?
Budgetary: Are financial resources sufficient to sup-
port the project? How many samples can be pro- 7.2 The Process
cessed for a fixed amount of money?
There are numerous examples of simple arithmetic
Time or scheduling: How long will it take to accom-
approximation for solving problems in physics,
plish each task/all necessary tasks? Can the proj-
engineering, economics, claims validation, ecology,
ect be completed in the allotted time?
and tests of critical thinking (Weinstein and Adams
Subjects: How many subjects can be processed,
2008). The process is sometimes known as Fermi
given operational constraints?
estimation, named after the physicist Enrico Fermi,
BOX 7.2 information are substituted into the equations.

Approximation Process The total is obtained by adding all component parts
together. Because totals are approximate, it is
1. Problem structuring: What information is recommended that a reasonable range of estimates
needed? are calculated by bounding the initial total by min-
2. Calculations: What are the initial sample size imum and maximum numbers along the lines of
approximations? best and worst-case scenarios (Weinstein and
3. ‘Reality checks’: Do the numbers make sense? Do Adams 2008).
initial approximations align with available To facilitate calculations and troubleshooting,
resources? measurement units for each variable should be
4. Revision: What protocol changes are required so retained, and all calculations should be checked for
numbers align with available resources? unit balance and conversion errors. Variables with-
out formal specified units should be assigned some
sort of unit based on context (Chizeck et al. 2009).
who had an extraordinary aptitude for finding quick
and accurate answers to practical problems when
data were sparse or absent, and without using 7.2.3 Reality Checks
sophisticated mathematics (Weinstein and Adams
2008; Reynolds 2019). Feasibility or ‘reality checks’ confirm that estimates
Arithmetic approximation is a four-step cyclical align with available resources, that the proposed
process (Box 7.2) consisting of: work practices, procedures, and personnel are suffi-
cient to support the project, and that the project can
1. Formulation of the estimation problem be completed in the allotted time and within the
through the logical structuring of the budget. Estimates should also make sense and
research question and identification of the should be within a reasonable range of possible
quantitative elements. starting values. Tracking units helps to find simple
2. Calculation of numbers for each element maths errors or incorrect unit conversions.
with simple arithmetic.
3. ‘Reality checks’ to determine if the estimates
obtained by arithmetic approximation are 7.2.4 Refinement
both sensible and feasible. If the proposed number of animals is too large for
4. Revision of the estimates if necessary study resources, then sample sizes will need to be
(Reynolds 2019). reformulated or refined to meet more realistic oper-
ational capabilities.
7.2.1 Problem Structuring
The first step is to specify the approximation prob- 7.3 Determining Operational
lem, the sub-problems or problem sub-components,
and the items in each sub-problem that need quan-
Feasibility
tification. If information is not readily available, Resources should be sufficient to complete the
then a few common-sense assumptions can be made study. These include the number of trained techni-
to complete the calculations. A brief justification of cal personnel, duration of each task, procedures
the problem should be included. required to collect data, costs per procedure, cost
for processing each sample, total study duration,
and study budget (Box 7.3).
7.2.2 Calculations
Simple arithmetic is used to ‘guesstimate’ numbers 7.3.1 Basic Science/Laboratory
for each item of information. Any formula used
should include the variable names for any quantity
Studies
to be approximated and estimated. Calculations The total animal sample size will be constrained by
involve both approximations and any relevant the availability of resources, number of trained
Feasibility Calculations: Arithmetic 83
BOX 7.3 Reality check. Even if the realities of staff welfare

Operations and Logistics and animal housing are ignored, procedures take
30–60 minutes per mouse. The projected num-
Number of trained personnel bers are unrealistic.
Duration of each task, procedure
Cost per procedure
Cost per subject
Total study duration Example: Time Available for Large Animal
Total study budget. Experiments
An investigator requested 120 swine for a three-
year project. A single experiment on one animal
personnel, estimated task and procedure duration, required a full working day for five people. An
costs per procedure, total study duration, and study additional 8–16 hours per week was allocated
budget. Number of animals should be adjusted to for equipment setup and takedown, supply inven-
numbers that can actually be processed in a reason- tory, data management, and related tasks. To pre-
able amount of time. vent staff fatigue and burnout, the investigator
considered two experiments per week was a rea-
sonable performance goal. Before the study could
Example: Laboratory Processing Capacity be initiated, it was anticipated that at least four to
six months would be required for protocol over-
An investigator requested 500,000 mice for a
sight paperwork to be filed and approved and
three-year project. Two technicians were listed
to obtain necessary supplies and equipment.
on the protocol. The experimental surgeries to
Can this number of animals realistically be pro-
be performed on each animal took 30–60 minutes
cessed in three years?
per mouse. Is the number of mice realistic?
Calculations. Total time = 3 yr × 12 months/

Calculations. Even if it is assumed that, at most,
yr = 36 months
each technician could work 350 days per year
for 8 hours per day, the total procedure time is: Available time = 36months – 6months
2 persons × 350 days year × 3 year = 2100 person − days = 30months
2100 person − days × 8 hours day = 16,800 person − hours Time required
= 120 animals 2 animals week × 4 weeks months
The number of mice that can be processed is = 15months
approximately 500,000 animals/2100 person-days
238 animals per person per day. Reality check. The investigator has allowed ade-
Assuming an 8-hour work day, 500,000 animals/ quate time to complete the project, without
16800 person-hours 30 animals per person per compromising the well-being of staff or study
hour, or one mouse approximately every 2 minutes. quality, and with an adequate time cushion in
Alternatively, the total procedural time required case of unforeseen problems or breaks in the
for 250,000 mice per technician is workflow.
1 mouse 30 min × 60 min hr × 8 hr day

= 16 mice per day 7.3.2 Veterinary Clinical Trials
16 mice day × 350 days year
Veterinary clinical trials must consider both availa-
= 5600 mice year bility of resources and availability of eligible sub-
Or 45 years for 250,000 mice per technician jects, especially if these are client-owned animals
BOX 7.4 Calculations. The expected number recruited is

Veterinary Clinical Trials N = 90(0.50) = 45 per year. The number
expected to have the disease is ND = 45(20/
Availability of resources 90) 10 subjects. To obtain the minimum
Clinic intake rate per week, month number of subjects with the condition of inter-
Number and proportion of eligible subjects est, the study would have to run for
Client consent rate
Cost per procedure, per subject 50 subjects required
Total study duration 5 years
10 subjects year
Total study budget.
entering the study sequentially and at variable inter- Example: Trial Size for a Grant Proposal
vals. For the trial to have a chance of success, antici-
A researcher wished to apply for a grant proposal
pated enrolment rates must agree with numbers
to study a novel disease biomarker in client-
obtained from power-based sample size calculations
owned companion animals. The plan was to
(Box 7.4).
develop a clinical prediction model featuring an
additional 11 predictor variables already con-
firmed in the primary literature to be associated
Example: Recruitment of Client-Owned with severity of that disease. The study was
Companion Animals intended to be a single-centre clinical prospective
cohort study. How many animals will need to be
Investigators determined from power calcula- recruited for a two-year study?
tions that at least 100 subjects would be required
to detect a meaningful difference between two
intervention groups, with 50 subjects per compar-
To develop an adequately robust clinical
ison arm (healthy versus diseased). Clinic records
prediction model using 12 predictors (including
indicated that a total of 90 subjects meeting study
the biomarker of interest), formal sample size calcu-
eligibility criteria had visited the clinic in the pre-
lations (Riley et al. 2020) suggested at least 260–300
vious year and included 20 subjects with the dis-
subjects are required. Is this number of subjects a
ease of interest. However, only 50% of owners
feasible recruitment goal?
consented to participate in previous trials at the
institution. How long will the trial have to run Information required. Number of eligible subjects
to enrol sufficient subjects for this new trial? visiting the clinic per week or per month, antici-
pated client consent rate, total operating
Information required. Size of the patient recruit- budget, processing costs per patient, and total
ment pool, anticipated client consent rate, and study duration.
expected number of patients with the disease.
Clinic volume. Based on prior clinical record
The expected number that can be recruited from data, approximately 2–3 eligible animals vis-
the total pool of eligible subjects (NT) is approxi- ited the clinic per week. Fully subsidised clin-
mated as ical and laboratory costs encouraged high
client consent rates, which were expected to
N = N T pE average about 80%.
where pE is the proportion of subjects enrolled. The

Anticipated intake = 2 – 3 animals week × 50 weeks
number of subjects that can both be enrolled and
have the condition of interest (ND) is then = 100 – 150 animals per year
Anticipated enrolment = 100 0 8 – 150 0 8
N D = N pD = 80 – 120 animals per year
Budget. Laboratory and clinical work-ups were The number of animals required is less easy to
approximately $1500 per animal. The grant estimate. In general, the number of animals will
allowed a total operating budget of $120,000. be determined by the total amount of tissue or cells
required to perform the assays (M) and the amount
Maximum number processed of viable tissue or cells per animal (m/animal)
= $120,000 $1500 cat = 80 animals in two years N = M m animal
Reality check. Sample size based on power calcu-
lations does not align with either anticipated
enrolment or budget constraints. Example: Number of Mice Required for
Refinement. Reducing the number of laboratory a Microarray Study
tests performed cut costs from $1500 to $1000. Suppose a cell culture requires M = 106 cells per
The revised sample size is now $120,000/ plate to obtain a sufficient amount of RNA for
($1000/cat) = 120 animals. However, this still analysis. However, only m = 2.5 × 105 relevant
does not align with the power-based sample cell types can be isolated per mouse.
size estimates in the original proposal. The pro-
posal was revised in the direction of more mod- The number of mice required
est research goals that could be accommodated
= 106 cells 2 5 × 105 cells per mouse
by the budget. The investigator prioritised the
marker of interest plus the top four candidate = 4 mice plate
markers, with priority based on clinical rele-
vance and importance. The reduced subset of
five candidate predictors would allow reasona-
bly precise estimates to be obtained with an The evolution of microarray and RNA-seq tech-
anticipated enrolment of 100–120 animals nology means that very small amounts (for exam-
and remain within budget. ple, <~1 pg) of mRNA can be extracted from
tissue and even single cells for gene expression pro-
filing (Amit et al. 2009; Shalik et al. 2013; Ye et al.
2018). Therefore very few animals, or only one,
7.3.3 High Dimensionality Studies maybe all that are necessary. Variability in array
These studies typically involve the harvest and pro- preparation and background determination will
cessing of cells or tissues from multiple animals. determine the number of technical replicates
Typically, large volumes of output information are required. However, technical replicates only affect
produced per subject. Examples include DNA/ measurement precision; they do not contribute to
RNA microarrays, biochemistry assays, biomarker reducing variance of the overall effect.
studies, gene expression, proteomic, metabolomic, If the research question involves mapping heter-
and inflammasome profiles. ogeneity at the subject level, the study will have to
For identification of differentially expressed be designed to accommodate the true experimental
genes, sample size refers to either the number of unit, which is the whole animal. If the number of
arrays or the number of biological replicates, biological replicates is too small, statistical power
depending on the research question. For array- for detecting differentially expressed genes will be
based studies, sample sizes are determined by the too low, and false positive rates will be high. For
desired fold change to be detected, the number of detecting differential gene expression between two
replicate arrays, investigator-specified sensitivity groups (e.g. knockout versus wild-type mice;
(proportion of detected differentially expressed tumour versus non-tumour tissue), sample size is
genes), number of expected false positives, the cor- determined by power calculations. Sample size will
relation between expression levels of different depend on the fold change to be detected, power
genes, and the number of sampling time points to detect that change (the true positive rate, or
(Jung and Young 2012; Lin et al. 2010). power 1 − β), confidence α (Type I error
probability), and the variance in the sample. The skillsets have been developed on non-animal and/
variance can be estimated from the 75th percentile or ex vivo models. Computer simulations, virtual
of the standard deviation of log ratio of expression models, and simulators are rapidly replacing ani-
levels (the variance for the 75% least variable genes mals. Carcasses and culls can be used for skills
in the array). Sample size can then be estimated as acquisition (Baillie et al. 2016). Ex vivo models can
usual for a two-sample t-test, or iteratively using the be as effective as animals obtained for purpose, as
formula for power and the non-centrality parameter well as considerably cheaper (e.g. Izawa et al.
(Wei et al. 2004). 2016). Additional refinement measures include sup-
plementary educational materials (lectures, text,
webinars, etc.) for orientation to equipment, techni-
7.3.4 Training, Teaching, Skill ques, and procedures. Detailed standard operating
Acquisition procedures and training plans should be devised.
Animal numbers for training and skill acquisition Instructional plans should describe how specific
are determined primarily by the number of repeti- critical skills and skill sets will be taught before live
tions of the task required to meet predetermined animals are used. Specific and measurable profi-
competency standards. For teaching purposes, the ciency metrics and competency assessments must
trainer:trainee ratio, number of assessors, and avail- be identified. Training must be species- and proce-
ability of teaching-related resources must be fac- dure-specific. Strategies for development of ani-
tored into estimates (Box 7.5). mal-based training programs are described by
Competency in essential clinical skills and surgi- Conarello and Shepherd (2007).
cal techniques is of major importance for preclinical
laboratory experiments and veterinary and human
medicine. On-the-job training is usually not suffi- Example: Surgical Training Laboratory
cient to acquire competency (Bergmeister et al. A teaching lab requested swine to be used for sur-
2020). High-quality and consistent skill sets are gical training of residents in multiple invasive
developed and maintained only with structured procedures. Students were to have received sev-
training and sufficient practice. Poor technique eral weeks of intensive training on basic skill
and inconsistent and unstandardised procedures acquisition on simulators and excised organs
will contribute to considerable variability in experi- prior to this lab. There are 16 trainees. Eight trai-
ments, potentially hiding any effect of the experi- nees can be accommodated per training session.
mental intervention. One animal can be allocated to every two trai-
Use of live animals for surgical skills training is nees. Four instructors are available. How many
still part of many medical and veterinary training animals are required?
programmes (DeMasi et al. 2016).
Use of live animals obtained for purpose should
Calculations. Number of sessions = 16 trainees/
be considered a last resort, and only after critical
(8 trainees/session) = 2 sessions.
Number of animals session = 8 trainees session

BOX 7.5 × 1 animal 2 trainees = 4 animals session
Training and Teaching
▪ Number of sessions required for competency Total number of animals = 4 animals session
▪ Resource availability × 2 sessions = 8 animals
▪ Number of trainees
▪ Number of trainees per session
▪ Number of instructors, assessors 7.3.5 Rodent Breeding Production
▪ Time available per session
▪ Assessor availability. Breeding protocols are required for in-house pro-
duction of research animals that are too expensive
to purchase in sufficient quantity or cannot be Table 7.1: Rules of Thumb for Estimating Mouse Numbers
Required for Genetic Analyses.
obtained commercially. Examples include the crea-
tion of new transgenic, knockout, or other geneti- Purpose Number of Number
cally modified animals; back-crossing of breeding pairs required
per line
genetically modified lines; or production of prenatal
or early neonate subjects. Maintenance and up to 5 80–100
The projected total pup production and the characterisation
of transgenic or
anticipated number of pups for a specific genotype
knockout line
depend on the number of breeding adults, litter size,
number of litters over the productivity lifespan, and Strain construction 10–12 750–1200
with congenic
weaning success (Box 7.6). Reasonable predictions genotyping
of pup production can be based on breeding colony
Quantitative trait 4–6 pairs 500–1000 F2
records, current breeding stock numbers, facility loci analysis inbred
and personnel capability, and projections from past parental
demand. For genetic analyses involving mice, esti- strains
mating the number of individuals and breeding 2–4 reciprocal
pairs per line can be approximated by simple rules F1 hybrid
of thumb (Table 7.1). Number of animals subjected pairs
to embryonic or foetal manipulations must be Gene mapping 10–12 1200
included in the total number of animals requested
Source: Adapted from Pitts (2002).
for a given study.
Scientific justification must be provided for
breeding protocols and plans for disposition of
unused animals. When target genotypes are of breeding protocols must include end-use disposition
research interest, potentially large number of ani- plans to minimise unacceptably large collateral
mals will be euthanised because they are the losses (Reynolds 2021).
unwanted genotype, or if they are produced in such
quantities that they age out or otherwise cannot be
used in the experimental protocol. Justifications for Example: Estimating Number of Mice With
a Desired Genotype
From preliminary power calculations, an investi-
BOX 7.6
gator determined that 50 homozygous knockout
Rodent Breeding Production
(KO −/−) and 50 homozygous wild-type (WT;
How many pups can be produced (N)? +/+) mouse pups of a certain strain are required
How many breeding adults are required to produce to study a specific disease. Only KO and WT mice
N pups? were to be used for experiments.
How many pups are born? Ten pairs of heterozygous breeder mice (HET
How many pups are successfully weaned? +/−) were available, 10 males and 10 females.
Information required
From past breeding records and vendor specifica-
tions, the expected average litter size for this
Initial number of breeders strain was approximately 5 pups, and each pair
Sex ratio (number of females per breeding male) was expected to produce 6 litters over 6 months
Litter size = (number of pups/litter/female) of the productivity lifespan. Genotype distribu-
Estimated total number of pups produced tion was thought to be approximately Mendelian;
Proportion of pups lost (attrition) that is, approximately 50% of pups will be HET,
Expected genotype distribution 25% of pups KO, and 25% WT. However, perinatal
Proportion of desired genotype required losses of KO were 15–20%. How many litters will
Time frame (pups produced per unit time).
automated unit balancing in modeling interface sys-

need to be produced, and how long will it take to tems. IEEE Engineering in Medicine & Biology 28 (3):
get the requisite number of pups? 50–57.
Conarello, S.L. and Shepherd, M.J. (2007). Training stra-
Calculations: Total number of pups: N = 10 tegies for research investigators and technicians. ILAR
females × (5 pups/litter/female) × (6 litters/ Journal 48 (2): 120–130.
female/6 months) = 300 pups in 6 months. DeMasi, S.C., Katsuta, E., and Takabe, K. (2016). Live
Assuming a Mendelian 1.2.1 distribution animals for preclinical medical student surgical train-
of genotypes, then the proportions of ing. Edorium Journal of Surgery 3 (2): 24–31.
each genotype are pWT = pKO = 0.25, and Izawa, Y., Hishikawa, S., Muronoi, T. et al. (2016).
pHET = 0.50. Ex-vivo and live animal models are equally
effective training for the management of a
The anticipated perinatal loss of KO pups is penetrating cardiac injury. World Journal of Emergency
Surgery 11: 45. https://doi.org/10.1186/s13017-
p = 0.15–0.20. Then the number of surviving
016-0104-3.
KO pups is pKO N (1 – p) = 0.25 (300) (1 –
Jung, S.-H. and Young, S.S. (2012). Power and sample
0.20) = 60 pups to 0.25(300) (1 – 0.15) = 63 pups. size calculation for microarray studies. Journal of Bio-
The number of WT pups = (0.25)(300) = 75, and pharmaceutical Statistics 22 (1): 30–42.
the number of HET pups = (0.50)(300) = 150. Lin, W.-J., Hsueh, H.-M., and Chen, J.J. (2010). Power
Therefore, for 10 adult pairs, we expect the pro- and sample size estimation in microarray studies.
duction of 60–63 viable knockout pups and BMC Bioinformatics 11: 47.
75 wild-type pups every 6 months. These are suf- Pitts, M. (2002). Institutional Animal Care and Use
ficient to meet the objectives of the study. The Committee Guidebook, 2e. Bethesda: Office of
total number of mice required is 20 adults + Laboratory Animal Welfare, National Institutes of
300 pups = 320 mice, with ‘waste’ of approxi- Health.
mately 150 heterozygotes. Instead of being eutha- Reynolds, P.S. (2019). When power calculations won’t
nised, heterozygotes could be transferred to do: fermi approximations of animal numbers. Lab
Animal 48: 249–253.
another protocol or used to replace the breeding
Reynolds, P.S. (2021). Statistics, statistical thinking, and
stock. the IACUC. Lab Animal (NY) 50 (10): 266–267.
https://doi.org/10.1038/s41684-021-00832-w.
Riley, R.D., Ensor, J., Snell, K.I.E. et al. (2020). Calculat-
ing the sample size required for developing a clinical
prediction model. BMJ 368: m441. https://doi.
org/10.1136/bmj.m441.
References Shalik, A., Satija, R., Adiconis, X. et al. (2013). Single-cell
Amit, I., Garber, M., Chevrier, N. et al. (2009). Unbiased transcriptomics reveals bimodality in expression and
reconstruction of a mammalian transcriptional network splicing in immune cells. Nature 498: 236–240.
mediating pathogen responses. Science 326 (5950): 257– https://doi.org/10.1038/nature12172.
263. https://doi.org/10.1126/science.1179050. Wei, C., Li, J., and Bumgarner, R.E. (2004). Sample size
Baillie, S., Booth, N., Catterall, A., et al. (2016). A guide to for detecting differentially expressed genes in microar-
veterinary clinical skills laboratories. https://www. ray experiments. BMC Genomics 5: 87. https://doi.
researchgate.net/publication/296964846 org/10.1186/1471-2164-5-87.
(accessed 2022). Weinstein, L. and Adams, J.A. (2008). Guesstimation: Sol-
Bergmeister, K.D., Aman, M., Kramer, A. et al. (2020). Simu- ving the World’s Problems on the Back of a Cocktail
lating surgical skills in animals: systematic review, costs & Napkin. Princeton, NJ: Princeton University Press.
acceptance analyses. Frontiers in Veterinary Science 7: Ye, Y., Song, H., and Shi, S. (2018). Understanding the
https://doi.org/10.3389/fvets.2020.570852. biology and pathogenesis of the kidney by single-cell
Chizeck, H.J., Butterworth, E., and Bassingthwaighte, J. transcriptomic analysis. Kidney Diseases 4: 214–225.
B. (2009). Error detection and unit conversion: https://doi.org/10.1159/000492470.
8
Feasibility: Counting Subjects

8.1 Introduction 89 8.A Determining Cumulative Probabilities for
8.2 Normal Distribution 90 Binomial, Negative Binomial, and
8.3 Binomial (Exact) Distribution 91 Hypergeometric Distributions with SAS 97
8.3.1 Rare or Non-Existent Events 92 8.A.1 Binomial Distribution 98
8.A.2 Negative Binomial Distribution 98
8.4 Batch Testing for Disease Detection 93
8.A.3 Hypergeometric Distribution 98
8.5 Negative Binomial Distribution 95
8.B SAS Programme for Maine Coon Cat
8.6 Hypergeometric Distribution 96 Example 98
8.6.1 Estimating the Proportion of
References 99
Subjects with the Target Effect 97
‘experiments’ analogous to flipping a coin many

8.1 Introduction times. Each experiment has one of two possible
Many studies require preliminary screening of large outcomes: ‘success’ representing the selection of a
numbers of potential subjects in order to capture a subject either with the condition or ‘failure’ if the
much smaller target number of subjects with a spe- subject is without the condition. The outcome is
cific trait or characteristic (Box 8.1). This type of therefore a binomial variable with only two possible
sample size problem is essentially like a series of outcomes, success s or failure f. The probability of
having the condition is the proportion p of successes
BOX 8.1 s in the total number of selected subjects N, and
Examples of Probability-Based Sampling (1 − p) is the expected proportion without the
condition (‘failures’).
Determining the total number of subjects to be The conventional approach to determining sam-
screened to ensure the inclusion of a pre-specified ple size is to use standard sample size formula based
number with a given trait or condition. on the normal distribution. As sample size increases
Determining the number of subjects required to towards infinity, distributions of count data will
observe at least one specified event. converge to approximately normal distribution
Estimating risk of adverse events when none were (large-scale approximations). However, sample size
observed during the preceding sequence of estimates tend to be inaccurate and biased when
procedures. applied to discrete data based on small samples.
Determining the number of batches and batch size Better estimates can be obtained with exact methods.
to estimate disease prevalence. These are based on specific discrete probability
distributions enabling an explicit model of the
BOX 8.2 must be specified (Table 8.1). The cumulative distri-

Determining Feasibility of Counts by Exact Methods bution function (cdf ) of a random variable X, eval-
uated at x, is the probability function that X will take
Sample size formula based on the normal distribution a value equal to, less than, or greater than x. That is,
are inaccurate and biased when applied to count data. to determine the target number of subjects s (‘suc-
1. Choose the appropriate discrete probability cesses’), we need to first decide on whether the prob-
distribution. lem requires a given probability of obtaining exactly
2. Choose the appropriate cumulative distribution s, or s that is more or less than some predetermined
function (is the probability of possible outcomes limit X. In practice, the probability of choosing
equal to, less than, or greater than the specified exactly s subjects is too restrictive. It is usually more
number). feasible to determine the probability of getting at
least as many or more than s subjects.
Finally, sample size is calculated using the appro-
probability of obtaining a pre-specified number of priate exact method as follows:
successes. They are valid regardless of sample size.
Although computationally intensive (Newman 1. Specify a range of candidate sample sizes n
2001), this is trivial consideration with the availabil- 2. Specify the form of the cumulative probabil-
ity of computer programmes. Worked examples are ity distributions. For example, if the target
provided with exemplary SAS code in each value for the expected number of success
section and in Appendices 8.A and 8.B. was to be at least 10 in a sample of N subjects,
The families of discrete probability distributions then Pr (X ≥ 10)
considered here are the binomial, geometric, nega- 3. Compute the probability y for each n using
tive binomial, and hypergeometric distributions the cumulative distribution function and/or
(Box 8.2). Choice of sampling distribution is deter- the probability density function of the rele-
mined by feasibility and sampling objectives. The vant distribution family. Computation meth-
binomial distribution is used to determine the total ods and SAS codes are given in Appendix 8.A.
number of successes s or prevalence p in the popula- 4. Select the sample size that equals or exceeds
tion for a fixed number of selections N. The geomet- the target confidence 1 − α.
ric distribution is used to determine the number of
selections N required to obtain the first success
(s = 1) after which sampling stops. The negative 8.2 Normal Distribution
binomial distribution is used to determine the num-
ber of trials N needed to obtain a specific number of Conventional sample size formulas are based on the
successes s = x. The hypergeometric sampling is used normal distribution. The large-sample approxima-
when the objective is to determine the number of tion is
2
samples to be screened to obtain a pre-specified N = z1 − α 2 p 1 − p d2
number of successes in a small finite population.
After identifying the distribution family most where N is the total sample size, p = the expected
appropriate for addressing feasibility objectives, proportion of the condition of interest or preva-
the form of the cumulative distribution function lence, d is the target precision, and z is the standard
Table 8.1: Cumulative Probabilities, Interpretation, Representation, and Computation.

Interpretation Representation Computation
Probability of getting exactly x Pr(X = x)

Probability of getting fewer than x Pr(X < x) Pr(X = 0) + Pr(X = 1) + … + Pr(X = x − 1)
Probability of getting at most x Pr(X ≤ x) Pr(X = 0) + Pr(X = 1) + … + Pr(X = x)
Probability of getting more than x Pr(X > x) Pr(X = x + 1) + … + Pr(X = n)
Probability of getting at least x Pr(X ≥ x) Pr(X = x) + … + Pr(X = n)
Feasibility: Counting Subjects 91
score based on the standard normal distribution and 1. Specify the range of candidate sample sizes n.
pre-specified confidence level. For example, for a 2. Specify the form of the cumulative binomial
two-tailed test with 95% confidence, α = 0.05 and probability. For example, if the target value
z = 1.96. The expected number of subjects with for the expected number of success was to
the condition (n) is obtained by multiplying total be at least 10 in a sample of N subjects, then
sample size N by prevalence p: Pr (X ≥ 10).
3. Compute the probability y for each n using
n=N p the cumulative distribution function and/or
Although simple to perform, this method will the probability density function of the bino-
give only poor approximations for most practical mial distribution (Appendix 8.A).
purposes (Newcombe 1998). Bias correction adjust- 4. Select the sample size that equals or exceeds
ment methods (such as the Agresti-Coull method) the target confidence 1 − α.
are described in Chapter 10.
Example: Genetically Linked Bleeding in
Doberman Pinschers
8.3 Binomial (Exact)
Investigators planned a prospective study to deter-
Distribution mine differences in coagulation profiles between
The binomial distribution is the probability distri- healthy Doberman Pinschers and those that are
bution of a binomial random variable. It is used to autosomally recessive for the mutation for von
determine the number of successes s in N selections. Willebrand disease (vWD), and therefore at high
It is applicable when the population is very large rel- risk for severe bleeding. From the literature,
ative to the target sample, and the N selections are they estimated a 25% prevalence of homozygous
independent; that is, the outcome on one selection affected Dobermans. The investigators needed to
does not affect the outcome on other selections, and estimate the number of dogs that would have to
the probability of success p is the same on every trial be screened (N) and the number of homozygous
(that is, p = 1/2 or 0.5). The binomial probability is affected dogs n they could expect to obtain in that
the probability that a binomial ‘experiment’ results sample with a probability of 95% and a precision
in exactly s successes. In practice, the cumulative of 5%.
binomial probability will be applicable to most stud-
ies. This is the probability that a binomial experiment Normal approximation. For a probability of 95%,
results in s successes that occur within a specified α = 0.05 and z1-α/2 = 1.96, prevalence p = 0.25,
range (either greater than or equal to a given lower and precision δ = 0.05. Then the total number
limit or less than or equal to a given upper limit). of animals to be screened is:
The binomial distribution is given by 2 2
N 1 96 0 25 1 − 0 25 0 05 = 288 1
N s N −s 290
Pr X = s = p 1−p
s
and the expected number of homozygous affected is
where N is the total sample size (number of ‘trials’ or n = N p 290 × 0.25 73.
selections), s is the number of subjects with the con-
dition (‘successes’), p = the expected proportion of When the Agresti-Coull adjustment (Agresti and
successes, and (1 − p) is the expected proportion Coull 1998) is applied (Chapter 11):
without the condition or event (‘failures’). The prob-
p = 25 + 2 100 + 4 = 0 2596,
ability of success p is the same on every trial.
2 2
Because the binomial variable is the sum of N inde- N 1 96 0 2596 1 − 0 2596 0 05
pendent binomial variables, the mean μ = N p, and − 4 = 291 4 292
the variance σ 2 = (N p)(1 − p) = μ(1 − p)
Sample size is calculated using the binomial exact and the expected number of homozygous affected is
method: n = 76.
With the asymptotic normal approximation, When no events have occurred (‘Rule of Threes’). If
approximately 290–292 dogs need to be screened there are no observed events in n trials, then x = 0
to have a 95% probability of capturing 73–76 homo- and the observed proportion p is 0/n. The approxima-
zygous affected dogs within the general Doberman tion for the upper confidence limit is obtained from
population. n 0 n−0
Pr X = 0 = p 1−p
0
Exact binomial calculation. Investigators wished
to determine the total number of dogs that which reduces to
would have to be screened to have 95% confi- n
dence of obtaining at least 75 or more homozy- 1−p =α
gous affected dogs from a population with an
expected prevalence of 25%. The cumulative Setting α to 0.05 and solving for p, the upper
binomial probability is Pr (X ≥ 75) = 0.95. Sam- bound of the 95% upper confidence limit for p is
ple size is then approximated by iteration over a 3/n (Louis 1981, Hanley and Lippman-Hand
range of potential sample sizes to find the sam- 1983, Jovanovic and Levy 1997).
ple size that is closest to Pr (X ≥ 75) = 0.95. For Alternatively, the exact upper confidence limit
this example, a potential sample size range of can be calculated from the cumulative probability
10–400 was chosen. SAS code is: distribution of the binomial distribution as follows:
%let numSamples = 400; *set maximum N subjects; %let p = 1; *maximum proportion;

data prob; data prob;
do n = 10 to &numSamples by 1; *iterate over range of proportions from 0 to 1;
y = PDF('BINOMIAL',75,0.25,n) + 1 - CDF do p = 0 to &p by 0.001;
('BINOMIAL',75,0.25,n); *Pr(X <= x)describes the upper confidence
output; limit;
end; y = CDF('BINOMIAL', 0, p, n);
run; output;
end;
With the exact method, it is estimated that 351 run;
dogs need to be screened to have a 95% probability
of obtaining at least 75 homozygous affected dogs. When events are rare but not zero. The number of
This revised estimate suggests that sample size observations to capture one adverse event in n trials
based on the asymptotic normal approximation with a given probability is approximated by rearran-
may underestimate the number of subjects that will ging (1 − p)n = α to solve for n:
have to be screened.
n = ln α ln 1 − p
8.3.1 Rare or Non-Existent Events

Non-occurrence of an event in a series does not mean Example: Occurrence of Adverse Events
that it cannot happen at all, especially if the series is
Suppose one adverse event is expected for every
relatively short. Assessing the probability that an
100 surgeries (p = 1%, or 0.1). Then the number
event could yet occur is especially important if the
of observations needed to have 95% probability
study is assessing the risk of rare but potentially cat-
of seeing one event,
astrophic adverse events (for example, serious injury
n
or death in a series of surgeries). Therefore, the ques- 1−p =α
tion becomes one of determining the worst-case sce-
n
nario based on these data. This is done by estimating 1 − 0 01 = 0 05
the one-sided upper bound of the 95% confidence
interval for p. For a one-sided confidence limit, all Solving for n:
of α must apply to that limit, not half of it (α/2).
Therefore, for a one-sided 95% confidence limit, we n ln 0 99 = ln 0 05
need to solve the equation for α = 0.05 (not α/2 = n = ln 0 05 ln 0 99 = 298 1
0.025 as is the case for a two-sided limit).
The probability for a range of sample sizes (n =

Given an expected prevalence of adverse
1–50) is calculated by simulation on the binomial
events of 0.1% and rounding up, we will need
distribution for a prevalence of 5% (or an expected
to observe approximately 300 surgeries to capture
prevalence p of 5 subjects out of 100), so p = 0.05.
one adverse event with 95% probability.
Choose the probability y for the target sample size
(s = 5 and n = 10). Sample SAS code is as follows:
Exact binomial calculation. Exact confidence inter- %let numSamples = 50; *maximum of s subjects;
vals for nonzero rare events can be calculated from
binomial simulations. Two confidence limits must data prob;
be calculated: the upper confidence limit at pU do s = 1 to &numSamples by 1; *iterate over
= α/2 and the lower confidence limit at pL = α/2. a range of s from 1 to 50;
y = 1 - CDF('BINOMIAL',s,0.05,100); *Pr(X > s)
more than s;
output;
Example: Computing an Expected Range of end;
Rare Events run;
We observe 3 events out of 100 trials. Then p = 3/ The probability that more than s = 5 animals will
100 = 0.03. The lower limit for the 95% confi- show an adverse effect is 0.384 (38.4%) and more
dence interval is the value of p for which than 10 is 0.0115 (1.2%).
Pr (X ≥ 3) = 0.025, and the upper limit is the value
of p for which Pr (X ≤ 3).
%let p = 1; *maximum proportion p;
Example: Risk of Adverse Events in Future
data prob; Surgeries
do p = 0 to &p by .001; *iterate over a
range of p from 0 to 1 with desired step
(Data from Eypasch et al. 1995). A ‘serious adverse
size; event’ during laparoscopic appendectomy is
yL = PDF('BINOMIAL',3,p,100) + 1 - CDF defined as an intraoperative vascular injury result-
('BINOMIAL',3,p,100); *lower CL; ing in limb loss or death. A series of 25 laparo-
yU = CDF('BINOMIAL', 3, p, 100); *upper scopic surgeries had no observed serious adverse
CL;
output; events. What is the risk that a serious adverse
end; event could still occur with 95% probability?
run; The approximate 95% upper bound on the rate
From the output, select the values of p for of occurrence is 3/N = 3/25 = 0.12, or 12%. The
which yL and yU = α/2 = 0.025. From the output exact bound for a = 0.05 is p = 0.113, or 11.3%.
yL = 0.025 corresponds to p = 0.006, and yU = In this case, the rule of thumb approximation is
0.025 corresponds to 0.085. Therefore the 95% very close to the exact calculation. The 95% con-
confidence interval for p = 0.03 is (0.006, 0.085), fidence interval for zero events is (0, 0.113). Even
or 0.6–8.5%. though no adverse events had been observed in
the preceding series of 25 procedures, there is still
the risk that a severe adverse event could occur in
The large confidence intervals in this example 11–12 out of every 100 procedures.
show that there can be considerable loss of precision
with small sample sizes, and therefore estimates are
much less reliable.
8.4 Batch Testing for Disease
Example: Predicting the Probabilities of Detection
Adverse Events
A special case of binomial sampling is estimating
Previous data suggest 5% of animals administered sample sizes for batch testing of pooled samples
a certain drug will show serious adverse effects. (Box 8.3). Pooled samples are used when the
What is the probability that more than five will prevalence of disease-positive individuals must be
show adverse effects? More than 10? estimated, and resources are limited relative to
BOX 8.3 negative, no further testing is required. If a batch

Applications of Batch Testing tests positive, then each sample in the batch is
retested separately.
▪ Number of available test kits are limited rela- The sample size problem is determining the opti-
tive to the size of the population pool that needs mum batch size m. If the population is large and
to be tested. expected proportion is small, then the probability
▪ Diagnostic screening for disease prevalence in that a batched sample B is positive is derived from
livestock and poultry operations. the binomial distribution as
▪ Rabies surveillance testing in bat colonies.
m
P B = 1 = 1− 1−p
the size of the population that needs to be tested. where p is the disease prevalence and m is the batch
Batch testing can result in considerable savings size or number of samples in each batch.
per unit of information. It is especially useful if dis- The ‘most efficient’ group size m is that which
eases are rare (Hou et al. 2017), reagents and other minimises E(T)/m, where E(T) is expected number
resources are scarce, and the number of available of tests:
tests is limited relative to the number of potential E T =1+m P B=1
test subjects that require them (Litvak et al. 1994,
Hughes-Oliver 2006), or if sample volumes are too As a first-pass approximation, m can be esti-
small for processing singly and must be pooled over mated by
several subjects for analysis (Giles et al. 2021).
m=1 p
The concept of batch testing was first developed
during World War II for detecting syphilis in US with rounding up to the nearest integer. Then the
army conscripts (Dorfman, 1943). It has been expected number of tests needed is approximately:
adapted for use in large-scale testing of diseases,
such as HIV and other sexually transmitted diseases E T 2 N√p
(Hughes-Oliver 2006, Shipitsyna et al. 2007), bacte-
riological screening of livestock herds and poultry for a population of N subjects (Finucan 1964).
flocks (Arnold et al. 2005, Arnold et al. 2009),
screening for environmental contaminants, such For feasibility and planning purposes, it is useful
as lead, and assessment of large number of molecu- to estimate the percentage reduction in the number of
lar targets for drug discovery (Hughes-Oliver 2006). tests by batch testing compared to testing individual
The SARS-COV2 pandemic of 2020 renewed samples:
interest in further application to the problems of RT = 100 1 − E T m
an emergent disease pandemic (Hitt et al. 2020,
Zhou and O’Leary 2021, FDA https://www.fda. The expected percentage increase in testing capac-
gov/medical-devices/coronavirus-covid-19-and- ity is approximated by the following:
medical-devices/pooled-sample-testing-and-
screening-testing-covid-19). Batch testing is also 1
I TC = 100 −1
used when individual animals are too small to allow E T m
the collection of sufficient tissue or blood for single-
sample analyses, such as surveillance testing of bat Exact binomial determination. The above formu-
colonies (Giles et al. 2021). lations assume that both the sensitivity (the proba-
In batch testing, samples from m subjects are bility of detecting a true positive) and specificity
pooled and tested for the presence of the disease (the probability of detecting a true negative) of
(yes = positive; no = negative). The simplest batch- the tests are equal to 1.0. Exact binomial determina-
testing design tests samples in two stages. The first tions that incorporate more realistic test sensitivity
stage tests the pooled samples, where each individ- and specificity provide more rigorous estimates of m
ual sample is part of one batch. If a batch tests and E(T). An open-access shiny app (Hitt et al. 2020)
numerically iterates over a range of batch sizes to wishes to pilot a new drug therapy on five cats with
determine optimal batch size m, E(T), E(T)/m, per- primary brain tumours. However, the rarity of the
cent reductions in the expected number of tests with condition means that determining the total number
batch testing, and the expected increase in testing of subjects to be screened will be potentially very
efficiency. large. The investigator will need to know how many
An analogous batch size determination algorithm potentially eligible subjects will need to be screened
has been developed for microarray or microplate until five subjects are obtained, when screening
studies (Bilder et al. 2019). In microarray studies, will stop. The number N of repeated trials to produce
samples are arranged in a two-dimensional grid. s successes is a negative binomial random variable.
In the first stage of testing, samples are pooled by It is assumed that the trials are independent – the
row and by column. In the second stage, only sam- outcome of one trial does not affect the outcome of
ples located at the intersections of positive rows and other trials.
columns are retested. See Hitt et al. (2019, 2020) and The negative binomial distribution is the proba-
https://bilder.shinyapps.io/PooledTesting/for bility distribution of a negative binomial random
more details. variable. It describes the probability that N = s + f
trials will be required to obtain s successes. Because
sampling continues until a predetermined number
Example: Screening for SARS CoV-2 of s is observed, this means there will be many fail-
(COVID-19) ures in the sequence (because f subjects will be
Abdalhamid et al. (2020) report an evaluation of selected that do not have the condition). When
batch-testing strategies for SARS-CoV-2 from the pre-specified number of successes has been
specimens collected by nasopharyngeal swabs achieved, the sampling stops, so s defines the stop-
(the ‘gold standard’ for SARS-CoV-2. diagnosis). ping criterion. The exact probability is given by:
They used the Shiny application for pooled test-
ing (available at https://www.chrisbilder. f + s−1 s f
Pr X = f = p 1−p
com/shiny). s−1
For an expected disease prevalence of 5% (p =
0.05), test sensitivity of 0.95 and test specificity of where the number of successes s is fixed, and the
1.0, with testing in two stages, they obtained an number of failures is f. Because N is the number
optimal batch size m of 5, E(T)= 2.07, and of ‘trials’ or draws, then f = N − s. The expected pro-
expected percentage reduction in the number of portion of ‘successes’ is p, and (1 − p) is the expected
tests of RT = 100[1 – 0.41] = 59%. However, if proportion of ‘failures’. Then the probability to be
prevalence is only 0.1% (p = 0.001) then m = 33, determined is the number of m failures given r
E(T) = 2.02, and RT = 94%. successes.
Zhou and O’Leary (2021) report lower sensitiv- The geometric distribution is a special case of the
ity (82–88%) for anterior nares or mid-turbinate negative binomial distribution, when the number of
(nasal) swabs. If sensitivity is assumed to be successes to be obtained stops at the first subject
0.85 with prevalence 5% and test specificity of with the target condition: s = 1.
1, then m = 6, E(T) = 2.35, and RT = 61%.
Example: Osteosarcoma Pilot Study: Deter-

mining Screening Numbers
8.5 Negative Binomial Investigators wish to enrol 10 client-owned dogs

with osteosarcoma for a pilot clinical trial with a
Distribution new drug treatment. They expect only 40% of
The negative binomial distribution can be used to clients will consent to participate. How many
estimate the total number of samples required to clients of eligible dogs will need to be screened
obtain a pre-specified number of subjects with the to have a 95% probability of obtaining 10 that
given condition. For example, suppose an investigator agree to participate?
probability of selection of each item will change,

The expected proportion of successes p is 0.4,
and sampling events are not independent. (In con-
and the number of desired successes is r = 10.
trast, for the binomial ‘experiments’, the size of
Enrolment of 10 dogs means that the total num-
the total population is extremely large relative to
ber of dogs screened is therefore N = m + 10. To
the size of the sample to be drawn (>95%), so the
find N, the number of expected ‘failures’ f needs
pool will not be depleted by repeated sampling).
to be estimated. The probability of obtaining f or
The hypergeometric distribution describes the
fewer failures is Pr (X ≤ f ). Then the probabilities
probability of drawing a predefined number of
over a range of n are computed using the cumu-
subjects with the ‘condition’ (x) out of a random
lative distribution function of the negative bino-
sample n drawn without replacement from a finite
mial distribution. The target sample size is
population of size N, with K members having the
selected for which the computed probability is
condition:
equal to or exceeds (1 − α) = 0.95:
*maximum number of n eligible subjects K N −K
to be sampled; x n−x
%let numSamples = 50; Pr X = x =
data prob; N
do m = 1 to &numSamples by 1; *set n
increment to desired step size;
y = CDF('NEGBINOMIAL',m,0.4,10); *y is
the set of probabilities; Each subject in a sample of size n drawn from a
output; population pool of size N will fall into one of four
end; categories: subject has the condition and is selected
run; (x), subject has the condition and is not selected
The number of failures that associated with a (K − x), subject does not have the condition and is
target probability ≥0.95 is f = 26 (with a realised selected (n − x), or the subject does not have the
probability of 0.955). Then the total number of condition and is not selected (N − K) − (n − x). Sam-
dogs to be screened is approximately N = 26 + pling is performed without replacement; that is,
10 = 36. once each subject is selected, it is not returned to
the pool.
8.6 Hypergeometric Example: Maine Coon Cats and Hyper-

Distribution trophic Cardiomyopathy Markers
The hypergeometric distribution is used to deter- Approximately 30% of Maine Coon cats have an
mine a sample size when the total population is MYBPC3 gene mutation associated with hyper-
small (Box 8.4). Subsequent draws without replace- trophic cardiomyopathy (HCM). There were
ment of each item during the selection process 320 cats listed in the Maine Coon cat registry.
reduce the total population size. As a result, the An investigator wished to determine how many
cats would need to be screened to have 95% prob-
ability that at least five cats in the sample would
BOX 8.4 have the mutation.
Examples of Small Defined Target Populations for The total number of subjects in the registry
Hypergeometric Sampling Problems population is N = 320. The prevalence of the con-
dition is approximately 30%; therefore R = 0.3 N
Rosters listing all clients visiting a clinic within a
= 96. The number of mutation-positive cats
given time period.
required is at least 5, so Pr (x ≥ 5). SAS code for
All individuals listed in a breed or disease registry.
computations is given in Appendix 8.B.
All animals surveyed in a defined location.
Results are shown in Figure 8.1. A random
Pollinators visiting flowers in a given location.
selection of 28 cats from the registry list must
All blood samples in a given batch.
The probability is the proportion of the

1.0
sample expected to show the desired change
0.9 in the response. In SAS, this is obtained by
0.8
Pr = probnorm(z)
0.7
0.6
Probability
0.5 Example: Weight Loss in Overweight and

Obese Cats
0.4
(Data from Flanagan et al. 2018). Cats on a weight
0.3
reduction diet were fed dry commercially availa-
0.2 ble weight loss food and monitored for three
0.1 months. Weight loss of 10–15% of initial body
0.0
weight is necessary for a one-unit change in body
0 4 8 12 16 20 24 28 32 36 40 44 48 condition score. Female cats (n = 155) lost an
Sample size average of 11.34% body weight (SE 0.556) and
Figure 8.1: Determining the number of Maine Coon cats
males (n = 180) lost 8.95% (SE 0.467). What per-
that must be selected at random from a small population centage of cats in a new study could be expected
to have a pre-specified probability of obtaining at least to show a 15% decrease in body weight?
five with the MYBPC3 gene mutation. Twenty-eight cats In this example, d = −15, and the mean
must be selected at random to have a probability of 95% observed weight loss x is −11.4 for females and
of obtaining at least five with the mutation and 34 cats for
a probability of 99%. −8.95 for males.
The z-score for females is [−15 – (−11.34)]/
be made to have a greater than 95% probability of (0.556 155) = −0.5289, and the proportion is
obtaining at least 5 cats with the mutation and probnorm(−0.5289) = 0.298, or 30%.
34 cats for a probability of 99%. The z-score for males is [−15 – (−8.95)]/
(0.467 180) = −0.8067, and the proportion is
probnorm(−0.8067) = 0.210, or 21%.
8.6.1 Estimating the Proportion of Therefore in a new sample of obese and over-
Subjects with the Target Effect weight cats, approximately 30% of females and
21% males in the sample are anticipated to show
The extended hypergeometric function can be used
the desired 15% decrease in weight for the target
to estimate the anticipated proportion of subjects
one-unit change in body condition score.
that will show a target response change from the
expected mean value d − x (Zelterman 1999). The
method requires three steps:
1. Calculate the z-score from values for the tar- 8.A Determining Cumulative
get change d, the mean x , the sample stand-
ard error SE, and sample size n: Probabilities for Binomial,
Negative Binomial, and
d−x
z=
SE n Hypergeometric
Distributions with SAS
2. Convert z to the probability determined by
Calculation of probabilities requires the specifica-
the pdf of the standard normal distribution:
tion of the cumulative distribution function
(cdf ) and probability density function (pdf ). The
Pr = φ z cumulative distribution function (cdf ) gives the
probabilities of a random variable X being smaller 8.A.3 Hypergeometric Distribution

than or equal to some value x, Pr (X ≤ x) = F (x).
This distribution is applicable when the total target
The inverse of the cdf gives the value x that would
population N is small and finite relative to the target
make F (x) return a particular probability p: F − 1
sample. Successive selection of each subject in a
(p) = x. The probability density function (pdf )
small population reduces the population size so that
returns the relative frequency of the value of a sta-
the probability of selection of each item will change
tistic or the probability that a random variable X
with each sampling event.
takes on a certain value. The pdf is the derivative
In SAS, the hypergeometric distribution is called
of the cdf.
by (‘HYPER’, s, N, R, n) where y is the probability, s
is the target number of successes to be obtained in
the sample, N is the total sample size of the target
8.A.1 Binomial Distribution
population, R is the number of subjects with the
This distribution is applicable when the popula- condition in the study population, and n is the num-
tion is very large relative to the sample so that ber of subjects to be screened (sample size or num-
the outcome on one trial does not affect the out- ber of ‘trials’). The respective probabilities are:
come on other trials, the probability of success p
Pr(X = s) y = PDF('HYPER',s,N,R,n);
is the same on every trial (that is, p = 1/2 or 0.5), Pr(X < s) y = CDF('HYPER',s,N,R,n) - PDF
and the trials are independent. In SAS, the ('HYPER',s,N,R,n);
binomial distribution is coded as (‘BINOMIAL’, Pr(X ≤ s) y = CDF('HYPER',s,N,R,n);
s, p, n) where s is the number of subjects with Pr(X > s) y = 1 - CDF('HYPER',s,N,R,n);
Pr(X ≥ s) y = PDF('HYPER',s,N,R,n) + 1 - CDF
the condition (‘successes’), p is the prevalence
('HYPER',s,N,R,n);
(probability of success, or proportion of subjects
with the condition), and n is the number of ‘trials’
(sample size).
The exact and cumulative probabilities are: 8.B SAS Programme for
Pr(X = x) y = PDF('BINOMIAL',s,p,n); Maine Coon Cat Example
Pr(X < x) y = CDF('BINOMIAL',s,p,n) - PDF
('BINOMIAL',s,p,n); %let numSamples = 50; *maximum of n subjects
Pr(X ≤ x) y = CDF('BINOMIAL',s,p,n); to be sampled;
Pr(X > x) y = 1 - CDF('BINOMIAL',s,p,n); data prob;
Pr(X ≥ x) y = PDF('BINOMIAL',s,p,n) + 1 - CDF do n = 1 to &numSamples by 1; *set start n and
('BINOMIAL',s,p,n); increment to desired values;
*iterate over a range of potential sample
sizes to estimate the probability of a capture
8.A.2 Negative Binomial Distribution of at least 4 events;
The negative binomial distribution is applicable for Y = PDF('HYPER', 5, 320, 96, n) + 1 - CDF
estimating the total sample size N necessary to ('HYPER', 5, 320, 96, n);
output;
obtain a pre-specified number of subjects with the end;
condition. The ‘experiment’ continues until a prede- run;
termined number of successes s is observed, and s proc print;
defines the stopping criterion (there is a ‘success’ run;
on the last trial). /*plot change in probability with sample size
The total number of samples required to obtain s n*/
successes is N = s + f trials. In SAS, the syntax is set proc sgplot data=prob;
up as s is the number of successes, f = the number series x=n y=y;
of failures (= N − s), and p is the probability of xaxis label="Sample size" grid
values=(0 to 50 by 1);
success. yaxis label="Probability" grid
values=(0 to 1 by 0.1);
Pr(X = f) y = PDF('NEGBINOMIAL', f, p, s); refline 0.9 0.95 0.99;
Pr(X ≤ f) y = CDF('NEGBINOMIAL', f, p, s); run;
References Hanley, J.A. and Lippman-Hand, A. (1983). If nothing

goes wrong, is everything all right? Interpreting zero
Abdalhamid, B., Bilder, C.R., McCutchen, E.L. et al. numerators. JAMA 249 (13): 1743–1745.
(2020). Assessment of specimen pooling to conserve Hitt, B., Bilder, C., Tebbs, J., and McMahan, C. (2019).
SARS COV-2 testing resources. American Journal of The objective function controversy for group testing:
Clinical Pathology 153 (6): 715–718. https://doi. much ado about nothing? Statistics in Medicine 38
org/10.1093/ajcp/aqaa064. (24): 4912–4923.
Agresti, A. and Coull, B.A. (1998). Approximate is better Hitt, B., Bilder, C., Tebbs, J., McMahan, C. (2020). A
than ‘exact’ for interval estimation of binomial propor- shiny app for pooled testing. https://bilder.shi-
tions. The American Statistician 52: 119–126. nyapps.io/PooledTesting/ (accessed 2022).
Arnold, M.E., Cook, A., and Davies, R. (2005). A model- Hou, P., Tebbs, J., Bilder, C., and McMahan, C. (2017).
ling approach to estimate the sensitivity of pooled fae- Hierarchical group testing for multiple infections. Bio-
cal samples for isolation of Salmonella in pigs. Journal metrics 73 (2): 656–665.
of the Royal Society Interface 2: 365–372. https://doi. Hughes-Oliver, J.M. (2006). Chapter 3: Pooling experi-
org/10.1098/rsif.2005.0057. ments for blood screening and drug discovery. In:
Arnold, M.E., Mueller-Doblies, D., Carrique-Mas, J.J., Screening Methods for Experimentation in Industry,
and Davies, R.H. (2009). The estimation of pooled- Drug Discovery, and Genetics (ed. A. Dean and
sample sensitivity for detection of Salmonella in turkey S. Lewis), 48–68. Springer 330 pp.
flocks. Journal of Applied Microbiology 107: 936–943. Jovanovic, B.D. and Levy, P.S. (1997). A look at the rule of
https://doi.org/10.1111/j.1365-2672.2008. three. The American Statistician 51 (2): 137–139.
04273.x . Litvak, E., Tu, X.M., and Pagano, M. (1994). Screening for
Bilder, C., Tebbs, J., and McMahan, C. (2019). Informa- the presence of a disease by pooling sera samples. Jour-
tive group testing for multiplex assays. Biometrics 75 nal of the American Statistical Association 89: 424–434.
(1): 278–288. Louis, T.A. (1981). Confidence intervals for a binomial
Dorfman, R. (1943). The detection of defective members parameter after observing no successes. The American
of large populations. Annals of Mathematical Statistics Statistician 35 (3): 154.
14: 436–440. Newcombe, R.G. (1998). Two-sided confidence intervals
Eypasch, E., Lefering, R., Kum, C.K., and Troidl, H. (1995). for the single proportion: comparison of seven meth-
Probability of adverse events that have not yet occurred: ods. Statistics in Medicine 17: 857–872.
a statistical reminder. BMJ 311 (7005): 619–620. Newman, S.C. (2001). Biostatistical Methods in Epidemi-
Finucan, H.M. (1964). The blood testing problem. Jour- ology. New York: Wiley.
nal of the Royal Statistical Society: Series C: Applied Sta- Shipitsyna, E., Shalepo, K., Savicheva, A. et al. (2007).
tistics 13 (1): 43–50. https://doi.org/10.2307/ Pooling samples: the key to sensitive, specific
2985222. and cost-effective genetic diagnosis of Chlamydia
Flanagan, J., Bissot, T., Hours, M.A. et al. (2018). An trachomatis in low-resource countries. Acta Dermato-
international multi-centre cohort study of weight loss Venereologica 87: 140–143.
in overweight cats: differences in outcome in different Zelterman, D. (1999). Models for Discrete Data. Oxford:
geographical locations. PLoS ONE 13 (7): e0200414. Clarendon Press.
https://doi.org/10.1371/journal. Zhou, Y. and O’Leary, T.J. (2021). Relative sensitivity of
pone.0200414. anterior nares and nasopharyngeal swabs for initial
Giles, J.R., Peel, A.J., Wells, K. et al. (2021). Optimizing detection of SARS-CoV-2 in ambulatory patients:
noninvasive sampling of a zoonotic bat virus. Ecology rapid review and meta-analysis. PLoS ONE 16 (7):
and Evolution 11: 12307–12321. https://doi.org/ e0254558. https://doi.org/10.1371/journal.
10.1002/ece3.7830. pone.0254559.
III
Sample Size for
Description
Chapter 9: Descriptions and Summaries.

Chapter 10: Confidence Intervals and Precision.
Chapter 11: Prediction Intervals.
Chapter 12: Tolerance Intervals.
Chapter 13: Reference Intervals.
9
Descriptions and Summaries

9.1 Introduction 103 9.5 Relationship Between Interval Width,
9.2 Describing Sample Data 104 Power, and Significance 107
9.3 Describing Results 105 References 108
9.4 Confidence and Other Intervals 105
9.1 Introduction BOX 9.1

Descriptive Statistics
Descriptive statistics describe the properties of
sample data and summarise the results of statistical 1. Describing sample data
inference testing. Many studies are inherently Measures of central tendency
descriptive rather than evaluative and inferential. Measures of variation
These include surveys, species inventories, demand– Five-number summary
supply assessments, feasibility studies, documenta- Counts and percentages
tion of novel methodologies, and protocols. For these 2. Describing results of inference tests
studies, the goal is to provide empirical descriptions
of specific events, outcomes, or categories in the sam- Range of estimates containing the unknown popula-
ple, not to test hypotheses (Gerring 2012). However, tion parameter with a specified probability
hypothesis-testing studies require descriptive statis- Confidence intervals
tics to summarise results and synthesise information Prediction intervals
across multiple groups or studies (Box 9.1). Tolerance intervals
Descriptive statistics are a condensation of large Reference intervals
amounts of data down to two or a few metrics: mea-
sures of central tendency and measures of variation.
The central tendency is a single value identifying and variance. The five-number summary (mini-
the central point of the data distribution (point esti- mum, lower quartile, median, upper quartile, maxi-
mate). It is a measure of the magnitude or size of the mum) is appropriate for non-normally distributed
variable. Measures of central tendency include continuous data. Categorical data are summarised
the mean and median for continuous sample data. as counts and percentages. Descriptive statistics
Measures of variation describe the spread or disper- require reporting of total sample size, sample sizes
sion of the data around the central point. Variability for each experimental group, and the number of
between observations within the sample is described any animals lost through attrition. The results of sta-
by the range, standard deviation, interquartile range, tistical hypothesis tests are used to make inferences
about the mean for the whole population. The uncer- these tables provide information about the animals
tainty in the estimates is quantified by confidence in the study sample and are analogous to patient
intervals. Several other types of intervals (prediction, demographic data reported for human clinical
tolerance, or reference) can be computed, depending trials. Information for variables recorded at baseline
on study objectives. is important for providing context for subsequent
The best-practice reporting standard is to present interpretation of results. Effects of the test interven-
descriptive statistics in at least two summary tables. tion and potential benefit are assessed by the mag-
So-called ‘Table 1’ is for reporting study animal char- nitude of post-intervention changes compared to
acteristics (‘Who was studied?’). ‘Table 2’ summarises baseline (Fishel et al. 2007). Between-study differ-
the results (‘What was found?’). Total sample size, ences in animal characteristics and baseline or
sample sizes per group, and numbers lost through ‘starting’ conditions may also contribute to substan-
attrition must always be reported. Summary descrip- tial differences in results (Mahajan et al. 2016).
tive data and meticulous sample sizes accounting Standard deviations are used to describe the sam-
enable the assessment of internal and external valid- pling distribution of the mean (Altman and Bland
ity and are required for systematic reviews and meta- 2005). Performing significance tests for differences
analyses (Macleod et al. 2015). Reporting of both on Table 1 characteristics is usually illogical and
sets of descriptive and summary statistics is recom- not recommended (Altman 1985; Senn 1994;
mended by all international consensus reporting Roberts and Torgerson 1999; de Boer et al. 2015;
guidelines, including ARRIVE 2.0 for animal-based Pijls 2022).
research (Percie du Sert et al. 2020a, 2020b), and
STROBE-VET for observational clinical studies
(O’Connor et al. 2016; Sargeant et al. 2016). Example: Sample ‘Table 1’ Data for a
Two-Arm Crossover Study of Stress in Dogs
‘Table 9.1’ shows group sample sizes and descrip-
9.2 Describing Sample Data tive statistics for signalment, baseline heart rate,
‘Table 9.1’ presents summary data for basic descrip- and baseline fear, anxiety, and stress (FAS) scores
tors: animal signalment (source, species, strain, age, for dogs in a crossover study of clinical exam loca-
sex, body weight) and baseline (or pre-intervention) tion effects.
traits (e.g. baseline biochemistry, haematology,
Table 9.1: Example of Descriptive Statistics Reporting for
hemodynamics, scores, behaviour metrics, etc.)
Animal Signalment and Baseline Characteristics.
(Box 9.2). Table 9.1 information allows assessment
of the representativeness of the sample and any Variable Group 1 Group 2
potential biases (Roberts and Torgerson 1999). Number of dogs, n 24 20
The more the study animals are representative of Sex, n males (%) 12 (50%) 12 (70%)
the target population, the more likely the results
will be generalisable. The descriptive statistics in Age, years; 5.0 (2.5, 5.7 (4.7,
median (IQR; 9.0; 5.0; 2,
minimum, maximum) 1,15) 11)
Body weight, kg; median 19 (9, 26; 22 (10, 30;
BOX 9.2 (IQR; minimum, 3, 65) 3, 39)
Describing Descriptive Data: ‘Who Was Studied?’ maximum)
Heart rate, bpm; 103 (21) 107 (25)
‘Table 1’ information: mean (SD)
What: Summary of major signalment, clinical, and Baseline FAS scores Number of dogs
pre-intervention characteristics for each
0–1 9 11
study group.
Descriptors: Sample statistics: sample size per group 2–3 7 7
n, mean (SD), median (interquartile range), and >3 2 3
counts (per cent).
Source: Data from Mandese et al. (2021).
Descriptions and Summaries 105
9.3 Describing Results Table 9.2: Example of Descriptive Statistics Reporting for
Results of Hypothesis Tests.
‘Table 2’ information consists of the summary data
for results per group and/or differences between Variable Adjusted mean P-value
intervention and control. These are reported as difference
(95% confidence
effect sizes (such as mean differences) and confi- interval)
dence (or other) intervals (Box 9.3). Intervals
measure the uncertainty around the point esti- Sample size, n 59
breeding pairs
mate and are the best estimate of how far the esti-
mate is likely to deviate from the true value Number of pups +1.04 (0.95, 1.14) 0.41
extra born per
(Altman and Bland 2014a). Intervals should be pair,
used to present the results of statistical inference tunnel versus
so that the practical or biological importance of tail-lift
differences can be properly evaluated (Gardner Number of pups +1.07 (0.94, 1.20) 0.33
and Altman 1986; Cumming 2012; Altman and weaned per pair,
Bland 2014a, b). tunnel versus
tail-lift
Source: Data from Hull et al. (2022).
BOX 9.3
Describing Descriptive Data: ‘What Was Found?’
9.4 Confidence and Other
‘Table 2’ information:
Intervals
What: Summary of major results for each study
There are four main types of intervals: confidence,
group, and results of hypothesis tests for groups
prediction, tolerance, and reference. Choice of
differences in primary and secondary outcomes,
interval will depend on study objectives. Confi-
population-based measures of precision.
dence intervals are related to the use of signifi-
Descriptors: Sample size per group n, means, confi-
cance tests but provide more useful, actionable,
dence intervals (or other relevant interval
and interpretable information on the size, direc-
measure).
tion, and uncertainty associated with the observed
effect compared to a P-value (Gardner and Altman
1986; Steidl et al. 1997; Cumming 2012; Rothman
and Greenland 2018).
The interval is an estimate based on the sampling
distribution of the statistic. The general form of the
Example: Sample ‘Table 2’ Results from a
interval is:
Two-Arm Randomised Controlled Study of
Mouse Productivity Sample point estimate ± width SD, n, α
‘Table 9.2’ provides descriptive statistics Therefore the width of the interval is deter-
expressed as adjusted means and 95% confi- mined by the variation in the sample (the standard
dence limits for differences in productivity for deviation SD), the inverse square root of the sam-
two methods of mouse handling. Means were ple size 1/√N, and the pre-specified significance
adjusted for parental and parity effects. The dif- level α.
ferences in favour of tunnel handling were There are four main types of intervals: confidence,
operationally important but not statistically sig- prediction, tolerance, and reference (Box 9.4). Choice
nificant. of interval will depend on study objectives.
Table 9.3: Coverage Probabilities for 95% and 99%

BOX 9.4 Confidence Intervals for the Mean of a Normally Distributed
Four Types of Descriptive Intervals Continuous Variable with Sample Size n of 10, 20, 50, and
100. Coverage probabilities were obtained by N = 1000
Confidence intervals describe the range within simulations for each sample size and confidence interval in
which a given population parameter is expected R coversim.
to be located with a specified probability. Preci-
Sample size Confidence intervals
sion is one-half the confidence interval or mar-
95% 99%
gin of error. Interval width is determined by
sampling error. n = 10 0.928 0.988
Prediction intervals describe the range within which n = 20 0.934 0.984
one or more future observations are expected to be
n = 50 0.944 0.993
located with a specified probability. Interval
width is determined by sampling error and the n = 100 0.948 0.994
expected variation in the predicted values.
Tolerance intervals describe the range within which
a specified proportion of the population is expected
When comparing two methods or devices that
to be located with a specified probability. Interval
measure the same thing on the same scale of meas-
width is determined by sampling error and
urement, the limits of agreement describe the inter-
expected variation in the population.
val within which a proportion of the differences
Reference intervals describe the range of clinical
between measurements occur with a given probabil-
read-out values between which typical results
ity (usually 95%) for a sample of n paired measure-
for a healthy subject might be expected to occur.
ments (Bland and Altman 1986, 2003).
Interval width is determined by sampling error
Prediction intervals describe the bounds on var-
variation and variation around the cut-off limits
iation for one or more future observations, given
separating healthy from abnormal values.
the distribution of a sample of previous observa-
tions. Prediction interval width is determined by
Confidence intervals define the range of values two variance components, the variance in the
from a sample that is likely to contain the popula- observed responses and the variance due to the
tion parameter (such as a mean or variance). The prediction. Therefore prediction intervals are
confidence interval puts bounds on the variation wider than confidence intervals. Prediction inter-
around a single, or point, estimate of population vals are most frequently calculated for regres-
parameter with a given significance α. The end- sion-based models where the magnitude of a
points of the confidence interval are the confidence response depends on one or more covariates
limits. The width of the confidence interval is deter- (Bland and Altman 2003).
mined by the variation in the observed responses or Tolerance intervals are constructed to estimate a
sampling error. As sample size increases, sampling range that contains a specified proportion of the pop-
error becomes smaller as the sample converges on ulation with a specified confidence or coverage. Tol-
the true population parameter, and the confidence erance intervals are most appropriate if the interval
interval approaches zero. Because of sampling var- is intended to capture coverage for many future
iation, the confidence interval for any given sample samples. The width of the tolerance interval is deter-
may not contain the parameter. The coverage prob- mined by both sampling error and the variance in
ability means that for a large number of samples of the population (Meeker et al. 2017).
the sample size n, the proportion of intervals that Reference intervals describe the range of clinical
will contain the population parameter will be 1 read-out values between which typical results
− α. For example, for significance level of α = for a healthy subject might be expected to occur.
0.05, the 95% confidence interval (1 − 0.05 = 0.95) The width of the reference interval is defined
means that approximately 95% of confidence inter- by the lower and upper bounds on the central
vals calculated for a large number of samples of size 95% of the reference sample measurements. These
n will contain the population parameter (Table 9.3). cut-points also have an associated precision defined
Precision is defined as one-half of the confidence as a proportion of the reference interval width
interval or margin of error. (Jennen-Steinmetz and Wellek 2005).
9.5 Relationship Between The values of c and d are obtained from the stand-
ard normal z-scores for the desired level of signifi-
Interval Width, Power, and cance and power, respectively. The z-score is the
Significance number of standard deviations between an observa-
tion and the mean:
Intervals are related to statistical tests of signifi-
cance but provide more practical information about observed value − mean x−μ
the true size of the specified difference or effect. z= =
SD σ
Figure 9.1 shows the relationship between inter-
val width, statistical significance, and power The z-score is therefore a standardised normal
(Clayton and Hills 1993). The power of the study value for the observation. The maximum possible
is the probability that the lower confidence limit z-score is (n − 1)/ n. Probability and z-scores are
is greater than the value specified by the null obtained from the standard normal distribution
hypothesis, the hypothesis of no difference between (Box 9.5). For the normal distribution, the probabil-
means. The probability of type I error is specified by ity for the value in that interval is the area under the
the significance threshold α, the probability of normal curve. For example, a 95% confidence inter-
obtaining a false positive (type I error), or incor- val is defined by the lower and upper confidence
rectly rejecting the null hypothesis when it is true. limits zl and zu. The value of zl is −1.96 and for zu
The confidence interval (1 − α) is a range that is is 1.96. The area under the curve to the left of zl is
expected to contain the population parameter with 0.025, and the area to the right of zu is 0.025. The
a specified probability. area under the curve between the two limits is
For sufficient power to detect a specified effect, 0.95. For power of 90%, z = d = 1.282.
the lower limit of the confidence interval must The difference between the null value and the
exceed the upper limit for null hypothesis value. lower limit of the observed value is expressed as a
The bounds for the null value are defined by the multiple of the number of standard deviations. To
range of d standard deviations from the null value. be statistically significant, the distance (c + d) SD
The confidence interval covers a range of c standard must be greater than the effect so that the lower
deviations on either side of the mean, so the limits of limit of the confidence interval (effect − c SD) is
the confidence interval are defined by ± c SD. larger than the upper limit for power (null value +
d SD). Because confidence and power are pre-
specified, sample size is chosen so that SD ≥
Power Confidence interval effect/(c + d), and (c + d) is approximately three.
That is, the effect to be detected must be equal to
d·SD
or greater than three standard deviations (Clayton
and Hills 1993; Table 9.4).
c·SD c·SD
BOX 9.5
Calculating z Scores for Confidence and Power
(SAS Code)
Value Expected
under the effect *significance alpha;
null *0.1 for 90% power, 0.2 for 80% power;
hypothesis
Figure 9.1: Relationship between interval width, statistical alpha = 0.05;

significance, and power. The significance level α defines the beta = 0.2;
bounds for the confidence interval as a multiple of
c standard deviations. Power defines how far the lower zleft = quantile('normal',alpha/2);
confidence limit is from the null value and is a multiple of Zright = quantile('normal', 1-alpha/2);
d standard deviations.
zbeta = quantile('normal', 1-beta);
Source: Adapted from Clayton and Hills (1993).
Table 9.4: Confidence, Power, and Corresponding z-Values for a Standard Normal Distribution.
α Confidence z1 – α/2 β Power z1–β
100(1 – α)% 100(1 – β)%
0.01 99 2.576 0.25 75 0.675

0.05 95 1.960 0.20 80 0.842
0.10 90 1.645 0.10 90 1.282
0.20 80 1.282 0.05 95 1.645
Source: Adapted from Clayton and Hills (1993).
Interval plots with different confidence levels are Bland, J.M. and Altman, D.G. (1986). Statistical methods
especially useful for evaluating empirical and transfor assessing agreement between two methods of clin-
lational pilot studies (Chapter 6). Pilot studies are ical measurement. Lancet 1: 307–310.
small by design and therefore underpowered to Bland, J.M. and Altman, D.G. (2003). Applying the right
detect any but very large effect sizes. As a result, statistics: analyses of measurement studies. Ultrasound
in Obstetrics & Gynecology 22 (1): 85–93. https://
non-significant p-values are often interpreted as evi-
doi.org/10.1002/uog.122.
dence of ‘no effect’, and potentially meaningful
de Boer, M.R., Waterlander, W.E., Kuijper, L.D. et al.
results are rejected (Altman and Bland 1995; (2015). Testing for baseline differences in randomized
Lee et al. 2014). However, the data may still provide controlled trials: an unhealthy research behavior that
sufficient evidence of promise (always assuming the is hard to eradicate. International Journal of Behavioral
pilot study was designed with sufficient internal Nutrition and Physical Activity 12: 4. https://doi.
validity), although the study may be too small to org/10.1186/s12966-015-0162-z.
statistically detect an effect size with a specified Clayton, D. and Hills, M. (1993). Statistical Models in Epi-
power. Lee et al. (2014) recommend that the demiology. Oxford: Oxford University Press 375 pp.
strength of preliminary evidence can be assessed Cumming, G. (2012). Understanding the New Statistics:
by comparing the effect relative to the null with Effect sizes, Confidence Intervals, and Meta-Analysis.
confidence intervals of different widths (e.g. 95%, New York: Routledge.
90%, 80%, 75%). Fishel, S.R., Muth, E.R., and Hoover, A.W. (2007). Estab-
lishing appropriate physiological baseline procedures
for real-time physiological measurement. J Cogn Engin
Decision Making 1: 286–308.
References Gardner, M.J. and Altman, D.G. (1986). Confidence
intervals rather than P values: estimation rather than
Altman, D.G. (1985). Comparability of randomised hypothesis testing. British Medical Journal 292
groups. The Statistician 34: 125–136. https://doi. (6522): 746–750. https://doi.org/10.1136/
org/10.2307/2987510. bmj.292.6522.746.
Altman, D.G. and Bland, J.M. (1995). Absence of evidence Gerring, J. (2012). Mere description. British Journal of
is not evidence of absence. BMJ 311 (7003): 485. Political Science 42 (4): 721–746. https://www.
https://doi.org/10.1136/bmj.311.7003.485. jstor.org/stable/23274165.
Altman, D.G. and Bland, J.M. (2005). Standard devia- Hull, M.A., Reynolds, P.S., and Nunamaker, E.A. (2022).
tions and standard errors. BMJ 331: 903. https:// Effects of non-aversive versus tail-lift handling on
doi.org/10.1136/bmj.331.7521.903. breeding productivity in a C57BL/6J mouse colony.
Altman, D.G. and Bland, J.M. (2014a). Uncertainty and PLoS ONE 17 (1): e0263192. https://doi.org/
sampling error. BMJ 349: g7064. https://doi.org/ 10.1371/journal.pone.0263192.
10.1136/bmj.g7064. Jennen-Steinmetz, C. and Wellek, S. (2005). A new
Altman, D.G. and Bland, J.M. (2014b). Uncertainty approach to sample size calculation for reference inter-
beyond sampling error. BMJ 349: g7065. https:// val studies. Statistics in Medicine 24 (20): 3199–3212.
doi.org/10.1136/bmj.g7065. https://doi.org/10.1002/sim.2177.
Lee, E.C., Whitehead, A.L., Jacques, R.M., and Julious, e300049. https://doi.org/10.1371/journal.
S.A. (2014). The statistical interpretation of pilot trials: pbio.3000410.
should significance thresholds be reconsidered? BMC Percie du Sert, N., Ahluwalia, A., Alam, S. et al. (2020b).
Medical Research Methodology 14: 41. https://doi. Reporting animal research: explanation and elabora-
org/10.1186/1471-2288-14-41. tion for the ARRIVE guidelines 2.0. PLoS Biology
Macleod, M.R., Lawson McLean, A., Kyriakopoulou, A. 18 (7): e3000411. https://doi.org/10.1371/
et al. (2015). Risk of bias in reports of in vivo research: journal.pbio.3000411.
a focus for improvement. PLoS Biology 13 (11): Pijls, B.G. (2022). The Table I fallacy: P values in baseline
e1002301. tables of randomized controlled trials. Journal of Bone
Mahajan, V.S., Demissie, E., Mattoo, H. et al. (2016). and Joint Surgery 104 (16): e71. https://doi.org/
Striking immune phenotypes in gene-targeted mice 10.2106/JBJS.21.01166.
are driven by a copy-number variant originating from Roberts, C. and Torgerson, D.J. (1999). Understanding
a commercially available C57BL/6 strain. Cell Reports controlled trials: baseline imbalance in randomised
15 (9): 1901–1909. https://doi.org/10.1016/j. controlled trials. BMJ 319 (7203): 185. https://doi.
celrep.2016.04.080. org/10.1136/bmj.319.7203.185.
Mandese, W.W., Griffin, F.C., Reynolds, P.S. et al. (2021). Rothman, K.J. and Greenland, S. (2018). Planning study
Stress in client-owned dogs related to clinical exam size based on precision rather than power. Epidemiol-
location: a randomised crossover trial. Journal of Small ogy 29: 599–603.
Animal Practice 62 (2): 82–88. https://doi.org/ Sargeant JM, O’Connor AM, Dohoo IR, Erb HN, Cevallos
10.1111/jsap.13248. M, Egger M, Ersboll AK, Martin SW, Neilsen LR, Pearl
Meeker, W.Q., Hahn, G.J., and Escobar, L.A. (2017). Sta- DL, Pfeiffer DU, Sanchez J, Torrence ME, Vigre H,
tistical Intervals: A Guide for Practitioners and Waldner C, Ward MP (2016). Methods and processes
Researchers, 2e. New York: Wiley. of developing the Strengthening the Reporting of
O’Connor, A.M., Sargeant, J.M., Dohoo, I.R. et al. (2016). Observational Studies in Epidemiology-Veterinary
Explanation and elaboration document for the (STROBE-Vet) statement. Preventative Veterinary Med-
STROBE-Vet statement: strengthening the Reporting icine, 134:188–196. https://doi.org/10.1016/j.
of Observational Studies in Epidemiology – Veterinary prevetmed.2016.09.005
extension. Journal of Veterinary Internal Medicine Senn, S. (1994). Testing for baseline balance in clinical
30 (6): 1896–1928. https://doi.org/10.1111/ trials. Statistics in Medicine 13 (17): 1715–1726.
jvim.14592. https://doi.org/10.1002/sim.4780131703.
Percie du Sert, N., Hurst, V., Ahluwalia, A. et al. (2020a). Steidl, R.J., Hayes, J.P., and Schauber, E. (1997). Statisti-
The ARRIVE guidelines 2.0: updated guidelines for cal power analysis in wildlife research. Journal of Wild-
reporting animal research. PLoS Biology 18 (7): life Management 61 (2): 270–279.
10
Confidence Intervals and
Precision

10.1 Introduction 111 Sample Proportion where x is the Number
10.2 Definitions 112 of Events, N is the Sample Size, and
10.3 Sample Size Calculations 112 Proportion p = x/N (Adapted from
10.3.1 Absolute Versus Relative Newcombe 1998; Hu 2015) 123
Precision 113 10.C SAS Code for Calculating the Critical
10.4 Continuous (Normal) Outcome Data 113 Values for z(α/2)/k and χ 2α k,1 124
10.4.1 Simultaneous Confidence 10.C.1 Calculating the critical values
Intervals 114 for z(α/2)/k 124
10.5 Proportions 115 10.C.2 Calculating the critical values
10.6 Multinomial samples 118 for χ 2α k,1 124
10.7 Skewed Count Data 118 10.D SAS Code for Calculating Confidence
10.7.1 Poisson Distribution 119 Limits for Poisson Data 124
10.7.2 Negative Binomial Distribution 121 10.E Evaluating Poisson and negative
10.A SAS Code for Computing Simultaneous binomial distributions for fitting counts
Confidence Intervals (Data from of red mites on apple leaves (Data from
German et al. 2015) 122 Bliss and Fisher 1953) 124
10.B Sample SAS Code for Computing References 125
Confidence Intervals for a Single
estimate of the sample. Because a precise estimate

10.1 Introduction is a reliable estimate, Cumming (2012) refers to pre-
The confidence interval defines the limits within cision as a measure of the ‘informative-ness’ of an
which a given population parameter (such as the experiment.
mean) is expected to lie with a specified degree of
confidence. The confidence interval enables an esti- BOX 10.1
mate of the precision of the sample estimate relative Definitions
to the true population value (Box 10.1). The width of
the confidence interval is entirely determined by the ▪ Confidence interval is the range within which a
amount of variation in the sample due to sampling given population parameter is expected to be
error. Therefore a ‘precise’ estimate indicates that located with a specified probability.
measurements are stable and consistent, with a rel- ▪ Precision is one-half the confidence interval.
atively narrow range of variation around the point Sometimes called the margin of error.
10.2 Definitions The desired confidence level 100(1 − α)%.

Precision required on either side of the parameter
The confidence interval puts bounds on the estimate.
variation around a single estimate of population
parameter (for example, the mean or variance) Spatial or temporal correlation. Larger sample
with a given confidence level. The confidence sizes are required if observations are not
level defines how close the confidence limits are independent.
to the point estimate for the effect size. The sig- Confidence intervals can be obtained by boot-
nificance level α is used to compute the confi- strapping, as an alternative to direct computation.
dence level 100(1 − α)%. For example, 95% Bootstrapping is a computationally-intense method
confidence level is set by α = 0.05. The endpoints for estimating the sampling distribution of most sta-
of the confidence interval are the confidence limits. tistics. It uses random sampling with replacement to
The upper and lower limits, or bounds of the generate an approximating distribution for observed
confidence interval, are defined by the effect data (Efron and Tibshirani 1993; Davison and Hink-
size ± precision d. ley 1997).
The amount of precision is determined entirely by Bootstrapping involves for steps:
sampling error. Therefore as sample size increases, 1. Generate new data by drawing a random
the interval converges to a single point (the true pop- sample with replacement multiple times
ulation parameter), and the width converges to zero. from the observed data set. It is usually
There is a trade-off between precision, confidence recommended that 5000–10,000 draws are
levels, and sample size. A very precise estimate (high performed.
precision) has narrow confidence intervals. Nar- 2. Fit the model and calculate the mean (or
rower confidence intervals require larger sample mean difference) and SE for each bootstrap
sizes. As the confidence level increases, the confi- sample.
dence interval becomes wider and precision is 3. Calculate the confidence intervals for each
accordingly reduced. A confidence level of 99% bootstrap sample.
requires a larger sample size than confidence level 4. Estimate confidence intervals from the
of 95%. Therefore, increasing the information empirical quantiles. For example, for a 95%
content of a study can be accomplished by reducing confidence interval, the quantiles for the lim-
sample variation, as well as simply increasing its are 2.5% and 97.5%.
sample size.
Precision is one-half of the confidence interval:
(upper limit – lower limit)/2. It can also be com- 10.3 Sample Size
puted from the product of the critical z-score for
the confidence level and the standard error of the
Calculations
estimate. Estimating sample size for precision The basis for sample size selection is the level of
depends on: desired precision for the estimate of the true differ-
ence in the outcome. For example, a study may be
Type of primary outcome variable. Variable type designed to estimate the true range of biomarker
(continuous, proportion, time to event) deter- expression within ±5% of the population value.
mines the underlying distribution of the popu- The confidence interval will be bounded by 5% on
lation parameter of interest. either size of the sample estimate for a total width
of 10%.
Effect size. The biologically or clinically important Sample size for confidence interval width is deter-
difference to be detected. mined in four steps:
Expected variability. Larger variation in the 1. Decide target precision d. Precision is one-half
sample requires larger sample sizes. the width of the desired confidence interval.
Confidence Intervals and Precision 113
It can also be expressed as a proportional estimate and refers to the actual uncertainty
measure of the deviation from the mean. in the metric itself. Absolute precision is used
For example, if the desired confidence inter- when the goal is to estimate the population
val width is 10% of the mean (or 0.1), then parameter to within d percentage points of
precision d will be ±(0.1/2) or ±5% (±0.05). the true value (estimate ± d), where percentage
2. Calculate the standard error of the estimate. point is the unit for the arithmetic difference of
The formula for the standard error (SE) two percentages. For example, if the prevalence
depends on the type of variable and the asso- of a given disease is 25% with a precision of 10%,
ciated distribution. For example, the SE for the absolute precision of the estimate is 10%.
the mean of a continuous normal variable
Relative precision is used when the goal is to esti-
SE Y is s n; the standard error for a pro- mate the population parameter to within a
portion is SE p = p 1 − p n. given percentage of the value itself. For exam-
3. Specify confidence. The most common α is ple, if prevalence is 25% ± 10%, the relative
0.05 (5%), corresponding to a 95% confidence uncertainty is 10% of 25%, or 2.5%. Relative pre-
interval. Other commonly used settings for α cision scales the desired amount of precision to
are 0.01 (1%; 99% confidence) and 0.1 (10%; the parameter estimate; that is, the variation in
90% confidence). A one-sided confidence the sample is a fraction or proportion of the true
interval is constructed by using α rather than value of the parameter: estimate± d estimate
α/2 in the expression for the lower or upper (Lwandga and Lemeshow 1991)
limit. For most practical purposes and for
ease of calculation, the z-score is used. For
example, for a two-tailed 95% confidence 10.4 Continuous (Normal)
interval z1 − α/2 = 1.96. If sample size is small,
the t statistic can be substituted. However,
Outcome Data
sample size approximations will require The confidence interval is computed from the prod-
iterations over a range of sample sizes to uct of the standard error for the estimate and the
determine the value that is closest to the tar- critical value of the test statistic. For a mean Y
get power for a given effect size. and standard error SE Y = s n, the confidence
4. Divide the confidence interval by the square interval for the mean is
of the precision, d2.
Y ± z1 − α 2 s n
10.3.1 Absolute Versus Relative
Re-arranging, sample size is therefore:
Precision
‘Precision’ can be either absolute or relative n = z21 − α 2 s2
(Box 10.2). Research objectives will determine
which measure of precision to use (Lwandga and For absolute precision, the standard error is
Lemeshow 1991). divided by the square of the precision:
Absolute precision is a fixed difference of percent- n = z21 − α 2 s2 d2

age points on either side of the parameter
For relative precision, the standard error is
BOX 10.2 divided by both the precision and the mean.
Absolute Versus Relative Precision
n = z21 − α 2 s2 d 2 Y
Absolute precision is a fixed difference of percentage
points on either side of the parameter estimate. The coefficient of variation (CV) is the ratio of the
Relative precision is the fraction of the true value of
sample standard deviation to the sample mean s Y.
the parameter estimate.
It is a form of relative precision.
value divided by the corresponding reference

Example: Lizard Body Temperatures value for that variable
An investigator studying lizard body tempera-
tures wished to estimate the mean difference r = yi reference value
between preferred temperature and critical
thermal maximum temperature (Tdiff) for a The reference ratio across all standardised vari-
population of lizards with 95% confidence. ables will equal 1.
Mean Tdiff (T for x species was estimated as
10 C, with variance s2 = 6.3 ( C)2. The investiga- 2. For each variable, compute its mean ratio r j,
tor considered that a precision of 2.5 C around standard deviation sj, and sample size nj.
T diff was biologically relevant and operationally 3. Calculate the pooled standard error szj n
feasible (that is, within the measurement error across all variables.
of the device). 4. Compute the simultaneous confidence bands
Assuming the data are normally distributed, 100(1 − α)% as:
then d = 2.5 C/2 = 1.25 C and the required sam-
ple size is szj p j nj − 1
rj ± F α,p,n − p
N = 1 96 2
63 1 25 2
= 15 5 16 n nj − p j
If the investigator wished to determine T with where Fα,p,n-p is the critical F value. Appendix 10.A
a precision of 1 C, then d = 0.5 C and provides sample SAS code for calculating simultane-
2 2 ous confidence bands and generating profile plots.
N = 1 96 63 05 97
Example: Nutrient Screening of Obese Dogs

in a Weight Loss Study
10.4.1 Simultaneous Confidence (Data from German et al. 2015). Daily intake of
Intervals 20 essential nutrients was measured for 27 obese
dogs on an energy-restricted weight loss diet. The
Simultaneous confidence intervals (SCI) are con-
goal was to establish if average daily nutrient
structed for assessing multiple outcome variables.
intakes complied with recommended dietary
Simultaneous assessment is inherently a multiple
levels.
comparisons problem. Therefore, SCI must be com-
In this example, observations were first stan-
puted to reduce the probability of Type 1 error (false
dardised to the specific minimum recommended
positives) and adjust for family wise error rate. For a
daily allowance value (RDA) for each nutrient, so
given sample size, SCI are much wider than conven-
the null value is 1 for all variables. For example,
tional confidence intervals computed for each vari-
crude protein intake is standardised by dividing
able separately. Confidence intervals that do not
each observation by 3.28, and the NRC recom-
intersect the pre-specified null value suggest evi-
mended daily intake amount. Evidence for nutri-
dence against the null hypothesis and for the exist-
ent levels exceeding RDA is suggested if the lower
ence of an effect. No evidence is provided against
confidence limit exceeds 1, below RDA if the
the null hypothesis if confidence intervals contain
upper confidence limit is less than 1, and no evi-
the null value.
dence against correspondence with RDA recom-
There are four steps in SCI construction:
mendations if confidence intervals contain 1.
1. If units of measurement for the variables are A comparison of conventional and SCI for five
not the same, the variables must be standar- nutrients are given in Table 10.1. Note that the
dised before analysis. For each of the p vari- SCI are much wider than conventional confi-
ables, compute its ratio r as the observed dence intervals.
Table 10.1: Comparison of Conventional and SE p = p 1−p N

Simultaneous Confidence Intervals for Standardised
Nutrient Values in a Feeding Study of Obese Dogs. Sample size is:
Nutrient Standardised 95% 95% N = z21 − α 2 p 1 − p
value confidence simultaneous
interval confidence
interval This method may be useful as a first approxima-
Potassium 1.22 1.17 1.28 0.81 1.63 tion for sample size estimate, but it is not recom-
Selenium 0.43 0.41 0.45 0.29 0.56
mended as a descriptor. It is highly unstable
and is reliable only for intermediate values of
Crude 1.93 1.85 2.01 1.32 2.55
protein
p, and only if N is large (>50). It may result in inap-
propriate negative values for the lower limit of a
Total fat 1.12 1.07 1.16 0.76 1.47
proportion, which by definition can only occur
Vitamin 32.7 31.4 34.1 22.3 43.2
D3
between 0 and 1.
The Clopper-Pearson method is based on the exact
binomial distribution and is usually estimated from
quantiles from the beta distribution. It is accurate
when np > 5, or n(1 − p)> 5, and unlike the Wald
method, it can be calculated when p = 0 or 1. How-
10.5 Proportions ever, the coverage probability can be very much lar-
The proportion p = x/N, where x is the number of ger than (1 − α), so it is not recommended for most
subjects responding or showing the condition of practical applications (Brown et al. 2001).
interest, and N is the total number of subjects. The Wilson method is calculated by solving for the
Choosing ‘correct’ confidence intervals for propor- lower and upper levels of p separately, with zα/2 to
tions need some care (Box 10.3). solve for the lower limit of p and z1−α/2 to solve
The classic Wald method uses the z-score from for the upper limit. It performs better than either
the large-sample normal approximation. The confi- the Wald or Clopper-Pearson intervals.
dence interval is p ± z1 − α/2 SE(p), where the stand- The Agresti-Couill (or adjusted Wald) method is
ard error is recommended when assessing if p is statistically sig-
nificantly different than some specified benchmark.
The method involves ‘add two successes and two
failures’ to the numerator and denominator of the
BOX 10.3
Confidence Intervals for a Proportion
proportion. This fix stabilises the confidence inter-
val and improves coverage probability (that is, con-
The classic Wald method for computing the standard fidence intervals more closely match the designated
error is best known but not recommended. confidence level) compared to the conventional
Clopper-Pearson method is based on the exact bino- Wald interval. A further advantage is that the lower
mial distribution. limit cannot be negative (Agresti and Coull 1998;
Wilson method solves for the lower and upper con- Agresti and Caffo 2000).
fidence levels of p separately. The Jeffreys method. The previous four methods
Agresti-Couill (or adjusted Wald) ‘add two successes are based on frequentist statistics; that is, the
and two failures’. parameter is a fixed quantity that is expected to
Jeffreys method is based on the beta distribution and occur with a desired confidence (usually 95%)
is Bayesian, incorporating prior information within the confidence interval. The Jeffreys method
for the proportion and the probability of the is Bayesian, in that it assumes the parameter is a
observed data. random variable that lies within a credible interval
that incorporates both prior information for the
binomial proportion, as well as the probability of

width is approximately the same as that for the
the observed data. It is based on the beta distribu-
Wilson method. The Clopper-Wilson (exact)
tion, with confidence limits:
method gives the widest interval, and the Jeffreys
the narrowest.
β α 2, n1 + 0 5, n − n1 + 0 5 ,
β 1 − α 2, n1 + 0 5, n − n1 + 0 5
Sample size determinations. For absolute preci-
where β(α, b, c) is the αth percentile of the beta dis- sion d, sample size is
tribution with shape parameters b and c, and priors N = z21 − α p 1 − p d2
2
set to 0.5.
Brown et al. (2001) and Newcombe (1998) pro- Sample size for relative precision is estimated as
vide comprehensive review of all methods. They
recommend either the Wilson or Jeffreys interval N = z21 − α 2 1−p d2 p
for small N, and Agresti-Coull interval for larger
N. Calculations for all methods can be performed When little or no preliminary information on
in SAS, either in proc freq (Wald, Clopper- precision is available, d can be approximated using
Pearson, Wilson, Agresti-Coull, Jeffreys) with simple rules of thumb:
option CL = ALL, or customised (Newcombe
1998; Hu 2015; Appendix 10.B). R code is available When events are ‘rare’ (p < 10%) then d is ~ p/2
with ‘descTools’ (https://cran.r-project.org/
For most studies, p will be then d is ~5–10%.
package=DescTools/). Hartnack and Roos (2021) between 10% and 90%,
have an excellent tutorial on binary variables. When events are ‘very common’ then d is ~ (1 − p)/2
(p > 90%),
Example: Simulated Equine Leg Bone

Fractures For an absolute precision, maximum sample size
is attained with 50% prevalence; that is, N is maxi-
(Data from Hartnack and Roos 2021). Investiga- mum when p1 = p2 = 0.5. Therefore, when there
tors used an impact device to simulate kicks on is no information on p, sample size can be approxi-
cadaveric leg bones from N = 16 horses to test mated with the so-called ‘worst case’ value p = 0.5
the effect of horseshoe material on fracture rates. (Figure 10.1). Because it is a fixed value, absolute
Fractures from a horn impactor were observed in
2 of 16 leg bones, p = 2/16 = 0.125. What is the
95% confidence interval?
The following 95% confidence intervals were
estimated using SAS proc freq (Appendix 10.B),
based on Newcombe (1998):
Sample size
95% confidence interval
Wald −0.037 0.287

Wilson 0.035 0.360
Clopper-Wilson (exact) 0.016 0.384
Agresti-Coull 0.022 0.373
Jeffrey’s Bayesian 0.027 0.344
0.0 0.2 0.4 0.6 0.8 1.0
Proportion p
Note that the large-sample Wald approxima- Figure 10.1: Sample size requirements for absolute
tion gives inappropriate negative values for the precision of a proportion or prevalence p. Note that with
lower confidence limits, although the interval absolute precision, maximum sample size N occurs at
p = 0.5.
N 1 96 2 0 25 1 − 0 25 0 05 2
288 12 = 289
The expected number of homozygous affected
nHOM = N pHOM is therefore
Sample size
nHOM 289 0 25 72
Therefore, the investigators need to recruit
approximately 300 subjects to capture at least
72 homozygous affected dogs within the general
Doberman population.
How many dogs must be sampled to capture
the expected proportion of affected dogs within
0.0 0.2 0.4 0.6 0.8 1.0
10% of the true population prevalence with 95%
Proportion p confidence?
This is the relative precision question. The rel-
Figure 10.2: Sample size requirements for relative
precision. Sample size N declines with increasing p. ative precision range is 10% of the true prevalence
25% (0.1 × 0.25), or 22.5–27.5%:
precision can produce impossibly wide confidence 1−p 0 75
intervals for small values or when events are rare. N = z21 − α 2 2 = 1 962
d p 0 12 0 25
If sample size is based on relative precision,
= 1152 48 = 1153
sample size increases as prevalence decreases
(Figure 10.2). Relative precision can estimate the
interval without a lot of bias but may require sub-
stantial and often impossibly large sample sizes.
Sample size can be reduced by increasing α, and Example: Estimating Prevalence of Feline
more drastically, by increasing di. However, these Calicivirus (FCV) in Shelter Cats
make estimates less precise.
A researcher wished to estimate the prevalence of
feline calicivirus (FCV) in shelter cats. Prior data
Example: Planning Trial Recruitment established that true prevalence was unlikely to
Numbers: von Willebrand Disease in exceed 30%.
Dobermans How many cats need to be sampled to obtain
Investigators planned a prospective observational prevalence of FCV within 5 percentage points
study to determine differences in coagulation proof the true value with 95% confidence?
files between healthy Doberman Pinschers and This is an absolute precision problem, so:
those autosomally recessive for the mutation for N = z21 − α 2 p 1 − p d2
von Willebrand disease (vWD) and, therefore, at
high risk for associated bleeding. They determined 03 07
then N = 1 962 = 322 7 = 323 cats
from the literature that they could expect that 25% 0 052
of Dobermans are homozygous affected (pHOM).
How many dogs need to be recruited and How many cats need to be sampled if the esti-
screened (N) to ensure the proportion of homozy- mate of FCV prevalence is to occur within 10% of
gous affected (nHOM) dogs in the sample is within the true value with 95% confidence?
five percentage points of the true value with 95% This is a relative precision problem:
confidence? N = z21 − α 2 1 − p d2 p
This is an absolute precision problem. The
absolute precision range is ±5% of the population n = 1 962 0 7 0 12 0 3 = 896 4 = 897 cats
prevalence or 20–30%. The expected number With 5% relative precision, sample size
needed to screen (N) is increases to 3586 cats.
10.6 Multinomial samples Example: Categories of Bladder Disease

in Dogs
Multinomial outcomes describe the proportion of
observations occurring in each of k categories. Data from Jones et al. (2020) Histological diag-
For some applications, the most appropriate sam- noses of five categories of bladder disease were
ple size for a specified precision will be based on made for 267 dogs. Percentages were cystitis
estimates of the SCI for all categories rather than 30%; neoplasia 28%, normal 15%, urolithiasis
confidence intervals calculated for separate cate- 14%, other 13%.
gory cells (Tortora 1978; Thompson 1987). Sam- What sample size for a new study would enable
ple size is found numerically by selecting the determination of the observed sample proportion
largest p (or the p closest to 0.5), then calculating with a precision of 5% with 95% confidence? With
N for that pi. The probability threshold is a precision of 10%?
adjusted by dividing the probability by the num- There are k = 5 categories, α = 0.05 for 100
ber of groups, k (See Appendix 10.C for SAS com- (1 − α) = 95%; d = 0.05. The critical value for
putation code). z20 025 5 is 3.291. The largest p is for the cystitis
Sample size for absolute precision is category, so p = 0.30. Then
2 2
N abs = max z2α 2 k p i 1 − p i d2 N abs = 3 291 0 3 1−0 3 0 05 = 909 5 910
For a precision of 10%, d = 0.10, and

where d is the target precision, z2α 2 k is the two-
tailed critical z-value with probability level α/2/k N abs = 3 291 2
0 3 1−0 3 0 10 2
= 227 4 228
for k categories, and max is the maximum value
across all k cells.
Sample size for relative precision is calcu-
lated as
N rel = max z2α 2 k 1 − pi d2 pi 10.7 Skewed Count Data

The Poisson and the negative binomial distributions
‘Worst case’ sample size. If there is little or no are applicable to discrete count data (Box 10.4). Sim-
information as to what the expected proportions ulation results show that likelihood-based methods
should be, set p = 0.5. Then for calculating sample size and power for negative
binomial or Poisson distributed data are preferred
N worst = z2α 0 5 1 − 0 5 d2 to nonparametric methods, such as Wilcoxon test
2 k
(Aban et al. 2008). Count data are not normally dis-
tributed, usually highly skewed, so it is inappropri-
Older papers may present the multinomial ate to describe these data with normal distribution
sample size calculation as a product of the parameters such as the mean and standard devia-
χ 2 distribution: = max χ 2α k,1 p 1 − p d2i , with tion. Assuming a normal distribution to estimate
the critical value of χ 2 at probability level α/k and confidence intervals will result in confidence inter-
1 degree of freedom. Under certain conditions, χ 2 vals that are too narrow unless the sample size is
and z2 are mathematically identical to t2 and F- very large.
statistics: Poisson distributions are used for modelling inde-
pendent items distributed randomly in time or
space. The Poisson parameter estimates a ‘rate’ of
z2α 2 k = χ2α k,1 = t 2α 2 k = F α 1 ,1
events occurring per unit time or per some linear
BOX 10.4 BOX 10.5

Distributions for Count Data Examples of Poisson applications
Poisson. Count data expressed as a rate (events/unit ▪ Number of birds visiting a feeder over a day
time, length, area, volume) ▪ Cells per square in a haemocytometer
▪ Number of foalings per year
Variance = mean ▪ Number of adverse events in a clinical trial
▪ Bacterial cell counts per mm2
Negative binomial. Count data showing aggrega-
▪ Arrival rate of dogs at a clinic or shelter
tion, usually with over-representation of zeros
▪ Number of cases of bubonic plague per year
Variance > > mean ▪ Biological dosimetry, mutation detection &
mutation rates
▪ Spiking activity rates of individual neurons;
inter-spike intervals (sometimes)
metric such as distance, area, or volume. A key fea- ▪ Lifetime encounter rate between parasitoid
and host
ture of the Poisson model is that the mean and var-
iance are equal (equidispersion). ▪ Visits of pollinators to a flower
Aggregated data are more appropriately modelled ▪ Predator attack rates
using a negative binomial distribution. Many biolog- ▪ Accident rates
ical count distributions are characterised by a highly
skewed and uneven frequency distribution of sub-
jects for a given characteristic, such that only a e − λ λx
Pr X = x =
few subjects have the characteristic, and many or x
most subjects do not. This is called aggregation or
clumping. Aggregated distributions are charac- where λ is the mean occurrence rate or number of
terised by over-representation of zeros and relatively events x per unit time and e is a constant
few observations in the remaining categories. The (2.718282). Because the mean and variance are
negative binomial distribution differs from the equal, they are both estimated as
Poisson in that the variance exceeds the mean, a
property referred to as over-dispersion. If over- λ=x N
dispersion is present and a Poisson distribution is
incorrectly assumed, then the confidence intervals The number of events x that occur during a time
will be too wide. The negative binomial distribution interval t is estimated by substituting λ = at, so that
is widely used for ecological applications (White and the number of events x expected during any time
Bennetts 1996; Stoklosa et al. 2022). The best-known interval t is a.
examples are patterns of macroparasite distribution Model fit is assessed by evaluation of the scaled
among hosts (Pennycuick 1971; Shaw et al. 1998; deviance goodness of fit statistic. If the Poisson
McVinish and Lester 2020). model is not a good fit, the deviance statistic will
be >1.
Sample size. The large-scale approximation for
10.7.1 Poisson Distribution sample size N for a target precision d is
The Poisson distribution is the limiting case of the
binomial distribution where the binomial probabil- z1 − α 2
2
N= λ
ity p for each ‘trial’ or event occurrence is unknown, d
but the average p is known. Examples are given in
Box 10.5. Confidence intervals. The Poisson distribution
The probability of x number of events during an converges to the binomial distribution at large sam-
interval t is ple sizes (or as the interval between events
approaches zero). Therefore, for x > 20 and for first-

objectives (Andersson and Kettunen 2021).
pass approximations, the two-sided confidence
A new study on fish lifespan and senescence is
interval using the critical z-value can be used. Con-
planned, with stocking density as an explanatory
fidence intervals estimated by the large-sample nor-
variable. However, it is necessary to establish pre-
mal approximations are given by:
liminary baseline estimates for the maximum
number of deaths that might be expected and
λ ± z1 − α 2 λ N
the amount of time until all fish in a tank might
be expected to die.
When the data are x events in a specified time
Outbred zebrafish have an average lifespan of
period t, the confidence interval is computed by sub-
approximately 42 months (Gerhard et al. 2002).
stituting t for N in the denominator.
Suppose routine husbandry data indicated that
When x < 20, exact confidence intervals are
zebrafish mortality per tank averaged approxi-
recommended. The lower and upper confidence
mately three fish per week after 42 months.
limits are obtained from the χ2 distribution as
Given this attrition rate, what is the maximum
0 5 χ 2α 2,2x 0 5 χ 21 − α 2,2 x + 1
number of weekly deaths per tank that could be
, expected 95% of the time?
t t
λ = Mean number of deaths per tank per week = 3
where x is the number of events and t is the time
period (Hahn and Meeker 1991; Meeker et al. x = Total number of deaths
2017). (Appendix 10.D shows SAS code for calcu-
lating the critical χ2 values and confidence e − 3 3x
intervals). Solve for = 0 95 by iterating over the
x
candidate values of x and selecting the value of
x that gives a cumulative probability closest to
Example: Visitation Rates of Sunbirds to 0.95. If no calculated cumulative probabilities
Flowers: Confidence Intervals are exactly equal to 0.95, round up to the next
(Data from Burd 1994) In a study of avian polli- value. The SAS code to approximate the target
nators, sunbirds made 114 visits to giant lobelia 95% confidence level is given in Appendix 10.D.
flowers over approximately 90 hours of observa- The output is as follows:
tion, averaging 1.05 visits/plant/h. It is assumed
arrivals occur independently of each other at a
Number of Exact Cumulative
constant rate. What is the 95% confidence inter- deaths probability probability
val for the visitation rate?
0 0.050 0.050
The visitation rate λ = 1 05 The conservative
1 0.149 0.199
95 confidence interval is 1 05 ± 1 96 1 05 90 2 0.224 0.423
= 0 84, 1 26 visits plant h 3 0.224 0.647
4 0.168 0.815
5 0.101 0.916
6 0.050 0.966
Example: Zebrafish Mortality and
Experiment Requirements 7 0.022 0.988
8 0.008 0.996
Current recommended stocking densities for lab-
oratory zebrafish (Danio rerio) welfare are 5/L, or 9 0.003 0.999
50 per 10 L tank, although density requirements 10 0.001 1.000
may vary according to fish life-stage and research
unless the sample is large and represents a random

The cumulative Poisson probability indicates
sample from the population (Gardner et al. 1995).
that a maximum number of six deaths per week
The mean of the negative binomial is
can be expected at least 95% of the time.
If the experiment starts with 50 zebrafish in m=x p
each tank, at the expected rate of attrition of 3 fish
per week per tank after 42 months (168 weeks), where m is the mean, x is ‘success’, and p is the prob-
how long can the experiment last? ability of a ‘success’ on any given trial. The mean
can be calculated from total number of ‘events’
Since λ = a t, the time to 50 deaths (E = x i ) divided by the number of experimental
t = a λ = 50 3 17 weeks units N:
m=E N
The expected duration is 17 + 168 = 185 weeks
46 months. The variance is s2 = m + m2 k and
k = m 2 s2 − m
10.7.2 Negative Binomial Distribution The parameter k is estimated by maximum like-

Counts of individuals in nature frequently follow a lihood (Cox 1983; Ross and Preece 1985; Shaw
negative binomial distribution, where individuals et al. 1998; Saha and Paul 2005; Lloyd-Smith
are clumped rather than being randomly dispersed. 2007; Shilane et al. 2010). In SAS, it is most easily
Examples are given in Box 10.6. The negative bino- obtained by fitting a generalised linear model in
mial is described by two parameters, the mean μ and proc genmod, proc countreg, proc logistic, or proc
the dispersion parameter k. When k is small, the probit, and obtaining k. If k is known, it can be
population becomes more clumped, and therefore incorporated into the model as a constant.
shows much more skew in the distribution than a Sample size is approximated as
Poisson with the same mean. As the variance of
the negative binomial approaches the mean, k zα 2 1 1
N= 2 +
becomes very large and the distribution converges d m k
to the Poisson distribution. Unfortunately, it is usu-
ally difficult to distinguish between competing dis- (Cundill and Alexander 2015). Methods for sample
tributional models by the usual inference tests size determination for two-group comparisons are
described in Chapter 17.
Confidence intervals. Methods for estimating con-
BOX 10.6 fidence intervals depend on the size of the sample N
Examples of Negative Binomial Applications and the magnitude of k (Shilane et al. 2010). Confi-
dence intervals estimated from the conventional
▪ Parasite distribution among hosts
large-sample approximations (±z1-α SE) can result
▪ Hospital length of stay
in under-coverage (confidence intervals that are too
▪ Number of clinic readmissions
narrow). This is primarily because k is over-
▪ Brain lesion counts
estimated when N is too small, or bias due to
▪ Frequency of disease exacerbations
under-counting zero class events (Lloyd-Smith
▪ Field quadrat counts of organisms
2007). If k is large, the Wald approximation or boot-
▪ Bacterial counts per microscope field of view
strapping will provide adequate coverage, regard-
▪ Cells per haemocytometer square
less of the size of n. When the number of
▪ Ear posture in sheep
successes r is known, the confidence interval for p
▪ Incidence rate of mastitis in cattle.
may be calculated from the quantiles of the χ2 dis-
▪ Incidence of infectious diseases
tribution, but these may result in over-coverage.
▪ Fish net catch data
Methods for testing differences in means and k
▪ Species richness and diversity
between subgroups are described in Shaw
▪ Tag counts across multiple gene libraries
et al. (1998).
Example: Parasitic worm burdens in Soay

sheep: Estimates from prior data 60
(Data from Gulland 1992). Lungworm counts
Number of leaves
were obtained for 67 Soay sheep that died during
the 1989 population crash on St Kilda. Worm 40
burdens were described by a negative binomial

distribution, with mean of 47.5 worms/sheep
and k = 0.841. The variance of the worm count is 20
s2 = 47 5 + 47 52 0 841 = 2730 3
0
What sample size for a new study would 0 1 2 3 4 5 6 7
Number of mites
enable the determination of the expected mean
worm count with a precision of 10% and 95% Figure 10.3: Counts of female mites on apple leaves
confidence? follow a negative binomial distribution.
Source: Data from Bliss and Fisher (1953).
1 96 2 1 1
N= 2 + = 317 5 318 sheep
01 47 5 0 841
10.A SAS Code for
Computing Simultaneous
Confidence Intervals (Data
Example: Counts of Red Mites on Apple from German et al. 2015)
Leaves: Estimating μ and k from Raw Data %let p=20;
DATA DOG;
(Data from Bliss and Fisher 1953). A total of 172 INPUT ID $ PROTEIN ARGIN HIST ISOLEUC METCYS
female mites were counted on 150 apple leaves LEUC LYSINE PHETYR THRE TRYPT VALINE TFAT
(Figure 10.3). Appendix 10.E provides data and LINOL CALCIUM PHOS MG SODIUM K CHLORIDE IRON
SAS code. COPPER ZN MANG SEL IODINE VITA VITD3 VITE
THIAMIN RIBO PYRID NIACIN PANTO COBAL FOLIC;
Simple descriptive statistics (mean, variance)
were calculated from the raw count data. The *calculate standardised variables;
mean number of mites per leaf is 172/ data new;
150 = 1.147 and the variance is s2 = 2.274. The set DOG;
data are clearly over-dispersed with an over- var="PROTEIN"; ratio=PROTEIN/3.28; output;
var="METCYs"; ratio=METCYS/0.21; output;
representation of zeros. The Poisson distribution var="CALCIUM"; ratio=CALCIUM/0.13; output;
is not a good fit, as the variance is nearly twice as var="PHOS"; ratio=PHOS/0.1; output;
large as the mean, and the scaled deviance statis- var="MG"; ratio=MG/19.7; output;
tic is 1.91 (much greater than 1.0). keep var ratio; run;
The negative binomial provided a better fit to
proc sort; by var; run;
the data. The maximum likelihood estimate for
k is 0.976. The standard error is, therefore *calculate summary statistics means SD and n
for each standardised variable;
proc means;
SE = 1 147 + 0 967 150 = 1 074 by var;
var ratio;
output out=a n=n mean=xbar var=s2;
The asymptotic 95% confidence interval is run;
1.147 ± 1.96(1.074) = [−0.95, 3.18], or approxi-
mately *calculate simultaneous confidence intervals;
(−1, 3.2) mites per leaf. data b;
set a;
f=finv(0.95,&p,n-&p); *set confidence level, *calculate upper limit;

here (1-alpha) = 0.95; ratio =xbar+sqrt(&p*(n-1)*f*s2/(n-&p)/n);
output;
ratio=xbar; output; run;
*calculate lower limit; proc print; run;

ratio =xbar-sqrt(&p*(n-1)*f*s2/(n-&p)/n); quit;
output;
10.B Sample SAS Code for Computing Confidence Intervals

for a Single Sample Proportion where x is the Number of
Events, N is the Sample Size, and Proportion p = x/N (Adapted
from Newcombe 1998; Hu 2015)
SAS proc freq Wilson method

proc freq data=test; data Wilson;
tables outcome / binomialc (CL=ALL); x = 2;
weight Count; n = 16;
run; alpha = 0.05;
p = x / n;
Wald method (large-sample normal) q = 1-p;
data wald; z = probit (1-alpha/2);
x = 2; L = ( 2*x+z**2 - (z*sqrt(z**2+4*x*q))) /
n = 16; (2*(n+z**2));
alpha = 0.05; U = ( 2*x+z**2 +(z*sqrt(z**2+4*x*q)) ) /
p = x / n; (2*(n+z**2));
z = probit (1-alpha/2); put L= U=;
*calculate standard error; run;
se = (sqrt(n*p*(1-p)))/n;
L = p - z * se; *Lower confidence
limit; Agresti-Coull method
U = p + z * se; *Upper confidence data AC;
limit; x= 2;
put L= U= ; n = 16;
run; alpha = 0.05;
z = probit (1-alpha/2);
do psi = z**2/2, 2, 1, 3;
Clopper –Pearson method p2=(x+psi)/(n+2*psi);
data CP; L = p2 - z*(sqrt(p2*(1-p2)/(n+2*psi)));
x = 2; U = p2 + z*(sqrt(p2*(1-p2)/(n+2*psi)));
n = 16; put L= U= ;
alpha = 0.05; output;
L = 1 - betainv(1 - alpha/2,n-x+1,r); end;
U = betainv(1 - alpha/2,r+1 ,n-x); run;
put L= U=; *Lower & upper confidence
limit;
run; Jeffreys Bayesian method
data jeffreys;
x = 2;
n = 16;
alpha = 0.05;
*priors are set to 0.5;
L = betainv( alpha/2, x+0.5, n-x+0.5);
U = betainv(1-alpha/2, x+0.5, n-x+0.5);
put L= U=;
run;
*defining exact confidence intervals when

10.C SAS Code for Calculating number of events
the Critical Values for z(α/2)/k data exact;
alpha = 0.05; *define desired type
and χ 2α k,1 I probability;
x = [..]; *define desired number of
events x;
lower = quantile('CHISQ', alpha/2,2*x)/2;
upper = quantile('CHISQ',1-alpha/2,2*(x+1))/2;
10.C.1 Calculating the critical values run;
for z(α/2)/k proc print; *output;
run;
data zval;
alpha=0.05;
k = 5; *number of categories - in this Code to approximate the target 95% confidence
example k equals 5; level by iterating over a range of sample sizes n
Pr = (alpha/2)/k; *for a two-sided and selecting the value that gives a cumulative prob-
probability; ability closest to 0.95.
z = quantile('normal',Pr);
z2 = z*z;
%let nmax = 10; *define a reasonable estimate
run;
of the maximum;
proc print; *output;
data event;
run;
lambda = 3; * the expected mean event
rate λ;
10.C.2 Calculating the critical values *calculate n! factorial;
for χ 2α k,1 do n = 0 to &nmax by 1;
f=fact(n);
data chisq;
alpha = 0.05; *desired type I probability; *calculate probabilities for number of events;
k = 5; *number of categories - in this prob = exp(-lambda)*(lambda**n)/f;
example k equals 5; if n=0 then cumprob=prob;
df = 1; else cumprob=cumprob+prob;
Pr = 1- alpha/k; output;
chi_crit=cinv(Pr,1); end;
put chi_crit = ; run;
run; proc print;
proc print; *output; run;
run;
10.D SAS Code for 10.E Evaluating Poisson and

Calculating Confidence Limits negative binomial
for Poisson Data distributions for fitting counts
*large scale normal;
data large;
of red mites on apple leaves
alpha = 0.05; *define desired type I (Data from Bliss and
probability;
x = [...]; *define desired number of Fisher 1953)
events x;
lambda= [event rate]; data mite;
lower = lambda - probit(alpha/2)*sqrt(lambda/ input n @@;
x); datalines;
upper = lambda + probit(1- alpha/2,)*sqrt 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
(lambda/x); 0 0
run; 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
proc print; *output; 0 0
run;
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 Cox, D.R. (1983). Some remarks on overdispersion. Bio-

1 1 metrika 70: 269–274.
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Cumming, G. (2012). Understanding the New Statistics:
1 1 Effect Sizes, Confidence Intervals, and Meta-Analysis.
1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3
3 3 New York: Routledge.
3 3 3 3 3 4 4 4 4 4 4 4 4 4 5 5 5 6 6 7; Cundill, B. and Alexander, N.D. (2015). Sample size cal-
culations for skewed distributions. BMC Medical
run; Research Methodology 5: 28. https://doi.org/
proc means n sum mean var maxdec=2; 10.1186/s12874-015-0023-0.
run; Davison, A.C. and Hinkley, D.V. (1997). Bootstrap Meth-
ods and Their Application. Cambridge: Cambridge
*Poisson;
proc genmod data=mite; University Press.
model n = / dist=poisson scale=pearson Efron, B. and Tibshirani, R.J. (1993). An Introduction to
CL; the Bootstrap. Boca Raton: Chapman & Hall/CRC.
run; Gardner, W., Mulvey, E.P., and Shaw, E.C. (1995).
Regression analyses of counts and rates: Poisson, over-
*negative binomial; dispersed Poisson, and negative binomial models. Psy-
proc genmod data=mite; chological Bulletin 118: 392–404.
model n = / dist=negbin CL;
run;
Gerhard, G.S., Kauffman, E.J., Wang, X. et al. (2002). Life
spans and senescent phenotypes in two strains of Zeb-
rafish (Danio rerio). Experimental Gerontology 37 (8-9):
1055–1068. https://doi.org/10.1016/s0531-5565
(02)00088-8.
References German, A.J., Holden, S.L., Serisier, S. et al. (2015).
Aban, I.B., Cutter, G.R., and Mavinga, N. (2008). Infer- Assessing the adequacy of essential nutrient intake
ences and power analysis concerning two negative in obese dogs undergoing energy restriction for weight
binomial distributions with an application to MRI loss: a cohort study. BMC Veterinary Research 11: 253.
lesion counts data. Computational Statistics and Data https://doi.org/10.1186/s12917-015-0570-y.
Analysis 53 (3): 820–833. https://doi.org/ Gulland, F.M.D. (1992). The role of nematode parasites
10.1016/j.csda.2008.07.034. in Soay sheep (Ovis aries L.) mortality during a popu-
Agresti, A. and Caffo, B. (2000). Simple and effective con- lation crash. Parasitology 105 (Pt 3): 493–503.
fidence intervals for proportions and differences of https://doi.org/10.1017/s0031182000074679.
proportions result from adding two successes and Hahn, G.J. and Meeker, W.Q. (1991). Statistical Intervals:
two failures. The American Statistician 54: 280–288. A Guide for Practitioners. New York: Wiley.
https://doi.org/10.2307/2685779. Hartnack, S. and Roos, M. (2021). Teaching: confidence,
Agresti, A. and Coull, B.A. (1998). Approximate is better prediction and tolerance intervals in scientific practice:
than ‘exact’ for interval estimation of binomial propor- a tutorial on binary variables. Emerging Themes in Epi-
tions. The American Statistician 52: 119–126. https:// demiology 18 (1): 17. https://doi.org/10.1186/
doi.org/10.2307/2685469. s12982-021-00108-1.
Andersson, M. and Kettunen, P. (2021). Effects of hold- Hu, J. (2015). Confidence intervals for binomial propor-
ing density on the welfare of zebrafish: a systematic tion using SAS®: the all you need to know and no
review. Zebrafish 2021: 297–306. https://doi.org/ more. Paper SD103. https://www.lexjansen.com/
10.1089/zeb.2021.0018. sesug/2015/103_Final_PDF.pdf (accessed 2022).
Bliss, C.I. and Fisher, R.A. (1953). Fitting the negative Jones, E., Alawneh, J., Thompson, M. et al. (2020). Pre-
binomial distribution to biological data. Biometrics dicting diagnosis of Australian canine and feline uri-
9 (2): 176–200. https://doi.org/10.2307/3001850. nary bladder disease based on histologic features.
Brown, L.D., Cai, T.T., and Das Gupta, A. (2001). Interval Veterinary Sciences 7 (4): 190. https://doi.org/
estimation for a binomial proportion. Statistical Sci- 10.3390/vetsci7040190.
ence 16 (2): 101–133. https://doi.org/10.1214/ Lloyd-Smith, J.O. (2007). Maximum likelihood estima-
ss/1009213286. tion of the negative binomial dispersion parameter
Burd, M. (1994). A probabilistic analysis of pollinator for highly overdispersed data, with applications to
behaviour and seed production in Lobelia deckenii. infectious diseases. PLoS ONE 2 (2): e180. https://
Ecology 75: 1635–1646. doi.org/10.1371/journal.pone.0000180.
Lwandga, S.K. and Lemeshow, S. (1991). Sample Size Shaw, D.J., Grenfell, B.T., and Dobson, P. (1998). Pat-
Determination in Health Studies: A Practical Manual. terns of macroparasite aggregation in wildlife host
Geneva: World Health Organization. populations. Parasitology 117: 597–610. https://
McVinish, R. and Lester, R.J.G. (2020). Measuring aggre- doi.org/10.1017/S0031182098003448.
gation in parasite populations. Journal of the Royal Shilane, D., Evans, S.N., and Hubbard, A.E. (2010). Con-
Society Interface 17: 20190886. https://doi.org/ fidence intervals for negative binomial random vari-
10.1098/rsif.2019.0886. ables of high dispersion. The International Journal of
Meeker, W.Q., Hahn, G.J., and Escobar, L.A. (2017). Sta- Biostatistics 6 (1): Article 10. https://doi.org/
tistical Intervals: A Guide for Practitioners and 10.2202/1557-4679.1164.
Researchers, 2e. New York: Wiley. Stoklosa, J., Blakey, R.V., and Hui, F.K.C. (2022). An
Newcombe, R.G. (1998). Two-sided confidence intervals overview of modern applications of negative binomial
for the single proportion: comparison of seven meth- modelling in ecology and biodiversity. Diversity 2022
ods. Statistics in Medicine 17: 857–872. (14): 320. https://doi.org/10.3390/d14050320.
Pennycuick, L. (1971). Frequency distributions of Thompson, S.K. (1987). Sample size for estimating mul-
parasites in a population of three-spined sticklebacks, tinomial proportions. The American Statistician 41 (1):
Gasterosteus aculeatus L, with particular reference 42–46.
to the negative binomial distribution. Parasitology Tortora, R. (1978). A note on sample size estimation for
63 (3): 389–406. https://doi.org/10.1017/ multinomial populations. The American Statistician
S0031182000079920. 32 (3): 100–102.
Ross, G.J.S. and Preece, D.A. (1985). The negative bino- White, G.C. and Bennetts, R.E. (1996). Analysis of fre-
mial distribution. Statistician 34: 323–336. quency count data using the negative binomial distri-
Saha, K. and Paul, S. (2005). Bias-corrected maximum bution. Ecology 77 (8): 2549–2557.
likelihood estimator of the negative binomial disper-
sion parameter. Biometrics 61: 179–185.
11
Prediction Intervals

11.1 Introduction 127 11.2.3 Continuous Data, Linear
11.2 Prediction Intervals: Continuous Data 127 Regression 129
11.2.1 Continuous Data, Single 11.3 Prediction Intervals: Binary Data 130
Observation 128 11.4 Prediction Intervals: Meta-Analyses 130
11.2.2 Continuous Data, Comparing References 131
Two Means 128
analyses (Riley et al. 2011; Inthout et al. 2016; Deeks

11.1 Introduction et al. 2022). Karl Pearson called prediction ‘the
A prediction interval is an interval within which one fundamental problem of practical statistics’
or more future observations from a population will (Pearson 1920).
fall with a specified degree of confidence (1 − α). Prediction intervals differ from confidence inter-
Prediction intervals are useful for a variety of vals in that a prediction interval accounts for both
applications (Box 11.1), although they are most the random variation in the observations and the
often used for regression when the variable of inter- uncertainty in the estimation of the population
est is expected to vary with a second explanatory mean, rather than random variation alone. As a
variable or covariate. Other applications include result, prediction intervals are always wider than
reliability and survival analysis (Landon and the corresponding confidence interval and, unlike
Singpurwalla 2008), ‘personalised’ reference inter- confidence intervals, will not converge to zero as
vals in laboratory medicine (Coşkun et al. 2021), N increases (Hahn and Meeker 1991; NIST/SEMA-
assessing feasibility based on pilot data, replication TECH 2012; Meeker et al. 2017).
studies (Spence and Stanley 2016), and meta-
BOX 11.1
11.2 Prediction Intervals:
Applications of Prediction Intervals Continuous Data
▪ ‘Personalised’ reference intervals Regression The prediction interval is calculated as
models
▪ Forecasting models ynew ± t − value SEnew
▪ Relationship between two traits for an unmeas-
ured species in phylogenetic analyses where ynew is the sample estimate for the new pre-
▪ Replication studies dicted observation and SEnew is the standard error
▪ Meta-analyses. of the new predicted value.
11.2.1 Continuous Data, Single 81 5 ± 2 052 19 6

Observation
1
The standard error SE for a prediction interval for a 1+ = 40 5, 122 5 μmol L
27
single new observation ynew is
Therefore creatinine of a single future randomly
1 1 selected obese dog will occur between (40.5,
SE = s +
m N 122.5) μmol/L in 95 out of 100 cases.
where s is the standard deviation and N is the num-

ber of observations in the original sample, and m is
the number of new observations. Then the predic-
tion interval is
11.2.2 Continuous Data, Comparing
Two Means
The standard error for the prediction interval for the
1 1
ynew ± t 1 − α 2, N − 1 s + difference between two means is
m N
1 1 1 1
where t1 − α/2,N − 1 is the 100(1 − α/2) percentile of SE = sp + + +
m1 m2 n1 n2
the t-distribution with N − 1 degrees of freedom.
For a single future observation, m = 1 and the equa- where the pooled standard deviation is
tion reduces to
n1 − 1 s21 + n2 − 1 s22
1 sp =
ynew ± t1 − α s 1+ n1 + n2 − 2
2, N − 1
N
Here, n1 and n2 are the sample sizes for each group,
The two-sided prediction interval 100(1 − α)% for s21 and s22 are the variances for each group, and m1
the difference between means y1 − y2 is and m2 are the number of new observations.
y1 − y2 ± t1 − α 2,n1 + n2 − 2 SE
Example: Predictions Based on a Fixed
Sample Size: Anaesthetic Immobilisation
The one-sided lower prediction limit of the differ- Times in Mice
ence between future observations is
(Data from Dholakia et al. 2017). Time of com-
y1 − y2 − t α,n1 + n2 − 2 SE plete immobilisation following anaesthesia was
measured in two groups of 16 CD-1 male mice
randomly allocated to receive intraperitoneal
injections of either ketamine-xylazine (KX) or
Example: Predicting a Single Future Value: ketamine-xylazine combined with lidocaine
Creatinine Levels in Obese Dogs (KXL). Immobilisation time averaged 38.8 (SD
7.9) min for KX mice and 33.3 (SD 3.9) min for
(Data from German et al. 2015). Serum creatinine KXL mice for a difference of 5.5 minutes.
was measured in 27 obese dogs before entering a If a new study is planned using only five mice
weight loss programme, averaging 81.5 (SD per group, what is the expected difference in
19.6) μmol/L. What is the predicted value of cre- future immobilisation times for KX mice com-
atinine for a single future obese dog at α = 0.05? pared to KXL with 95% confidence?
Prediction Intervals 129
Summary data for the original study are y1 =

11.2.3 Continuous Data, Linear
38.8, y2 = 33.3, n1 = n2 = 8, and t1−0 05 2,n1 + n2 −2 Regression
Prediction intervals are most often used for regres-
= 2.145. For the new study, m1 = m2 = 5
sion when one or more future observations are pre-
Then dicted from the relations between the response and
one or more explanatory (or independent) variables.
Therefore the prediction interval on the new obser-
1 1 1 1 vation will be conditional on the predictors in the
38 8 − 33 3 ± 2 145 3 9 + + +
8 8 5 5 regression model.
= 5 5 ± 6 74 = − 1 2, 12 2 min The two-sided prediction interval is
2
ynew ± tα 2,N − K + 1 MSE + SE ynew
Therefore, immobilisation times for KXL could
be as much as 12 min longer than KX or 1.2 min
shorter. where MSE is the mean square error of the regres-
sion with associated degrees of freedom N − (K + 1)
and the term MSE + SE ynew 2 is the standard
error of the prediction. In SAS, the 95% prediction
Example: Predicting a Range of Future interval for a new observation at a given value of
Observations: Creatinine Levels in X is called by the command cli in the model option
Obese Dogs statement.
(Data from German et al. 2015). Serum creatinine The ordinary least squares regression is fitted
was measured in 27 obese dogs before entering a under the assumption that the independent variable
weight loss programme, averaging 81.5 (SD is measured without error, or at least the error is
19.6) μmol/L. Suppose a new study was planned negligible compared with that for the response var-
using a sample size of 100 dogs. What range of iable (Draper and Smith 1998).
mean creatinine values can be expected with Prediction intervals must be interpreted with cau-
95% confidence? tion if there is a substantial error for values of the
For this query, the prediction interval on the independent variable. del Río et al. (2001) describe
difference between two means must be com- methods for constructing prediction intervals for
puted, and both the predicted mean and sample linear regression that account for errors on both
standard deviation are hypothetical but assumed axes. Gelman et al. (2021) provide an excellent guide
to be equal to those in the original sample. The to modern methods of regression, including Bayes-
prediction interval for the new study is then ian inference methods (incorporation of prior infor-
mation into inferences) and R commands for
computation.
s21 s2
original mean ± tdf , N 1 − 1 + 1
N1 N2 Example: Rodent p50 in Relation
19 62 19 62 to Body Mass.
= 81 5 ± 2 052 +
27 100 Oxygen affinity for haemoglobin is quantified by
= 81 5 ± 8 7 = 72 8, 90 2 P50, the partial pressure of oxygen at which hae-
moglobin is 50% saturated. In mammals, P50
scales negatively with body mass, such that small
Therefore the range of average creatinine mammals have larger P50 than large mammals
values that can be expected in a future study with (Schmidt-Nielsen and Larimer 1958). Body mass
a sample size of 100 is between 72.8 and (kg) and P50 (mm Hg) data for 23 rodent species
90.2 μmol/L. are given in Table 11.1.
The prediction interval is

Table 11.1: Mean body mass (kg) and P50 (mm Hg)
data for 23 rodent species.
m+n
1 2 3 4 5 m p ± z1 − α 2 m p 1−p
Species n
Body mass 0.018 0.018 0.020 0.025 0.030
Hartnack and Roos (2021) recommend the Bayesian
P50 52.0 40.0 28.8 33.5 33.2 Jeffreys prediction interval. This is based on the pos-
6 7 8 9 10 terior predictive distribution combining the bino-
Body mass 0.045 0.047 0.049 0.068 0.088 mial distribution with a conjugate beta prior.
Prediction intervals can be computed in R using
P50 33.8 53.0 32.0 41.0 29.0
the package DescTools; see Hartnack and Roos
11 12 13 14 15 (2021) for details.
Body mass 0.100 0.157 0.162 0.193 0.196
P50 38.4 25.0 39.0 23.0 29.5
Example: Predicting Number of Future
16 17 18 19 20 Events: Simulated Equine Orbital Fractures
Body mass 0.196 0.226 0.454 0.500 0.517 The major cause of long bone and facial fractures
P50 29.5 39.0 26.8 27.8 36.0 in horses is kicks from other horses. The type of
shoe can greatly affect the severity of the injury
21 22 23
and probability of bone fracture (Sprick
Body mass 1.000 1.200 3.500 et al. 2017).
P50 24.0 22.0 27.0 Investigators used a drop-weight impact device
to simulate kicks on cadaveric skulls from
Source: Compiled from Schmidt-Nielsen and Larimer (1958).
17 horses to test the effect of shod versus unshod
hoof on fracture rates. Orbital fractures from steel
The regression on log10-transformed data was shoes were observed in 12 of 17 skulls (Joss et al.
2019). How many orbital fractures can be expected
log10 P50 = 1 43 − 0 1 log10 mass in a future sample of m = 10 skulls at α = 0.05?
R2 = 0 34, MSE = 0 00736 The observed proportion of fractures is
12/17 = 0.706, n = 17, and m = 10. The approxi-
mate normal 95% prediction interval is
The 95% confidence and prediction intervals
for P50 were estimated for body mass of 0.1 kg
(values were back-transformed to original units). 37
10 0 706 ± 1 96 10 0 706 0 294
The predicted mean P50 is 33.2. The 95% confi- 27
dence intervals are (30.4, 36.2) mmHg and the = 3 75, 10 37
95% prediction intervals are (21.8, 50.5) mmHg.
Therefore, for a future sample of 10 skulls, we
can expect to find that between 4 and 11 fractures
will occur with 95% confidence.
11.3 Prediction Intervals:
Binary Data
Binary data are expressed as a proportion p. The 11.4 Prediction Intervals:
standard error SE for a prediction interval for a
new observation from a binomial distribution is Meta-Analyses
For a random-effects meta-analysis, the mean
m+n describes systematically different effects for the
SE = z1 − α 2 m p 1−p
mn compiled studies. Confidence intervals calculated
Prediction Intervals 131
for mean effects in a meta-analysis will usually be administered to CD-1 mice. PLoS ONE 12 (9):
too narrow for an adequate description of the range e0184911. https://doi.org/10.1371/journal.
of possible study effects. If the number of studies in pone.0184911.
the meta-analysis is >10, then the 100(1 − α)% pre- Draper, N.R., and Smith, H. (1998). Applied Regression
diction interval can be estimated as Analysis, 3e. New York: Wiley.
Gelman, A., Hill, J., and Vehtari, A. (2021). Regression
and Other Stories. Cambridge University Press.
2
Y ± t 1 − α,k − 2 τ2 + SE Y German, A.J., Holden, S.L., Serisier, S. et al. (2015).
Assessing the adequacy of essential nutrient intake
in obese dogs undergoing energy restriction for weight
where Y is the summary mean of absolute measures loss: a cohort study. BMC Veterinary Research 11: 253.
of effect (e.g. risk difference, mean difference, stan- https://doi.org/10.1186/s12917-015-0570-y.
2
dardised mean difference), SE Y is the variance, Hahn, G.J. and Meeker, W.Q. (1991). Statistical Intervals:
t1 − α,k − 2 is the critical t-value for 1 − α and k − 2 A Guide for Practitioners. New York: Wiley.
degrees of freedom, and τ is the estimate of the var- Hartnack, S. and Roos, M. (2021). Teaching: confidence,
iation of the true effects (heterogeneity). For relative prediction, and tolerance intervals in scientific prac-
measures, such as RR and OR, the interval must be tice: a tutorial on binary variables. Emerging Themes
calculated from the logarithm (ln) of the summary in Epidemiology 18: 17. https://doi.org/10.1186/
s12982-021-00108-1.
estimate, as is the case for confidence intervals
Higgins, J.P.T., Thompson, S.G., and Spiegelhalter, D.J.
(Higgins et al. 2009).
(2009). A re-evaluation of random-effects meta-analy-
These prediction intervals may have very poor sis. Journal of the Royal Statistical Society: Series
coverage when k is small, resulting in intervals that A (Statistics in Society) 172: 137–159.
are too narrow (Partlett and Riley 2017). Nagashima IntHout J, Ioannidis JPA, Rovers MM, Goeman JJ (2016)
et al. (2019a, b) have developed an R package pimeta Plea for routinely presenting prediction intervals in
that compiles several alternative methods for con- meta-analysis. BMJ Open, 6: e010247. https://doi.
structing prediction intervals based on bootstrap- org/10.1136/bmjopen-2015-010247
ping. The package also includes methods for Joss, R., Baschnagel, F., Ohlerth, S. et al. (2019). The risk
estimating prediction intervals when the number of a shod and unshod horse kick to create orbital
of studies is very small (k < 5). fractures in equine cadaveric skulls. Veterinary and
Comparative Orthopaedics and Traumatology 32 (4):
282–288.
References Landon, J. and Singpurwalla, N.D. (2008). Choosing a
coverage probability for prediction intervals. The
Coşkun, A., Sandberg, S., Unsal, I. et al. (2021). Persona- American Statistician 62 (2): 120–124. https://doi.
lized reference intervals in laboratory medicine: a new org/10.1198/000313008x304062.
model based on within-subject biological variation. Meeker, W.Q., Hahn, G.J., and Escobar, L.A. (2017).
Clinical Chemistry 67 (2): 374–384. https://doi. Statistical Intervals: A Guide for Practitioners and
org/10.1093/clinchem/hvaa233. Researchers, 2e. New York: Wiley.
Deeks, J.J., Higgins, J.P.T., and Altman, D.G. (2022). Nagashima, K., Noma, H., and Furukawa, T.A. (2019a).
Chapter 10: Analysing data and undertaking meta-ana- Prediction intervals for random-effects meta-analysis:
lyses. In: Cochrane Handbook for Systematic Reviews of a confidence distribution approach. Statistical Methods
Interventions version 6.3 (updated February 2022) (ed. H. in Medical Research 28 (6): 1689–1702. https://doi.
JPT, J. Thomas, J. Chandler, et al.). Cochrane Available org/10.1177/0962280218773520.
from www.training.cochrane.org/handbook. Nagashima, K., Noma, H., and Furukawa, T.A. (2019b)
del Río, F.J., Riu, J., and Rius, F.X. (2001). Prediction pimeta: prediction intervals for random-effects meta-
intervals in linear regression taking into account errors analysis. R package version 1.1.2. Available from:
on both axes. Journal of Chemometrics 15: 773–788. https://CRAN.R-project.org/package=pimeta
https://doi.org/10.1002/cem.663. (accessed 2022).
Dholakia, U., Clark-Price, S.C., Keating, S.C.J., and NIST/SEMATECH (2012) e-Handbook of Statistical
Stern, A.W. (2017). Anesthetic effects and body weight Methods www.itl.nist.gov/div898/handbook,
changes associated with ketamine-xylazine-lidocaine https://doi.org/10.18434/M32189
Partlett, C. and Riley, R.D. (2017). Random effects meta- https://doi.org/10.1152/ajplegacy.1958.

analysis: coverage performance of 95% confidence and 195.2.424.
prediction intervals following REML estimation. Sta- Spence, J.R. and Stanley, D.J. (2016). Prediction interval:
tistics in Medicine 36 (2): 301–317. what to expect when you’re expecting … a replication.
Pearson, K. (1920). The fundamental problem of practi- PLoS ONE 11 (9): e0162874. https://doi.org/
cal statistics. Biometrika 13 (1): 1–16. https://doi. 10.1371/journal.pone.0162874.
org/10.2307/2331720. Sprick, M., Fürst, A., Baschnagel, F. et al. (2017). The
Riley, R.D., Higgins, J.P.T., and Deeks, J.J. (2011). Inter- influence of aluminium, steel and polyurethane
pretation of random effects meta-analyses. BMJ shoeing systems and of the unshod hoof on the injury
342: d549. risk of a horse kick: an ex vivo experimental study.
Schmidt-Nielsen, K. and Larimer, J.L. (1958). Oxygen disso- Veterinary and Comparative Orthopaedics and
ciation curves of mammalian blood in relation to body Traumatology 30: 339–345. https://doi.org/
size. American Journal of Physiology 195 (2): 424–428. 10.3415/VCOT-17-01-0003.
12
Tolerance Intervals

12.1 Introduction 133 12.A SAS and R Code for Calculating
12.2 Tolerance Interval Width and Bounds 134 Tolerance 139
12.3 Parametric Formulations 134 12.A.1 Solving for k 139
12.3.1 Two-Sided Limits 134 12.A.2 SAS Code Racehorse Medication
12.3.2 One-Sided Limits 135 Threshold Limits for N = 20,
Mean = 0.43 and Standard
12.4 Non-parametric Tolerance Limits 136
Deviation STD = 1.50 139
12.5 Determining Sample Size for Tolerance 12.A.3 R Code for Package Tolerance
Intervals 137 (Young 2010) for Racehorse
12.6 Sample Size for Tolerance Based on Medication Threshold Limits 139
Permissible Number of Failures 138 References 140

Applications of Tolerance Intervals
A tolerance interval is defined as the interval
(coverage) between which a pre-specified propor- ▪ Diagnostic and clinical reference ranges
tion (p) of observations fall with a pre-specified level ▪ Regulatory thresholds for performance horse
of confidence. That is, statistical tolerance limits are medication
limits within which a given proportion of the popu- ▪ Medical device performance
lation is expected to lie. Unlike prediction intervals ▪ Method comparison tests for medical devices
for which coverage accounts for a pre-specified ▪ Quality control applications
number of future observations, coverage for toler- ▪ Safety limits
ance limits accounts for any number of future obser- ▪ Medication residue limits
vations (Hahn and Meeker 1991; Vardeman 1992; ▪ Environmental regulatory limits for pesticide
NIST/SEMATECH 2012; Meeker et al. 2017; Hart- concentrations
nack and Roos 2021). Potential applications
Tolerance intervals have numerous biological
and clinical applications for the quantification of
▪ Humane endpoints
‘acceptable’ or ‘reference’ performance limits
▪ Physiological ‘tolerance polygons’
(Box 12.1). Examples include drug and device
▪ Critical thermal maxima and minima for
ectotherms.
quality control, environmental and toxicology
monitoring (Smith 2002; Gibbons et al. 2009; BOX 12.2

Komaroff 2018), safety (Chen and Kishino 2015), One-Sided or Two-Sided Tolerance Intervals?
medication levels in performance horses (RMTC
2016) and dairy animals (CVMP 2000), compara- Two-sided tolerance interval: When it is important
tive performance of biosimilar drugs (Chiang to define an accurate and reliable range for a
et al. 2021), and method comparison assessments given proportion of observations.
for medical devices (Francq et al. 2020). Tolerance What length of the interval will contain p observations
intervals have been recommended overconfidence with a specified level of confidence?
and prediction intervals for constructing clinical
One-sided tolerance interval: When it is important to
reference ranges (Liu et al. 2021).
define an accurate and reliable performance
For process validation and quality control appli-
threshold.
cations, ‘confidence’ is defined as the amount of
certainty that the tolerance interval contains a spe- What length of interval will ensure that p observations
cified percentage of each individual measurement do not fall below a lower threshold limit L (or exceed
in the population. ‘Reliability’ is the proportion of an upper threshold limit U) with a specified level of
the population sample contained by the interval. confidence?
For clinical purposes, such as medical device test-
ing, a defined ‘risk’ component can be incorporated
that combines the probability of occurrence and
included the determination of upper threshold
potential severity of harm resulting from product
for environmental contamination or toxicology
failure or defect (Durivage 2016 a, b).
responses, or for determining a specific cut-point
in biomarker expression when evaluating response
to a drug (Pan 2015).
12.2 Tolerance Interval
Width and Bounds
The width of a tolerance interval results from both 12.3 Parametric
sampling error and population variation. As sample Formulations
size increases, sampling error is reduced so that the
For normally distributed observations obtained
percentiles estimated from the sample approach
from a random sample, two-sided tolerance inter-
are the true population percentiles. Therefore the
vals are bounded by the corresponding lower (L)
bounds of a tolerance interval are the upper or lower
and upper (U) tolerance limits, calculated as differ-
confidence interval bounds of a quantile of the
underlying data distribution. For example, a com- ences from the mean Y Therefore the lower limit YL
mon upper threshold for regulatory purposes is is Y − k s and the upper limit Y U is Y + k s. The
the 95th quantile (95% of the sampled population value for k is a function of the coverage or propor-
should fall at or below the threshold). Then the tion of the population p to be covered with confi-
tolerance interval limit is the upper bound of the dence α. If the data are not normally distributed,
(1 − α) confidence interval for the 95th quantile. log-transformation may be sufficient to ensure nor-
Tolerance intervals may be either two-sided or malisation. Values for k are obtained from the
one-sided (Box 12.2). A two-sided tolerance interval appropriate non-central distributions, with critical
is bounded by both an upper and lower limit. values generated from the respective inverse func-
Examples include quality control applications in tions (Appendix 12.A).
pharmaceutical development, device performance
in comparison to a reference standard, and compar- 12.3.1 Two-Sided Limits
ative performance of biosimilar drugs (Chiang
and Hsiao 2021; Chiang et al. 2021). A one-sided tol- For normally distributed continuous data,
erance interval is calculated when the objective is to the two-sided tolerance interval is Y ± k 2 s,
determine if a given proportion of observations fall where k(2) is based on the non-central chi-squared
outside some upper or lower threshold. Examples distribution:
Tolerance Intervals 135
1 zp + z2p − ab
df 1 +
N k 1 =
k = z 1+p 2 a
χ 21 − α,df
z2α z2
where a = 1 − , and b = z2p − α
with coverage p (the proportion of observations 2 N −1 N
that need to lie within the interval), χ 21 − α,df is the
critical value distribution, with df = N − 1 degrees
of freedom, and z(1 + p)/2 is the critical value of the Example: Regulatory Threshold for Race-
normal distribution with cumulative probability horse Medication Withdrawal
(1 + p)/2. For example, suppose the required cover-
(Data from RMTC 2016.) The Racehorse Medi-
age is 99% (p = 0.99). Then (1 + p)/2 = (1 + 0.99)/2
cation and Testing Consortium [RMTC] Scien-
= 0.995, and z0.995 is 2.576. For 95% coverage,
tific Advisory Committee has determined the
(1 + p)/2 = 0.975 and z0.975 = 1.96.
regulatory threshold for specific medications to
For small samples (N < 30), Guenther (1977) sug-
be the upper limit of the 95/95 tolerance inter-
gests a weighted correction for k(2) as k 2 = k 2 w,
val; that is, the specified tolerance interval has
where w is 95% coverage with 95% confidence. Samples
from 20 research horses were collected 24 hours
N − 3 − χ 2N − 1,α after administration of a certain medication
w= 1+ 2
2 N+1 and assayed to determine medication residue.
Observed values were:
y = 6 8, 3 4, 6 2, 5 4, 0 3, 0 5, 2 6, 0 1, 0 1, 4 5,
12.3.2 One-Sided Limits 1 0, 2 3, 10 0, 3 5, 0 2, 1 2, 0 8, 1 0, 1 4, 20 0
For continuous normally distributed data with
The range is 0.1–20.0 ng/mL, with mean 3.565
mean y and standard deviation s, the upper limit
ng/mL (SD = 4.596 ng/mL); median 1.85 (IQR
of a one-sided tolerance interval is y + k 1 s, and
0.725, 4.725) ng/mL. Because the data were
the lower limit is y – k 1 s. The one-sided normal non-normal and right-skewed, they were ln-
tolerance intervals have an exact solution based transformed for analysis. The transformed mean
on the non-central t-distribution. The z-distribution and variance are 0.43 and 1.50, respectively, with
can be used for large-sample approximations (N > p = 0.95, and α = 0.95.
100). In general, there is no difference between a Sample SAS and R (Young 2010) codes for cal-
one-sided tolerance bound and a one-sided confi- culating tolerance limits are provided in Appen-
dence bound on a given quantile of the distribution. dix 12.A. The value for k1 is 2.383. The one-
For example, a 95 per cent confidence limit on the sided upper tolerance limit is exp(0.43 + 2.283
upper 95th percentile and an upper tolerance limit 1.50) = exp(4.00523) for a threshold value of
on the 95th percentile at 95% confidence are the 54.6 ng/mL.
same (Meeker et al. 2017).
For small-sample one-sided tolerance intervals
based on the non-central t-distribution
t α,N − 1,λ Example: One-Sided Lower Tolerance
k1 =
N Interval: Osprey Eggshell Thickness
(Data from Odsjö and Sondell 2014.) Poor repro-
where λ is the non-centrality parameter zp N. The
ductive success in ospreys is related to eggshell
sample size N is obtained by iteration to find the
thinning, mostly resulting from bioaccumulation
minimum N that satisfies tα,N − 1,λ ≥ t1 − α,N − 1,λ.
of environmental contaminants. Investigators
For one-sided tolerance intervals based on the
measured shell thicknesses in 166 eggs; average
large-sample normal distribution, k(1) is calcu-
shell thicknesses were 0.51 (SD 0.039) mm.
lated as
From historical data, the investigators deter- The prediction interval for a single future
mined that a reduction in shell thickness by observation is
approximately 20% was associated with markedly
increased rates of reproductive failure. A 20% RI = Y ± t 1 − α 2,N − 1 s 1+1 N
reduction in shell thickness from the mean corre- = 95 54 ± 1 97 7 42 1 + 1 210
sponds to an absolute eggshell thickness of 0.4 mm. = 80 9, 110 18
Suppose the effectiveness of environmental
remediation was defined as at least 95% of eggs Using the normtol.int option in R package
in the population to be above the breakage tolerance and the Howe method for estimat-
threshold of 0.4 mm with 90% confidence. What ing the two-sided k, the tolerance interval
is the lower one-sided tolerance limit? for 95% coverage and 95% confidence is (80.1,
Using the normtol.int option in R library 110.9) mg/dL
tolerance:
set.seed(166)
x <- rnorm(166, 0.515, 0.039)
out <- normtol.int(x = x, alpha = 0.10, P = 12.4 Non-parametric
0.95, side = 1, method = "HE", log.norm =
FALSE) Tolerance Limits
out
Non-parametric, or distribution-free, tolerance lim-
The lower one-sided tolerance limit is its are used if the data are not normally distributed
0.448 mm. and cannot be readily transformed or if the investi-
gator chooses not to make distributional assump-
tions about the data. The only assumption is that
the underlying distribution function is a non-
decreasing continuous probability distribution.
Example: Reference Intervals: Normally
Non-parametric tolerance intervals are approxi-
Distributed Data
mated by rank order methods. A major disadvan-
Reference intervals for common veterinary hae- tage of non-parametric approximations is the
matology and biochemistry variables are usually requirement for large sample sizes.
based on the mean ± 2 SD if it can be assumed the For count or binary data, the tolerance interval
data are normally distributed (Klaassen 1999). specifies the upper and lower bounds on the
How do reference intervals based on this approx- number of observations that are expected to show
imation compare with prediction and tolerance the event of interest (yes/no) in a future sample
intervals for the same data? of m observations with a specified confidence 100
(Data from Liu et al. 2021.) Fasting blood glu- (1 − α)%. For two-sided tolerance intervals with
cose values were obtained from 210 subjects and an upper and a lower limit, a specified proportion
averaged 95.54 (SD 7.42) mg/dL. P of the population is contained within the bounds
The 95% confidence interval is: with a specified level of confidence. For one-sided
tolerance intervals, the upper (or lower) limit
RI = Y ± z1 − α 2 s n = 95 54 describes the specified proportion P that meets or
exceeds (or falls below) a specified threshold value
± 1 96 7 42 210 P . For example, to provide evidence of safety, an
investigator might wish to show with 90% confi-
= 94 5, 96 5
dence that ≤1% of subjects exposed to a test sub-
stance demonstrated adverse effects (Meeker et al.
The reference interval RI approximated by ±2 2017; Hartnack 2019; Hartnack and Roos 2021).
SD (Klaassen 1999) is For continuous data, non-parametric tolerance
intervals are calculated from quantiles of the ranked
RI = 95 54 ± 2 7 42 = 80 71, 110 37 mg dL sample data using the largest and smallest values in
the sample. The major disadvantage of this method

fractures following simulated hoof impact. The
is that very large sample sizes are required for
observed proportion was 12/17 = 0.706. How
reasonable precision. Alternatively, if too high a
does the 95/95 tolerance interval compare to
tolerance is specified, this method may result in esti-
the 95% prediction interval for m = 10 future
mates of impossibly large sample sizes. Determining
observations?
non-parametric tolerance limits usually require
Tolerance limits describe the proportion of
sample size N of at least 60 to ensure 90% coverage
‘defective’ items (skulls with fractures) that
with 95% confidence (Hahn and Meeker 1991;
bound the number of fractures expected in future
Meeker et al. 2017).
samples. The two-sided 95/95 tolerance interval
for 95% coverage and 95% confidence can be cal-
Example: Blood Ammonia Tolerance Inter- culated with the bintol.int option in R library
vals: Non-normal Continuous Data tolerance. Either the Wilson score method or
Jeffrey’s Bayesian method can be used to calcu-
(Data from Tivers et al. 2014.) Blood ammonia late the lower and upper bounds.
concentration is measured routinely for dogs The approximate two-sided 95% prediction
and cats with hepatic encephalopathy (HE). interval for m = 10 future observations is
A study of 90 dogs without clinical signs of HE (4, 11). The two-sided 95/95 tolerance
reported blood ammonia concentration aver- interval for 95% coverage and 95% confidence
aging Y = 152.6 μmol/L with s = 101.6 μmol/L. is (2, 10).
These data are clearly non-normal and probably
right-skewed. Therefore, the estimates for the
mean and standard deviation are ln-transformed
prior to analysis, and then back-transformed to
obtain the tolerance limits in the original units. 12.5 Determining Sample
Size for Tolerance Intervals
2
The transformed SD is st = ln s2 + Y − 2 ln Y Depending on the research problem, either of two
different approaches can be used to choose sample
= ln 101 62 + 152 62 − 2 ln 152 6 = 0 605738 size N so that the probability of either (a) including
The transformed mean is μ = ln Y – s2 2 = more than a specified proportion p of the population
4 844361 . Using the normtol.int option in R in the tolerance interval is small; or (b) the entire tol-
package tolerance and the Howe method erance interval being within the specification limits
for estimating the two-sided k, the tolerance is large.
interval for 95% coverage and 95% confidence Two-sided interval distribution-free sample size.
is (3.568984, 6.168753), which back-transform The sample size required for a coverage p is esti-
to (35.5, 477.6) μmol/L. However, the tolerance mated by iterating over candidate values for N to
interval estimated by this method is probably solve for confidence 100(1 − α)%, where
too wide to be practical or informative and seems
to include values that lie outside the established 1 − α = 1 − NpN − 1 + N − 1 pN
normal range.
Alternatively, N can be obtained by sorting the
data from lowest to highest, then iterating over a
range of coverage p from 0.5 to 0.999, which repre-
Example: Equine Skull Fractures: sent the proportion ‘interval’ between the smallest
Binary Data and largest observations in the sample. The two-
(Data from Joss et al. 2019.) Twelve out of sided non-parametric tolerance interval will be
17 cadaveric equine skulls sustained orbital between the kth largest and (n − kth largest + 1)
values for these data (Faulkenberry and Daly 1970;

Using the norm.ss option in R library toler-
Meeker et al. 2017).
ance (Young 2010, Young et al. 2016) for 95%
One-sided interval distribution-free sample size.
coverage 95% confidence, and P = 0.97, N 525.
For the one-sided tolerance interval, the above
equality reduces to
1 − α = 1 − pN
12.6 Sample Size for
Therefore, the smallest p to achieve a desired con-
fidence interval is (1 − α) is
Tolerance Based on
Permissible Number of
p = exp ln α N
Failures
and the minimum sample size is When trialling novel medical procedures or devices,
the consequences of error can be catastrophic. For
sequential processes with risk of harm, a predeter-
N = ln α ln p mined level of acceptable risk must be included with
‘confidence’ (percentage of occurrences) and ‘relia-
Another approximation for N is bility’ (the proportion of the population sample)
in sample size calculations. Durivage (2016 a, b)
1+p 1 recommends setting confidence and reliability
N χ 2α,4 + levels based on risk acceptance (Table 12.1). As risk
4 1−p 2 increases, the predetermined level of reliability (or
proportion of the sample population evaluated
where χ 2 α,4 is the critical value of the χ 2 distribution without failure) increases.
with 4 degrees of freedom that is exceeded with When number of failures is predetermined, sam-
probability α (Hahn and Meeker 1991, NIST/SEMA- ple size N is approximated as
TECH 2012). Kirkpatrick (1977) provides tabulated
values for determining sample size for one-sided 0 5 χ 21 − C,2 r + 1
N=
and two-sided tolerance limits for both normally 1−R
distributed and distribution-free data.
where r is the number of failures, C is the confidence
level, R is the ‘reliability’, and the value of χ 2 is
Example: Sample Size for Osprey Egg Study determined for (1 − C) confidence and 2(r + 1)
degrees of freedom.
How many eggs need to be randomly sampled to If zero failures are allowed:
be able to claim that 95% of the osprey egg popu-
lation will exceed the lower tolerance bound with ln 1 − C
N=
95% confidence? ln R
Sample size is estimated from the
Table 12.1: Confidence and reliability levels can be
approximation: specified based on subjective levels of risk acceptance.
1+p 1 Risk Defect, adverse Confidence Reliability

N χ 2α,4 + event
4 1−p 2
Low Minor 95% 90%
For p = 0.95 and α = 0.95 with = χ 20 95,4 Medium Major 95% 95%
9 488, N = 93 and for p = 0.99 and α = 0.95, High Critical 95% 99%
N = 473.
Source: Adapted from Durivage (2016a).
Equivalent tolerance values can be calculated in

Example: Surgical Procedure: No-Fail
R using the tolerance package (Young 2010).
Performance
A new surgical procedure is being trialled where 12.A.2 SAS Code Racehorse
the consequences of error can be catastrophic.
Therefore, the risk level is high. How many sur-
Medication Threshold Limits for
geries must be performed to ensure the investiga- N = 20, Mean = 0.43 and Standard
tors can have 95% confidence that the process is Deviation STD = 1.50
99% reliable (99% of the sample is expected to data tolerance;
have a successful procedure) when zero failures set (keep=N mean STD);
occur in the series? Two failures? N = N;
In this example, C = 0.95, R = 0.99 and r = 0 p=0.95; *coverage of 95%;
alpha=0.05;
or 2. alpha2=1-alpha; *confidence 95%;
c=(1+p)/2;
ln 1 − 95 DF=N-1; *degrees of freedom;
Zero failures N = = 298 1 299
ln 99 *1. Small-sample method based on the non-
central t-distribution;
lambda =zp*sqrt(N);
0 5 χ 21 − 0 95,2 2 + 1 t_crit = quantile("T",alpha,DF,lambda);
Two failures N = = 629 6 k1_1 = t_crit/sqrt(N);
1 − 0 99
*calculate the upper tolerance limit;
630 *Because observations were ln-transformed,
they must be back-transformed to get the
original units;
UL_t = exp(mean + k1_1*STD);
12.A SAS and R Code for *2. Large-sample method;

zc = quantile("normal", c);
Calculating Tolerance za = quantile("normal", alpha2);
zp = quantile("normal", p);
12.A.1 Solving for k a = 1-((za*za)/(2*DF));
b = zp*zp - (za*za/N);
The value k for the calculation of tolerance intervals k1 = (zp + sqrt(zp*zp - a*b))/a;
is obtained from the non-central distributions for χ 2,
and either z or t, with critical values generated from *Calculate the upper tolerance limit;
UL = exp(mean + k1*STD);
the respective quantile functions. The quantile func- run;
tion is the inverse of the CDF function.
For t, the format is t = quantile (‘[distribution]’, *print output;
[coverage/confidence], [degrees of freedom]). proc print; run;
For z, the format is z = quantile (‘[distribution]’,
[coverage/confidence]). 12.A.3 R Code for Package
For χ 2, the format is chisq = quantile(‘[distribu- Tolerance (Young 2010) for
tion]’, alpha, [degrees of freedom]).
For the iterations, start at N = 2 and increase N by
Racehorse Medication Threshold
one until k1 ≤ k2. Limits
In SAS, for coverage p and confidence α set.seed(20)
x <- rnorm(20, 0.43, 1.50)
t_alpha = quantile("T",(1+p)/2,N - 1); out <- normtol.int(x = x, alpha = 0.05,
t_p = quantile("T",p,N - 1); P = 0.95, side = 1, method = "HE", log.norm =
FALSE)
z_alpha = quantile("normal", (1+p)/2); out
z_p = quantile("normal", p);
plottol(out, x, plot.type = "both", side =
chi_crit = quantile("chisq",alpha, N-1); "upper", 12.lab = "Normal Data")
References Hahn, G.J. and Meeker, W.Q. (1991). Statistical Intervals:

A Guide for Practitioners. New York: Wiley.
Chen, H. and Kishino, H. (2015). Hypothesis testing of Hartnack, S. (2019). Confidence, Prediction and Tolerance
inclusion of the tolerance interval for the assessment Intervals in Classical and Bayesian Settings Master The-
of food safety. PLoS ONE 10 (10): e0141117. sis in Biostatistics. Zurich: University of Zurich, Faculty
https://doi.org/10.1371/journal. of Science 82 pp. https://www.math.uzh.ch/li/
pone.0141117. index.php?file&key1=112446.
Chiang, C., and Hsiao, C.-F. (2021). Tolerance interval Hartnack, S., and Roos, M. (2021). Teaching: confidence,
testing for assessing accuracy and precision simultane- prediction and tolerance intervals in scientific practice:
ously. PLoS ONE 16 (2): e0246642. https://doi.org/ a tutorial on binary variables. Emerging Themes in
10.1371/journal.pone.0246642. Epidemiology 18: 17. https://doi.org/10.1186/
Chiang, C., Chen, C.T., and Hsiao, C.F. (2021). Use of a s12982-021-00108-1.
two-sided tolerance interval in the design and evalua- Howe, W.G. (1969). Two-sided tolerance limits for nor-
tion of biosimilarity in clinical studies. Pharmaceutical mal populations – some improvements. Journal of
Satistics 20 (1): 175–184. https://doi.org/10.1002/ the American Statistical Association 64 (326): 610–620.
pst.2065. Joss, R., Baschnagel, F., Ohlerth, S. et al. (2019). The risk of
Committee for Veterinary Medicinal Products. (2000). a shod and unshod horse kick to create orbital fractures
Note for guidance for the determination of withdrawal in equine cadaveric skulls. Veterinary and Comparative
periods for milk. EMEA/CVMP/473/98-FINAL. Orthopaedics and Traumatology 32 (4): 282–288.
https://www.ema.europa.eu/documents/scien- Kirkpatrick, R.L. (1977). Sample sizes to set tolerance
tific-guideline/note-guidance-determination- limits. Journal of Quality Technology 9 (1): 6–12.
withdrawal-periods-milk_en.pdf (accessed 2022). https://doi.org/10.1080/
Durivage, M.A. (2016a). Risk-based approaches to 00224065.1977.11980758.
establishing sample sizes for process validation Klaassen, J.K. (1999). Reference values in veterinary
https://www.meddeviceonline.com/doc/risk- medicine. Laboratory Medicine 30 (3): 194–197.
based-approaches-to-establishing-sample-sizes- https://doi.org/10.1093/labmed/30.3.194.
for-process-validation-0004 (accessed 2022). Komaroff, M. (2018). The applications of tolerance inter-
Durivage, M.A. (2016b). How to establish sample vals: make it easy. PharmaSUG 2018 – Paper AA-09.
sizes for process validation using the Success-Run https://www.pharmasug.org/proceedings/2018/
theorem. Pharmaceutical Online https://qscom- AA/PharmaSUG-2018-AA09.pdf (accessed 2022).
pliance.com/wp-content/uploads/2017/10/How- Liu, W., Bretz, F., and Cortina-Borja, M. (2021). Refer-
To-Establish-Sample-Sizes-For-Process-Val- ence range: which statistical intervals to use? Statisti-
idation-Using-The-Success-Run-Theorem.pdf cal Methods in Medical Research 30 (2): 523–534.
(accessed 2022). https://doi.org/10.1177/0962280220961793.
Eberhardt, K.R., Mee, R.W., and Reeve, C.P. (1989). Meeker, W.Q., Hahn, G.J., and Escobar, L.A. (2017). Sta-
Computing factors for exact two-sided tolerance limits tistical Intervals: A Guide for Practitioners and
for a normal distribution. Communications in Statistics Researchers, 2e. New York: Wiley.
Part B. 1989: 397–413. NIST/SEMATECH. (2012). e-Handbook of Statistical
Faulkenberry, G.D., and Daly, J.C. (1970). Sample size for Methods. http://www.itl.nist.gov/div898/hand-
tolerance limits on a normal distribution. Techno- book. http://doi.org/10.18434/M32189
metrics 12 (4): 813–821. Odeh, R.E., Chou, Y.-M., and Owen, D.B. (1987). The
Francq, B.G., Berger, M., and Boachie, C. (2020). To tol- precision for coverages and sample size requirements
erate or to agree: a tutorial on tolerance intervals for normal tolerance intervals. Communications in
in method comparison studies with BivRegBLS Statistics: Simulation and Computation 16: 969–985.
R package. Statistics in Medicine 39: 4334–4349. Odsjö, T., and Sondell, J. (2014). Eggshell thinning of osprey
https://doi.org/10.1002/sim.87092. (Pandion haliaetus) breeding in Sweden and its signifi-
Gibbons, R.D., Bhaumik, D.K., and Aryal, S. (2009). cance for egg breakage and breeding outcome. Science
Statistical Methods for Groundwater Monitoring, 2e. of the Total Environment 470–471: 1023–1029. https://
Hoboken: Wiley. doi.org/10.1016/j.scitotenv.2013.10.051.
Guenther, W.C. (1977). Sampling Inspection in Statistical Pan, J. (2015). The application of tolerance interval in
Quality Control. Griffin’s Statistical Monographs, defining drug response for biomarker. PharmaSUG
Number 37. London and High Wycombe: Griffin. China 2015 – Paper 57 https://www.pharmasug.org/
Racehorse Medication and Testing Consortium [RMTC]. 9 (1): e82303. https://doi.org/10.1371/journal.

(2016). Explaining the 95/95 tolerance interval. pone.0082303.
https://rmtcnet.com/wp-content/uploads/2016- Vardeman, S.B. (1992). What about the other intervals?
02-Explaining-the-95_95-Tolerance-Interval. The American Statistician 46 (3): 193–197. https://
pdf (accessed 2022). doi.org/10.2307/2685212.
Smith, R.W. (2002). The use of random-model tolerance Young, D.S. (2010). tolerance: an R package for estimat-
intervals in environmental monitoring and regulation. ing tolerance intervals. Journal of Statistical Software
Journal of Agricultural, Biological, and Environmental 36 (5): 1–39.
Statistics 7 (1): 74–94. Young, D.S., Gordon, C.M., Zhu, S., and Olin, B.D.
Tivers, M.S., Handel, I., Gow, A.G. et al. (2014). Hyper- (2016). Sample size determination strategies for nor-
ammonemia and systemic inflammatory response syn- mal tolerance intervals using historical data. Quality
drome predicts presence of hepatic encephalopathy in Engineering 28 (3): 337–351. https://doi.org/
dogs with congenital portosystemic shunts. PLoS ONE 10.1080/08982112.2015.1124279.
13
Reference Intervals

13.1 Introduction 143 13.3.3 Parametric Sample Size Estimates 148
13.2 Constructing the Reference Interval 144 13.3.4 Non-parametric Sample Size
13.2.1 Regression-Based Reference Estimates 148
Ranges 146 13.3.5 Covariate-Dependent Sample
13.3 Sample Size Determination 147 Size Estimates 148
13.3.1 Rules of Thumb 147 References 149
13.3.2 Sample-Based Coverage Methods 147
population (Figure 13.1). Because the reference

13.1 Introduction interval is the proportion of healthy subjects for
Reference intervals describe the range of clinical which a correct negative test result is obtained, it
read-out values between which typical results for corresponds to diagnostic specificity.
a healthy subject might be expected to occur. If Typically, reference intervals are constructed for
the measurement falls outside these limits, it is a single quantitative marker, usually a haematology
flagged as potentially abnormal (Box 13.1). The ref- or biochemistry variable or some other marker of
erence interval is the proportion of measurements disease. Markers are obtained from a sample of ‘rep-
for healthy subjects contained between the upper resentative’ reference subjects, usually healthy, in
and lower reference limits. The reference limits numbers ‘sufficient’ to provide reliable and consist-
are defined by the percentiles of the reference ent results. Defining what is ‘representative’ and
‘sufficient’ will be context-dependent. Reference
intervals for the same marker may vary considera-
BOX 13.1 bly between laboratories due to differences in sub-
What Are Reference Intervals? ject populations, measurement devices, reagents,
methods of analysis, and methods of verification.
The reference interval is the range of clinical read-
For a reference interval to be both clinically
out values between which typical results for a
useful and generalisable, studies describing its
healthy subject might be expected to occur.
construction must thoroughly document all meth-
Reference limits define potential abnormalities.
odology, including subject characteristics (signal-
Reference intervals are descriptive; they do not test
ment, health status, inclusion and exclusion
hypotheses.
criteria), methods of subject selection, sampling
Reference intervals are sample-based, not
methods, and the type and number of biomarkers
risk-based.
from which reference values are to be derived.
percentiles 13.2 Constructing the

95th
97.5th
Reference Interval
99th Construction of the reference interval and deter-
mination of an adequate sample size is a four-
Reference Abnormal step process (Table 13.1). The process requires
population population
specification of the reference or target popula-
tion (and health status), the subject pool, the
marker variable, and coverage and confidence.
Biomarker level
Sample size can be justified by both rule-of-
thumb methods and formal calculations, but in
Figure 13.1: Read-out distributions for reference ‘healthy’
versus ‘abnormal’ subject populations.
most real-life clinical settings it will be deter-
Source: Adapted from Ekelund (2018).
mined by convenience and the pool of available
and accessible subjects. Convenience sampling is
cheap and easy but is not a probability-based
BOX 13.2 sampling method and is subject to sampling bias
Definitions for Selection and Spectrum Bias and spectrum bias (Ransohoff and Feinstein
1978, Mulherin and Miller 2002). Most formal
Spectrum bias. Diagnostic performance differences sample size justification methods are based on
resulting from differences in patient and subject the assumption that the sample is obtained by
composition between clinical settings. Differ- random probability-based selection. Reference
ences in case mix affect diagnostic specificity interval determination studies must clearly
(when the proportion of health subjects differs), describe methods of sampling and discuss the
sensitivity (when the proportion of subjects with associated limitations.
the target condition differs), and estimates of Construction of the reference interval requires
prevalence. specification of three items: interval coverage, refer-
Selection bias. Group differences in signalment, ence limits, and cut-point precision (Figure 13.2)
baseline characteristics, and/or outcome due to Similar to 95% confidence intervals, bounds for a
differences in subject characteristics resulting
from poorly defined inclusion/exclusion criteria,
sampling bias (different rates of selection), attri- Table 13.1: Four steps in construction of a reference interval.
tion bias (different rates of failure or dropout),
1. Specify the reference population
and under-coverage bias (usually resulting from
▪ Healthy or diseased?
convenience sampling) ▪ If healthy, is subclinical disease an issue?
▪ Inclusion/exclusion criteria
▪ Consider laboratory analytics, methodology
Selection bias and spectrum bias (Box 13.2) will 2. Specify methods for subject selection
limit the application of the reference interval ▪ Prospective or retrospective
because test subjects will not be representative of ▪ Random or cohort
the desired target group (Ransohoff and Feinstein
▪ Consider selection and spectrum bias
3. Specify marker variable
1978; Goehring et al. 2004). ▪ One or multiple?
Although commonly used for clinical applica- ▪ Covariates? (age, sex, weight, other)
tions, reference intervals do not measure risk. They ▪ Distribution? (uniform, normal, non-normal)
are sample-based and descriptive. Therefore, sam- 4. Specify coverage and confidence settings
▪ Target coverage for the interval
ple characteristics are the pre-eminent determinant ▪ Confidence around the limits
of reliability and generalisability. Unless the refer- 5. Determine number of subjects
ence interval is well-designed and verified, clinical ▪ Rule of thumb = 120
interpretation must be performed with care. At ▪ Formal determinations
worst, the reference interval will be neither ▪ Parametric or non-parametric?
▪ One- or two-tailed?
applicable nor useful (Machin et al. 2018).
Reference Intervals 145
(a) 2.5.5th 97.5th the cut-point. This is estimated as a designated pro-

percentile percentile
portion of the reference interval width, usually 1 to
95% Coverage 5% (Figure 13.2b). The precision of the cut-points is
the confidence probability. Coverage and confidence
probability are analogous to the concepts of confi-
dence and power used in statistical significance
Reference testing.
population Frequently, reference ranges for common veteri-
nary haematology and biochemistry markers are
constructed on either the 95% confidence interval
or the mean ± 2 SD for normally distributed
(b) 2.5th
variables (Klaassen 1999). However, these approxi-
Reference limits
percentile
90% Coverage
mations will be biased (especially if data are non-
normal), resulting in under-estimation of true
coverage. Regression-based reference intervals are
used when clinical measurements vary with one
or more covariates, such as age or body weight
(Altman 1993; Wellek et al. 2014). Reference
Reference
intervals may also be constructed using prediction
population
intervals and tolerance intervals (Liu et al. 2021).
97.5th
percentile
Example: Reference Range: Normally
Figure 13.2: Construction of the reference interval requires Distributed Data
specification of three items: interval coverage, reference
limits, and cut-point precision. (a) By convention, coverage p is (Data from Liu et al. 2021.) Fasting blood
specified as 95%. Therefore the central 95% of the sample glucose values obtained from 210 subjects aver-
measurements for the reference distribution is bounded
by the 2.5th and 97.5th percentiles, the reference limits. aged 95.54 (SD 7.42) mg/dL. Reference ranges
(b) Cut-point precision is defined by the coverage over the calculated from these summary data are sum-
reference limits themselves. The desired coverage for the marised below.
reference limit is usually specified as 90% (β = 0.90).
The ± 2 SD approximation is RI2SD

reference interval for continuous normally distribu- = 95 54 ± 2 7 42 = 80 7, 110 4 mg dL
ted data are determined from percentiles of the
standard normal distribution. For non-normal data, The reference range based on the 95% confi-
non-parametric methods based on percentiles of dence interval is: RICI = Y ± z1 −α 2 s n =
rank-ordered values are used (Jennen-Steinmetz 95 54 ± 1 96 7 42 210 = 94 5,96 5
and Wellek 2005). The reference interval itself is The 95% prediction interval for a single future
defined as the central range of read-out values observation is
bounded by reference limit values. By convention,
the central 95% of the sample measurements (cover- RIpred = Y ± t1 − α 2,N − 1 s 1+1 N
age) fall between the lower 2.5th percentile and = 95 54 ± 1 97 7 42 1 + 1 210
upper 97.5th percentile bounds (reference limits or
= 80 9, 110 2
cut-points). This means that 2.5% of observations
are below the lower limit, and 2.5% are above the
upper limit (Figure 13.2a). The reference interval The 95/95 tolerance interval calculated with
requires an additional measure for the precision of the normtol.int option in R library
tolerance, with the Howe method for estimat- RI95 95 = 35 5, 477 6 μmol L
ing the two-sided k (Young 2010, 2013) is
Using the non-parametric method with
RI95 95 = 79 9, 111 8 mg dL nptol.int in the R library tolerance
RI95 95 = − 78 2, 414 9 μmol L
The 95/95 tolerance interval is wider than the

Example: Reference Range: Non-normal ±2 SD approximation, but results are roughly
Continuous Data comparable. In contrast, the interval calculated
with non-parametric methods is extremely wide
(Data from Tivers et al. 2014.) Blood ammonia and includes impossible values. The sample size
concentration is measured routinely for dogs is not large enough to obtain realistic precision
and cats with suspected or overt hepatic enceph- with non-parametric methods.
alopathy (HE). A study of 90 dogs without clinical
signs of HE reported blood ammonia concentra-
tions averaging Y = 152.6 μmol/L with SD
101.6 μmol/L. What is the reference interval 13.2.1 Regression-Based Reference
based on (a) the ±2 SD approximation; (b) the Ranges
95/95 tolerance interval? Regression-based reference ranges are constructed
Because the mean is smaller than twice the when clinical measurements vary with one or more
standard deviation (Altman and Bland 1996), covariates, such as age or body weight (Altman
these data are clearly non-normal and right- 1993; Jennen-Steinmetz 2013; Wellek et al. 2014).
skewed. Summary statistics were ln-transformed In the simplest case, with one covariate X, the linear
for analysis, as follows: regression model is Y = b0 + b1X + ε, with intercept
b0, slope b1 and error term ε, which is normally dis-
σ ln = ln SD2 + mean2 − 2 ln mean and μln tributed with mean 0 and variance σ2. The relative
margin of error D is a pre-specified proportion of the
= ln mean – σ 2ln 2 width of the 100(1 − α)% confidence interval, say
In this example, σ ln = 0.605738 and μln = 10–30%. Sample sizes are calculated such that N is
4.844361 sufficiently large to obtain a pre-specified margin
(a) The reference interval based on ±2SD app- of error. Non-normal data may be log-transformed
roximation is 4.844361 ± 2(0.605738), and back- to approximate a normal distribution, then back-
transformed to give transformed to obtain values in the original units.
If data cannot be readily transformed or the data
RI2SD = 37 8, 426 6 mg dL do not conform to the assumptions required for tra-
ditional least-squares regression, semi-parametric
The tolerance interval was estimated using or non-parametric rank methods can be used
normtol.int in the R library tolerance (Wellek et al. 2014; Machin et al. 2018). Altman
(1993) described a simple method using centiles
set.seed(20)
on the absolute residuals (see also Wright and Roy-
x <- rnorm(210, 4.844361, 0.605738)
out <- normtol.int(x = x, alpha = 0.05, ston 1997a, b). Quantile regression (Wei et al. 2019)
P = 0.95, side = 2, method = "HE", log.norm may be indicated when clinical markers show une-
= FALSE) ven patterns of variation across the range of the cov-
out ariate (Cade et al. 1999; Yu et al. 2003; Jennen-
Steinmetz 2014) or when measurements falling
The output for the lower and upper limits are
below (or above) limits of detection may be an issue
(3.568984, 6.168753). Back-transforming to
(Smith et al. 2006). Quantile regression can be per-
obtain the original units gives:
formed in SAS proc quantreg or R quantreg
(Koenker 2022), and confidence intervals con-

tolerance interval (24.5, 36.2) mm. These values
structed by rank test statistics or bootstrap resam-
are very similar to the interval limits originally
pling. Bootstrap resampling is recommended for
reported.
small samples (<200) and covariate skew
(Koenker and Hallock 2001; Yu et al. 2003; Tarr
2012; Jennen-Steinmetz 2014). The Harrell-Davis
quantile estimator (Harrell and Davis 1982) is 13.3 Sample Size
another distribution-free method that constructs
somewhat more efficient confidence intervals of
Determination
quartiles than other non-parametric methods. The 13.3.1 Rules of Thumb
Harrell-Davis estimator routine can be found in
the R Hmisc package (hdquantile function). The Clinical and Laboratory Standards Institute
Methods for constructing regression-based toler- (CLSI) recommends a minimum of 120 subjects as
ance intervals are available for several types of a rule of thumb for a reference interval study
regression models, including linear regression with (Bautsch 2009; Hortowitz et al. 2010). At least 400
one or more predictors, polynomial regression, non- subjects in each homogeneous subgroup are recom-
linear regression, and non-parametric regression mended for multi-centre studies (Ichihara and Boyd
(Young 2013, Young et al. 2016; Liu et al. 2021). 2010). For verification of a previously established
Regression tolerance intervals can be computed reference interval, the minimum recommended
in R tolerance: regtol.int for linear regression, sample size is 20 subjects (Henny 2009; Hortowitz
nlregtol.int for nonlinear regression, and et al. 2010; Katayev et al. 2010).
npregtol.int for non-parametric regression Rule of thumb approximations implicitly assume
(Young 2010, 2014). However, because tolerance the data are normally distributed. This is frequently
interval estimation was developed for engineering not the case for biological or biochemical data.
quality control applications, it may not be generally Another disadvantage is that rules of thumb do
applicable to clinical reference interval construc- not account for design and operational specifics or
tion. Reference intervals based on tolerance estima- sample population characteristics. Rule of thumb
tion may be too wide, sacrificing sensitivity for estimates may considerably over- or under-estimate
specificity (Wellek and Jennen-Steinmetz 2022). numbers actually required to achieve the necessary
specificity for the new study.
Example: Echocardiography Metrics in 13.3.2 Sample-Based Coverage

Cavalier King Charles Spaniels
Methods
(Data from Misbach et al. 2014.) Readings for
Sample size estimation for a reference interval
13 echocardiography and pulsed-wave Doppler
requires specification of three items: the coverage
variables were obtained for 134 clinically healthy
(probability content p) of the reference interval, the
adult Cavalier King Charles Spaniels, and effects
precision with which the reference limits or ‘cut-
of body weight, age, and sex were used to con-
points’ are estimated (confidence probability β), and
struct 90% confidence regression-based reference
cut-point precision (δ) (Figure 13.2.) The cut-point
intervals. They reported the fitted linear regres-
precision is expressed as a proportion of the reference
sion of left ventricular end-diastolic diameter
interval width, usually 1–5%.
(LVDd, mm) on body weight (BW, kg) as
Reference intervals may be determined by either
LVDd = 20.57 + 0.987 BW, with 90% reference
parametric or non-parametric methods of estima-
range (23.4, 35.6) mm.
tion. Parametric methods are appropriate for nor-
Computation of regression tolerance intervals
mally distributed continuous data, or data that
for a 10 kg male dog (R tolerance: regtol.
can be normalised with a log-transformation. For
int; Young 2013) indicated that the predicted
non-normal data, non-parametric methods are
LVDd is approximately 30.4 mm, with the 90/95
used. Non-parametric methods require much larger
sample sizes than parametric approximations. Ref-

Example: Determining Sample Size for a
erence intervals may be either one- or two-sided.
Non-Parametric Reference Interval
13.3.3 Parametric Sample Size (From Wellek et al. 2014.) For two-sided reference
coverage of 95%, cut-point precision δ of 0.01, and
Estimates
β of 90%, the parametric estimate of sample size is
Wellek et al. (2014) give the approximate sample
size formula for a two-sided parametric reference 1 962 2
N≈ 1+ 0 05845 1 96 0 01 2
interval based on the standard normal probability 2
density function. Sample size is ≈ 1534
2
N ≥ 1 + z21 2 Φ z2 δ 2
and the non-parametric sample size is
2
where z1 is the (1 + q)/2 percentile for coverage, and 1+0 95 2 1− 1+0 95 2 1 96 0 01 12
z2 = (1 + β)/2 percentile for the cut-point precision, ≈ 0 975 0 025 1 96 0 005 2 ≈3746
δ. Typically, q = 0.95 and β = 0.90, so that z1 = z2 =
1.96 for the two-sided case, For the one-sided
reference interval, z1 is the q percentile and cut-
point precision is δ (rather than δ/2), and z1 = 13.3.5 Covariate-Dependent Sample
z2 = 1.6440. The quantity Φ is the pdf of the standard Size Estimates
normal density function at x = z1 and is calculated as When clinical measurements vary with one or more
covariates, sample size estimates need to account for
1
e − z1
2
Φ= 2
the structure of the regression model and how cov-
2π ariates are distributed, as well as the desired degree
of precision. In general, sample sizes based on cov-
If z1 = 1.96, then Φ = 0.05845 and if z1 = 1.6448
ariate-dependent samples will be very much larger
then Φ = 0.10314.
than those that are independent of covariates.
Bellera and Hanley (2007) developed a simple
13.3.4 Non-parametric Sample Size regression-based method for estimating sample size
Estimates when the value of the response depends on a single
covariate, for example, age or body weight. Four
Non-parametric (distribution-free) formulations are variations of this method are available to account
based on weighted average of the rank orders of different distributions of the covariate in the study
measurements. Reference intervals are determined sample.
by percentiles of the distribution of clinical bio-
marker values. The reference interval is defined Uniform distribution. If the covariate is uniformly
by the qth and [1 − (1 − q)/2]th quantiles of the nor- distributed across its range, minimum sample
mal distribution. If a one-sided reference interval is size N is:
required, the two-sided quantile [1 − (1 − q)/2] is
replaced by [1 − (1 − q)] = q. z21 − α 4 + z2p p
2
The approximate sample size formula for a two- N≥
sided non-parametric reference interval (Wellek z21 − β 2 D2
et al. 2014) is
for the 100(1 − α)% confidence interval, the 100 p%
N ≥ 1+q 2 1− 1 + q 2 z2 δ 2 2 reference limit compared to the 100(1 − β)% refer-
ence range, and for margin of error D.
and for the one-sided case:
Normal distribution. If the covariate is approxi-
2 mately normally distributed, and its range is
N ≥ q 1−q z2 δ
approximately four times that of its standard

deviation: Tertile method. One potential sampling scheme
for a new study would be to divide the study
sample into three weight classes, or tertiles
z21 − α 2 5 + z2p 2 (<10 kg, 10–20 kg, and >20 kg). Equal num-
N≥ ber of dogs would then be enrolled from each
z21 − β D2
2 tertile.
Tertile method. If the covariate does not conform The estimated minimum sample size for the
to either type of distribution, sampling for a new study based on the tertile method is
new study could be approached by partitioning
the covariate into thirds, or tertiles, and then 1 962 5 2 + 1 6452 2
N≥ = 755 2
obtaining samples of size N/3 from each tertile: 1 962 0 102
for a total of 756 dogs, with 252 dogs randomly
z21 − α 2 5 2 + z2p 2 sampled from each weight class.
N≥
z21 − β 2 D2
Estimating at the covariate average. For estimat- References

ing the reference range at the average value of
Ackerman, L.H., Reynolds, P.S., Aherne, M., and Swift, S.
the covariate, the sample size is:
T. (2022). Right axis deviation in the canine electrocar-
diogram for predicting severity of pulmonic stenosis: a
z21 − α 2 1 + z2p 2 retrospective cohort analysis. American Journal of Vet-
N≥ erinary Research 83 (4): 312–316. https://doi.org/
z21 − β 2 D2 10.2460/ajvr.21.09.0138.
Altman, D.G. (1993). Construction of age-related refer-
ence centiles using absolute residuals. Statistics in
Medicine 12: 917–924.
Example: Canine Pulmonary Stenosis
Altman, D.G. and Bland, J.M. (1996). Detecting skewness
(Data from Ackerman et al. 2022). Pulmonary ste- from summary information. BMJ 313 (7066): 1200.
nosis severity in dogs showed a negative linear https://doi.org/10.1136/bmj.313.7066.1200.
association with body weight, and body weights Bautsch, W. (2009). Requirements and assessment of lab-
showed a right-skewed distribution. Median body oratory tests: part 5 of a series on evaluation of scien-
weight for 90 dogs was 14 kg (IQR 7, 25 kg; range tific publications. Deutsches Ärzteblatt International
106 (24): 403–406. https://doi.org/10.3238/
1–46 kg). If a new study was planned based on
arztebl.2009.0403.
these data, how many dogs would need to be
Bellera, C.A. and Hanley, J.A. (2007). A method is pre-
sampled to obtain the 95% confidence interval sented to plan the required sample size when estimat-
for the 95% reference limit with a relative margin ing regression-based reference limits. Journal of
of error D of 10%? Clinical Epidemiology 60: 610–615.
Cade, B.S., Terrell, J.W., and Schroede, R.L. (1999). Esti-
Normal method. If weight is ln-transformed, mating effects of limiting factors with regression quan-
the distribution of the covariate is approxi- tiles. Ecology 80 (1): 311–323.
mately normal, with mean 2.617 (SD 0.74) Ekelund, S. (2018). Reference intervals and percentiles –
and range 0.1–3.8. Then implications for the healthy patient. https://acute-
caretesting.org/ (accessed 2022).
Goehring, C., Perrier, A., and Morabia, A. (2004). Spec-
1 962 5 + 1 962 2 trum bias: a quantitative and graphical analysis of
N≥ = 692 1
1 962 0 102 the variability of medical diagnostic test performance.
Statistics in Medicine 23 (1): 125–113.
or approximately 693 dogs. Harrell, F.E. and Davis, C.E. (1982). A new distribution-
free quantile estimator. Biometrika 69 (3): 635–640.
Henny, J. (2009). The IFCC recommendations for deter- Mulherin, S.A. and Miller, W.C. (2002). Spectrum bias or
mining reference intervals: strengths and limitations. spectrum effect? Subgroup variation in diagnostic test
Journal of Laboratory Medicine 33 (2): 45–51. evaluation (PDF). Annals of Internal Medicine 137 (7):
https://doi.org/10.1515/JLM.2009.0162. 598–602. https://doi.org/10.7326/0003-4819-
Hortowitz GL, Altaie S, Boyd JC, Ceriotti F, Garg U, 137-7-200210010-00011.
Horn P, Pesce A, Sine HE, Zakwski J. Defining, estab- Ransohoff, D.F, Feinstein, A.R. (1978). Problems of spec-
lishing, and verifying reference intervals in the clinical trum and bias in evaluating the efficacy of diagnostic
laboratory; Approved guideline. 3 Clinical and Labora- tests. New England Journal of Medicine (NEJM),
tory Standards Institute (CLSI) Document EP28-A3C, 299 (17): 926–30. doi:https://doi.org/10.1056/
Vol. 28, No. 30 Wayne, PA: Clinical and Laboratory NEJM197810262991705.
Standards Institute; 2010. Smith, D., Silver, E., and Harnly, M. (2006). Environmen-
Ichihara, K. and Boyd, J.C. (2010). An appraisal of statis- tal samples below the limits of detection – comparing
tical procedures used in derivation of reference inter- regression methods to predict environmental concen-
vals. Clinical Chemistry and Laboratory Medicine 48: trations. http://www.lexjansen.com/wuss/2006/
1537–1551. Analytics/ANL-Smith.pdf (accessed 2022).
Jennen-Steinmetz, C. (2013). Sample size determination Tarr, G. (2012). Small sample performance of quantile
for studies designed to estimate covariate-dependent regression confidence intervals. Journal of Statistical
reference quantile curves. Statistics in Medicine 33: Computation and Simulation 82 (1): 81–94. https://
1336–1348. doi.org/10.1080/00949655.2010.527844.
Jennen-Steinmetz, C. (2014). Sample size determination Tivers, M.S., Handel, I., Gow, A.G. et al. (2014). Hyper-
for studies designed to estimate covariate-dependent ammonemia and systemic inflammatory response syn-
reference quantile curves. Statistics in Medicine, 33 drome predicts presence of hepatic encephalopathy in
(8):1336–48. https://doi.org/10.1002/sim.6024. dogs with congenital portosystemic shunts. PLoS ONE
Jennen-Steinmetz, C. and Wellek, S. (2005). A new 9 (1): e82303. https://doi.org/10.1371/journal.
approach to sample size calculation for reference inter- pone.0082303.
val studies. Statistics in Medicine 24 (20): 3199–3212. Wei, Y., Kehm, R.D., Goldberg, M., and Terry, M.B.
https://doi.org/10.1002/sim.2177. (2019). Applications for quantile regression in
Katayev, A., Balciza, C., and Seccombe, D.W. (2010). epidemiology. Current Epidemiology Reports 6:
Establishing reference intervals for clinical laboratory 191–199. https://doi.org/10.1007/s40471-019-
test results: is there a better way? American Journal of 00204-6.
Clinical Pathology 133 (2): 180–186. https://doi. Wellek, S. and Jennen-Steinmetz, C. (2022). Reference
org/10.1309/ajcpn5bmtsf1cdyp. ranges: Why tolerance intervals should not be used.
Klaassen, J.K. (1999). Reference values in veterinary [Comment on Liu, Bretz and Cortina-Borja, Reference
medicine. Laboratory Medicine 30 (3): 194–197. range: Which statistical intervals to use? SMMR,
https://doi.org/10.1093/labmed/30.3.194. 2021,Vol. 30(2) 523–534]. Statistical Methods in Medi-
Koenker, R. (2022). R package quantreg. https://cran. cal Research 31 (11): 2255–2256. https://doi.org/
r-project.org (accessed 2022). 10.1177/09622802221114538.
Koenker, R. and Hallock, K.F. (2001). Quantile regres- Wellek, S., Lackner, K.J., Jennen-Steinmetz, C. et al.
sion. Journal of Economic Perspectives 15 (4): 143–156. (2014). Determination of reference limits: statistical
Liu, W., Bretz, F., and Cortina-Borja, M. (2021). Refer- concepts and tools for sample size calculation. Clinical
ence range: which statistical intervals to use? Statisti- Chemistry and Laboratory Medicine (CCLM)
cal Methods in Medical Research 30 (2): 523–534. 52 (12): 1685–1694. https://doi.org/10.1515/
https://doi.org/10.1177/0962280220961793. cclm-2014-0226.
Machin, D., Campbell, M.J., Tan, S.B., and Tan, S.H. Wright, E.M. and Royston, P. (1997a). Simplified estima-
(2018). Sample Sizes for Clinical, Laboratory and Epide- tion of age-specific reference intervals for skewed data.
miology Studies, 4e. New York: Wiley. Statistics in Medicine 16: 2785–2803.
Misbach, C., Lefebvre, H.P., Concordet, D. et al. (2014 Wright, E.M. and Royston, P. (1997b). A comparison
Jun). Echocardiography and conventional Doppler of statistical methods for age-related reference
examination in clinically healthy adult Cavalier King intervals. Journal of the Royal Statistical Society, A
Charles Spaniels: effect of body weight, age, and gen- 160: 47–69.
der, and establishment of reference intervals. Journal Young, D.S. (2010). tolerance: an R package for estimat-
of Veterinary Cardiology 16 (2): 91–100. https:// ing tolerance intervals. Journal of Statistical Software
doi.org/10.1016/j.jvc.2014.03.001. 36 (5): 1–39.
Young, D.S. (2013). Regression tolerance intervals. Com- Young, D.S., Gordon, C.M., Zhu, S., and Olin, B.D.
munications in Statistics: Simulation and Computation (2016). Sample size determination strategies for nor-
42 (9): 2040–2055. https://doi.org/10.1080/ mal tolerance intervals using historical data. Quality
03610918.2012.689064. Engineering 28 (3): 337–351. https://doi.org/
Young, D.S. (2014). Computing tolerance intervals and 10.1080/08982112.2015.1124279.
regions using R. In: Handbook of Statistics: Computa- Yu, K., Lu, Z., and Stander, J. (2003). Quantile regression:
tional Statistics with R, vol. 32 (ed. M.B. Rao and applications and current research areas. The Statisti-
C.R. Rao), 309–338. Amsterdam: North Holland- cian 52 (3): 331–350.
Elsevier.
IV
Sample Size for
Comparison
Chapter 14: Sample Size and Hypothesis Testing.

Chapter 15: A Bestiary of Effect Sizes.
Chapter 16: Comparing Two Groups: Continuous Outcomes.
Chapter 17: Comparing Two Groups: Proportions.
Chapter 18: Time-to-Event (Survival) Data.
Chapter 19: Comparing Multiple Factors.
Chapter 20: Hierarchical or Nested Data.
Chapter 21: Ordinal Data.
Chapter 22: Dose-Response Studies.
14
Sample Size and
Hypothesis Testing

14.1 Introduction 155 14.A Sample SAS Code for Estimating
14.2 Power and Significance 155 Power for a Given Sample Size Based
14.3 Non-centrality 157 on Exemplary Non-Gaussian Data.
Modified from Stroup (2011, 2012)
14.4 Estimating Sample Size 159
and Littell et al. (2006) 163
14.4.1 Non-central t-Distribution 159
14.4.2 Non-central F-Distribution 160 References 164
14.5 Sample Size Balance and Allocation
Ratio 162

Definitions
Comparisons usually involve tests of statistical
hypotheses. The purpose of hypothesis testing is to Power: probability of correctly rejecting the null
determine if there is sufficient evidence to support a hypothesis if the alternative hypothesis is true,
claim (the statistical hypothesis) about a population true positive (1 − β). The probability of a false neg-
parameter based on a sample of data. A hypothesis ative is Type II error β.
test can be designed to minimise the likelihood of Significance: probability of observing the sample
making false claims and to maximise the ability of result if the null hypothesis is true. The probabil-
the test to detect a difference when one exists. Increas- ity of a false positive is Type I error α.
ing sample size is one method of increasing the relia- Non-centrality parameter: a measure of the differ-
bility of test interpretations and precision of estimates. ence between the mean under the alternative
However, increasing sample size increases costs and hypothesis compared to that expected for the null.
time to obtain the data and, of course, increases P-value: probability of observing the result or a
the number of animals. Right-sizing experiments result more extreme in the sample data if the null
involved trade-offs between statistical precision, hypothesis is true.
balancing the probabilities of different kinds of false Statistical significance is declared when P ≤ α.
claims, and the constraints related to resource availa-
bility and ethical considerations. the conditional probability of correctly rejecting
the null hypothesis H0 when the alternate hypothe-
14.2 Power and Significance sis HA is true:
Power and significance are related but distinct Power = Pr Reject H 0 HA is true
statistical concepts (Box 14.1, Table 14.1). Power is
Table 14.1: The relationship between significance,

confidence, power, and Type I and Type II errors.
BOX 14.2
Minimum Information Planning Sample Size for
If the null True False Comparative Studies
hypothesis is
Rejected α False positive 1 − β (power)

▪ Type I and type II error rates, α and β
Type I error True positive ▪ Primary outcome (response) variable
1−α β False negative
▪ Expected variance of the response
Not rejected
True negative Type II error ▪ The minimal biologically relevant difference to
be detected
▪ Definition of the experimental unit
It is the probability of detecting a real effect or dif- ▪ Randomisation strategy for experimental unit
ference when one exists (the probability of a true selection, treatment allocation, and sequence
positive). The Type II error β is the probability of allocation.
obtaining a false negative, or not rejecting the null
hypothesis when it is false. Therefore, power is
(1 − β). High power means that the test is less likely Table 14.2 Sample size, power, and significance.
to miss a real effect – there is a low probability of
Sample size is determined by significance (α), power
making a type II error. Low power means that the (1 − β), and the number of tails of the test (one-tailed
test has a high risk of Type II error and is more likely or two-tailed).
to miss a real effect. For a given effect size, significance (α), and power
Recall that the Type I error is the probability of (1 − β), larger sample sizes are required for a
obtaining a false positive or incorrectly rejecting two-tailed test compared to a one-tailed test.
the null hypothesis when it is true. The probability Power increases as effect size increases.
of Type I error is specified by the significance thresh-
Power increases with increasing significance (α) but
old α. The confidence level (1 − α) is the probability smaller confidence (1 − α). Larger values of α
of not rejecting the null hypothesis given it is correspond to smaller confidence levels and
true (true positive). When a statistical test is applied therefore greater precision. A study with α at 5%
to the sample data, the resulting P-value is (95% confidence) has greater power to detect a true
effect than if α is at 1% (99% confidence).
compared to the significance threshold value α.
A statistically significant result occurs when P ≤ α. Power is maximised as β (the probability of type II
In addition to pre-specification of power and con- error) is minimised. However, if false positives (Type
I errors) are considered more serious than false
fidence, the minimum items of information for a negatives (Type II errors), a rule of thumb to ensure
comparative study will include definitions for the a target power of 80% is to use a ratio of β to α as 4:1.
primary outcome variable, the biologically relevant
difference between groups, and the measure of var-
iance. The study is powered off the number of exper- have both insufficient power to detect true effects if
imental units, so these must be defined, and a they exist (essentially wasting all the animals in the
randomisation plan devised both for allocation study) and a high probability of false positives, result-
of treatments to units, and the measurements ing in over-estimation of efficacy or benefit of a test
sequence (Box 14.2). intervention (Sena et al. 2010). Too-large a sample
Determining an appropriate sample size involves size wastes the excess animals, without contributing
trade-offs between the risks of false positives and false much additional information.
negatives. Power is also determined by sample size, Statistical significance is frequently misinter-
significance level, and effect size (Table 14.2). Increas- preted. The P-value is a statement of the statistical
ing sample size can increase the power of a test and probability of observing the result or a result more
reduce standard error of the estimate, making it easier extreme if the null hypothesis is true. Statistical sig-
to detect an effect if it exists. However, ‘wrong sample nificance does not mean the results are biologically
size bias’ occurs when sample size is either too small meaningful or clinically actionable. Conversely,
or too large to meet study objectives (Catalogue of failure to reject the null hypothesis does not ‘prove’
Bias Collaboration et al. 2017). A too-small study will the null or signify that results are unimportant
Sample Size and Hypothesis Testing 157
(Altman 1991; Altman and Bland 1995; Gardner BOX 14.3

and Altman 2012; Cumming 2012; Greenland Centrality and Non-centrality
et al. 2016; Amrhein et al. 2019). The threshold
for significance (conventionally 0.05) is arbitrary The central probability distribution describes the
and has no clinical or biological meaning (Altman distribution of the test statistic under the null
1991). Therefore, dichotomising results into hypothesis.
‘significant’ or ‘non-significant’ loses almost all The non-central distribution describes the distribu-
quantitative information necessary for interpretation of the test statistic under an alternate
tion. In addition, P-values are unstable, especially hypothesis.
when effect size is small and variation is large. P- The non-central parameter λ is the difference by
values can vary considerably between studies sim- which the test statistic distribution estimated
ply due to random variation (Greenland et al. from sample data differs from that expected
2016). Statistically significant results from small under the null hypothesis.
underpowered studies are likely to be false positives
and are unlikely to replicate (Button et al. 2013). studies, sample sizes based on large-sample normal
Performing sample size calculations to ‘find a sta- approximations will be highly inaccurate, and sim-
tistically significant result’ should never be a study ple sample size formulae may not even exist.
goal (Chapter 3). ‘Chasing significance’ (Ware and The central probability distribution describes how
Munafò 2016; Szucs 2016; Marín-Franch 2018) the test statistic is distributed under the null hypoth-
leads to questionable research practices such as esis. Figure 14.1 shows an example of the central
P-hacking and N-hacking. P-hacking is the selection distribution for the F-statistic. The critical value of
of data and analysis methods to produce statistically the test statistic defines the threshold of statistical
significant results (Head et al. 2015). N-hacking is significance. The critical F-value is defined by
the selective manipulation of sample size to achieve Fα, ν1, ν2, with degrees of freedom υ1 and υ2, and
statistical significance, usually by increasing sample probability of Type I error α. If the null hypothesis
size, cherry-picking observations, or excluding out- is true, the values of the test statistic calculated from
liers without justification (Szucs 2016). Both prac- observed data will occur outside the region defined
tices increase the false positive rate and violate by the cut-off with probability α, the probability of a
standards of ethical scientific and statistical practice false positive.
(Wasserstein and Lazar 2016).
0.5
14.3 Non-centrality Critical value

Fα, v1, v2
0.4
The non-centrality parameter (Box 14.3) is a meas-
ure of the difference between the population means
under the alternative hypothesis. It provides a 0.3
method of evaluating the power of the test with
respect to the alternative hypothesis (Carlberg
2014). A large non-centrality parameter represents 0.2
a large effect size, providing more evidence against Reject H0

the null hypothesis and increasing the power of the 0.1
test. Calculations based on non-central distributions α = 0.05
of the relevant test statistic can be used to determine

power for a given sample size, or alternatively, find 0
0 1 2 3 4 5 6 7 8 9 10
the sample size necessary to achieve a target power
F-value
(’Brien and Muller 1993; Harrison and Brady 2004).
This is a useful property for animal-based studies Figure 14.1: The central F distribution with degrees of
freedom (3, 20) and α = 0.05. The critical value is 14.098. The
that are frequently small, with data that often significance threshold is α. Values of the F statistic obtained
do not conform to a normal distribution, or when from the sample data are statistically significant if they
complex study designs are required. For these exceed the critical F-value.
The non-central distribution describes the distri- BOX 14.4

bution of the test statistic when the null hypothesis Relationships Between Non-centrality Parameter λ, ϕ,
is false, and an alternative hypothesis is true. and Cohen’s Effect Size f
The non-centrality parameter λ is a measure of the
difference by which the distribution of a test statistic λ = f2 N
estimated from sample data differs from the distri-
bution expected under the null hypothesis of no dif- λ = F crit df numerator
ference (Kirk 2012). When there are no differences λ = SStre MSE
in group means, then λ is zero, and the distribution
follows the central distribution. As λ gets larger, the λ = ϕ2 k
distribution peak moves further away from zero and
the distribution stretches out. The power of the test ϕ=f N
is the area to the right of the critical value under the
non-central distribution (Figure 14.2). Therefore, as
λ gets larger, power increases. Power is the probabil-
Definitions, terminology, and notation for the
ity that the non-central test statistic is equal to, or
non-centrality parameter are inconsistent and can
greater than, the critical value of the test statistic
be confusing (Carlberg 2014). Older textbooks that
for a given level of significance:
rely on power charts and nomograms usually refer
to the quantity phi (ϕ) for tabulated values of power
Power = Pr Non-central test statistic ≥ Critical (Pearson and Hartley 1958; Zar 2010); ϕ is some-
central value of the test statistic times identified as the non-centrality parameter
itself. However, ϕ is not the same as λ. For example,
The power of a test is not one value but is a range for single-factor ANOVA, ϕ is equivalent to the
of values that depend on the parameters in the square root of the non-centrality parameter and
alternate hypothesis. Sample size is only one factor the number of groups. Conversion relationships
that determines the magnitude of the non- for λ, ϕ, and f (Winer et al. 1991; Dollins 1995) are
centrality parameter. It is also affected by the dif- given in Box 14.4.
ference between the true population mean In G∗Power, the value of the non-centrality param-
and the hypothesised mean, the variance, the dis- eter is part of the automatic output for sample size
tribution of the test statistic, the covariance struc- calculations. Dudek (2022) provides a useful over-
ture of the statistical model, and the study design view of power concepts with accompanying R code.
(Stroup 2012).
0.4
λ = 0 (central t-distribution)
14.4 Estimating Sample Size
Sample size using non-centrality is computed by
0.3 iterative numerical methods (Fenschel et al. 2011;
λ=4 Stroup 2011, 2012). Depending on the type of infor-
mation available, estimates of the non-centrality
Density
0.2
λ=8 parameter λ are determined from:
0.1 λ = 16
1. Summary statistics for each group. These can be
means and standard deviations, or standard
error and sample size obtained from prior data.
0.0
–5 0 5 10 15 20 25 2. An estimate of mean difference and the
X pooled variance of the difference.
Figure 14.2: Central (λ = 0) and non-central (λ > 0) t-
distributions with 10 degrees of freedom. As λ gets
14. Exemplary or raw data. An exemplary or
larger, the curve shifts to the right and becomes more ‘dummy’ data set is artificial data that consist
asymmetrical. of values obtained from historical data or
simulated from statistical models used to model alternative hypothesis, the test statistic now has a
the expected responses. The λ is calculated from non-zero mean t-distribution.
the mock data by fitting a prespecified analysis The power to detect the difference d with signif-
model (e.g. ANOVA, mixed models, generalised icance α is approximately
linear models) and extracting the necessary sta-
tistics from the output (O’Brien and Muller d n
1 − β = T n − 1 tα 2,n − 1
1993; Castelloe and O’Brien 2001). The model σ
output step provides F-values and degrees of
freedom for calculating λ.
where Tn−1 is the cumulative distribution function
Calculations for λ and the critical value for the of the non-central t-distribution with degrees of
test statistic (λ = 0) are made by iterating over a freedom n − 1 and non-centrality parameter λ.
range of candidate sample sizes (Littell et al. (Box 14.5 gives definitions for cumulative and prob-
2006; Fenschel et al. 2011; Stroup 2012). The sam- ability density functions).
ple size is chosen that results in the value of the The critical value of t is estimated by the inverse
computed power that equals or exceeds the pre- function of t, calculated from the total sample size n,
specified target power. Total sample size N, sample degrees of freedom df = n − 1, and significance level
size per group ni, or maximum number of groups k α (Harrison and Brady 2004). In SAS and R, the tinv
for a specified total sample size can be calculated function returns the pth quantile from the Student’s
with this method. t-distribution:
t_crit = tinv(alpha/2,n-1)
14.4.1 Non-central t-Distribution
For a one-sided test, the one-sided significance
The one-sample t-statistic follows the Student’s t- level α replaces the two-sided α/2.
distribution with n − 1 degrees of freedom: Power is then computed from the critical value of
t, the associated degrees of freedom, and the non-
x−μ centrality parameter (ncp):
t=
s n
in SAS as power = 1 - probt(t_crit, n-1, ncp)
and in R as
where x is the sample mean of a random sample ncp <- d/(s/sqrt(n))
from a normal population, s is the standard devia- t <- qt(0.975,df=n-1)
tion, n is the sample size, and μ is the population pt(t,df=n-1,ncp=ncp)-pt(-t, df=n-1,ncp=ncp)
mean. The effect size is 1-(pt(t,df=n-1,ncp=ncp)-pt(-t,df=n-1,ncp=ncp))
x−μ
d=
s BOX 14.5
The central t is distributed around zero. Suppose the Cumulative Distribution Function (cdf ) and Probability
Density Function (pdf )
true population mean is actually μ1, then the t-
statistic will follow a non-central t-distribution: The cumulative distribution function (cdf ) gives the
probabilities of a random variable X being smaller
μ1 − μ
λ= than or equal to some value x, Pr(X ≤ x) = F(x). The
σ n inverse of the cdf gives the value x that would make
F(x) return a particular probability p: F−1(p) = x.
with values of t is distributed around the non-
The probability density function (pdf ) returns the
centrality parameter λ. Under the null hypothesis,
relative frequency of the value of a statistic or the prob-
the difference between μ and μ1 is zero, and the test
ability that a random variable X takes on a certain
statistic has a standard t-distribution. The distribu-
value. The pdf is the derivative of the cdf.
tions move to the right as λ increases. Under the
The two sample t-statistic follow the Student’s

Suppose the target effect size is 2.5 for a non-
t-distribution with n − 2 degrees of freedom. Here,
centrality parameter of >14. The total sample size
the t statistic is
required for a power of 0.8 is now 6, or 3 per
x1 − x2 group. However, an effect size this large would
t= require a standard deviation of approximately 8
1 1
sp + N, which is unlikely to be practically or biologi-
n1 n2
cally feasible.
where sp is the pooled standard deviation of the two
samples. The effect size is
x1 − x2 14.4.2 Non-central F-Distribution

d=
sp The F-statistic is the ratio of the total variance to the
error variance:
Example: Increasing Power By Reducing
Variation: Rat Model of Bone Fracture MST σ2 + σ2
F= = T 2 e
MSE σe
(Data from Prodinger et al. 2018.) Biomechanical
evaluations of rodent long bone fractures fre-
quently use a three-point bending apparatus, where σ 2T is the estimate of the variation due to dif-
with support bars set to a fixed span. Femur fail- ferences between group means (‘treatments’), and
ure loads in a fixed span model of femur fracture σ 2e is the estimate of the variation due to error.
in n = 20 rats averaged 206 (SD 30) N. However, it The distribution of the F-ratio is determined by
was found that adjusting the span to accommo- the degrees of freedom for the treatment group,
date individual bone length may reduce variation the numerator degrees of freedom ν1 = (k − 1)
in response by up to 30%. and the error, or denominator, degrees of freedom
Suppose it was determined that a 20 N increase ν2 = k(n − 1) = (N − k), where k is the number of
in a second experimental group was biologically ‘groups’, N is the total sample size, and n is the sam-
important. What sample size would be required ple size per group (if group sizes are balanced).
for this level of variation, and what is the effect If group sizes are equal, the non-centrality
on power and sample size if the variation could parameter for the non-central F is the weighted
be reduced by changing from the fixed-span to sum of squares of treatment means:
the individualised-span protocol?
Simple simulations can be performed using SST
G∗Power for a two-sample t-test for the difference λ= = k−1 F
MSE
between two independent means. Assuming
standard deviation is similar to that for the orig- where (k − 1) is the numerator degrees of freedom
inal fixed-span protocol, effect size is d/s = 20/ (Liu and Raudenbush 2004). Because t2 = F, the
30 = 0.67. For a one-tailed test with α = 0.05 methods used to calculate the non-central F-variate
and nominal power of 0.8, the critical t-value is can also be used to calculate the non-central
1.673 and the non-centrality parameter is 2.551. t-variate (Winer et al. 1991).
The total sample size is 58, with 29 rats in The shape of the non-central F-distribution is
each group. determined by the numerator and denominator
If the standard deviation can be reduced by degrees of freedom and the variance of the treat-
30%, then SD is approximately 20, and the effect ment group means, σ 2T (Carlberg 2014). If the null
size is now 20/20 = 1.0. Then the critical t-value is hypothesis cannot be rejected, there are no differ-
1.706 and the non-centrality parameter is 2.646.
ences between group means, σ 2T is zero, and the
The total sample size is 28, with only 14 per
F-value is equal to 1. Therefore the ratio of variances
group, with an increase in realised power to 0.82.
will follow the central F-distribution, and the non-
0.5
Example: Determining Sample Size from
Exemplary Data: Mouse Breeding
0.4 λ=0 Productivity
(Data from Hull et al. 2022.) A study was designed
0.3
to compare two handling methods on reproduc-
λ=4 tive indices for laboratory mice. The study design
0.2 was a two-arm randomised controlled trial, with
λ=8 breeding pair as random effect and handling
method as the fixed effect.
0.1 λ = 16 The primary outcome was number of pups pro-
duced per pair. The outcome data were discrete
0 counts, which are non-Gaussian. Analyses
0 5 10 15 20
required a generalised linear mixed model with
F-value
a Poisson distribution for counts. Therefore, con-
Figure 14.3: Central (λ = 0) and non-central (λ > 0) ventional power calculation formulae for deter-
F-distributions with degrees of freedom (3, 20). The
critical F3,20 = 14.098. Area under the curve to the right
mining sample size were inappropriate, and
of the critical value (dotted line) for each non-central power and sample size were estimated by simula-
curve is the power of the test. tion (Littell et al. 2006; Stroup 2011, 2012).
Sample size was estimated in three steps
(Sample SAS code is in Appendix 14.A). First, an
centrality parameter is zero (λ = 0). Under the alter- exemplary data set was created for the total num-
nate hypothesis, the sampling distribution of the F- ber of pups expected in each group. Data were
ratio is the non-central F-distribution F(ν1, ν2, λ), λ based on historical information for controls and
> 0 and the F-distribution stretches out to the right the anticipated effect size. The expected total num-
(Figure 14.3). ber of pups for control animals in this study was:
The critical F-value is obtained from the 100
(1 − α)% quantile value of the central F distribution 5 litters × 6 pups litter = 30 pups per pair
and corresponding degrees of freedom. For exam-
ple, for α = 0.05, the quantile is 100 (1 − 0.05), or The expected difference was an additional pup
95th quantile value of the central F-distribution per litter, or 4–5 additional pups per pair, for a
with λ = 0. This is calculated from the inverse total of 34–35 pups in the test group. It was deter-
function of the corresponding cumulative density mined a priori that an increase of one extra pup
function cdf. The critical F-value is obtained by set- per pair was operationally significant enough to
ting the non-centrality parameter to zero: justify the facility switching to the new handling
method. In this example, exemplary data were
In SAS this is F_crit=Finv(1-alpha, numdf, generated for a sample size of 40 pairs per treat-
dendf,0)
and in R as F.crit = qf(alpha, numdf, ment arm. The 40 ‘observations’ in each group
dendf,lower.tail=F) had the values specified by the anticipated pup
counts (30 and 35).
where numdf is the numerator degrees of freedom Second, a generalised linear model and Poisson
ν1 = (k − 1), and dendf is the denominator degrees distribution were fitted to the exemplary data in
of freedom ν2 = k (n − 1) = (N − k). SAS proc glimmix. The model output step provides
The associated power for each N is then estimated expected F-values and degrees of freedom df for
as Power = Pr F ν1 , ν2,λ ≥ F ν1 , ν2 : calculating the non-centrality parameter, appr-
oximated as (numerator df) ○ (F-value). The crit-
in SAS as Power = 1-probf(F_crit, NumDF, ical F-value was estimated from the inverse
DenDF, NCP)
and in R as pf(F.crit, NumDF, DenDF,
function for the χ2 distribution, given the pre-
ncp=delta, lower.tail=F) specified confidence (1 − α), the degrees of
the precision of a perfectly balanced completely ran-

freedom for the numerator (here k − 1 = 2 −
domised design. The RE of a design is assessed by
1 = 1), and λ = 0 for the central distribution.
comparing an unbalanced to a balanced design as
Finally, power was calculated by estimating the
the ratio of respective sample sizes and standard
probability that the computed F-statistic exceeded
deviations. As group sample sizes become more
the critical F-value under the non-central F.
unbalanced, RE (ratio of variances) decreases and
The power for a difference of four extra pups
total N increases to achieve the same precision as a
per pair was 0.88 and 0.97 for five extra pups
design with equal sample sizes per treatment group.
per pair for a total sample size of 80 pairs (or
Relative efficiencies can be estimated for a variety of
40 pairs per group). However, the power for an
designs, methods of analysis, and comparisons (dif-
operational difference of one extra pup per pair
ference, matched, stratified, covariance-adjusted),
with this sample size was only 0.114. To achieve
and for both experimental and observational studies.
power >0.8 with such a small effect size, the study
would have required approximately 250 pairs.
This was clearly unfeasible in terms of housing Example: Relative Efficiency of
space and time. Furthermore, the number of pups a Balanced Versus an Unbalanced
produced would have greatly exceeded demand, Allocation Design
which would have necessitated the euthanasia
Suppose a study design is proposed with two
of excess animals that could not be used.
treatments A and B, a fixed total sample size N
of 20, and variance σ 2. The sample sizes of each
group are nA and nB, respectively, and nA +
nB = 20. The standard error of the difference
14.5 Sample Size Balance between treatment means is
and Allocation Ratio SEdiff = σ 1 nA + 1 n B
Balanced designs have equal numbers of experi-
mental units in each group. A balanced design The RE of the design is the ratio of the variance
has the most power for a given sample size N. The of the unbalanced design nA nB to that for the
allocation ratio r describes the balance of subjects balanced design nA = nB:
in each arm of the trial. If sample size is balanced, 2
then the allocation ratio r = n1/n2 = 1. SE 2new σ new 1 nA + 1 nB
In general, unequal allocation and unbalanced RE = =
SE 2balanced σ bal 1 nA + 1 n B
2
designs are not recommended. Occasionally more
power might be obtained by an unequal allocation
ratio if the response in one group is known to be If variances are assumed to be equal, then the
more variable than that in the other. Then the variances cancel, so RE is seen to depend directly
appropriate r can be computed as the ratio of stand- on the ratio of sample sizes for each design.
ard deviations: r = n1/n2 = s1/s2, and the group with Table 14.3 shows the relative efficiencies for
the larger expected standard deviations is allocated each allocation ratio, from the most balanced
more subjects. However, compared to balanced (r = 1) to least balanced (r = 1/19) condition.
designs, unbalanced designs require substantially In this example, the RE for an unbalanced design
more subjects to detect the same effect size with with nA = 3 and nB = 17 requires twice as many
the same power (e.g. 12% for a 2:1 allocation ratio), animals to achieve the same level of precision
making them more costly, exposing more subjects to as a balanced design (nA = nB = 10). For the
potential risk (Hey and Kimmelman 2014), and unbalanced design to obtain the same precision
requiring the use of far many more animals. as the balanced design, the sample size required
The effect of design imbalance on sample size can is 1/RE = 1/0.51 = 1.96 times that of the
be assessed by computing the relative efficiency of the balanced design. The adjusted sample sizes are
design. Relative efficiency (RE) is the increase in nA = (3 1.96) 6, nB = (17 1.96) = 33.33 34,
sample size needed for a candidate design to match for total N = 40.
Table 14.3: Relative efficiency of balanced versus unbalanced allocation designs. The total sample size N = nA + nB is 20.
nA nB 1 nA + 1 nB Relative efficiency Correction Adjusted sample sizes
nA nB Na
10 10 0.447 1.00 1.00 10 10 20

9 11 0.449 0.99 1.01 9.09 11.11 22
8 12 0.456 0.96 1.04 8.33 12.50 23
7 13 0.469 0.91 1.10 7.69 14.29 23
6 14 0.488 0.84 1.19 7.14 16.67 25
5 15 0.516 0.75 1.33 6.67 20.00 27
4 16 0.559 0.64 1.56 6.25 25.00 32
3 17 0.626 0.51 1.96 5.88 33.33 40
2 18 0.745 0.36 2.78 5.56 50.00 56
1 19 1.026 0.19 5.26 5.26 100.00 106
a
Values are rounded up.
14.A Sample SAS Code for /* Input expected pup counts for each arm;*/
Estimating Power for a datalines;

control 30
Given Sample Size Based on test 35
;
Exemplary Non-Gaussian run;
proc print data=mice;
Data. Modified from Stroup run;
(2011, 2012) and Littell
2. Perform the Analysis on the Exemplary Model
et al. (2006) Using the Anticipated Statistical Model. This
1. Create Exemplary Data. The Primary Outcome Extracts the Non-Central F Value and Degrees of
was Number of Pups Per Pair with an Expected Freedom to Enable Calculation of the Non-
Number of 30 for the Control and 35 for the Test Centrality Parameter
Group. The Do Loop Assigns the Number of Exper-
proc glimmix data= mice;
imental Units for Each Treatment Arm class trt;
model mean = trt / chisq link=log
data mice; dist=poisson;
/* trt = treatment identifiers, pups = contrast 'control vs experimental' trt 1 -1 /
expected total pup count for each arm;*/ chisq ;
input trt $ pups; ods output tests3=F_overall
contrasts=F_contrasts;
/*best guess sample size per treatment arm. run;
Compare different values until the desired
power is obtained */
3. Calculate Power. The Non-Centrality Parameter
n=40; (ncp) is Calculated from the F Statistics Extracted
do obs=1 to N; from the Output. F_crit Defines the Critical Value
output; for F Under the Central Distribution
end;
data power; Cumming, G. (2012). Understanding the New Statistics:

set F_overall F_contrasts; Effect sizes, Confidence Intervals, and Meta-Analysis.
New York, NY: Routledge.
/*calculate non-centrality parameter;*/ Dollins, A.B. (1995). A Computational Guide to Power
ncp=numdf*Fvalue;
Analysis of Fixed Effects in Balanced Analysis of Var-
/* calculate power for a pre-specified iance Designs. Report DODPI95-R-0003. Department
confidence (alpha) */ of Defense Polygraph Institute, Fort McClellan, AL.
alpha=0.05; Dudek, B. (2022). Test statistics, null and alternative
distributions: type II errors, power, effect size, and
/* for count data the inverse χ2 is used non-central distributions. https://bcdudek.net/
(Stroup 2012)*/ power1/non_central.html#power-facilities-
f_crit=cinv(1-alpha,numdf,0);
Power=1-probchi(F_crit,numdf,ncp); in-r (accessed 2023).
Fenschel, M.C., Amin, R.S., and van Dyke, R.D. (2011). A
/* the inverse F is: */ macro to estimate sample-size using the non-centrality
f_crit=finv(1-alpha,numdf,dendf,0); parameter of a non-central F-distribution SA18-2011.
power=1-probF(f_crit,numdf,dendf,ncp); https://www.mwsug.org/proceedings/2011/
stats/MWSUG-2011-SA18.pdf (accessed 2019).
/* output power and compared to desired value */ Gardner, M.J. and Altman, D.G. (2012). Confidence
proc print data=power; intervals rather than P values: estimation rather
run; than hypothesis testing. British Medical Journal
292 (6522): 746–750. https://doi.org/10.1136/
bmj.292.6522.746.
Greenland, S., Senn, S.J., Rothman, K.J. et al. (2016). Sta-
References tistical tests, P values, confidence intervals, and power:
a guide to misinterpretations. European Journal of Epi-
Altman, D.G. (1991). Practical Statistics for Medical demiology 31 (4): 337–350. https://doi.org/
Research. Chapman & Hall/CRC. 10.1007/s10654-016-0149-14.
Altman, D.G. and Bland, J.M. (1995). Absence of evi- Harrison, D.A. and Brady, A.R. (2004). Sample size and
dence is not evidence of absence. BMJ 311 (7003): power calculations using the noncentral t-distribution.
485. https://doi.org/10.1136/bmj.311.70014. The Stata Journal 4 (2): 142–153.
485. Head, M.L., Holman, L., Lanfear, R. et al. (2015). The
Amrhein, V., Greenland, S., and McShane, B. (2019). extent and consequences of P-hacking in science. PLoS
Scientists rise up against statistical significance. Nature Biology 13 (3): e1002106.
567 (7748): 305–307. https://doi.org/10.1038/ Hey, S.P. and Kimmelman, J. (2014). The questionable
d41586-019-00857-9. use of unequal allocation in confirmatory trials. Neu-
Button, K.S., Ioannidis, J.P., Mokrysz, C. et al. (2013). rology 82 (1): 77–79. https://doi.org/10.1212/
Power failure: why small sample size undermines the 01.wnl.0000438226.10353.1c.
reliability of neuroscience. Nature Reviews Neurosci- Hull, M.A., Reynolds, P.S., and Nunamaker, E.A. (2022).
ence 14 (5): 365–376. https://doi.org/10.1038/ Effects of non-aversive versus tail-lift handling on
nrn3475. breeding productivity in a C57BL/6J mouse colony.
Carlberg, C. (2014). Statistical Analysis: Microsoft Excel PLoS ONE 17 (1): e0263192. https://doi.org/
2013. Pearson Education. 10.1371/journal.pone.0263192.
Castelloe, J.M. and O’Brien, R.G. (2001). Power and sam- Marín-Franch, I. (2018). Publication bias and the
ple size determination for linear models. SUGI 26: chase for statistical significance. Journal of Optometry
240–226. https://support.sas.com/resources/ 11 (2): 67–68. https://doi.org/10.1016/j.
papers/proceedings/proceedings/sugi26/p240- optom.2018.03.001.
26.pdf. Kirk, R. (2012). Experimental Design: Procedures for
Catalogue of Bias Collaboration, Spencer, E.A., Brassey, Behavioral Sciences. SAGE.
J., Mahtani, K., Heneghan, C. (2017). Wrong sample Littell, R.C., Milliken, G.A., Stroup, W.W. et al. (2006).
size bias. Catalogue of Bias. https://catalogof- SAS for Mixed Models, 2e. Cary, NC: SAS Institute, Inc.
bias.org/biases/wrong-sample-size-bias/ Liu, X. and Raudenbush, S. (2004). A note on the non-
(accessed 2022). centrality parameter and effect size estimates for the
F test in ANOVA. Journal of Educational and Behav- and Data Analysis. https://support.sas.com/
ioral Statistics 29 (2): 251–255. https://doi.org/ resources/papers/proceedings11/349-2011.pdf
10.3102/10769986029002251. (accessed 2019).
O’Brien, R.G. and Muller, K.E. (1993). Unified power Stroup, W.W. (2012). Generalized Linear Mixed Models:
analysis for t-tests through multivariate hypotheses, Modern Concepts, Methods and Applications. Boca
chap 8. In: Applied Analysis of Variance in Behavioral Raton: Chapman & Hall/CRC.
Science (ed. L.K. Edwards), 297–344. New York: Szucs, D. (2016). A tutorial on hunting statistical signif-
Marcel Dekker. icance by chasing N. Frontiers in Psychology 7: 1444.
Pearson, E. and Hartley, H.O. (1958). Biometrika https://doi.org/10.3389/fpsyg.2016.01444.
Tables for Statisticians, vol. 1. Cambridge: Cambridge Ware, J.J. and Munafò, M.R. (2016). Significance chasing
University Press. in research practice: causes, consequences, and possi-
Prodinger, P.M., Bürklein, D., Foehr, P. et al. (2018). ble solutions. Addiction 110 (1): 4–8. https://doi.
Improving results in rat fracture models: enhancing org/10.1111/add.126714.
the efficacy of biomechanical testing by a modification Wasserstein, R.L. and Lazar, N.A. (2016). The ASA’s
of the experimental setup. BMC Musculoskeletal Disor- statement on P-values: context, process, and purpose.
ders 19: 243. https://doi.org/10.1186/s12891- The American Statistician 70: 129–314. https://
018-2155-y. doi.org/10.1080/00031305.2016.1154108.
Sena, E.S., van der Worp, H.B., Bath, P.M. et al. (2010). Winer, B.J., Brown, D.R., and Michels, K.M. (1991).
Publication bias in reports of animal stroke studies Statistical Principles in Experimental Design, 3e. New
leads to major overstatement of efficacy. PLoS Biology York: McGraw-Hill.
8 (3): e1000344. https://doi.org/10.1371/jour- Zar, J.H. (2010). Biostatistical Analysis, 5e. Upper
nal.pbio.1000344. Saddle River: Prentice-Hall.
Stroup, W.W. (2011). Living with generalized linear
mixed models. SAS Global Forum 2011: Statistics
15
A Bestiary of Effect Sizes

15.1 Introduction 167 15.5.2 Interpretation 175
15.2 Effect Size Basics 169 15.5.3 Nominal Variables: Cramer’s V 176
15.3 d Family Effect Sizes 169 15.6 Time to Event Effect Size 176
15.3.1 The Basic Equation for 15.7 Interpreting Effect Sizes 176
Continuous Outcome Data 170 15.7.1 Interpreting Effect Sizes as
15.3.2 Two-Group Comparisons, Ratios 177
Continuous Outcomes, 15.7.2 What Is a Meaningful Effect
Independent Samples 170 Size? 178
15.4 r Family (Strength of Association) 15.A Using SAS Proc Genmod to Calculate
Effect Sizes 171 OR and RR (Adapted from Spiegelman
15.4.1 Correlation 171 and Hertzmark 2005). As an alternative
15.4.2 Regression 171 to fitting a logistic regression model,
15.4.3 Analysis of Variance OR and RR can be calculated using
(ANOVA) Methods 173 the log-binomial maximum likelihood
15.5 Risk Family Effect Sizes 174 estimators. 178
15.5.1 Risk Difference, Relative References 178
Risk, and Odds Ratio 174
Understanding how effect sizes are constructed

15.1 Introduction for different types of study is essential for both pre-
An effect size is an estimate that describes the mag- liminary sample size estimation and interpretation
nitude and direction of the difference between two of results (Box 15.1). First, there is no one definition
groups or the relationship between two or more of ‘effect size’ (Cumming, 2012). For practical pur-
variables. For interpretation of results, effect sizes poses, there are four major categories or families
are preferred to ‘statistical significance’ or p-values. of effect sizes: group difference effect sizes (d family),
This is because an effect size uses all information strength of association effect sizes (r family), and risk
about size and direction of the difference, and is and time-to-event effect sizes (Table 15.1). However,
the best summary of the ‘effect’ of the test interven- in practice, over 70 different effect size indices have
tion, along with a measure of precision (Hojat and been identified (Schober et al. 2018). Second, differ-
Xu 2004; Sun et al. 2010; Cumming 2012). ent software programmes may calculate the ‘same’
BOX 15.1
Effect Sizes
Choose an effect size (or effect size family) most appropriate for testing study hypotheses.
The effect size must be relevant to study objectives (context).
Effect size does not infer causality.
Effect size does not infer statistical significance.
Effect size does not infer practical significance.
Table 15.1: Categories of effect size indices.

Effect size index Definition
Group difference indices

Mean Unadjusted average
Difference between Unadjusted difference
two means
Cohen’s d family Standardised difference in two independent sample means, standardised by the average or
pooled standard deviation. If samples are paired, use the standard deviation for the
paired differences.
Hedge’s g family Standardised difference in two independent sample means when N < 20, standardised using
weighted standard deviation.
Glass Δ Standardised difference in two independent sample means, standardised using the control
standard deviation.
Strength of association indices
Correlation r The strength of linear association between two or more independent variables
Regression slope β The change in the dependent variable with a unit change in the predictor
Proportion of Also called coefficient of variation. The strength of association between one dependent
variation R2 variable and one or multiple predictors
η2 Correlation ratio. Estimates the proportion of variance in the response variables accounted
for by the explanatory variables. Ratio of variance terms SSbetween/SStotal
η2p Partial correlation ratio
ω2 Less-biased alternative to η2 for small N.

Cohen’s f2 Proportion of variance effect size appropriate for multiple regression, multilevel and
repeated-measures data.
Cramer’s V Strength of association between two nominal variables, obtained from χ 2 analyses
Risk estimate indices
Relative risk, RR Ratio of proportion of one group to proportion of the second group p1/p2
Odds ratio, OR Ratio of the odds of occurrence of an event (or ‘success’) p relative to the odds of failure (1 − p)
in a test group relative to those for a comparator group: p1/(1 − p1) /p2/(1 − p2)
Cohen’s h Difference between two arcsin-transformed proportions
Time to event indices
Hazard ratio, HR ln p1
Ratio of events p in each group HR =
ln p0
M0
Ratio of median survival times HR =
M1
A Bestiary of Effect Sizes 169
effect size differently (for example, G∗Power and BOX 15.2

SPSS; Lakens 2013), and terminology can be idio- The ‘Best’ Effect Size
syncratic. Therefore, to avoid misunderstanding
and misinterpretation, the particular effect size to ▪ results in maximum power for a given sam-
be used and how it is to be calculated must be clearly ple size
described in the protocol and methods. Effect sizes ▪ is based on continuous variables and equal
used and presented without context can result in sample sizes per group,
sample size approximations and subsequent inter- ▪ is determined by study objectives, study con-
pretations that are extremely misleading. text, and relevance to the research question.
15.2 Effect Size Basics The ‘best’ effect size is that which provides the
most power for a given sample size (Box 15.2). This
Choosing the most appropriate effect size depends is usually an effect size based on a continuous var-
on the study objectives and the study context: the iable (rather than binary or time-to-event variables)
study design or statistical model, and the type of effect and a balanced design (equal number of experimen-
most relevant to the research question and study tal units per treatment arm). Studies using binary or
objectives. How the effect size is calculated depends time-to-event outcomes or designs with unequal
on whether or not the effect size is to be standar- allocation require many more subjects than do bal-
dised, and if standardised, how the standardiser is anced studies on continuous outcomes.
to be calculated (Cumming 2012).
Regardless of which family of effect sizes is
selected, all effect size calculations require the fol- 15.3 d Family Effect Sizes
lowing information:
The d family of effect sizes describe differences
The primary outcome. The type of outcome vari- between independent groups. For continuous vari-
able dictates which class or family of effect sizes ables, these are expressed as the differences between
can be used. group means. Effect sizes based on group differ-
ences may be unstandardised or standardised
The pre-specified difference between groups most (Box 15.3).
likely to be of practical or biological significance. An unstandardised effect size expresses the differ-
Number of groups to be compared. ence between groups without adjusting for variation
in the sample. Raw effect sizes retain the original
Desired probability for type I error (α) and type II
units of measurement and thus have the enormous
error (β).
advantage of being easy to interpret when units of
Whether comparisons are one-sided or two-sided. measurement are themselves meaningful. For these
reasons, raw effect sizes are often used in meta-
Allocation ratio, or sample size balance among
analyses of health research (Vedula and Altman,
groups.
2010). Examples of unstandardised effect sizes
During planning and initial sample size calcula-
tions, careful thought should be given to choosing
BOX 15.3
preliminary values for first-pass approximations.
d-Family Effect Sizes
Reasonable initial values can be obtained from
systematic reviews and meta-analyses, targeted lit- Effect sizes describing differences between independ-
erature reviews, pilot data, and data from previous ent groups
experiments (Rosenthal 1994; Ferguson 2009;
Unstandardised effect size is a simple difference.
Cummings 2012; Lakens 2013). Lakens (2013) pro-
Standardised effect size is the difference scaled by
vides an excellent tutorial on calculating and report-
the sample variation.
ing effect sizes.
include a simple difference between means or pro- 15.3.2 Two-Group Comparisons,

portions and sample standard deviation. Continuous Outcomes, Independent
A standardised effect size is the difference in
Samples
means scaled or adjusted by some measure of vari-
ation in the sample. In the simplest form, the differ- Cohen’s ds (Cohen 1988) is the best-known metric of
ence is scaled by the within-group standard standardised effect size. This is the difference
deviation. The best-known measure is Cohen’s d. between two means, divided by the pooled standard
This gives the effect size as the difference in means deviation:
expressed as a multiple of standard deviation units
in the sample. As such, standardised effect size is a x1 − x2
ds =
dimensionless ratio and independent of units of sp
measurement. This allows a readier comparison of
effect sizes across diverse studies with different That is, effect size measures the difference in
measurement scales. Standardised effect sizes are means in units of the pooled sample standard
thus used extensively in meta-analyses (Vedula deviation.
and Altman 2010). Standardised effect sizes will also Hedges’s g should be used when n is small (n <
be useful when units of measurement do not have 20). Cohen’s ds is a biased estimate of the population
an intrinsic meaning, e.g. ordinal scores, such as effect size when sample sizes are small. The results
client satisfaction (Schober et al. 2018). However, for ds and g are similar if n > 20. Hedge’s g uses a
standardised effect sizes have several crucial disad- sample size based on the weighted pooled standard
vantages. Practical interpretation is extremely diffi- deviation:
cult because there is no information about the
measurement units. When used in meta-analyses, x1 − x2
g=
standardised effect sizes may distort diagnostics, sp
such as funnel plots, and result in increased false 3 3
positives (Zwetsloot et al. 2017). 1− = ds 1 −
4 N −9 4 N −9
where N = n1 + n2. (Hedges and Olkin 1985; Cum-

15.3.1 The Basic Equation for ming 2012; Lakens 2013).
Continuous Outcome Data Glass’s Δ is used if the intent is to compare one
The basic equation is also referred to as the funda- group against a baseline, or if standard deviations
mental equation (Machin et al. 2006) or the basic differ substantially between groups. This statistic
formula (van Belle 2008). uses the standard deviation of the control sample
When outcome data are continuous and normally only, s0 rather than the pooled standard deviation
distributed, the power of a two-sided test to detect a sp. Therefore effect size based on Glass’s Δ measures
difference between groups is: the difference in means in units of the control sam-
ple standard deviation (Glass 1976).
d n
z1 − β = −z1 − α/2 x2 − x1
σ 2 Δ= ,
s0
where d is the difference between group means
( μ1 − μ2). Assuming equal group sizes (N = 2n) then Two-group comparisons, paired samples. Paired
the sample size per group is designs compare observations taken on the same
2
experimental unit. The observations are therefore
z1 − α 2 + z1 − β correlated within each experimental unit. Common
n≥2 2
d σ applications include before-after assessments and
paired measurements taken on both the right and
The unstandardised effect size is d. The standar- left sides of a study animal. The effect size for a
dised effect size is d/σ. paired sample ds, pair is estimated as the difference
of two means divided by the standard deviation of The effect size is specified by the difference
the paired difference sD: between the correlation expected for the population
for the null r0 and the correlation r hypothesised
ds,pair = x 2 − x 1 sD under the alternative hypothesis (the postulated
effect of the test intervention), such that
where sD = s21 + s22 2. Accounting for the pair-
ing between differences reduces the variation. f = r − r0
Therefore sD will be smaller, estimates more precise,
and effect sizes larger, than if the standard deviation
was calculated assuming independence of 15.4.2 Regression
observations.
The linear regression model describing the relation-
ship of Y in relation to X is
15.4 r Family (Strength of
Y = β0 + β 1 X + ε
Association) Effect Sizes
The r family effect sizes describe the relationship where β0 is the intercept, β1 is the change in Y with a
between variables as ratios of shared variance. They unit change in X (the slope) and ε is the random
are used for correlation, regression, and analysis of error, which is normally distributed with mean zero
variance (ANOVA) models (Box 15.4). The total var- and variance σ 2.
iance of the data is partitioned into several different Effect size. For regression models, the effect size
pieces or sources of variation. can be quantified by the regression coefficient, or
the coefficient of determination R2.
Regression coefficients β provide the most imme-
15.4.1 Correlation diately obvious information about the magnitude of
The most familiar metric from the r family is Pear- the effect. In the simplest case with two treatment
son’s correlation coefficient r. Correlation is a meas- groups and no other predictors, the groups can be
ure of the strength of linear association between defined by a single indicator variable such that
variables. It is estimated as the ratio of shared vari- X = 0 for the comparator and X = 1 for the test
ance between two variables: group. Then the effect d is the estimate of the true
difference in means between the two groups:
cov x, y d = β1. The standardised effect size is the regression
r x,y = coefficient divided by the pooled within-group
sx sy
standard deviation of the outcome measure
(Feingold 2015). The residual standard deviation
where sx and sy are the sample standard deviations
(root mean square error) of the regression is used
for each variable, and the covariance between two
when there are two or more predictors. However,
sample variables x and y is
the estimated size of the effect for models with mul-
tiple predictors will depend on the scale of each
Xi − X Y j − Y
cov x, y = independent variable. Therefore effect sizes based
n−1 on different variables in the same study, or variables
across multiple studies, cannot be compared
directly (Lorah 2018).
BOX 15.4 Regression models have at least three sources of
r Family Effect Sizes variation: the total sum of squares, the regression
sum of squares, and the error sum of squares. The
Effect sizes that describe the relationship between structure of the model and how variance is parti-
variables as ratios of shared variance. (strength of tioned are shown by constructing an ANOVA table
association) (Table 15.2).
Table 15.2: Analysis of variance table for the simple linear regression with one predictor variable X.
Source of variation Degrees of Sum of Mean F
freedom (df) squares (SS) square (MS)
Due to explanatory variables 1 SSR MSR = SSR/1 MSR/MSE

(regression)
Residual N −2 SSE = SSTc − SSR MSE = SSE/
(n − 2) = σ 2
Total variation corrected for mean N −1 SSTc
Total variation N SST
The total sum of squares (SST) is the sum of Correlation r is related to the slope b1 as
squared deviations of observations yi from the mean
of those observations y. sx
r = b1
The regression sum of squares (RSS) describes the sy
amount of variation among the y values that can be
attributed to the linear model. The least squares esti- The adjusted R2 R2adj is a corrected goodness-of-
mate for the coefficient β1 is calculated as fit metric that adjusts for the number of terms in
the model:
xi − x yj − y
β1 = N −1
xi − x 2 R2adj = 1 – 1 − R2
N −k
x i − x yj − y The unadjusted R2 increases as more predictor

and the SSreg as β1 xi − x 2 with one degree of terms are added to the model, regardless of their
freedom. The regression mean square MSreg is SSreg/ predictive or explanatory effect. In contrast, R2adj will
1 = SSreg. increase only as useful predictors are added and will
The error sum of squares (SSE) is the difference decrease if variables with no predictive value are
between the SST and RSS. The residual mean added. R2adj will always be less than or equal to
square, or mean square error (MSE), is the estimate the unadjusted R2. Therefore, R2adj is useful for asses-
of the variance s2, with n − 2 degrees of freedom (the sing if the appropriate variables are included in the
loss of 2 df is because two parameters β0 and β1 are model (Draper and Smith 1998).
estimated). It is a measure of the amount of varia- The correlation r (the shared variation between Y
tion attributable to the regression model. and X) is not interchangeable with the unadjusted
The F-statistic tests the null hypothesis that R2 (the variation in Y attributable to the linear var-
β1 = 0 and is calculated as the ratio of MSR/MSE. iation in X) as a measure of effect. The square root of
The coefficient of determination R2 is the propor- the unadjusted R2 is a biased measure of effect, espe-
tion of variance explained by the linear relation of Y cially when the sample size is small (Nakagawa and
and X: Cuthill 2007). The square root of the adjusted R2, radj
SSE = R2adj is the appropriate effect size measure if cor-

R = 1−
2
relation is required (Nakagawa and Cuthill 2007).
SST
The effect size f2 for the fixed effects model is
where SST is the total sum of squares, and SSE is the obtained from the adjusted R2 as follows:
error sum of squares, or the residual amount R2adj
2
remaining after subtracting the contribution made f =
by the linear fit. 1 − R2adj
15.4.3 Analysis of Variance (ANOVA) The partial correlation ratio η2p can also be esti-
Methods mated from existing or exemplary F-values and
degrees of freedom (df) as
ANOVA models belong to the r family and are math-
ematically equivalent to regression. The F-statistic is
the ratio of the estimate of the between-groups, or F df num
treatment, mean square to the error mean square. η2p =
F df num + df denom
The single-factor ANOVA partitions variation into
between-group variation with k − 1 degrees of free-
dom (where k is the number of groups) and residual where dfnum is the numerator, or between-groups,
(within group, or error) variation with k(n − 1) degrees of freedom, and dfdenom is the denominator
degrees of freedom. Between-group variation assesses or error degrees of freedom (Cohen 1988).
treatment effects. A major drawback to both η2 and η2p is that they
Effect size. The effect size f for ANOVA designs is are biased. Their application in power calculations
can lead to underpowered studies because the sam-
f = η2 1 − η 2 ple size estimates will be too small. Estimates of η2p
become more complicated to calculate as study
where η2 is the correlation ratio. The correlation design becomes more complex because η2p is deter-
ratio η2 is analogous to R2 and for a single-factor mined by the different sources of variation contrib-
ANOVA, η2p = R2 . It describes the proportion of uted by additional factors, factor interactions,
the total variation (total sum of squares SSTotal) in nesting, and blocking (Lakens 2013). η2p can be used
the response variable accounted for by differences for comparisons of effect sizes across studies, but
between groups (between-group sum of squares, only if study designs are similar and there are no
SSbetw): covariates or blocking variables (Lakens 2013).
When reporting effect sizes for ANOVA, it is recom-
SSbetw mended to report both the generalized η2 and the
η2 =
SSTotal standardised η2p .
When sample size is small, the partial omega-
For example, η2 of 0.2 means that 20% of the total squared ω2p is a less-biased alternative to η2 and η2p
variance can be attributed to group membership. It (Olejnik and Algina 2003; Albers and Lakens
is the uncorrected effect size estimate determined by 2018). It estimates the proportion of variance in
the variance explained by group membership in the the response variables accounted for by the explan-
sample for a single study (Lakens 2013). However, atory variables. For the completely randomised
studies based on η2 will be considerably underpow- single-actor ANOVA, ω2p can be calculated from
ered; several authors recommend strongly against
the F-statistic and associated degrees of freedom
using η2 as an ANOVA effect size estimator
for the treatment and error:
(Troncoso Skidmore and Thompson 2013; Albers
and Lakens 2018). If there are only two independent
groups, f is related to Cohen’s ds as f = ds/2. F −1
ω2p =
The partial correlation ratio η2p is the ‘standar- F+ dfdenom + 1
dfnum
dised’ effect size. It is appropriate only for fixed fac-
tor designs without covariates. Software programs
such as G∗Power require η2p for sample size calcula- Warning: This formula cannot be used for repeated
tions. It is calculated as measure designs. Neither η2 or ω2p effect sizes are
applicable to ANOVA with observational categori-
SSbetw SSbetw cal explanatory variables (e.g. sex) or if covariates
η2p = =
SSbetw + SSerror SStotal are included.
15.5 Risk Family Effect Sizes The number needed to treat (NNTT) is the recip-
rocal of the risk difference:
When outcome measures are binary (0/1; yes/no;
present/absent), the effect size is assessed as a NNTT = 1 ARR
comparison between proportions. It is common in
epidemiological investigations for effect sizes for This is an estimate of the number of subjects that
proportions to be expressed as risk difference, rela- need to be treated for one extra subject to benefit.
tive risk (RR), or the odds ratio (OR) (Sánchez-Meca The RR or risk ratio, is the ratio of two propor-
et al. 2003, Machin et al. 2018). Cramer’s V is a tions. It compares the probability, or risk, of an
measure of the strength of association between event occurring in one group in one group relative
two nominal variables. It is typically obtained from to that for a comparator group:
χ2 contingency tables (Box 15.5).
RR = b n1 a n0 = p1 p0
15.5.1 Risk Difference, Relative Risk,
When RR is 1, risk for the two groups is the same;
and Odds Ratio that is, an RR of 1 indicates no effect. If RR > 1, there
Table 15.3 shows how proportions are structured is increased risk in group 1 relative to group 2, and
for comparisons between the test group and a RR < 1 risk is decreased in group 1 relative to
comparator, or standard. Sample SAS code group 2. Therefore an RR of 0.5 means that the risk
for calculating RR and OR is provided in has been reduced by one-half or 50%. Confidence
Appendix 15.A. intervals should be reported to obtain an idea of
Effect size metrics are calculated as follows: the precision of the estimate.
The risk difference or absolute risk reduction ARR, There are two possible ways to compute RR
is the difference between proportions: depending on whether the presence or absence of
an event is of interest. Scientific or clinical judgment
ARR = b n1 − a n0 = p1 − p0 is required to determine which RR is appropriate, as
they are not simple reciprocals of one another.
If the absolute risk difference is zero, this indi- The OR is the ratio of the odds of occurrence of an
cates there is no effect. event (or ‘success’) p relative to the odds of failure
(1 − p) in a test group relative to those for a compar-
ator group:
BOX 15.5 p1 1 − p1
Risk Family Effect Sizes OR =
p0 1 − p0
Effects sizes for binary outcomes: risk difference, rela-
tive risk, odds ratio. In case–control studies, the OR is the measure of
Effect size for nominal variables: Cramer’s V. association between an exposure or risk factor and
occurrence of disease.
Table 15.3: Calculation of proportions for comparisons of a test group against a control.
Group Event ‘yes’ Event ‘no’ Total Proportion of events

(‘success’) (‘failure’) (‘successes’)
Comparator group a c a + c = n0 p0 = a/n0

(‘standard’)
Test group b d b + d = n1 p1 = b/n1
Total a+b c+d n0 + n1 = N
When OR is 1, odds associated with success are difficulties with interpretation, especially for asses-
similar in the two groups; that is, an OR of 1 indi- sing benefits and harms. For example, an OR = 0.20
cates no effect. If the OR > 1 odds are increased in cannot be interpreted as an 80% reduction in risk.
group 1 relative to group 2, and OR < 1 odds are It means that the odds of the event occurring in the
decreased in group 1 relative to group 2. Rule of first group (e.g. the test group) are 0.2 times the
thumb values for OR effect sizes has been reported odds in the second group (e.g. the control). How-
as ‘small’ for OR = 1.68, ‘medium’ for OR = 3.47, ever, OR has several advantages over RR. ORs
and ‘large’ for OR = 6.71 (Chen et al. 2010). If the are symmetrical with respect to the outcome defini-
ORs are obtained from logistic regression coeffi- tion; that is, interpretation does not depend on
cients, they must be standardised using the root which group is used as the numerator or denomi-
mean square before use. ORs obtained from logistic nator, or whether the emphasis is on the occur-
regression are usually unstandardised, as they rence of an event or its failure to occur. In
depend on the scale of the predictor. contrast, interpretation of the RR does depend on
outcome definition. Because choice of the refer-
ence comparison can greatly affect interpretation,
15.5.2 Interpretation it is important to clearly identify which group is
Both RR and OR are relative measures of effect, and the ‘control’ or reference comparator group and
therefore unaffected by changes in baseline risk. whether that group is the numerator or denomina-
However, proper interpretation requires that both tor of the ratio. For example, Cochrane Reviews
raw numbers and baseline risk should be reported. use OR < 1 as favouring the treatment group, not
Absolute changes in risk are more important than the control (Higgins et al. 2022).
relative changes in risk. Sensible interpretation For observational studies, the study design (e.g.
requires consideration of both RR and OR as ratio case-control or cohort studies), sampling strategy,
problems; that is, assessment of magnitudes is based case type (incident or prevalent) and source popula-
on reciprocals, not simple arithmetic differences. tion (fixed or dynamic) will determine which meas-
When the outcome event is rare, RR and OR will ure of association can be used (Knol et al. 2008,
be approximately equal. This is because both (1 − p1) Labrecque et al. 2021).
and (1 − p2) will be close to 1, so OR will approach
RR. As a result, OR can be regarded as an estimate
of RR when incidence of an event is low in both Example: Survival or Mortality: Which?
groups (usually <10%). As incidence or baseline risk
A study assessed the number of subjects that sur-
increases, RR and OR become more dissimilar.
vived or died following a course of experimental
If there are zero events in the comparator group
treatment A relative to control treatment B.
(p0 = 0), neither the RR or OR can be calculated
Results were as follows:
(because the denominator is zero). If the occurrence
of the event in the intervention group is 100% (p1 = Intervention Alive Dead Total
1), OR cannot be calculated.
Group A 75 25 100
To more readily interpret OR as a change in num-
ber of events, Higgins et al. (2022) recommend con- Group B 50 50 100
verting the OR to RR and then interpreting RR
relative to a range of assumed comparator or base- The RR of death for group B relative to
line group risks (ACR): group A is
OR 50 100 0 50
RR = RR = = =20
1 − ACR 1 − OR 25 100 0 25
RR is more easily understood than OR. Disadvan-

However, the RR of survival is
tages of OR include overestimation of risk, and
50 100 0 50 BOX 15.6

RR = = = 0 67 Time to Event Effect Size
75 100 0 75
Hazard ratio
When expressed as RR, survival and mortality Odds ratio
are not reciprocals of one another and not
substitutable.
In contrast, the OR of death for group B relative
to group A is events and adjusts for censoring of all subjects which
did not have the event (Tierney et al. 2007). When
50 50 10
OR = = =30 ‘survival’ is measured as counts or proportions,
25 75 0 333 the effect size can be estimated as an OR. However,
the OR provides information only on the number of
The OR of survival for group B relative to events per group and does not account for timing or
group A is duration of events (Box 15.6).
The HR is the ratio of the ‘risks’ (or proportion of
50 50 10
OR = = = 0 333 events) in the control and the test groups and may
75 25 3 be expressed either as the ratio of events p in each
group or the ratio of median survival times M:
The OR of survival is the reciprocal of OR for
mortality. OR does not depend on whether the
emphasis is on occurrence of an event or its fail- h1 t ln p1 M0
HR = = =
ure to occur. h0 t ln p0 M1
HRs are interpreted similarly to RR: that is, HR of

1 indicates no effect, HR < 1 indicates the chance
15.5.3 Nominal Variables: Cramer’s V of the event is reduced, and HR > 1 that the chance
Cramer’s V measures the strength of association of the event occurrence is increased. For example,
between two nominal variables (categorical vari- HR = 0.75 indicates the chance of having an event
ables with no natural ordering or ranking). It is is reduced by 25%. HR will always be a more reliable
calculated as: effect size than RR because of the incorporation of
time effects. If HR and RR in a study are very similar,
this may indicate that time effects are not important
χ2 N
V= (Machin et al. 2006; Machin et al. 2018; Abberbock
min c − 1, r − 1 et al. 2019). Tierney et al. (2007) describe methods
for obtaining HRs from published summary data.
where N is total sample size, c is the number of col-
umns, r is the number of rows, and χ 2 is the chi-
squared statistic. V ranges between 0 and 1, with
0 indicating no association and 1 indicating a strong 15.7 Interpreting
association. Effect Sizes
… the reader is counseled to avoid the use of these conventions,
if he can, in favour of exact values provided by theory or expe-
15.6 Time to Event rience in the specific area in which he is working. (Cohen
Effect Size 1988; italics mine)
Effect size for time-to-event data is most appropri- Various rule-of-thumb benchmark values have
ately expressed as a hazard ratio (HR). The HR been proposed to indicate how ‘meaningful’ effect
accounts for both the number and the timing of size is, usually in subjective terms such as ‘small’,
BOX 15.7 Predetermined effect sizes are too strict. This was
Interpreting Effect Size recognised by Cohen himself (Cohen 1988)
and others (Thompson 2002, Funder and Ozer
Unless you know what the effect size index is and how 2019). Effect size must always be determined
it was calculated, you cannot interpret it. and interpreted by study context and practical
Benchmark effect sizes should be avoided because significance. For example, an effect size R2 =
they are 0.1 in large clinical trial with death and major
morbidity as a primary outcome is of far greater
▪ Irrelevant to animal-based research
practical significance than it would be for a
▪ Too strict
small laboratory study measuring weakly vali-
▪ Subject to technical problems
dated biomarker concentrations or behaviour
▪ Easily gamed
characterised by large between- and within-
subject variability.
Predetermined effect sizes have numerous technical
BOX 15.8 problems. The magnitude of an effect (therefore
Determining and Interpreting Effect Sizes any resulting interpretation of its biological
importance) will be affected by presence of cov-
▪ Do not use subjective effect size benchmarks
ariates, data with non-normal error structure
(‘small’, ‘medium’, ‘large’) for planning studies
and/or variances, non-independence of obser-
or interpreting results.
vations, and small sample sizes (Nakagawa
▪ The magnitude of the effect size may be deter-
and Cuthill 2007). Frequently, conversion of
mined more by study design and variation than
one type of effect size metric to another can
the anticipated difference in intervention
result in discordant and nonsensical bench-
groups.
mark values, for example conversion of d effect
▪ Consider effect sizes like any other ratio (both
sizes to r (Lakens 2013). Sampling distributions
numerator AND denominator are important).
will generally be non-normal and highly
skewed, so the smaller the sample size, the less
reliable the effect size estimate will be (Albers
‘medium’, and ‘large’. Benchmarks may be useful as and Lakens 2018).
first-pass sample size approximations for study pla- Predetermined effect sizes are subject to ‘gaming’.
nning purposes. They can also be useful for asses- An unfortunately common practice among
sing results in comparison results already well- investigators is manipulating effect size calcula-
understood or validated. However, in most circum- tions to obtain a preferred sample size. This will
stances, effects sizes should be directly calculated from often result in estimates of implausibly large
the individual components (Box 15.7). Predeter- effect sizes so that future studies will be
mined (‘canned’) benchmarks should be avoided extremely underpowered for more biologically
for the following reasons (Box 15.8): realistic and plausible effect sizes.
Predetermined effect sizes are irrelevant to animal-
based studies. By far, the leading objection to 15.7.1 Interpreting Effect Sizes as
benchmark criteria is that they were originally
developed for social and psychology studies and Ratios
are meaningless without a practical frame of ref- Cumming (2012) points out that sensible construc-
erence (Thompson 2002). Therefore, the criteria tion and interpretation of effect sizes should con-
defining the ‘size’ of an effect may be neither sider effect size as a ratio problem. That is, both
appropriate nor relevant for pre-clinical or veter- the numerator and denominator are important
inary clinical studies. and should be considered equally in the
interpretation of effect size. Two studies with the significance has been discussed in detail elsewhere
same effect sizes may differ in the magnitude of (Greenland et al. 2016).
the difference between groups (numerator), the Effect sizes and 95% confidence intervals in orig-
amount of variation in the two samples (denomina- inal measurement units have the advantage of being
tor), or both. If variation between experimental more readily interpretable compared to standar-
units is small, then the denominator is small, result- dised effect sizes. Reporting of effect sizes and mea-
ing in large effect sizes. This may conceal the fact sures of precision, whether or not they are
that differences between groups are trivial and of ‘statistically significant’, have the further advantage
no practical importance. Conversely, large standar- of allowing inclusion in systematic reviews and
dised effect sizes may be a reflection of better study meta-analyses (Nakagawa and Cuthill 2007). Sys-
design (control of variation), not larger effects per se tematic reviews and meta-analyses increase data
(differences between groups). Neglect of basic ‘shelf life’, and minimise waste of animals and
design features such as ensuring quality of data resources in additional non-informative experi-
sampling, collection, and measurement, and espe- ments (Reynolds and Garvan 2020).
cially incorporation of allocation concealment (or
blinding) can also account for non-trivial effect sizes
(Rosenthal 1994, Funder and Ozer 2019). For exam-
ple, a meta-investigation of mouse models of amyo-
15.A Using SAS Proc Genmod
trophic lateral sclerosis provided strong evidence to Calculate OR and RR
that unattributed sources of variation (such as litter
effects) were major contributors to apparent thera-
(Adapted from Spiegelman
peutic efficacy of candidate drugs, rather than true and Hertzmark 2005). As an
treatment effects (Scott et al. 2008; Perrin 2014).
alternative to fitting a logistic
regression model, OR and RR
15.7.2 What Is a Meaningful
Effect Size?
can be calculated using the
A ‘meaningful’ effect size (Box 15.8) is described by
log-binomial maximum
the effect size estimate together with a measure of likelihood estimators.
precision (such as 95% confidence intervals). Mean- *to calculate OR, use the logit link;
ing is context-dependent and is determined by proc genmod descending;
model validity, goals of the intervention, predefined class x;
criteria for what constitutes biological or clinical model y = x / dist = binomial link = logit;
‘importance’ (Kazdin 1999; Thompson 2002; Naka- estimate 'Beta' x 1 -1/ exp;
run;
gawa and Cuthill 2007), and evaluation in compar-
ison with estimated effect size in comparison to *to calculate RR use the log link;
values reported for other similar studies (Schuele proc genmod descending;
and Justice 2006). Unfortunately, discussions of class x;
model y = x/ dist = binomial link = log;
research results are often framed in terms of ‘statis- estimate 'Beta' x 1 -1/ exp;
tical significance’ and P-values alone, which do not run;
convey much useful scientific information (Cohen
1990; Thompson 2002; Nakagawa and Cuthill
2007; Cumming 2012; Kelley and Preacher 2012;
Schober et al. 2018). ‘Statistical significance’ does
References
not mean that the observed difference is large Abberbock, J., Anderson, S., Rastogi, P., and Tang, G.
enough to be of any practical or biological impor- (2019). Assessment of effect size and power for survival
tance. Common misunderstanding of P-values and analysis through a binary surrogate endpoint in
clinical trials. Statistics in Medicine 38: 301–314. Hojat, M. and Xu, G. (2004). A visitor’s guide to effect
https://doi.org/10.1002/sim.7981. sizes – statistical significance versus practical (clinical)
Albers, C. and Lakens, D. (2018). When power analyses importance of research findings. Advances in Health
based on pilot data are biased: inaccurate effect size Sciences Education 9 (3): 241–249.
estimators and follow-up bias. Journal of Experimental Kazdin, A.E. (1999). The meanings and measurement of
Social Psychology 74: 187–195. https://doi.org/ clinical significance. Journal of Consulting and Clinical
10.1016/j.jesp.2017.09.004. Psychology 67: 332–339.
Chen, H., Cohen, P., and Chen, S. (2010). How big is a big Kelley, K. and Preacher, K.J. (2012). On effect size. Psy-
odds ratio? Interpreting the magnitudes of odds ratios chological Methods 17 (2): 137–152. https://doi.
in epidemiological studies. Communications in Statis- org/10.1037/a0028086.
tics: Simulation and Computation 39: 860–864. Knol, M.J., Vandenbroucke, J.P., Scott, P. et al. (2008).
Cohen, J. (1988). Statistical Power Analysis for the Behav- What do case-control studies estimate? Survey of
ioral Sciences. Hillsdale: Lawrence Erlbaum methods and assumptions in published case-control
Associates. research. American Journal of Epidemiology 168 (9):
Cohen, J. (1990). Things I have learned (so far). American 1073–1081.
Psychologist 45 (12): 1304–1312. https://doi.org/ Labrecque, J.A., Hunink, M.M.G., Ikram, M.A., and
10.1037/0003-06615.45.12.1304. Ikram, M.K. (2021). Do case-control studies always
Cumming, G. (2012). Understanding the New Statistics: estimate odds ratios? American Journal of Epidemiol-
Effect sizes, Confidence Intervals, and Meta-Analysis. ogy 190 (2): 318–321. https://doi.org/10.1093/
New York: Routledge. aje/kwaa167.
Draper, N., and Smith, H. (1998). Applied Regression Lakens, D. (2013). Calculating and reporting effect sizes
Analysis, 3rd ed. New York: Wiley. to facilitate cumulative science: a practical primer for t-
Feingold, A. (2015). Confidence interval estimation for tests and ANOVAs. Frontiers in Psychology 2013: 4.
standardized effect sizes in multilevel and latent https://doi.org/10.3389/fpsyg.2013.00863.
growth modeling. Journal of Consulting and Clinical Lorah, J. (2018). Effect size measures for multilevel mod-
Psychology 83 (1): 157–168. https://doi.org/ els: definition, interpretation, and TIMSS example.
10.1037/a0037721. Large-Scale Assessments in Education 6: 8. https://
Ferguson, C.J. (2009). An effect size primer: a guide for doi.org/10.1186/s40536-018-0061-2.
clinicians and researchers. Professional Psychology: Machin, D., Cheung, Y.B., and Parmar, M. (2006) Sur-
Research and Practice 40 (5): 532–538. vival Analysis: A Practical Approach 2nd edition, Wiley.
Funder, D.C. and Ozer, D.J. (2019). Evaluating effect size Machin, D., Campbell, M.J., Tan, S.B., and Tan, S.H.
in psychological research: sense and nonsense. (2018) Sample Sizes for Clinical, Laboratory and Epide-
Advances in Methods and Practices in Psychological miology Studies. 4th Ed. Wiley.
Science 2: 156–168. Nakagawa, S. and Cuthill, I.C. (2007). Effect size,
Glass, G.V. (1976). Primary, secondary, and meta- confidence interval and statistical significance: a prac-
analysis of research. Educational Researcher 5: 3–8. tical guide for biologists. Biological Reviews of the Cam-
https://doi.org/10.2307/1174772. bridge Philosophical Society 82 (4): 591–605. https://
Greenland, S., Senn, S.J., Rothman, K.J. et al. (2016). Sta- doi.org/10.1111/j.1469-18515.2007.00027.15.
tistical tests, P values, confidence intervals, and power: Erratum in: Biol Rev Camb Philos Soc. 2009 84(3):515.
a guide to misinterpretations. European Journal of Epi- Olejnik, S. and Algina, J. (2003). Generalized eta and
demiology 31: 337–350. https://doi.org/10.1007/ omega squared statistics: measures of effect size for
s10654-016-0149-3. some common research designs. Psychological Methods
Hedges, L.V. and Olkin, I. (1985). Statistical Methods for 8 (4): 434–447.
meta-analysis. San Diego, CA: Academic Press. Perrin, S. (2014). Make mouse studies work. Nature 507:
Higgins, J.P.T., Li, T., and Deeks, J.J. (2022). Chapter 6: 423–425.
Choosing effect measures and computing estimates Reynolds, P.S. and Garvan, C.S. (2020). Gap analysis of
of effect. In: Cochrane Handbook for Systematic swine-based hemostasis research: “Houses of brick or
Reviews of Interventions version 6.3 (Updated February mansions of straw?”. Military Medicine 185 (Suppl 1):
2022) (ed. H. JPT, J. Thomas, J. Chandler, et al.). 88–95. https://doi.org/10.1093/milmed/usz249.
Cochrane Available from www.training.cochrane. Rosenthal, R. (1994). Parametric measures of effect size.
org/handbook. In: The Handbook of Research Synthesis (ed. H. Cooper
and L.V. Hedges), 231–244. New York: Sage.
Sánchez-Meca, J., Marín-Martínez, F., and Chacón- psychology. Journal of Educational Psychology 102
Moscoso, S. (2003). Effect-size indices for dichoto- (4): 989–1004.
mized outcomes in meta-analysis. Psychological Meth- Thompson, B. (2002). Statistical, practical, and clinical:
ods 8 (4): 448–467. https://doi.org/10.1037/ how many kinds of significance do counselors need
1082-98915.8.4.448. to consider? Journal of Counseling and Development
Schober, P., Bossers, S.M., and Schwarte, L.A. (2018). Sta- 80 (1): 64–71.
tistical significance versus clinical importance of Tierney, J.F., Stewart, L.A., Ghersi, D. et al. (2007). Prac-
observed effect sizes: what do p values and confidence tical methods for incorporating summary time-to-
intervals really represent? Anesthesia and Analgesia event data into meta-analysis. Trials 8: 16. https://
126 (3): 1068–1072. https://doi.org/10.1213/ doi.org/10.1186/1745-6215-8-16.
ANE.0000000000002798. Troncoso Skidmore, S. and Thompson, B. (2013). Bias
Schuele, C.M. and Justice, L.M. (2006). The importance and precision of some classical ANOVA effect sizes
of effect sizes in the interpretation of research: primer when assumptions are violated. Behavior Research
on research: part 3. The ASHA Leader. 11 (10): Methods 45: 536–546. https://doi.org/10.3758/
https://doi.org/10.1044/leader. s13428-012-0257-2.
FTR4.11102006.14. van Belle, G.G. (2008). Statistical Rules of Thumb, 2nd
Scott, S., Kranz, J.E., Cole, J. et al. (2008). Design, edition. New York: Wiley.
power, and interpretation of studies in the standard Vedula, S.S. and Altman, D.G. (2010). Effect size estima-
murine model of ALS. Amyotrophic Lateral Sclerosis tion as an essential component of statistical analysis.
9: 4–15. Archives of Surgery 145 (4): 401–402. https://doi.
Spiegelman, D. and Hertzmark, E. (2005). Easy SAS cal- org/10.1001/archsurg.2010.33.
culations for risk or prevalence ratios and differences. Zwetsloot, P.P., Van Der Naald, M., Sena, E.S. et al.
American Journal of Epidemiology 162: 199–205. (2017). Standardized mean differences cause funnel
Sun, S., Pan, W., and Wang, L.L. (2010). A comprehen- plot distortion in publication bias assessments. eLife
sive review of effect size reporting and interpreting 6: e24260. https://doi.org/10.7554/eLife.24260.
practices in academic journals in education and
16
Comparing Two Groups:
Continuous Outcomes

16.1 Introduction 181 16.A.2 Total Sample Size Based on
16.2 Sample Size Calculation Methods 182 the Non-Centrality Parameter
16.2.1 Asymptotic Large-Scale for t 187
Approximation 182 16.A.3 Sample Size for a Fixed Power:
16.2.2 Sample Size Based on the Crossover Design (Cattle
t-Distribution 182 Example) 187
16.2.3 Sample Size Derived From 16.B Sample SAS Code for Calculating Sample
Percentage Change in Means 183 Size for a Veterinary Clinical Trial. The
16.2.4 Sample Size Rule of Thumb 183 standard deviation obtained from pilot
16.3 Which Standard Deviation? 183 data is corrected by computation of
16.3.1 One-Sample Comparison 184 its upper confidence interval (UCL)
16.3.2 Two Independent Samples 184 or by the inverse function of power
16.3.3 Paired Samples or A/B and pilot degrees of freedom 187
Crossover Designs 184 16.B.1 Conventional (Uncorrected
16.4 Sample Size for Standard Deviation) 187
Two-Arm Veterinary 16.B.2 Upper Confidence Interval
Clinical Trials 185 Correction 188
16.B.3 Simplified Non-Central t
16.A Sample SAS Code for Calculating
Correction 188
Sample Size for Two-Group
Comparisons 187 References 188
16.A.1 Sample Size Based on
z-Distribution 187
BOX 16.1
16.1 Introduction Two-Group Designs
The simplest experimental design is for compari- The simplest between-group design usually compar-
son of two groups (Box 16.1). In general, these ing a test intervention against a control.
designs involve a test intervention and a control, Two-arm designs are powerful tools for assessing
randomly allocated to experimental units in each efficacy, especially for veterinary clinical trials.
group. Due to their simple structure, two-arm Two-arm are not recommended for exploratory
designs are useful for definitive veterinary clinical studies with multiple variables; t-tests are commonly
trials, which require large sample sizes and used to compare continuous outcomes for two
sufficient power for testing efficacy. However, groups.
two-arm designs are not the best choice for most and confidence. However, sample sizes can also
laboratory rodent studies, which are mostly explor- be determined to obtain a pre-specified precision
atory with multiple explanatory variables. Alterna- or tolerance for the difference, or for a range of sam-
tive designs are discussed in Chapter 19. ple variation that can detect the pre-specified differ-
When outcome variables are continuous and nor- ence with a given power.
mally distributed, sample size can be approximated
by formulae based on the t or the z distribution. The
t-distribution describes the standardised distances 16.2.1 Asymptotic Large-Scale
of sample means to the population mean. The t- Approximation
statistic is based on the sample variance. The shape Sample size based on the z-distribution for a two-
of a t-distribution depends on the degrees of free- group comparison is:
dom (df), where df = n − 1, so the critical t-value tcrit 2
is given by tα/2, n − 1 for a two-sided test and tα, n − 1 z1 − α 2 + z1 − β
N= 2
for a one-sided test. In contrast, the approximation d s
based on the z-distribution assumes the population
where the difference d is the difference between
variance σ is known, and the population is normally
sample means d = x 1 − x 2 , s is the sample standard
distributed. The z-distribution can be used when
deviation, d/s is the effect size, and z1 − α/2 and
sample sizes n are large (n > 30) because as n
z1 − β are the z-scores for confidence and power,
increases, the t-distribution approaches the z-distri-
respectively.
bution. However, the large-scale asymptotic approx-
To detect an increase in the average by one stand-
imation based on the z-distribution should be used
ard deviation, the relation becomes
with caution. Although sample size estimates may
useful for initial planning, sample sizes will be 2
z1 − α 2 + z1 − β 2
seriously underestimated compared to exact deter- N= 2 = z1 − α 2 + z1 − β
1 1
minations from the t-distribution. A drawback of
the exact method is that it must be determined by or the square of the summed z-scores for confidence
numerical iteration over a candidate range of n, and power.
although this is readily performed with simple
computer code. Sample SAS codes are given in
Appendix 16.A. 16.2.2 Sample Size Based on the
t-Distribution
16.2 Sample Size Calculation When the study is small, estimating sample size
directly from the t-distribution will provide more
Methods exact approximations. The sample size calculation
Sample size determinations for two-group compar- for a two-sided comparison of two samples is
isons require the minimum information for sample
2
size planning (Chapter 14): pre-specified confidence N = t2α 2,n − 1 d s
and power, and the minimum biologically relevant
difference to be detected. In addition, information is where t is the critical t-value for confidence α/2 and
required as to directionality (if the comparison is n − 1 degrees of freedom (df).
one-sided or two-sided, as this determines α or The degrees of freedom that define the t-
α/2, respectively), and the type of comparison to distribution is calculated from the total sample size
be made. These include a single sample compared itself. Therefore, sample size has to be estimated by
to a reference value, comparison of two independent iterating over a range of candidate sample sizes, and
samples, or comparison of paired observations on the final sample size is obtained by solving for β:
the same subject. The type of comparison will deter-
mine how the sample standard deviation is calcu- d n
1 − β = tinv t α 2,n − 1
lated and therefore affect estimates of the effect size. s
Commonly, sample sizes for two-group compari-
sons are computed to detect a pre-specified biolog- The inverse function tinv computes the pth quantile
ically relevant difference d with a given power from the t-distribution, with degrees of freedom df
Comparing Two Groups: Continuous Outcomes 183
and non-centrality parameter λ = d/s. Because t2 =

F, the critical value for the test statistic can Example: Mouse Two-Arm Drug Study:
be computed from either the non-central t or Percent Difference
F-distribution (Chapter 14). An investigator wished to obtain a preliminary
sample size estimate for a two-arm study in mice
testing a new drug against an inert control. Based
16.2.3 Sample Size Derived From on literature data, the investigator thought that a
Percentage Change in Means 20% improvement in response in the drug group
compared to control would be clinically mean-
Investigators sometimes define a ‘minimum clini-
ingful and that CV would be between 20% and
cally or biologically important difference’ between
50% of the mean.
two groups as a percentage change in mean values.
The desired per cent change difference is
A 20% difference between the control group and the
(1 − 0.2) = 0.8, and the ratio of CV to the mean
intervention group as an indication that the test
is 0.3/1–0.5/1. Then
intervention ‘works’ is a popular rule of thumb,
although there may be no particular clinical or bio- 2 2
If CV = 20 , n = 16 0 20 ln 1 − 0 20 13
logical rationale for this number. However, use of
2 2
percentage differences can be used for crude sample If CV = 50 , n = 16 0 50 ln 1 − 0 20 80
size estimates when few data are available.
The difference between two group means The investigator will require ~13–80 subjects
x 0 − x 1 expressed as a percentage change, is per group to detect a 20% difference in the group
means, depending on the amount of variation in
d = x0 − x1 x0
the sample. It is apparent that if the response is
This is rearranged to express the ratio of the means: expected to be highly variable (and hence less pre-
cise), more subjects will be needed.
d = 1 − x1 x0 = 1 − p .
where p is the per cent (or proportion) change.

Estimates of the sample variation are obtained
from the coefficient of variation (CV). The CV is 16.2.4 Sample Size Rule of Thumb
the ratio of the standard deviation (s) to
A crude estimate for sample size can be obtained
the mean x , so that CV = s x (and so can be
from the standardised effect size for the difference
thought of as the inverse of a simple standardised
between two groups (van Belle 2008) as
effect size).
Using a modification of Lehr’s formula (Lehr 4 s 2
16
1992; van Belle, 2008 and assuming equal CVs for N= = 2
d d s
the two groups, s0 x 0 = s1 x 1 , then sample size is
16 CV 2 16 CV 2 and for a single sample against a reference value as

n= 2 =
ln x 0 − ln x 1 ln 1 − p 2
2
2 s 4
where n is the sample size per group. The ln- N= = 2
d d s
transformation is used to stabilise the variance
because the CV assumes the standard deviation is
proportional to the mean.
The advantage of this method for preliminary 16.3 Which Standard
planning is that it is simple and does not require
much detailed information. However, this method
Deviation?
should not substitute for formal sample size calcula- Solving for sample size requires an estimate of effect
tions and without careful consideration of the spe- size. Effect size for two groups is measured by the
cifics for the planned study. standardised difference between the two means,
d/s where d is the difference between means

8 − 1 82 + 8 − 1 92 82 + 92
x 1 − x 2 , and s is the sample standard deviation. s2pool = = = 8 515
8 + 8−2 2
However, estimates of s will be determined by
whether the comparison is between a single sample
and a reference value, a comparison of two inde-
The specified width of the confidence interval is
pendent groups, or a comparison of paired
15 min, so the precision d is 15/2 = 7.5 min and
observations.
the effect size is 7.5/8.515 = 0.88. Using the
large-scale approximation, z1 − α/2 = 1.96, and
16.3.1 One-Sample Comparison z1 − β = 0.8416, and sample size per group is
If the sample consists of a single group compared to approximately
some reference value, then s is the standard
deviation of that sample. The standard error of 1 96 + 0 8416 2
the mean is N= = 10 14
0 88 2
SE x = s n
11 mice per group for a total of 22 mice. Sample
16.3.2 Two Independent Samples size derived from direct approximation based on
the non-central t (Chapter 14) results in 23 per
For a comparison of two independent samples, x1 group, for a total of 46 mice. This total is more
and x2, the standard deviations for the two samples than double that of the simpler large-scale
must be pooled. The pooled sample variance is cal- approximation. Therefore, studies based on the
culated from existing data for the two groups as z-approximation may be considerably underpow-
n1 − 1 s21 + n2 − 1 s22 ered to detect a true effect.
s2pool =
n1 + n2 − 2
If sample sizes are equal for the two groups, then
the pooled variance is the average of the two
variances: 16.3.3 Paired Samples or A/B
s2 + s22 Crossover Designs
s2pool = 1
2 The paired t-test is equivalent to a one-sample t-test
and the pooled standard deviation is on the differences. The mean and standard devia-
tion for the paired differences d is computed from
s21 + s22 the differences for each paired observation.
spool =
2 The standard deviation for the differences is
2
Example: Anaesthesia Duration in Mice di − d
s=
(Data from Dholakia et al. 2017.) Anaesthesia n−1
duration was reported for 16 mice randomly
assigned to receive intraperitoneal injections of Within-subject correlation is a feature of many
either ketamine-xylazine (KX) or ketamine- association-type effect sizes, such as ANOVA–type
xylazine plus lidocaine (KXL). Mean and stand- models, nested designs, and regression. Because
ard deviation for each group were KX: 39 (SD the variance of the paired differences of the n sub-
8) min, and KXL: 30 (SD 9) min. What sample jects is
size is required to estimate a difference between
groups with a 95% confidence interval that is
no wider than 15 min, with power of 80%? s2 = var x 1 + var x 2 − 2 cov x 1 , x 2
In this example, α = 0.05, and β = 0.2. The
pooled standard deviation is and the within-subject correlation r is
cov x 1 , x 2
r= and 11.7 (SD 3.64) ppm at t1. The mean and stand-
var x 1 var x 2 ard deviation for the paired differences (t1 − t0)
were −12 ppm and 5.8 ppm, respectively. The
(deVore 1987), the variance s2 is also computed as correlation r between baseline and t1 measure-
(s1 + s2 − 2 r s1 s2). Standard statistical packages ments was −0.33.
such as SAS require an estimate of r to estimate What sample size is required to detect a mini-
power for paired t-tests. This is not required in mum difference of 10 ppm 90% of the time with
G∗Power if the mean and standard deviation of 95% confidence?
the differences are used to determine effect size. This is a paired design. Here, α = 0.05 and the
Calculating r on raw data using standard statisti- desired power is 1 − β = 0.9. The 95% confidence
cal packages (e.g. SAS proc corr, R stats cor.test) interval for the difference is
ignores non-independence of the observations and
will result in inaccurate estimates. Instead, sum- d ± tα 2,n − 1 sd = − 12 ± 2 262 5 8
marise the correlation at the subject level by calcu- − 25, +1 ppm
lating the averages for each pair of observations
(x1 + x2)/2 for each subject, then calculating the The effect size is 10/5.8 = 1.724. The sample
correlation between these averages and the size to detect a minimum paired difference of
difference for each pair. This is the foundation of 10 ppm is 12 paired measurements, or 6 cows
the Bland–Altman method used for method with realised power of 0.97 (Appendix 16.A).
comparison.
Example: Urinary Fluoride Concentrations

in Cattle: Paired Comparisons 16.4 Sample Size for
(Data from Debackere and Delbeke 1978.) Uri-
nary fluoride concentrations were measured at
Two-Arm Veterinary
two different time points for the same 10 cows Clinical Trials
randomly selected from a herd located in a con- Sample size for a definitive clinical trial can be
taminated area (Table 16.1). Fluoride concentra- obtained from pilot data (Machin et al. 2018). Exter-
tions at baseline t0 averaged 23.7 (SD 5.1) ppm nal pilot data are not included in later data analyses
Table 16.1: Urinary fluoride concentrations (ppm) of 10 cows measured at two different times.
Cattle ID Urinary fluoride concentrations ppm Difference (t1 − t0) Average (t0 + t1)/2
Time 0 Time 1
1 24.7 12.4 −12.3 18.6

2 18.5 7.6 −10.9 13.1
3 29.5 9.5 −20.0 19.5
4 26.3 19.7 −6.6 23.0
5 33.9 10.6 −23.3 22.3
6 23.1 9.1 −14.0 16.1
7 20.7 11.5 −9.2 16.1
8 18.0 13.3 −4.7 15.7
9 19.3 8.3 −11.0 13.8
10 23.0 15.0 −8.0 19.0
Source: Data from Debackere and Delbeke (1978).

and are not used to test hypotheses about the inter- 2

r+1 2
z1 − α 2 + z1 − β z21 − α 2
vention. Instead, pilot data are used to estimate N= 2 +
r d snew 2
effect size components, specifically the mean clini-
cally relevant difference between groups and sam- Non-central t-distribution. The correction factor
ple variation. However, pilot trials are small, so θ is computed from the inverse function of the
the standard deviation s estimated from pilot data t-distribution: θ = tinv (1 − β, dfpilot; t1−α/2,df,λ).
will be noisy and imprecise. If the pilot s is used The inverse function tinv computes the pth quantile
in subsequent sample size calculations, the main from the t-distribution, with degrees of freedom df
trial is likely to be considerably underpowered. and non-centrality parameter λ. Then the sample
Conventional method: Uncorrected s. The conven- size for the definitive trial is
tional method of estimating sample size for a two- 2
r+1 θ2
arm trial is N= 2
r d spilot
2
r+1 2
z1 − α 2 + z1 − β z21 − α 2 Solving requires multiple iterations until the
N= 2 +
r d s 2 estimate for N converges. However, a good
approximation is obtained by substituting z1 − α/2
where r is the allocation ratio (Chapter 14). If sam- for t1 − α2,df ,λ . Then
ple sizes are equal, r is 1, and (r + 1)2/r = 4. θ = tinv (1 − β, dfpilot; z1 − α/2), and 1 is added to the
More realistic estimates of s can be obtained by final estimate (Julious and Owen 2006; Whitehead
applying a correction factor to adjust for small sam- et al. 2016).
ple size of the pilot. Two methods are adjustment by
the upper confidence limit of the standard deviation Example: External Pilot: Dogs With Degen-
(Browne 1995; Machin et al. 2018) and adjustment erative Mitral Valve Disease
by the non-central t-distribution (Julious and Owen
2006; Whitehead et al. 2016). SAS code for each (Data from Karlin et al. 2019.) A pilot study was
method is given in Appendix 16.B. performed to assess the potential of circulating
Upper confidence limit. The correction factor k is trimethylamine N-oxide (TMAO, μmol/L) as a
computed from the upper confidence limit (UCL) of marker for dogs with degenerative mitral valve
the standard deviation obtained from the pilot disease (DMVD). In this example, sample size
(spilot). The confidence limit is computed from the for a new main trial was estimated to compare
χ 2 distribution and the pilot study degrees of dogs with asymptomatic DMVD to those with
freedom, dfpilot = Npilot − 2 (Browne 1995; Machin both DMVD and congestive heart failure. The dif-
et al. 2018). ference in TMAO between symptomatic and
The correction factor k for the pilot standard devi- asymptomatic dogs was approximately 4 μmol/
ation is L, with a standard deviation of 5.65 and 10 dogs
in each group.1 The effect size was approximately
df pilot 0.71, with (10 + 10) − 2 = 18 degrees of freedom.
k=
χ 2 1 − α, df pilot What sample size would be required for the main
trial to detect a mean difference of 4 μmol/L, with
The adjusted standard deviation for the definitive α = 0.05 and power 0.9?
trial is With the conventional method of estimating
sample size for a two-arm study, sample size is
df pilot 1+1 2
1 96 + 1 282 2
1 962
snew = k spilot = UCL = spilot N= +
χ 2 1 − α, df pilot 1 4 5 65 2 2
The standardised effect size for the new trial is then 43 per arm, for a total of 86 dogs.
calculated from the desired mean difference to be
detected (d) divided by the new standard deviation: 1
d/snew. The total sample size for the new trial is com- Standard deviations were estimated by the method of Wan
et al. (2014) from medians, IQR and range data described in
puted as before: the original paper.
*calculate difference to be detected;

With the upper confidence limit method, the mean1= [VALUE];
correction factor k is mean2= [VALUE];
18 meandiff = mean2-mean1;
k= = 1 3845 s2= [VALUE]; *calculate pooled variance;
9 3905
alpha=0.05;
Therefore, the corrected effect size is 0.51. The
sample size for the new trial is * Calculate non-centrality parameter ncp;
ncp = n*(meandiff**2)/(4*s2);
1+1 2
1 96 + 1 282 2 1 962 F_crit = finv(1-alpha,1,n-2,0);
N= + *Solve for power for each n;
1 4 7 82242 2 2 power = 1-probF(F_crit,1,n-2, NCP);
output;
82 per arm, for a total of 164 dogs. end;
With the simplified non-central t-method, θ = run;
*output realised power and corresponding N.
3.485. The sample size for the new trial is Pick N with power that ≥ target power;
2 proc print data=tsample; run;
1+1 3 4852
N=
1 0 71 2
16.A.3 Sample Size for a Fixed Power:
49–50 per arm, for a total of 98–100 dogs. Crossover Design (Cattle Example)
proc power;
pairedmeans test=diff
alpha=0.05
meandiff = 10
std = 5.8
16.A Sample SAS Code for corr = -0.33
Calculating Sample Size for npairs = .
power = .9;
Two-Group Comparisons run;
16.A.1 Sample Size Based on

16.B Sample SAS Code for
z-Distribution Calculating Sample Size for a
data zsample; Veterinary Clinical Trial. The
meandiff= [value];
sd= [value]; standard deviation obtained
ES= meandiff/sd; from pilot data is corrected by
alpha = 0.05;
beta= 0.1; [0.1 for 90% power, 0.2 for
80% power]
computation of its upper
z = quantile("Normal", 1-alpha/2); confidence interval (UCL) or
zbeta=quantile("Normal", 1-beta);
N = ((z + zbeta)**2)/(ES**2); by the inverse function of
run;
proc print data=zsample; run; power and pilot degrees of
freedom
16.A.2 Total Sample Size Based on the
Non-Centrality Parameter for t 16.B.1 Conventional (Uncorrected
*set maximum sample size for iterations; Standard Deviation)
%let numSamples = 100;
data tsample; data power;
*pilot data;
*iterate through range of sample sizes; meandiff=4;
do n = 4 to &numSamples by 1; spilot=5.65;
alpha = 0.05;
beta=0.1;
z = quantile("Normal", 1-alpha/2);
References
zbeta=quantile("Normal", 1-beta);
Browne, R.H. (1995). On the use of a pilot sample for
Nmain= (z*z/2)+4*((zbeta+z)**2)/(meandiff/ sample size determination. Statistics in Medicine 14
Spilot)**2; (17): 1933–1940. https://doi.org/10.1002/
run; sim.4780141709.
proc print; run; Debackere, M. and Delbeke, F.T. (1978). Fluoride
pollution caused by a brickworks in the Flemish
countryside of Belgium. International Journal of
Environmental Studies 11: 245–252.
16.B.2 Upper Confidence Interval DeVore, J.L. (1987). Probability and Statistics for Engi-
Correction neering and the Sciences, 2e. Brooks/Cole Publishing.
Dholakia, U., Clark-Price, S.C., Keating, S.C.J., and
data power; Stern, A.W. (2017). Anesthetic effects and body weight
*pilot data;
meandiff = 4;
changes associated with ketamine-xylazine-lidocaine
df=18; administered to CD-1 mice. PLoS ONE 12 (9):
stddev = 5.65; e0184911. https://doi.org/10.1371/journal.
pone.0184911.
alpha = 0.05; Julious, S.A. and Owen, R.J. (2006). Sample size calcula-
beta = 0.1; tions for clinical studies allowing for uncertainty about
the variance. Pharmaceutical Statistics 5: 29–37.
zbeta = quantile("Normal", 1-beta); https://doi.org/10.1002/pst.197.
C_crit = quantile('CHISQ',alpha,df ); Karlin, E.T., Rush, J.E., and Freeman, L.M. (2019). A
pilot study investigating circulating trimethylamine
k=sqrt(df/C_Crit); N-oxide and its precursors in dogs with degenerative
UCL=k*stddev; mitral valve disease with or without congestive heart
failure. Journal of Veterinary Internal Medicine 33
Nmain= (z*z/2)+4*((zbeta+z)**2)/(meandiff/ (1): 46–53. https://doi.org/10.1111/jvim.15347.
UCL)**2;
run;
Lehr, R.S. (1992). Sixteen s-squared over d-squared:
proc print; run; A relation for crude sample size estimates. Statistics
in Medicine 11: 1099–1102.
Machin, D., Campbell, M.J., Tan, S.B., and Tan, S.H.
16.B.3 Simplified Non-Central t (2018). Sample Sizes for Clinical, Laboratory and Epide-
Correction miology Studies, 4e. Wiley-Blackwell.
van Belle, G. (2008). Statistical Rules of Thumb, 2nd
data power; edition. New York: Wiley.
*pilot data; Wan, X., Wang, W., Liu, J., and Tong, T. (2014). Estimat-
df=18; *pilot degrees of freedom; ing the sample mean and standard deviation from the
meandiff=4;
sample size, median, range and/or interquartile range.
spilot=5.65;
BMC Medical Research Methodology 14: 135. https://
ES = meandiff/spilot; doi.org/10.1186/1471-2288-14-135.
Whitehead, A.L., Julious, S.A., Cooper, C.L., and Camp-
alpha = 0.05; bell, M.J. (2016). Estimating the sample size for a pilot
beta=0.1; randomised trial to minimise the overall trial sample
size for the external pilot and main trial for a continu-
ous outcome variable. Statistical Methods in Medical
theta = tinv(1-beta,df,z);
Research 25 (3): 1057–1073. https://doi.org/
Nmain = 4*(theta**2)/ES**2; 10.1177/0962280215588241.
run;
proc print;
run;
17
Comparing Two Groups:
Proportions

17.1 Introduction 189 17.4 Skewed Count Data 195
17.2 Difference Between Two Proportions 190 17.4.1 Poisson Distribution 195
17.2.1 One Proportion Known 190 17.4.2 Negative Binomial Distribution 196
17.2.2 Sex Ratios 190 17.A SAS Code for Computing Sample Size:
17.2.3 Two Independent Samples 191 Difference Between Two Independent
17.2.4 Confidence Intervals 191 Proportions. Proportions are calculated
17.3 Relative Risk and Odds Ratio 193 over a range of candidate n,
17.3.1 Relative Risk 193 then solved for power 196
17.3.2 Odds Ratios 194 References 197

Proportions p summarise count data. The propor- Comparing Proportions
tion is the ratio of the number of events or subjects Proportions p are used to summarise binary and
with the condition of interest divided by the total: count data
p = x/N. Count data do not contain as much
information as continuous data, so sample size Sample size requires specifying both
based on proportion outcomes will be much larger. absolute difference p1 − p0
Because proportions are not continuous and nor- relative difference with respect to control or baseline
mally distributed variables, comparisons cannot proportion
involve the use of t-tests or other statistical analysis The effect size is
methods that are standard for continuous data p1 − p0
(Box 17.1).
p 1−p
Sample size determinations require pre-
specification of
Type I and type II errors, α and β Analysis methods used for continuous data cannot
One-sided or two-sided comparison (α or α/2 be used for comparisons of proportions.
respectively)
The minimum biologically relevant difference to
be detected. This is the absolute difference between The value of the control or reference group p0.
proportions d = p1 − p0. This is also known as the Sample size for proportions is also determined by
risk difference. the difference relative to the baseline.
The standardised effect size for comparing two

and power 1 − β = 0.9. The number of surgeries
proportions with equal sample sizes is
that will need to be observed is
p1 − p0 1 645 0 15 1−0 15 +1 282 0 05 1−0 05
2
ES = N=
p 1−p 0 15−0 05 2
=300 5
where (p1 − p0) is the difference d, and p =
p0 + p1 2 is the mean of the two proportions. or approximately 301 surgeries.
The margin of error, or precision, is one-half of
the confidence interval range (u − l)/2. It can also
be calculated as z1-α/2 SE(pdiff). Sample size can
be obtained for a target precision by iterating over 17.2.2 Sex Ratios
a range of candidate sample sizes, calculating the Sex ratio, and factors inducing sex ratio shifts, are
confidence interval for each, and choosing the sam- important information for a variety of disciplines
ple size that approximates the desired precision. such as evolutionary biology, ecology, reproductive
biology, genetics, animal husbandry, and manage-
ment of laboratory populations. In most species
17.2 Difference Between with separate sexes, males and females are pro-
Two Proportions duced in approximately equal numbers, so the
expected sex ratio is 1:1; that is, p0 = p1 = 0.5. How-
17.2.1 One Proportion Known ever, sex ratio can vary in response to environmen-
tal factors such as temperature, food supply,
When one proportion is already known (reference
population density, and parental effects.
proportion p0), then sample size for detecting a dif-
The sex ratio R is calculated as a proportion of the
ference d of the new or anticipated p from the refer-
total number of animals:
ence level, with a significance level of α and power
1 − β is: R = n1 N
2
z1 − α p0 1 − p0 + z1 − β p 1−p where the total sample size N = n1 + n2. Calculating
N= direct ratios of one sex to the other (R = n1/n2) is not
d2
recommended. If there are no representatives of one
sex in the sample, the resulting ‘ratio’ will be
This is usually a one-sided test of the hypothesis either undefined or zero depending on which is
pnew < p0 or pnew > p0, so the significance is α rather the numerator or denominator (Wilson and
than α/2. Hardy 2002).
Comparison of observed sex ratio when the
expected sex ratio is 1:1. Under the null hypothesis,
Example: Before-After Comparison: Mouse
the sex ratio is 1:1, and the probability p of obtaining
Surgery Success
a member of the target sex will be exactly one-half,
A series of surgeries performed on mice had a loss or 0.5 (p0 = p1 = 0.5). Usually, the hypothesis of
rate of 15%. The investigator initiated a rigorous interest is a test of a shift in sex ratio from p0 =
series of retraining and procedural changes, with 0.5 (the null value) to an alternative value p1 with
the goal of reducing losses to 5% or less before power 1 − β.
beginning the new protocol. How many surgeries The equation for sample size is a modification of
need to be observed to be 90% confident that the the former equation for a one-sample test:
change in procedures has reduced attrition to 5%?
2
This tests the hypothesis that pnew < p0, with z1 – α 0 5 + z1 − β 0 25 – d2
proportions p0 = 0.15 and pnew = 0.05, α = 0.05, N=
d2
Comparing Two Groups: Proportions 191
where the reference proportion is p0 = 0.5, p1 is the range of candidate sample sizes. Sample sizes n0 and
new proportion, and d = p0 − p1 is the difference n1 should be equal for simplicity and to minimise
from 0.5 under the alternate hypothesis. the total sample size required for a target power.
The equation is
Example: Butterfly Sex Ratios − zα p q n0 + p q n1 − d

2
Power = 1− β = Pr z ≤
For many years, the sex ratio of butterflies col- p0 q0 n0 + p1 q1 n1
lected in the field was 1:1; that is, the proportion
of female butterflies was approximately 50%. zα 2 p q n 0 + p q n1 − d
+ Pr z ≥
However, recent studies have reported an appar- p0 q0 n0 + p1 q1 n1
ent decline in females (possibly the result of cli-
mate change and habitat loss). How many
butterflies would need to be collected to detect where p = p0 + p1 2 ; q0 = 1 − p0; q1 = 1 − p1;
a shift in the proportion of females from 0.50 to q = 1 − p, and d = (p1 − p0) (Zar 2010). SAS code
0.45 with 90% power and 95% confidence? for a two-sided test is provided in Appendix 17.A.
Example: Mouse Infections: Sample Size for

The null hypothesis of the expected proportion of Risk Difference
p0 = 0.50 will be tested against the one-tailed alter-
native p1 = 0.45, with α = 0.05 (z1 − α = 1.645) and (Data from Zar 2010). Two species of mice were
power 1 − β = 0 9 z1 − β = 1 282 . The number of examined for the presence of a certain parasite.
For species 1, 18 of 24 mice were infected (p0 =
butterflies N to be collected to detect this shift will
0.75), and for species 2, 10 of 25 were infected
be approximately:
(p1 = 0.40). What sample size in a future study
2 would be required to obtain the same risk differ-
0 5 1 645 + 1 282 0 25 − 0 50 − 0 45 2 ence p1 − p0 = 0.35, for equal sample sizes
N≥ n1 = n2, confidence of 95% and power of 80%?
2
0 50 − 0 45 Sample size based on the Wald large-scale
= 350 approximation is
N = 1 96 + 0 8416 2
0 75 1 − 0 75 + 0 4 0 4
17.2.3 Two Independent Samples = 27 4
0 75 − 0 4 2
When sample sizes in each group are equal, then the
large sample approximation (Wald equation) for
28 for each group, for a total of 56.
total sample size is
With the exact method, the sample size is
32 per group, for a total of 64 and power of
2 p0 1 − p0 + p1 1 − p1
N ≥ z1 − α 2 + z1 − β 80%. For power of 90%, the sample size is
d2 42 per group, for a total of 84 (Figure 17.1).
where subscripts 0 and 1 indicate the control (or ref-
erence) and intervention groups, respectively, and
the difference of interest is d = (p1 − p0). A conserv- 17.2.4 Confidence Intervals
ative estimate for sample size is obtained by assum-
ing that p = 0.5, because p(1 − p) is a maximum at The conventional Wald confidence interval for one
p = 0.5. sample is
Better estimates for sample size are based on the
exact method: solving for power and iterating over a p ± z1 − α 2 p 1–p n
1.0 performs well for all sample sizes. The width of

0.9 the confidence interval is
0.8
0.7
w= p0 −l0 2 + u1 −p1 2 + p1 −l1 2 + u0 −p0 2
0.6
The lower (l) and upper (u) confidence
Power
0.5
limits are (A − B)/C and (A + B)/C,
0.4 respectively, where A = 2 ni pi + z21 − α 2 ; B = z1 − α 2
0.3
z21 − α 2 + 4 ni pi 1 − pi ; and C = 2 ni + z21 − α 2 .
0.2
Agresti-Caffo method. The unadjusted propor-
0.1
tions are pi = xi/ni. The adjusted confidence interval
0.0 ‘adds two successes and two failures’ where
10 20 30 40 50 60 70 80 90 100 ni = n + 4 trials and pi = x + 2 n + 4 . The
n per group
adjusted xi are x i = x i + z21 − α 2 4 , adjusted ni are
Figure 17.1: Sample size for evaluating risk difference in
a mouse infection model estimated by the exact method.
ni = ni + z21 − α 2 2 , and adjusted pi are pi = x ni .
For power of 80%, the sample size is 32 per group, and for These adjusted values are substituted into the con-
power of 90%, the sample size is 42 per group. fidence interval and sample size equations
(Agresti and Caffo 2000).
For the difference between two independent pro-
portions, the confidence interval is Example: Mouse Infections: Confidence
Intervals for Difference Between
p1 − p0 ± z 1 − α 2 SE p1 − p0 Proportions
(Data from Zar 2010). Two species of mice were
where the standard error for the difference SE is examined for the presence of a certain parasite.
For species 1, 18 of 24 mice were infected (p0 =
0.75), and for species 2, 10 of 25 were infected
p1 1 − p1 p 1 − p0
SE p1 − p0 = + 0 (p1 = 0.40).
n1 n0 The 95% confidence intervals for the risk differ-
ence (p1 − p0) of 0.35 were constructed by the
However, the Wald equation is not recommended conventional Wald, Newcombe, and Agresti-
for small samples (n < 30 per arm) because it is Caffo methods (SAS code in Garner 2016) and
unstable and usually very inaccurate even for very are summarised in Table 17.1. The Wald interval
large samples. It cannot be used when proportions is wider than either the Newcombe or Agresti-
are extreme (p < 0.1, p > 0.9). Caffo intervals, indicating less precision. The
Two preferred alternatives discussed here
are the Newcombe (Wilson score interval) and Table 17.1: 95% Confidence intervals constructed for
mouse infection data p0 = 0.75 and p1 = 0.40 with a risk
Agresti-Caffo methods. Newcombe (1998) describes difference (p1 − p0) of 0.35.
11 methods for constructing confidence intervals
for two-group comparisons of proportions. SAS Method 95% confidence
interval
code for these and six additional methods are
described by Garner (2016). Wald 0.091 0.609
Newcombe (Wilson score intervals). This method Agresti-Caffo 0.072 0.576
incorporates the lower and upper confidence limits
Newcombe (Wilson score) 0.073 0.561
(l, u) estimated for each proportion (Newcombe
1998; Newcombe and Altman 2000). This method Source: SAS code adapted from Garner (2016).
Sample size. When prevalence p0 in the control

Wald interval limits are symmetrical around the group is relatively low (<20%) the sample size
mean; the Newcombe and Agresti-Caffo intervals to achieve a target RR can be approximated
are not. by a Poisson-based formula:
4
ngroup = 2
17.3 Relative Risk and p0 RR − 1
Odds Ratio
Alternatively, it can be estimated by the ln-based
Relative risk (RR), risk differences, and odds ratio
formulation
(OR) are commonly used in epidemiological studies
to describe the probability of an event occurring in
one group relative to another. Study designs can be 8 RR + 1 RR
randomised controlled trials or observational cohort ngroup =
p0 ln RR 2
and case-control studies.
This formula is more conservative than the simple

17.3.1 Relative Risk Poisson-based formulation, so estimated sample
The RR or risk ratio is the ratio of two proportions, sizes will be larger. As baseline prevalence increases,
usually expressed as the ratio of two ‘success’ pro- the required sample size decreases.
portions. The RR compares the probability or risk
of an event occurring in one group relative to that
Example: Detecting a Given Relative Risk
of a comparator group:
To detect an RR of at least 5 with prevalence of
RR = p1 p0 1% in the unexposed group, the required sample
size is
Subscripts 0 and 1 indicate the control (reference
or unexposed) and intervention (exposed) groups,
respectively. 4
n= 2
0 01 5−1
Confidence intervals. The confidence interval is
obtained by first calculating the natural log = 261 8 ≈ 262, for a total of 524
(ln) of the RR:
1 − p0 1 − p1 Alternatively,
ln RR ± zα +
n0 p 0 n1 p 1
8 5+1 5
n= = 370 6 ≈ 371, for a total of 742
then back-transforming to the linear scale. For the 0 01 ln 5 2
95% confidence interval zα = 1.96, and the confi-
dence limits are
Rule of thumb. RR reduction is the difference
1 −p0 1− p1 between two RRs (RR1 − RR0). Studies evaluat-
Lower limit l = exp lnRR − 1 96 +
n0 p 0 n1 p 1 ing a rare binary outcome require approxi-
mately 50 events in the control group and an
1 −p0 1− p1 equal number in the exposed group to detect
Upper limit u = exp lnRR + 1 96 + a 50% RR reduction (or halving of risk) with
n0 p 0 n1 p1
80% power (Glasziou and Doll 2006).
then back-transformed to the linear scale. The 95%

Example: Pre-specified Risk Reduction confidence interval is
Suppose the risk of incurring a given disease is
10%, and a researcher wanted to determine if a
Lower limit = exp
new intervention will reduce risk to 5%. How
many subjects are needed? 1 1 1 1
ln OR − 1 96 + + +
p0 1 − p0 p1 1 − p1
ngroup = 50 0 1 = 500 subjects per
group or 1000 subjects total
Upper limit = exp
1 1 1 1
ln OR + 1 96 + + +
p0 1 − p0 p1 1 − p1
17.3.2 Odds Ratios
The OR is the ratio of the odds of occurrence of an
Confidence intervals for more complex regres-
event (‘success’, p) relative to the odds of it not
sion model-based estimates for ORs can be derived
occurring (‘failure’, 1 − p) for a test group versus
from simulation or percentiles of bootstrap distribu-
the comparator group. Subscripts 0 and 1 indicate
tions (Greenland 2004).
the control (reference or unexposed) and interven-
tion (exposed) groups, respectively.
Example: Mouse Infections: Sample Size for
p1 1 − p1 a Change in Odds Ratio
OR =
p0 1 − p0 Mice in two groups were examined for parasites.
In group 1, 18 of 24 mice were infected (p1 =
Sample size. The sample size to achieve a target 0.75), and in group 2, 10 of 25 were infected
OR (ORnew) is approximately (p0 = 0.40). In this example, group 1 is the test
group ptest and group 2 is the comparator group
8 σ 2lnOR p0. The OR is
ngroup = 2
ln ORnew 0 75 1 − 0 75
OR = =45
0 40 1 − 0 40
1 1 1 1
where σ 2lnOR = + + +
p0 1 − p0 p1 1 − p1 and the mean proportion p = (0.75 + 0.40)/
2 = 0.575. What sample size would be required
More precise estimates are obtained by to detect an OR of 3 with α = 0.05 and power of
2
0.9 and equal sample sizes in each group?
2
r+1 z1 − α 2 + z1 − β For planning a new study, it can be assumed the
N=
r p 1−p ln ORnew 2 baseline proportion in the new study will be the
same as the old. For a new OR = 3 and the same
where p is the mean proportion (p0 + p1)/2. The baseline proportion, the anticipated proportion
allocation ratio r = n1/n2. If sample sizes are equal, for the test group must be estimated first:
r is 1, and (r + 1)2/r = 4 (Machin et al. 2018).
OR p0 3 0 40
ptest,new = = = 0 67
Confidence intervals. The 100(1 − α)% confidence 1 − p0 + OR p0 1 −0 40 + 3 0 40
interval is calculated from ln(OR):
The new mean proportion is
1 1 1 1 p = 0 67 + 0 40 2= 0 533. Then the new sample
ln OR ± zα + + +
p0 1 − p0 p1 1 − p1 size is
1+1 2
1 96 + 1 2816 2 most common distributions being the Poisson and
N= 2 = 203 8 negative binomial (Cundill and Alexander 2015).
1 0 533 1 − 0 533 ln 3
≈204 mice, with 102 per group.

17.4.1 Poisson Distribution
The Poisson distribution is appropriate for count
When expected proportions are unknown. When data describing event rate or the number of events
there is no information as to the proportions per unit time. The mean event rate is λ.
expected for the two groups, effect size can be Total sample size is
approximated from Cohen’s h for ORs (Cohen
1988). The effect size h is 2
z1 − α 2 + z1 − β a+b
N≥ 2
ln λ0 − ln μλ1
h = φ1 − φ2
1 1 1 1
where φi = 2 arcsin( pi). The arcsine transforma- where a = and b = , and sub-
p1 λ 1 p0 λ0
tion is used to stabilise the variance for the binomial
scripts 0 and 1 indicate reference (or control) and
distribution (Rücker et al. 2008, 2009). The total
intervention groups, respectively.
sample size N is then approximated by
Rules of thumb. The square root of a Poisson-
2 distributed sample is approximately normal.
z1 − α 2+ z1 − β
N=4 With Lehr’s approximation (van Belle 2008),
h
sample size per group is approximately
Because it is easy to calculate (Rücker et al. 4

2009), h may be appropriate for first-pass approxi- ngroup = 2
mations. However, this approximation should be λ0 − λ1
used with considerable caution. Effect sizes based
on standardised ‘small’, ‘medium’, and ‘large’ To determine sample size for estimating a given
benchmark values are not appropriate for ani- rate above background, then sample size is
mal-based studies. Apart from that, calculated
N=4 λ∗
values of h are difficult to interpret. The transfor-
mation is biased because the transformed values
do not match the difference between proportions Example: Avian Influenza
(p1 − p2) (Cohen 1988). Bias increases with small
sample size and large imbalances between groups There has been a global increase in the reported
(Rücker et al. 2009). number of avian influenza outbreaks in wild
birds and poultry. Low pathogenicity strains
occur naturally without causing illness. Suppose
baseline mortality for a given population was
17.4 Skewed Count Data approximately 25,000 deaths per year. How many
When count data are concentrated in relatively few excess deaths would have to occur to be flagged
categories, the data distribution will be skewed or up as an unusual occurrence?
imbalanced. These data are often characterised by
large number of zeros or counts in a few categories N ≥ 4 25,000 ≈ 633
and many outlying values. Sample size calculations
for two-group comparisons of skewed data are based
for an increase of approximately 2.5%.
on a number of underlying distributions, the two
17.4.2 Negative Binomial Distribution 1 1

+
The negative binomial is a generalisation of the 2
23 8 0 841
1 96 + 1 6449 2
Poisson distribution, with mean count μ and an 1 1
additional parameter k describing over-dispersion. + +
47 5 0 841
This distribution is used frequently to describe dis- N≥ 2
ln 47 5 − ln 23 8
tributions of parasites across hosts.
Total sample size is
132.1. Therefore, a total of approximately 133
sheep (or 66–67 per treatment arm) will be
1 1 1 1 required.
z21 − α + +
2
μ0 k0 p1 p0
2
+ z1 − β
1 1
+
1
+
1 1
+
1 17.A SAS Code for
p1 μ1 k1 p0 μ0 k0
N≥ 2
Computing Sample Size:
ln μ0 − ln μ1
Difference Between Two
A good approximation can be made if proportions
Independent Proportions.
are set to 0.5 (p0 = p1 = 0.5), sample sizes are bal- Proportions are calculated
anced (n0 = n1), and the negative binomial param-
eter is fixed (k0 = k1 = k). Then the equation
over a range of candidate n,
reduces to then solved for power
%let numSamples = 100;
data proppower;
2 1 1 1 1
z1 − α 2 + z1 − β 2 + + + do n1 = 10 to &numSamples by 2;
μ1 k μ0 k alpha = 0.05;
N≥ 2 *input proportions;
ln μ0 − ln μ1
p1=0.75;
p2=0.40;
(Brooker et al. 2005). Estimates for μ and k can be *for maximum power set n1=n2;
based on prior values or calculated from raw data n2=n1;
using maximum likelihood (Chapter 10).
*calculate difference in proportions and
related metrics;
pdiff=p1-p2;
Example: Soay Sheep Parasite Burden q1=1-p1;
q2=1-p2;
The mean lungworm burden of 67 Soay sheep pmean=(n1*p1+n2*p2)/(n1+n2);
found dead on St Kilda was 47.5 worms per sheep qmean=1-pmean;
f1=pmean*qmean/n1;
(Gulland 1992). Suppose a new vaccine is pro- f2=pmean*qmean/n2;
posed with expected efficacy of 50% when the
new vaccine tested against a control in a two- z=quantile('normal',1-alpha/2);
arm trial. How many sheep will be required to
detect this level of efficacy with 90% power and *calculate power twosample test;
pow = (z*sqrt(f1+f2)-(pdiff ))/
two-sided confidence of 95%? (sqrt(f1+f2));
Reference worm burden μ0 is 47.5, and with power=1-probnorm(pow);
50% efficacy the new worm burden μ1 is 23.8. output;
The negative binomial parameter k reported in end;
run;
the prior study was 0.841. Then with p0 = p1 = proc print data=proppower;
0.5 and k0 = k1 = 0.841, sample size for the new run;
study is
*Plot power vs sample size per arm;
proc sgplot data=proppower aspect=1; Greenland, S. (2004). Model-based estimation of relative

series x=n1 y=power / lineattrs= risks and other epidemiologic measures in studies
(thickness=3); of common outcomes and in case-control studies.
yaxis label = "Power" values = (0 to 1 by 0.1) American Journal of Epidemiology 160 (4): 301–305.
valueattrs=(color=black size=12) labelattrs=
(color=black family=arial size=14); https://doi.org/10.1093/aje/kwh221.
xaxis Label = "n per group" values = (10 to Gulland, F.M.D. (1992). The role of nematode parasites
100 by 5) labelattrs=(color=black family=arial in Soay sheep (Ovis aries L.) mortality during a popu-
size=14) valueattrs=(color=black size=12); lation crash. Parasitology 105 (Pt 3): 493–503.
refline 0.8 0.9 /axis = y lineattrs= https://doi.org/10.1017/s0031182000074679.
(thickness=3 color=black pattern=shortdash); Machin, D., Campbell, M.J., Tan, S.B., and Tan, S.H.
refline 32 42 / axis=x lineattrs=(thickness=3
(2018). Sample Sizes for Clinical, Laboratory and Epide-
color=black pattern=dash);
run; miology Studies, 4e. Wiley-Blackwell.
Newcombe, R.G. (1998). Interval estimation for the dif-
ference between independent proportions: compari-
son of eleven methods. Statistics in Medicine 17 (8):
References 873–890. https://doi.org/10.1002/(sici)1097-
0258(19980430), Erratum in: Statistics in Medicine,
Agresti, A. and Caffo, B. (2000). Simple and effective con-
18(10):1293.
fidence intervals for proportions and differences of
Newcombe, R.G. and Altman, D.G. (2000). Proportions
proportions result from adding two successes and
and their differences. In: Statistics with Confidence,
two failures. The American Statistician 54: 280–288.
2nd ed. (ed. D.G. Altman, D. Machin, T.N. Bryant,
https://doi.org/10.2307/2685779.
and M.J. Gardner), 45–56. Bristol: BMJ Books.
Brooker, S., Bethony, J., Rodrigues, L. et al. (2005). Epi- Rücker, G., Schwarzer, G., and Carpenter, J. (2008). Arc-
demiologic, immunologic and practical considerations sine test for publication bias in meta-analyses with
in developing and evaluating a human hookworm vac- binary outcomes. Statistics in Medicine 27 (5): 746–763.
cine. Expert Review of Vaccines 4: 35–50. Rücker, G., Schwarzer, G., Carpenter, J., and Olkin, I. (2009).
Cohen, J. (1988). Statistical Power Analysis for the Behav- Why add anything to nothing? The arcsine difference as a
ioral Sciences. Hillsdale: Lawrence Erlbaum Associates. measure of treatment effect in meta-analysis with zero
Cundill, B. and Alexander, N.D. (2015). Sample size cal- cells. Statistics in Medicine 28 (5): 721–738. https://
culations for skewed distributions. BMC Medical
doi.org/10.1002/sim.3511. PMID: 19072749.
Research Methodology 15: 28. https://doi.org/
Van Belle, G. (2008). Statistical Rules of Thumb,
10.1186/s12874-015-0023-0.
2nd ed. Wiley.
Garner, W. (2016). Constructing confidence intervals for Wilson, K. and Hardy, I. (2002). Statistical analysis of sex
the differences of binomial proportions in SAS®. ratios: an introduction. In: Sex Ratios: Concepts and
https://www.lexjansen.com/wuss/2016/ Research Methods (ed. I. Hardy), 48–92. Cambridge:
127_Final_Paper_PDF.pdf. Cambridge University Press https://doi.org/
Glasziou, P. and Doll, H. (2006). Was the study big enough? 10.1017/CBO9780511542053.004.
Two café rules. Evidence-Based Medicine 11 (3): 69–70. Zar, J.H. (2010). Biostatistical Analysis, 5th ed. Upper
https://doi.org/10.1136/ebm.11.3.69. Saddle River: Prentice-Hall.
18
Time-to-Event (Survival) Data

18.1 Introduction 199 18.3.1 Count Data 203
18.2 Methodological Considerations 200 18.3.2 Survival Times 203
18.2.1 Define the Research Question 200 18.3.3 Hazard Rate and Hazard Ratio 204
18.2.2 Define All Survival-Related 18.4 Time-to-Event Sample Size
Items 200 Calculations 205
18.2.3 Randomisation Schedule 200 18.5 Veterinary Clinical Trials 206
18.2.4 Data Structure 201 18.6 Other Study Design Considerations 207
18.2.5 ‘Responders’ Versus 18.6.1 More Than Two Groups 207
‘Non-Responders’ and Post Hoc
References 208
Dichotomisation 203
18.3 Outcome Data 203
BOX 18.1
18.1 Introduction Examples of Time-to-Event Applications
A time to event outcome combines two separate
Time from disease diagnosis to death.
pieces of information: whether an event did or did
Time to nest failure
not occur in the designated time (Y/N) and the time
Number of nestlings surviving to fledging.
at which the event occurred. The ‘event’ of interest
Comparison of dogs completing guide dog training
may be adverse or classified as a ‘failure’ (death of a
to those withdrawn early.
subject, time to humane endpoint, time to tumour
Time from disease induction to palpable tumour
appearance), neutral (pollinator visits, duration of
appearance
shelter animal stay), or positive (time to complete
Length of pregnancy in cattle following artificial
wound healing, time to conception; Box 18.1). Cen-
insemination.
soring is when an individual subject does not expe-
Time from disease diagnosis to disease resolution.
rience the event of interest in the study time frame,
Time from initial disease exposure to first major
and is a distinguishing feature of time-to-event data.
symptom.
Sample size calculations for time-to-event data
Time from completion of cancer treatment to
are more complicated than those for continuous
tumour recurrence.
or binary endpoints and usually require large sam-
Time between seizures.
ple sizes. This makes ‘survival’ as a primary out-
Time between flower visits by pollinators
come unacceptable for most laboratory studies
Differences in fish survival caused by haul net BOX 18.2

differences. Time to Event Study ‘Landmarks’
Time of shelter animal stay, from shelter entry to
There must be clear, clinically or biologically relevant,
adoption
repeatable (or standardised), and measurable defini-
Time to complete wound healing.
tions for:
Event of interest;
using rodents. Specification of time to event variable Time ‘start’;
as a primary outcome is more suited to analysis of Time ‘end’ (for subject monitoring);
large data sets or multi-centre veterinary rando- End of the study;
mised clinical trials, where there is the potential Time scale for biological effect
for observing many more subjects than is feasible Definition of censoring
for laboratory studies. Humane endpoints.
in an experimental animal can be prevented, termi-

18.2 Methodological nated, or relieved’ (National Research Council
Considerations 2011). Appropriate humane endpoints must be
selected to accurately predict or indicate pain, dis-
18.2.1 Define the Research Question tress, morbidity, or death, with welfare plans for
monitoring, recognition, and appropriate end-stage
The research question will determine both how
actions for example, euthanasia when a pre-
sample size will be determined and how time-to-
specified humane endpoint is attained1.
event data will be analysed. The three broad cate-
The time scale for biological effects is critical for
gories of research questions are determining the
assessing the biological or clinical relevance of any
proportion of individuals that will remain free of
differences between groups. For example, a 3-day
the event after a certain time; the proportion of indi-
difference in survival will be clinically relevant for
viduals that will have the event after a certain time;
acute infections or diseases with high mortality
and the risk of the event at a particular point in time,
occurring in 7–10 days. However, for chronic dis-
among those who have survived until that point
eases, or conditions that may take weeks to months
(Machin et al. 2006).
to evolve, a 3-day difference will have no biological
importance (Furuya et al. 2014).
18.2.2 Define All Survival-
Related Items
After formulation of the research question, clear 18.2.3 Randomisation Schedule
working definitions must be developed for all items Subjects may enter a trial sequentially (for example,
used in the calculations. These include the event of veterinary clinical trials or laboratory rodents with
interest, the time start, the time scale (hours, days, uncommon genotypes that depend on the timing
weeks, months, years), the end of the monitoring of litter drop). As a result, recruitment, enrolment,
or follow-up period, and the end of the study
(Box 18.2). Typically, there is a single target event, 1
Guidelines for humane endpoint references include
but more advanced analytical methods are available https://oacu.oir.nih.gov/system/files/media/
that can accommodate either multiple or recurrent file/2022-04/b13_endpoints_guidelines.pdf;
events. National Research Council (US) Committee on Recognition
and Alleviation of Pain in Laboratory Animals. Recognition
Death as an experimental endpoint in laboratory
and Alleviation of Pain in Laboratory Animals. Washington
animals is usually not ethically justifiable. Humane (DC): National Academies Press (US); 2009. 4, Effective Pain
endpoints for research animals are the earliest sci- Management; https://nc3rs.org.uk/3rs-resources/
entifically justified point at which ‘pain or distress humane-endpoints
Time-to-Event (Survival) Data 201
and randomisation of eligible subjects may be stag- 1.00

gered over long periods of time, varying from weeks,
months, or years. Randomisation strategies should
Proportion surviving
be planned for uncertainties in availability to avoid 0.75
unintentionally uneven allocation ratios. Likewise,
follow-up times should be planned with respect 0.50
to the time allotted to enrolment period. If the
study is terminated at a fixed time regardless of
study entry time, individual follow-up times will 0.25
be extremely variable, with the first subject enrolled

with the longest follow-up time and the last enrolled 0.0
subject the shortest. Staggered entry does not affect Time
sample size calculations for continuous or binary Figure 18.1: Patterns of survival over time. Type I: Constant
endpoints. However, staggered entry will pro- mortality throughout the study duration. Type II: Low
early mortality; high survival over most of the study period.
foundly affect sample size calculations for time-to-
Type III: High early mortality for most subjects, with a few
event data, as they depend on the number of events longer-term survivors
observed, and therefore both patterns of study entry
and whether or not event rates are low or high
(Witte 2002).
Pattern interpretation should be made with cau-
tion, especially when sample sizes are small,
18.2.4 Data Structure because very small numbers in the tails of the curves
Time-to-event data are characterised by differing make survival estimates unstable (Altman 1991;
time-dependent patterns of survival (time-dependent Altman and Bland 1998).
trajectories or survivorship curve shape) and censoring. Censoring occurs when the event of interest is not
Survivorship curves illustrate the distribution of observed for a subject, either because the event has
survival over time, measured as the number, or pro- not occurred by the time the study ends or because
portion, of individuals surviving at each time t. The it occurred before the subject entered the study.
shape of the curve will be modified by the strength Censored observations are not ‘missing’ in the con-
of the test intervention or agent, genotype, and ventional sense. Subjects may be right-censored
environmental effects. Graphical examination of pat- (subjects do not have the event before the study
terns can provide insight into potential mechanisms ends), left-censored (subjects experience the event
affecting survival in laboratory and clinical investiga- before the study begins), or interval-censored (the
tions. The three main patterns (Figure 18.1) were event occurs within some known time interval,
described in the last century (Pearl and Miner but the exact time cannot be precisely identified)
1935; Demetrius 1978). (Table 18.1; Figure 18.2).
Because of censoring, time-to-event data cannot
Type I survival indicates a constant proportion of be analysed by conventional summary statistics
individuals dying over the course of the study. used for continuous data. Attempting to impose a
normal distribution on these data will result in large
Type II survival exhibits high survival for most of
apparent outliers. If conventional analysis methods
the study. Mortality increases near the end of
are used, data from censored subjects will be
the monitoring period.
excluded because exact survival duration cannot
Type III survival exhibits high early mortality be calculated for censored individuals; times are
followed by low mortality for relatively few indi- known only for those subjects for which the event
viduals over the remainder of the monitoring occurred. Omitting censored data from analyses
period. contributes to survivorship bias.
Table 18.1: Types of Censoring.
Right censoring: The most common type. Occurs when a subject completes the study without experiencing the event.
Type I right censoring occurs when study duration is fixed beforehand.
Type II right censoring occurs when the study is completed when a fixed number of events has occurred (study duration
is variable).
Left censoring occurs when the subject experiences the event before enrolment or monitoring begins.
Examples. In oncology trials, an animal presents with a tumour before the study begins.
Interval censoring: The subject experiences the event within some pre-specified time interval, but the exact time is
unknown. Interval censoring occurs when subjects can be monitored only periodically (for example, at weekly or
monthly intervals).
Examples. Veterinary studies of client-owned animals followed up at scheduled clinic visits; disease progression studies
(Finkelstein and Wolfe 1985; Radke 2003; Rotolo et al. 2017); monitoring of bird nest boxes (Heisey et al. 2007;
Fieberg and DelGiudice 2008).
(a) Study window
Study start Study end

Time to event
Event happens,
Time to event known
Censor = 0
Event did not happen

Censor = 1
(b) Study start Study end
Pre-study
window Study window
(c) Study start

Study end
Time to event
Clinic Clinic Clinic Clinic

visit visit visit visit
1 2 3 4
Figure 18.2: Types of censoring. (a) Right censoring occurs when a subject completes the study without experiencing the
event. (b) Left censoring occurs when the event occurs before the study begins. (c) Interval censoring occurs when the subject
experiences the event within some pre-specified time interval, but the exact time is unknown (for example, death of the
animal between clinic visits).
18.2.5 ‘Responders’ Versus variables, ln(odds ratios) are unstandardised effect

‘Non-Responders’ and Post Hoc sizes. Problems associated with odds ratios or relative
risk for analysing time-to-event outcomes in meta-
Dichotomisation
analyses have been described by Tierney et al. (2007).
After the data are examined, investigators some- Statistical comparisons of simple proportions at
times categorise subjects as ‘responders’ or ‘non- several times t are not recommended for survival
responders’, then reanalyse survival differences studies. Proportions do not contain time and dura-
between these new ‘groups’. Senn (2018) lists five tion information, and estimates are unstable
reasons for discouraging this practice. First, the because of subject attrition over time (Machin
operational definitions of ‘responder’ versus ‘non- et al. 2006).
responder’ are arbitrary, because they are usually
based on some post hoc threshold or measure of
‘improvement’. Second, it is implicitly assumed that 18.3.2 Survival Times
subgroup responses will be consistent, predictable, The Kaplan Meier survival curves show the trajec-
and stable for each subject. Third, it is assumed that tory of events occurring over time. The Kaplan-
subgroup responses are somehow representative of Meier method is most appropriate for categorical
population response rather than being an artefact of predictor or grouping variables (e.g. experimental
subgrouping composition and subject variability. drug versus placebo control), or when there are a
Fourth, to be valid, statistical tests for group com- small number of fixed values (e.g. drug doses 0,
parisons require random allocation of treatments 10, 20 mg) that can be considered categorical.
to experimental units. Finally, because post hoc The median survival time M is the time at which
groups are not randomised, statistical inference half (50%) of the subjects have died or showed the
tests based on post hoc dichotomisation are inappro- occurrence of the event of interest (Box 18.3). The
priate (Weiss et al. 1983; Atkinson et al. 2019 and median is reliable as an effect size estimate when
will be biased and misleading. If such comparisons events occur fairly regularly over the study time
are unavoidable, descriptive statistics are recom- period. If survival times are highly irregular, the
mended rather than inference testing. median will tend to be biased and unreliable.
Median survival cannot be calculated if survival
>50% by the end of follow-up (Machin et al. 2006).
18.3 Outcome Data The Kaplan-Meier method calculates the survival
function S(t). The probability of surviving each time
Time-to-event data can be summarised as both
interval is the proportion of patients alive at each
counts and survival times. The choice of outcome
time t:
variable will determine both the effect size and
the method of calculating sample size. nt − n d
dt =
nt
18.3.1 Count Data
where d is the difference between nt the number
Dichotomous categorical outcomes (e.g. ‘Dead/sur- alive at the start of time t and nd the number that
vived’, ‘tumour present/absent’; ‘yes/no’) can be died in the interval. The probability of survival to
summarised as counts (number of subjects) and pro- the next time interval ti is pi = 1 − di/ni. The survival
portions p (where p is the number of subjects with
the event/total number of subjects in the group).
Effect size can be estimated as relative risk (or risk BOX 18.3
ratio, RR) or odds ratio (OR). Parameter estimates Time to Event Effects Size: Median Survival Time
of odds ratios obtained from logistic regression
can be used as a measure of effect size. Because Median survival M is the time at which 50% of the sub-
they are dependent on the scale of the predictor jects have died and 50% remain alive.
function S(t) is obtained by multiplying the prob- relationship between the proportion surviving (ps)
abilities at each time point. and the hazard rate h(t) at a given time t, is therefore
S(t) = p1 p2 . = (1 – d1/n1)(1 – d2/n2)…. This is
formally expressed as h t = − ln ps t
j di The relationship between ps, h(t) and the median

St = i−1
1− survival time M is
ni
− ln ps − ln 0 5 ln 2
where Π is the product multiplier and j is the num- ht = = =
t M M
ber of times an event is observed. S(t) changes at the
time of each event and so is typically presented as a The hazard ratio (HR) is the ratio of the hazard rate
series of steps. Kaplan-Meier curves can also be used at time t in one group compared to hazard rate for
to describe the accumulation of events or failures the comparator group:
over time. The cumulative distribution function
gives the probability that the survival time is less h1 t ln p1
HR = =
than or equal to a given time t, S = 1 – S(t) h0 t ln p0
(Figure 18.3). Examples include time to complete
healing of wounds or ulcers (Moffatt et al. 1992) Here p0 and p1 are the proportions of subjects with
and probability of malaria recrudescence over the event for the comparator and test groups,
28 days of follow-up time (Dahal et al. 2017). respectively. The hazard ratio can be interpreted
in a similar way to a risk ratio (Box 18.4).
When two or more groups are compared, it is often
18.3.3 Hazard Rate and Hazard Ratio assumed that the hazard ratio is constant over the
The survival function S(t) is related to the hazard follow-up period. This is the proportional hazards
rate at time t as assumption. If true, the hazard ratio provides an
average relative treatment effect over time. The pro-
S t = exp − h t t portional hazards assumption must be checked
before the data are analysed, as the log-rank test can-
The hazard rate h(t) is the probability that an not be used to test the equality of survival functions
individual at time t has an event at that time. It in the presence of non-proportionality. Survival
represents the instantaneous event rate for an indi-
vidual who has already survived to time t. The haz- BOX 18.4
ard rate is a measure of instantaneous risk. The Time to Event Effect Size: Hazard Rate and
Hazard Ratio
1.00 Hazard rate h(t): instantaneous event rate for a subject

already surviving to time t
Hazard ratio (HR): ratio of the hazard rate at time t in
0.75
one group compared to hazard rate for the compara-
Proportion healed
tor group
0.50 HR = 1: Both groups (treatment and control) are
experiencing an equal number of events at any
0.25
time t.
HR > 1: More events occurred in the treatment
groups at any time t compared to the control group.
0.0 HR < 1: Fewer events occurred in the treatment
Time
groups at any point in time compared to the control
Figure 18.3: Kaplan-Meier curve of cumulative failure
probability.
group.
curves that cross each other are a strong indicator compared to balanced designs: approximately 12%
of non-proportionality. Peto et al. (1977) describe for 2:1 and 33% for 3:1 (Hey and Kimmelman
common errors in survival analyses. 2014). If unequal ratios are planned, sample size
estimates must be corrected by the pre-specified
allocation ratio.
18.4 Time-to-Event Sample
Effect size. The ratio of median survival times is
Size Calculations often the simplest method for estimating the haz-
Sample size estimates for time-to-event data require ard ratio for a new study (Machin et al. 2018):
an estimate of the effect size (hazard ratio), the
expected proportion of subjects in each group that h1 t ln 2 M 1 M0
HR = = =
might experience the event, and the expected h0 t ln 2 M 0 M1
number of subjects showing the event of interest
(Box 18.5). where h1(t) and h0(t) are the hazard rates for the
In effect, two sample sizes must be estimated, the test group and the control or comparator group,
number of subjects expected to have the event (nE), respectively, and M1 and M0 are the correspond-
and the total number of subjects (N). The number of ing medians.
events that occur determines the variance used in
the sample size calculation. If censoring does not Number of events. The total number of subjects
occur for any subject, then E = N. The prevalence with the event nE that need to be observed in
of the disease (or proportion of subjects expected to a two-group study is
have the event) is PE. Then the total sample size is 2
2
r+1 z1 − α 2 + z1 − β
nE nE = 2
N= r ln HR
PE
The allocation ratio describes the balance of sub-
Other information needed for time-to-event sam- jects in each arm. The allocation ratio adjustment
ple size calculations is the allocation ratios or num- is (r + 1)2/r. For an allocation ratio of 1 (balanced
ber of subjects allocated to each group, the desired group sample size), the adjustment factor is
confidence and power. Two-sided tests are recom- 1 + 1 2 1 = 4.
mended because the existence of a survival differ-
ence to be detected or the direction of that The hazard ratio can be approximated from an
difference is usually not known beforehand. Bal- expected clinically relevant effect. For example, to
anced group sizes with a 1:1 allocation ratio result determine a 50% decrease in events with a new
in the most power. For a given power, unbalanced treatment relative to controls, the planning value
designs will require much large sample sizes for HR would be 1/0.5 = 2.0.
Number of subjects. The total number of subjects

BOX 18.5 N is determined by the expected proportion of
Sample Size Requirements subjects in each group, p1 and p0:
▪ Median survival time or hazard ratio nE nE

N= =
(effect size) P 1 − p0 + r 1 − p1 1+r
▪ Proportion of subjects with the event in each
group or prevalence If the allocation ratio is 1:1, then
▪ Number of events
▪ Probabilities for Type I and II error (α and β) nE
▪ Allocation ratio.
N=
1 − p0 + 1 − p1 2
The expected value of p1 can be estimated from the Hazard ratio nE N

planning values for ‘baseline’ or control p0 as
1.20 944 3542
p1 = exp HRnew ln p0 1.25 631 2364
1.50 191 716
If a control value is unknown, then an estimate of
2 65 245
prevalence P can be substituted.
3 26 98
4 16 61
Example: Rat Tumour Growth
5 12 46
Mantel et al. (1977) monitored 150 female
Sprague Dawley rats for tumour growth; 40 rats This example shows that only very large effect
developed tumours. Suppose a new study was sizes for time-to-event data can be detected with
proposed to test a new drug anticipated to the small sample sizes typical of many laboratory
reduce the risk of developing tumours by 50% studies of rodents. A different primary outcome
with 95% confidence (α = 0.05) and power of 90% should be selected, when the number of animals
(1 − β = 0.9). How many rats would be needed? required is too large to be operationally or ethi-
In the former study Nold = 150, and nE, old = 40. cally feasible, or the effect size that could be
For the new study, the hazard ratio HR = 2. 95% detected with a given power is implausible.
confidence (α = 0.05) z1-α/2 = 1.96, and power of Choose a primary outcome variable that is a rea-
90% z1-β = 1.282. Then the expected number of sonable surrogate, continuous, and with small
subjects with tumours in the new study is: variance.
2
1 96 + 1 282
nE,new =4 = 87 5 88
ln 2 2 18.5 Veterinary Clinical
Trials
The total sample size required to detect this num-
ber of tumours is In veterinary clinical trials, sample size estimates
will be affected by additional factors such as client
non-compliance, losses to follow-up, and duration
nE,new 88 of the trial. These need to be considered in the pla-
N= N old = 150 = 330
nE,old 40 nning stages because they will determine the num-
ber of events, which in turn determine the variance
What risk of tumorigenicity could be detected used in the sample size calculations.
with power 80% and α of 5% if fewer animals
are used? Loss to follow-up. In veterinary clinical studies,
The number of events nE, old = 40, with α = subjects that drop out or are lost to follow-up
0.05, z1-α/2 = 1.96; 1 – β = 0.8, z1-β = 0.8416 can result in the loss of considerable informa-
The corresponding number of events nE and tion because events for these subjects cannot
total sample size N required for hazard ratio be recorded. Sample size must be increased,
HR are: or adjusted, to account for these losses (Nadj):
N = nE 40 150 N adj = N 1 − pf
4 1 96 + 0 8416 2 where pf is the fraction of subjects anticipated to

and nE = 40 = . Substituting be lost to follow-up. It is assumed that losses to
ln HR 2
follow-up occur at random and are not related
in the values for a range of hazard ratios and cal-
to the health status of the subject or the interven-
culating sample size N for each shows:
tion (Machin et al. 2006).
Study duration. The minimum or lower limit to The hazard ratio HR = ln 0 70 ln 0 58 = 0 65

study duration TL is
T L = nE R 4 1 96 + 0 8416 2
Thenumber of events E = = 169 4
ln0 65 2
where nE is the number of events, and R is the 170
anticipated entry rate of subjects into the trial.
The entry or recruitment rate R is estimated by
multiplying the number of randomised subjects by Assuming the proportion of event in the pla-
the recruitment time, R = N t. If there are multiple cebo group will be 76/180 = 42%, the anticipated
recruiting sites, R is multiplied by the number of proportion of events in the intervention group is
sites nsites, so that Rmulti = R nsites
The total duration of patient entry into the trial is p1 = exp 0 65 ln 0 42 = 0 57
approximated by the median survival time of the
control group (or whatever group has the
longer time to develop signs and symptoms) plus Then N = 170/[(1 − 0.57) + (1 − 0.42)]/
the lower limit: 2 = 335.5 336
With 10% loss to follow-up, pf = 0.1, and
T total = M c + T L
N adj = 336 1 − 0 1 373
A more complex equation requiring iterative
solutions is available, but this method is preferred Study duration. How much time would a new
for its simplicity (Machin et al. 2006). trial require to recruit the requisite number
of subjects?
Suppose the time to recruit 360 subjects in the

Example: Clinical Trial of Dogs with Cardiac previous trial was 1.5 years or 548 days. Then
Disease – Two-Arm Trial R = 360/548 = 0.67 subjects/d, and TL = 170/
0.67 = 259 days. The total estimated study dura-
Boswood et al. (2016) conducted a randomised, tion is 1228 + 259 = 1486.5 days or just over
blinded, controlled trial comparing pimobendan 4 years.
to placebo for treating dogs showing echocardio- A study of this size and duration usually
graphic signs of cardiac enlargement. One goal requires multiple participating sites to recruit
was to determine if pimobendan delayed the onset enough patients in a reasonable period of time.
of congestive heart failure (CHF). Subjects were However, organisation, logistics, coordination,
randomised 1:1 to treatment arms. There was a and expense may be considerable.
total enrolment of 360 subjects, of which 59 in
the pimobendan group and 76 in the placebo
group developed CHF. Median times to develop
symptomatic CHF were 1228 days and 766 days
for pimobendan and control groups, respectively. 18.6 Other Study Design
The power (1 − β) of the study was 0.8. Considerations
Number of subjects. How many subjects would 18.6.1 More Than Two Groups
be needed for a new study, where the goal
was to increase the success rate to 70% with Sample size calculations can be challenging when
α of 5% and power 80%, and assuming a loss time-to-event outcomes must be compared for
to follow-up of 10%? more than two groups. Most commercial software
packages require Cohen’s f or ηp2 for effect sizes,
The ‘success rate’ for the drug in the previous which are not applicable to time-to-event studies.
trial is (180 − 59)/180 = 0.67, or 67% (versus Machin et al. (2018) recommend a three-step proce-
58% for the placebo). dure for a factorial design where interactions may
be of interest. Suppose a study is proposed evaluat- myxomatous mitral valve disease and cardiomegaly:
ing two factors A and B. With a 2 × 2 factorial the EPIC study-a randomized clinical trial. Journal
design, there are four ‘groups’ to be evaluated: of Veterinary Internal Medicine 30 (6): 1765–1779.
A only, B only, both A and B, and neither A or B. https://doi.org/10.1111/jvim.14586.
Therefore, to assess the effect of each factor, there Dahal, P., Simpson, J.A., Dorsey, G. et al. (2017). Statis-
tical methods to derive efficacy estimates of anti-
are four median survival times to be estimated
malarials for uncomplicated Plasmodium falciparum
and two hazard ratios to be compared.
malaria: pitfalls and challenges. Malaria Journal 16
(1): 430. https://doi.org/10.1186/s12936-017-
1. Estimate hazard ratios for A and B separately 2074-7.
(assuming no interaction). Calculate N Demetrius, L. (1978). Adaptive value, entropy and survi-
for each. vorship curves. Nature 275: 213–214.
2. If the N for each group is roughly similar, Fieberg, J. and DelGiudice, G. (2008). Exploring migra-
choose the larger N. If hazard ratios (and tion data using interval-censored time-to-event
hence Ns) are very different, prioritise factors models. Journal of Wildlife Management 72 (5):
based on importance to the research ques- 1211–1219. https://doi.org/10.2193/2007-403.
tion. For example, suppose A is a comparison Finkelstein, D.M. and Wolfe, R.A. (1985). A semipara-
of the therapeutic effect of the test drug metric model for regression analysis of interval-
against a control, versus B which is a compar- censored failure time data. Biometrics 41 (4): 933–945.
ison of intravenous versus oral drug delivery Furuya, Y., Wijesundara, D.K., Neeman, T., and Metzger,
D.W. (2014). Use and misuse of statistical significance
routes. Factor A will be of higher priority if
in survival analyses. MBio 5 (2): e00904-14. https://
the goal is to determine efficacy. Power off
doi.org/10.1128/mBio.00904-14.
the most important factor. Heisey, D.M., Shaffer, T.L., and White, G.C. (2007). The
3. If interactions (A × B) to assess synergism or ABCs of nest survival: theory and application from a
antagonism are of greater interest than main biostatistical perspective. Studies in Avian Biology 34:
effects, then estimate the size of the interac- 13–33.
tion, and power the study based on the inter- Hey, S.P. and Kimmelman, J. (2014). The questionable
action. Be aware that, in general, estimating use of unequal allocation in confirmatory trials. Neu-
interactions requires roughly 16 times the rology 82 (1): 77–79. https://doi.org/10.1212/
sample size compared to that for estimating 01.wnl.0000438226.10353.1c.
a main effect for equivalent power.2 Machin, D., Campbell, M.J., Tan, S.B., and Tan, S.H.
(2018). Sample Sizes for Clinical, Laboratory and Epide-
miology Studies, 4e. Wiley-Blackwell 390 pp.
Machin, D., Cheung, Y.B., and Parmar, M. (2006). Sur-
References vival Analysis: A Practical Approach, 2e. 278 pp.
Mantel, N., Bohider, N.R., and Ciminera, J.L. (1977).
Altman, D.G. (1991). Practical Statistics for Medical Mantel-Haenszel analyses of litter-matched time-to-
Research. London: Chapman & Hall/CRC. response data, with modifications for recovery of
Altman, D.G. and Bland, J.M. (1998). Time to event (sur- inter-litter information. Cancer Research 37:
vival) data. BMJ 317 (7156): 468–469. 3863–3868.
Atkinson, G., Williamson, P., and Batterham, A.M. Moffatt, C.J., Franks, P.J., Oldroyd, M. et al. (1992). Com-
(2019). Issues in the determination of ‘responders’ munity clinics for leg ulcers and impact on healing.
and ‘non-responders’ in physiological research. Exper- BMJ 305 (6866): 1389–1392. https://doi.org/
imental Physiology 104 (8): 1215–1225. 10.1136/bmj30568661389.
Boswood, A., Häggström, J., Gordon, S.G. et al. (2016). National Research Council (2011). Guide for the Care and
Effect of pimobendan in dogs with preclinical Use of Laboratory Animals, 8e. Washington: The
National Academies Press.
Pearl, R. and Miner, J.R. (1935). Experimental studies on
2
Gelman A (2018) https://statmodeling.stat.colum- the duration of life. XIV. The comparative mortality of
bia.edu/2018/03/15/need-16-times-sample-size- certain lower organisms. Quarterly Review of Biology 10
estimate-interaction-estimate-main-effect/ (1): 60–79.
Peto, R., Pike, M.C., Armitage, P. et al. (1977). Design and Tierney, J.F., Stewart, L.A., Ghersi, D. et al. (2007). Prac-
analysis of randomized clinical trials requiring pro- tical methods for incorporating summary time-to-
longed observation of each patient. II. Analysis and event data into meta-analysis. Trials 8: 16.
examples. British Journal of Cancer 35 (1): 1–39. Weiss, G.B., Bunce, H. 3rd, and Hokanson, J.A. (1983).
https://doi.org/10.1038/bjc.1977.1. Comparing survival of responders and nonresponders
Radke, B.R. (2003). A demonstration of interval-censored after treatment: a potential source of confusion in
survival analysis. Preventive Veterinary Medicine 59: interpreting cancer clinical trials. Controlled Clinical
241–256. Trials 4 (1): 43–52. https://doi.org/10.1016/
Rotolo, M.L., Sun, Y., Wang, C. et al. (2017). Sampling s0197-2456(83)80011-7.
guidelines for oral fluid-based surveys of group-housed Witte, J. (2002). Sample size calculations for randomized
animals. Veterinary Microbiology 209 (20-29): 2017. controlled trials. Epidemiologic Reviews 24 (1): 39–53.
Senn, S. (2018). Statistical pitfalls of personalized medi-
cine. Nature 563 (7733): 619–621.
19
Comparing Multiple Factors

19.1 Introduction 211 19.8.1 Before-After and Crossover
19.2 Design Components 212 Designs 226
19.2.1 Constructing the F-Distribution 212 19.8.2 Repeated Measures on
19.3 Sample Size Determination Methods 213 Time: Continuous Outcome 227
19.3.1 Effect Sizes 213 19.8.3 Repeated Measures on
19.3.2 Non-Centrality Parameter 213 Time: Proportions Outcome 227
19.3.3 Mead’s Resource Equation 213 19.8.4 Repeated Measures in Space:
19.3.4 Skeleton ANOVA 214 Spatial Autocorrelation 227
19.4 Completely Randomised Design, 19.A Guinea-Pig Data: Sample SAS Code for
Single Factor 214 Calculating Sample Size for a Single-Factor
19.4.1 Sample Size Approximations Four-Level (a) Completely Randomised
Based on Mean Differences 217 Design; (b) Randomised Complete
19.4.2 Sample Size Approximations Block Design 228
Based on Number of Levels 19.A.1 Completely Randomised
for a Single Factor 217 Design 228
19.4.3 Sample Size Approximations 19.A.2 Randomised Complete
Based on Expected Range of Block Design 228
Variation 218 19.B Sample SAS Code for Calculating
19.5 Randomised Complete Block Design 219 Sample Size per Group for a Simple
Repeated-Measures Design 229
19.6 Factorial Designs 221
References 229
19.7 Split-Plot Design 223
19.8 Repeated-Measures (Within-Subject)
Designs 225
ANOVA-type analysis is determined by how the

19.1 Introduction experiment is designed before data are collected
Comparative experiments are designed to investi- (Box 19.1). The ANOVA is a linear statistical regres-
gate cause-and-effect relationships between one sion model where the outcome variable is predicted
or more independent variables (treatments) on from one or more quantitative or qualitative catego-
the dependent variables, or response. Typically rical independent variables. The structuring of mul-
these are analysed by analysis of variance tiple independent variables (or factors) according to
(ANOVA) methods. Correct application of an the number of independent factors to be compared
BOX 19.1 (discussed in Chapter 20). The most appropriate

Comparing Two or More Groups method for calculating sample size will be closely
aligned with the study objectives, the experimental
Key concepts design, and the intended statistical analysis plan
Factor: One independent variable (O’Brien and Muller 1993; Hurlbert and Lombardi
Level: a group within a factor 2004; Simpson 2015; Gamble et al. 2017).
Block: a categorical classifier for a specific nuisance
factor used to minimise between-subject variation
Common designs:
19.2.1 Constructing the F-Distribution
Single-factor completely randomised design Sample size and power calculations are based on the
Randomised complete block F-distributions for the corresponding null and
Factorial alternative distribution of each factor in the design
Split plot (Chapter 14). The total variance of the data is parti-
Crossover tioned into different components that discriminate
Nested or hierarchical design. between the different sources of variation. In the sim-
plest case, there are three components: variation
between experimental units that receive different
and how variation is to be controlled makes up the treatments (between-group), variation between
experimental design. A good design allows discrimi- experimental units receiving the same treatment
nation of the experimental signal from noise caused (within-group), and the residual error variance (or
by extraneous variation (Winer et al. 1991; Box et al. mean square error) that estimates the random varia-
2005; Montgomery 2012). tion between experimental units. The ratio of the
Design and analysis of variance methods were between-group to within-group mean square is the
developed primarily by Sir Ronald Fisher in the F-statistic for testing the hypothesis of equality of
early half of the twentieth century and subsequently group means. The F-value equals one if there is no
extended by George Box and collaborators (Box difference between group means.
et al. 2005). Detailed coverage of experimental
design is beyond the scope of this book. However, A note on ‘groups’. The idea of the ‘group’ as the
an overview of a few basic designs and methods entity subjected to treatments has occasionally
for estimating sample size are provided. The reader resulted in serious misunderstandings of both
is advised to consult Box et al. (2005), Montgomery the role and method of treatment allocation
(2012), Bate and Clark (2014), Lazic (2016), and and randomisation in ANOVA-type designs
Karp and Fry (2021) for more detailed information. (Festing 2014; Reynolds 2022). In a properly
designed experiment, treatments (experimental
or control interventions) are allocated ran-
19.2 Design Components domly to the experimental units. However, a
Fisher defined the design structure as the logical common practice in the preclinical rodent liter-
strategy by which treatments or treatment combi- ature is to refer to experimental ‘groups’ as
nations are assigned to experimental units. The ‘cohorts’. The term cohort actually has a formal
design structure is based on the number and rela- technical meaning. As used in observational
tionship of independent variables, the factors. Each studies, a cohort is a group of subjects with
factor has a restricted number of fixed values (or group membership defined by some common
levels). In the simplest case, the single-factor (or characteristic or trait. When experimental
one-way) design has one factor and two or more groups are incorrectly defined as cohorts, this
levels. A blocking factor is a categorical classifier suggests that interventions have been non-
for a specific nuisance factor. Blocking factors randomly allocated, an invalid and confounded
are incorporated to minimise variation between assignment strategy (Festing 2020). The
experimental units. Other designs include the ran- emphasis on ‘groups’ has also resulted in
domised block design (RCBD), factorial, crossover, neglect of more powerful experimental designs
and split-plot designs, repeated measures (within- more suited to exploratory studies with multi-
subject) designs, and hierarchical or nested designs ple factors.
Comparing Multiple Factors 213
A note on strain or species comparisons. If the on best-available evidence, practical significance,

study design involves comparison of multiple and the proposed statistically based design for
strains, sample size calculations will depend future experiment. Information for preliminary cal-
on the inference goals and the types of biolog- culations can be obtained from previous studies,
ical units in the study. Power to detect differ- pilot data, or exemplary data. Exemplary data are
ences between strains depends on the number essentially mock data, constructed to have the char-
of strains. At a minimum, three strains are acteristics expected for the future sample, such as
recommended to test hypotheses of strain (or means and variances or distributional properties.
species) differences (Garland and Adolph Exemplary data can be determined from previous
1994). Precision of the estimated effect depends similar studies, informed and scientifically justifia-
on the number of animals sampled within each ble guesswork, and/or computer simulations
strain. Environmental effects due to housing, (Goldman and Hillman 1992; O’Brien and Muller
husbandry, handling, cage and cage stocking 1993; Divine et al. 2010). Researchers should not
density, and even source or vendor can result define ‘small’, ‘medium’, and ‘large’ effect sizes
in large between-animal variation in physiolog- based on general benchmark values. These were
ical and behavioural measurements. Variation originally developed for social and psychology stud-
resulting from the effects of such confounders ies and are usually neither appropriate nor relevant
should be controlled by appropriate blocking for animal-based studies (see Chapter 15).
strategies. Large formal analyses with multiple
strains or species should consider phylogenetic
19.3.2 Non-Centrality Parameter
comparative methods (Cooper et al. 2016).
Sample size formulae may not be available for the
more complex study designs. However, sample sizes
19.3 Sample Size can be customised using simulations based on the
Determination Methods non-centrality parameter λ (Chapter 14). Typically,
calculations of λ use values for the minimum biolog-
The investigator should specify beforehand the major ically important mean difference to be detected and
components required for sample size determination an estimate of the pooled variance of the difference.
(Chapter 14). Choice of study design will also depend Other statistics that can be used are summary statis-
on the number and type of factors to be studied, the tics for each group (means and a measure of varia-
number of levels of each factor, and how they tion, SD or SEM), the group means with an estimate
will be spaced; methods for controlling variation of the pooled variance, or summary ANOVA statis-
(blocking, stratification, covariates), and methods tics (mean square error or F-values) derived from
for incorporating time or within-subject dependen- either prior or exemplary data (O’Brien and Muller
cies (repeated measures, longitudinal data). 1993; Castelloe and O’Brien 2001). Sample size
Power is maximised when the number of experi- calculations will require specifying the analysis
mental units n are balanced across the design. Unbal- objectives; that is, if the goal is to determine (i) a
anced sample size allocation results in increased pre-specified biologically important difference
variance (and therefore reduced precision). Unneces- between groups, (ii) the allowable range of variation
sarily large variations can obscure true treatment that enables detection of the target difference, or
effects, especially when effects are small. The effects (iii) the maximum number of groups that can be
of sample size imbalance on variation and effect size tested for a fixed total sample size. Sample SAS
are described in Chapter 14. codes for simulation-based approximations of sam-
ple size are provided in Appendix 19.A.
19.3.1 Effect Sizes
Effect size for ANOVA-type designs are computed 19.3.3 Mead’s Resource Equation
as f, η2, η2p, the coefficient of determination R2, or Although originally developed for agricultural
adjusted R2adj. Many commercially available sample research, Mead’s resource equation (Mead 1988)
size calculation programmes such as G∗Power use has been suggested for laboratory animal studies
estimates of effect size. Effect sizes should be based (Festing 2018). The resource equation is based on
the idea that the total information of an experiment variance for estimates for each factor and factor inter-
with N experimental units can be determined from actions, the effect of blocking, the magnitude of each
the total variation based on N − 1 degrees of free- relevant residual error term, potential confounding,
dom. In a simple experiment, the resource equation and information about the power of hypothesis tests.
partitions the variation among three sources: the A skeleton ANOVA is especially useful for estimating
treatment component T, the blocking component sample sizes for complex designs. Sample size formu-
B, and the error component E: lae may not even exist, and determine the optimal
allocation of experimental units among multiple
T + B + E = N −1 sources of variation can be challenging. Examples
The treatment degrees of freedom are k, where k of complex designs include split-plot and factorial
is the number of groups. The blocking variable B designs, designs with complex blocking and cluster-
with b − 1 degrees of freedom, is a known categori- ing, and repeated measures on different factors
cal variable used to minimise variation from sources (Brien 1983; Draper and Smith 1991; Brien and Bailey
not of direct interest to the experimenter. The error 2006, 2009, 2010). The planning rule of thumb is that
component is used for testing hypotheses. Mead there should be at least one degree of freedom for all
(1988) recommends a sample size that allows main factors and interactions of interest and 10–20
10–20 degrees of freedom for the error term. degrees of freedom for the residual error term for test-
Mead’s resource equation is most appropriate for ing hypotheses (Mead 1988).
single-factor designs. A further disadvantage is that
the inclusion of a block term seemingly penalises 19.4 Completely
the experiment by reducing degrees of freedom
for the error term. However, this is misleading Randomised Design, Single
because blocking removes the variation associated Factor
with the nuisance variable from the error term
and therefore can contribute substantially to the The single factor or one-way ANOVA consists of
power and precision of the experiment. one factor with k levels or ‘groups’. Experimental
units are independent of each other, and treatments
are randomly allocated to each experimental unit.
Example: Mead’s Resource Equation: The main disadvantage of the completely rando-
Simple Experiment mised design is that different sources of variation
not accounted for by the model will go into the error
An experiment is planned with four treatments
term. This inflates the mean square error, increas-
with five mice per group, for a total of 20 mice.
ing the error variance and reducing power.
Here, N = 20, B = 0, T = 4, so E = 19 − (4 − 1)
There are three variation components for a
= 16, suggesting the sample size is adequate.
single-factor completely randomised design:
If the experiment is to be run in 5 blocks, then
variation between units receiving different treat-
E = 19 − (5 − 1) − (4 − 1) = 12.
ments (between-treatment or between-group), the
variation between experimental units receiving
the same treatment (within-group), and the residual
19.3.4 Skeleton ANOVA error term (mean square error MSE) that estimates
The skeleton ANOVA can be thought of as an exten- the random variation between experimental units.
sion of the resource equation. The skeleton ANOVA The ratio of the between-group to within-group
table lists the sources of variation and correspond- mean square is the F-statistic (Table 19.1).
ing degrees of freedom for a specific experimental For a balanced design with equal sample sizes for
design (Table 19.1). Candidate values for the number k levels (commonly referred to as ‘groups’), the min-
of factors, levels of each factor, and number of exper- imal sample size per group n is
imental units can be selected, and corresponding
degrees of freedom calculated from those values. 2 F 1 − α,k − 1,k n − 1 MSE
The skeleton analysis of variance is therefore a n≥ 2
relatively simple method of evaluating the likely d
Table 19.1: Skeleton ANOVA for common study designs. N is the total number of experimental units (EUs), n is sample size
per group (number of replications of EUs), and σ2ε is the error or residual variance.
Source of variation Degrees of freedom
A. Single-factor (one-way) ANOVA, completely randomised design

Due to ‘treatment’ (between k groups) k−1
Residual error k (n − 1) = N − k = σ2ε
Total (corrected) kn – 1 = N − 1
Total N
B. Single-factor randomised complete block

Due to ‘treatment’ (k groups) k–1
Due to blocks (b blocks) b−1
Group × block interaction = Experimental error (k − 1)(b − 1)
Within-group variation = Sampling error (MSE) kb(n − 1) = σ2ε
Total (corrected) kbn − 1 = N − 1
C. Two-factor (two-way) ANOVA; 2 × 2 Factorial

Due to factor A a−1
Due to factor B b−1
A × B interaction (a − 1)(b − 1)
Error ab(n − 1) = σ2ε
Total (corrected) abn – 1 = N − 1
D. Crossover design
Due to ‘treatment’ (A versus B) k−1
Period p−1
Sequence (AB versus BA) s−1
Subject (sequence) n(t − 1)
Within-group variation n–1
Error (k − 1)(nk − 2)
Total (corrected) nsk − 1 = N − 1
E. Hierarchical (nested) design B within A

Due to ‘treatment’ A (between treatments) a−1
B nested within A, B(A) a(b − 1)
Error ab(n − 1)
Total (corrected) abn − 1 = N − 1
F. Single factor A with repeated measures on factor time

A a−1
Time t−1
(continued)
Table 19.1: (Continued)

Source of variation Degrees of freedom
Subjects (A) a(n − 1)
Within-subjects an(t − 1)
Total (corrected) atn − 1 = N − 1
G. Two-factor design on factors A and B, with EUs randomised to A and repeated measures on B, such that subjects
are nested within A: S(A)
Between treatments
A a–1
B b−1
A×B (a – 1)(b − 1)
Error ab − 1
Within treatments
S(A) n−1
B × S(A) (n − 1)(b − 1)
Error ab(n − 1)
Total (corrected) abn − 1 = N − 1
H. Split-plot design
Whole ‘plot’ (effects)
Replicate on A r−1
Whole plot A a−1
Whole plot error (r − 1)(a − 1)
Whole plot error
Subplot B b−1
Subplot B × Main A (a − 1)(b − 1)
Subplot error rabn − a(r + b − 1)
where d is the mean difference to be detected, MSE For two groups with group size n, the total sample
is the mean square error (the estimate of the size is 2 n. For multiple treatment levels k, the total
residual variance), and F 1 − α,υ1 ,υ2 is the critical F- sample size is k n.
value with confidence (1 − α), and υ1 = (k − 1) The following examples show how to calculate
for between-treatment degrees of freedom, and sample size from the non-centrality parameter
υ2 = k(n − 1) for within-group degrees of freedom. and iterating over a range of candidate n
The effect size is d / MSE The non-centrality (Appendix 19.A). Pre-specified values for mean dif-
parameter is ference, variance, and confidence are entered, and
2 the corresponding non-centrality parameter and
n d power are calculated for each n. The appropriate
λ= 2
2 MSE sample size is selected by matching n to that which
most closely corresponds to the pre-specified power 1.0

d=5
(for example, 80%). 0.9
0.8
Example: Guinea-Pig Weight Gain 0.7
(Data from Zar 2010.) A small trial was conducted 0.6
Power
to evaluate the effect of four different diets on 0.5
weight gain (g) of guinea pigs. The factor was diet
0.4
with four levels (A, B, C, D). Diets were randomly d = 2.5
assigned to each of 20 guinea-pigs. There were 0.3
five guinea-pigs per diet group (n = 5/group). 0.2
The raw data are:
0.1
A 7.0 9.9 8.5 5.1 10.3 0.0

2 4 6 8 10 12 14 16 18 20
B 5.3 5.7 4.7 3.5 7.7
n per group
C 4.9 7.6 5.5 2.8 8.4
Figure 19.1: Guinea-pig growth: Power of single-factor
D 8.8 8.9 8.1 3.3 9.1 four-level completely randomised design over a range of
sample sizes for two effect sizes (maximum mean difference
between two diets). Vertical dotted lines show sample
size for a target power of 80%. Horizontal dotted lines
indicate the power of 80% and 90%.
The summary statistics (mean, standard devia-
tion) are:
A 8 2 2 1 ;B 5 4 1 5 ;C 5 8 2 2 ;D 7 6 2 5 difference of at least 2.5 g? A difference of 5.0 g?
(These differences correspond to approximate effect
sizes of 1 and 2, respectively).
The ANOVA statistics are:
Figure 19.1 shows the power of a single-factor,
Source of df SS MS F P- four-level completely randomised design over a
variation value range of sample sizes and two effect sizes: 2.5 g
Between 3 27.426 9.142 2.034 0.15
and 5.0 g. For a mean difference of 2.5 g with
groups four groups, 17 guinea-pigs per group (N = 68)
would be required for a realised power of 0.81
Within 16 71.924 4.495
groups (λ = 11.81). For a mean difference of 5.0 g, a sam-
ple size of 5–6 per group (total sample size
Total 19 99.350
N = 20–24) would be required for a realised
power of 0.80–0.89 (λ = 13.47). Sample size per
The mean square error MSE is 4.5 and R2 = 0.28. group becomes smaller as the size of the effect
The largest difference in weight gain was observed to be detected increases.
between diets A and B. The difference was
2.8 g with 95% confidence limits −0.06–5.62 g. 19.4.2 Sample Size Approximations
The precision of the difference estimate is
[5.62 − (−0.06)]/2 = 2.84 g. Based on Number of Levels for a
Single Factor
19.4.1 Sample Size Approximations The number of levels of a single factor (‘groups’)
that can be realistically tested for a given power
Based on Mean Differences can be determined by iteration over a plausible
A new study is planned where it is desired to detect a range of experiment sizes. For the guinea pig exam-
pre-specified mean difference of 2.5 g and 5.0 g ple, group sample sizes were estimated for experi-
between at least two of the four diet groups. What ments consisting of 2–6 groups and a mean
is the minimum sample size that will ensure a difference of 2.5 g with power of 0.8 and confidence
power of at least 80% for α = 0.05 to detect a of 95% (Figure 19.2). For a two-group (k = 2)
Number 1.0
of groups
0.9
k
1.0 2 0.8
0.9 4 0.7
0.8 6 0.6
Power
0.7 0.5
0.6 0.4 n per
Power
0.5 group
0.3
10
0.4
0.2 8
6
0.3 4
0.1 2
0.2
0.0
0.1
0.5 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5
0.0 Variance
2 4 6 8 10 12 14 16 18 20
Figure 19.3: Sample size and variance. Increasing error
n per group
variance greatly increases the sample size required to
Figure 19.2: Guinea-pig growth: Sample size per group in a obtain a given power.
single-factor four-level design for two to six levels (or
groups). Increasing the number of groups to be compared
greatly increases the sample size required to obtain a the minimal difference term. For the guinea pig
given power.
data, the grand mean is
experiment, a sample size n of 12–13 per group is Y = 8 16 + 5 38 + 5 84 + 7 64 4 = 6 8 g
required for a total 24–26 animals. For 6 groups
(k = 6), the sample size required is 19–20 per group, The non-centrality parameter λ is
for a total of 114–120 animals. Sample sizes per
group become smaller as the number of groups is ai 2
reduced. λ=n
MSE
19.4.3 Sample Size Approximations where ki = 1 a2i = k 2

i = 1 yi − Y the sum of the
Based on Expected Range of Variation squared corrected group means, with mean square
Sample size and expected power can be estimated error MSE. For the guinea-pig data, the non-
for a range of variances by substituting estimates centrality parameter is:
for the grand mean and the mean square error into
2 2 2 2
8 16 − 6 8 + 5 38 − 6 8 + 5 84 − 6 8 + 7 64 − 6 8
λ=n
MSE
In this example, sample sizes were approxi- to detect the target difference of 2.5 g if variance
mated over a range of MSE from 0.5 to 9.5 is much greater than 2.5.
(Figure 19.3). These values represent effect sizes In the presence of large variation, the number of
of approximately 0.55 (small) to over 10 (implausi- animals required to obtain a desired power of
bly large). In this example, a sample size of 4 per at least 80% power will be unacceptably large or will
group might be adequate if variance is less than require implausibly large effect sizes (Figure 19.4).
~20% of the mean difference. However, the Study refinement to minimise the effects of
required sample size per group increases consider- between-animal variation and alternative designs
ably with increasing variance. Even a sample size should be considered. Candidate designs include ran-
of 10 per group will not provide sufficient power domised complete block and factorial designs.
4.75 (a)
4.25
3.75
3.25
Effect size
n=2
2.75
n=4
2.25 n=6
n=8
n = 10
1.75 (b)
1.25
0.75
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Power
Figure 19.4: Sample size, effect size and power.
Figure 19.5: Blocking and pseudo-replication. (a) Blocking.

19.5 Randomised Complete There are eight mice housed four to a cage. Two mice in
Block Design each cage receive drug A, and two receive drug B. The cage
is the block, and the individual mouse is the experimental
The randomised complete block design allocates and the biological unit. Both levels (A, B) of the treatment
(Drug) are represented in each block. The total sample size is
experimental units according to a predefined char- eight mice, with four mice per drug. (b) False block (pseudo-
acteristic (the blocking variable) into homogeneous replication). There are eight mice housed four to a cage. All
groups (blocks), after which treatments are ran- four mice in one cage receive drug A, and all mice in the
domly allocated to units within each block. The ran- second cage receive drug B. Although individual mice are
domised complete block design is more powerful biological units, the experimental unit is now the cage. The
total sample size is now two (not eight). There will be zero
than a completely randomised design. Blocking degrees of freedom to test treatment effects. Analysing
improves power and precision by controlling and the data from the eight mice as if they are independent
minimising biological variation, thereby improving experimental units artificially inflates the true
the detection of the treatment signal. An added sample size.
advantage of this class of designs is that the system-
atic incorporation of blocking (‘planned heteroge- In a balanced block design, every level of the
neity’) substantially improves the reliability, intervention or treatment factor occurs the same
validity, and reproducibility of research results number of times within each level of the block.
(Voelkl et al. 2018; Würbel et al. 2020). Treatments are randomly assigned to the experi-
The block is a group of experimental units mental units within each block. If there are k treat-
expected to be more homogeneous than others ments, then there should be at least k experimental
based on their classification by a given nuisance fac- units in each block. Computer-generated block
tor. The block term is identified by a categorical designs can be derived in SAS (e.g. proc plan and
classifier. It is not a new factor or treatment. Nui- proc factex) and in R (e.g. ‘blocksdesign’).
sance or blocking variables can be properties of The skeleton ANOVA for a randomised complete
the animals (age, litter, cage or tank), experimental block design with a single factor (Table 19.1)
conditions (cage location, time of day, week of study accounts for two sources of error variation. The
entry, specific technical staff performing the exper- experimental error term quantifies the random var-
iment, donor), and experimental location (different iation between experimental units and is calculated
laboratories, different sites). Inappropriate specifi- by the interaction between block and treatment.
cation of the experimental unit and block results Within-group variation (the mean square error) is
in artificial inflation of the true sample size or due to replication on the blocks and is a measure
pseudo-replication (Figure 19.5). of sampling error.
Method
up as statistically significant. Although the block
1. Define the blocking variable so that homoge- effect required (5 − 1) = 4 degrees of freedom
neous experimental units are grouped (with correspondingly fewer degrees of freedom
together. for the error term), the mean square error is
2. Randomise treatments allocated to each unit 0.773, a seven-fold reduction in variation. The lar-
or animal within each block. Ideally, ensure gest difference between means is for diets A and
that all treatments are represented equally B is 2.8 g with 95% confidence interval of (1.6,
in each block. 4.0). The precision for the estimate of the mean
3. Include the (treatment x block) interaction in difference is therefore 2.4/2 = 1.2. The R2 is
the analysis to assess the effect of the block as 0.91, indicating considerable improvement in
a source of variation. model fit.
For a sample size calculation based on this
Example: Guinea-Pig Growth: Randomised revised estimate of variance, only 4 guinea-pigs
Complete Block Design per group (total sample size N = 16) are required
to detect a mean difference of 2.5 g for a realised
The 20 animals in the study were actually meas- power of 0.827, and 5 per group (N = 20) for rea-
ured in ‘batches’ over five weeks. A batch con- lised power of 0.93.
sisted of one complete replicate of all four diets.
The blocking variable is the week of study entry.
Each block contained four animals, with one of
four diets randomly allocated to each guinea- Example: Experiment to be Conducted over
pig in each block. Several Days
Diet Guinea-pig weight gain (g) An investigator wished to conduct a single-factor

three-treatment experiment on mice, with two
Block 1 Block 2 Block 3 Block 4 Block 5
experimental drugs and one control. The experi-
A 7.0 9.9 8.5 5.1 10.3 mental unit is the individual animal. A power
B 5.3 5.7 4.7 3.5 7.7 analysis indicated the total required sample size
C 4.9 7.6 5.5 2.8 8.4
is 75 with n = 25 mice per group. However, only
15 animals can be processed in one day. There-
D 8.8 8.9 8.1 3.3 9.1 fore, the experiment will have to be conducted
over five days.
One solution is to set up the study design as a
The ANOVA table is
randomised complete block design, with day as
Source of df SS MS F P-value the blocking variable. The experiment is operatio-
variation nalised by randomly assigning animals to one of
five blocks (block = run day) so that there are
Diet 3 27.426 9.142 11.825 0.0007 15 animals for each block. Then treatments are
(Between randomly allocated to each animal within block,
groups)
so that all treatments are represented equally in
Block 4 62.647 15.662 20.259 <0.0001 each block. In this example, there will be 5 ani-
Within 12 9.277 0.773 mals/treatment group on each day.
groups Multi-batch designs are a type of randomised
Total 19 99.350 controlled block experiment where the entire
experiment is replicated, and systematic hetero-
geneity accounted for by replicate × treatment
Blocking was clearly effective in controlling for interactions. Heterogeneity factors can be
variation between animals. Compared to the extended from different time periods within the
completely randomised design (where the effect laboratory to different experimenters and differ-
of blocking was ignored), diet was now flagged ent laboratories (Karp et al. 2020).
A major advantage of factorial designs is that

Example: Mouse Oncology: Blocking
because they require relatively few runs per factor,
on Donor
information is maximised for relatively small sample
An experiment was proposed to test the effect of sizes. This class of designs is extremely sparing of ani-
three candidate drugs on a mouse model of can- mal use (Bate and Clark 2014, Karp and Fry 2021).
cer, where recipient mice are inoculated with Factorial designs can consist of any combination
cells from donor mice. One donor mouse could of factors and number of levels (Table 19.1). The
supply enough cells for three recipients. The number of possible treatment combinations is
study was originally powered with total sample found by multiplying the number of factors k by
size of 30 recipient mice and 10 donor mice, for the number of levels in each factor. The fundamen-
a total sample size of 40. However, the investiga- tal workhorse design (and the basis for screening
tor was concerned about variation between designs) is the 2k factorial, with k factors and two
donors affecting results, and proposed doubling levels for each factor. For example, a 23 full factorial
the original sample size ‘to account for the design consists of three factors, each at two levels,
variation’. for eight possible treatment combinations.
The study can be designed as a randomised
block, with donor mouse as the block. Each drug
treatment A, B, and C can be represented once in Example: Factorial Designs and Number of
each block. Then the number of blocks Treatment Combinations
required = 30/3 = 10 donor mice. Donor mice
A study was proposed to compare immune
are randomly allocated to 10 blocks each consist-
response of three strains of mice raised on two
ing of three recipient mice. The drugs A, B, and
temperatures and administered one of 4 drugs
C are then randomly allocated to the recipient
(3 test compounds, 1 control). This is a
mice in each block. The experiment is effectively
3 × 2 × 4 factorial with 24 treatment
replicated ten times without increasing the total
combinations.
number of mice, and variation is accounted for
by including (donor x treatment interaction) in
the analysis.
For simplicity and ease of interpretation, the
number of levels for a quantitative factor is delib-
erately restricted to two or three values: a low
value and a high value that bracket the expected
19.6 Factorial Designs range of possible response and an intermediate,
Many laboratory animal-based studies are essen- or centre, value that allows testing for curvature
tially exploratory in nature. The goal is usually to in the response. These qualities make them far
assess several or many factors of potential interest, superior to the conventional one-factor-at-a-time
with the objective of identifying the most promising (OFAT) two-group comparisons (Czitrom 1999).
and screening out those that appear less promising. Screening designs for factor reduction (‘screening
Factorial designs are ideal for this type of experi- out the trivial many from the significant few’)
mental as they are nimble, efficient, and well-suited consist of 2k factorial designs when a relatively
to detect interactions between multiple factors, large number of candidate inputs or interven-
which results in most ‘discovery by chance’ (Box tions thought to affect the response can be
et al. 2005). For most practical purposes, results reduced to a few of the more promising or impor-
are easily quantified by second-degree least squares tant based on the main effects and interactions
polynomial regression, with effect sizes of all main with the largest effect size. Subsequent experi-
effects and two-way interactions quantified by the ments can then be designed to optimise
respective regression coefficients: responses based on the specific factors identified
in the preceding phases (Chapter 6; Montgomery
Y = β0 + β1 X 1 + β2 X 2 + … + β12 X 1 X 2 + … + eij 2012; Collins 2018). Multiple factors can be
studied in fractional factorial designs when the run of each combination of treatment factors
cost and size of full factorials become too prohib- (Montgomery 2012). Therefore, rather than think-
itive (Box et al. 2005), or if certain combinations ing of treatments as randomised to ‘groups’ of ani-
of factor levels are too toxic or otherwise logisti- mals and basing sample size of number per group,
cally unacceptable (Collins et al. 2009). For gen- it is more constructive to evaluate sample size for
uine replication (Box et al. 2005) and so that a factorial design as the number of replicate runs
statistical inference is valid, run combination is required to estimate the error variance. In a bal-
randomly allocated to each experimental units, anced factorial design, power to detect the main
and run order randomised for all treatment com- effect of a factor is determined by the sample size
binations and replicate runs. per level of each factor, not the sample size of the
‘group’. This means that in a two-factor design, each
animal is doing double duty by contributing to the
Example: Weight Gain in Mice on
main effect estimates for both factors at the same
Antibiotics
time. That is, when estimating the main effect for
(Adapted from Snedecor and Cochran 1980). An each factor, two means are compared based on
experiment was planned to measure weight gain m = 2 n observations from n experimental units.
of newly weaned mice given an antibiotic at a The minimum number of replicates per treatment
high dose (‘High’ =1), an intermediate dose combination is two.
(intermediate = 0) or vehicle control (‘low’ = −1) The effect size ES is obtained from regression
and vitamin B12 or vehicle control. This is a coefficients for each main effect, and standardising
3 × 3 factorial design with nine treatment them by the expected standard deviation of the
combinations: response variable Y:
Run Antibiotic Vitamin B12 ES = β SDY

1 1 1
Sample size is determined by the smallest effect
2 1 0 size for a main effect to be detected at a pre-specified
3 1 −1 level of power (Collins 2018). In a 2 × 2 factorial,
there are four means to be estimated, based on m
4 0 1
observations. Machin et al. (2018) recommend esti-
5 0 0 mating total sample size N for each factor sepa-
6 0 −1 rately, assuming no interaction, and selecting the
7 −1 1 larger N. If sample size estimates are very different,
factors should be prioritised based on study objec-
8 −1 0
tives and the study powered with respect to the most
9 −1 −1 important factor.
Non-centrality parameter. For 2 × 2 factorial,
One complete replicate consists of nine mice three non-centrality parameters λ are calculated
with each treatment combination (run) assigned for each effect and the interaction term. The values
to one mouse. To obtain an estimate of the exper- for each λ are calculated from the a number of levels
imental variance, the design could be replicated. of factor A, b number of levels of factor B, τ(A)i the
Two replicates require 18 mice. For genuine rep- mean difference of factor A at level i, τ(B)i the mean
lication (Box et al. 2005), so that statistical infer- difference of factor B at level j, τ(AB)ij is the interac-
ence is valid, run combination is randomly tion effect of factor A at level i and factor B at level j,
allocated to each mouse, and order of all 18 runs and the mean square error MSE (σ 2).
is randomised. Factor A effects:
a 2
Power and sample size for factorial designs. In a bn i = 1τ A
λA = i
factorial design, a replicate is an independent repeat MSE
Factor B effects: 1.0

b 2
an j=1 τ B j
λB = 0.9
MSE
0.8
Factor A × B interaction effects:
0.7 s2 = 21
a b 2
n i=1 j = 1τ AB ij 0.6 s2 = 42.2
λAB =
Power
MSE 0.5
Definitive studies based on factorial designs will 0.4

require formal simulation-based power analyses, 0.3
especially if interaction effects are of primary inter- 0.2
est (Lakens and Caldwell 2021). If the interaction is
0.1
approximately the same size as the main effect, sam-
ple sizes will have to be increased by four-fold to 0.0
achieve sufficient power, and will be up to 16 times 2 4 6 8 10 12 14 16 18 20
n per group
larger if the interaction is only half that of the main
effect. Figure 19.6: Sample size estimates for factorial design.
Reducing residual variance minimises sample size.
Example: Effect of Diet on Mouse Re-analysis as a 2 × 2 factorial on factors hemin,

Tumorigenesis nitrite, and the (hemin × nitrite) interaction indi-
cates a statistically significant effect of hemin
(Data from Sødring et al. 2015.) Researchers only (F = 5.67, p = 0.02). The effect of nitrite
examined the effects of dietary heme and nitrate alone and the (hemin × nitrite) interaction were
on intestinal tumorigenesis in 80 A/J min/+ not statistically significant.
mice. The design was a 2 × 2 factorial on the fac- Suppose a new study was planned to further
tor hemin (yes = 1, no = 0) and nitrite (yes = 1, examine the (hemin × nitrite) interaction, and
no=0) for four treatment combinations. Mice power the study to detect a treatment effect of
were randomly assigned to the four diet groups (hemin × nitrite) interaction of approximately
and after eight weeks on each diet, tumorigenesis 2.5 mm2, with a = b = 2 levels. The non-centrality
was quantified as tumour ‘load’, measured as value is:
total lesion area (mm2). Summary statistics were:
Treatment Hemin Nitrite n Mean SD n 2 2 2 52
λAB = = 0 592 n
group lesion 42 2
area
(mm2)
Figure 19.6 shows the power curves for esti-
1 0 1 20 9.7 3.6
mated sample size per group. The study would
require 14 mice per level, for a total of 56. By
2 1 0 21 14.5 9.0
reducing residual variation by approximately
3 1 1 20 13.1 7.3 50% (for example, by identifying and removing
4 0 0 19 10.9 4.5 the causes for the large outliers observed in the
prior study), the revised sample size is 7 mice
When analysed as a single-factor ANOVA with per level for a total of 28.
four groups, treatment effect was not statistically
significant (p = 0.096). Examination of summary
data plots and the standard deviations for group 19.7 Split-Plot Design
means suggest one or more outlying values occur Split-plot designs contain two sizes of experimental
in the hemin-only and hemin + nitrite groups. unit, the main unit (whole plot) and the subunits.
The residual variance (MSE) was 42.2. Each main unit contains a number of subunits.
One factor A is randomly assigned to each main

Drugs were to be randomly administered to indi-
unit, and a second factor B is randomly assigned
vidually marked mice. Batches of 16 mice (4 per
to each subunit (subplots). There are two error
drug) were to be irradiated at each radiation level.
terms. The mean square error for the whole plot
However, use of the device was very expensive,
effect for factor A is larger than the mean square
and changing radiation level was difficult, so
for either factor B or the (A × B) interaction term.
changing settings could only be done once per
Split-plot designs are extremely useful for exploring
day. The researcher intended to run all drug treat-
multiple factors when (1) it is more difficult to
ment combinations at each radiation level over
change the level of one factor compared to the
three consecutive days. Are experimental units
others, (2) if one factor requires more replicates or
allocated appropriately?
experimental units than the others, or (3) if some
This is a split-plot design, with radiation level
feature of the experimental conditions precludes
as the whole plot factor and drug as the subplot
assignment of treatment combinations to all exper-
factor. There are a = 3 levels of radiation intensity
imental units (Kowalski and Potcner 2003). The dis-
and r = 1 replicate. At the subplot level, there are
advantage of split plots is the increased complexity
b = 4 drugs and n = 4 mice per drug.
of the design and subsequent analyses, and some
The skeleton ANOVA constructed from the
loss of precision for the main unit treatment com-
proposed plan (Table 19.2) shows there are zero
parison (Snedecor and Cochran 1980). Kowalski
error degrees of freedom for testing the effect of
and Potcner (2003) describe three scenarios where
radiation level, and therefore no statistical test
split-plot designs are superior to a completely ran-
is possible for this factor. If the effects of radiation
domised or factorial design. The skeleton ANOVA
are of primary interest, this setup would be a
(Table 19.1) provides a relatively simple method
waste of resources and money. If resources and
for preliminary assessments of experimental unit
budget allow, each radiation level (a = 3) could
allocation across factors.
be replicated so there are two runs at each radia-
tion level (r = 2). This setup provides (3 − 1)
Example: Murine Irradiation Study: Skele- (2 − 1) = 2 degrees of freedom for the error term.
ton ANOVA for a Split-Plot Design If the effects of drug are of higher priority to
An experiment was planned to test biomarker the study objectives than radiation level, the
levels of mice irradiated at one of three sub-lethal investigator could consider choosing a single
levels of radiation, then administered one of four fixed radiation level to test several replicates of
drugs thought to act as anti-apoptosis mitigators. drug effects.
Table 19.2: Skeleton ANOVA for a proposed split-plot design with a = 3 levels of radiation intensity, r = 1 replication of
radiation at the whole plot level, b = 4 drugs, and n = 4 mice per drug. Zero degrees of freedom for the radiation effect
indicate that the design must be revised if radiation effects are of interest.
Source Degrees of freedom
Whole ‘plot’ (radiation effects)

‘Plot’ Replicate on radiation r−1 1−1=0
Radiation level Whole plot treatment a−1 3−1=2
Between radiation levels Whole plot error (r − 1)(a − 1) 0(2) = 0
Sub ‘plot’ (drug treatment effects)
Drug Subplot b – 1s 4−1=3
Drug × radiation level Subplot × Main (a − 1)(b − 1) 2(3) = 6
Between mice Subplot error rabn − a(r + b − 1) 2 3 4 4 − 4(2 + 2 − 1) = 84
19.8 Repeated-Measures means are averaged over time. The total sample size
N increases as correlation r increases. This is
(Within-Subject) Designs because the average group effect size is a
Repeated-measures designs consist of observations between-subjects comparison of averages, and the
made on the same subject so that observations variance of the estimate increases with r. However,
are correlated (Box 19.2). Types of repeated mea- when comparing group differences across time
sures designs included before-after and crossover (time × intervention interaction) as correlation r
designs (with paired observations on the same sub- increases, then total sample size N decreases, as
ject), repeated measures on treatment (different the contribution of individual variance to the esti-
treatments randomly applied to the same subject), mate of the rate of change in response is reduced
repeated measures on time (observations on the (Diggle et al. 2002).
same subject are obtained for multiple time Repeated-measures designs can test one or more
points), and spatial autocorrelation designs (obser- of three hypotheses for overall group differences
vations are correlated over space rather than time). (test on main effects), the overall trend over time
Sample size calculations for repeated measures (time effect), and differences between groups with
data require information on the number of time (time × intervention interaction) (Winer
repeated observations per experimental unit and et al. 1991). Therefore, three sources of variation
an estimate of the correlation among the repeated must be accounted for:
observations (Diggle et al. 2002).
Observations are correlated when pairs of obser- Between-subject variation (random effects): the
vations close to each other are more similar than variation between animals due to the character-
pairs of observations that are more widely istics of the animals in the sample.
separated. Recall for a series of n independent Within-subject variation: the variation within a
observations with mean Y and var(Y ) = s2/n, the single subject resulting from the effects of time
large-sample normal approximation for sample and measurement error. Time dependencies
size is: result when measurements are obtained on
N ≥ z21 − α s d 2 the same subject over time. Measurements
2
taken at short time intervals will be more highly
However, if there is a correlation r between obser- correlated than those taken further apart.
vations made at t time points, then Interaction effects: the variation attributable to
2 t−1 treatment differences across time (time × inter-
2
N ≥ z21 − α 2 s d 1+r vention interaction).
t
When comparing the overall difference in Sample size calculations for repeated–measures
response between two or more groups, the group designs should include preliminary estimates of
the variance and the expected correlation structure
BOX 19.2 for the repeated measurements. The four most com-
Repeated Measures Designs mon correlation patterns (Guo et al. 2013) are:
1. Paired measurements: Before-after, crossover 1. zero correlation, if observations are
designs independent.
2. Repeated measures on treatment: Treatments 2. compound symmetry, if any two sequential
randomly applied to the same subject observations are equally correlated.
3. Repeated measures on time, longitudinal designs: 3. autoregressive, when time points are equally
Measurements taken on the same subject at two spaced, and correlation decreases with
or more time points increasing distance between observations.
4. Spatial autocorrelation: Measurements are cor- 4. unstructured, if correlation is present but
related in space rather than time without any specific pattern.
Choice of a specific correlation structure must be treatments in reverse order. The great advantage
done with care, as the wrong structure or a structure of the crossover design is that two or more factors
that is too simple will inflate type I error and can be investigated on the same subjects, resulting
increase false positives. The conventional Pearson in a conservable reduction in animal numbers.
correlation coefficient does not account for the However, the disadvantage of crossover trials is
within-subject correlation structure of responses the potential for carry-over effect: measurements
so is unsuited for estimating correlation for a obtained for the second intervention may be
repeated-measures study. More appropriate prelim- affected by the previous intervention, resulting in
inary estimates can be obtained by calculating the interaction of the two treatments. The study must
correlation based on subject averages of the two be designed so that the washout period between
measurements (Bland and Altman 1994, 1995). Iri- treatment applications is sufficiently long to mini-
mata et al. (2018) describe several alternative meth- mise carry-over effects. Senn (2002) argues against
ods for estimating correlation for repeated- multi-stage testing for carry-over effects, as has been
measures designs. The best is to obtain relevant var- recommended in the past.
iance components from a mixed-model analysis of Sample size in an AB/BA crossover trial can be
variance on prior or exemplary data (Castelloe estimated by using power calculation methods for
and O’Brien 2001; Hamlett et al. 2004; Irimata a paired t-test, two-sample t test, or equivalence test,
et al. 2018). Mixed-model analyses provide the most depending on the hypothesis to be tested. In SAS,
reliable (and therefore smallest) estimates of total the power analysis for the paired sample t-test can
sample size (Hedeker et al. 1999; Guo et al. 2013; Iri- be performed with proc power. It requires an esti-
mata et al. 2018; Shan et al. 2020). mate of the within-subject correlation r. In R, the
power analysis can be performed with the pwr.t.test
function
19.8.1 Before-After and Crossover
Designs
Example: Crossover Trial: Heart Rate
In these designs, the ‘groups’ consist of observations Elevation in Cats
paired on the individual subject or experimental
unit. Because each subject serves as its own control, (Data from Griffin et al. 2021.) Heart rates of
between-subject variation is eliminated, and preci- 21 healthy adult cats presenting for routine well-
sion is greatly increased. The advantage of these ness exams were measured in a randomised two-
designs is that the same of level of statistical power period two-treatment crossover trial to assess anx-
can be achieved with fewer subjects than would be iety levels as result of exam location. The two loca-
the case if observations were obtained from inde- tions were an isolated exam room with the owner
pendent, parallel arm groups. Disadvantages present (A) or a treatment room without the owner
include carry-over effects (when treatment effects (B). Data from the trial indicated a reduction in
from the first test period persist to the second) heart rate of approximately 30 bpm when cats were
and missing data with subject dropout (Senn 2002). examined with owner present. How many cats are
A before-after design consists of two sets of obser- required for a new study if it was desired to detect a
vations made in a fixed sequence. The first set of clinically relevant difference of 25 bpm over the
observations is made on each subject at baseline baseline heart rate of 185 (SD 34) bpm for a confi-
before application of the experimental intervention, dence of 95% and power of 80%?
and the second set after the intervention. The correlation r was estimated from the vari-
A crossover design has subjects randomly allo- ance components from the original data using
cated to one of two sequences of two (or more) treat- mixed-model analysis of variance and equal
ments given consecutively. The sequence order AB correlation (compound symmetry). The subject
or BA is randomised to subject. Subjects allocated to variance was 543.59, and residual variance
AB receive treatment A first, followed by treatment was 343.98, so the correlation was 543.59/
B, and subjects allocated to BA receive the (543.59 + 343.98) = 0.6. The sample size for the
2
new study could then be approximated using the 2 zα + z1 − β 1 + r t−1
n= 2
formula for computing power for a paired means t d s2
t-test with baseline mean heart rate of 185 bpm,
mean difference of 25 bpm, and standard devia- where r is the correlation of the repeated measures,
tion 34. The new study would require approxi- s2 is the assumed common variance in the two
mately 15 cats. groups (or MSE), and t is the number of time points
(Diggle et al. 2002). Sample SAS code is provided in
Appendix 19.B.
19.8.2 Repeated Measures on Time:
Continuous Outcome 19.8.3 Repeated Measures on Time:
If the mean difference between levels d can be Proportions Outcome
assumed to be constant over time, then the sample Assuming there is a stable difference in proportions
size per group n to estimate the time-averaged p1 − p2 between two groups across t time points, the
response is number of subjects (n) in each of two groups is
2
z1 − α 2 p 1 − p + z1 − β p1 1 − p1 + p2 1 − p2 1 + r t−1
n= 2
t p1 − p2
where p1 is the proportion responding in group 1

(q1 = 1 − p1), p2 = response proportion in group with time), then n = 65.1 66 per location.
2, p = p1 + p2 2 , r is the common correlation A more sophisticated design could account for
across the n observations, and t is number of time clustering effects of nest within location and
points (Diggle et al., 2002). Sample SAS code is pro- hatchlings within nest.
vided in Appendix 19.B.
19.8.4 Repeated Measures in Space:

Example: Turtle Hatchling Sex Ratio Spatial Autocorrelation
Sex ratios of red-eared slider turtle (Trachemys Observations that are correlated over space are com-
scripta) hatchlings are more female-biased in mon in field studies of animal spatial response. For
late-season clutches compared to early season example, animals may be limited in mobility or dis-
clutches, and certain locations shown more sex persal, causing aggregation in some areas and
bias compared to others (Carter et al. 2018). Sup- avoidance of others, and patterns of aggregation
pose researchers wish to sample clutches at two and dispersal may reflect response to environmental
locations four times over the season. Approxi- gradients, such as salinity or ambient temperature.
mately how many nests must be sampled to Other examples of spatial autocorrelation include
detect a shift between locations in sex ratio from studies of brain architecture and function, and map-
0.5 to 0.7 with a confidence 95% and power ping of genetic distances (with distances based on
of 80%? allele frequencies rather than physical distance).
In this example, t = 4, p1 = 0.5, and p2 = 0.7. If If correlation occurs between spatial clusters, the
there is no shift in sex ratio is anticipated over the large-sample normal approximation for total sam-
season, then correlation of repeated outcomes is ple size N is
zero, and n = 23.2 24 per location. If correlation
2
is 0.6 (indicating a positive relationship of sex bias N = ns nc ≥ z21 − α 2 s d 1 + r nc − 1
where N is the total number of experimental units, 19.A Guinea-Pig Data:

ns is the number of experimental units within a clus-
ter, nc is the number of clusters (the repeated Sample SAS Code for
points), and r is the average correlation between Calculating Sample Size for
clusters (Conquest 1993).
a Single-Factor Four-Level
(a) Completely Randomised
Example: Spatial Correlation and Mussel Design; (b) Randomised
Shell Length
Complete Block Design
(Adapted from Conquest 1993.) A researcher
planned a study where N freshwater mussels 19.A.1 Completely Randomised
were to be collected at each of S = 10 sam- Design
pling stations, and shell length of each mus-
sel was to be measured. The researcher proc glm;
class TRT;
wished to estimate shell length with a preci- model wt=TRT;
sion d of 10%, and a level of confidence of means TRT;
95%. From previous studies, the estimated lsmeans TRT / diff pdiff CL;
coefficient of variation (CV) for shell length run;
(s Y ) is thought to be approximately 30%. Sam-
pling stations are close enough such that the aver- 19.A.2 Randomised Complete Block
age spatial correlation r is approximately 0.25. Design
How many mussels need to be sampled to obtain
an estimate of mean shell length with the desired proc glm order=data;
class BLOCK TRT;
precision? model wt=BLOCK TRT;
The total number of mussels to be sampled is means TRT / waller regwq;
run;
N ≥ ns S
title 'power for single-factor ANOVA';
where ns is the number of mussels at each *set sample size per group limit;
station and S is the number of sample stations. %let numSamples = 20; *maximum sample size per
For CV = s Y = 0 3 1 = 0 3, d = 0 1, r = 0 25, group;
%let kmax = 10; *maximum number of
α = 0 05, z1 − α 2 = 1 96, and S = 10, the total sam- levels;
ple size is data anpow;
do n = 3 to &numSamples by 1;
do k = 2 to &kmax by 1;
s 2
meandiff = 2.5; *biologically meaningful
N ≥ z21 − α 2 1 + r S−1
d difference to be detected;
sigma2 = 4.95;
2
03 alpha = 0.05;
N ≥ 1 962 1 + 0 25 10 − 1 = 112 3 113
01
*calculate noncentrality parameter ncp;
NCP = (n*meandiff**2)/ (2*sigma2);
The investigator needs to sample at least 113 *calculate power;
mussels. The number of mussels to be sampled at NumDF=k-1;
each station is therefore ns = N S = 113 10 = 13 DenDF=k*(n-1);
mussels per station.
F_crit = finv(1-alpha,
NumDF, DenDF, 0); num = ((za*sqrt(2*pbar*qbar) + zb*sqrt(p1*q1 +

power = 1- probf(F_crit, p2*q2))**2)*(1 + (t-1)*rho);
NumDF, DenDF, NCP); den = t*((p1-p2)**2);
output; npergrp = num/den;
end; proc print;
end; var npergrp;
run; run;
proc sort;
by k;
run;
proc print data=anpow;
References
run;
Bate, S. and Clark, R. (2014). The Design and Statistical
Analysis of Animal Experiments. Cambridge: Cam-
bridge University Press.
Bland, J.M. and Altman, D.G. (1994). Correlation, regres-
19.B Sample SAS Code for sion, and repeated data. British Medical Journal 308: 896.
Calculating Sample Size per Bland, J.M. and Altman, D.G. (1995). Statistics notes: cal-
culating correlation coefficients with repeated observa-
Group for a Simple Repeated- tions: part 1—correlation within subjects. BMJ 310: 446.
Measures Design https://doi.org/10.1136/bmj.310.6977.446.
Box, G.E.P., Hunter, W.G., and Hunter, J.S. (2005). Sta-
∗ determines sample size per group; tistics for Experimenters, 2nd ed. New York: Wiley.
∗ 5 timepoints (icc=.4);∗ difference in Brien, C.J. (1983). Analysis of variance tables based on
proportions of 0.5 and 0.67 (odds ratio = 2); experimental structure. Biometrics 39: 53–59.
∗ power = 0.8 for a 2-tailed 0.05 test; Brien, C.J. and Bailey, R.A. (2006). Multiple randomiza-
tions (with discussion). Journal of the Royal Statistical
data test; Society, Series B 68: 571–609.
za = probit(.975); Brien, C.J. and Bailey, R.A. (2009). Decomposition tables
zb = probit(.8); for experiments. I. A chain of randomizations. The
n = 5; Annals of Statistics 37: 4184–4213.
p1 = 0.5; p2 = 2/3;
q1 = 1-p1; q2 = 1-p2; Brien, C.J. and Bailey, R.A. (2010). Decomposition tables
pbar = (p1+p2)/2; for experiments. II. Two-one randomizations. The
qbar = (q1+q2)/2; Annals of Statistics 38: 3164–3190.
rho = 0.4; Carter, A.W., Sadd, B.M., Tuberville, T.D. et al. (2018).
num = ((za*sqrt(2*pbar*qbar) + zb*sqrt(p1*q1 + Short heatwaves during fluctuating incubation
p2*q2))**2)*(1 + (n-1)*rho); regimes produce females under temperature-depend-
den = n*((p1-p2)**2); ent sex determination with implications for sex ratios
npergrp = num/den;
in nature. Scientific Reports 8 (1): 3. https://doi.
proc print; org/10.1038/s41598-017-17708-0.
var npergrp; Castelloe, J.M. and O’Brien, R.G. (2001). Power and
run; sample size determination for linear models. SUGI
26: 240–226. https://support.sas.com/resources/
ans n = 71.87 = 72 papers/proceedings/proceedings/sugi26/p240-
data test; 26.pdf.
alpha=0.05;
beta=0.8; Collins, L.M. (2018). Chapter 3: Introduction to the fac-
za = probit(.975); torial optimization trial. In: Optimization of Behav-
za2=quantile('normal',1-alpha/2); ioral, Biobehavioral, and Biomedical Interventions,
zb = probit(.8); Statistics for Social and Behavioral Sciences (ed. L.
zb2=quantile('normal',beta); G. Collins), 67–113. Springer International Publish-
t = 4; ing AG https://doi.org/10.1007/978-3-319-
p1 = 0.5; p2 = 0.7; 72206-1_3.
q1 = 1-p1; q2 = 1-p2;
pbar = (p1+p2)/2; Collins, L.M., Dziak, J.J., and Li, R. (2009). Design of
qbar = (q1+q2)/2; experiments with multiple independent variables: a
rho = 0.6; resource management perspective on complete and
reduced factorial designs. Psychological Methods 14 (3): Methodology 13: 100. https://doi.org/10.1186/
202–224. https://doi.org/10.1037/a0015826. 1471-2288-13-100.
Conquest, L.L. (1993). Statistical approaches to environ- Hamlett A, Ryan L, Wolfinger R (2004). On the use of proc
mental monitoring: did we teach the wrong things? mixed to estimate correlation in the presence of repeated
Environmental Monitoring and Assessment 26 (2-3): measures. SUGI29 Paper 198-29 https://support.
107–124. https://doi.org/10.1007/BF00547490. sas.com/resources/papers/proceedings/pro-
Cooper, N., Thomas, G.H., and FitzJohn, R.G. (2016). ceedings/sugi29/198-29.pdf (accessed 2022).
Shedding light on the ’dark side’ of phylogenetic com- Hedeker, D., Gibbons, R.D., and Waternaux, C. (1999).
parative methods. Methods in Ecology and Evolution 7 Sample size estimation for longitudinal designs with
(6): 693–699. https://doi.org/10.1111/2041- attrition. Journal of Educational and Behavioral Statis-
21019.12533. tics 24: 70–93. https://doi.org/10.3102/10769
Czitrom, V. (1999). One-factor-at-a-time versus designed 986024001070.
experiments. The American Statistician 53: 126–131. Hurlbert, S.H. and Lombardi, C.M. (2004). Research
https://doi.org/10.1080/ methodology: experimental design sampling design,
00031305.1999.10474445. statistical analysis. In: Encyclopedia of Animal Behav-
Diggle, P.J., Heagerty, P., Liang, K.-Y., and Zeger, S.L. ior, vol. 2 (ed. M.M. Bekoff), 755–762. London: Green-
(2002). Analysis of Longitudinal Data, Oxford Statisti- wood Press.
cal Science Series, 2e. Oxford: Oxford University Press. Irimata K, Wakim P, Li X (2018). Estimation of correla-
Divine, G., Kapke, A., Havstad, S., and Joseph, C.L. tion coefficient in data with repeated measures. Paper
(2010). Exemplary data set sample size calculation 2424-2018, Proceedings SAS Global Forum, 2018:8-11
for Wilcoxon-Mann-Whitney tests. Statistics in Medi- Karp, N.A. and Fry, D. (2021). What is the optimum
cine 29 (1): 108–115. https://doi.org/10.1002/ design for my animal experiment? BMJ Open Science
sim.3770. 5: e100126.
Draper, N. and Smith, K. (1991). Applied Regression Anal- Karp, N.A., Wilson, Z., Stalker, E. et al. (2020). A multi-
ysis, 2e. New York: Wiley. batch design to deliver robust estimates of efficacy and
Festing, M.F.W. (2014). Randomized block experimental reduce animal use – a syngeneic tumour case study.
designs can increase the power and reproducibility of lab- Scientific Reports 10 (1): 6178. https://doi.org/
oratory animal experiments. ILAR Journal 55: 472–476. 10.1038/s41598-020-62509-7.
Festing, M.F.W. (2018). On determining sample size in Kowalski, S.M. and Potcner, K.J. (2003). How to recog-
experiments involving laboratory animals. Laboratory nize a split-plot experiment. Quality Progress 36 (11):
Animals 52 (4): 341–350. https://doi.org/10.1177/ 60–66.
00236772177382. Lakens, D. and Caldwell, A.R. (2021). Simulation-based
Festing, M.F.W. (2020). The “completely randomized” power analysis for factorial analysis of variance
and the “randomized block” are the only experimental designs. Advances in Methods and Practices in Psycho-
designs suitable for widespread use in preclinical logical Science 4 (1): https://doi.org/10.1177/
research. Scientific Reports 10: 17577. 2515245920951503.
Gamble, C., Krishan, A., Stocken, D. et al. (2017). Guide- Lazic, S.E. (2016). Experimental Design for Laboratory
lines for the content of statistical analysis plans in clin- Biologists. Cambridge: Cambridge University Press.
ical trials. JAMA 318 (23): 2337–2343. Machin, D., Campbell, M.J., Tan, S.B., and Tan, S.H.
Garland, T. and Adolph, S.C. (1994). Why not to do two- (2018). Sample Sizes for Clinical, Laboratory and Epide-
species comparative studies: limitations on inferring miology Studies, 4e. Wiley-Blackwell.
adaptation. Physiological Zoology 67: 797–828. Mead, R. (1988). The Design of Experiments. Cambridge
Goldman, A.I. and Hillman, D.W. (1992). Exemplary data: UK: Cambridge University Press.
sample size and power in the design of event-time clin- Montgomery, D.C. (2012). Design and Analysis of Experi-
ical trials. Controlled Clinical Trials 13 (4): 256–271. ments, 8th ed. New York: Wiley.
https://doi.org/10.1016/0197-2456(92)90010-w. O’Brien, R.G. and Muller, K.E. (1993). Unified power
Griffin, F.C., Mandese, W.W., Reynolds, P.S. et al. (2021). analysis for t-tests through multivariate hypotheses.
Evaluation of clinical examination location on stress in In: Applied Analysis of Variance in Behavioral Science
cats: a randomized crossover trial. Journal of Feline (ed. L.K. Edwards), 297–344. New York: Marcel
Medicine and Surgery 23 (4): 364–369. https://doi. Dekker.
org/10.1177/1098612X20959046. Reynolds, P.S. (2022). Between two stools: preclinical
Guo, Y., Logan, H.L., Glueck, D.H., and Muller, K.E. research, reproducibility, and statistical design of
(2013). Selecting a sample size for studies with experiments. BMC Research Notes 15: 73. https://
repeated measures. BMC Medical Research doi.org/10.1186/s13104-022-05965-w.
Senn, S. (2002). Cross-Over Trials in Clinical Research, PLoS ONE 10 (4): e0122880. https://doi.org/
2nd ed. New York: Wiley. 10.1371/journal.pone.0122880.
Shan, G., Zhang, H., and Jiang, T. (2020). Correlation Voelkl, B., Vogt, L., Sena, E.S., and Würbel, H. (2018).
coefficients for a study with repeated measures. Com- Reproducibility of preclinical animal research
putational and Mathematical Methods in Medicine improves with heterogeneity of study samples. PLoS
2020: 7398324. https://doi.org/10.1155/2020/ Biology 16 (2): e2003693. https://doi.org/
7398324. 10.1371/journal.pbio.2003693.
Simpson, S.H. (2015). Creating a data analysis plan: what Winer, B.J., Brown, D.R., and Michels, K.M. (1991). Sta-
to consider when choosing statistics for a study. The tistical Principles in Experimental Design, 3e. Boston
Canadian Journal of Hospital Pharmacy 68 (4): 311– MA: McGraw-Hill.
317. https://doi.org/10.4212/cjhp.v68i4.1471. Würbel, H., Voelkl, B., Altman, N.S. et al. (2020). Reply to ‘It
Snedecor, G.W. and Cochran, W.G. (1980). Statistical is time for an empirically informed paradigm shift in ani-
Methods, 7e. Ames: Iowa State University Press. mal research’. Nature Reviews Neuroscience 21: 661–662.
Sødring, M., Oostindjer, M., Egelandsdal, B., and Paul- https://doi.org/10.1038/s41583-020-0370-7.
sen, J.E. (2015). Effects of hemin and nitrite on intes- Zar, J.H. (2010). Biostatistical Analysis, 5th ed. Upper
tinal tumorigenesis in the A/J min/+ mouse model. Saddle River: Prentice-Hall.
20
Hierarchical or Nested Data

20.1 Introduction 233 20.5.2 Sample Size Based on
20.2 Steps in Multilevel Sample Size Design Effect 242
Determinations 235 20.5.3 Initial Approximations 242
20.2.1 Identify the Unit of 20.5.4 Asymptotic Normal
Randomisation 235 Approximation: Balanced
20.2.2 No Predictors 236 Cluster Sizes 242
20.2.3 Multilevel Models with 20.5.5 Asymptotic Normal
Predictors 236 Approximation: Unbalanced
20.2.4 Constructing the Model 238 Cluster Sizes 242
20.3 Estimating Effect Size 238 20.5.6 Sample Size Based on the
20.3.1 Cohen’s d 238 Non-centrality Parameter 243
20.3.2 Fixed Effects Regression 20.5.7 Two-Level Model, Subjects
Coefficients 238 Within Cluster as Unit of
20.3.3 Intraclass Correlation Randomisation 243
Coefficient (ICC) 239 20.5.8 Two-Level Model, Cluster as
Unit of Randomisation 243
20.4 Other Considerations: Balance,
Sparse Data, Costs 241 20.A Notes on Software 244
20.4.1 Balanced Versus 20.A.1 SAS Codes for Estimating
Unbalanced Designs 241 Variance Components 244
20.4.2 Sparse Data 241 20.B SAS Code for Estimating Variance
20.4.3 Costs 241 Components for Calculations of R2 and
20.5 Sample Size Determinations 241 Cohen’s f2 for House Sparrow Data
20.5.1 Rules of Thumb 241 (Code modified from Selya et al. 2012) 245
References 246
within larger organisational units, or clusters

20.1 Introduction (Box 20.1). For example, pups are located within lit-
The focus of many animal-based experiments is the ters, and litters are clustered within dam. Individual
individual subject or biological unit. However, col- mice are assigned to a given drug intervention, the
lections of subjects often exhibit some sort of hierar- individual mouse brain is harvested, then cells are
chical structuring, with observational units nested harvested from different regions within each brain.
BOX 20.1 when both anticipated sample sizes and effect size
Examples of Hierarchical Data to be detected are small (Goldstein 2003; Moer-
beek 2004).
▪ Nestlings in different nests One solution to these problems is to average
▪ Pups in each litter from each multiparous across all observations for each cluster, then analys-
breeding pair ing the means. However, averaging loses a lot of
▪ One of two drugs is randomly assigned to each information and does not permit exploration of
mouse, the brain of each mouse is harvested, the effects of predictor variables acting at the differ-
and cells from each brain region are isolated ent hierarchical levels of organisation.
and plated Hierarchical or multilevel design models (also
▪ Brains from two strains of mouse are harvested, called nested models, random effects models, mixed
cells are isolated and plated, then one of two models, and variance components models) specifically
drugs is randomly assigned to cells in each well account for the nested structure of organismal units
in a plate and the correlation within clusters. Design features
▪ Multisite veterinary trials where dogs with a adjust for the different variances that occur between
particular cancer within each site are randomly and within clusters (thus ensuring estimation of
assigned to one of two or more treatments correct standard errors). Predictor or explanatory vari-
▪ Synaptic vesicles counts are obtained for each ables (covariates) can be incorporated at both the level
set of neurons harvested from the brains of sev- of the individual observation and the cluster. There are
eral strains of mice two main types of multilevel designs: between-cluster
▪ Cows within herds within locations and within-cluster (Figure 20.1). Between-cluster
▪ Survey studies: Individuals within clinics
within city (a)
▪ Meta-analyses
Cluster A B A B
Units A A B B A A B B
Two- and three-level models are common in studies
of neurobiology (Aarts et al. 2014, 2015), rodent (b)
breeding (e.g. pups within litter within dam; Lazic Cluster
and Essiou 2013; Hull et al. 2022), ecology (e.g. nest-
lings within nest within site), agriculture (e.g. live- Units A C C B B A C B
stock within herds within locations), multisite
B A C B
veterinary clinical trials (e.g. subjects within clinic
within city), and meta-analyses (individuals within (c)
studies).
When units are nested, observations for the units Cluster B A A B
within each cluster will not be independent. The

individual units within clusters will be more simi- Units c d d c d c c c
lar than individual units located in different clus-
ters. Treating observations as independent in Figure 20.1: Treatment allocation strategies and units of
subsequent analyses is ‘pseudo-replication’, and randomisation in multilevel designs. (a) Between-cluster
design: The cluster is the experimental unit. Clusters are
results in artificially inflated sample sizes (Lazic randomised to intervention, and all observation units within
2010; Lazic and Essiou 2013; Lazic et al. 2018). the cluster receive that intervention. (b) Within-cluster
Failure to account for nesting can result in inflated design: The unit within the cluster is the experimental unit.
Type I error rates, increased false positives, and Within-cluster units are randomised to intervention, and
clusters are a blocking variable. (c) Multi-cluster design.
incorrect interpretation of results (Julian 2001;
Different treatments are assigned to clusters and units
Moerbeek 2004; Aarts et al. 2014, 2015). It can also within clusters, respectively. Treatment 1 (A or B) is
result in a loss of statistical power to detect any randomly allocated to level-2, and Treatment 2 (c or d) is
experimental effects that actually exist, especially randomly allocated to level-1 units.
Hierarchical or Nested Data 235
designs randomise the factor or treatment to the strategy as aggregate. Sample size determina-
whole cluster; experimental units within each cluster tions prioritise the number of clusters k, not
are essentially subsamples. Within-cluster designs the number of individuals n (Aarts et al. 2014,
randomise treatment to units within each cluster. 2015). Cluster sample size is the priority if the
Repeated measures designs (where multiple observa- focus of the study is testing the effects of an
tions are nested within each subject) and randomised intervention applied at the cluster level, or if
block designs (where at least one replicate from each the number of subjects contained in each clus-
treatment is nested within blocks) are a type of ter is anticipated to be small, or else must be
multilevel design. fixed a priori during the design phase. Clus-
There are at least two sample size calculations ter-level designs have the advantage of relative
required for multilevel designs: the sample size simplicity (only one treatment is allocated to
per cluster (the number of units per cluster) the entire cluster). However, they are liable to
and the total number of clusters. When clustering a type of representation bias if there is large
is present, the correlation between observations between-cluster variation in subject character-
within clusters (the intra-class correlation, ICC) istics. The requirement for large number of
will affect statistical power. Larger sample sizes clusters to be sampled to achieve adequate
will be required if variance components are power may make these studies very expensive,
large and the sample size at the lowest level especially for multisite trials or large-scale sur-
(e.g. number of individuals per cluster) is very veys (Raudenbush and Liu 2000).
small (Konstantopoulos 2008, 2010; Austin and
Leckie 2018). Within-cluster assignment. When interventions
are assigned randomly to different individuals
within the same cluster, the individual is
20.2 Steps in Multilevel the unit of randomisation, and all treatments
are represented in each cluster. Within-cluster
Sample Size Determinations designs are sometimes referred to as multilevel
There are four steps for sample size determination blocked trials. Individual subjects are within
for multilevel models: ‘blocks’ because all treatments are represented
in each cluster (Spybrook et al. 2011; Konstan-
1. Identify the unit of randomisation topoulos 2012). Individual randomisation is
2. Determine design features often preferable to cluster randomisation
3. Decide on the effect size: δ, ICC, f2, R2 because sample sizes required are usually much
4. Other considerations: Balance, sparse data, smaller, and it is easier to control for represen-
and costs tation bias. However, individual treatment
allocations will not be possible if subjects can-
not be individually identified, and ‘treatment
20.2.1 Identify the Unit of contamination’ (inadvertent application of the
Randomisation wrong intervention) may be a consideration
In randomised experimental studies, test or control (Torgerson 2001). For testing effects of an inter-
interventions can be applied to the whole cluster, to vention applied at the individual level, sample
individuals within each cluster, or different inter- size determinations prioritise the number of
ventions may be randomised at different levels. individuals n. The number of clusters k will
be determined by the anticipated cluster size,
Cluster-level assignment. When interventions are or number of individuals within each cluster.
assigned randomly to whole clusters, the clus-
ter is the unit of randomisation. All individuals If there are more treatments than individuals per
from the same cluster will therefore belong to cluster, so that all treatments cannot be represented
the same experimental or treatment group. in each block, incomplete block or fractional facto-
Machin et al. (2018) refer to this assignment rial designs should be considered. More complex
experiments can be designed where different inter- or ordinal data), the assumptions of normally
ventions can be separately assigned to cluster and distributed error distributions no longer apply.
individual units. The outcome must be transformed with the
Multilevel designs are best understood as a appropriate non-linear link function in a gener-
regression model. Features of the regression model alised multilevel model. The response ηi is the
to be considered are number of levels, or hierarchi- log odds of the event occurring for the ith indi-
cal units, and specification of predictor variables (or vidual in the jth cluster:
covariates) at the different levels (if applicable).
μ
ηij = log
1−μ
20.2.2 No Predictors
The most basic multilevel model has no predictor where μ = 1/[1 + exp(−β0)]. The model is
variables and two hierarchical levels – the individ-
ual level (level-1) and the cluster level (level-2). Level 1 ηij = β0j
The individual is nested within cluster.
Level 2 β0j = γ 00 + u0j , u0j N 0, τ00
Continuous outcome. The model is described by
regression on the two levels as:
where β0j is the intercept or the average log odds of
Level 1 Y ij = β0j + eij , eij N 0, σ 2e the event occurring in cluster j and the level-2 error
variance term is designated by τ00. Because the var-
Level 2 β0j = γ 00 + u0j , u0j N 0, σ 2u iance of a non-Gaussian model is determined by the
population mean, the ‘level-1 error variance term’ as
Here Yij is the response measured for the ith indi- such is not computed as it is with Gaussian data but
vidual in the jth cluster, the intercept β0j is the mean is estimated as π2 / 3 = 3.29 (Ene et al. 2015; Hox
response for the jth cluster, and γ 00 is the grand et al. 2018). The level-2 random component is
mean (mean across all clusters). The covariance assumed to belong to the exponential family of dis-
matrix of the random effects is designated as the tributions: binary, binomial, Poisson, geometric,
G-matrix. It consists of the variance for the inter- and negative binomial distributions. Count data
cept, the variance for the slopes, and the covariance are usually modelled with a Poisson distribution.
between intercepts and slopes. The level-1 random Ordinal data are described by proportional odds,
effect, or error, term is eij, and is assumed to be nor- or cumulative logit, models.
In SAS, the variance components are obtained
mally distributed with mean 0 and variance σ 2e , the
from the variance components option in the ran-
variance between individuals. The level-2 random
dom statement of proc glimmix. The grand mean
component is normally distributed with mean 0
(with 95% confidence intervals) is obtained from
and variance σ 2u , the variance between clusters.
the fixed effects intercept (Appendix 20.A). Other
The total observed variance for the response is thus
distributions can be specified by the relevant distri-
σ 2e +σ 2u (Bell et al. 2013). Combining the two equa- bution and link functions.
tions gives the full model:
Y ij = γ 00 + ui + eij
20.2.3 Multilevel Models with
that is, the response is the sum of the grand mean Predictors
plus the between- and within-cluster variances. In The basic model is easily expanded to include one or
SAS, the variance components σ 2u , σ 2e are obtained more predictors, or covariates, at any or all levels.
from the variance components option in the ran- The level to which a treatment, or intervention, pre-
dom statement of proc mixed (SAS code is provided dictor is assigned depends on how the unit of rando-
in Appendix 20.A). misation is defined. The associated slope terms
(regression coefficients β1, β2, etc.) are of primary
Categorical outcome. When outcome variables are interest as they describe the effects of the interven-
categorical (such as binary, proportions, count, tion on the outcome. Interpretation is simplified if
Table 20.1: Dummy variables.
Dummy variables are used to streamline calculations when categorical variables with two or more distinct ‘levels’ are
used as predictors (Draper and Smith 1998).
1. Two category ‘levels’ A and B. One dummy variable Z that takes values of 0 or 1 is required to uniquely code for each
group.
Example. One of two treatments is randomly assigned to individual within a cluster. Treatment is represented by
indicators Z1 = 0 (control) and Z1 = 1 (test intervention). The regression equation is
Yij = β0j + β1 Z1 + eij
Example. One of two treatments is randomly assigned to equal number of individuals of both sexes, male or female.
Sex is represented by indicators Z2 = 0 (male) and Z2 = 1 (female). The regression equation is
Y = β 0 + β 1Z 1 + β 2 Z 2 + e
2. More than two category ‘levels’. More than two levels require k − 1 dummy variables z1, z2, …., zk-1, with each taking the
values of 0 or 1.
Example. Animals are sampled from each of three locations A, B, C. To code for location, two dummy variables z1 and
z2 are required:
Location A z1 = 1, z2 = 0
Location B z1 = 0, z2 = 1
Location C z1 = 0, z2 = 0
The regression equation is
Y = β0 + β1X1 + a1z1 + a2z2 + e
The regression coefficients a1 and a2 estimate the difference between each group and the reference group (0, 0).
intervention variables are categorical and dummy- β0j = γ 00 + u0j ; u0j N 0, τ00
coded (Table 20.1). For example, suppose a two-
arm experiment is planned comparing a test drug β1j = γ 10 + u1j ; u1j N 0, τ11
A versus a vehicle control C. Then ‘treatment’ can
be coded with a single dummy variable to indicate
Here γ 00 is the grand mean of the response, γ 10 is
the two groups (0 for C and 1 for A). The regression
the treatment effect, or the difference in means
coefficient will be the quantitative difference in
between the experimental group and the compara-
responses between groups and is thus a measure
of unstandardised effect size. tor group (Y t − Y c ), τ00 is the between-cluster vari-
Additional covariates can be added to increase ation, and τ11 is the variance for the treatment
precision (reduce standard errors) of the estimated effect between clusters.
intervention effects, and reduce noise, therefore
improving signal and increasing power to detect Level-2 predictors. Predictors at the cluster level
treatment effects if they exist (Hedges and Hedberg (level-2) are indicated by W, rather than 20.
2007; Raudenbush et al. 2007; Konstantopoulos The cluster-level regressions are
2008, 2012). Signalment variables (age, sex, repro-
β0j = γ 00 + γ 01 W j + u0j level − 2 intercept
ductive status, weight, etc) are useful covariates.
Level-1 predictors. If the treatment is applied at β1j = γ 10 + γ 11 W j + u1j level − 2 slope

the level of the individual, the predictor or
treatment indicator variable X is incorporated The combined model is
into level-1 regression:
Y ij = β0j + βj X j + eij Y ij = γ 00 + γ 01 W j + γ 10 X j + γ 11 X 1ij W 1j
+ u0j + u1j X ij + eij
with the level-2 or cluster-effect:
The cross-level term X1ij W1j indicates the clustered studies, there are multiple sources of var-
strength of the adjustment of level-2 characteristics iation and therefore several choices of the standard
on the level-1 (subject) responses within each clus- deviation to be used for the standardisation: the
ter (Sample SAS code is provided in Appendix 20.B). total variance, the variance for the subjects nested
within clusters, the variance between clusters, and
20.2.4 Constructing the Model when applicable, the variance for the main effect
of treatment.
Bell et al. (2013) and Ene et al. (2015) recommend a Initial effect sizes for continuous data can be
‘bottom-up’ model-building strategy. The initial approximated using Cohen’s d and ignoring for
model is intercept-only with no predictors, then pre- level-2 clustering and associated predictors. Recall
dictors and random effects are added one at a time. that effect size is expressed in standard deviation
Predictors should be specified a priori, and selection units as d = Y 1 − Y 0 /σ total , where Y 1 − Y 0 is
based on best available knowledge or scientific
the biologically important difference δ between
justification. Final model specification is based on
treatment means, and σ total is the standard deviation
coefficients that are statistically significant (or
of the outcome, estimated as the square root of the
‘meaningfully large’) and model fit based on the
total variance σ 2total . For the two-level model, the
deviance test, which assesses differences in the
−2 log likelihood values between candidate models. standard deviation of the outcome is σ 2u + σ 2e ,
Model misspecification is indicated by failure to
converge, or a non-positive definite g-matrix. The and for three levels σ 2u + σ 2v + σ 2e Values for δ
latter indicates that one or more of the random can be a ‘best guess’ or hypothesised scientifically
effects variance components are zero. There may meaningful difference Y1 − Y0 to be detected. Basic
be too few observations to properly estimate the ran- descriptive or summary statistics can be used to
dom effects (there should be at least p + 1 observa- obtain the standard deviation of the outcome Y.
tions to estimate p random effects), too many If there is little or no prior information on antici-
random effects specified in the model, or there is lit- pated mean differences and variances, Cohen’s
tle to no variation between units at the given level. benchmark values for small, medium, and large
Dropping the zero variance random effects is an effect sizes could be substituted. However, these
option. An alternative is to use a ‘population aver- values (and their interpretation) will be highly unre-
age’ model and dispense entirely with random liable and are not recommended for animal-based
effects. The disadvantage of this approach is that studies (see Chapter 15).
these models cannot capture between-cluster differ- When outcome variables are categorical, the
ences, but this may not matter if cluster effects are effect size is estimated by the difference in log odds
not of primary interest or are too small to be of prac- ratios:
tical importance (McNeish et al. 2016).
p1 1 − p0
δ = η1 − η0 = ln
p0 1 − p1
20.3 Estimating Effect Size
Effect sizes can be estimated by Cohen’s d, unstan- where η0 and η1 are the log odds ratios for the con-
dardised or standardised regression coefficients, trol and experimental groups, respectively, with the
multilevel R2, Cohen’s f 2, and/or the intraclass cor- average treatment effect defined in terms of the log
relation coefficient ICC (Lorah 2018). odds of the proportions of ‘successes’ (the number of
events occurring) in each group.
20.3.1 Cohen’s d 20.3.2 Fixed Effects Regression

Estimating a d effect size requires specification of Coefficients
both the average difference between groups δ and For regression-type models, regression coefficients
the variance of effect sizes within and across for the fixed effects provide information about the
clusters, expressed as standard deviation s. With magnitude of the treatment differences (which is
of most biological interest). Feingold (2015) recom- Selya et al. (2012) describe how to calculating f2
mends that the regression coefficient for slope β is using SAS proc mixed. Bates et al. (2011) and Naka-
divided by the pooled within-group standard devia- gawa and Schielzeth (2013) have developed
tion of the response Y. The estimated effect size will R packages for estimating R2 for multilevel models,
be somewhat smaller if the effect size is estimated further extended by Johnson (2014) and Nakagawa
without accounting for cluster variances. Because et al. (2017) to include generalised linear models for
the magnitude of the fixed effects depend on the non-normal outcomes.
scale of each independent variable, effect sizes
based on different variables in the same study, or
variables across multiple studies, cannot be com- Example: Calculating R2 and f 2: Sparrow
pared directly (Lorah 2018). Model of Early Life Stress
An alternative measure of effect size for regres- (Data from Grace et al. 2017.) Nestling house
sion models is the proportion of variation explained sparrows (Passer domesticus) were used as a
by the predictors. Cohen’s f 2 estimates the propor- model of the effects of early life stress on growth
tion of variance in the sample explained by the (cat- and survival. Nestlings were fed mealworms con-
egorical) variable for the intervention effect relative taining either corticosterone or vehicle control to
to all other covariates in the sample, or R2. Here f 2 is assess the effect of glucocorticoid exposure (a
calculated by running the regression model several stress indicator) on body mass gain (‘treatment’,
times to obtain three different variance compo- TRT). Measurements were obtained for mass
nents: the variance components for the full model (M) and tarsus length (TL) on day 12, and SEX
with all fixed effects and covariates included, vari- (male or female). For this example, data were lim-
ance components for the covariates-only model ited to observations for four nestlings per nest,
(which estimates the variance in the response with- two of which received corticosterone and two of
out the fixed effects), and variance components for which received control. There were 31 nestlings
the null model (intercept-only model with no pre- in the experimental and control groups, respec-
dictors). The respective R2 for the full and covariate tively, for a total of 72 nestlings in 18 nests.
models are calculated from these variance compo- The model is a two-level design, with individ-
nents as ual nestlings within nest as the unit of randomi-
σ 2null − σ 2full sation to which treatment interventions were
σ 2null − σ 2other
R2full = ; R2other = applied. Mass M is the dependent variable,
σ 2null σ 2null TRT is the fixed effect (1 = corticosterone,
0 = control), and TL and sex are covariates.
For a binary response variable, R2 is: R2 and Cohen’s f 2 were calculated in SAS
σ 2fixed (adapted from Selya et al. 2012). Variances and
R2binary = corresponding R2 are shown in Table 20.2.
σ 2fixed + τ200 + π 2 3 Cohens f 2 is
where σ 2fixed is the sample variance of the fixed 0 234 − 0 07
f2 = = 0 215
effects linear predictor, and τ00 is the cluster 1 − 0 234
(level-2) error variance term (Austin and
Merlo 2017).
Cohen’s f2 is then calculated as:
Rfull 2 − R2other 20.3.3 Intraclass Correlation

f2 =
1 − R2full Coefficient (ICC)
The intraclass correlation ICC is the proportion of
Sample size N is obtained from f2 and the non- total variance in the response Y that can be attribu-
centrality parameter λ, where N = λ/f2 (See ted to the cluster. Because it measures strength of
Chapter 14). association, the ICC can be interpreted as an effect
Table 20.2: Calculation of R2 and f 2 for sparrow data. where u and v refer to the level-2 and level-3 clus-
Source Variance R2 ters, respectively.
For non-Gaussian outcomes (e.g, binary, count,
Full model (TRT, 2.209 R2full = (2.886-2.209)/ proportions), the ICC for a two-level model is
SEX, TL) 2.886 = 0.234
Covariates only 2.683 R2other = (2.886-2.683)/
τ00
ICC =
(SEX, TL) 2.886 = 0.070 τ00 + π2 3
Intercept-only (no 2.886
predictors) where τ00 is the intercept variance. For binary and
proportion response data, the level-1 error variance
term is determined by the population mean. The
second term for calculating the total variance is esti-
size in the same way as a conventional correlation mated from the logistic link function with scale fac-
coefficient r (Snijders and Bosker 1993, 1999; Lorah tor of 1 and is π2 / 3 = 3.29 (Ene et al. 2015; Hox
2018). It is also a measure of the ‘information con- et al. 2018).
tent’ contained in each cluster. As the correlation Initial values of ICC may be obtained from previ-
between observations increases, the level-1 variance ously published data. If raw data are available, the
(σ e2) becomes smaller, and ICC increases. ICC is ICC can be calculated from the sum of the variance
zero if responses for all individuals are independent components of the intercept-only regression model
of one another. A non-zero ICC denotes responses (Bell et al. 2013; Ene et al. 2015).
are not independent, and ICC of 1 indicates all
the responses in all clusters are the same. A large
ICC indicates the information contained in individ- Example: Calculating ICC: Magpie Egg
ual observations is similar. Therefore, if ICC is large, Weights
increasing within-cluster sample size will be redun-
dant and will not contribute much in the way of new (Data from Reynolds 1996.) Weights were
information for tests of treatment effects. (When obtained for 118 eggs from 19 Black-billed Mag-
ICC is calculated as a measure of inter-rater agree- pie (Pica pica) nests. The fitted model is an inter-
ment, a high value is desirable as a measure of cept-only no-predictor model:
method reliability).
Egg weight per egg per nest = a0 + ui + eij
If the design is balanced (equal n within each
cluster), the between-cluster ICC in a two-level where the random intercept term a0 is an esti-
model with a continuous outcome is mate of mean egg weight, ui is the random effects
error term for the between-nest effects, and eij is
σ 2u the residual variance, or the error term for egg. In
ICCu =
σ 2u + σ 2e SAS, the variance components σ 2u , σ 2e are obtained
from the variance components option in the ran-
dom statement, and mean egg weight (with 95%
That is, the ICC is the ratio of the between-cluster confidence intervals) from the fixed effects inter-
variance to the total variance. For a three-level cept (Appendix 20.A).
model, there are two cluster ICCs to be calculated: The covariance parameter estimates are 0.68
for level-2 nest effects (intercept) and 0.12
σ 2u for level-2 egg effects (residual). Mean egg weight
ICCu =
σ 2u + σ 2v + σ 2e is 9.3 (95% confidence interval 8.9, 9.7) g. The
ICC = σ 2u σ 2u + σ 2e = 0.68/(0.68+0.12) = 0.85.
σ 2v
ICC v = The ICC indicates that 85% of the variation in
σ 2u + σ 2v + σ 2e egg weight can be accounted for by nest of origin.
20.4 Other Considerations: 20.4.3 Costs

Balance, Sparse Data, Costs For multisite veterinary and agricultural trials, and
large-scale surveys, maximising the number of sites
20.4.1 Balanced Versus Unbalanced will be more important than the number of subjects
Designs per site for maximising power for detecting a treat-
ment effect. However, adding sites will cost consid-
Balance (equal sample sizes within each cluster) is erably more in terms of logistics, time, resources,
rarely possible for multilevel animal-based studies, and money compared to costs of adding more sub-
either by design or (more usually) as a result of nat- jects within each site. Alternatively, the number
ural variation. Some experiments are designed to be of sites may be fixed. Therefore, choice of the opti-
unbalanced, for example, studies featuring experi- mal sample size will entail cost-benefit analyses
mental transfer of offspring to artificially increase between power and cost trade-offs and whether
or reduce litter or clutch size (e.g. Voillemot et al. the primary objectives of the study are to make
2012). Examples of natural variation leading to inferences about the main effect of treatment, assess
imbalance include litter size in rodents, clutch sizes treatment-by-cluster variation, or effects of site
of breeding birds, and unequal losses resulting from characteristics on treatment response. Raudenbush
differential mortality or other types of attrition. and Liu (2000) provide an excellent tutorial and
When cluster sizes are expected to be highly var- associated SAS code. Raudenbush et al. (2007) have
iable, sample size calculation must be adjusted for developed a user-friendly, free software programme
this variation with the coefficient of variation for optimising sample size and power for a variety of
(CV) of the cluster sizes n. van Breukelen et al. designs: https://sites.google.com/site/optimal
(2007) suggest that unequal cluster sizes can be designsoftware/home
adjusted by the minimum relative efficiency of the
sample, which works out to an approximate
increase in cluster number of 11%. 20.5 Sample Size
Determinations
20.4.2 Sparse Data
20.5.1 Rules of Thumb
Most multilevel animal-based studies are charac-
terised by very small cluster sizes. In general, The classic sample size rule of thumb for multilevel
level-1 sample sizes <10 do not seem to affect Type studies is a minimum of 30 groups with 30 subjects
I error, confidence interval coverage, or statistical in each group (the so-called ‘30/30 rule’; Hox 2010).
bias to any great extent (Maas and Hox 2005; Clarke However, these sample sizes originated with social
2008; Bell et al. 2008, 2010, 2014). At least n = 5 science, education, and survey studies, and will be
observations per cluster may be sufficient to obtain wholly unrealistic for many animal-based studies.
valid and reliable regression coefficient estimates Typically, the usual cluster size for animal studies
for two-level models (Clarke 2008) with as few as will be fewer than 10 subjects, with large variation
10 clusters, and sampling variability can be assessed in cluster size.
with bootstrapping, or other computer-intensive When the cluster is the unit of randomisation, the
simulation techniques (Maas and Hox 2005). number of clusters to be sampled is most important
However, power to detect differences between for determining power, and cluster size will be rel-
treatment and intervention groups may be poor. atively unimportant. However, when the treatment
Simulations indicate that power achieved desired intervention is applied to subjects rather than the
0.80 power only for relatively large sample sizes entire cluster, the individual within cluster is the
(n > 20) for level-1 predictors, and level-2 sample unit of randomisation. The number of subjects
sizes of k > 30. Statistical power of level-2 predictors required per treatment arm can be determined
never achieved 0.80 even for very large n and k. using the usual power calculation methods, then
Small n also increased the number of zero-variance approximating the number of clusters required to
random effects (Bell et al. 2010, 2014). Sample size obtain that number of subjects. However, this
approximations should be based on number of clus- approximation will be increasingly unreliable as
ters rather than number of subjects. variation in cluster size increases.
20.5.2 Sample Size Based on approximately 10 subjects per cluster, the study
Design Effect requires 325/10 = 32.5 clusters, rounded up to
If an estimate of ICC is available, sample size can be 33 clusters.
approximated using the design effect formula. The
design effect quantifies the increase in sample size
expected for a nested design over that of a simple
random sample with no clusters (Kish 1987). That 20.5.4 Asymptotic Normal
is, if a sample size N has been approximated using Approximation: Balanced Cluster Sizes
conventional power calculations, then N must be
multiplied by the design effect (Deff) to adjust for Sample size for a balanced two-arm study with
the effects of clustering. The design effect increases level-1 randomisation (non-aggregate design) is
with both increased ICC and n per cluster, therefore
larger N would need to be sampled. 2
z1 − α 2 + z1 − β z21 − α 2
The design effect factor Deff will be determined N = 4 Deff 2 +
ES 2
by whether cluster sample sizes are balanced or
unbalanced.
where ES is the effect size δ/stotal.
Balanced cluster sizes. If sample sizes can For a study with level-2 randomisation (aggre-
assumed to be approximately equal for each gated design),
cluster, then the design effect factor is:
2
z1 − α + z1 − β z21 − α
Deff = 1 + n − 1 ICC N=4
2
+
2
2
δ sy 2
where n is the within-cluster sample size.
Deff s2total
where sy = n
20.5.3 Initial Approximations
Perform power calculations under the assumption 20.5.5 Asymptotic Normal
of simple random sampling in the usual way. Then
multiply that sample size estimate by Deff.
Approximation: Unbalanced
Cluster Sizes
For many types of animal-based studies, there will
Example: Adjusting Sample Size for Cluster be a considerable amount of between-cluster imbal-
Effects ance. Common examples are litter sizes in rodents
A new study was proposed where power calcula- and clutch sizes in birds, reptiles, and amphibians.
tions based on the assumption of simple random Relative to balanced cluster designs or unclustered
sampling indicated that approximately 100 ani- studies, large variation in cluster size n contributes
mals would be required. However, it is antici- to loss of power. As a result, projected sample sizes
pated that clustering effects will be important. for clustered studies need to be adjusted for the
There are 10 subjects per cluster with ICC of anticipated variation in cluster sizes.
0.25. What is the revised sample size? Machin et al. (2018) recommend an adjusted Deff
based on the coefficient of variation CV for within-
The effective sample size would require cluster sample sizes:
Deff = 1 + 10 − 1 0 25 = 3 25 CV n = sn n
that is, 3.25 times as many subjects as a non- where sn and n are the standard deviation and mean
clustered random sample. Then the new study cluster sizes, respectively. Then the adjusted design
would require 3.25 (100) = 325 subjects. With effect will be
k−1 or variation explained by a covariate (R2, f2). Sample

Deff = 1 + ICC n−1 1+ CV 2
SAS code is provided in Appendix 20.A.
k
The form of λ to be calculated depends on choice
where k is the number of clusters. of the unit of randomisation (individual or cluster),
the number of hierarchical levels, and whether the
effect to be tested in the difference between treat-
Example: House Sparrow Clutch Size ments (with or without covariates), the variance
of the treatment effect, or the cluster x-treatment
(Data From Grace et al. (2017).) In this study,
effect (Raudenbush and Liu, 2000). Spybrook
brood size for k = 28 house sparrow nests ranged
et al. (2011) give itemised descriptions of calculation
from 2 to 11 nestlings. Mean brood size n was 5
methods for λ.
(SD 2), so the coefficient of variation is
2/5 = 0.4, with ICC ≈ 0.49. Then
20.5.7 Two-Level Model, Subjects
18 − 1
Deff = 1 + 0 49 5−1 1+ 04 2
Within Cluster as Unit of
18
Randomisation
=32
When treatments Xj are randomly assigned to n
Suppose a new study was proposed to test a individuals within m clusters, the model is Yij = β0j
nestling-level intervention (non-aggregate + βjXj + eij. The difference between experimental
design), to detect an effect size of 0.8 with a and control groups means is d = Y e − Y c . Assum-
two-sided α = 0.05 and power of 80%. Then sam- ing a balanced cluster design, the non-centrality
ple size is parameter is
m d2
1 96 + 1 2816 2
1 96 2 λ= σ 2e
N≥4 32 + ≈ 212 τ11 +
08 2 2 n
where τ11 is the variance for the treatment effect

Assuming an average of 5 nestlings per nest, between clusters, and σ 2e is the between-subjects
approximately 212/5 ≈ 43 nests would need to variation.
be sampled.
20.5.8 Two-Level Model, Cluster as

20.5.6 Sample Size Based on the Unit of Randomisation
Non-centrality Parameter When treatments W j are randomly assigned to m
clusters (with n individuals per cluster), the model
The F-statistic for a multilevel design follows a non- is Yij = γ 00 + γ 01Wj + u0j + eij. The main effect of
central F-distribution F(1, m − 1, λ), where m is the treatment γ 01 is the mean difference between exper-
number of clusters, and λ is the non-centrality
imental and control group means Y e − Y c . Assum-
parameter. Because λ is related to the power of the
ing an equal number of clusters per treatment
test, the total sample size N and/or the total number
group, the non-centrality parameter is
of clusters m can be obtained by iterating over differ-
ent candidate values for any of N, m, and/or effect
γ 201
size to find the values that will result in the desig- λ=
nated power. Alternatively, power can be calculated 4 σ2
τ+ e
for prespecified values for any of N, k, effect size, and/ m n
where m is the number of clusters, n is the number

of subjects per cluster, γ 01 is the regression coeffi- of sample sizes and solving for m. For power 80%,
cient for the treatment effect, τ is between-cluster 11 nests must be sampled, and 14 nests for power
of 90%.
variation, and σ 2e is the between-subjects variation.
Then the non-centrality parameter can be used to
estimate sample size by iteration. If the design is bal-
anced, the F-statistic for testing the main effect of
treatment follows the non-central F distribution
with 1 and (m − 1) degrees of freedom. 20.A Notes on Software
In SAS, regression coefficients and variance compo-
nents can be estimated with proc mixed for contin-
Example: Sparrow Model of Early
uous normally distributed response variables (Bell
Life Stress
et al. 2013), and proc glimmix for binary or catego-
(Data from Grace et al.). Average body masses at rical outcomes (Ene et al. 2015; Kiernan 2018). In
12 days post-hatch of nestling house sparrows R, both marginal and conditional R2 and ICC for
(Passer domesticus) treated with either corticos- mixed models can be calculated using r2() and
terone or vehicle control were 21.5 and 20.2 g, icc() in the lme4 package (Bates et al. 2011).
respectively. The between-nestling variation Nakagawa and Schielzeth (2013) and Nakagawa
σ 2e was 2.228, and between-nest variation τ et al. (2017) provide R code for obtaining R2 from
was 2.209. generalised linear mixed-effects models. The Centre
A new study was planned to investigate if cor- for Multilevel Modelling, University of Bristol, has
ticosterone induced a reduction of 1.5 g compared developed software MLwiN for complex multilevel
to control, with all nestlings in a nest adminis- modelling (http://www.bristol.ac.uk/cmm/). Spy-
tered the same treatment. It was assumed that brook et al. (2011) have developed a user-friendly,
the median clutch size would be 6 nestlings per free software programme for optimising sample size
nest. How many nests need to be sampled to and power when sampling costs must be factored
detect this difference with 95% confidence and into the study design, https://sites.google.com/
power 80% and 90%? site/optimaldesignsoftware/home.
The cluster is the unit of randomisation.
Figure 20.2 shows the results obtained by calcu-
lating the non-centrality parameter over a range 20.A.1 SAS Codes for Estimating
Variance Components
1.0
1. Continuous outcomes. The variance compo-
0.9
nents σ 2u , σ 2e are obtained from the variance
0.8
components option in the random statement
0.7
of proc mixed. The grand mean (with 95%
0.6 confidence intervals) is obtained from the
Power
0.5 fixed effects intercept. The SAS code is:

0.4
0.3
proc mixed data=twolevel covtest CL
method=ML;
0.2 class cluster;
0.1 model Y = / s CL;
0.0 random intercept / subject=cluster
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 type=vc;
Number of nests run;
Figure 20.2: Number of nests to be sampled to attain 2. Binary outcome. The variance components
prespecified power in a study of sparrow early life stress.
are obtained from the variance components
option in the random statement of proc * Calculate non-centrality parameter;

glimmix.
NCP = m*(meandiff**2)/(s2_clus +
proc glimmix data=twolevel s2_subj/n);
method=laplace CL; F_crit = finv(1-alpha,1,m-1,0);
class cluster; power = 1-probF(F_crit,1,m-1, NCP);
model Y = / s CL dist= binary link=logit output;
oddsratio; end;
random intercept /subject=cluster run;
type=vc CL;
covtest /Wald; *output realised power;
run; proc print data = cluster;
run;
3. Level-2 predictors. Sample SAS code for the
treatment applied to subjects (X1) and two
additional predictors W1 and W2 at the clus- 20.B SAS Code for Estimating
ter level is:
Variance Components for
proc mixed data=twolevel covtest CL
method=ML;
Calculations of R2 and Cohen’s
class cluster;
model Y = X1 W1 W2 / s CL;
f 2 for House Sparrow Data
random intercept X1 / subject=cluster (Code modified from Selya
type=vc;
run; et al. 2012)
If the treatment was applied at the level-2 *1 Calculate variance for full model level-1
treatment effect & covariates;
cluster level (W1), and there were two level- ods output CovParms = VarFull;
1 covariates X1 and X2, the code would be: proc mixed data=sparrow covtest CL method=ML;
class nest;
proc mixed data=twolevel covtest CL model mass = TRT SEX TARSUS / s CL;
method=ML; random INT / subject=nest type= vc;
class cluster; run;
model Y = X1 X2 W1 / s CL; quit;
random intercept X1 X2 / subject=cluster ods output close;
type=vc;
run; *2. Calculate variance for covariates (minus
TRT);
4. Sample SAS code for obtaining total sample ods output CovParms = VarPart;
size from the non-centrality parameter for a proc mixed data=sparrow covtest CL method=ML;
class nest;
two-level model, and subjects within cluster model mass= SEX TARSUS/ s CL;
as unit of randomisation. random INT / subject=nest type=vc;
run;
%let numClust = 100; quit;
data cluster; ods output close;
*iterate through range of cluster sample
sizes; *3. Calculate variance for null model (no
do m = 5 to &numClust by 5; predictors);
*prespecified difference; ods output CovParms = VarNull;
meandiff = 1.5; proc mixed data=sparrow covtest CL method=ML;
s2_subj= 2.228; *pooled variance; class nest;
s2_clus= 2.209; model mass = / s CL;
n = 6; *assume median random INT / subject=nest type=vc;
clutch size of 6; run;
quit;
alpha=0.05; ods output close;
*Merge datasets; Bell, B.A., Morgan, G.B., Schoeneberger, J.A. et al.

DATA VarAll; (2014). How low can you go? An investigation of the
merge VarFull(rename=(Estimate=VAB)) influence of sample size and model complexity on
VarPart(rename=(Estimate=VA)) point and interval estimates in two-level linear models.
VarNull(rename=(Estimate=Vnull));
by CovParm; Methodology European Journal of Research Methods for
DROP CovParm; the Behavioral and Social Sciences 10: 1–11.
run; Bell, B.A., Morgan, G.B., Schoeneberger, J.A., et al. (2010).
Dancing the sample size limbo with mixed models: how
* Calculate R2 and f2; low can you go? SAS Global Forum 2010 Proceedings (11-
DATA results; 14 April 2010). Paper 197-2010, Seattle, WA. https://
set VarAll; support.sas.com/resources/papers/proceed-
R2full = (Vnull - Vfull)/Vnull;
R2part = (Vnull - Vpart)/Vnull; ings10/197-2010.pdf.
f2 = (R2full - R2part)/(1 - R2full); Clarke, P. (2008). When can group level clustering be
run; ignored? Multilevel models versus single-level models
proc print; with sparse data. Journal of Epidemiology and Commu-
run; nity Health 62 (8): 752–758. https://doi.org/
10.1136/jech.2007.060798.
Draper, N.R. and Smith, H. (1998). Applied Regression
References Analysis, 3e. New York: Wiley.
Ene, M., Leighton, E.A., Blue, G.L., and Bell, B.A. (2015).
Aarts, E., Dolan, C.V., Verhage, M., and van der Sluis, S. Multilevel models for categorical data using SAS proc
(2015). Multilevel analysis quantifies variation in the glimmix: the basics. SAS Global Forum 2015 Proceedings.
experimental effect while optimizing power and pre- https://support.sas.com/resources/papers/
venting false positives. BMC Neuroscience 16: 94. proceedings15/3430-2015.pdf (accessed 2019).
https://doi.org/10.1186/s12868-015-0228-5. Feingold, A. (2015). Confidence interval estimation for
Aarts, E., Verhage, M., Veenvliet, J.V. et al. (2014). A solu- standardized effect sizes in multilevel and latent
tion to dependency: using multilevel analysis to accom- growth modelling. Journal of Consulting and Clinical
modate nested data. Nature Neuroscience 17: 491–496. Psychology 83 (1): 157–168. https://doi.org/
Austin, P.C. and Leckie, G. (2018). The effect of number 10.1037/a0037721.
of clusters and cluster size on statistical power and Goldstein, H. (2003). Multilevel Statistical Models, 3e.
Type I error rates when testing random effects variance London: Edward Arnold.
components in multilevel linear and logistic regression Grace, J.K., Froud, L., Meillère, A., and Angelier, F. (2017).
models. Journal of Statistical Computation and Simu- House sparrows mitigate growth effects of post-natal glu-
lation 88 (16): 3151–3163. https://doi.org/ cocorticoid exposure at the expense of longevity. General
10.1080/00949655.2018.1504945. and Comparative Endocrinology 253: 1–12. https://
Austin, P.C. and Merlo, J. (2017). Intermediate and doi.org/10.1016/j.ygcen.2017.08.011.
advanced topics in multilevel logistic regression anal- Hedges, L. and Hedberg, E.C. (2007). Intraclass correla-
ysis. Statistics in Medicine 36 (20): 3257–3277. tion values for planning group randomized trials in
https://doi.org/10.1002/sim.7336. education. Educational Evaluation and Policy Analysis
Bates, D., Maechler, M., and Bolker, B. (2011). lme4: lin- 29 (1): 60–87.
ear mixed-effects models. R package, version 0.999375- Hox, J.J. (2010). Multilevel Analysis: Techniques and
42. http://CRAN.R-project.org/package=lme4 Applications, 2nd ed. New York: Routledge.
(accessed 2022). Hox, J.J., Moerbeek, M., and van de Schoot, R. (2018).
Bell, B.A., Ene, M., Smiley, W., and Shonenberger, J.A. Multilevel Analysis: Techniques and Applications, 3e.
(2013). A multilevel primer using SAS proc mixed. Taylor & Francis Group.
SAS Global Forum 2013 Proceedings. http://sup- Hull, M.A., Reynolds, P.S., and Nunamaker, E.A. (2022).
port.sas.com/resources/papers/proceed- Effects of non-aversive versus tail-lift handling on
ings13/433-2013.pdf (accessed 2019). breeding productivity in a C57BL/6J mouse colony.
Bell, B.A., Ferron, J.M., and Kromrey, J.D. (2008). Cluster PLoS ONE 17 (1): e0263192. https://doi.org/
size in multilevel models: the impact of sparse data 10.1371/journal.pone.0263192.
structures on point and interval estimates in two-level Johnson, P.C.D. (2014). Extension of Nakagawa and
models. In: Proceedings of the Joint Statistical Meetings Schielzeth’s R2 GLMM to random slopes models. Meth-
(Survey Research Methods section), 1122–1129. Alexan- ods in Ecology and Evolution 5 (9): 944–946. https://
dria, VA: American Statistical Association. doi.org/10.1111/2041-21020.12225.
Julian, M. (2001). The consequences of ignoring multi- Behavioral Research, 39, 129–149. https://doi.
level data structures in non-hierarchical covariance org/10.1207/s15327906mbr3901_
modeling. Structural Equation Modeling 8: 325–352. Nakagawa, S., Johnson, P.C.D., and Schielzeth, H.
https://doi.org/10.1207/S15328007SEM0803_1. (2017). The coefficient of determination R2 and
Kiernan, K. (2018). Insights into using the GLIMMIX intra-class correlation coefficient from generalized lin-
procedure to model categorical outcomes with random ear mixed-effects models revisited and expanded. Jour-
effects. Paper SAS2179-2018. https://support.sas. nal of the Royal Society Interface 14: 20170214.
com/resources/papers/proceedings18/2179- https://doi.org/10.1098/rsif.2017.0213.
2018.pdf. Nakagawa, S. and Schielzeth, H. (2013). A general and
Kish, L. (1987). Statistical Design for Research. New simple method for obtaining R2 from generalized lin-
York: Wiley. ear mixed-effects models. Methods in Ecology and Evo-
Konstantopoulos, S. (2008). The power of the test for lution 4 (2): 133–142. https://doi.org/10.1111/
treatment effects in three-level cluster randomized j.2041-21020.2012.00261.x.
designs. Journal of Research on Educational Effective- Raudenbush, S.W. and Liu, X. (2000). Statistical power
ness 1 (1): 66–88. https://doi.org/10.1080/ and optimal design for multisite randomized trials.
19345740701692522. Psychological Methods 5 (2): 199–213. https://doi.
Konstantopoulos, S. (2010). Power analysis in two-level org/10.1037/1082-98920.5.2.199.
unbalanced designs. Journal of Experimental Educa- Raudenbush, S.W., Martinez, A., and Spybrook, J. (2007).
tion 78 (3): 291–317. Strategies for improving precision in group-
Konstantopoulos, S. (2012). The impact of covariates on randomized experiments. Educational Evaluation
statistical power in cluster randomized designs: which and Policy Analysis 29 (1): 5–29. https://doi.org/
level matters more? Multivariate Behavioral Research 10.3102/0162373707299460.
47 (3): 392–420. https://doi.org/10.1080/ Reynolds, P.S. (1996). Brood reduction and siblicide in
00273171.2012.673898. Black-Billed Magpies (Pica pica). The Auk 113 (1):
Lazic, S.E. (2010). The problem of pseudoreplication in 189–199. https://doi.org/10.2307/4088945.
neuroscientific studies: is it affecting your analysis? Selya, A.S., Rose, J.S., Dierker, L.C. et al. (2012). A practical
BMC Neuroscience 11: 5. https://doi.org/ guide to calculating Cohen’s f2, a measure of local effect
10.1186/1471-2202-11-5. size, from PROC MIXED. Frontiers in Psychology 3: 111.
Lazic, S.E., Clarke-Williams, C.J., and Munafò, M.R. https://doi.org/10.3389/fpsyg.2012.00111.
(2018). What exactly is ‘N’ in cell culture and animal Snijders, T.A. and Bosker, R.J. (1993). Standard errors
experiments? PLoS Biology 16 (4): e2005282. https:// and sample sizes for two-level research. Journal of
doi.org/10.1371/journal.pbio.2005282. Educational and Behavioural Statistics 18: 237–259.
Lazic, S.E. and Essiou, L. (2013). Improving basic and Snijders, T.A.B. and Bosker, R.J. (1999). Multilevel Anal-
translational science by accounting for litter-to-litter ysis: An Introduction to Basic and Advanced Multilevel
variation in animal models. BMC Neuroscience 14: Modeling. Thousand Oaks, CA: Sage.
37. https://doi.org/10.1186/1471-2202-14-37. Spybrook, J., Bloom, H., Congdon, R., et al. (2011). Opti-
Lorah, J. (2018). Effect size measures for multilevel mod- mal design plus empirical evidence: documentation for
els: definition, interpretation, and TIMSS example. the “Optimal Design” software. http://hlmsoft.
Large-Scale Assessments in Education 6: 8. https:// net/od/ (accessed 2022).
doi.org/10.1186/s40536-018-0061-2. Torgerson, D.J. (2001). Contamination in trials: is cluster
Maas, C.J.M. and Hox, J.J. (2005). Sufficient sample sizes randomisation the answer? BMJ 322: 355. https://
for multilevel modelling. Methodology 1 (3): 86–92. doi.org/10.1136/bmj.322.7282.355.
https://doi.org/10.1027/1614-1881.1.3.86. van Breukelen, G.J., Candel, M.J., and Berger, M.P.
Machin, D., Campbell, M.J., Tan, S.B., and Tan, S.H. (2007). Relative efficiency of unequal versus equal
(2018). Sample Sizes for Clinical, Laboratory and Epide- cluster sizes in cluster randomized and multicentre
miology Studies, 4e. Wiley-Blackwell. trials. Statistics in Medicine 26 (13): 2589–2603.
McNeish, D., Stapleton, L.M., and Silverman, R.D. https://doi.org/10.1002/sim.2740.
(2016). On the unnecessary ubiquity of hierarchical Voillemot, M., Hine, K., Zahn, S. et al. (2012). Effects of
linear modeling. Psychological Methods 22: 114–140. brood size manipulation and common origin on phe-
https://doi.org/10.1037/met0000078. notype and telomere length in nestling collared fly-
Moerbeek M (2004). The consequences of ignoring a level catchers. BMC Ecology 12: 17. https://doi.org/
of nesting in multilevel analysis. Multivariate 10.1186/1472-6785-12-17.
21
Ordinal Data

21.1 Introduction 249 21.5 Sample Size for Observer Agreement
21.2 Sample Size Considerations 250 Studies 254
21.3 Sample Size Approximations 250 21.A Sample SAS Code For Calculating
21.4 Paired or Matched Ordinal Data 253 Cohen’s κ from Raw Data 255
References 256

Score data are extremely common in animal-based Common Examples of Ordinal Data
studies, especially for assessment of health- and Likert scores
welfare-related characteristics and grading histo-
1 = ‘Strongly Agree’ 2 = ‘Agree’; 3 = ‘Neutral’,
pathological images. Scores are also common in sur-
4 = ‘Disagree’, 5 = ‘Strongly Disagree’
vey and questionnaire studies, e.g. the Likert scale
(Box 21.1). These variables are ordinal and consist Pain scores: 0 = No pain to 10 = Severe pain Quality of
of data that are categorical with a meaningful order- life scores
ing or ranking. Exercise tolerance: 1 = Very good; 2 = Good;
Ordinal data are relatively simple to collect, as data 3 = Moderate; 4 = Poor
collection involves counting the number of subjects Demeanor: 1 = Alert, responsive; 2 = Mildly
in each of a number of predefined categories. How- depressed; 3 = Moderately depressed; 4 = Mini-
ever, analysis and interpretation can be difficult. mally responsive; 5 = Unresponsive
Because the difference between any pair of adjacent
Plumage condition scores
scores is qualitative, a rank score can be ‘better’ or
‘worse’ than another, but the number codes for the 1 = Fully feathered; 2 = Some small bare patches,
ranks and intervals between adjacent ranks do not 3 = Strongly damaged feathers and large bare
have the same quantitative values as do continuous patches; 4 = Mostly or completely denuded
data measured on a ratio scale. Distances between Body condition scores
intervals cannot be standardised (for example, the
1 = Emaciated; 2 = Underweight; 3 = Ideal weight;
difference between ‘Very good’ and ‘Good’ is not
4 = Overweight; 5 = Obese
the same as that between ‘Moderate’ and ‘Poor’). Par-
ticipant preferences for particular categories may be Histopathologic scores
biased (e.g. respondents may prefer a ‘neutral’ or Necrosis: 1 = Rare (<5%); 2 = Multifocal (6–40%);
extreme categories over intermediate categories). 3 = Coalescing (41–80%); 4 = Diffuse (>80%)
(a) (b)
40
40
35 35
30 30
25 25
Frequency
20 20
15 15
10 10
5 5
0 0
1 2 3 4 5 1 2 3 4 5
Scores
Figure 21.1: Two simulated ordinal data sets on a 5-point scale, both with median score 3. (a) The distribution mirrors
the cognitive ‘expectation’. (b) The median does not represent the clinical picture, which shows most subjects occurring
at either extreme and relatively few occurring at the median.
Basic arithmetic operations (addition, subtrac- an odds ratio, the expected distribution of propor-
tion, multiplication, division) cannot be performed tions across categories, and the allocation ratio
on ordinal data. Descriptive statistics used for con- (Campbell et al. 1995; Machin et al. 2018).
tinuous data (e.g. mean, median, mode, standard Sample size increases substantially as proportion
deviation) often do not have a clinically or biologi- representation becomes more unbalanced toward
cally sensible interpretation when applied to ordinal dominance of one category. If the mean proportions
data. Although it is sometimes suggested that ordi- pi are expected to be roughly similar across all cate-
nal data can be summarised by the median and gories, then pi depends on the number of categories
mode, such summary statistics can be very mislead- k and pi is approximately 1/k (Table 21.1). For a two-
ing. Figure 21.1 shows two data sets with the same arm study with i = 4 categories of response, and
median score of 3. However, biological or clinical responses across all four categories are expected to
interpretation will be very different. In the first sam- be similar, then p1 = p2 = p3 = p4 = 1/4 = 0.25.
ple, the distribution mirrors the cognitive expecta- In contrast, if there is one dominant category with
tion that the summary statistic reflects the central sparse representation in the other categories, sam-
tendency of the data. In second sample, most sub- ple size will be inflated by approximately 43% over
jects occur at both extremes of the ranking, with rel- the uniform distribution case.
atively few occurring at the median. Maximum power will be obtained if sample sizes
Ordinal data can be analysed by conventional for each intervention group are equal. To correct for
rank-based non-parametric tests such as the allocation ratio, the sample size formula is adjusted
Mann-Whitney U-test for unpaired two samples, by (1 + r)2/r. If there is an equal number of subjects
Kruskal-Wallis (more than two unmatched groups) in each group, then r = n1/n2 = 1, and the correction
and Friedman tests (more than two matched pair is (1 + 1)2/1 = 4. The expected distribution of propor-
groups or repeated measures). Cumulative ordinal tions across categories can be obtained from previous
regression or cumulative link models are more pow- data. If there is no information on the likely response
erful alternatives (Agresti 2013). pattern, sample size can be approximated based on
the most likely pattern of distribution (Table 21.1).
21.2 Sample Size

Considerations 21.3 Sample Size
Sample size determinations for ordinal data are
Approximations
based on the proportion of events expected for each Approximating sample size for ordinal categorical
category and the anticipated effect size, expressed as data is a four-step process.
Ordinal Data 251
Table 21.1: Sample size inflation resulting from proportion imbalance over four categories. Five common scenarios are
presented. The sample size for the uniform category distribution is represented by n0. The ratio n/n0 is the sample size
inflation caused by imbalances. Sample size requirements increase substantially with increasing dominance of a single
category.
Category 0 1 2 3
Proportions p1 p2 p3 p4 n/n0
Equally probable occurrence (uniform) 0.25 0.25 0.25 0.25 1.00
Graduated occurrence 0.10 0.20 0.30 0.40 1.04
Single under-represented category 0.10 0.30 0.30 0.30 1.14
Bimodal representation 0.05 0.05 0.45 0.45 1.15
Single dominant category 0.10 0.10 0.10 0.70 1.43
Source: Adapted from Whitehead (1993).
1. Obtain a best guess for proportions expected

for each category in the treatment and con- Example: Sample Size from Raw Count
trol groups. Data: Lung Tissue Histopathology
2. Calculate cumulative proportions P0, i and Changes in Mice
P1, i for the treatment and control groups. An experiment tested a new anti-inflammatory
3. Determine an expected odds ratio OR for the compound versus an inert control randomly
new study, or if pre-existing data are availa- allocated to 40 mice, with 20 mice per group.
ble, calculate the odds ratio for each category Histopathology scores (0 = normal, 1 = mild,
based on the cumulative proportions: 2 = moderate, 3 = severe) describing inflamma-
tory changes in lung tissue were obtained from
P1,i 1 − P1,i the mice after a two-week course of treatment.
ORi = The histopathology scores are as follows:
P0,i 1 − P0,i
Test 0 2 1 0 1 2 0 0 0 0
A ‘summary’ odds ratio for planning pur-
poses can be estimated from the geometric 1 0 0 0 3 0 1 0 0 0
mean of all the odds ratios.
4. Calculate the average proportions p of the Control 0 0 2 2 2 1 3 1 3 1
treatment and control probabilities for each 0 0 2 0 0 2 2 0 1 1
category i.
Suppose it is desired to conduct a new study,
Then, sample size is for which it is assumed that the distribution of
lung damage scores will be similar to the previous
2 2 study. There will be equal sample sizes in each of
3 1+r z1 − α 2 + z1 − β
Nk = k 2 the control and intervention groups. What is the
1− i = 1p
3 r ln OR
new sample size?
First, reorganise the data in tabular form
If allocation is equal, then (r + 1)2/r is 4. If all and calculate counts and proportions for
average proportions p are approximately equal, then each category. The proportions p for each of
k
3 1− i = 1p
3
depends only on the total number i = 4 histopathology categories are shown in
of categories k, and reduces to 3/[1 − (1/k2)]. Table 21.2.
Table 21.2: Mouse lung tissue histopathology: calculation

P1,1 1 − P1,1 0 85 1 − 0 85
of proportions for each histopathology category. OR1 = = = 3 78
P0,1 1 − P0,1 0 60 1 − 0 60
Histopathology Number Proportion
category i of mice of mice
Control, Test, Control, Test,
P1,2 1 − P1,2 0 95 1 − 0 95
OR2 = = = 2 11
n0 n1 p0 p1 P0,2 1 − P0,2 0 90 1 − 0 90
0 7 13 0.35 0.65
The odds ratio for the new study is estimated
1 5 4 0.25 0.20 from the geometric mean of the odds ratios
2 6 2 0.30 0.10 derived from the previous data:
3 2 1 0.10 0.05 1 3
ORnew = 3 45 3 78 2 11 = 3 02, or ≈ 3
Total 20 20 1.00 1.00
The average proportion pi is calculated for each

category pair: (poi + p1i)/2 and summarised in
Table 21.3:
Three odds ratios are estimated from these data Then for α = 0.05 and 1 − β = 0.8, sample size
for comparing baseline risk against each of the N is
remaining three categories: 2
3 1 96 + 0 8416
N≥ 4
P1,0 1 − P1,0 0 65 1 − 0 65 1 − 0 145 ln 3 2
OR0 = = = 3 45
P0,0 1 − P0,0 0 35 1 − 0 35
=91.3, or approximately 92.
Table 21.3: Mouse lung tissue histopathology: calculation of expected proportions for a new study based on prior odds
ratios.
Histopathology Number of mice Proportion of mice Average p Cumulative proportion

category i
Control, n0 Test, n1 Control, p0 Test, p1 Control, P0,i Test, P1,i
0 7 13 0.35 0.65 0.50 0.35 0.65

1 5 4 0.25 0.20 0.23 0.60 0.85
2 6 2 0.30 0.10 0.20 0.90 0.95
3 2 1 0.10 0.05 0.08 1.00 1.00
Total 20 20 1.00 1.00
Example: Sample Size for Detecting expected odds ratio in favour of the test inter-
a Pre-Specified Odds Ratio vention is 2. What sample size is required to
detect ORnew = 2 with 95% confidence and
Suppose a new study is planned where the dis-
80% power?
tribution of subjects in the control arm is
Here the proportions and cumulative propor-
expected to be similar to those in the initial
tions for the test group must be estimated from
study. Disproportionately more mice in the test
the control group data and the planned odds ratio
group are expected to be in lowest lung dam-
(Machin et al. 2018). First, calculate the expected
age categories and more mice in the control
cumulative proportion
group with higher lung damage scores. The
Ordinal Data 253
Table 21.4: Mouse lung tissue histopathology: Calculating expected proportions in each histopathology category when
the odds ratio for a new study is pre-specified.
Histopathology Proportion of mice New Cumulative proportion
category i average p
Control, Expected Control, Expected
p0 test p1 P0,i test P1,i
0 0.35 0.519 0.435 0.35 0.519

1 0.25 0.231 0.241 0.60 0.750
2 0.30 0.197 0.249 0.90 0.947
3 0.10 0.053 0.076 1.00 1.000
Total 1.00 1.00
P = ORnew P1 1 − P1 + ORnew P1 The expected proportions pi for each category

are back-calculated as the differences between
Then
cumulative proportions: p1 = 0.519; p2 = (0.750
P1 = 2 0 35 1 − 0 35 + 2 0 35 = 0 519, − 0.519) = 0.231, and so on (Table 21.4). Then
the new 3
k 3
is 3/(1 − 0.112) = 3.38, and
1−
P2 = 2 0 60 1 − 0 60 + 2 0 60 = 0 750, i=1
p
2
P3 = 2 0 90 1 − 0 90 + 2 0 90 = 0 947, N k = 3 38 4
1 96 + 0 8416
and P4 = 1 ln 2 2
= 220.8, or approximately 221.
To estimate sample size for paired data based on a

21.4 Paired or Matched pre-specified odds ratio, Julious et al. (1999) suggest
Ordinal Data first estimating the expected total number of dis-
cordant paired observations. The proportion of dis-
Before-after and crossover studies have each subject
cordant pairs pdisc is s + t and the pre-specified odds
serving as its own control so that data are paired.
ratio OR is s/t. The expected number of discordant
Matched case–control studies match subjects on
pairs ndisc for the pre-specified odds ratio is
pre-specified characteristics except for the condition
of interest (Julious and Campbell 1998; Julious et al. 2
z1 − α 2 OR + 1 + 2 z1 − β OR
1999). These designs result in correlated outcomes. ndisc = 2
In paired studies with binary outcomes (posi- OR − 1
tive = 1, negative = 0), there are four possible out-
comes for the first and second responses (Table 21.5).
Table 21.5: In paired studies with binary outcomes (positive = 1, negative = 0), there are four possible outcomes for the first
and second responses.
First Second Sum of first and
response response second responses
Both negative (agree) 0 0 r

First negative, second positive (discordant) 0 1 s
First positive, second negative (discordant) 1 0 t
Both positive (agree) 1 1 u
Total r+s+t+u
Finally, the total number of paired observations multiple observers on the same specimen) and
Npair is repeatability (the similarity in scores for the same
observer over multiple assessments on the same
N pair =
2
specimen). The reliability of the measurements is
z1 − α 2 OR + 1 + z1 − β pdisc OR + 1 2 − OR − 1 2 interpreted in the context of a priori definition of
the minimally acceptable difference between obser-
2
pdisc OR − 1 vers (Hallgren 2012). Detailed coverage of the design
of reproducibility and repeatability (R & R) studies is
If there is a single dominant category, Julious beyond the scope of this chapter; however, detailed
et al. (1999) suggest that preliminary sample size guidance can be found elsewhere (e.g. Montgomery
estimates can be obtained by dichotomising the data and Runger 1993a, b; https://sixsigmastudyguide.
into the dominant group and one other by pooling com/repeatability-and-reproducibility-rr).
observations from the other categories, then esti- Rater agreement studies are a special case of
mating sample size as above. repeated binary or categorical ordinal data (Fleiss
and Cohen 1973; Fleiss 1981). In these studies,
Example: Stress Reduction in Healthy Dogs two or more observers assess the same specimens
or items, often several times. Cohen’s kappa coeffi-
Mandese et al. (2021) reported that 27/44 dogs cient (κ) is used to measure both inter-rater and
(61%) showed a major increase in stress levels rel- intra-rater reliability for two raters only. Fleiss’ κ
ative to baseline following separation from own- is used for >2 raters (Fleiss 1971) but is not a gener-
ers during routine physical exams. There were alisation of Cohen’s κ. For estimating inter-rater
16/44 (36%) dogs that showed a reversal in stress reliability, the number of specimens to be assessed,
levels between baseline and the subsequent N, can be estimated from the expected proportion of
assessment (discordance). discordant ratings pdisc and Cohen’s κ as follows:
Suppose a new study was to be designed to
detect a 50% reduction in risk of stress (OR = po − pe
κ=
0.5) with 80% power and 95% confidence. Then, 1 − pe
the number of discordant paired observations
required will be where po is the observed agreement between raters,
and pe is the agreement expected by chance
2 (Appendix 21.A). Then the number of specimens
1 96 0 5 + 1 + 2 0 8416 05
ndisc = 2 ≈ 68 N to be assessed by at least two individual raters
0 5−1
with precision δ, a pre-specified κ, and confidence
From the previous data, it was expected that s interval 100(1 − α)% is
+ t = 0.36. The new OR = 0.5. Therefore, the
4 1−κ
approximate sample size is 68/0.36 ≈ 189 N = z21 − α 2
δ2
κ 2−κ
1 − κ 1 − 2κ +
21.5 Sample Size for 2 pdisc 1 − pdisc
Observer Agreement To estimate within-observer reproducibility, the

number of specimens to be assessed by each rater
Studies is given by
Rater agreement studies are essential for validating
scoring systems by quantifying variation caused by 2 p 1 − p 1 − 2p 1 − p
both the measurement system and the personnel N reprod = z21 − α 2
1 − 2 p δ2
performing the measurements. A valid scoring sys-
tem will produce consistent results when different
observers score the same animal or specimen inde- where
pendently or if the same rater scores the same ani-
mal or specimen multiple times. The goal is to assess 1 − 1 − 2 pdisc
p =
both reproducibility (the similarity in scores made by 2
Ordinal Data 255
Example: Rater Agreement proportion of discordant observations pdisc is 13/

50 = 0.26.
Two pathologists evaluated the same 50 histology Then the number of specimens to be assessed
specimens for evidence of cancerous lesions. The to capture a precision δ of 10% with 95% confi-
data are as follows (Table 21.6): dence is
The observed agreement between two raters p0
is calculated as the proportion of total ties, calcu- 4 1 − 0 48
lated as the sum of [(both scored negative) + N = 1 962 1 − 0 48 1 − 2 0 48
0 12
(both scored positive)] divided by the total num-
ber of readings: 0 48 2 − 0 48
+ ≈ 1532
2 0 26 1 − 0 26
p0 = 21 + 16 50 = 37 50 = 0 74
Sample sizes can be reduced substantially
The probability of agreement by chance pe is when raters are trained up to a consistent high
the proportion of times each rater classified standard before the study begins so that the num-
responses as either positive or negative: ber of discordant observations are minimised. For
example, for good agreement κ =0.8 and pdisc =
pe = 4 + 16 50 9 + 16 50 + 21 + 9 50 0.1, then
21 + 4 50 = 0 20 + 0 3 = 0 5
4 1−0 8
N = 1 962 1−0 8 1−2 0 8
Then κ = (0.74 − 0.5)/(1 − 0.5) = 0.48. Good 0 12
agreement is indicated by κ close to 1. In this exam-
0 8 2−0 8
ple, the agreement is not particularly good. The + ≈ 373
2 0 1 1−0 1
Table 21.6: Data for two pathologists evaluating the same 50 histology specimens.
Rater 1 Rater 2 Number of specimens
Both negative (agree) 0 0 21

First negative, second positive (discordant) 0 1 9
First positive, second negative (discordant) 1 0 4
Both positive (agree) 1 1 16
Total 50
run;
21.A Sample SAS Code For proc freq data=test;
Calculating Cohen’s κ from tables rater1*rater2 /agree;
weight count;
Raw Data test kappa;
exact kappa;
data test; title "Cohen's Kappa Coefficients";
input rater1 rater2 count @@; run;
datalines;
............ The macro MKAPPA to calculate κ for more than
; two raters is described by Chen et al. (2012)
References Julious, S.A. and Campbell, M.J. (1998). Sample size cal-
culations for paired or matched ordinal data. Statistics
Agresti, A. (2013). Categorical Data Analysis, 3rd ed. New in Medicine 17: 1635–1642.
York: Wiley. Julious, S.A., Campbell, M.J., and Altman, D.G. (1999).
Campbell, M.J., Julious, S.A., and Altman, D.G. (1995). Estimating sample sizes for continuous, binary, and
Estimating sample sizes for binary, ordered categori- ordinal outcomes in paired comparisons: practical hints.
cal, and continuous outcomes in two-group compari- Journal of Biopharmaceutical Statistics 9 (2): 241–251.
sons. BMJ 311: 1145–1148. Machin, D., Campbell, M.J., Tan, S.B., and Tan, S.H.
Chen, B., Zaebast, D., and Seel, L. (2012). A macro to (2018). Sample Sizes for Clinical, Laboratory and Epide-
calculate kappa statistics for categorizations by multi- miology Studies, 4th ed. Wiley-Blackwell.
ple raters. https://support.sas.com/resources/ Mandese W, Griffin F. Reynolds P, Blew A, Deriberprey
papers/proceedings/proceedings/sugi30/155-30. A, Estrada, A (2021). Stress in client-owned dogs
pdf (accessed 2022). related to clinical exam location: a randomised crosso-
Fleiss, J.L. (1971). Measuring nominal scale agreement ver trial. Journal of Small Animal Practice, 62(2):82–88.
among many raters. Psychological Bulletin 76: 378–382. https://doi.org/10.1111/jsap.13248.
Fleiss, J.L. (1981). Statistical Methods for Rates and Pro- Montgomery, D.C. and Runger, G.C. (1993a). Gauge
portions, 2nd ed. New York: Wiley. capability analysis and designed experiments. Part I:
Fleiss, J.L. and Cohen, J. (1973). The equivalence of basic methods. Quality Engineering 6 (1): 115–135.
weighted kappa and the intraclass correlation coeffi- Montgomery, D.C. and Runger, G.C. (1993b). Gauge
cient as measures of reliability. Educational and Psy- capability analysis and designed experiments. Part II:
chological Measurement 33 (3): 613–619. https:// experimental design models and variance component
doi.org/10.1177/001316447303300309. estimation. Quality Engineering 6 (2): 289–305.
Hallgren, K.A. (2012). Computing inter-rater reliability Whitehead, J. (1993). Sample size calculation for
for observational data: an overview and tutorial. Tuto- ordered categorical data. Statistics in Medicine 12:
rial in Quantitative Methods for Psychology 8 (1): 23–34. 2257–2271.
https://doi.org/10.20982/tqmp.08.1.p023.
22
Dose-Response Studies

22.1 Introduction 257 22.A.1 Calculating Logit and Probit
22.2 Sample Size Requirements 258 Regression Weights 264
22.2.1 Translational Considerations 259 22.A.2 Determining Relative Efficacy
22.3 Dose-Response Regression Models 260 in a Dose-Response Study.
(Data from Kodell et al. 2010.) 264
22.4 Sample Size: Binary Response 260
22.A.3 Solving for Power Using the
22.5 Sample Size: Continuous Response 262 Iterative Exact Method 265
22.5.1 Linear Dose-Response 262
References 265
22.5.2 Nonlinear Dose-Response 263
22.A SAS Code for Dose-Response
Calculations 264

Preclinical dose-response studies are often used for Dose-Response Studies
preliminary investigations of efficacy, safety, or
Dose-response studies determine
toxicity (Salsberg 2006; Dmitrienko et al. 2007,
Machin et al. 2018). Because they quantify both Efficacy = relative potency of the test intervention
the magnitude and shape of responses, dose- relative to a control
response designs are more informative than the Outcome measures:
classic two-group or ‘one-way’ comparisons
Shape of the dose-response relation
(Wong and Lachenbruch 1996).
Minimum effective dose
The goal of a dose-response study is to compare the
Maximum dose
effect of one or more agents on a biological response
Median effective dose ED50:
relative to a control. Biological responses are meas-
Change in response per unit change in dose (slope)
ured along an ordered dose gradient. Specific objec-
Shift.
tives (Dmitrienko et al. 2007, Bretz et al. 2008)
include establishing if a dose-response relationship
exists at all (proof of concept or proof-of-activity);
determining efficacy of a test compound relative to a Informative dose-response studies depend on
standard or control; defining form and shape of the existence of a positive relationship or trend
dose-related trends; estimating optimal doses; and between dose and response, rather than statistically
estimating the optimal therapeutic window (Box 22.1). significant ‘effects’ that are not dose-related (Kodell
Response Efficacy outcome measures must be clearly iden-

Y
Slope
tified, quantitative, and measurable. These can be
Maximum binary (e.g. death yes/no; tumour presence/
100% efficacy absence), counts (e.g. cell counts), time to event
(e.g. time to predetermined tumour size or
50% humane endpoint), or continuous (e.g. tumour area,
biomarker concentrations, fluorescence) (Salsberg
2006, Bretz et al. 2008). Baseline characteristics
Dose for each dose group are described by means, stand-
ED50
ard deviations, and group size n. Measures of effect
are expressed in terms of mean or standardised
Response mean differences, with the placebo or vehicle con-
Y
trol group as the reference group.
Control Test Biologically relevant trends are rarely associated
with ‘statistically significant’ differences between
dose groups, and significance tests performed on
Shift
each dose point separately are likely to be both mis-
leading and invalid. Hypothesis testing should focus
either on parameter estimates for curve metrics or
Dose area under the curve (AUC) rather than attempting
Figure 22.1: Key metrics describing dose-response curves. multiple testing on each dose point (Altman 1991).
See text for details. Simple data plots and summary statistics will be
most effective for initial determinations of magni-
and Chen 1991). Therefore, before proceeding with tude and shape.
a dose-response study, the researcher should first
establish if responses to candidate agents are likely
to differ from placebo, and second, if a dose- 22.2 Sample Size
response trend is likely to exist. If a trend is possible,
further information should be collected to deter- Requirements
mine the choice of dose, dose range, and type of con- ‘Good’ information for dose-response studies means
trol. Supporting evidence for decision to proceed precision of metric estimates (smallest possible
with a formal dose-response trial is based on prior standard error) and power for model discrimination
information or proof of concept studies (Laird and tests of efficacy (Wong and Lachenbruch 1996).
et al. 2021). Simulations suggest that the traditional design of
The dose-response curve is quantified by four four dose groups with five animals of each sex for
metrics (Figure 22.1): each group will be underpowered to detect differ-
ences as large as 30% (Slob et al. 2005). The goal
Maximum efficacy: Ceiling, or maximum, res- of a dose-response study is usually to demonstrate
ponse (position on the y-axis); that a test intervention is ‘better’ than a control by
Potency: Position of the curve along the dose axis some biologically relevant amount. Therefore, con-
(x-axis). It measures the activity of the drug in trols are usually negative (placebo, vehicle). Dose-
terms of the dose or concentration of the drug response studies without a control cannot be used
required to produce a defined effect (y-axis). to conclude that a drug effect exists.
Shift is the displacement of the test agent curve Sample size determinations will require the fol-
along the x-axis relative to control; lowing information:
Median effective dose ED50: The dose that on aver- Number of treatment arms. The basic design is
age induces a response in 50% of subjects; a parallel two-arm study, with intervention
Slope: Change in response per unit change treatments and doses randomly allocated to
in dose. animals. Sample size can be greatly reduced
Dose-Response Studies 259
by crossover designs (Senn 2002), factorial shape with only two parameters (Crippa and
designs (Dmitrienko et al. 2007), or Monte Orsini 2016). Interpreting appearance of curve
Carlo-based methods that simultaneously shapes based on historical data must be per-
assess multiple heterogeneous tumour models formed with care, as artefact can be introduced
(Ciecior et al. 2021). by small sample sizes, lack of randomisation,
confounders, large variability, and ‘fishing’
Efficacy metric type: binary, ordinal, time to event,
(analysing large number of endpoints)
continuous. Much smaller sample sizes are
(Thayer et al. 2005). However, design features
needed when the outcome is continuous com-
that increase precision, such as dose placement
pared to categorical or time-to-event outcomes.
and number of dose levels, coupled with con-
Dose range. The dose range must be wide enough sistency in experimental procedures, are more
to cover the expected response range and dis- important than model choice (Slob et al. 2005).
criminate different models of trend (e.g. linear
versus quadratic or nonlinear). Doses should be
chosen to bracket the benchmark dose (usually 22.2.1 Translational Considerations
zero) and the likely maximum dose (Slob et al. Dose-response sample size planning should con-
2005). Choice of the maximum dose must be sider three major factors affecting translation poten-
made with care. If the dose is too high, unex- tial of the model: sex, allometric scaling, and
pected or severe adverse events related to safety application of 3Rs principles. These are described
and tolerance may occur, and if too low, effi- in more detail in Chapter 6.
cacy effects will not be detectable (Pinheiro
et al. 2006, Bretz et al. 2008). Sex. National Institute of Health best-practice
standards for animal-based research strongly
Dose placement. Selection of dosage levels and encourage the inclusion of female animals
how they are spaced are crucial design compo- (Clayton and Collins 2014 and consideration
nents for reliable descriptions of the dose- of sex as a biological variable (Miller et al.
response relationship. At least five or six dose 2017). Regulatory guidelines may mandate test-
groups are recommended, especially when out- ing of both sexes at each dose level (e.g. OECD
comes are highly variable (coefficient of varia- Test Guidelines https://www.oecd.org/). Sam-
tion of 18% or more; Slob et al. 2005). When ple sizes can be minimised by careful choice of
the shape of the response is not known before- statistically–based experimental designs.
hand, equally spaced designs are more robust
than optimal designs (e.g. D-optimal and c- Allometric scaling. Care must be taken when deter-
optimal designs with dose levels determined mining dose ranges derived from interspecific or
by optimisation techniques; Holland-Letz and intraspecific allometric dose conversions. Gen-
Kopp-Schneider 2015). eral scaling relationships may be useful for initial
approximations. However, allometric predic-
Expected response frequency. It is usually assumed tions for pharmacokinetic relationships should
that higher doses will result in more pro- be based on quality data and validated up-to-date
nounced responses compared to lower doses methodologies (Blanchard and Smoliga 2015).
(Dmitrienko et al. 2007).
3Rs principles. Regulatory agencies promote the
Model choice. The shape expected for the response incorporation of reduction and refinement
curve will determine the form of the model to methods and processes into protocols, and
be fitted to the data, and therefore the number increasingly, replacement with non-animal
of regression coefficients to be estimated (Bretz models, such as cell and tissue-based assays.
et al. 2008). Linear regression is useful for first- For example, OECD guidelines for chemical
pass approximations of the overall dose- testing in animals1 provide practical
response relation, and for initial estimates of
nonlinear parameter values. Splines have the 1
https://www.oecd.org/chemicalsafety/testing/
advantage of being able to fit almost any curve oecdguidelinesforthetestingofchemicals.htm
recommendations for application of 3Rs meth- sometimes be simplified by logarithmic transforma-

ods in acute and chronic toxicity assays. Mouse tion to linearise the model. Residuals and fit statis-
oncology models commonly use tumour bur- tics should be examined before analysis to
den (tumour size, number of tumours, tumour determine if the chosen model provides the best
recurrence) to assess tumorigenicity, progres- fit to the data.
sion, and response to treatment. Because Model building can begin with ordinary least-
tumour burden is a critical humane endpoint, squares linear regression on prior or exemplary data
follow-up time and terminal tumour sizes or anticipated values. Preliminary slope estimates
must be predefined, and humane limits strictly can be obtained from the difference of the expected
adhered to (Nunamaker and Reynolds 2022). mean values for the highest and lowest dose,
divided by the dose range:
meanmax − meanmin
β1,plan =
22.3 Dose-Response dosemax − dosemin
Regression Models Preliminary estimates for minimum and maximum
The basic dose-response design is the parallel-line efficacy are obtained from the anticipated maxi-
assay, modelled as a linear regression: mum and minimum values of the response variable.
Preliminary estimates of the median effective dose
Y = β0 + β1 D + β2 X ED50 can be obtained from the median value of
planned doses covering the range of the anticipated
Here β0 is the intercept, β1 is the slope describing the therapeutic effect.
change in Y with dose D, and β2 is the coefficient for Hypothesis tests are one-tailed. This is because
describing the difference between control (X = 0) most dose-response studies are designed to assess
and test treatment (X = 1) arms (Finney 1978; differences in response to increasing doses, so there
Figure 22.2). is only one pre-specified test direction.
When the response is binary, logit or probit trans-
formations are used to linearise the data. Nonlinear
regression models are used to estimate more com- 22.4 Sample Size: Binary
plex relationships between dose and response.
For continuous response data, the 3-parameter or
Response
4-parameter nonlinear Emax models are commonly Binary responses are modelled as proportions p for
used. When dose concentrations span a very wide each dose group. Therefore the distribution of the
range or increase exponentially, analyses can response will be sigmoid, and the relation is line-
arised by the logit or probit transformation of pro-
portions (Finney 1978, Kodell et al. 2010). The
form of the regression is
Y = F − 1 P = β0 + β1 log 10 D + β2 X
Control Test
Probit Logρ intervention
or logit
where P is the proportion of subjects exposed to dose
D that shows the response. For logit analysis, F is
the logistic cumulative distribution function (cdf ),
where F(x) = 1/(1 + e−x). For probit analysis, F is
ED50 ED50 Log10 (Dose) the normal (Gaussian) cdf.
control test
Efficacy (ρ) is the ratio of the median dose for the
Figure 22.2: The probit regression model for quantifying test agent relative to control:
efficacy between a test and control group. Efficacy is the
horizontal difference between parallel test and control ED50 Test
regression lines. The Y axis is the probit score for a binary ρ=
response, and the X axis is the log-transformed doses. ED50 Control
(Finney 1978). The difference in efficacy (log10ρ) respectively. The total sample size N is the product
between test and control groups is the difference of the sample size per group, the number of groups
in the log10 transformed values. It is shown graphi- in each treatment arm, and the number of treatment
cally as the horizontal distance between parallel test arms, so that N = 2 g ni (Kodell et al. 2010).
and control regression lines (Figure 22.2):
log10 ρ = log10 ED50 Test Example: LD50 Radiation Lethality Study

in Mice
− log 10 ED50 Control
Kodell et al. (2010) reported the results from a 10-
The null hypothesis is log10(ρ) = 0 versus the alter- day radiation lethality study in mice. Eighty mice
native hypothesis HA: log10(ρ) > 0. were randomly allocated to one of five radiation
For g dose groups, the sample size per dose group dose levels and one of two treatment groups (drug
n, is versus control) so that there were 8 mice in each
2 g of 10 groups. Probit regression was fitted to these
2 tdf ,1 − α + t df ,1 − β 1 i = 1 wi
n data to obtain the common slope b and the esti-
2
b1 log10 ρ mated LD50 values for each treatment group
(Appendix 22.A.2).
where b log10(ρ) is the effect size obtained from b The probit slope is 28.51. The log10LD50 (con-
(the estimate of the logit or probit slope based on trol) is 0.95117 and log10LD50(drug) is 1.01814.
prior data), ρ (the target efficacy), and wi are regres- Log efficacy log10(ρ) is
sion weights. The regression weights wi are obtained
from the expected target response proportions Pi in log 10 ρ = log 10 LD50 Drug − log 10 LD50 control
each dose group. = 1 018 − 0 951 = 0 067
For logit analysis, the logit transformation
weights are
The estimate of efficacy is obtained from the anti-
wi = Pi 1 − Pi log of the difference:
For probit regression, the probit weights are ρ = anti-log 10 0 067 = 1 167
2
φ Φ − 1 Pi Therefore, relative efficacy is approximately 1.17.
wi =
Pi 1 − Pi
where φ is the normal probability density function Suppose a new study is planned to detect a range
(pdf ), and Φ−1 is the cumulative density function of relative efficacies of 1.05, 1.1, and 1.2 based on
(cdf ) of the standard normal distribution (SAS code these data. What sample sizes are required to detect
is given in Appendix 22.A.1). these relative efficacies with 95% confidence and
To solve for power, the equation is re-arranged to power 80%?
obtain the critical t-value for 1 − β over a range of
candidate sample sizes ni 1. Obtain the anticipated response rates for
each dose group, and convert to proportions.
2 g
In this example, prior data suggested that
b1 log 10 ρ i = 1 wi anticipated response rates will be approxi-
t df ,1 − β = n − tdf ,1 − α
2 mately 5%, 27.5%, 50%, 72.5%, and 95% for
each of the five dose groups (pi = 0.05,
then power is calculated iteratively from the cumu- 0.275, 0.50, 0.725, 0.95).
lative density function for each t value. The degrees 2. Calculate regression weights. Logit regres-
of freedom df for the t-distribution depend on the sion weights are calculated as wi = Pi
number of dose groups, not the number of subjects. (1 − Pi). Probit regression weights are calcu-
Therefore, df = 2g − 3. For example, if there are g = lated from the cdf and pdf functions
5 dose groups, the degrees of freedom are 2(5) − (Appendix 22.A.1).
3 = 7, and the critical t values for α = 0.05 and 3. Calculate power over a range of candidate n
β = 0.2 are t7,0.95 = 1.895, and t7,0.80 = 0.896, (Appendix 22.A.3). For 95% confidence,
ρ = 1.2 2
1.0 z1 − α + z1 − β
N =g n=g 2
ρ = 1.15 D b s
0.9
ρ = 1.10
0.8
(Machin et al. 2018). Here g is the number of dose
0.7 levels, and D is the adjustment for the doses di
0.6 and dose range:
Power
0.5 ρ = 1.05
g−1 2
0.4
D= i=0
di − d
0.3
If the doses di are equally spaced, then D reduces
0.2 to g(g2 − 1)/12. The effect size b/s is estimated by the
0.1 regression slope b obtained from the difference in
0.0
expected mean response values at the maximum
3 5 7 9 11 13 15 17 19 and minimum dose: b = (μmax− μmin)/(dmax− dmin),
n per group and the anticipated standard deviation s.
Figure 22.3: Power curves for a range of sample size per
group n and efficacy ρ.
Example: Dose-Response Effects of Vitamin
Source: Data from Kodell et al. (2010).
D on Mouse Retinoblastoma
α = 0.05, and the t value t7,0.95 is 1.895. Based (Data from Dawson et al. 2003). To determine
on the previous data, a common slope of 28.5 the effectiveness of a vitamin D analogue in
is assumed for the calculations. Power is inhibiting retinoblastoma, transgenic mice
calculated by deriving the cdf for each value (N = 175) were fed a vitamin D- and calcium-
of t1 − β, iterating over the entire range of can- restricted diet, then randomised to five dose
didate n and choosing the n that results in groups of a vitamin D analogue: 0.0 (vehicle con-
power >0.8. Total sample size is then calcu- trol), 0.1, 0.3, 0.5, or 1.0 μg. Outcomes were
lated as 2(5) n. tumour area (μm) and serum calcium (mg/dL).
Results are shown in Figure 22.3. A sample size of The authors concluded there was no dose-
n = 8 per group (or total N = 80) has power of only dependent response. However, examination of
46% to detect relative efficacy of 1.05 with α = 0.05. the data suggests that the most dramatic tumour
Approximately 20 mice per group (N = 200) are reductions occurred at doses between 0 and 0.1.
required to detect efficacy this small with 95% con- It was observed that undesirable adverse events,
fidence. In contrast, to detect a larger relative effi- such as hypercalcemia (calcium > 10 mg/dL)
cacy of ρ = 1.2, only two mice per group (N = 20) and mortality, were minimised at very
are required for power >90%. However, it must be low doses.
determined ahead of time if an efficacy of this mag- Suppose a new dose-response study is planned
nitude is even biologically feasible before consider- for candidate doses <0.1 μg, with the objective of
ing such small sample sizes for a study. detecting a per cent reduction in tumour area of
at least 65% with standard deviation of 20%, one-
sided α = 0.05, and power 1 – β = 0.8. The candi-
22.5 Sample Size: date doses were 0.025, 0.05, and 0.1 μg against
vehicle control of 0 μg. In the original study,
Continuous Response the initial tumour area was approximately 90 ×
103 μm3 for the control group. Then the expected
22.5.1 Linear Dose-Response tumour size at 0.1 μg is 31.5, the slope is (90 −
When the response is continuous and normally dis- 31.5)/(0 − 0.1) = −585, and s = 0.2∗585 = 117.
tributed, and the relation between response and There will be g = 4 new dose groups. Doses are
dose levels is linear, total sample size N can be 2
not equally spaced, so D = 4−1 i = 0 di −d The
approximated by:
different groups require the same values for dose

mean dose is (0 + 0.025 + 0.05 + 0.1)/4 = 0.044,
concentrations, and AUC units of measurement
and D = (0.0–0.044)2 + (0.025–0.044)2 + (0.05–
are difficult to interpret.
0.058)2 + (0.1–0.058)2 = 0.005. Approximate sam-
Sample size can be approximated by using a pre-
ple size is
specified set of linear contrasts to capture antici-
1 645 + 0 8416 2 pated differences in curve shape and test hypotheses
N ≥4 ≈ 198 (Tukey et al. 1985, Stewart and Ruberg 2000, Chang
0 005 585 117 2
and Chow 2006, Dmitrienko et al. 2007). Contrasts
or approximately 50 mice per dose level. are a set of integers that code for expected differ-
ences between pairs of treatment means. Contrasts
are constrained to sum to zero. They can be custo-
mised to compare specific effects rather than just
22.5.2 Nonlinear Dose-Response the overall difference between null versus alterna-
Many dose-response patterns are nonlinear. Models tive hypotheses. Contrasts can be coded to compare
are fitted by nonlinear regression, and parameter the mean for a specified dose level to the overall
estimates and associated confidence intervals mean or to make specific comparisons for user-
obtained for all relevant metrics. defined pairs. For example, if it was desired to test
The Emax model is a common choice for nonlin- the mean response for the placebo group against
ear dose-response curves (MacDougall 2006). The responses for three other dose groups, then the con-
3-parameter model describes a concave-down func- trasts would be (−3 1 1 1).
tional response: Minimum sample size depends on the correct
specification of the model shape, and therefore
Emax D
Y i = E0 + choice of contrast coefficients (Chang and Chow
ED50 + D 2006). Therefore, the major disadvantage of this
The 4-parameter model describes a sigmoid func- method is that contrasts need to be determined a
tional response: priori, and model misspecification will result in a
loss of power.
E max Dh The null hypothesis for evaluating means for k
Y i = E0 +
ED50 h + Dh dose groups is:
In both models, D is dose, E0 is the baseline, or H 0 c1 θ1 = c12 θ2 = … = ck θk
zero-dose effect when no drug is present (placebo
or vehicle control), Emax is the ceiling or maximum where ci are contrasts for each expected change in
dose effect, and ED50 is the dose which produces response θi from the reference level. The value of
50% of the maximum effect. The fourth parameter the contrasts must sum to zero. The null hypothesis
is h, the Hill or slope parameter. When h = 1 the k
is H o i = 0 ci θ i = 0, and the alternative hypoth-
model reduces to the 3-parameter form. Initial k
esis for effect is i = 0 ci θi = ε, where ε is the
parameter estimates for fitting nonlinear models
expected improvement in response for the ith
can be obtained from either prior or exemplary data
dose group.
(MacDougall 2006). Nonlinear models cannot pro-
Sample size per dose group n is
duce R2 values for coefficient estimates. Individual
parameter estimates and confidence intervals are 2
used to assess statistical significance of the regres- z1 − α + z1 − β s k c2i
n=
sion coefficients. ε i=0 f
i
The area under curve (AUC) is an alternative
analysis method (Altman 1991), if initial nonlinear where ci are the expected values for the contrasts.
parameter estimates are unavailable, if the nonlin- The standard deviation s can be estimated from
ear model cannot be fitted to the raw data, or prior data. The sample size fraction fi is a measure
the dose range is too narrow to obtain sufficient of the balance of sample sizes across dose groups.
precision of key metrics (i.e. the confidence For example, if there are four dose groups, then a
intervals are too wide to be practical). Disadvan- balanced design has fi = 1/4 = 0.25 (Chang and
tages of this method are that comparisons of Chow 2006).
datalines;
Example: Dose-Response Effects of Vitamin 0.05 0.275 0.5 0.725 0.95
D on Mouse Retinoblastoma: Contrasts ;
Method data prob;
set prob;
A new mouse retinoblastoma study was planned
to test the hypothesis of a linear dose-response *calculate logit weights;
w_logit=P*(1-P);
with a relative change of 15% in tumour areas
of 90 and 30 μm. The relative change in tumour *calculate probit weights;
area at each dose level (expressed as a proportion norminv = quantile('normal', P);
relative to placebo) can be approximated f=pdf('NORMAL',norminv,0,1);
as θ0 = 0, θ1 = 0 15, θ2 = 0 30, θ3 = 0 65, with w_probit=f*f/w_logit;
run;
standard deviation s of approximately 0.3.
To test the initial hypothesis ‘Is there a
response?’ the contrasts are set up to compare
the control group against the other three dose
22.A.2 Determining Relative
levels: ci = (−3, 1, 1, 1). Then Efficacy in a Dose-Response
ε = − 3 0 + 1 0 15 + 1 0 3 + 1 0 65 = 1 1 Study. (Data from Kodell
et al. 2010.)
The contrasts are ci = (−3, −1, 1, 3). Then ε =
(−3) 0 + (−1) 0.15 + (1(0.3) + (3) 0.65 = 2.1. For /*TRT is the two treatment groups where C is
the control group and
one-sided α = 0.05 and power of 80%, the approx- T is the test drug treatment group */
imate sample size is /*N is number of mice, Response Y is number of
2
deaths */
1 645 + 0 842 03
n= 4 data probit;
11
input trt $ dose N Y;
−3 2 + 1 2 + 1 2 + 1 2 ldose=log10(dose); *log10 transformation for Y;
datalines;
≈ 27, for a total sample size of 27 × 4 = 108 C 7 8 0
C 8 8 1
C 9 8 3
C 10 8 8
C 11 8 8
The R package ‘DoseFinding’ provides a com- T 7 8 0
prehensive set of tools for design, contrast genera- T 8 8 0
tion, multiple contrast tests, and nonlinear T 9 8 0
T 10 8 4
model fits for dose-response models (Bornkamp T 11 8 5
et al. 2023; https://cran.r-project.org/web/ ;
packages/DoseFinding). run;
*The response Y is a proportion, therefore Y =

number responding divided by number per group;
22.A SAS Code for Dose-
*probit model;
Response Calculations proc probit log10 plot=predpplot;
class trt;
model Y/N=Dose trt / lackfit inversecl
22.A.1 Calculating Logit and itprint;
output out=P p=Prob std=std xbeta=xbeta;
Probit Regression Weights run;
*input anticipated response for each dose *logistic model: note that the distribution
group as a proportion; function is 'logistic';
data prob; proc probit log10 plot=predpplot;
input P @@; class trt;
model Y/N=Dose trt / d=logistic lackfit finding experiments. https://cran.r-project.

inversecl itprint; org/web/packages/DoseFinding/index.html
output out=L p=Prob std=std xbeta=xbeta; (accessed 2023).
run; Bretz, F., Hsu, J., Pinheiro, J., Liu, Y. (2008). Dose finding
– a challenge in statistics. Biometrical Journal Biome-
trische Zeitschrift, 50(4):480–504. https://doi.org/
22.A.3 Solving for Power 10.1002/bimj.200810438.
Using the Iterative Exact Chang, M. and Chow, S.-C. (2006). Chapter 14: Power
and sample size for dose response studies. In: Dose
Method Finding in Drug Development, Statistics for Biology
and Health (ed. N. Ting), 220–241. New York: Springer
%let nummax = 20; *maximum sample size https//doi.org/10.1007/0-387-33706-7_9.
n per group;
data prob; Ciecior, W., Ebert, N., Borgeaud, N. et al. (2021). Sample-
do n = 2 to #max by 1; *set increment to size calculation for preclinical dose-response experi-
desired step size; ments using heterogeneous tumour models. Radiother-
alpha=0.05; *confidence; apy and Oncology 158: 262–267. https://doi.org/
rho=1.2; *target efficacy; 10.1016/j.radonc.2021.02.032.
g=5; *number of dose Clayton, J.A. and Collins, F.S. (2014). Policy: NIH to bal-
groups; ance sex in cell and animal studies. Nature 509: 282–283.
DF=2*g-3; *degrees of freedom;
Crippa A, Orsini N (2016). Dose-response meta-analysis
*calculate critical t value for alpha; of differences in means. BMC Medical Research Meth-
t_alpha=quantile('T',1-alpha,DF); odology, 16:91. https://doi.org/10.1186/s12874-
016-0189-0.
*summed probit weights wi; Dawson, D.G., Gleiser, J., Zimbric, M.L. et al. (2003). Tox-
sum_p = 0.22394 + 0.55843 + 0.63662+ icity and dose-response studies of 1-alpha hydroxyvita-
0.55843 + 0.22394; min D2 in LH-beta-tag transgenic mice.
*calculate effect size; Ophthalmology 110 (4): 835–839. https://doi.org/
ES = (28.51*log10(rho))**2; 10.1016/S0161-6420(02)01934-6.
sqrtes = sqrt(n*(ES*sum_p/2)); Dmitrienko, A., Fritsch, H.J., and Ruberg, S.R. (2007).
Chapter 11: Design and analysis of dose-ranging clin-
*calculate t value for beta; ical studies. In: Pharmaceutical Statistics Using SAS:
tbeta=sqrtes-t_alpha; A Practical Guide (ed. A. Dmitrienko, C. Chuang-Stein,
and R. ’Agostino), 273–312. Cary: SAS Institute.
*calculate power;
power=cdf('T',tbeta,DF); Finney, D.J. (1978). Statistical Method in Biological Assay,
output; 3e. New York: Macmillan.
end; Holland-Letz T, Kopp-Schneider A (2015). Optimal
run; experimental designs for dose-response studies with
continuous endpoints. Archives of Toxicology, 89
proc print; (11):2059–68. https://doi.org/10.1007/s00204-
var n power; 014-1335-2.
run;
Kodell, R.L. and Chen, J.J. (1991). Characterization of
dose-response relationships inferred by statistically
References significant trend tests. Biometrics 47: 139–146.
Kodell, R.L., Lensing, S.Y., Landes, R.D. et al. (2010).
Altman, D.G. (1991). Practical Statistics for Medical Determination of sample sizes for demonstrating
Research. Chapman & Hall/CRC. efficacy of radiation countermeasures. Biometrics
Blanchard, O.L., and Smoliga, J.M. (2015). Translating 66: 239–248. https://doi.org/10.1111/j.1541-
dosages from animal models to human clinical 0420.2009.01236.x.
trials – revisiting body surface area scaling. FASEB Laird, G., Xu, L., Liu, M., Liu, J. (2021). Beyond exposure-
Journal, 29(5):1629–34. https://doi.org/10.1096/ response: a tutorial on statistical considerations in
fj.14-269043. dose-ranging studies. Clinical and Translational Sci-
Bornkamp, B., Pinheiro, J., Bretz, F., and Sandig, L. ence, 14(4):1250–1258. https://doi.org/10.1111/
(2023). DoseFinding: planning and analyzing dose cts.12998.
Macdougall, J. (2006). Analysis of dose–response studies Senn, S. (2002). Crossover Trials in Clinical Research. Chi-
– Emax model, chap 9. In: Dose Finding in Drug Devel- chester: Wiley.
opment, Statistics for Biology and Health (ed. N. Ting), Slob, W., Moerbeek, M., Rauniomaa, E., and Piersma, A.
127–145. New York: Springer https://doi.org/ H. (2005). A statistical evaluation of toxicity study
10.1007/0-387-33706-7_9. designs for the estimation of the benchmark dose in
Machin, D., Campbell, M.J., Tan, S.B., and Tan, S.H. continuous endpoints. Toxicological Sciences 84: 167–
(2018). Sample Sizes for Clinical, Laboratory and Epide- 185. https://doi.org/10.1093/toxsci/kfi004.
miology Studies, 4e. Wiley-Blackwell. Stewart, W.H. and Ruberg, S.J. (2000). Detecting dose
Miller, L.R., Marks, C., Becker, J.B., et al. (2017). Consid- response with contrasts. Statistics in Medicine 19 (7):
ering sex as a biological variable in preclinical 913–921. https://doi.org/10.1002/(sici)1097-
research. FASEB Journal, 31(1):29–34. https:// 0258(20000415)19:7<913::aid-sim397>3.0.
doi.org/10.1096/fj.201600781R. co;2-2.
Nunamaker, E.A. and Reynolds, P.S. (2022). ‘Invisible Thayer, K.A., Melnick, R., Burns, K., Davis, D., Huff, J.
actors’ – how poor methodology reporting compro- (2005). Fundamental flaws of hormesis for public
mises mouse models of oncology: a cross-sectional sur- health decisions. Environmental Health Perspectives,
vey. PLoS ONE 17 (10): e0274738. https://doi.org/ 113(10), 1271–1276. https://doi.org/10.1289/
10.1371/journal.pone.0274738. ehp.7811
Pinheiro, J.C., Bornkamp, B., and Bretz, F. (2006). Design Tukey, J.W., Ciminera, J.L., and Heyse, J.F. (1985). Test-
and analysis of dose-finding studies combining multi- ing the statistical certainty of a response to increasing
ple comparisons and modeling procedures. Journal of doses of a drug. Biometrics 41: 295–301.
Biopharmaceutical Statistics 16: 639–656. Wong, W.K., and Lachenbruch, P.A. (1996). Tutorial in
Salsberg, D. (2006). Dose finding based on preclinical biostatistics. Designing studies for dose response. Sta-
studies, chap 2. In: Dose Finding in Drug Development, tistics in Medicine, 15(4):343–59. https://doi.org/
Statistics for Biology and Health (ed. N. Ting), 18–29. 10.1002/(SICI)1097-0258(19960229)15:4<343
New York: Springer https://doi.org/10.1007/0- ::AID-SIM163>3.0.CO;2-F .
387-33706-7_9.
Index
Please note that page references to Figures will be followed by the letter ‘f’, to Tables by the letter ‘t’.
A research animals, 21–22

absolute risk reduction (ARR), 174 resources for animal-based study planning, 28–29
academic research, non-regulated, 48 sample size problem in, 3–7
adverse events sequential thinking, 27
dose range, 259, 262 sex as a biological variable, 62–63
future risk of, 92–93 signalment, 21
linear dose-response, 262 small studies, 157
occurrence of, 92 statistically defensible, 3
Poisson applications, 119 three ‘Rs’ principles (Replacement, Reduction and
predicting the probability of, 93 Refinement), 3, 4, 17, 19, 39, 41, 259–260
rare nature of, 67 see also laboratory-based animal research; numbers
sample size determination, 64 of animals; right-sizing
severe or unexpected, 93, 259 ANOVA see analysis of variance
see also rare or non-existent events approximations, sample size
Agresti-Caffo method, 192, 193 asymptotic large-scale approximation, 182
Agresti-Coull method, 91, 115, 116, 123 asymptotic normal approximation, 242–243
allocation, 162–163, 250 based on a number of levels for a single factor,
allometric scaling, 63–64, 259 217–218
American Society for Quality (ASQ), 48 based on an expected range of variations, 218, 219f
amphibians, clutch sizes, 242 based on mean differences, 217
amyotrophic lateral sclerosis, in mice, 178 lung tissue histopathology changes, in mice, 251, 252t
analysis of variance (ANOVA), 211, 212 ordinal data, 250–253
correlation ratio, 173 pre-specified odds ratio, detecting, 252
effect sizes, 171, 172t, 173 area under the curve (AUC), 258, 263
one-way, 15, 66, 214 asymptotic approximation
randomisation in ANOVA-type design, 212 large-scale, 182
single-factor, 173, 214, 223 normal, 242–243
skeleton, 214, 215–216t, 219, 224 AUC see area under the curve (AUC)
traditional, 15
animal-based research B
best-practice quality improvement, 48 batch testing for disease detection, 93–95
challenges for right-sizing studies, 4 before-after designs, 24, 42, 170, 190, 226
direct and indirect harm, avoiding, 3, 4 benchmarks, 115, 177, 178
ethically defensible, 3 dose-response studies, 259
experimental design see experimental design metrics, 20, 47
individual subject or biological unit, 233 rule-of-thumb, 176
operationally defensible, 3 values, 176, 195, 213, 238
pilot studies see pilot studies best-practice quality improvement, 48
pseudo-replication in see pseudo-replication Bayesian inference regression method, 129
268 Index
bias C
controls, 24 calculation/determination of sample size, 112–113
correction adjustment methods, 91 adverse events, 64
increasing, 13 asymptotic large-scale approximation, 182
internal validity, 60 asymptotic normal approximation, 242–243
minimising, 13, 17, 19, 23, 26–27 based on t-distribution, 182–183
prior data, 25 clusters see clusters/clustering
in published animal research studies, 4 confidence intervals and standard deviation, 69
reducing, 4 conventional approach, 89–90
removing, 24 covariate-dependent estimates, 148–149
replication vs. repeats, 13 derived from percentage change in means, 183
representation, 235– design effect, based on, 242
research results, 15 effect sizes, 213
selection, 60, 144 empirical and translational pilots, 64–67
spectrum, 144 exact methods, 90
binary data, 130, 137 from exemplary data, 161
binomial distribution, 90–94 hierarchical/nested data, 242–244
and batch testing, 93–95 initial approximations, 242
cumulative probabilities for, 97–98 Lehr’s formula, 183, 195
determining screening numbers, 95 Mead’s resource equation, 213–214
exact calculation, 93 multiple factors, comparing, 213–214
negative see negative binomial distribution non-centrality parameter (NCP), 213, 243
rare or non-existent events, 92–93 non-parametric or parametric estimates, 148
‘Rule of Threes,’ 92 reference intervals, 147–149
see also adverse events; multinomial samples rules of thumb, 147, 183, 241
biological unit, 11 sample-based coverage methods, 147–148
biological variability, 4, 11 skeleton ANOVA, 214
birds time-to-event (survival) data, 205–206
avian influenza, 195 tolerance intervals, 137–138
Black-billed Magpie (Pica pica) clutch size, 240 two-level model, subjects within cluster, 243
breeding, 241 units of randomisation, 243–244
clutch sizes, 243 see also feasibility calculations; sample size
osprey eggshell thickness, 135–136, 138 CAMARADES (collaborative research group), 18, 42
Poisson applications, 119 cancer, in mice
sparrows blocking on donor, 221
house sparrow clutch size, 243 dose-response effects of vitamin D on retinoblastoma,
house sparrow data, calculations of R2 and Cohen’s f 2 262, 264
for, 245–246 effect of diet on tumorigenesis, 223
model of early life stress, 239, 244 model papers, cross-sectional survey, 4
sunbirds, visitation rates to flowers, 120 tumour proliferation, 15
bladder disease, in dogs, 118 canine pulmonary stenosis, 149
bleeding, 91, 117 cardiac disease, in dogs, 207
blocking, 212–214 cats
balanced block design, 219 feline calicivirus (FCV) in shelter cats, 117–118
blocking factor, 212 heart rate elevation in, 226–227
mouse oncology, 221 hypertrophic cardiomyopathy markers in Maine Coon
randomised complete block design, 22, cats, 96–97, 98
219–221 MYBPC3 gene mutation, in Maine Coon cats, 96
by treatment interaction, 73 overweight or obese, weight loss in, 97
variables, 22, 73, 173, 219, 220 cattle, 185, 187
body size Causal Analysis/Diagnosis Decision Information System
and allometric scaling, 63–64 (CADDIS), 68
in large animals, 53, 83, 129 cause and effect, 18
bootstrapping, 112, 131, 147 Cavalier King Charles Spaniels, echocardiography, 147
butterflies, sex ratios, 191
Index 269
cell cultures, 11 width, 111

cell line authentication, 21 see also prediction intervals; reference intervals; tolerance
censoring, 201, 202t intervals
central probability distribution, 157 congestive heart failure (CHF), in dogs, 207
checklists, 3, 50, 51f, 62t consistency, 5, 20
Clinical and Laboratory Standards Institute (CLSI), 147 data extraction procedures, 54
clinical observation, 19 and generalizability, 59
Clopper-Pearson intervals, for proportions, 115, 116 processes and procedures, 55
clusters/clustering, 23 construct validity, 61, 62t
balanced cluster sizes, 242–243 continuous response, 262–264
cluster effects, adjusting sample size for, 242 linear, 262–263
cluster-level assignment, 235 continuous data
two-level model, subjects within cluster, 243–244 comparing two means, 128–129
as units of randomisation, 243, 244 creatinine levels in obese dogs, 128, 129
within-cluster assignment, 235 linear regression, 129–130
coefficient of determination, 172 non-normal, 137, 145, 146
Collaborative Approach to Meta-Analysis and Review of non-parametric tolerance limits, 136
Animal Experimental Studies (CAMARADES) outcome data, 113–115, 170
research group see CAMARADES (collaborative prediction intervals, 127–130
research group) predicting a single future value, 128
comparisons predictions based on a fixed sample size, 128
analysis of variance (ANOVA), 173 single observation, 128
controls, 23, 174t controls
matched-pairs, 24 and appropriate comparators, 23–25
minimum information planning, 156 concurrent, 24
non-linear dose response, 263 historical, 24
OFAT, 27, 65 matching, 23–24
one- or two-sided, 169 negative, 23
proportions, 189 positive, 23
risk family effect sizes, 174 process, 20–21
sample size balance and allocation ratio, 162 redundant, 25
simultaneous confidence intervals (SCI), 114 sham, 23, 25
and statistical hypotheses, 155 standard, 23
strain/species, 213 statistical, 22–23
time-to-event (survival) data, 203 strain, 24
see also two-group comparisons types, 23–25
completely randomised design, 228 vehicle, 23
conceptual replication, 74–75 when unnecessary, 25
confidence intervals, 69–70, 105–106 correlation, 171
comparison of conventional and simultaneous, 115t coverage plots, 68–69
confidence level, 112, 156 COVID-19 pandemic, 94, 95
confidence limits, 105–106 Cramer’s V, 176
definitions, 111–112 creatinine levels, obese dogs, 128, 129
endpoints, 112 crossover designs, 226
negative binomial distribution, 121 cumulative distribution function (cdf ), 90, 97–98, 159
and precision, 111–126
profile plots, 70
relationship between interval width, power and D
significance, 107 data
relative risk (RR), 193 binary, 130, 137
sample SAS code for computing (single sample continuous, 113–115, 127–130, 136, 137
proportion), 123 exemplary, 161, 163–164
sample size calculations, 112–113 extraction and coding, 54
simultaneous, 114, 115t, 122–123 hierarchical/nested, 233–247
two-group comparisons, 191–193 improving quality of, 48
270 Index
data (cont’d) LD50 radiation lethality study in mice, 261

non-Gaussian, 163–164 model choice, 259
normally distributed, 136, 145 non-linear, 263–264
outcome, 203–205 regression models, 260
paired assessments, in Excel, 55 sample size requirements, 258–260
plots, 68 SAS code, 264–265
Poisson, 124 sex, 259
prior, 25 solving for power using the iterative exact method, 265
raw count, 68, 251 3Rs principles, 259–260
sample, describing, 104 translational considerations, 259–260
skewed count, 118–122, 195–196
sparse, 241
tidy, 20 E
time-to-event, 199–209 early life stress, sparrow model, 239, 244
data dictionary, 20, 54–55 echocardiography metrics, 147
data management, 20, 54, 83 EDA see exploratory data analysis (EDA)
degenerative mitral valve disease (DMVD), in dogs, 186 effect sizes, 167–180
descriptive statistics, 6, 14, 103–105, 104t, 105t, 122 basics/basic formula, 169, 170
confidence and other intervals, 105–106 ‘best’ effect size, 169
relationship between interval width, power and categories of indices, 168t
significance, 107–108 Cohen’s d, 170, 238
results, describing, 105 continuous outcomes, 170–171
sample data, 104 d-family effect sizes, 169–171
and summaries, 103–109 fixed effects regression coefficients, 238–239
see also confidence intervals; prediction intervals; fundamental equation, 170
reference intervals; tolerance intervals Glass’s Δ, 170
determination of sample size see calculation/determination group difference effect sizes, 167
of sample size Hedges’s g, 170
diagnostic studies (PIRT), 18 hierarchical/nested data, 238–240
direct replication, 74 independent samples, 170–171
Doberman Pinschers, genetically linked bleeding in, 91, 92 interpretation, 176–178
dogs intraclass correlation coefficient (ICC), 240
bladder disease, categories, 118 meaningful, 178
blood ammonia tolerance intervals, 137 multiple factors, comparing, 213
canine pulmonary stenosis, 149 precision, 112
cardiac disease, 207 predetermined, 177
degenerative mitral valve disease in, 186 ratios, interpreting as, 177–178
echocardiography in Cavalier King Charles Spaniels, 147 r-family (strength of association), 171–173
genetically linked bleeding in Doberman Pinschers, 91, 92 risk family, 174–176
obese, 70, 114, 115t, 128, 129 standardised, 169, 170
osteosarcoma pilot study, 95–96 strength of association effect sizes, 167
stress in, 104, 254 time-to-event (survival) data, 176, 205
two-arm crossover study, 104 two-group comparisons, 170–171
von Willebrand disease in Doberman Pinschers, 91, 117 unstandardised, 169–170
dose-response studies, 257–266 empirical pilots, 57–80
allometric scaling, 259 compared with translational pilots, 58t
binary response, 260–262 end-use disposition planning, 22
calculating logit and probit regression weights, 264 error, 4, 127, 219
continuous response, 262–264 abstraction, 54
determining relative efficacy, 264–265 blocking, 214
dose placement, 259 common, 205
dose range, 259 consequences of, 138, 139
effects of vitamin D on mouse retinoblastoma, 262, 264 consistent, 54
expected response frequency, 259 conversion, 82
goal, 257 critical, 50, 51f
Index 271
degrees of freedom, 173, 214, 220 interaction plots, 72–73

entry, 55 profile plots, 70, 71f
experimental, 14, 50, 51, 66, 219 replication, 74–76
family-wise, 114 standard deviation, sample size, 69–70
interpretation, 26 building, 59–64
major, 54 external validity, 58t, 61–64
margin of, 106, 111, 146, 148 internal validity, 58t, 59–60
mean square error, 129, 171, 212, 213, 214, 216, 217, exact binomial distribution see binomial (exact) distribution
219, 220 exemplary data, 158, 161–164
measurement, 9, 11, 12, 13, 20, 41, 63, 74, 114 anticipated statistical model, using, 164
non-normal, 177 experimental design
of omission, 50 analysis of variance (ANOVA) methods, 212
performance, 20, 48 balanced vs. unbalanced, 162, 163t, 241
of the prediction, 128, 129 between-subjects, 24
procedural, 50 challenges for right-sizing studies, 4
protocol, 51, 64 clustering, 22–23, 214, 227, 235, 242
random, 60 comparisons, 6
residual, 212, 214 components, 19, 212–213
root mean square, 171 conventional, 73
sampling, 74, 106, 111, 112, 134 design effect, sample size based on, 242
standard, 9, 14, 60, 69, 112, 113, 114, 115, 122, 128, 129, factorial-type designs, 65
130, 156, 158, 162, 184 Fisher-type designs, 4, 212
statistical, 9 information density, 64, 65
substantial, 129 information power see information power
systematic, 13, 59, 60 multi-batch designs, 220
see also bias own control, 24
term, 146, 214, 219, 220, 224, 236, 240 reducing variation, 22
transcription, 21 replication, 76
Type I, 9, 14, 26, 85, 107, 114, 156, 189, 226 right-sizing experiments, 4
Type II, 156, 189 running experiments in series, 27
variance, 160, 212, 214, 218f, 222, 236, 239, 240 screening see screening designs
zero, 50 simple, 181
error sum of squares (SSE), 171, 172 statistical, 4, 9, 19–20, 65, 76, 259
estimating sample sizes, 158–162 strain controls, 24
Bayesian priors, based on, 24 structured inputs, 19
covariate-dependent estimates, 148–149 within-subject, 24
cumulative distribution function (cdf ), 90, 159 see also before-after designs; crossover designs; factorial
F-distribution, non-central, 160–162 designs; split-plot design
non-parametric estimates, 148 experimental units, 9–11, 13f, 73, 156, 162, 169, 177, 181, 203,
parametric estimates, 148 212–215, 219–228, 235
probability density function (pdf ), 159 blocking and clustering, 22–23
reducing variation, 160 comparators and controls, 23
reliability, 9 defining, 9
t-distribution, non-central, 159–160 independence of, 11
tertile method, 149 non-independence, 11
see also calculating/determining sample sizes; sample replicating, 10, 12
sizes; variation sunfish foraging preferences, 14
ethical issues, 3, 4 with technical replication, 12
evidentiary strength see also technical replicates
assessing, 67–76 experimental/intervention studies (PICOT), 18
confidence intervals, sample size, 69–70 experiments
coverage plots, 68–69 adjusting, 48
design and sample size for replication, 76 animal-based, 5
exploratory data analysis (EDA), 68 blocking, 76
half-normal plots, 70, 72f complex, 48, 55, 70
272 Index
experiments (cont’d) factorial designs, 19, 221–223

concurrent controls, 23 power and sample size for, 67, 208, 222
definitive/confirmatory, 35, 64, 66 feasibility calculations, 4–6, 81–88
discarding, 4–5 approximation process, 81
efficient and cost-effective, 27 arithmetical formulas, 82
entire, repeats of, 13 batch testing for disease detection, 93–95
factor importance, 65 counting subjects, 89–99
‘failed,’ 28 determining operational feasibility, 82–88
feasibility calculations, 83 experiments, 83
half-normal plots, 70 operational feasibility, 47, 55, 81–86
independent, 76 pilot studies, 35–43
information power, 65 problem structuring, 82
informative outcome variables, 25, 26 process, 81–82
‘informative-ness’ of, 111 reality checks, 82
interaction plots, 73 refinement, 82
iterative, 27 see also calculation/determination of sample size
large animal, 83 feline calicivirus (FCV), shelter cats, 117
mini, 75 fish, 14, 120
multiple, 74–75 fixed effects regression coefficients, 238–239
‘negative,’ 27–28 Food and Drug Administration (FDA)- regulated veterinary
‘noise’ in, 22 trials, 48
non-informative, 36 fractures
and pilot studies, 37–39, 48, 50, 51f, 58, 60, 64–66, 70, 75 bone, in rats, 160
planning/planned, ix, 39, 47 equine
preclinical laboratory, 86 leg bone, 116
preference, 14 skull, 130, 137
proof of concept (pPoC), 58, 59t futility loop studies, 38
proof of principle (pPoP), 58, 59t
protocol compliance, 51 G
pseudo-replication, 12, 13–15, 219, 234 G∗Power, 158, 160, 173, 185
replication, 73, 74, 75, 76
generalizability, 58, 72, 73f
reporting, 28 genotyping, 21
research hypothesis, 18 geometric distribution, 90
right-sizing, 3, 4, 27
Good Clinical Practice (GCP) guidelines, 48
running in series, 27 guinea pigs
sample size determination, 64 growth, 220
sequential, 50, 51f
sample SAS code for calculating sample size, 228–229
sham controls, 23 weight gain, 217
small, 38, 76
statistical design see experimental design
technical replicates, 11, 12 H
two-group and OFAT comparisons, 65, 66 habituation protocols, 22
uncontrolled variation in, 47 half-normal plots, 70–71, 72f
validity see validity hazard ratio (HR), 168t, 176, 204–208
variability in, 21, 86 heart rate elevation, in cats, 226–227
in vitro, 11 hierarchical/nested data, 233–247
exploratory and screening studies, 38–39 balanced vs. unbalanced allocation design, 241
exploratory data analysis (EDA), 68 categorical outcome, 236
external validity constructing the model, 238
body size and allometric scaling, 63–64 continuous outcome, 236
representativeness, 61–62 costs, 241
effect size, estimating, 238–240
examples of hierarchical data, 234
F identifying the unit of randomisation, 235–236
face validity, 58t, 61 multilevel models with predictors, 236–238
factor reduction, 27, 221 predictors, lack of, 236
Index 273
sample size determination, 241–244 informative outcome variables, 25–26

software, notes on, 244–245 insects
sparse, 241 butterflies, 191
steps in multilevel sample size determinations, 235–238 red mites, 122, 124–125
high dimensionality studies, 85–86 interaction plots, 72–73
horses internal validity see validity, internal
equine skull fractures, 130, 137 intervals
racehorse medication, 135, 139 confidence see confidence intervals
simulated equine leg bone fractures, 116 distances between (ordinal data), 249
humane endpoints, 26 prediction see prediction intervals
hypergeometric distribution, 90, 96–98 reference see reference intervals
hypertrophic cardiomyopathy (HCM), in Maine Coon cats, relationship between interval width, power and
96–97, 98 significance, 107–108
hypothesis testing, 155–165 short time, 225
area under the curve (AUC), 258, 263 significance threshold, 107
definitions, 156 tolerance see tolerance intervals
feasibility calculations as alternative, 81 intraclass correlation coefficient (ICC), 235, 238–239, 240,
non-centrality, 157–158 242–244
null hypothesis statistical tests, 11, 19, 38, 107, 114, 156,
157, 158, 159 J
comparison of two groups, 190, 191 Jeffreys method, 115–116, 123, 137
dose-response studies, 261, 263
effect sizes, 172
empirical and translational pilots, 68, 69, 70 K
power and significance, 155–157 Kaplan-Meier method, 203–204
reducing variation, 20–23
research hypothesis, 18–19 L
scientific hypothesis, 18–19 laboratory-based animal research, 58–59
statistical hypothesis, 18–19 basic science/laboratory studies, 82–83
statistical tests, 11 Mead’s resource equation, 213–214
pilot studies, 37
I preclinical laboratory, 86
ICC see intraclass correlation coefficient (ICC) processing capacity, 83
information surgical training laboratory example, 86
comparators and controls, 23–25 see also animal-based research
designing experiments see experimental design large animals
experiments, ‘informative-ness’ of, 111 experiments, available time for, 83
informative outcome variables, 25–26 mammals, 129
increasing (ten strategies), 17–32 surgery models, 53
reducing variation see variation, reducing leg bone fractures, simulated equine, 116
research vs. statistical hypotheses, 18–19 linear regression, 129–130, 171, 172t
sequential thinking, 27 literature reviews, 40
significance, futility of chasing, 27–28 lizards, body temperature, 114
source and strain, 21 lung tissue histopathology changes, in mice,
strategies to increase, 17–32 251, 252t
structured inputs see experimental design
‘well-built’ research question, 17–19 M
see also experiments Magpie, Black-billed, nests, 240
information density, 57t, 64–65 Maine Coon cats, hypertrophic cardiomyopathy in,
information power, 58, 64, 65–67 96–97, 98
conventional vs. screening designs, 66 margin of error, 106, 111, 146, 148
factor importance, 65–66 see also precision
factor reduction, 65 matching controls, 23–24
factor testing, 66 MBID (minimum biologically important difference), 68
sample size, 66 Mead’s resource equation, 213–214
274 Index
mean square error (MSE), 129, 172, 212, 213, 214, 216, 217, based on mean differences, 217
219, 220 sample size determination methods, 213–214
measurement error, 9, 11, 12, 13, 20, 41, 63, 74, 114 split-plot design, 223–224
measurement times, 52–53 mussel shell length, 228
meta-analyses, prediction intervals, 130–131
mice N
amyotrophic lateral sclerosis, 178
National Institute of Health (NIH), 13, 63
anaesthesia duration, 184
negative binomial distribution, 90, 95–96, 119, 121–122
anaesthetic immobilisation times, 128
cumulative probabilities for, 97–98
on antibiotics, weight gain in, 222
evaluating, 124
body size and allometric scaling, 63
fitting counts of red mites on apple leaves, 124–125
breeding productivity, 161
osteosarcoma pilot study, 95–96
C57BL/6 mouse strain, 24
skewed count data, 196
cancer in, 4, 15, 221, 223, 262
Newcombe (Wilson score intervals) method, 137, 192, 193
environmental toxins, effects on growth, 72
N-hacking, 28, 157
estimating number with desired genotype, 87–88
NIH see National Institute of Health (NIH)
heterozygous breeder, 87
non-centrality parameter (NCP), 64, 86, 135, 156, 157–162,
infections, 191, 192–193, 194
164, 183, 186–187, 213, 216, 218, 222–223, 243–245
irradiation study on, 224
non-Gaussian data, 163–164
laboratory processing capacity experiment, 81
normal distribution, 90–91
LD50 radiation lethality study on, 261
nuisance variables, 22
longitudinal data plots for, 52
numbers of animals
lung tissue histopathology changes in, 251, 252t
justifiable, 5f
number required for a microarray study, 85
reducing, 24
perception of as ‘furry test tubes,’ 28
vs. sample size, 9
photoperiod exposure, 14
verifiable, feasible and fit for purpose, 3
pups, 10f, 72, 87, 88, 105t, 161–164, 233, 234
see also animal-based research; right-sizing
repeating of entire experiments, 13
surgical success study on, 190
tumorigenesis, 223 O
two-arm drug study, 183 observational studies (PECOT), 18
two-arm randomised controlled study of productivity, 105 observer agreement studies, sample size for, 254–255
two-group comparisons, 190 odds ratio (OR), 174–175, 203
see also rodents pre-specified, sample size for detecting, 252
microarray studies, 85 two-group comparisons, 194–195
MSE see mean square error (MSE) using SAS proc genmod to calculate, 178
multi-batch designs, 75 OFAT see one-factor-at-a-time (OFAT)
multinomial samples, 118 oncology, 4, 24, 221, 260
multiple factors, comparing, 211–231 see also cancer, in mice
before-after designs, 226 one-factor-at-a-time (OFAT) design, 25, 27, 221
constructing the F-distribution, 212–213 empirical and translational pilots, 65, 66
continuous outcome, 227 operating procedures, standardised, 54
crossover designs, 226 operational feasibility, determining, 81, 82–88
design components, 212–213 available time for large animal experiments, 83
factorial designs, 221–223 basic science/laboratory studies, 82–83
proportions outcome, 227 high dimensionality studies, 85–86
randomised complete block design, 219–221 laboratory processing capacity, 83
repeated measures large animal, available time for, 83
simple repeated-measures design, 229 operation and logistics, 83
in space, 227–228 training, teaching and skill acquisition, 86
over time, 227 veterinary clinical trials, 83–85
within-subject designs, 225–228 operational pilot studies, 25, 40, 47–56
sample size approximations animal surgery procedural checklist, 50
based on a number of levels for a single factor, 217–218 animal-based research, 48
based on an expected range of variations, 218, 219f benchmark metrics, 47
Index 275
best-practice target times, 53 experiments, 37–39, 48, 50, 51f, 58, 60, 64–66, 70, 75
‘Can It Work?’ 48 general role of, 36
checklists, 50 justification, 40
constructing, 49 literature reviews, 40
deliverables, 47 Murphy’s First, Second and Third Law, 36
goal, 47 operational, 47–56
measurement times, 52–53 planning, 39–40
no pre-specified performance times, 53 principles, 39
operational tools, 48–52 proportionate investment, 39
paired data assessments, in Excel, 55 rationale for, 35–45
performance metrics, 52–53 reporting of results, 38
process/workflow maps, 48–50 role in laboratory animal-based research, 37
retrospective chart review studies, 53–55 role in veterinary research, 37–38
run charts, 50–52 sample size, 41–43
sample size considerations, 55 sequential probabilities, 42
set-time metrics, 53 similitude, 39
standardising performance, 54–55 stakeholder requirements, 40
subjective measurements, 53 translational pilots, 57–80
surgical implementation for physiological monitoring in trials, 35, 40–41
swine model, 49–50 utility, 39
task identification, 47 well-conducted, 36
see also pilot studies see also evidentiary strength; osteosarcoma pilot study
operational tools, 48–52 PIRT framework, 18
OR see odds ratio (OR) planned heterogeneity, 21, 75, 219
ordinal data, 249–256 plots
common examples, 249 coverage, 68–69
paired or matched, 253–254 half-normal, 70, 72f
rater agreement, 254–255 interaction, 72–73
sample size approximations, 250–253 longitudinal data, 52
osprey eggshell thickness, 135–136, 138 profile, 70, 71f
osteosarcoma pilot study, 95–96 sequential, 52f
outcome data Pocock criteria, concurrent controls, 24
basic equation for continuous data, 170 Poisson distribution, 118, 119–121
count data, 203 applications, 119
hazard rate and ratio, 204–205 evaluating, 124
survival times, 203–204 examples of applications, 119
Oxford Centre for Evidence-based Medicine fitting counts of red mites on apple leaves,
website, 18 124–125
rules of thumb, 195
P skewed count data, 118–119, 195
PECOT framework, 18 visitation rates of sunbirds to flowers, 120
performance metrics, 52–53 power
no-fail surgical performance, 139 calculations, 4–5, 10–11, 81, 163–164
standardising performance, 54–55 descriptions, 107–108
P-hacking, 28, 157 hypothesis testing, 155–157
phenotypic responses, 21 information, 58, 64, 65–67
PICOT framework, 18 simulation-based analyses, 67
pilot studies statistical, 58
applications, 36–38 pragmatic sample size, 42
arithmetic approximations, 42 precision
categories of studies labelled incorrectly as, 38–39 absolute vs. relative, 113
defining, 35 and confidence intervals, 111–126
distinguished from exploratory and screening defining, 12, 106
studies, 38–39 role of design to increase, 19–23
empirical pilots, 57–80 internal validity, 60
276 Index
precision (cont’d) R
precision-based sample size, 42–43 R code, 116, 136–139, 145–146, 160–161, 244
see also margin of error racehorse medication, 135, 139
predetermined effect sizes, 177 randomisation, 4, 13, 19, 22–27, 39, 60, 63, 65, 76, 156,
prediction intervals, 105, 106, 127–132 200–201, 58t, 234f
binary data, 130 in ANOVA-type design, 212
continuous data, 127–130 bias minimisation, 26
defining, 127 block, 22
meta-analyses, 130–131 cluster
see also confidence intervals; reference intervals; tolerance cluster-level assignment, 235
intervals as units of randomisation, 243–244
predictive validity, 61 within-cluster assignment, 235
probability density function (pdf ), 97, 98, 159 see also clusters/clustering
probability distribution, 89, 90 formal, 23
see also binomial (exact) distribution; hypergeometric individual, 236
distribution; negative binomial distribution; normal lack of, 259
distribution; Poisson distribution level-2, 242
procedural variability, 20 non-randomisation, 27
probit, 121, 123–124, 229, 260f protocols, 24
process behaviour charts, 50–52 schedule, 200–201
process control, 20–21 strategies, 201
process maps, 48–50 units of, 236, 241, 243–244, 245
profile plots, 70, 71f clusters as, 243–244
proof of concept (pPoC), 58, 59t identifying, 235–236
proof of principle (pPoP), 58, 59t see also randomised complete block design (RCBD)
proportional hazards assumption, 204 randomised complete block design (RCBD), 22, 212,
proportions, 115–118, 227 219–221, 228–229
protocols, 5, 20–22, 37, 50, 103 rare or non-existent events, 92–93
approval, 50 see also adverse events
breeding, 86, 87 ratios
compliance, 51 allocation, 162–163, 169, 186, 194, 201,
coverage plots, 69 205, 250
effect sizes, 169 effect sizes, 177–178
errors, 51 hazard ratio (HR), 176, 204–208, 168t
experimental, 36, 37, 47, 48, 74, 87 odds ratio see odds ratio (OR)
fixed-span, 160 sex, 190–191
fully articulated, 41 rats, 160, 206
habituation, 22 recommended daily allowance (RDA), 114
implementation, 64 red mites, 122, 124–125
individual idealised-span, 160 Reduction principle, 3, 4, 17, 22, 27, 64
large animal experiments, 83 reference intervals, 105, 106, 136, 143–151
measurement times, 52 confidence probability, 145
mouse surgical success, 190 constructing, 144–147
pilot studies, 38, 40 cut-point precision, 144, 145
randomisation, 24 defining, 143
research, 37 interval coverage, 144
standardised, 76 markers, 143, 145
translational considerations, 259 reference limits, 144, 145
pseudo-replication, 12, 13–15, 219, 234 regression-based reference ranges, 146–147
P-values, 59, 156, 157 sample size determination, 147–149
selection bias, 144
Q specificity, 143
quality assurance (QA), 20, 48, 53 spectrum bias, 144
quality improvement (QI), 20, 48 see also confidence intervals; prediction intervals;
quantified outcomes, 19 tolerance intervals
Index 277
regression analysis retinoblastoma, in mice, 262, 264

adjusted/unadjusted R2, 172 r-family (strength of association) effect sizes
analysis of variance table, 172t analysis of variance (ANOVA), 172t, 173
Bayesian inference method, 129 correlation, 171
calculating logit and probit regression weights, 264 regression, 171–172
coefficients, 70, 72, 172 right-sizing
controls, 24 challenges to, 4
degrees of freedom, 14 checklist, 3
dose-response models, 260 defensibility, 3
error sum of squares (SSE), 171, 172 evidence of, 4, 27
fixed effects regression coefficients, 238–239 experiments, 3, 4, 27
F-statistic, 118, 163–164, 172 thinking, 27
linear regression, 129–130, 171, 172t risk difference, 174, 189, 191
mean square error of the regression, 129 risk family effect sizes, 174–176
probit regression model, 260f Cramer’s V, 176
regression sum of squares (RSS), 171, 172 interpretation, 175–176
regression-based reference ranges, 146–147 nominal variables, 176
r-family (strength of association) effect sizes, 171–172 odds ratio see odds ratio (OR)
total sum of squares (SST), 171, 172 relative risk see relative risk (RR)
relative efficiency (RE), 162, 163 risk difference, 174
relative risk (RR), 174, 175, 176, 203 risk ratio (RR), 174
detecting a given relative risk, 193 rodents
pre-specified risk reduction, 194 breeding programmes, 22, 86–88
using SAS proc genmod to calculate, 178 guinea pigs, 217, 220
repeated measures lifespan, 21
simple repeated-measures design, 229 mice see mice
in space, 227–228 rats, 160
on time, 227 rodent p50 in relation to body mass, 129
within-subject designs, 212, 213, 225–228 source and strain information, 21
replicates, 12–13 routine assessments, EDA, 68
defining, 10 RR see relative risk (RR)
experiments, 13, 73 run charts, 48, 50–52
repeats of entire experiments, 13
vs. replication, 13
technical, 9, 11–12 S
replication sacrificial pseudo-replication, 14
biological, 9, 11 sample size
conceptual, 74–75 balance and allocation, 23, 162–163
design and sample size for, 76 balanced vs. unbalanced allocation design, 162, 163t
direct, 74 basics, 5, 9–15
efficiency-effectiveness trade-offs, 75 calculating/determining see calculation/determination of
evidentiary strength, assessing, 74–76 sample size
experimental units, 12 for comparative studies, minimum planning size, 156
vs. replicates, 13 confidence intervals see confidence intervals
types of study, 75t conventional vs. screening designs, 66
units of, 10f, 11t defining, 9
see also pseudo-replication dose-response studies, 258–260
representativeness, 61–62 estimating see estimating sample sizes
reptiles experiments see experimental design; experiments
clutch sizes, 242 information density, 64–65
lizards, 114 information power, 65–67
turtles, 227 justification, 4
research hypothesis, 18, 19 large, 26
research question, see ‘well-built’ research question vs. numbers of animals, 9
resources for animal-based study planning, 18, 28–29 pilot studies see operational pilot studies; pilot studies
278 Index
sample size (cont’d) single-cell gene expression RNA sequencing experiment,

pragmatic, 42 hypothetical, 11t
precision-based, 42–43 skewed count data, 118–122
problem of in animal-based research, 3–7 two-group comparisons, 195–196
for replication, 76 skull fractures, equine, 130, 137
safety and tolerability, 67 SOPs see standardised operating procedures (SOPs)
standard deviation, 69–70 sparrows
total (N), 10–11 calculating r2 and f 2: model of early life stress, 239
veterinary clinical trials, 67, 83–85 early life stress, 239, 244
zero-animal, 41–42 house sparrow clutch size, 243
SAS codes, 5 house sparrow data, calculations of R2 and Cohen’s f2 for,
based on z-distribution, 187 245–246
calculating confidence limits for Poisson data, 124 spatial autocorrelation, 227–228
calculating critical values, 124 spectrum bias, 144
calculation of tolerance, 139 split-plot design, 223–224
computing of simultaneous confidence intervals, 122–123 stakeholder requirements, 40
cumulative probabilities for binomial, negative binomial standard deviation, 69–70, 145, 170
and hypergeometric distribution with, 97–98 choice of, 183–185
for dose-response calculations, 264–265 one-sample comparison, 184
for estimating variance components, 244–245 paired samples or A/B crossover designs, 184–185
for calculations of R2 and Cohen’s f2 for house sparrow simplified non-central t correction, 188
data, 245–246 two independent samples, 184
evaluating Poisson and negative binomial uncorrected, 187–188
distributions, 124 upper confidence interval (UCL), 188
sample standard error (SE), 9, 14, 60, 69, 156, 158, 162, 184
based on non-Gaussian data, 163–164 and precision, 112, 113, 114, 115, 122
for calculating Cohen’s k from raw data, 255 prediction intervals, 128, 129, 130
for computing confidence intervals for a single sample standardisation of housing and husbandry conditions, 21
proportion, 123 standardised operating procedures (SOPs), 47–48
for fixed power crossover design (cattle example), 187 standardising of performance, 54–55
total size based on the non-centrality parameter for statistical hypothesis, 10, 11, 18–19, 27
t, 187 descriptive statistics, 103–104
veterinary clinical trials, 187–188 hypothesis testing, 155
skewed count data, 196–197 statistical significance, 27–28, 156
two-group comparisons, 187 chasing, 27, 28, 157
SAS programmes, 98 and effect sizes, 167, 178
SCI see simultaneous confidence intervals (SCI) of regression coefficients, 263
screening designs testing, 47, 55
classic, 65 strain controls, 24
customized, 66 stratification, 22
definitive, 65, 66, 67 stress, in dogs, 104, 254
factor reduction, 221 study design, 19, 207–208
sample size, 66 study planning resources, 28–29
and workhorse design, 221 study quality, 13, 83
selection bias, 144 see also quality assurance (QA); quality improvement (QI)
sequential assembly/thinking, 27 sunfish foraging preferences, 14
sequential plots, 52f swine model, physiological monitoring in, 49–50
set times, 52 SYRCLE (collaborative research group), 18
sex, as a biological variable, 62–63 Risk of Bias tool, 26
sex ratios, 190–191, 227
sheep, parasitic worm burdens in Soay type, 122, 196
signalment, animal, 21 T
simple pseudo-replication, 14 target difference, coverage plots, 68–69
simultaneous confidence intervals (SCI), 114, 115t technical replicates, 11–12
computing, 122–123 temporal pseudo-replication, 14
Index 279
test platform, 18 effect sizes, 170–171

Three ‘Rs’ principles (Replacement, Reduction and one-factor-at-a-time (OFAT), 65, 66, 221
Refinement), 3, 4, 17, 19, 39, 41, 259–260 paired samples, 170–171
throw-away studies, 38 proportions, 189–197
time to event effect size, 176 confidence intervals, 191–193
time-to-event (survival) data, 199–209 difference between two proportions, 190–193
censoring, 201, 202t independent samples, 191–193
defining all survival-related items, 200 odds ratio (OR), 194–195
defining the research question, 200 one proportion known, 190
examples of applications, 199–200 relative risk (RR), 193–194
methodological considerations, 200–203 sex ratios, 190–191
more than two groups, 207–208 skewed count data, 195–196
randomisation schedule, 200–201 sample SAS code for calculating sample size, 187
‘responders’ vs. ‘non-responders’ and post-hoc sample size calculation methods, 182–183
dichotomisation, 203 two-group designs, 181
sample size calculations, 205–206 Type I error rates, 9, 14, 26, 85, 107, 114, 156, 189, 226
structure, 201 Type II error, 156, 189
study design considerations, 207–208
survivorship curves, 201 U
veterinary clinical trials, 206–207
underpowered studies, 38
tolerance intervals, 105, 106, 133–141, 146
unit of analysis see experimental units
applications of, 133
unsubstantiated studies, 38
blood ammonia, 137
calculation of tolerance, 139
confidence, 134 V
defining, 133 validity, 3
determining sample size for, 137–138 construct, 61–63, 58t, 62t
non-parametric tolerance limits, 136–137 external, 21, 58–64, 74–75
one-sided lower tolerance interval, 135–136 face, 58t, 61
one-sided or two-sided, 134, 136 internal, 13, 24, 39, 58–60, 61, 64, 74, 76, 108
osprey eggshell thickness, 135–136, 138 predictive, 61, 76, 58t
parametric formulations, 134–136 sex as a biological variable, 62–63, 259
sample size, based on permissible number of failures, study, 12, 59
138–139 variability
surgical procedures, 139 between-individual, 21
two-sided interval distribution-free sample size, 137 between-subject, 225
width and bounds, 134 biological, 4, 11
see also confidence intervals; prediction intervals; descriptive statistics, 103
reference intervals expected, 112
track and trace systems, 21 in experiments, 21, 86
translational pilots, 57–80 interaction effects, 225
compared with empirical pilots, 58t procedural, 20, 48
treatment interactions, 73 subject, 203
trial-and-error, 4, 27, 38, 39, 48, 65, 66 technical replicates, 12
trials see also variation
pilot studies, 35, 40–41 variance error, 160, 212, 214, 218f, 222, 236, 239, 240
planning recruitment numbers, 117 variation
recruitment of client-owned companion animals, 84 between-subject, 225
trial size for a grant proposal, 84 interaction effects, 225
veterinary clinical trials, 67, 83–85 reducing
trimethylamine N-oxide (TMAO), 186 between-individual variability, 21
t-tests, 14, 15, 181, 185, 189 data management rules, 20
turtle hatchling sex ratio, 227 estimating sample sizes, 160
two-group comparisons, 4, 6, 66, 121, 221 pilot studies, 36
continuous outcomes, 181–188 process control, 20–21
280 Index
variation (cont’d) W
research animals, 21–22 Wald method
statistical control, 22–23 adjusted Wald see Agresti-Coull method
sources of, 19, 22, 24, 37, 66, 171–173, 178, confidence interval, 123, 191–192
212–215, 228 equation, 191
within-subject, 177, 225 interval limits, 193
veterinary clinical trials, 67, 83–85 for proportions, 115, 116, 121
Food and Drug Administration (FDA)- regulated, 48 water maze testing, 52
loss to follow-up, 206 welfare indicators, 22, 26, 249
non-central t-distribution, 186 ‘well-built’ research question, 17–19
pilot studies, 37–38 Wilcoxon matched-pairs test, 14
recruitment, 41 Wilcoxon test, skewed count data, 118
sample SAS code for calculating sample size, Wilson method, for proportions, 115, 116
187–188 workflow maps, 48–50
sample size for two-arm, 185–188 World Health Organisation (WHO), 48, 50
conventional method, uncorrected s, 186
upper confidence limit method, 186
time-to-event (survival) data, 206–207 Z
von Willebrand disease (vWD), in Doberman Pinschers, zebrafish mortality, 120
91, 117 zero-animal sample size, 41–42

A Guide To Sample Size For Animal-Based Studies (VetBooks - Ir)

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

A Guide To Sample Size For Animal-Based Studies (VetBooks - Ir)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Guide To Sample Size For Animal-Based Studies (VetBooks - Ir)

Uploaded by

Copyright:

Available Formats

A Guide to Sample

Size for Animal-based

0005646879 3D 1 17/8/2023 3:50:21 PM

Limit of Liability/Disclaimer of Warranty

Library of Congress Cataloging-in-Publication Data applied for

Paperback ISBN: 9781119799979

Cover Design: Wiley

Set in 11.5/13.5pt STIXTwoText by Straive, Pondicherry, India

0005646879 3D 2 17/8/2023 3:50:48 PM

PART I. What is Sample Size?

PART II. Sample Size for Feasibility and Pilot Studies

PART III. Sample Size for Description

PART IV. Sample Size for Comparison

Chapter 1: The Sample Size Problem in Animal-Based Research.

CHAPTER OUTLINE HEAD

Good Numbers Matter. This is especially true when BOX 1.1

Sample size Sample size Confidence, power,

Feasibility Description Comparison

Arithmetic Interval Power

CHAPTER OUTLINE HEAD

provide an estimate of time dependencies in response.

Biological replicate is a biologically distinct and inde-

Replicate ‘unit’ Replicate type

Animals Colonies Biological

Source: Adapted from Blainey et al. (2014).

a new experimental run on a new experimental

Example: Replication Versus Repeats

Figure 2.3: Replicates versus repeats. True replicates are

Hurlbert, S.H. and White, M.D. (1993). Experiments

CHAPTER OUTLINE HEAD

The ten strategies are as follows:

3.4 Reduce Variation I: error, can be reduced by training all personnel up

defined and interpretable. Regular quality checks BOX 3.4

Würbel 2021). Habituation protocols (Reynolds BOX 3.5

hierarchical relationship to a natural grouping a method to distinguish experimental ‘signal’ from

differ considerably in baseline characteristics. Examples of within-subject designs include

3.7.2 When Are Controls 3.8 Informative Outcome

of information obtained from each animal. Choice 3.9 Minimise Bias

BOX 3.10 increasing sample size, cherry-picking observa-

CHAPTER OUTLINE HEAD

BOX 4.2 one of the biggest contributors to unreproducible

before considerable time, resources, money, and

Identification of performance gaps in personnel 4.2.1 The Role of Pilot Studies in

Exploratory studies are usually stand-alone heu- BOX 4.5

A screening study is a type of exploratory study Principles

4.4.2 Justification fiscal administrators with how much it is going to

‘Does it work?’ Empirical pilots are used to deter- BOX 4.7

CHAPTER OUTLINE HEAD

is indispensable for ensuring the research is both

BOX 5.1 are underway (Reynolds et al. 2019). Trial-and-error

consistent performance over the course of compli- 5.2 Operational Tools

BOX 5.2 The study workflow map identifies and describes

experiment components and dependencies before- Example: Surgical Instrumentation for

of multiple physiological variables. The experi- Example: Animal Surgery Procedural

Pre-surgical Surgical skin site preparation × 3

Arterial pCO2 35–55 mmHg; pO2 95–110 mmHg?

Source: Adapted from Reynolds et al. (2019).

and reduced protocol errors.