Biostatistics For Dummies, 2nd Edition (True PDF
Biostatistics For Dummies, 2nd Edition (True PDF
Biostatistics For Dummies, 2nd Edition (True PDF
2nd Edition
Copyright © 2024 by John Wiley & Sons, Inc. All rights reserved, including rights for text and data mining and training
of artificial technologies or similar technologies.
Media and software compilation copyright © 2024 by John Wiley & Sons, Inc. All rights reserved, including rights for
text and data mining and training of artificial technologies or similar technologies.
No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any
means, electronic, mechanical, photocopying, recording, scanning or otherwise, except as permitted under Sections
107 or 108 of the 1976 United States Copyright Act, without the prior written permission of the Publisher. Requests to
the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River
Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/
permissions.
Trademarks: Wiley, For Dummies, the Dummies Man logo, Dummies.com, Making Everything Easier, and related
trade dress are trademarks or registered trademarks of John Wiley & Sons, Inc. and may not be used without written
permission. All other trademarks are the property of their respective owners. John Wiley & Sons, Inc. is not
associated with any product or vendor mentioned in this book.
For general information on our other products and services, please contact our Customer Care Department within
the U.S. at 877-762-2974, outside the U.S. at 317-572-3993, or fax 317-572-4002. For technical support, please visit
https://hub.wiley.com/community/support/dummies.
Wiley publishes in a variety of print and electronic formats and by print-on-demand. Some material included with
standard print versions of this book may not be included in e-books or in print-on-demand. If this book refers to
media such as a CD or DVD that is not included in the version you purchased, you may download this material at
http://booksupport.wiley.com. For more information about Wiley products, visit www.wiley.com.
CHAPTER 14: Analyzing Incidence and Prevalence Rates in Epidemiologic Data. . . . 191
Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369
Table of Contents
INTRODUCTION. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
About This Book. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Foolish Assumptions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Icons Used in This Book. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Beyond the Book. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Where to Go from Here . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Table of Contents v
CHAPTER 3: Getting Statistical: A Short Review of
Basic Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Taking a Chance on Probability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Thinking of probability as a number. . . . . . . . . . . . . . . . . . . . . . . . . . 30
Following a few basic rules of probabilities. . . . . . . . . . . . . . . . . . . . 30
Comparing odds versus probability. . . . . . . . . . . . . . . . . . . . . . . . . . 32
Some Random Thoughts about Randomness . . . . . . . . . . . . . . . . . . . . 33
Selecting Samples from Populations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Recognizing that sampling isn’t perfect. . . . . . . . . . . . . . . . . . . . . . . 34
Digging into probability distributions. . . . . . . . . . . . . . . . . . . . . . . . . 35
Introducing Statistical Inference. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Statistical estimation theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Statistical decision theory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Honing In on Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Getting the language down . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Testing for significance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Understanding the meaning of “p value” as the result of a test. . . 42
Examining Type I and Type II errors. . . . . . . . . . . . . . . . . . . . . . . . . . 42
Grasping the power of a test. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Going Outside the Norm with Nonparametric Statistics. . . . . . . . . . . . 47
Table of Contents ix
PART 5: LOOKING FOR RELATIONSHIPS WITH
CORRELATION AND REGRESSION. . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
Table of Contents xi
Getting Casual about Cause. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
Rothman’s causal pie . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
Bradford Hill’s criteria of causality . . . . . . . . . . . . . . . . . . . . . . . . . . 298
INDEX. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369
It is not possible to cover all the subspecialties of biostatistics in one book, because
such a book would have to include chapters on molecular biology, genetics,
agricultural studies, animal research (both inside and outside the lab), clinical
trials, and epidemiological research. So instead, we focus on the most widely
applicable topics of biostatistics and on the topics that are most relevant to human
research based on a survey of graduate-level biostatistics curricula from major
universities.
Only in a few places does this book provide detailed steps about how to perform a
particular statistical calculation by hand. Instruction like that may have been
necessary in the mid-1900s. Back then, statistics students spent hours in a com-
puting lab, which is a room that had an adding machine. Thankfully, we now have
statistical software to do this for us (see Chapter 4 for advice on choosing statistical
software). When describing statistical tests, our focus is always on the concepts
behind the method, how to prepare your data for analysis, and how to interpret
the results. We keep mathematical formulas and derivations to a minimum. We
Introduction 1
only include them when we think they help explain what’s going on. If you really
want to see them, you can find them in many biostatistics textbooks, and they’re
readily available online.
Because good study design is crucial for the success of any research, this book
gives special attention to the design of both epidemiologic studies and clinical tri-
als. We also pay special attention to providing advice on how to calculate the num-
ber of participants you need for your study. You will find easy-to-apply examples
of sample-size calculations in the chapters describing significance tests in
Parts 4, 5, and 6, and in Chapter 25.
Foolish Assumptions
We wrote this book to help several kinds of people. We assume you fall into one of
the following categories:
» Doctors, nurses, and other healthcare professionals who want to carry out
human research
What is important to keep in mind when learning biostatistics is that you don’t
have to be a math genius to be a good biostatistician. You also don’t need any
special math skills to be an excellent research scientist who can intelligently
design research studies, execute them well, collect and analyze data properly, and
draw valid conclusions. You just have to have a solid grasp of the basic concepts
and know how to utilize statistical software properly to obtain the output you need
and interpret it.
This icon signals information especially worth keeping in mind. Your main take-
aways from this book should be the material marked with this icon.
We use this icon to flag explanations of technical topics, such as derivations and
computational formulas that you don’t have to know to do biostatistics. They are
included to give you deeper insight into the material.
This icon refers to helpful hints, ideas, shortcuts, and rules of thumb that you can
use to save time or make a task easier. It also highlights different ways of thinking
about a topic or concept.
This icon alerts you to discussion of a controversial topic, a concept that is often
misunderstand, or a pitfall or common mistake to guard against in biostatistics.
Introduction 3
If you want to get the big picture of what biostatistics encompasses and the areas
of biostatistics covered in this book, then read Chapter 1. This is a top-level
overview of the book’s topics. Here are a few other special parts of this book you
may want to jump into first, depending on your interest:
» If you want a quick refresher on basic statistics like what you would learn in
a typical introductory course, then read Chapter 3.
» If you want to learn about collecting, summarizing, and graphing data, jump
to Part 3.
» If you need to know about working with survival data, you can go right to
Part 6.
Chapter 1
Biostatistics 101
B
iostatistics deals with the design and execution of scientific studies involv-
ing biology, the acquisition and analysis of data from those studies, and the
interpretation and presentation of the results of those analyses. This book
is meant to be a useful and easy-to-understand companion to the more formal
textbooks used in graduate-level biostatistics courses. Because most of these
courses teach how to analyze data from epidemiologic studies and clinical trials,
this book focuses on that as well. In this first chapter, we introduce you to the
fundamentals of biostatistics.
Much of the work in biostatistics is using data from samples to make inferences
about the background population from which the sample was drawn. Now that we
have large databases, it is possible to easily take samples of data. Chapter 6
provides guidance on different ways to take samples of larger populations so you
can make valid population-based estimates from these samples. Sampling is
especially important when doing observational studies. While clinical trials
covered are experiments, where participants are assigned interventions, in obser-
vational studies, participants are merely observed, with data collected and statis-
tics performed to make inferences. Chapter 7 describes these observational study
designs, and the statistical issues that need to be considered when analyzing data
arising from such studies.
Data used in biostatistics are often collected in online databases, but some data
are still collected on paper. Regardless of the source of the data, they must be put
into electronic format and arranged in a certain way to be able to be analyzed
using statistical software. Chapter 8 is devoted to describing how to get your data
into the computer and arrange it properly so it can be analyzed correctly. It also
» In Chapter 11, you see how to compare average values between two or
more groups by using t tests and ANOVAs. We also describe their nonpara-
metric counterparts that can be used with skewed or other non-normally
distributed data.
» Chapter 13 focuses on one specific kind of cross-tab called the fourfold table,
which has exactly two rows and two columns. Because the fourfold table
provides the opportunity for some particularly insightful calculations, it’s
worth a chapter of its own.
» You may want to develop a formula for predicting the value of a variable from
the observed values of one or more other variables. For example, you may
want to predict how long a newly diagnosed cancer patient may survive
based on their age, obesity status, and medical history.
Regression analysis can manage all these tasks and many more. Regression is so
important in biological research that all the chapters in Part 5 are focused on some
aspect of regression.
If you have never learned correlation and regression analysis, read Chapter 15,
which introduces these topics. We cover simple straight-line regression in
Chapter 16, which includes one predictor variable. We extend that to cover
multiple regression with more than one predictor variable in Chapter 17. These
three chapters deal with ordinary linear regression, where you’re trying to predict
the value of a numerical outcome variable from one or more other variables. An
example would be trying to predict mean blood hemoglobin concentration using
variables like age, blood pressure level, and Type II diabetes status. Ordinary
linear regression uses a formula that’s a simple summation of terms, each of
which consists of a predictor variable multiplied by a regression coefficient.
» Poisson regression, where the outcome is the number of events that occur in
an interval of time
» LOWESS curve-fitting, where you fit a custom function to describe your data
Finally, Part 5 ends with Chapter 20, which provides guidance on the mechanics
of regression modeling, including how to develop a modeling plan, and how to
choose variables to include in models.
The need to study survival with data like these led to the development of survival
analysis techniques. But survival analysis is not only intended to study the
outcome of death. You can use survival analysis to study the time to the first
occurrence of non-death events as well, like remission or recurrence of cancer,
the diagnosis of a particular condition, or the resolution of a particular condition.
Survival analysis techniques are presented in Part 6.
But you should still be familiar with the common statistical distributions that may
describe the fluctuations in your data, or that may be referenced in the course of
performing a statistical calculation. Chapter 24 contains a list of commonly used
distribution functions, with explanations of where you can expect to encounter
those distributions and what they look like. We also include a description of some
of their properties and how they’re related to other distributions. Some of them
are accompanied by a small table of critical values, corresponding to statistical
significance at α = 0.05.
Chapter 2
Overcoming
Mathophobia: Reading
and Understanding
Mathematical
Expressions
L
et’s face it: Many people fear math, and statistical calculations require math.
In this chapter, we help you become more comfortable with reading
mathematical expressions, which are combinations of numbers, letters, math
operations, punctuation, and grouping symbols. We also help you become more
comfortable with equations, which connect two expressions with an equal sign.
And we review formulas, which are equations designed for specific calculations.
(For simplicity, for the rest of the chapter, we use the term formula to refer to
expressions, equations, and formulas.) We also explain how to write formulas,
which you need to know in order to tell a computer how to do calculations with
your data.
» A typeset format utilizes special symbols, and when printed, the formula is
spread out in a two-dimensional structure, like this:
n 2
i 1
xi m
SD
n 1
» A plain text format prints the formula out as a single line, which is easier to
type if you’re limited to the characters on a keyboard:
SD sqrt sum x[i] m ^ 2, i , 1, n / n 1
You must know how to read both types of formula displays — typeset and plain
text. The examples in this chapter show both styles. But you may never have to
construct a professional-looking typeset formula (unless you’re writing a book,
like we’re doing right now). On the other hand, you’ll almost certainly have to
write out plain text formulas as part of organizing, preparing, editing, and ana-
lyzing your data.
Constants
Constants are values that can be represented explicitly (using the numerals 0
through 9 with or without a decimal point), or symbolically (using a letter in the
Greek or Roman alphabet). Symbolic constants represent a particular value impor-
tant in mathematics, physics, or some other discipline, such as:
» The Greek letter π usually represents 3.14159 (plus a zillion more digits). This
Greek letter is spelled pi and pronounced pie, and represents the ratio of the
circumference of any circle to its diameter.
Mathematicians and scientists use lots of other specific Greek and Roman letters
as symbols for specific constants, but you need only a few of them in your biosta-
tistics work. π and e are the most common, and we define others in this book as
they come up in topics we present.
Variables are always italicized in typeset formulas, but not in plain text formulas.
» A minus sign placed immediately before a variable tells you to reverse the
sign of the value of the variable. Therefore, –x means that if x is positive, you
should now make it negative. But it also means that if x is negative, make it
positive (so, if x was –5 kg, then –x would be 5 kg). Used this way, the minus
sign is referred to as a unary operator because it’s acting on only one variable.
Multiplication
The word term is generic for an individual item or element in a formula. Multipli-
cation of terms is indicated in several ways, as shown in Table 2-1.
You can put terms right next to each other to imply multiplication only when it’s
perfectly clear from the context of the formula that the authors are using only
single-letter variable names (like x and y), and that they’re describing calcula-
tions where it makes sense to multiply those variables together. In other words,
you can’t put numeric terms right after one another to imply multiplication,
Division
Like multiplication, division can be indicated in several ways:
Raising to a power
Raising to a power is a shorthand way to indicate repeated multiplication by the
same number. You indicate raising to a power by:
All the preceding expressions are read as “five to the third power,” “five to the
power of three,” or “five cubed.” It says to multiply three fives together: 5 × 5 × 5,
which gives you 125.
Remember the constant e (2.718. . .)? Almost every time you see e used in a for-
mula, it’s being raised to some power. This means you almost always see e with
an exponent after it. Raising e to a power is called exponentiating, and another way
of representing e x in plain text is exp(x). Remember, x doesn’t have to be a whole
number. By typing =exp(1.6) in the formula bar in Microsoft Excel (or doing the
equation on a scientific calculator), you see that exp(1.6) equals approximately
4.953. We talk more about exponentiating in other book sections, especially
Chapters 18 and 24.
Taking a root
Taking a root involves asking the power question backwards. In other words, we
ask: “What base number, when raised to a certain power, equals a certain num-
ber?” For example, “What number, when raised to the power of 2 (which is
squared), equals 100?” Well, 10 10 (also expressed 10 2) equals 100, so the square
root of 100 is 10. Similarly, the cube root of 1,000,000 is 100, because 100 100 100
(also expressed 100 3 ) equals a million.
Root-taking is indicated by a radical sign (√) in a typeset formula, where the term
from which we intend to take the root is located “under the roof” of the radical
sign, as 25 is shown here: 25 . If no numbers appear in the notch of the radical
sign, it is assumed we are taking a square root. Other roots are indicated by putt-
ing a number in the notch of the radical sign. Because 2 8 is 256, we say 2 is the
eighth root of 256, and we put 8 in the notch of the radical sign covering 256, like
this: 8 256 . You also can indicate root-taking by expressing it different ways used
in algebra: n x is equal to x 1/n and can be expressed as x ^ 1/n in plain text.
Looking at logarithms
In addition to root-taking, another way of asking the power question backwards
is by saying, “What exponent (or power) must I raise a particular base number to
in order for it to equal a certain number?” For root-taking, in terms of using a
formula, we specify the power and request the base. With logarithms, we specify
the base and request the power (or exponent).
For example, you may ask, “What power must I raise 10 to in order to get
1,000?” The answer is 3, because 10 3 1, 000. You can say that 3 is the logarithm
of 1,000 (for base 10), or, in mathematical terms: Log 10 1, 000 3. Simi-
larly, because 2 8 256, you say that Log 2 256 8. And because e 1.6 4.953,
then Log e 4.953 1.6 .
The most common kind of logarithm used in this book is the natural logarithm, so
in this book we always use Log to indicate natural (base-e) logarithms. When we
want to refer to common logarithms, we use Log 10 , and when referring to binary
logarithms, we use Log 2 .
Calculating an antilog is exactly the same as raising the base to the power of the
logarithm. That is, the base-10 antilog of 3 is the same as 10 raised to the power
of 3 (which is 10 3, or 1,000). Similarly, the natural antilog of any number is e
(2.718) raised to the power of that number. As an example, the natural antilog of
5 is e 5, or approximately 148.41.
Factorials
Although a statistical formula may contain an exclamation point, that doesn’t
mean that you should sound excited when you read the formula aloud (although it
may be tempting to do so!). An exclamation mark (!) after a number is shorthand
Even though standard keyboards have a ! key, most computer programs and
spreadsheets don’t let you use ! to indicate factorials. For example, to do the cal-
culation of 5! in Microsoft Excel, you use the formula =FACT(5).
» Factorials can be very large. For example, 10! is 3,628,800, and 170! is about
7.3 10 306 , which is close to the processing limits for many computers.
» 0! isn’t 0, but is actually 1. Actually, it’s the same as 1!, which is also 1. That may
not make obvious sense, but is true, so you can memorize it.
Absolute values
The term absolute value refers to the value of a number when it is positive (mean-
ing it has no minus sign before it). You indicate absolute value by placing vertical
bars immediately to the left and right of the number. So |5.7| equals 5.7, and
|–5.7| also equals 5.7. Even though most keyboards have the | (pipe) symbol, the
absolute value is usually indicated in plain text formulas as abs(5.7).
Functions
In this book, a function is a set of calculations that accepts one or more numeric
values (called arguments) and produces a numeric result. Regardless of typeset or
plain text, a function is indicated in a formula by the function name followed by a
set of parentheses that contain the argument or arguments. Here’s an example of
the function square root of x: sqrt(x).
The most commonly used functions have been given standard names. The preced-
ing sections in this chapter covered some of these, including sqrt for square root,
exp for exponentiate, log for logarithm, ln for natural log, fact for factorial, and abs
for absolute value.
Whether doing calculations manually or using software, you need to ensure that
you do your formula calculations in the correct order (called the order of operation).
If you evaluate the terms and operations in the formula in the wrong order, you
will get incorrect results. In a complicated formula, the order in which you evalu-
ate the terms and operations is governed by the interplay of several rules arranged
in a hierarchy. Most computer programs try to follow the customary conventions
that apply to typeset formulas, but you need to check software’s documentation to
be sure.
Here’s a typical set of operator hierarchy rules. Within each hierarchical level,
operations are carried out from left to right:
In a typeset fraction, evaluate terms and operations above the horizontal bar (the
numerator) first, then terms and operations below the bar (the denominator)
next. After that, divide the numerator by the denominator.
Equations
An equation has two expressions with an equal sign between them. Most equations
appearing in this book have a single variable name to the left of the equal sign and
a formula to the right, like this: SEM SD / N . This style of equation defines the
variable appearing on the left in terms of the calculations specified on the right.
The book also contains another type of equation that appears in algebra, asserting
that the terms on the left side of the equation are equal to the terms on the right.
For example, the equation x 2 3 x asserts that x is a number that, when added
to 2, produces a number that’s 3 times as large as the original x. Algebra teaches
you how to solve this expression for x, and it turns out that the answer is x 1.
One-dimensional arrays
A one-dimensional array can be thought of as a list of values. For instance, you
may record a list of fasting glucose values (in milligrams per deciliter, mg/dL )
from five study participants as 86, 110, 95, 125, and 64. You could use the variable
name Gluc to refer to this array containing five numbers, or elements. Using the
term Gluc in a formula refers to the entire five-element array.
You can refer to one particular element of this array (meaning one glucose mea-
surement) in several ways. You can use the index of the array, which is the number
that indicates the position of the element to which you are referring in the array.
» In a plain text formula, indices are typically indicated using brackets (such as
Gluc[3]).
The index can be a variable like I, so Gluc[i] would refer to the ith element of the
array. The term ith means the variable would be allowed to take on any value
between 1 and the maximum number of elements in the array (which in this case
would be 5).
In some programming languages and statistical books and articles, the indices
start at 0 for the first element, 1 for the second element, and so on, which can be
confusing. In this book, all arrays are indexed starting at 1.
Special terms may be used to refer to arrays with one or two dimensions:
Arrays in formulas
If you see an array name in a formula without any subscripts, it usually means
that you have to evaluate the formula for each element of the array, and the result
is an array with the same number of elements. So, if Gluc refers to the array with
the five elements 86, 110, 95, 125, and 64, then the expression 2 × Gluc results in
an array with each element in the same order multiplied by two: 172, 220, 190,
250, and 128.
When an array name appears in a formula with subscripts, the meaning depends
upon the context. It can indicate that the formula is to be evaluated only for some
elements of the array, or it can mean that the elements of the array are to be com-
bined in some way before being used (as described in the next section).
When you see ∑ in a formula, just think of it as saying “sum of.” Assuming an
array named Gluc that is comprised of the five elements 86, 110, 95, 125, and 64,
you can read the expression Gluc as “the sum of the Gluc array” or “sum of
Gluc.” To evaluate it, add all five elements together to get 86 110 95 125 64 ,
which equals 480.
Sometimes the ∑ notation is written in a more complex form, where the index
variable i is displayed under (or to the right of) the ∑ as a subscript of the array
name, like this: i
Gluci . Though its meaning is the same as Gluc , you would
read it as, “the sum of the Gluc array over all values of the index i” (which pro-
duces the same result as Gluc , which is 480). The subscripted ∑ form is helpful
in expressing multi-dimensional arrays, when you may want to sum over only
one of the dimensions. For example, if Ai,j is a two-dimensional array:
10 15 33
25 8 1
then i
Ai , j means that you should sum over the rows (the i subscript) to get the
one-dimensional array: 35, 23, and 34. Likewise, i
Ai , j means to sum across the
columns (j’) to get the one-dimensional array: 58, 34.
Finally, you may see the full-blown official mathematical ∑ in all its glory, like
this:
b
i a
Gluci ,
which reads “sum of the Gluc array over values of the index i going from a to b,
inclusive.” So if a was equal to 1, and b was equal to 5, the expression would
become:
5
i 1
Gluci ,
This expression says to add up only Gluc2 + Gluc3 + Gluc4, to get 110 95 125 , which
would equal 330.
4
Gluci Gluc2 Gluc 3 Gluc 4 100 95 125 330
i 2
5
Gluc Gluc Gluci 86 110 95 125 64 7,189,600,000
i i 1
Fortunately, to make it easier on all of us, we have scientific notation, which is a way to
represent very small or very large numbers to make the easier for humans to under-
stand. Here are three different ways to express the same number in scientific notation:
1.23 10 7 or 1.23E7, or 1.23e 7. All three mean “take the number 1.23, and then slide
the decimal point seven spaces to the right (adding zeros as needed).” To work this out
by hand, you could start by adding extra decimal places with zeros, like 1.2300000000.
Then, slide the decimal point seven places to the right to get 12300000.000 and clean it
up to get 12,300,000.
For very small numbers, the number after the E (or e) is negative, indicating that you
need to slide the decimal point to the left. For example, 1.23e–9 is the scientific notation
for 0.00000000123.
Note: Don’t be misled by the “e” that appears in scientific notation — it doesn’t stand for
the 2.718 constant. You should read it as “times ten raised to the power of.”
» Understanding nonparametric
statistical tests
Chapter 3
Getting Statistical:
A Short Review of
Basic Statistics
T
his chapter provides an overview of basic concepts often taught in a one-
term introductory statistics course. These concepts form a conceptual
framework for the topics that we cover in greater depth throughout this
book. Here, you get the scoop on probability, randomness, populations, samples,
statistical inference, hypothesis testing, and nonparametric statistics.
Note: We only introduce the concepts here. They’re covered in much greater depth
in Statistics For Dummies and Statistics II For Dummies, both written by Deborah
J. Rumsey, PhD, and published by Wiley. Before you proceed, you may want to
skim through this chapter to review the topics we cover so you can fill any gaps in
your knowledge you may need in order to understand the concepts we introduce.
Probabilities are numbers between 0 and 1 that can be interpreted this way:
» A probability of 0 (or 0 percent) means that the event definitely won’t occur.
» A probability of 1 (or 100 percent) means that the event definitely will occur.
The probability of one particular event happening out of N equally likely events
that could happen is 1/N. So with a deck of 52 different cards, the probability of
drawing any one specific card (such as the ace of spades) compared to any of the
other 51 cards is 1/52.
Even though these rules of probabilities may seem simple when presented here,
applying them together in complex situations — as is done in statistics — can get
tricky in practice. Here are descriptions of the not rule, the and rule, and the or rule.
» The not rule: The probability of some event X not occurring is 1 minus the
probability of X occurring, which can be expressed in an equation like this:
Prob not X 1 Prob X
» The and rule: For two independent events, X and Y, the probability of event X
and event Y both occurring is equal to the product of the probability of each of
the two events occurring independently. Expressed as an equation, the and
rule looks like this:
Prob X and Y Prob X Prob Y
As an example of the and rule, imagine that you flip a fair coin and then draw
a card from a deck. What’s the probability of getting heads on the coin flip and
also drawing the ace of spades? The probability of getting heads in a fair coin
flip is 1/2, and the probability of drawing the ace of spades from a deck of
cards is 1/52. Therefore, the probability of having both of these events occur is
1/2 multiplied by 1/52, which is 1/104, or approximately 0.0096 (which is — as
you can see — very unlikely).
» The or rule: For two independent events, X and Y, the probability of X or Y (or
both) occurring is calculated by a more complicated formula, which can be
derived from the preceding two rules. Here is the formula:
Prob X or Y 1 1 Prob X 1 Prob Y
As an example, suppose that you roll a pair of six-sided dice. What’s the
probability of rolling a 4 on at least one of the two dice? For one die, there is a
1/6 chance of rolling a 4, which is a probability of about 0.167. (The chance of
getting any particular number on the roll of a six-sided die is 1/6, or 0.167.)
Using the formula, the probability of rolling a 4 on at least one of the two dice
is 1 1 0.167 1 0.167 , which works out to 1 0.833 0.833 , or 0.31,
approximately.
The and and or rules apply only to independent events. For example, if there is a 0.7
chance of rain tomorrow, you may make contingency plans. Let’s say that if it
does not rain, there is a 0.9 chance you will have a picnic rather than stay in a read
a book, but if it does rain, there is only a 0.1 chance you will have a picnic rather
The odds of an event equals the probability of the event occurring divided by the
probability of that event not occurring. We already know we can calculate the prob-
ability of the event not occurring by subtracting the probability of the event occur-
ring from 1 (as described in the previous section). With that in mind, you can
express odds in terms of probability in the following formula:
With a little algebra (which you don’t need to worry about), you can solve this
formula for probability as a function of odds:
Returning to the casino example, if the customer says their odds of losing are
2-to-1, they mean 2/1, which equals 2. If we plug the odds of 2 into the second
equation, we get 2/(1+2), which is 2/3, which can be rounded to 0.6667. The cus-
tomer is correct — they will lose two out of every three times, and win one out of
every three times, on average.
As shown in Table 3-1, for very low probabilities, the odds are very close to the prob-
ability. But as probability increases, the odds increase exponentially. By the time
probability reaches 0.5, the odds have become 1, and as probability approaches 1, the
odds become infinitely large! This definition of odds is consistent with its
common-language use. As described earlier with the casino example, if the odds
of a horse losing a race are 3:1, that means if you bet on this horse, you have three
chances of losing and one chance of winning, for a 0.75 probability of losing.
0.9 9 The event will occur 90% of the time (it is nine times as likely to occur as
to not occur).
0.75 3 The event will occur 75% of the time (it is three times as likely to occur as
to not occur).
0.5 1.0 The event will occur about half the time (it is equally likely to occur or not occur).
0.25 0.3333 The event will occur 25% of the time (it is one-third as likely to occur as
to not occur).
0.1 0.1111 The event will occur 10% of the time (it is 1/9th as likely to occur as to not occur).
The important point about the term random is that you can’t predict a specific out-
come if a random element is involved. But just because you can’t predict a specific
outcome with random numbers doesn’t mean that you can’t make any predictions
about these numbers. Statisticians can make reasonably accurate predictions
about how a group of random numbers behave collectively, even if they cannot
predict a specific outcome when any randomness is involved.
To illustrate sampling error, we obtained a data set containing the number of pri-
vate and public airports in each of the United States and the District of Columbia
in 2011 from Statista (available at https://www.statista.com/statistics/
185902/us-civil-and-joint-use-airports-2008/). We started by making a
histogram of the entire data set, which would be considered a census because it
contains the entire population of states. A histogram is a visualization to deter-
mine the distribution of numerical data, and is described more extensively in
Chapter 9. Here, we briefly summarize how to read a histogram:
We first made a histogram of the census, then we took four random samples of 20
states and made a histogram of each of the samples. Figure 3-1 shows the results.
FIGURE 3-1:
Distribution of
number of
private and public
airports in
2011 in the
population (of
50 states and the
District of
Columbia), and
four different
samples of
20 states from the
same population.
© John Wiley & Sons, Inc.
As shown in Figure 3-1, when comparing the sample distributions to the distribu-
tion of the population using the histograms, you can see there are differences.
Sample 2 looks much more like the population than Sample 4. However, they are
all valid samples in that they were randomly selected from the population. The
samples are an approximation to the true population distribution. In addition, the
mean and standard deviation of the samples are likely close to the mean and stan-
dard deviation of the population, but not equal to it. (For a refresher on mean and
standard deviation, see Chapter 9.) These characteristics of sampling error —
where valid samples from the population are almost always somewhat different
than the population — are true of any random sample.
In the following sections, we break down two types of distributions: those that
describe fluctuations in your data, and those that you encounter when performing
statistical tests.
» Accuracy refers to how close your measurement tends to come to the true
value, without being systematically biased in one direction or another. Such a
bias is called a systematic error.
» High precision and high accuracy is an ideal result. It means that each
measurement you take is close to the others, and all of these are close to the
true population value.
» High precision and low accuracy is not as ideal. This is where repeat
measurements tend to be close to one another, but are not that close to the
true value. This situation can when you ask survey respondents to self-report
their weight. The average of the answers may be similar survey after survey,
but the answers may be inaccurately lower than truth. Although it is easy to
predict what the next measurement will be, the measurement is less useful if
it does not help you know the true value. This indicates you may want to
improve your measurement methods.
» Low precision and high accuracy is also not as ideal. This is where the
measurements are not that close to one another, but are not that far from the
true population value. In this case, you may trust your measurements, but
find that it is hard to predict what the next one will be due to random error.
» Low precision and low accuracy shows the least ideal result, which is a low
level of both precision and accuracy. This can only be improved through
improving measurement methods.
Fortunately, you don’t have to repeat the entire experiment a large number of
times to calculate the SE. You can usually estimate the SE using data from a single
experiment by using confidence intervals.
At this point, you may be wondering how to calculate CIs. If so, turn to Chapter 10,
where we describe how to calculate confidence intervals around means, propor-
tions, event rates, regression coefficients, and other quantities you measure,
count, or calculate.
In its most basic form, statistical decision theory deals with using a sample to
make a decision as to whether a real effect is taking place in the background
population. We use the word effect throughout this book, which can refer to differ-
ent concepts in different circumstances. Examples of effects include the
following:
» Null hypothesis (abbreviated H 0): The assertion that any apparent effect
you see in your data is not evidence of a true effect in the population, but is
merely the result of random fluctuations.
» Significance: The conclusion that random fluctuations alone can’t account for
the size of the effect you observe in your data. In this case, H 0 must be false,
so you accept H Alt .
» Test statistic: A number calculated from your sample that is part of perform-
ing a statistical test. It can be for the purpose of testing H 0 . In general, the test
» p value: The probability or likelihood that random fluctuations alone (in the
absence of any true effect in the population) can produce the effect observed
in your sample (or, at least as large as the effect you observe in your sample).
The p value is the probability of random fluctuations making the test statistic
at least as large as what you calculate from your sample (or, more precisely,
at least as far away from H 0 in the direction of H Alt ).
» Type I error: Choosing that H Alt is correct when in fact, no true effect above
random fluctuations is present.
» Type II error: Choosing that H 0 is correct when in fact there is indeed a true
effect present that rises above random fluctuations.
1. Reduce your raw sample data down into a single number called a test
statistic.
Each test statistic has its own formula, but in general, the test statistic repre-
sents the magnitude of the effect you’re looking for relative to the magnitude of
the random noise in your data. For example, the test statistic for the unpaired
Student t test for comparing means between two groups is calculated as a
fraction:
To do this, you use complicated formulas to generate the test statistic. Once
the test statistic is calculated, it is placed on a probability distribution. The
distribution describes how much the test statistic bounces around if only
random fluctuations are present (that is, if H 0 is true). For example, the Student
T statistic is placed on the Student T distribution. The result from placing the
test statistic on a distribution is known as the p value, which is described in the
next section.
How small should a p value be before we reject the null hypothesis? The technical
answer is this is arbitrary and depends on how much of a risk you’re willing to
take of being fooled by random fluctuations (that is, of making a Type I error). But
in practice, the value of 0.05 has become accepted as a reasonable criterion for
declaring significance, meaning we fail to reject the null for p values of 0.05 or
greater. If you adopt the criterion that p must be less than 0.05 to reject the null
hypothesis and declare your effect statistically significant, this is known as set-
ting alpha (α) to 0.05, and will establish your likelihood of making a Type I error
to no more than 5 percent.
The truth can be one of two answers — the effect is there, or the effect is not
there. Also, your conclusion from your sample will be one of two answers — the
effect is there, or the effect is not there.
» Your test is not statistically significant, but H0 is false. In this situation, the
interpretation of your test is wrong and does not match truth. Imagine testing
the difference in effect between Drug C and Drug D, where in truth, Drug C
had more effect than Drug D. If your test was not statistically significant, it
would be the wrong result. This situation is called Type II error. The probability
of making a Type II error is represented by the Greek letter beta (β).
We discussed setting α = 0.05, meaning that you are willing to tolerate a Type I
error rate of 5 percent. Theoretically, you could change this number. You can
increase your chance of making a Type I error by increasing your α from 0.05 to a
higher number like 0.10, which is done in rare situations. But if you reduce your α
to number smaller than 0.05 — like 0.01, or 0.001 — then you run the risk of never
calculating a test statistic with a p value that is statistically significant, even if a
true effect is present. If α is set too low, it means you are being very picky about
accepting a true effect suggested by the statistics in your sample. If a drug really
is effective, you want to get a result that you interpret as statistically significant
when you test it. What this shows is that you need to strike a balance between the
At this point, you may be wondering, “Is there any way to keep both Type I and
Type II error small?” The answer is yes, and it involves power, which is described
in the next section.
» The α level you’ve established for the test — that is, the chance you’re willing
to accept making a Type I error (usually 0.05)
» The actual magnitude of the effect in the population, relative to the amount of
noise in the data
Power, sample size, effect size relative to noise, and α level can’t all be varied
independently. They’re interrelated, because they’re connected and constrained
by a mathematical relationship involving all four quantities.
This relationship between power, sample size, effect size relative to noise, and α
level is often very complicated, and it can’t always be written down explicitly as a
formula. But the relationship does exist. As evidence of this, for any particular
type of test, theoretically, you can determine any one of the four quantities if you
know the other three. So for each statistical test, there are four different ways to
do power calculations, with each way calculating one of the four quantities from
arbitrarily specified values of the other three. (We have more to say about this in
Chapter 5, where we describe practical issues that arise during the design of
research studies.) In the following sections, we describe the relationships between
power, sample size, and effect size, and briefly review how you can perform power
calculations.
» Power versus sample size, for various effect sizes: For all statistical tests,
power always increases as the sample size increases, if the other variables
including α level and effect size are held constant. This relationship is illus-
trated in Figure 3-2. “Eff” is the effect size — the between-group difference
divided by the within-group standard deviation.
Small samples will not be able to produce significant results unless the effect size
is very large. Conversely, statistical tests using extremely large samples including
many thousands of participants are almost always statistically significant unless
the effect size is near zero. In epidemiological studies, which often involve
hundreds of thousands of subjects, statistical tests tend to produce extremely
small (and therefore extremely significant) p values, even when the effect size is
so small that it’s of no practical importance (meaning it is clinically insignificant).
FIGURE 3-2:
The power of a
statistical test
increases as the
sample size and
the effect size
increase.
© John Wiley & Sons, Inc.
» Power versus effect size, for various sample sizes: For all statistical tests,
power always increases as the effect size increases, if other variables including
the α level and sample size are held constant. This relationship is illustrated
in Figure 3-3. “N” is the number of participants in each group.
For very large effect sizes, the power approaches 100 percent. For very small
effect sizes, you may think the power of the test would approach zero, but you
can see from Figure 3-3 that it doesn’t go down all the way to zero. It actually
approaches the α level of the test. (Keep in mind that the α level of the test is
the probability of the test producing a significant result when no effect is
truly present.)
» Sample size versus effect size, for various values of power: For all
statistical tests, sample size and effect size are inversely related, if other variables
including α level and power are held constant. Small effects can be detected
only with large samples, and large effects can often be detected with small
samples. This relationship is illustrated in Figure 3-4.
FIGURE 3-4:
Smaller effects
need larger
samples.
© John Wiley & Sons, Inc.
» Computer software: The larger statistics packages such as SPSS, SAS, and R
enable you to perform a wide range of power calculations. Chapter 4
describes these different packages. There are also programs specially
designed for conducting power calculations, such as PS and G*Power, which
are described in Chapter 4.
» Web pages: Many of the more common power calculations can be performed
online using web-based calculators. An example of one of these is here:
https://clincalc.com/stats/samplesize.aspx.
Parametric tests assume that your data conforms to a parametric distribution func-
tion. Because the normal distribution is the most common statistical distribution,
the term parametric test is often used to mean a test that assumes normally dis-
tributed data. But sometimes your data don’t follow a parametric distribution. For
example, it may be very noticeably skewed, as shown in Figure 3-5a.
FIGURE 3-5:
Skewed data (a)
can sometimes
be turned into
normally
distributed data
(b) by taking
logarithms.
© John Wiley & Sons, Inc.
If you transform your data to get it to assume a normal distribution, any analyses
done on it will need to be “untransformed” to be interpreted. For example, if you
have a data set of patients with different lengths of stay in a hospital, you will
likely have skewed data. If you log-transform these data so that they are normally
distributed, then generate statistics (like calculate a mean), you will need to do an
inverse log transformation on the result before you interpret it.
But sometimes your data are not normally distributed, and for whatever reason,
you give up on trying to do a parametric test. Maybe you can’t find a good trans-
formation for your data, or maybe you don’t want to have to undo the transfor-
mation in order to do your interpretation, or maybe you simply have too small of
One-group or paired Student t test (see Chapter 11) Wilcoxon Signed-Ranks test
Pearson Correlation test (see Chapter 15) Spearman Rank Correlation test
Most nonparametric tests involve first sorting your data values, from lowest to
highest, and recording the rank of each measurement. Ranks are like class ranks
in school, where the person with the highest grade point average (GPA) is ranked
number 1, and the person with the next highest GPA is ranked number 2 and so on.
Ranking forces each individual to be separated from the next by one unit of rank.
In data, the lowest value has a rank of 1, the next highest value has a rank of 2, and
so on. All subsequent calculations are done with these ranks rather than with the
actual data values. However, using ranks instead of the actual data loses informa-
tion, so you should avoid using nonparametric tests if your data qualify for
parametric methods.
Chapter 4
Counting on Statistical
Software
B
efore statistical software, complex regressions we could do in theory were
too complicated to do manually using real datasets. It wasn’t until the
1960s with the development of the SAS suite of statistical software that
analysts were able to do these calculations. As technology advanced, different
types of software were developed, including open-source software and web-based
software.
As you may imagine, all these choices led to competition and confusion among
analysts, students, and organizations utilizing this software. Organizations
wonder what statistical packages to implement. Professors wonder which ones to
teach, and students wonder which ones to learn. The purpose of this chapter is
to help you make informed choices about statistical software. We describe and
provide guidance regarding the practical choices you have today among the
statistical software available. We discuss choosing between:
If you were to take a college statistics course in the year 2000, your course would
have likely taught either SAS or SPSS. Professors would have made either SPSS or
SAS available to you for free or for a nominal license fee from your college book-
store. If you take a college statistics course today, you may be in the same
situation — or, you may find yourself learning so-called open-source statistical
software packages. The most common are R and Python. This software is free to
the user and downloadable online because it is built by the user community, not a
company.
As the Internet evolved, more options became available for statistical software. In
addition to the existing stand-alone applications described earlier, specialized
statistical apps were developed that only perform one or a small collection of spe-
cific statistical functions (such as G*Power and PS, which are for calculating sam-
ple sizes). Similarly, web-based online calculators were developed, which are
typically programmed to do one particular function (such as calculate a chi-square
statistic and p value from counts of data, as described in Chapter 12). Some web
pages feature a collection of such calculators.
Comparing Commercial to
Open-Source Software
Before 2010, if an organization performed statistical analysis as part of its core
function, it needed to purchase commercial statistical software like SAS or
SPSS. Advantages of implementing commercial software include the ability to
So, why are new organizations today hesitant to adopt commercial software when
open-source software has so many downsides? The main reason is that the
old advantages of commercial software are not as true anymore. SAS and SPSS
are expensive programs, but they have much of the same functionality as open-
source R and Python, which are free. In some cases, analysts prefer the
open-source application to the commercial application because they can custom-
ize it more easily to their setting. Also, it is not clear that commercial software is
innovating ahead of open-source software. Organizations do not want to get
entangled with expensive commercial software that eventually starts to perform
worse than free open-source alternatives!
SAS
SAS is the oldest commercial software currently available. It started out as having
two main components — Base SAS and SAS Stat — that provided the most used
statistical calculations. However, today, it has grown to include many additional
components and sublanguages. SAS has always been so expensive that only orga-
nizations with a significant budget can afford to purchase and use it. However,
Today, the experience of using SAS has been modernized. In PC SAS and SAS ODA,
it is easy to view code, log, and output files in different windows and switch back
and forth between them. It is also easier to import data into and out of the SAS
environment and create integrated application pipelines involving the SAS envi-
ronment. The new commercial cloud-based version of SAS called Viya is intended
to be used with data stored in the cloud rather than on SAS servers (see the later
section “Storing Data in the Cloud” for more).
Students often find that SAS is challenging to learn when compared to other sta-
tistical software, especially open-source software. Why learn legacy commercial
software like SAS today, when it is so much harder to learn than other software?
The answer is that SAS is still standard software in some domains, such as phar-
maceutical research. This means that even if those organizations choose to even-
tually migrate away from SAS, they will need to hire SAS users to help with the
migration.
SPSS
SPSS was invented more recently than SAS and runs in a fundamentally different
way. SPSS does not expect you to have a data server the way SAS does. Instead,
SPSS runs as a stand-alone program like PC SAS, and expects you to import data
into it for analysis. Therefore, SAS is more likely to be used in a team environ-
ment, while SPSS tends to have individual users.
Microsoft Excel
Microsoft Excel has been used in some domains for statistical calculations, but it
is difficult to use with large datasets. Excel has built-in functions for summariz-
ing data (such as calculating means and standard deviations talked about in
Chapter 9). It also has common probability distribution functions such as Student
t (Chapter 11) and chi-square (Chapter 12). You can even do straight-line regres-
sion (Chapter 16), as well as more extensive analyses available through add-ins.
These functions can come in handy when doing quick calculations or learning
about statistics, but using Excel for statistical projects evokes many challenges.
Using a spreadsheet for statistics means your data are stored in the same place as
your calculations, creating privacy concerns (and a mess!). So, while Excel can be
helpful mathematically — especially when making extra calculations based on
estimates in printed statistical output — it is not a good practice to use it for
extensive statistical projects.
Focusing on Open-Source
and Free Software
Open-source software refers to software that has been developed and supported by
a user community. Although open-source software has licenses, they are typically
free but require you to adhere to certain policies when using the software. In this
section, we talk about the two most popular open-source statistical software
packages: R and Python.
Open-source software
The two most popular and extensive open-source statistical programs are R and
Python.
» OpenStat and LazStats are free statistical programs developed by Dr. Bill
Miller that use menus that resemble SPSS. Dr. Miller provides several
excellent manuals and textbooks that support both programs. OpenStat and
LazStats are available at https://openstat.info.
» Epi Info was developed by the United States Centers for Disease Control to
acquire, manage, analyze, and display the results of epidemiological research.
What makes it different than other statistical software is that it contains
modules for creating survey forms and collecting data. Epi Info is available at
https://www.cdc.gov/epiinfo/index.html.
This is an important issue in statistics. When no code files are produced or saved,
you have no record of the steps in your analysis. If you need to be able to repro-
duce your analysis, the only way to be sure of this is to use software that allows
you to save the code so you can run it again.
Although moving data to the cloud may be an onerous task, you may not have any
choice, because physical storage space may be running out. Many new organiza-
tions start with cloud data storage for that reason. Once your data are stored in the
cloud, they are more easily accessed using online analytics platforms such as SAS
Viya and Tableau.
Chapter 5
Conducting Clinical
Research
T
his chapter provides a closer look at a special kind of human research —
the clinical trial. The purpose of a clinical trial is to test one or more
interventions, such as a medication or other product or action thought to be
therapeutic (such as drinking green tea or exercising). One of the important fea-
tures of a clinical trial is that it is an experimental study design, meaning that
participants in the study are assigned by the study staff which intervention to
take. Therefore, there are serious ethical considerations around clinical trials. On
the other hand, the clinical trial study design provides the highest quality
evidence you can obtain to determine whether or not an intervention actually
works, which is a form of causal inference. In this chapter, we cover approaches
to designing and executing a high-quality clinical trial and explain the ethical
considerations that go along with this.
The objectives are much more specific than the aims. In a clinical trial, the objec-
tives usually refer to the effect of the intervention (treatment being tested) on
specific outcome variables at specific points in time in a group of a specific type of
study participants. In a drug trial, which is a type of experimental clinical research,
an efficacy study may have many individual efficacy objectives, as well as one or
two safety objectives, while a safety study may or may not have efficacy
objectives.
When designing a clinical trial, you should identify one or two primary objectives —
those that are most directly related to the aim of the study. This makes it easier to
determine whether the intervention meets the objectives once your analysis is
complete. You may then identify up to several dozen secondary objectives, which
may involve different variables or the same variables at different time points or in
different subsets of the study population. You may also list a set of exploratory
objectives, which are less important, but still interesting. Finally, if testing a risky
intervention (such as a pharmaceutical), you should list one or more safety
objectives (if this is an efficacy study) or some efficacy objectives (if this is a
safety study).
The objectives you select will determine what data you need to collect, so you have
to choose wisely to make sure all data related to those objectives can be collected
in the timeframe of your study. Also, these data will be processed from various
sources, including case report forms (CRFs), surveys, and centers providing labo-
ratory data. These considerations may limit the objectives you choose to study!
Drug clinical trials are usually efficacy studies. Here is an example of each type of
objectives you could have in an efficacy study:
» Safety objective: To evaluate the safety of drug XYZ, relative to drug ABC, in
terms of the occurrence of adverse events, changes from baseline in vital signs
such as temperature and heart rate, and changes in laboratory results of safety
panels (including tests on kidney and liver function), in participants with HTN.
For each of these objectives, it is important to specify the time range of participa-
tion subject to the analysis (such as the first week of the trial compared to other
time segments). Also, which groups are being compared for each objective should
be specified.
Hypotheses usually correspond to the objectives but are worded in a way that
directly relates to the statistical testing to be performed. So, the preceding pri-
mary objective may correspond to the following hypothesis: “The mean 12-week
reduction in SBP will be greater in the XYZ group than in the ABC group.” Alter-
natively, the hypothesis may be expressed in a more formal mathematical nota-
tion and as a null and alternate pair (see Chapters 2 and 3 for details on these
terms and the mathematical notation used):
In all types of human research, identifying the variables to collect in your study
should be straightforward after you’ve selected a study design and enumerated all
the objectives. In a clinical trial, you will need to operationalize the measurements
you need, meaning you will need to find a way to measure each concept specified
in the objectives. Measurements can fall into these categories:
Values that do not change over time, such as birth date and medical history, only
need to be recorded at the beginning of the study. In designs including follow-up
visits, values that change over time are measured multiple times over the duration
of data collection for the study. Depending on the research objectives, these could
include weight, medication use, and test results. Most of this data collection is
scheduled as part of study visits, and but some may be recorded only at unpredict-
able times, if at all (such as adverse events, and withdrawing from the study
before it is completed).
» Exclusion criteria are used to identify potential participants who do not fall in
the population being studied. They are also used to rule out participation by
individuals who are otherwise in the population being studied but should not
» Parallel: In this clinical trial design, each participant receives one of the
interventions, and the groups are compared. Parallel designs are simpler,
quicker, and easier for each participant than crossover designs, but you need
more participants for the statistics to work out. Trials with very long treatment
periods usually have to be parallel.
Using randomization
Randomized controlled trials (RCTs) are the gold standard for clinical research (as
described in Chapters 7 and 20). In an RCT, the participants are randomly allocated
Blinding eliminates bias resulting from the placebo effect, which is where
participants tend to respond favorably to any treatment (even a placebo),
especially when the efficacy variables are subjective, such as pain level.
Double-blinding also eliminates deliberate and subconscious bias in the
investigator’s evaluation of a participant’s condition.
FIGURE 5-1:
Simple
randomization.
© John Wiley & Sons, Inc.
FIGURE 5-2:
Random
shuffling.
© John Wiley & Sons, Inc.
This arrangement is better because there are exactly six participants assigned to
drug and placebo each. But this particular random shuffle happens to assign more
drugs to the earlier participants and more placebos to the later participant (which
is just by chance). If the recruitment period is short, this would be perfectly fine.
However, if these 12 participants were enrolled over a period of five or six months,
seasonal effects may be mistaken for treatment effects, which is an example of
confounding.
To make sure that both treatments are evenly spread across the entire recruitment
period, you can use blocked randomization, in which you divide your subjects into
consecutive blocks and shuffle the assignments within each block. Often the block
size is set to twice the number of treatment groups. For instance, a two-group
study would use a block size of four. This is shown in Figure 5-3.
FIGURE 5-3:
Blocked
randomization.
© John Wiley & Sons, Inc.
You can create simple and blocked randomization lists in Microsoft Excel using
the RAND() built-in function to shuffle the assignments. You can also use the web
page at https://www.graphpad.com/quickcalcs/randomize1.cfm to generate
blocked randomization lists quickly and easily.
You must also allow some extra space in your target sample-size estimate for
some of the enrolled participants to drop out or otherwise not contribute the data
you need for your analysis. For example, suppose that you need full data from
64 participants for sufficient statistical power to answer your main objective. If
you expect a 15 percent attrition rate from the study, which means you expect only
85 percent of the enrolled participants to have analyzable data, then you need
to plan to enroll 64/0.84, or 76, participants in the study.
» Title: A title conveys as much information about the trial as you can fit into
one sentence, including the protocol ID, name of the study, clinical phase, type
and structure of trial, type of randomization and blinding, name of the drug or
drugs being tested, treatment regimen, intended effect, and the population
being studied (which could include a reference to individuals with a particular
medical condition). A title can be quite long — this example title has all the
preceding elements:
» Rationale: The rationale for the study states why it makes sense to do this
study at this time and includes a justification for the choice of doses, how
the drug is administered (such as orally or intravenously), duration of drug
administration, and follow-up period.
» Design of the clinical trial: As described in the earlier section “Choosing the
structure of a clinical trial,” the clinical trial’s design defines its structure. This
includes the number of treatment groups as well as consecutive stages of the
study. These stages could include eligibility screening, washout, treatment,
follow-up, and so on. This section often includes a schematic diagram of the
structure of the study.
» Safety considerations: These factors include the known and potential side
effects of each drug included. This section also includes the known and
potential side effects of each procedure in the study, including X-rays, MRI
scans, and blood draws. It also describes steps taken to minimize the risk to
the participants.
» Handling of adverse events: This section describes how adverse events will
be addressed should they occur during the study. It includes a description of
the data that will be recorded, including the nature of the adverse event,
severity, dates and times of onset and resolution, any medical treatment given
for the event, and whether or not the investigator thinks the event was related
to the study drug. It also explains how the research study will support the
participant after the adverse event.
Since the end of World War II, international agreements have established ethical
guidelines for human research all over the world. Regardless of the country in
which you are conducting research, prior to beginning a human research study,
your protocol will need to be approved by at least one ethics board. Selection of
ethics boards for human research depends upon where the research is taking place
and what institutions are involved (see the later section “Working with Institu-
tional Review Boards”).
» Safety: Minimizing the risk of physical harm to the participant from the drug
or drugs being tested and from the procedures involved in the study
» Privacy/confidentiality: Ensuring that data collected during the study are not
breached (stolen) and are not made public in a way that identifies a specific
participant without the participant’s consent
The following sections describe some of the infrastructure that helps protect human
subjects.
Because research ethics are international, other countries have similar agencies,
so international clinical trial oversight can get confusing. An organization called
the International Conference on Harmonization (ICH) works to establish a set of
consistent standards that can be applied worldwide. The FDA and NIH have
adopted many ICH standards (with some modifications).
Most medical centers and academic institutions (as well as some industry part-
ners) run their own IRBs with jurisdiction over research conducted at their insti-
tution. If you’re not affiliated with one of these centers or institutions (for
example, if you’re a freelance biostatistician or clinician), you may need the ser-
vices of a consulting or free-standing IRB. The sponsor of the research may recom-
mend or insist you use a particular IRB for the project.
Potential participants must be told that they can refuse to participate with no
penalty to them, and if they join the study, they can withdraw at any time for any
reason without fear of retribution or the withholding of regular medical care. The
IRB can provide ICF templates with examples of their recommended or required
wording.
Fortunately, such training is readily available. Most hospitals and medical centers
provide training in the form of online courses, workshops, lectures, and other
resources. As you comply with ongoing IRB training, you receive a certification in
human subjects protection. Most IRBs and funding agencies require proof of cer-
tification from study staff. If you don’t have access to that training at your insti-
tution, you can get certified by taking an online tutorial offered by the NIH
(https://grants.nih.gov/policy/humansubjects/research/training-and-
resources.htm).
» In the case of digitally collected data, the central analytic team will run
routines for validating the data. They will communicate with study staff if they
find errors and work them out.
» In the case of manually collected data, data entry into a digital format will be
required. Study staff typically are expected to log into an online database with
CRFs and do data entry from data collected on paper.
» The sponsor of the study will provide detailed data entry instructions and
training to ensure high-quality data collection and validation of the data
collected in the study.
» Exclusion: Exclude a case from an analysis if any of the required variables for
that analysis is missing. This seems simple, but the downside to this approach
is it can reduce the number of analyzable cases, sometimes quite severely.
And if the result is missing for a reason that’s related to treatment efficacy,
excluding the case can bias your results.
Handling multiplicity
Every time you perform a statistical significance test, you run a chance of being
fooled by random fluctuations into thinking that some real effect is present in
your data when, in fact, none exists (review Chapter 3 for a refresher on statistical
testing). If you declare the results of the test are statistically significant, and in
reality they are not, you are committing Type I error. When you say that you require
p < 0.05 to declare statistical significance, you’re testing at the 0.05 (or 5 percent)
alpha (α) level. This is another way of saying that you want to limit your Type I
error rate to 5 percent. But that 5 percent error rate applies to each and every sta-
tistical test you run. The more analyses you perform on a data set, the more your
overall α level increases. If you perform two tests at α = 0.05, your chance of at
least one of them coming out falsely significant is about 10 percent. If you run 40
tests, the overall α level jumps to 87 percent! This is referred to as the problem of
multiplicity, or as Type I error inflation.
In sponsored clinical trials, the sponsor and DSMB will weigh in on how they want
to see Type I error inflation controlled. If you are working on a clinical trial with-
out a sponsor, you should consult with another professional with experience in
developing clinical trial analyses to advise you on how to control your Type I error
inflation given the context of your study.
Each time an interim analysis is conducted, a process called data close-out must
occur. This creates a data snapshot, and the last data snapshot from a data close-
out process produces the final analytic dataset, or dataset to be used in all analyses.
Data close-out refers to the process where current data being collected are copied
into a research environment, and this copy is edited to prepare it for analysis.
These edits could include adding imputations, unblinding, or creating other vari-
ables needed for analysis. The analytic dataset prepared for each interim analysis
and for final analysis should be stored with documentation, as decisions about
stopping or adjusting the trial are made based on the results of interim analyses.
Chapter 6
Taking All Kinds
of Samples
S
ampling — or taking a sample — is an important concept in statistics. As
described in Chapter 3, the purpose of taking a sample — or a group of indi-
viduals from a population — and measuring just the sample is so that you
do not have to conduct a census and measure the whole population. Instead, you
can measure just the sample and use statistical approaches to make inferences
about the whole, which is called inferential statistics. You can estimate a measure-
ment of the entire population, which is called a parameter, by calculating a statistic
from your sample.
Some samples do a better job than others at representing the population from
which they are drawn. We begin this chapter by digging more deeply into some
important concepts related to sampling. We then describe specific sampling
approaches and discuss their pros and cons.
For example, imagine that you had a list of all the patients of a particular clinic
and their current ages. Suppose that you calculated the average age of the patients
on your list, and your answer was 43.7 years. That would be a population param-
eter. Now, let’s say you took a random sample of 20 patients from that list and
calculated the mean age of the sample, which would be a sample statistic. Do you
think you would get exactly 43.7 years? Although it is certainly possible, in all
likelihood, the mean of your sample — the statistic — would be a different num-
ber than the mean of your population — the parameter. The fact that most of the
time a sample statistic is not equal to the population parameter is called sampling
error. Sampling error is unavoidable, and as statisticians, we are forced to accept it.
Now, to describe the other type of error, let’s add some drama. Suppose that when
you went to take a sample of those 20 patients, you spilled coffee on the list so you
could not read some of the names. The names blotted out by the coffee were there-
fore ineligible to be selected for your sample. This is unfair to the names under the
coffee stain — they have a zero probability of being selected for your sample, even
though they are part of the population from which you are sampling. This is called
undercoverage, and is considered a type of non-sampling error. Non-sampling error
is essentially a mistake. It is where something goes wrong during sampling that
you should try to avoid. And unlike sampling error, undercoverage is definitely a
mistake you should avoid making if you can (like spilling coffee).
» If you calculated the mean of these 100 values, you would be doing a simula-
tion of the population parameter.
» If you randomly sampled 20 of these values and calculated the mean, you
would be doing a simulation of a sample statistic.
» If you compared your parameter to the statistic to see how close they were to
each other, you would be doing a simulation of sampling error.
So far we’ve reviewed several concepts related to the act of sampling. However, we
haven’t yet examined different sampling strategies. It matters how you go about
taking a sample from a population; some approaches provide a sample that is
more representative of the population than other approaches. In the next section,
we consider and compare several different sampling strategies.
You may be wondering, “What is the best way to draw a sample that is representa-
tive of the background population?” The honest answer is, “It depends on your
resources.” If you are a government agency, you can invest a lot of resources in
conducting representative sampling from a population for your studies. But if you
are a graduate student working on a dissertation, then based on resources availa-
ble, you probably have to settle for a sample that is not as representative of the
population as a government agency could afford. Nevertheless, you can still use
your judgment to make the wisest decisions possible about your sampling approach.
In practice, an SRS is usually taken using a computer so that you can take advan-
tage of a random number generator (RNG) (and do not have to cut up all that paper).
Imagine that the patient list from which you were sampling was not printed on
paper, but was instead stored in a column in a spreadsheet in Microsoft Excel. You
could use the following steps to take an SRS of 20 patients from this list using the
computer:
You could create another column in the spreadsheet called “Random” and
enter the following formula into the top cell in the column: =RAND(). If you drag
that cell down so that the entire column contains this command, you will see
that Excel populates each cell with a random number between 0 and 1. Each
time Excel evaluates, the random number gets recalculated.
Learners sometimes think that as long as they sort a spreadsheet of data by a col-
umn containing any value and then select a sample of rows from the top, that they
have automatically obtained an SRS. This is not correct! If you think about it more
carefully, you will realize why. If you sort names alphabetically, you will see pat-
terns in names (such as religious names, or names associated with certain lan-
guages, countries, or ethnicities). If you sort by another identifying column, such
as email address or city of residence, you will again see patterns in the data. If you
attempt to take an SRS from such data, it will be biased, not random, and not be
representative. That is why it is important to use a column with an RNG in it for
sorting if you are taking an SRS electronically.
Taking an SRS intuitively seems like the optimal way to draw a representative
sample. However, there are caveats. In the previous example, you started with a
clinical population in the form of a printed or electronic list of patients from which
you could draw a sample. But what if you want to sample from patients presenting
to the emergency department during a particular period of time in the future?
Such a list does not exist. In a situation like that, you could use systematic sam-
pling, which is explained later in the section “Engaging in systematic
sampling.”
Another caveat of SRS is that it can miss important subgroups. Imagine that in
your list of clinic patients, only 10 percent were pediatric patients (defined as
patients under the age of 18 years). Because 10 percent of 20 is two, you may
expect that a random sample of 20 patients from a population where 10 percent
are pediatric would include two pediatric patients. But in practice, in a situation
like this, it would not be unusual for an SRS of 20 patients to include zero pediatric
patients. If your SRS needs to ensure representation by certain subgroups, then
you should consider using stratified sampling instead.
Drawing a stratified sample requires you to weight your overall estimate, or else
it will be biased. As an example, imagine that 15 percent of pediatric patients had
an oral health condition, and 50 percent of the rest of the patients had an oral
health condition. In a stratified sample of 20 patients where you draw 10 from the
pediatric population and 10 from the rest of the population, because the pediatric
population is oversampled (because they only make up 10 percent of the back-
ground population but make up 50 percent of our sample), if weights are not
applied, the estimate of the percentage of the population with an oral health con-
dition would be artificially reduced. That is why it is necessary to apply weights to
overall estimates derived from a stratified sample.
If you are familiar with large epidemiologic surveillance studies such as the
National Health and Nutrition Examination Survey (NHANES) in the United States,
you may be aware that extremely complex stratified sampling is used in the design
and execution of such studies. Stratified sampling in these studies is unlike the
simple example described earlier, where the stratification involves only two age
groups. In surveillance studies like NHANES, there may be stratified sampling
based on many characteristics, including age, gender, and location of residence. If
you need to select factors on which to stratify, trying looking at what factors were
used for stratification in historical studies of the same population. The kind of
stratified sampling used in large-scale surveillance studies is reviewed later in
this chapter in the section “Sampling in multiple stages.”
This is your starting number. If you select three, this means that — starting at
6 p.m. — the first patient to whom you would offer your survey would be the
third one presenting to the emergency department.
This is your sampling number. If you select five, then after the first patient to
whom you offered the survey, you would ask every fifth patient presenting
to the emergency department to complete your survey.
3. Continue sampling until you have the size sample you need (or the time
window expires).
Chapter 4 describes the software G*Power that can be used for making
sample-size calculations.
Systematic sampling is not representative if there are any time-related cyclic pat-
terns that could confer periodicity onto the underlying data. For example, suppose
that it was known that most pediatric patients present to the emergency depart-
ment between 6 p.m. and 8 p.m. If you chose to collect data during this time win-
dow, even if you used systematic sampling, you would undoubtedly oversample
pediatric patients.
Sampling clusters
Another challenge you may face as a biostatistician when it comes to sampling
from populations occurs when you are studying an environmental exposure. The
term exposure is from epidemiology and refers to a factor hypothesized to have a
causal impact on an outcome (typically a health condition). Examples of environ-
mental exposures that are commonly studied include air pollution emitted from
factories, high levels of contaminants in an urban water system, and environ-
mental pollution and other dangers resulting from a particular event (such as a
natural disaster).
But cluster sampling is not only done geographically. As another example, clus-
ters of schools may be selected based on school district, rather than geography,
and an SRS drawn from each school. The important takeaway from cluster sam-
pling is that it is a sampling strategy optimized for drawing a representative sam-
ple when studying an exposure known to be uneven across the population.
Thinking this way, both systematic sampling and cluster sampling also add com-
plexity to your sampling frame. In systematic sampling, whether you use a static
list or you sample in real time, you need to keep track of the details of your sam-
pling process. In cluster sampling, you may be using a map or system of group-
ings from which to sample, and that also involves a lot of recordkeeping. You may
be asking by now, “Isn’t there an easier way?”
Yes! There is an easier and more convenient way: convenience sampling. Conve-
nience sampling is what you probably think it is — taking a sample from a popu-
lation based on convenience. For example, when statistics professors want to
The problem is that the answer they get may be very biased. Most of the students
in their classes may come from the sciences, and those studying art or literature
may feel very differently about the same policy. Although our convenience
sample would be a valid sample of the background population of students, it would
be such a biased sample that the results would probably be rejected by the rest of
the faculty — especially those from the art and literature departments!
Given that the results from convenience samples are usually biased, you may
think that convenience sampling is not a good strategy. In actuality, convenience
sampling comes in handy if you have a relatively low-stakes research question.
Customer satisfaction surveys are usually done with convenience samples, such as
those placing an order on a restaurant’s app. It is simple to program such a survey
into an app, and if the food quality is terrific and the service terrible, it will be
immediately evident even from a small convenience sample of app users complet-
ing the survey.
FIGURE 6-1:
Example of
multi-stage
sampling from
the National
Health and
Nutrition.
Examination
Survey (NHANES).
© John Wiley & Sons, Inc.
As shown in Figure 6-1, in NHANES, there are four stages of sampling. In the first
stage, primary sampling units, or PSUs, are randomly selected. The PSUs are made
up of counties, or small groups of counties together. Next, in the second stage,
segments — which are a block or group of blocks containing a cluster of
households — are randomly selected from the counties sampled in the first stage.
Next, in the third stage, households are randomly selected from segments. Finally,
in stage four, to select each actual community member who will be offered
participation in NHANES, an individual is randomly selected from each house
hold sampled in the third stage.
Chapter 7
Having Designs on
Study Design
B
iostatistics can be seen as the application of a set of tools to answer ques-
tions posed through human research. When studying samples, these tools
are used in conjunction with epidemiologic study designs in such a way as
to facilitate causal inference, or the ability to determine cause and effect. Some
study designs are better than others at facilitating causal inference. Nevertheless,
regardless of the study design selected, an appropriate sampling strategy and sta-
tistical analysis that complements the study design must be used in conjunction
with it.
FIGURE 7-1:
Study design
hierarchy.
© John Wiley & Sons, Inc.
Observational research is where humans are studied in terms of their health and
behavior, but they are not assigned to do any particular behavior as part of the
study — they are just observed. For example, imagine that a sample of women
were contacted by phone and asked about their use of birth control pills. In this
case, researchers are observing their behavior with regard to birth control pills.
Note that there are only two entries under the experimental category in
Figure 7-1 — small-scale experiments in laboratory settings and clinical trial
designs. The example of assigning research participants to take a new birth
control pill and then testing their lipid profiles is an example of a small-scale
experiment in a laboratory setting. We describe clinical trials later in this chapter
in “Advancing to the clinical trial stage.”
Observational studies, on the other hand, can be further subdivided into two
types: descriptive and analytic. Descriptive study designs include expert opinion,
case studies, case series, ecologic (correlational) studies, and cross-sectional
studies. These designs are called descriptive study designs because they focus on
describing health in populations. (We explain what this means in “Describing
What We See.”) In contrast to descriptive study designs, there are only two types
of analytic study designs: longitudinal cohort studies and case-control studies.
Unlike descriptive studies, analytic studies are designed specifically for causal
inference. These are described in more detail in the section, “Getting
Analytical.”
Getting analytical
Analytic designs include longitudinal cohort studies and case-control studies.
These are the strongest observational study designs for causal inference. Longitu-
dinal cohort studies are used to study causes of more common conditions, like
hypertension (HTN). It is called longitudinal because follow-up data are collected
over years to see which members of the sample, or cohort, eventually get the out-
come, and which members do not. (In a cohort study, none of the participants has
the condition, or outcome, when they enter the study.) The cohort study design is
described in more detail under the section, “Following a cohort over time.”
Case-control studies are used when the outcome is not that common, such as liver
cancer. In the case of rare conditions, first a group of individuals known to have
the rare condition (cases) is identified and enrolled in the study. Then, a compa-
rable group of individuals known to not have the rare condition is enrolled in the
study as controls. The case-control study design is described in greater detail
under the section “Going from case series to case-control.”
In ecologic studies, the experimental units are often entire populations (such of a
region or country). For example, in a study presented in Chapter 16 of Epidemiology
For Dummies by Amal K. Mitra (Wiley), the experimental unit is a country, and
15 countries were included in the analysis. The exposure being investigated is fat
intake from diet (which was operationalized as average saturated fat intake as a
percentage of energy in the diet). The outcome was deaths from coronary heart
disease (CHD), operationalized as 50-year CHD deaths per 1,000 person-years
(see Chapter 15 for more about rates in person-years). Figure 7-3 presents the
results in the form of a scatter plot.
FIGURE 7-3:
Ecologic study
results.
© John Wiley & Sons, Inc.
As shown in Figure 7-3, the country’s average value of the outcome (rate of CHD
deaths) is plotted on the y-axis because it’s the outcome. The exposure, average
dietary fat intake for the country, is plotted on the x-axis. The 15 countries in the
study are plotted according to their x-y coordinates. Notice that the United States
is in the upper-right quadrant of the scatter plot because it has high rates of both
the exposure and outcome. The strong, positive value of correlation coefficient r
(which is 0.92) indicates that there is a strong positive bivariate association
between the exposure and outcome, which is weak evidence for causality (flip to
Chapter 15 for more on correlation).
That is why we also have cross-sectional studies, where the experimental unit is
an individual, not a population. A cross-sectional study takes measurements of
individuals at one point in time — either through an in-person hands-on exami-
nation, or by survey (over the phone, Internet, or in person). The National Health
and Nutrition Examination Survey (NHANES) is a cross-sectional surveillance
effort done by the U.S. government on a sample of residents every year. NHANES
makes many measurements relevant to human health in the United States, includ-
ing dietary fat intake as well as status of many chronic diseases including CHD. If
an analysis of cross-sectional data like NHANES found that there was a strong
positive association between high dietary fat intake and a CHD diagnosis in the
individuals participating, it would still be weak evidence for causation, but would
be stronger than what was found in the ecologic study presented in Figure 7-3.
You can use a fourfold, or 2x2, table to better understand how case-control studies
are different from cohort studies. (Refer to Chapters 13 and 14 for more about 2x2
tables.) As shown in Figure 7-4, the 2x2 table cells are labeled relative to exposure
status (the rows) and outcome or disease status (the columns). For the columns,
D+ stands for having the disease (or outcome), and D– means not having the dis-
ease or outcome. Also, for the rows, E+ means having the exposure, and E– means
not having the exposure. Cell a includes the counts of individuals in the study who
were positive for both the exposure and outcome, and cell d includes the counts of
individuals who were negative for both the exposure and outcome (a and d are
concordant cells because the exposure and outcome statuses agree). In the
FIGURE 7-4:
2x2 table cells.
© John Wiley & Sons, Inc.
The 2x2 table shown in Figure 7-4 is generic — meaning it can be filled in with
data from a cross-sectional study, a case-control study, a cohort study, or even a
clinical trial (if you replace the E+ and E– entries with intervention group assign-
ment). How the results are interpreted from the 2x2 table depend upon the under-
lying study design. In the case of a cross-sectional study, an odds ratio (OR) could
be calculated to quantify the strength of association between the exposure and
outcome (see Chapter 14). However, any results coming from a 2x2 table do not
control for confounding, which is a bias introduced by a nuisance variable associ-
ated with the exposure and the outcome, but not on the causal pathway between
the exposure and outcome (more on confounding in Chapter 20).
Imagine that you were examining the cross-sectional association between having
the exposure of obesity (yes/no), and having the outcome of HTN (yes/no). House-
hold income may be a confounding variable, because lower income levels are
associated with barriers to access to high-quality nutrition that could prevent
both obesity and HTN. However, in a bivariate analysis like is done in a 2x2 table,
there is no ability to control for confounding. To do that, you need to use a regres-
sion model like the ones described in Chapters 15 through 23.
So how would you use a 2x2 table for a case-control study on a statistically rare
condition like liver cancer? Suppose that patients thought to have liver cancer are
referred to a cancer center to undergo biopsies. Those with biopsies that are posi-
tive for liver cancer are placed in a registry. Suppose that in 2023 there were
30 cases of liver cancer found at this center that were placed in the registry. This
would be a case series. Imagine that you had a hypothesis — that high levels of
alcohol intake may have caused the liver cancer. You could interview the cases to
determine their exposure status, or level of alcohol intake before they were diag-
nosed with liver cancer. Imagine that 10 of the 30 reported high alcohol intake.
You will see that as some evidence for your hypothesis.
FIGURE 7-5:
Example of
a typical
case-control
study 2x2 table.
© John Wiley & Sons, Inc.
As shown in Figure 7-5, what is important is not the 2x2 table itself, but the order
in which the counts are filled in. Notice that at the beginning of the study, you
already knew the case total was 30, and you had determined that your control total
would be 30 (although you are allowed to sample more controls if you want in a
case-control study).
The correct measure of relative risk to present for a case-control study is the OR (as
described in Chapter 14). It is important to acknowledge that when you present an
OR from a case-control study, you interpret it as an exposure OR, not an outcome
or disease OR. (It is also acceptable to present an OR in a cross-sectional study, but
in that case, you are presenting an outcome or disease OR.)
In a case-control study, because the condition is rare, you are sampling on the
outcome and calculating the likelihood that the cases compared to controls were
exposed. This study design is seen as extremely biased, which is why cohort stud-
ies are preferred, and are at a higher level of evidence. However, case-control
study designs are necessary for rare diseases.
FIGURE 7-6:
Example of a
typical cohort
study 2x2 table.
© John Wiley & Sons, Inc.
As shown in Figure 7-6, the total number of participants is large. This cohort of
600 participants could have been naturally sampled from the population, or they
could be stratified by exposure, meaning that the study design could require a cer-
tain number of participants to be exposed and to be unexposed. Imagine that you
insisted that 300 of your participants have high alcohol intake, and 300 have low
alcohol intake. It may be harder to recruit for the study, but you would be sure to
have enough exposed participants for your statistics to work out. In the case of
Figure 7-6, 210 exposed and 390 unexposed participants were enrolled.
In cohort studies, all the participants are examined upon entering the study, and
those with the outcome are not allowed to participate. Therefore, at the beginning
of the study, all 600 of the participants did not have the outcome, which is HTN.
A cohort study is essentially a series of cross-sectional studies on the same cohort
called waves. The first wave is baseline, when the participants enter the study (all
of whom do not have the outcome). Baseline values of important variables are
measured (and criteria about baseline values may be used to set inclusion criteria,
such as minimum age for the study). Subsequent waves of cross-sectional data
collection take place at regular time intervals (such as every year or every two
years). Changes in measured baseline values are tracked over time, and subgroups
of the cohort are compared in terms of outcome status. Figure 7-6 shows the
exposure status from baseline, and the outcome status from the first wave.
Because the exposure is measured in a cohort study before any participants get the
outcome, it is considered the highest level of evidence among the observational
study designs. It is far less biased than the case-control study design. Several
measures of relative risk can be used to interpret a cohort study, including the OR,
risk ratio, and incidence rate (see Chapter 14).
It is possible to use a 2x2 table to analyze the results of a high-quality clinical trial
as long as the rows are replaced with the intervention groups. You can report the
same measure of relative risk as for a cohort study; however, the difference is that
the high-quality clinical trial would be seen as having much less bias than the
cohort study — and stronger causal evidence.
We could ask a similar question about observational studies as well. Imagine that
multiple case-control studies were conducted to determine whether having liver
cancer was associated with the exposure of having high prediagnosis alcohol
intake. What is the overall answer? Does high alcohol intake cause liver cancer or
not? You could also imagine that multiple cohort studies could be conducted
examining association between the exposure of high alcohol intake and develop-
ing the outcome of HTN. How would the results of these cohort studies be taken
together to answer the question of whether high alcohol intake actually
causes HTN?
The answer to this question are systematic reviews and meta-analyses. In a sys-
tematic review, researchers set up inclusion and exclusion criteria for reports of
studies. Included in those criteria are requirements for a certain study design. For
If you are looking for the highest quality of evidence right now about a current
treatment or exposure and outcome, read the most recent systematic reviews
and meta-analyses on the topic. If there aren’t any, it may mean that the treat-
ment, exposure, or outcome is new, and that there are not a lot of high quality
observational or experimental studies published on the topic yet.
» Understanding levels of
measurement (nominal, ordinal,
interval, and ratio)
Chapter 8
Getting Your Data into
the Computer
B
efore you can analyze data, you have to collect it and get it into the
computer in a form that’s suitable for analysis. Chapter 5 describes this
process as a series of steps — figuring out what data you need and how they
are structured, creating data entry forms and computer files to hold your data, and
entering and validating your data.
So why are we devoting a whole chapter to describing, entering, and checking dif-
ferent types of data? It turns out that the topic of data storage is not quite as trivial
as it may seem at first. You need to be aware of some important details or you may
wind up collecting your data the wrong way and finding out too late that you can’t
run the appropriate analysis. This chapter starts by explaining the different levels
of measurement, and shows you how to define and store different types of data. It
also suggests ways to check your data for errors, and explains how to formally
describe your database so that others are able to work with it if you’re not available.
» Ordinal data have categorical values (or levels) that fall naturally into a logical
sequence, like the severity of cancer (Stages I, II, III, and IV), or an agreement
scale (often called a Likert scale) with levels of strongly disagree, somewhat
disagree, neither agree nor disagree, somewhat agree, or strongly agree. Note
that the levels are not necessarily equally spaced with respect to the concep-
tual difference between levels.
» Ratio data, unlike interval data, does have a true zero point. The numerical
value of a ratio variable is directly proportional to how much there is of what
you’re measuring, and a value of zero means there’s nothing at all. Income
and systolic blood pressure are good examples of ratio data; an individual
without a job may have zero income, which is not as bad as having a systolic
blood pressure of 0 mmHg, because then that individual would no longer be
alive!
Making bad decisions (or avoiding making decisions) about exactly how to repre-
sent the data values in your research database can mess it up, and quite possibly
doom the entire study to eventual failure. If you record the values to your variables
the wrong way in your data, it may take an enormous amount of additional effort
to go back and fix them, and depending upon the error, a fix may not even be
possible!
You should also be aware that most software has field-length limitations for text
fields. Although commonly used statistical programs like Microsoft Excel, SPSS,
SAS, R, and Python may allow for long data fields, this does not excuse you from
designing your study so as to limit collection of free-text variables. Flip to
Chapter 4 for an introduction to statistical software.
» Two columns: One for Last, another for First and Middle
You may also want to include separate fields to hold prefixes (Mr., Mrs., Dr., and
so on) and suffixes (Jr., III, PhD, and so forth).
Addresses should be stored in separate fields for street, city, state (or province),
ZIP code (or comparable postal code).
Nothing is worse than having to deal with a data set in which a categorical variable
has been stored with numerical codes, but there is no key to the codes and the
person who created the data set is no longer available. This is why maintaining a
data dictionary — described later in this chapter in “Creating a File that Describes
Your Data File” — is a critical step for ensuring you analyze your research data
properly.
Microsoft Excel doesn’t care whether you type a word or a number in a cell, which
can create problems when storing data. You can enter Type of Caregiver as N for the
first subject, nurse for the second, NURSE for the third, 1 for the fourth, and Nurse
for the fifth, and Excel won’t stop you or throw up an error. Statistical programs
like R would consider each of these entries as a separate, unique category. Even
worse, you may inadvertently add a blank space in the cell before or after the text,
which will be considered yet another category. Details such as case-sensitivity of
character values (meaning patterns of being upper or lowercase) can impact que-
ries. In Excel, avoid using autocomplete, and enter all levels of categorical vari-
ables as numerical codes (which can be decoded using your data dictionary).
You handle the Choose only one situation just as we describe for Type of Caregiver in
the preceding section — you establish numeric code for each alternative. For the
Likert scale example, if the item asked about patient satisfaction, you could have a
categorical variable called PatSat, with five possible values: 1 for strongly disagree,
2 for somewhat disagree, 3 for neither agree nor disagree, 4 for somewhat agree,
and 5 for strongly agree. And for the Type of Caregiver example, if only one kind of
caregiver is allowed to be chosen from the three choices of nurse, physician, or
social worker, you can have a categorical variable called CaregiverType with three
possible values: 1 for nurse, 2 for physician, and 3 for social worker. Depending
upon the study, you may also choose to add a 4 for other, and a 9 for unknown
(9, 99, and 999 are codes conventionally reserved for unknown). If you find
unexpected values, it is important to research and document what these mean to
help future analysts encountering the same data.
But the situation is quite different if the variable is Choose all that apply. For the
Type of Caregiver example, if the patient is being served by a team of caregivers,
you have to set up your database differently. Define separate variables in the data-
base (separate columns in Excel) — one for each possible category value. Imagine
that you have three variables called Nurse, Physician, and SW (the SW stands for
social worker). Each variable is a two-value category, also known as a two-state
flag, and is populated as 1 for having the attribute and 0 for not having the attrib-
ute. So, if participant 101’s care team includes only a physician, participant 102’s
care team includes a nurse and a physician, and participant 103’s care team
includes a social worker and a physician, the information can be coded as shown
in the following table.
101 0 1 0
102 1 1 0
103 0 1 1
If you have variables with more than two categories, missing values theoretically
can be indicated by leaving the cell blank, but blanks are difficult to analyze in
statistical software. Instead, categories should be set up for missing values so they
can be part of the coding system (such as using a numerical code to indicate
Never try to cram multiple choices into one column! For example, don’t enter 1, 2
into a cell in the CaregiverType column to indicate the patient has a nurse and phy-
sician. If you do, you have to painstakingly split your single multi-valued column
into separate two-state flag columns (described earlier) before you analyze the
data. Why not do it right the first time?
Along the same lines, don’t group numerical data into intervals when recording it.
If you know the age to the nearest year, don’t record Age in 10-year intervals (such
as 20 to 29, 30 to 39, 40 to 49, and so on). You can always have the computer do
that kind of grouping later, but you can never recover the age in years if all you
record is the decade.
Some statistical programs let you store numbers in different formats. The pro-
gram may refer to these different storage modes using arcane terms for short, long,
or very long integers (whole numbers) or single-precision (short) or double-precision
(long) floating point (fractional) numbers. Each type has its own limits, which may
vary from one program to another or from one kind of computer to another. For
example, a short integer may be able to represent only whole numbers within the
range from 32,768 to 32.767, whereas a double-precision floating-point number
could easily handle a number like 1.23456789012345 10 250 . Excel has no trouble
storing numerical data in any of these formats, so to make these choices, it is best
to study the statistical program you will use to analyze the data. That way, you can
make rules for storing the data in Excel that make it easy for you to analyze the
data once it is imported into the statistical program.
» Don’t put two numbers (such as a blood pressure reading of 135 / 85 mmHg)
into one column of data. Excel won’t complain about it, but it will treat it as
Missing numerical data requires a little more thought than missing categorical
data. Some researchers use 99 (or 999, or 9999) to indicate a missing value in
categorical data, but this approach should not be used for numeric data (because
the statistical program will see these values as actual measured values, and not
codes for missing data). The simplest technique for indicating missing numerical
data is to leave it blank. Most software treats blank cells as missing data in a cal-
culation, but this changes depending on the software, so it’s important to confirm
missing values handling in your analysis.
Some programs may store a date and time as a Julian Date, whose zero occurred at
noon, Greenwich Mean Time, on Jan. 1, 4713 BC. (Nothing happened on that date;
it’s purely a numerical convenience.)
What if you don’t know the day of the month? This happens a lot with medical
history items; a participant may say, “I got the flu in September 2021.” Most
software (including Excel) insists that a date variable be a complete date, and
won’t accept just a month and a year. In this case, a business rule is created to set
the day (as either the 1st, 15th, or last day of the month). Similarly, if both the
month and day are missing, you can set up a business rule to estimate both.
Because of the way most statistics programs store dates and times, they can easily
calculate intervals between any two points in time by simple subtraction. It is best
practices to store raw dates and times, and let the computer calculate the intervals
later (rather than calculate them yourself). For example, if you create variables for
date of birth (DOB) and a visit date (VisDt) in Excel, you can calculate an accurate
age at the time of the visit with this formula:
» Examine the smallest and largest values in numerical data: Have the
software show you the smallest and largest values for each numerical variable.
This check can often catch decimal-point errors (such as a hemoglobin value of
125 g/dL instead of 12.5 g/dL) or transposition errors (for example, a weight of
517 pounds instead of 157 pounds).
» Sort the values of variables: If your program can show you a sorted list of all
the values for a variable, that’s even better — it often shows misclassified
categories as well as numerical outliers.
» Search for blanks and commas: You can have Excel search for blanks
in category values that shouldn’t have blanks, or for commas in numeric
variables. Make sure the “Match entire cell contents” option is deselected in
the Find and Replace dialog box (you may have to click the Options button to
see the check box). This operation can also be done using statistical software.
Be wary if there a large number of missing values, because this could indicate
a data collection problem.
» Tabulate categorical variables: You can have your statistics program tabulate
each categorical variable (showing you the frequency each different category
occurred in your data). This check usually finds misclassified categories. Note
that blanks and special characters in character variables may cause incorrect
results when querying, which is why it is important to do this check.
» A variable name (usually no more than ten characters) that’s used when
telling the software what variables you want it to use in an analysis
• If categorical: What codes and descriptors exist for each level of the
category (these are often called picklists, and can be documented on a
separate tab in an Excel data dictionary)
» How missing values are represented in the database (99, 999, “NA,”
and so on)
Database programs like SQL and statistical programs like SAS often have a func-
tion that can output information like this about a data set, but it still needs to be
curated by a human. It may be helpful to start your data dictionary with such out-
put, but it is best to complete it in Excel. That way, you can add the human cura-
tion yourself to the Excel data dictionary, and other research team members can
easily access the data dictionary to better understand the variables in the database.
Chapter 9
Summarizing and
Graphing Your Data
A
large study can involve thousands of participants, hundreds of variables,
and millions of individual data points. You need to summarize this ocean
of individual values for each variable down to a few numbers, called
summary statistics, that give readers an idea of what the whole collection of
numbers looks like — that is, how they’re distributed.
When presenting your results, you usually want to arrange these summary
statistics into tables that describe how the variables change over time or differ
between categories, or how two or more variables are related to each other. And,
because a picture really is worth a thousand words, you will want to display these
distributions, changes, differences, and relationships graphically. In this chapter,
we show you how to summarize and graph both categorical and numerical data.
Note: This chapter doesn’t cover time-to-event (survival) data, which is the topic
of Chapter 22.
Military 70 16.6%
Other 83 19.7%
Groups are often compared across columns, and if that is the intention, column
percentages should be displayed. But if you divide these same 60 rural residents
with commercial insurance by their row total of 169 rural residents, you find they
make up 30.6 percent of all rural residents, which is a row percentage. And if you
go on to divide these 60 participants by the total sample size of the study, which
is 422, you find that they make up 14.2 percent of all participants in the study.
Categorical data are typically displayed graphically as frequency bar charts and as
pie charts:
» Pie charts: Pie charts indicate the relative number of participants in each
category by the angle of a circular wedge, which can also be considered more
deliciously as a piece of the pie. To create a pie chart manually, you multiply
the percentage of participants in each category by 360, which is the number
of degrees of arc in a full circle, and then divide by 100. By doing that, you are
essentially figuring out what proportion of the circle to devote to that pie
piece. Next, you draw a circle with a compass, and then split it up into wedges
using a protractor — remember from high school math? Trust us, it’s easier to
use statistical software.
Most scientific writers recommend the usage of bar charts over pie charts. They
express more information in a smaller space, and allow for more accurate visual
comparisons.
FIGURE 9-2:
Four different
shapes of
distributions:
normal (a),
skewed (b),
pointy-topped (c),
and bimodal
(two-peaked) (d).
© John Wiley & Sons, Inc.
» Center: Where along the distribution of the values do the numbers tend
to center?
» Symmetry: If you were to draw a vertical line down the middle of the
distribution, does the distribution shape appear as if the vertical line is a
mirror, reflecting an identical shape on both sides? Or do the sides look
noticeably different — and if so, how?
Like using average skating scores to describe the visual appeal of an Olympic skate
routine, to describe a distribution you need to calculate and report numbers that
measure each of these four characteristics. These characteristics are what we
mean by summary statistics for numerical variables.
Arithmetic mean
The arithmetic mean, also commonly called the mean (or the average), is the most
familiar and most often quoted measure of central tendency. Throughout this
book, whenever we use the two-word term the mean, we’re referring to the
You can write the general formula for the arithmetic mean of N number of values
contained in the variable X in several ways:
N
Xi Xi
i 1 i
X
Arithmetic Mean m X
N N N
Some statistical books use the notation such that capital X and capital N refer to
census parameters, and lowercase versions of those to refer to sample statistics.
In this book, we make it clear each time we present this notation whether we are
talking about a census or a sample.
Median
Like the mean, the median is a common measure of central tendency. In fact, it
could be argued that the median is the only one of the three that really takes the
word central seriously.
The median of a sample is the middle value in the sorted (ordered) set of numbers.
By definition, half of the numbers are smaller than the median, and half are larger.
The median of a population frequency distribution function (like the curves shown
in Figure 9-2) divides the total area under the curve into two equal parts: Half of
the area under the curve (AUC) lies to the left of the median, and half lies to
the right.
Consider the sample of diastolic blood pressure (DBP) measurements from seven
study participants from the preceding section. If you arrange the values in order
Statisticians often say that they prefer the median to the mean because the median
is much less strongly influenced by extreme outliers than the mean. For example,
if the largest value for DBP had been very high — such as 150 mmHg instead of
116 mmHg — the mean would have jumped from 98.3 mmHg up to 103.1 mmHg.
But in the same case, the median would have remained unchanged at 91. Here’s an
even more extreme example: If a multibillionaire were to move into a certain
state, the mean family net worth in that state might rise by hundreds of dollars,
but the median family net worth would probably rise by only a few cents (if it were
to rise at all). This is why you often hear the median rather than mean income in
reports comparing income across regions.
Mode
The mode of a sample of numbers is the most frequently occurring value in the
sample. One way to remember this is to consider that mode means fashion in
French, so the mode is the most popular value in the data set. But the mode has
several issues when it comes to summarizing the centrality of observed values for
continuous numerical variables. Often there are no exact duplicates, so there is no
mode. If there are any exact duplicates, they usually are not in the center of the
data. And if there is more than one value that is duplicated the same number of
times, you will have more than one mode.
So the mode is not a good summary statistic for sampled data. But it’s useful for
characterizing a population distribution, because it’s the value where the peak of
the distribution function occurs. Some distribution functions can have two peaks
(a bimodal distribution), as shown earlier in Figure 9-2d, indicating two distinct
subpopulations, such as the distribution of age of death from influenza in many
populations, where we see a mode in young children, and another mode in older
adults.
INNER MEAN
The inner mean (also called the trimmed mean) of N numbers is calculated by
removing the lowest value (the minimum) and the highest value (the maximum),
and calculating the arithmetic mean of the remaining N – 2 inner values. For the
sample of seven values of DBP from study participants from the example used
earlier in this chapter (which were 84, 84, 89, 91, 110, 114, and 116 mmHg), you
would drop the minimum and the maximum to compute the inner
mean: 84 89 91 110 114 / 5 488 / 5 97.6.
An inner mean that is even more inner can be calculated by making an even
stricter rule. The rule could be to drop the two (or more) of the highest and two (or
more) of the lowest values from the data, and then calculate the arithmetic mean
of the remaining values. In the interest of fairness, you should always chop the
same number of values from the low end as from the high end. Like the median
(discussed earlier in this chapter), the inner mean is more resistant to extreme
values called outliers than the arithmetic mean.
GEOMETRIC MEAN
The geometric mean (often abbreviated GM) can be defined by two different-
looking formulas that produce exactly the same value. The basic definition has
this formula:
N
Geometric Mean GM IIX
We describe the product symbol Π (the Greek capital pi) in Chapter 2. This formula
is telling you to multiply the values of the N observations together, and then take
the Nth root of the product. Using the numbers from the earlier example (where
you had DBP data on seven participants, with the values 84, 84, 89, 91, 110, 114,
and 116 mmHg), the equation looks like this:
7 7
GM 84 84 89 91 110 114 116 83,127, 648,746,160, 93.4
log( X ) log( X )
log( GM ) ,or GM antilog
N N
This formula may look complicated, but it really just says, “The geometric mean
is the antilog of the mean of the logs of the values in the sample.” In other words,
( di )2
i
SD sd s where d i Xi X
N 1
This formula is saying that you calculate the SD of a set of N numbers by first
subtracting the mean from each value ( X i ) to get the deviation (d i ) of each value
from the mean. Then, you take the square each of these deviations and add up the
d i2 terms. After that, you divide that number by N – 1, and finally, you take
the square root of that number to get your answer, which is the SD.
For the sample of diastolic blood pressure (DBP) measurements for seven study
participants in the example used earlier in this chapter, where the values are 84,
84, 89, 91, 110, 114, and 116 mmHg and the mean is 98.3 mmHg, you calculate the
SD as follows:
» Variance: The variance is just the square of the SD. For the DBP example, the
variance 14.4 2 207.36.
Range
The range of a set of values is the minimum value subtracted from the maximum
value:
Consider the example from the preceding section, where you had DBP measure-
ments from seven study participants (which were 84, 84, 89, 91, 110, 114, and
116 mmHg). The minimum value is 84, the maximum value is 116, and the range
is 32 (equal to 116 84).
Centiles
The basic idea of the median is that ½ (half) of your numbers are less than the
median, and the other ½ are greater than the median. This concept can be
extended to other fractions besides ½.
The inter-quartile range (IQR) is the difference between the 25th and 75th centiles
(the first and third quartiles).
FIGURE 9-3:
Distributions
can be left-
skewed (a),
symmetric (b), or
right-skewed (c).
© John Wiley & Sons, Inc.
Figure 9-3b shows a symmetrical distribution. If you look back to Figures 9-2a
and 9-2c, which are also symmetrical, they look like the vertical line in the
center is a mirror reflecting perfect symmetry, so these have no skewness. But
Figure 9-2b has a long tail on the right, so it is considered right skewed (and if you
flipped the shape horizontally, it would have a long tail on the left, and be consid-
ered left-skewed, as in Figure 9-3a).
How do you express skewness in a summary statistic? The most common skew-
ness coefficient, often represented by the Greek letter γ (lowercase gamma), is
calculated by averaging the cubes (third powers) of the deviations of each point
from the mean and scaling by the SD. Its value can be positive, negative, or zero.
» A zero γ indicates unskewed data (Figures 9-2a and 9-2c, and Figure 9-3b).
Notice that in Figure 9-3a, which is left-skewed, the γ = –0.7, and for Figure 9-3c,
which is right-skewed, the γ = 0.7. And for Figure 9-3b — the symmetrical
distribution — the γ = 0, but this almost never happens in real life. So how large
does γ have to be before you suspect real skewness in your data? A rule of thumb
for large samples is that if γ is greater than 4 / N , your data are probably skewed.
Kurtosis
Kurtosis is a less-used summary statistic of numerical data, but you still need to
understand it. Take a look at the three distributions shown in Figure 9-4, which
FIGURE 9-4:
Three
distributions:
leptokurtic (a),
normal (b), and
platykurtic (c).
© John Wiley & Sons, Inc.
A good way to compare the kurtosis of the distributions in Figure 9-4 is through
the Pearson kurtosis index. The Pearson kurtosis index is often represented by the
Greek letter k (lowercase kappa), and is calculated by averaging the fourth powers
of the deviations of each point from the mean and scaling by the SD. Its value can
range from 1 to infinity and is equal to 3.0 for a normal distribution. The excess
kurtosis is the amount by which k exceeds (or falls short of) 3.
One way to think of kurtosis is to see the distribution as a body silhouette. If you
think of a typical distribution function curve as having a head (which is near the
center), shoulders on either side of the head, and tails out at the ends, the term
kurtosis refers to whether the distribution curve tends to have
» A pointy head, fat tails, and no shoulders, which is called leptokurtic, and is
shown in Figure 9-4a (where k 3).
» Broad shoulders, small tails, and not much of a head, which is called
platykurtic. This is shown in Figure 9-4c (where k 3 ).
A very rough rule of thumb for large samples is that if k differs from 3 by more than
8 / N , your data have abnormal kurtosis.
mean SD N
Consider the example used earlier in this chapter of seven measures of diastolic
blood pressure (DBP) from a sample of study participants (with the values of 84,
84, 89, 91, 110, 114, and 116 mmHg), where you calculated all these summary sta-
tistics. Remember not to display decimals beyond what were collected in the orig-
inal data. Using this arrangement, the numbers would be reported this way:
98.3 14.4 7
91 84 116
The real utility of this kind of compact summary is that you can place it in each
cell of a table to show changes over time and between groups. For example, a
sample of systolic blood pressure (SBP) measurements taken from study partici-
pants before and after treatment with two different hypertension drugs (Drug A
and Drug B) can be summarized concisely, as shown in Table 9-3.
Drug 138.7 ± 10.3 (40) 139.5 121.1 ± 13.9 (40) 121.5 -17.6 ± 8.0 (40) –17.5 (–34 – 4)
A (117 – 161) (85 – 154)
Drug 141.0 ± 10.8 (40) 143.5 141.0 ± 15.4 (40) 142.5 -0.1 ± 9.9 (40) 1.5 (–25 – 18)
B (111 – 160) (100 – 166)
FIGURE 9-5:
Population
distribution of
systolic blood
pressure (SBP)
measurements in
mmHg (a) and
distribution of a
sample from that
population (b).
© John Wiley & Sons, Inc.
The histogram in Figure 9-5b indicates how the SBP measurements of 60 study
participants randomly sampled from the population might be distributed. Each
bar represents an interval or class of SBP values with a width of ten mmHg. The
height of each bar is proportional to the number of participants in the sample
whose SBP fell within that class.
Log-normal distributions
Because a sample is only an imperfect representation the population, determining
the precise shape of a distribution can be difficult unless your sample size is very
large. Nevertheless, a histogram usually helps you spot skewed data, as shown in
Figure 9-6a. This kind of shape is typical of a log-normal distribution
(Chapter 25), which is a distribution you often see when analyzing biological
measurements, such as lab values. It’s called log-normal because if you take a
logarithm (of any type) of each data value, the resulting logs will have a normal
distribution, as shown in Figure 9-6b.
FIGURE 9-6:
Log-normal
data are
skewed (a), but
the logarithms
are normally
distributed (b).
© John Wiley & Sons, Inc.
Bar charts
One simple way to display and compare the means of several groups of data is
with a bar chart, like the one shown in Figure 9-7a. Here, the bar height for each
group of patients equals the mean (or median, or geometric mean) value of the
enzyme level for patients at the clinic represented by the bar. And the bar chart
becomes even more informative if you indicate the spread of values for each clini-
cal sample by placing lines representing one SD above and below the tops of the
bars, as shown in Figure 9-7b. These lines are always referred to as error bars,
which is an unfortunate choice of words that can cause confusion when error bars
are added to a bar chart. In this case, error refers to statistical error (described in
Chapter 6).
FIGURE 9-7:
Bar charts
showing mean
values (a) and
standard
deviations (b).
© John Wiley & Sons, Inc.
But even with error bars, a bar chart still doesn’t provide a picture of the distribu-
tion of enzyme levels within each group. Are the values skewed? Are there outliers?
Imagine that you made a histogram for each subgroup of patients — Clinic A,
Clinic B, Clinic C, and Clinic D. But if you think about it, four histograms would take
up a lot of space. There is a solution for this! Keep reading to find out what it is.
FIGURE 9-8:
Box-and-whiskers
charts: no-frills (a)
and with variable
width and
notches (b).
© John Wiley & Sons, Inc.
Looking at Figure 9-8a, you notice the box plot for each group has the following
parts:
» A box spanning the interquartile range (IQR), extending from the first quartile
of the variable to the third quartile, thus encompassing the middle 50 percent
of the data.
» A thick horizontal line, drawn at the median, which is also the 50th centile. If
this falls in the middle of the box, your data are not skewed, but if it falls on
either side, be on the lookout for skewness.
» Lines called whiskers extending out to the farthest data point that’s not more
than 1.5 times the IQR away from the box, and terminate with a horizontal bar
on each side.
» Individual points lying outside the whiskers, which are considered outliers.
Box plots provide a useful visual summary of the distribution of each subgroup for
comparison, as shown in Figure 9-8a. As mentioned earlier, a median that’s not
located near the middle of the box indicates a skewed distribution.
Some software draws the different parts of a box plot according to different rules,
so you should always check your software’s documentation before you present a
box plot so you can describe your box plot accurately.
» Variable width: The widths of the bars can be scaled to indicate the relative
size of each group.
» Notches: The box can have notches that indicate the uncertainty in the
estimation of the median. If two groups have non-overlapping notches, they
probably have significantly different medians.
This chapter focused on univariate and bivariate summary statistics and graphs
that can be developed to help you and others better understand your data. But
many research questions are actually answered using multivariate analysis, which
allows for the control of confounders. Being able to control for confounders is one
of the main reasons biostatisticians opt for regression analysis, which we describe
in Part 5 and Chapter 23. In these chapters, we cover the appropriate summary
statistics and graphical techniques for showing relationships between variables
when setting up multivariate regression models.
Chapter 10
Having Confidence
in Your Results
I
n Chapter 3, we describe how statistical inference relies on both accuracy and
precision when making estimates from your sample. We also discuss how the
standard error (SE) is a way to indicate the level of precision of your sample
statistic, but that SE is only one way of expressing the preciseness of your statis-
tic. In this chapter, we focus on another way — through the use of a confidence
interval (CI).
We assume that you’re familiar with the concepts of populations, samples, and
statistical estimation theory (see Chapters 3 and 6 if you’re not), and that you
know what SEs are (read Chapter 3 if you don’t). Keep in mind that when you
conduct a human research study, you’re typically enrolling a sample of study par-
ticipants drawn from a hypothetical population. For example, you may enroll a
sample of 50 adult diabetic patients who agree to be in your study as participants,
but they represent the hypothetical population of all adults with diabetes (for
details about sampling, turn to Chapter 6). Any numerical estimate you observe
from your sample is a sample statistic. A statistic is a valid but imperfect estimate
of the corresponding population parameter, which is the true value of that quantity
in the population.
The SE is usually written after a sample mean with a ± (read “plus or minus”)
symbol followed by the number representing the SE. As an example, you may
express a mean and SE blood glucose level measurement from a sample of adult
diabetics as 120 ± 3 mg/dL. By contrast, the CI is written as a pair of numbers —
known as confidence limits (CLs) — separated by a dash. The CI for the sample
mean and SE blood glucose could be expressed like this: 114 – 126 mg/dL. Notice
that 120 mg/dL — the mean — falls in the middle of the CI. Also, note that the
lower confidence limit (LCL) is 114 mg/dL, and the upper confidence limit (UCL) is
126 mg/dL. Instead of LCL and UCL, sometimes abbreviations are used, and are
written with a subscript L or U (as in CL L or CL U) indicating the lower and upper
confidence limits, respectively.
Although SEs and CIs are both used as indicators of the precision of a numerical
quantity, they differ in what they are intending to describe (the sample or the
population):
» A SE indicates how much your observed sample statistic may fluctuate if the
same study is repeated a large number of times, so the SE intends to describe
the sample.
» A CI indicates the range that’s likely to contain the true population parameter,
so the CI intends to describe the population.
The confidence level is sometimes abbreviated CL, just like the confidence limit,
which can be confusing. Fortunately, the distinction is usually clear from the con-
text in which CL appears. When it’s not clear, we spell out what CL stands for.
There is a popular simulation to illustrate the interpretation of CIs and help learners
understand what it is like to be 95 percent confident. Imagine that you have a Microsoft
Excel spreadsheet, and you make up an entire population of 100 adult diabetics (maybe
they live on an island?). You make up a blood glucose measurement for each of them
and type it into the spreadsheet. Then, when you take the average of this entire column,
you get the true population parameter (in our simulation). Next, randomly choose a
sample of 50 measurements from your population of 100, and calculate a sample mean
and a 95 percent CI. Your sample mean will probably be different than the population
parameter, but that’s okay — that’s just sampling error.
Here’s where the simulation gets hard. You actually have to take 100 samples of 50. For
each sample, you need to calculate the mean and 95 percent CI. You may find yourself
making a list of the means and CIs from your 100 samples on a different tab in the
spreadsheet. Once you are done with that part, go back and refresh your memory as to
what the original population parameter really is. Get that number, then review all 100
CIs you calculated from all 100 samples of 50 you took from your imaginary population.
Because you made 95 percent CIs, 95 out of your 100 CIs will contain the true popula-
tion parameter (and 5 of them won’t)! This simulation is a way of demonstrating a proof
of the central limit theorem (CLT), and helps learners understand what it means to be
95 percent confident about their CI.
For any normally distributed sample statistic, the lower and upper confidence
limits can be calculated from the observed value of the statistic (V) and standard
error (SE) of the statistic:
CL L V k SE
CLU V k SE
For the most commonly used confidence level, 95 percent, k is 1.96, or approxi-
mately 2. This leads to the very simple approximation that 95 percent upper con-
fidence limit is about two SEs above the value, and the lower confidence limit is
about two SEs below the value.
To calculate the confidence limits around a mean using the formulas in the pre-
ceding section, you first calculate the SE, which in this case is the standard error
of the mean (SEM). The formula for the SEM is SEM SD / N , where SD is the SD
of the sample values, and N is the number of values included in the calculation.
For the fasting blood glucose study sample, where your SD was 40 mg/dL and your
sample size was 25, the SEM is SEM 40 / 25 , which is equal to 40/5, or 8 mg/dL.
On the basis of your calculations, you would report your result this way: mean
glucose = 130 mg/dL (95 percent CI = 114 – 116 mg/dL).
Please note that you should not report numbers to more decimal places than their
precision warrants. In this example, the digits after the decimal point are practi-
cally meaningless, so the numbers are rounded off.
There are multiple approximate formulas for CIs around an observed proportion,
which are also called binomial CIs. Let’s start by unpacking the simplest method
for calculating binomial CIs, which is based on approximating the binomial distri-
bution using a normal distribution (see Chapter 25). The N is the denominator of
the proportion, and you should only use this method when N is large (meaning at
least 50). You should also only use this method if the proportion estimate is not
very close to 0 or 1. A good rule of thumb is the proportion estimate should be
between 0.2 and 0.8.
Using the numbers from the sample of 100 adult diabetics (of whom 70 have their
diabetes under control), you have p 0.7 andN 100 . Using those numbers, the SE
for the proportion is 0.7(1 0.7 ) / 100 or 0.046. From Table 10-1, k is 1.96 for 95
percent confidence limits. So for the confidence limits, CL L 0.7 1.96 0.046 and
CLU 0.7 1.96 0.046. If you calculate these out, you get a 95 percent CI of 0.61 to
0.79 (around the original estimate of 0.7). To express these fractions as percent-
ages, you report your result this way: “The percentage of adult diabetics in the
sample whose diabetes was under control was 70 percent (95 percent CI = 61 – 79
percent).”
There are many approximate formulas for the CIs around an observed event count
or rate, which is also called a Poisson CI. The simplest method to calculate a Pois-
son CI is based on approximating the Poisson distribution by a normal distribu-
tion (see Chapter 24). It should be used only when N is large (at least 50). You first
calculate the SE of the event rate using this formula: SE N / T . Next, you use
the normal-based formulas in the earlier section “Before you begin: Formulas for
confidence limits in large samples” to calculate the lower and upper confidence
limits.
Using the numbers from hospital falls example, N 36 and T =3, so the SE for the
event rate is 36 / 3, which is the same as the square root of 2, which is 1.41.
According to Table 10-1, k is 1.96 for 95 percent CLs. So CLL = 12.0 – 1.96 × 1.41 and
CLU = 12.0 + 1.96 × 1.41, which works out to 95 percent confidence limits of 9.24 and
14.76. You report your result this way: “The serious fall rate was 12.0 (95 percent
CI = 9.24 – 14.76) per month.”
To calculate the CI around the event count itself, you estimate the SE of the count
N as SE N , then calculate the CI around the observed count using the formulas
Many other approximate formulas for CIs around observed event counts and rates
are available, most of which are more reliable when your N is small. These formu-
las are too complicated to attempt by hand, but fortunately, many statistical pack-
ages can do these calculations for you. Your best bet is to get the name of the
formula, and then look in the documentation for the statistical software you’re
using to see if it supports a command for that particular CI formula.
» If the 95 percent CI around the observed effect size includes the no-effect value,
then the effect is not statistically significant. This means that if the 95 percent
CI of a difference includes 0 or of a ratio includes 1, the difference is not large
enough to be statistically significant at α = 0.05, and we fail to reject the null.
» If the 95 percent CI around the observed effect size does not include the
no-effect value, then the effect is statistically significant. This means that if the
95 percent CI of a difference is entirely above or entirely below 0, or is entirely
above or entirely below 1 with respect to a ratio, the difference is statistically
significant at α = 0.05, and we reject the null.
So you have two different but related ways to estimate if an effect you see in your
sample is a true effect. You can use significance tests, or else you can use CIs.
Which one is better? Even though the two methods are consistent with one
another, in biostatistics, we are encouraged for ethical reasons to report the CIs
rather than the result of significant tests.
» The CI around the mean effect clearly shows you the observed effect size, as
well as the size of the actual interval (indicating your level of uncertainty about
the effect size estimate). It tells you not only whether the effect is statistically
significant, but also can give you an intuitive sense of whether the effect is
clinically important, also known as clinically significant.
» In contrast, the p value is the result of the complex interplay between the
observed effect size, the sample size, and the size of random fluctuations.
These are all boiled down into a single p value that doesn’t tell you whether
the effect was large or small, or whether it’s clinically significant or negligible.
Chapter 11
Comparing Average
Values between Groups
C
omparing average values between groups of numbers is part of almost all
biostatistical analyses, and over the years, statisticians have developed
dozens of tests for this purpose. These tests include several different fla-
vors of the Student t test, analyses of variance (ANOVA), and a dizzying collection
of tests named after the men who popularized them, including Welch, Wilcoxon,
Mann-Whitney, and Kruskal-Wallis, to name just a few. The multitude of tests is
enough to make your head spin, which leaves many researchers with the uneasy
feeling that they may be using the wrong statistical test on their data.
In this chapter, we guide you through the menagerie of statistical tests for com-
paring groups of numbers. We start by explaining why there are so many tests
available, then guide you as to which ones are right for which situations. Next, we
show you how to execute these tests using R software, and how to interpret the
output. We focus on tests that are usually provided by modern statistical programs
(like those discussed in Chapter 4, which also explains how to install and get
started with R).
These different factors can occur in any and all combinations, so there are a lot of
potential scenarios. In the following sections, we review situations you may fre-
quently encounter when analyzing biological data, and advise you as to how to
select the most appropriate testing approach given the situation.
Typically, comparing a group mean to a historical control warrants using the one-
group Student t test that we describe in the later section “Surveying Student t tests.”
For data that are not normally distributed, the Wilcoxon Signed-Ranks (WSR) test
can be used instead, although it is not used often so we do not cover it in this
chapter. (If you need a review on what normally distributed means, see Chapter 3.)
» The standard deviation (SD) of the values must be close for both groups
(called the equal variance assumption). As a reminder, the SD is the square root
of the variance. To remember why accounting for variation is important in
sampling, review Chapter 3. Also, Chapter 9 provides more information about
the importance of SD. If the two groups you are comparing have very different
SDs, you should not use a Student t test, because it may not give reliable
results, especially if you are also comparing groups of different sizes. A rule of
thumb is that one group’s SD divided by another group’s SD should not be
more than 1.5 to quality for a Student t test. If you feel your data do not qualify,
you can use an alternative called the Welch test (also called the Welch t test, or
the unequal-variance t test). As you see later in this chapter under “Surveying
Student t tests,” because the Welch test accounts for both equal and unequal
variance, it is the only one that is included in R statistical software.
The null hypothesis of the one-way ANOVA is that all the groups have the same
mean. The alternative hypothesis is that at least one group has a mean that is
statistically significantly different from at least one of the other groups. The
ANOVA produces a single p value, and if that p is less than your chosen criterion
(typically α = 0.05), you conclude that at least one of the means must be statisti-
cally significantly different from at least one of the other means. (For a refresher
on hypothesis testing and p values, see Chapter 3.) But the problem with ANOVA
is that if it is statistically significant, it doesn’t tell you which groups have means
that are statistically significantly different. If you have a statistically significant
ANOVA, you have to follow-up with one or more so-called post-hoc tests (described
later under “Assessing the ANOVA”), which test for differences between the
means of each pair of groups in your ANOVA.
You can also use the ANOVA to compare just two groups. However, this one-way,
two-level ANOVA produces exactly the same p value as the classic unpaired equal-
variance Student t test.
In ANOVA terminology, the term way refers to how many grouping variables are
involved, and the term level refers to the number of different levels within any one
grouping variable.
When you are comparing means between groups, you are doing a bivariate com-
parison, meaning you are only involving two variables: the group variable and the
outcome. Adjusting for confounding must be done through a multivariate analysis
using regression.
» The values come from the same participants, but at two or more different
times, such as before and after some kind of treatment, intervention, or event.
» The values come from a crossover clinical trial, in which the same participant
receives two or more treatments at two or more consecutive phases of the trial.
The paired Student t test and the one-group Student t test are actually the same
test. When you run a paired t test, the statistical software first calculates the dif-
ference between each pair of numbers. If comparing a post-treatment value to a
pretreatment value, the software would start by subtracting one value from the
other for each participant. Finally, the software would run a test to see if those
mean differences were statistically significantly different from the hypothesized
value of 0 using a one-group test.
We opted not to clutter this chapter with pages of mathematical formulas for the
following tests because based on our own experience, we believe you’ll probably
never have to do one of these tests by hand. If you really want to see the formulas,
we recommend putting the name of the test in quotes in a search engine and look-
ing on the Internet.
1. Calculate the difference (D) between the mean values you are comparing.
For the t test, calculate the standard error (SE) of that difference (see
Chapter 10 for a refresher on SE).
The test statistic expresses the size of the D relative to the size of its SE. That
is: t D / SE .
df is a tricky concept, but is easy to calculate. For t, the df is the total number of
observations minus the number of means you calculated from those
observations.
The p value is the probability that random fluctuations alone could produce a t
value at least as large as the value you just calculated based upon the Student t
distribution.
The Student t statistic is always calculated using the general equation D/SE. Each
specific type of t test we discussed earlier — including one-group, paired,
unpaired, and Welch — calculates D, SE, and df slightly differently. These different
calculations are summarized in Table 11-1.
Executing a t test
Statistical software packages contain commands that can execute (or run) t tests
(see Chapter 4 for more about these packages). The examples presented here use
R, and in this section, we explain the data structure required for running the var-
ious t tests in R. For demonstration, we use data from the National Health and
Nutrition Examination Survey (NHANES) from 2017–2020 file (available at wwwn.
cdc.gov/nchs/nhanes/continuousnhanes/default.aspx?Cycle=2017-2020).
» For the one-group t test, you need the column of data containing the
variable whose mean you want to compare to the hypothesized value (H), and
you need to know H. R and other software enable you to specify a value for H
and assumes 0 if you don’t specify anything. In the NHANES data, the fasting
glucose variable is LBXGLU, so the R code to test the mean fasting glucose
against a maximum healthy level of 100 mg/dL in an R dataframe named
GLUCOSE is t.test(GLUCOSE$LBXGLU, mu = 100).
» For the paired t test, you need two columns of data representing the pair of
numbers you want to enter into the paired t test. For example, in NHANES,
systolic blood pressure (SBP) was measured in the same participant twice
(variables BPXOSY1 and BPXOSY2). To compare these with a paired t test in an
R dataframe named BP, the code is t.test(BP$BPXOSY1, BP$BPXOSY2, paired =
TRUE).
» For the independent t test, you need to have one column coded as the
grouping variable (preferable with a two-state flag coded as 0 and 1), and
another column with the value you want to test. We created a two-state flag in
the NHANES data called MARRIED where 1 = married and 0 = all other marital
statuses. To compare mean fasting glucose level between these two groups in
a dataframe named NHANES, we used this code: t.test(NHANES$LBXGLU ~
NHANES$MARRIED).
data: GLUCOSE$LBXGLU
t = 21.209, df = 4743, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 100
95 percent confidence interval:
110.1485 112.2158
sample estimates:
mean of x
111.1821
The R output starts by stating what test was run and what data were used, and
then reports the t statistic (21.209), the df (4743), and the p value, which is writ-
ten in scientific notation: < 2.2e–16. If you have trouble interpreting this notation,
just remove the < and then copy and paste the rest of the number into a cell in
Microsoft Excel. If you do that, you will see in the formula bar that the number
resolves to 0.00000000000000022 — which is a very low p value! The shorthand
used for this in biostatistics is p < 0.0001, meaning it is sufficiently small. Because
of this small p value, we reject the null hypothesis and say that the mean glucose
of NHANES participants is statistically significantly different from 100 mg/dL.
But in what direction? For that, it is necessary to read down further in the R out-
put, under 95 percent confidence interval. It says the interval is 110.1485 mg/dL to
112.2158 mg/dL (if you need a refresher on confidence intervals, read Chapter 10).
Because the entire interval is greater than 100 mg/dL, you can conclude that the
NHANES mean is statistically significantly greater than 100 mg/dL.
Now, let’s examine the output from the paired t test of SBP measured two times
in the same participant, which is shown in Listing 11-2.
Paired t-test
Notice a difference between the output shown in Listings 11-1 and 11-2. In
Listing 11-1, the third line of output says, “alternative hypothesis: true mean is
not equal to 100.” That is because we specified the null hypothesis of 100 when we
coded the one-sample t test. Because we did a paired t test in Listing 11-2, this
null hypothesis now concerns 0 because we are trying to see if there is a statisti-
cally significant difference between the first SBP reading and the second in the
same individuals. Why should they be very different at all? In Listing 11-2, the p
value is listed as 1.674e–05, which resolves to 0.00001674 (to be stated as p <
0.0001). We were surprised to see a statistically significant difference! The output
says that the 95 percent confidence interval of the difference is 0.1444651 mmHg
to 0.3858467 mmHg, so this small difference may be statistically significant while
not being clinically significant.
Let’s examine the output from our independent t test of mean fasting glucose
values in NHANES participants who were married compared to participants with
all other marital statuses. This output is shown in Listing 11-3.
But which group is higher? Well, for that, you can look at the last line of the out-
put, where it says that the mean in group 0 (all marital statuses except married)
is 108.8034 mg/dL, and the mean in group 1 (married) is 113.6404 mg/dL. So does
getting married raise your fasting glucose? Before you try to answer that, please
make sure you read up on confounding in Chapter 20!
But what if you just wanted to know if the variance in the fasting glucose mea-
surement in the married group was equal or unequal to the other group, even
though you were doing a Welch test that accommodates both? For that, you can
do an F test. Because we are not sure which group’s fasting glucose would be
higher, we choose a two-sided F test and use this code: var.test(LBXGLU ~
MARRIED, NHANES, alternative = "two.sided"), which produces the output shown in
Listing 11-4.
As shown in Listing 11-4, the p value on the F test is 0.4684. As a rule of thumb:
The term one-way ANOVA refers to an ANOVA with only one grouping variable in
it. The grouping variable usually has three or more levels because if it has only
two, most analysts just do a t test. In an ANOVA, you are testing how spread out
the means of the various levels are from each other. It is not unusual for students
to be asked to calculate an ANOVA manually in a statistics class, but we skip that
here and just describe the result. One result derived from an ANOVA calculation is
expressed in a test statistic called the F ratio (designated simply as F). The F is the
ratio of how much variability there is between the groups relative to how much
variability there is within the groups. If the null hypothesis is true, and no true
difference exists between the groups (meaning the average fasting glucose in
M = NM = OTH), then the F ratio should be close to 1. Also, F’s sampling fluctua-
tions should follow the Fisher F distribution (see Chapter 24), which is actually a
family of distribution functions characterized by the following two numbers seen
in the ANOVA calculation:
The p value can be calculated from the values of F, df1, and df2 , and the software
performs this calculation for you. If the p value from the ANOVA is statistically
significant — less than 0.05 or your chosen α level — then you can conclude that
the group means are not all equal and you can reject the null hypothesis. Techni-
cally, what that means is that at least one mean was so far away from another
mean that it made the F test result come out far away from 1, causing the p value
to be statistically significant.
Although using post-hoc tests can be helpful, controlling Type I error is not that
easy in reality. There can be issues with the data that may make you not trust the
results of your post-hoc tests, such having too many levels to the group you are
testing in your ANOVA, or having one or more of the levels with very few partici-
pants (so the results are unstable). Still, if you have a statistically significant
ANOVA, you should do post-hoc t tests, just so you know the answer to the ques-
tion stated earlier.
It’s okay to do these post-hoc tests; you just have to take a penalty. A penalty is
where you deliberately make something harder for yourself in statistics. In this
case, we take a penalty by making it deliberately harder to conclude a p value on a
t test is statistically significant. We do that by adjusting the α to be lower than
0.05. How much we adjust it depends on the post-hoc test we choose.
» Scheffe’s test compares all pairs of groups, but also lets you bundle certain
groups together if doing so makes physical sense. For example, if you have
two treatment groups and a control group (such as Drug A, Drug B, and
Control), you may want to determine whether either drug is different from the
control. In other words, you may want to test Drug A and Drug B as one group
against the control group, in which case you use Scheffe’s test. Scheffe’s test is
the safest to use if you are worried your analysis may be suffering from Type I
error because it is the most conservative. On the other hand, it is less
powerful than the other tests, meaning it will miss a real difference in your
data more often than the other tests.
Running an ANOVA
Running a one-way ANOVA in R is similar to running an independent t test (see
the earlier section “Executing a t test”). However, in this case, we save the results
as an object, and then run R code on that object to get the output of our results.
Let’s turn back to the NHANES data. First, we need to prepare our grouping vari-
able, which is the three-level variable MARITAL (where 1 = married, 2 = never mar-
ried, and 3 = all over marital statuses). Next, we identify our dependent variable,
which is our fasting glucose variable called LBXGLU. Finally, we employ the aov
command to run the ANOVA in R, and save the results in an object called
GLUCOSE_aov. We use the following code: GLUCOSE_aov <- aov(LBXGLU ~
as.factor(MARITAL), data = NHANES). (The reason we have to use the as.factor com-
mand on the MARITAL variable is to make R handle it as an ordinal variable in the
calculation, not a numeric one.) Next, we can get our output by running a summary
command on this object using this code: summary(GLUCOSE_aov).
If you use R for this, you will notice that at the bottom of the output it says Signif.
codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1. This is R explaining its coding sys-
tem for p values. It means that if a p value in output is followed by three asterisks,
this is a code for < 0.001. Two asterisks is a code for p < 0.01, and one asterisk
indicates p < 0.05. A period indicates p < 0.1, and no notation indicates the p value
is greater than or equal to 0.1 — meaning by most standards, it is not statistically
significant at all. Other statistical packages often use similar coding to make it
easy for analysts to pick out statistically significant p values in the output.
Several statistical packages that do ANOVAs offer one or more post-hoc tests as
optional output, so programmers tend to request output for both ANOVAs and
post-hoc tests, even before they know whether the ANOVA is statistically signifi-
cant or not, which can be confusing. ANOVA output from other software can
include a lot of extra information, such as a table of the mean, variance, standard
deviation, and count of the observations in each group. It may also include a test
for homogeneity of variances, which tests whether all groups have nearly the same
SDs. In R, the ANOVA output is very lean, and you have to request information like
this in separate commands.
Next is a table (also known as a matrix) with five columns. The first column does
not have a heading, but indicates which levels of MARITAL are being compared in
each row (for example, 2-1 means that 1 = M is being compared to 2 = NM). The
column diff indicates the mean difference between the groups being compared,
with lwr and upr referring to the lower and upper 95 percent confidence limits of
this difference, respectively. (R is using the 95 percent confidence limits because
we specified conf.level = .95 in our code.) Finally, in the last column labeled p adj is
the p value for each test. As you can see by the output, using the Tukey-Kramer
test and α = 0.05, M and NM are statistically significantly different (p = 0.0000102),
and OTH and M are statistically significantly different (p = 0.0030753), but NM and
OTH are not statistically significantly different (p = 0.1101964).
When doing a set of post-hoc tests in any software, the output will be formatted
as a table, with each comparison listed on its own row, and information about the
comparison listed in the columns.
In a real scenario, after completing your post-hoc test, you would stop here and
interpret your findings. But because we want to explain the Scheffe test, we can
take an opportunity compare what we find when we run that one, too. Let’s start
by loading the DescTools package using the R code library(DescTools) (Chapter 4
explains how to use packages in R). Next, let’s try the Scheffe test by using the
following code on our existing ANOVA object: ScheffeTest(GLUCOSE_aov).
The Scheffe test output is arranged in a similar matrix, but also includes R’s sig-
nificance codes. This time, according to R’s coding system, M and NM are statisti-
cally significantly different at p < 0.001, and M and OTH are statistically significantly
different at p < 0.01. Although the actual numbers are slightly different, the inter-
pretation is the same as what you saw using the Tukey-Kramer test.
Nonparametric tests don’t compare group means or test for a nonzero mean dif-
ference. Rather, they compare group medians, or they deal with ranking the order
of variables and analyze those ranks. Because it this, the output from R and other
programs will likely focus on reporting the p value of the test.
Only use a nonparametric test if you are absolutely sure your data do not qualify
for a parametric test (meaning t test, ANOVA, and others that require a particular
distribution). Parametric tests are more powerful. In the NHANES example, the
data would qualify for a parametric test; we only showed you the code for non-
parametric tests as an example.
Chapter 12
Comparing Proportions
and Analyzing
Cross-Tabulations
S
uppose that you are studying pain relief in patients with chronic arthritis.
Some are taking nonsteroidal anti-inflammatory drugs (NSAIDs), which are
over-the-counter pain medications. But others are trying cannabidiol (CBD),
a new potential natural treatment for arthritis pain. You enroll 100 chronic arthri-
tis patients in your study and you find that 60 participants are using CBD, while
the other 40 are using NSAIDs. You survey them to see if they get adequate pain
relief. Then you record what each participant says (pain relief or no pain relief).
Your data file has two dichotomous categorical variables: the treatment group (CBD
or NSAIDs), and the outcome (pain relief or no pain relief).
You find that 10 of the 40 participants taking NSAIDs reported pain relief, which
is 25 percent. But 33 of the 60 taking CBD reported pain relief, which is 55 percent.
CBD appears to increase the percentage of participants experiencing pain relief by
30 percentage points. But can you be sure this isn’t just a random sampling
fluctuation?
In this chapter, we describe two tests you can use to answer this question: the
Pearson chi-square test, and the Fisher Exact test. We also explain how to esti-
mate power and sample sizes for the chi-square and Fisher Exact tests.
Like with other statistical tests, you can run all the tests in this chapter from
individual-level data in a database, where there is one record per participant. But
the tests in this chapter can also be executed using data that has already been
summarized in the form of a cross-tab:
FIGURE 12-1:
The observed
results comparing
CBD to NSAIDs
for the treatment
of pain from
chronic arthritis.
© John Wiley & Sons, Inc.
Figure 12-1 presents the actual data you observed from your survey, where the
observed counts are placed in each of the four cells. As part of the chi-square test
statistic calculation, you now need to calculate an expected count for each cell. This
is done by taking the product of the row and column marginals and dividing them
by the total. So, to determine the expected count in the CBD/pain relief cell, you
would multiply 43 (row marginal) by 60 (column marginal), then divide this by
100 (total) which comes out 25.8. Figure 12-2 presents the fourfold table with the
expected counts in the cells.
FIGURE 12-2:
Expected cell
counts if the null
hypothesis is true
(there is no
association
between either
drug and the
outcome).
© John Wiley & Sons, Inc.
The reason you need these expected counts is that they represent what would
happen under the null hypothesis (meaning if the null hypothesis were true). If
the null hypothesis were true:
As you can see, this expected table assumes that you still have the overall pain
relief rate of 43 percent, but that you also have the pain relief rates in each group
equal to 43 percent. This is what would happen under the null hypothesis.
Now that you have observed and expected counts, you’re no doubt curious as to
how each cell in the observed table differs from its companion cell in the expected
table. To get these numbers, you can subtract each expected count from the
observed count in each cell to get a difference table (observed – expected), as shown
in Figure 12-3.
FIGURE 12-3:
Differences
between
observed and
expected cell
counts if the null
hypothesis is
true.
© John Wiley & Sons, Inc.
As you review Figure 12-3, because you know the observed and expected tables in
Figures 12-1 and 12-2 always have the same marginal totals by design, you should
not be surprised to observe that the marginal totals in the difference table are all
equal to zero. All four cells in the center of this difference table have the same
absolute value (7.2), with a plus and a minus value in each row and each column.
The pattern just described is always the case for 2 2 tables. For larger tables, the
difference numbers aren’t all the same, but they always sum up to zero for each
row and each column.
The values in the difference table in Figure 12-3 show how far off from H 0 your
observed data are. The question remains: Are those difference values larger than
what may have arisen from random fluctuations alone if H 0 is really true? You
need some kind of measurement unit by which to judge how unlikely those differ-
ence values are. Recall from Chapter 10 that the standard error (SE) expresses the
general magnitude of random sampling, so looking at the SE as a type of mea-
surement unit is a good way for judging the size of the differences you may expect
to see from random fluctuations alone. It turns out that it is easy to approximate
the SE of the differences because this is approximately equal to the square root of
You can “scale” the Ob-Ex difference (in terms of unit of SE) by dividing it by the
SE measurement unit, getting the ratio Diff / SE 7.2 / 5.08, or 1.42. This means
that the difference between the observed number of CBD-treated participants who
experience pain relief and the number you would have expected if the CBD had no
effect on survival is about 1.42 times as large as you would have expected from
random sampling fluctuations alone. You can do the same calculation for the other
three cells and summarize these scaled differences. Figure 12-4 shows the differ-
ences between observed and expected cell counts, scaled according to the esti-
mated standard errors of the differences.
FIGURE 12-4:
Differences
between
observed and
expected cell
counts.
© John Wiley & Sons, Inc.
The next step is to combine these individual scaled differences into an overall
measure of the difference between what you observed and what you would have
expected if the CBD or NSAID use really did not impact pain relief differentially.
You can’t just add them up because the negative and positive differences would
cancel each other out. You want all differences (positive and negative) to contrib-
ute to the overall measure of how far your observations are from what you expected
under H 0 .
FIGURE 12-5:
Components of
the chi-square
statistic: squares
of the scaled
differences.
© John Wiley & Sons, Inc.
You then add up these squared scaled differences: 2.01 1.52 3.01 2.27 8.81
to get the chi-square test statistic. This sum is an excellent test statistic to mea-
sure the overall departure of your data from the null hypothesis:
» If the null hypothesis is true (use of CBD or NSAID does not impact pain relief
status), this statistic should be quite small.
When the expected cell counts are very large, the Poisson distribution becomes
very close to a normal distribution (see Chapter 24 for more on the Poisson distri-
bution). If the H 0 is true, each scaled difference should be an approximately nor-
mally distributed random variable with a mean of zero and a standard deviation
of 1. The mean is zero because you subtract the expected value from the observed
value, and the standard deviation is 1 because it is divided by the SE. The sum of
the squares of one or more normally distributed random numbers is a number
How would you calculate the df for a chi-square test? The answer is it depends on
the number of rows in the cross-tab. For the 2 2 cross-tab (fourfold table) in this
example, you added up the four values in Figure 12-5, so you may think that you
should look up the 8.81 chi-square value with 4 df. But you’d be wrong. Note the
italicized word independence in the preceding paragraph. And keep in mind that
the differences (Ob – Ex ) in any row or column always add up to zero. The four
terms making up the 8.81 total aren’t independent of each other. It turns out that
the chi-square test statistic for a fourfold table has only 1 df, not 4. In general, an
N-by-M table, with N rows, M columns, and therefore N M cells, has only
N 1 M 1 df because of the constraints on the row and column sums. In our
case, N — which is the number of rows — is 2, so N-1 is 1. Also, M — which is the
number of columns — is 2, so M-1 is 1 also (and 1 times 1 is 1). Don’t feel bad if
this wrinkle caught you by surprise — even Karl Pearson who invented the
chi-square test got that part wrong!
So, if you were to manually look up the chi-square test statistic of 8.81 in a
chi-square table, you would have to look under the distribution for 1 df to find out
the p value. Alternatively, if you got this far and you wanted to use the statistical
software R to look up the p value, you would use the following code: pchisq(8.81, 1,
lower.tail = FALSE). Either way, the p value for chi-square = 8.81, with 1 df, is 0.003.
This means that there’s only a 0.003 probability that random fluctuations could
produce the effect seen, where CBD performs so differently than NSAIDs with
respect to pain relief in chronic arthritis patients. A 0.003 probability is the same
as 1 chance in 333 (because 1/0.003 333), meaning very unlikely, but not impos-
sible. So, if you set α = 0.05, because 0.003 < 0.05, your conclusion would be that
in the chronic arthritis patients in our sample, whether the participant took CBD
or NSAIDs was statistically significantly associated with whether or not they felt
pain relief.
FIGURE 12-6:
A general way of
naming the cells
of a cross-tab
table.
© John Wiley & Sons, Inc.
Using these conventions, the basic formulas for the Pearson chi-square test are as
follows:
Ri C j
» Expected values: Ex i , j
T
, i 1, 2, ... N ; j 1, 2, ... M
N M ( Ob Ex i , j ) 2
» Chi-square statistic: 2 i, j
Ex i , j
i 1j 1
where i and j are array indices that indicate the row and column, respectively, of
each cell.
» It’s not an exact test. The p value it produces is only approximate, so using
p 0.05 as your criterion for statistical significance (meaning setting α = 0.05)
doesn’t necessarily guarantee that your Type I error rate will be only
5 percent. Remember, your Type I error rate is the likelihood you will claim
statistical significance on a difference that is not true (see Chapter 3 for an
introduction to Type I errors). The level of accuracy of the statistical signifi-
cance is high when all the cells in the table have large counts, but it becomes
unreliable when one or more cell counts is very small (or zero). There are
different recommendations as to the minimum counts you need per cell in
order to confidently use the chi-square test. A rule of thumb that many
analysts use is that you should have at least five observations in each cell of
your table (or better yet, at least five expected counts in each cell).
» It’s not good at detecting trends. The chi-square test isn’t good at detecting
small but steady progressive trends across the successive categories of an
ordinal variable (see Chapter 4 if you’re not sure what ordinal is). It may give a
significant result if the trend is strong enough, but it’s not designed specifically
to work with ordinal categorical data. In those cases, you should use a
Mantel-Haenszel chi-square test for trend, which is outside the scope of this
book.
Let’s apply the Yates continuity correction for your analysis of the sample data in
the earlier section “Understanding how the chi-square test works.” Take a look at
Figure 12-3, which has the differences between the values in the observed and
expected cells. The application of the Yates correction changes the 7.20 (or –7.20)
difference in each cell to 6.70 (or –6.70). This lowers the chi-square value from
Even though the Yates correction to the Pearson chi-square test is only applicable
to the fourfold table (and not tables with more rows or columns), some statisti-
cians feel the Yates correction is too strict. Nevertheless, it has been automatically
built into statistical software like R, so if you run a Pearson chi-square using most
commercial software, it automatically uses the Yates correction when analyzing a
fourfold table (see Chapter 4 for a discussion of statistical software).
This test is conceptually pretty simple. Instead of taking the product of the mar-
ginals and dividing it by the total for each cell as is done with the chi-square test
statistic, Fisher exact test looks at every possible table that has the same marginal
totals as your observed table. You calculate the exact probability (Pr) of getting
each individual table using a formula that, for a fourfold table (using the notation
for Figure 12-6), is
(60!)(40!)(43!)(57!)
Pr 0.00196
(33!)(27!)(10!)(30!)(100!)
Other possible tables with the same marginal totals as the observed table have
their own Pr values, which may be larger than, smaller than, or equal to the Pr
value of the observed table. The Pr values for all possible tables with a specified set
of marginal totals always add up to exactly 1.
The Fisher Exact test p value is obtained by adding up the Pr values for all tables
that are at least as different from the H 0 as your observed table. For a fourfold
table that means adding up all the Pr values that are less than (or equal to) the Pr
value for your observed table.
For the example in Figure 12-1, the p value comes out to 0.00385, which means
that there’s only 1 chance in 260 (because 1/0.00385 260) that random fluctua-
tions could have produced such an apparent effect in your sample.
» It is exact for all tables, with large or small (or even zero) cell counts.
Why do people still use the chi-square test, which is approximate and doesn’t
work for tables with small cell counts? Well, there are several problems with the
Fisher Exact test:
» The Fisher calculations are a lot more complicated, especially for tables larger
than 2 2. Many statistical software packages either don’t offer the Fisher Exact
test or offer it only for fourfold tables. Even if they offer it, you may execute
the test and find that it fails to finish the test, and you have to break into the
program to stop the procedure. Also, some interactive web pages perform the
Fisher Exact test for fourfold tables (including www.socscistatistics.com/
tests/fisher/default2.aspx). Only the major statistical software packages
(like SAS, SPSS, and R, described in Chapter 4) offer the Fisher Exact test for
tables larger than 2 2 because the calculations are so intense. For this reason,
the Fisher Exact test is only practical for small cell counts.
» Another issue is — like the chi-square test — the Fisher Exact test is not for
detecting gradual trends across ordinal categories.
Earlier in the section “Examining Two Variables with the Pearson Chi-Square
Test,” we used an example of an observational study design in which study par-
ticipants were patients who chose which treatment they were using. In this sec-
tion, we use an example from a clinical trial study design in which study
participants are assigned to a treatment group. The point is that the tests in this
section work on all types of study designs.
Let’s calculate sample size together. Suppose that you’re planning a study to test
whether giving a certain dietary supplement to a pregnant woman reduces her
chances of developing morning sickness during the first trimester of pregnancy,
which is the first three months. This condition normally occurs in 80 percent of
pregnant women, and if the supplement can reduce that incidence rate to only
60 percent, it would be considered a large enough reduction to be clinically sig-
nificant. So, you plan to enroll a group of pregnant women who are early in their
first trimester and randomize them to receive either the dietary supplement or a
placebo that looks, smells, and tastes exactly like the supplement. You will ran-
domly assign each participant to the either the supplement group or the placebo
group in a process called randomization. The participants will not be told which
group they are in, which is called blinding. (There is nothing unethical about this
situation because all participants will agree before participating in the study that
they would be willing to take the product associated with each randomized group,
regardless of the one to which they are randomized.)
You’ll have them take the product during their first trimester, and you’ll survey
them to record whether they experience morning sickness during that time (using
You have several ways to estimate the required sample size. The most general and
most accurate way is to use power/sample-size software such as G*Power, which
is described in detail in Chapter 4. Or you can use the online sample-size calcula-
tor at https://clincalc.com/stats/samplesize.aspx, which produces the
same results.
You need to enroll additional subjects to allow for possible attrition during the
study. If you expect x percent of the subjects to drop out, your enrollment should be:
So, if you expect 15 percent of enrolled subjects to drop out and therefore be unan-
alyzable, you need to enroll 100 × 197/(100 – 15), or about 231 participants.
Chapter 13
Taking a Closer Look
at Fourfold Tables
I
n Chapter 12, we show you how to compare proportions between two or more
groups with a cross-tab table. In general, a cross-tab shows the relationship
between two categorical variables. Each row of the table represents one partic-
ular category of one of the variables, and each column of the table represents one
particular category of the other variable. The table can have two or more rows and
two or more columns, depending on the number of different categories or levels
present in each of the two variables. (To refresh your memory about categorical
variables, read Chapter 8.)
Imagine that you are comparing the performance of three treatments (Drug A,
Drug B, and Drug C) in patients who could have four possible outcomes: improved,
stayed the same, got worse, or left the study due to side effects. In such a case,
your treatment variable would have three levels so your cross-tab would have
three rows, and your outcome variable would have four levels so your cross-tab
would have four columns.
But this chapter only focuses on the special case that occurs when both categorical
variables in the table have only two levels. Other words for two-level variables are
dichotomous and binary. A few examples of dichotomous variables are hyperten-
sion status (hypertension or no hypertension), obesity status (obese or not obese),
and pregnancy status (pregnant or not pregnant). The cross-tab of two
In the rest of this chapter, we describe many useful calculations that you can
derive from the cell counts in a fourfold table. The statistical software that cross-
tabulates your raw data can provide these indices depending upon the commands
it has available (see Chapter 4 for a review of statistical software). Thankfully (and
uncharacteristically), unlike in most chapters in this book, the formulas for many
indices derived from fourfold tables are simple enough to do manually with a
Like any other value you calculate from a sample, an index calculated from a four-
fold table is a sample statistic, which is an estimate of the corresponding population
parameter. A good researcher always wants to quote the precision of that estimate.
In Chapter 10, we describe how to calculate the standard error (SE) and confidence
interval (CI) for sample statistics such as means and proportions. Likewise, in this
chapter, we show you how to calculate the SE and CI for the various indices you
can derive from a fourfold table.
For consistency, all the formulas in this chapter refer to the four cell counts of the
fourfold table, and the row totals, column totals, and grand total, in the same
standard way (see Figure 13-1). This convention is used in many online resources
and textbooks.
FIGURE 13-1:
These
designations for
cell counts and
totals are used
throughout this
chapter.
© John Wiley & Sons, Inc.
The example given here is for the use of a fourfold table to interpret a cross-
sectional study. If you have heard of a fourfold table being analyzed as part of a
cohort study or longitudinal study, that is referring to a series of cross-sectional
studies done over time to the same group or cohort. Each round of data collection
is called a wave, and fourfold tables can be developed cross-sectionally (using data
from one wave), or longitudinally (using data from two waves).
As described in Chapter 6, you could try simple random sampling (SRS), but this
may not provide you with a balanced number of participants who are positive for
the exposure compared to negative to the exposure. If you are worried about this,
you could try stratified sampling on the exposure (such as requiring half the sam-
ple to be obese, and half the sample to not be obese). Although other sampling
strategies described in Chapter 6 could be used, SRS and stratified sampling are
the most common to use in cross-sectional study. Why is your sampling strategy
so important? As you see in the rest of this chapter, some indices are meaningful
only if the sampling is done accordingly so as to support a particular study design.
» Evaluating therapies
Note: These scenarios can also give rise to tables larger than 2 2, and fourfold
tables can arise in other scenarios besides these.
The table in Figure 13-2 indicates that more than half of the obese participants
have HTN and more than half of the non-obese participants don’t have HTN — so
there appears to be a relationship between being a membership in a particular row
and simultaneously being a member of a particular column. You can show this
apparent association is statistically significant in this sample using either a Yates
chi-square or a Fisher Exact test on this table (as we describe in Chapter 12). If you
do these tests, your p values will be p 0.016 and p 0.013 , respectively, and at
α = 0.05, you will be comfortable rejecting the null.
FIGURE 13-2:
A fourfold table
summarizing
obesity and
hypertension
in a sample of
60 participants.
© John Wiley & Sons, Inc.
But when you present the results of this study, just saying that a statistically sig-
nificant association exists between obesity status and HTN status isn’t enough.
You should also indicate how strong this relationship is and in what direction it
goes. A simple solution to this is to present the test statistic and p value to repre-
sent how strong the relationship is, and to present the actual results as row
or column percentages to indicate the direction. For the data in Figure 13-2, you
could say that being obese was associated with having HTN, because 14/21 =
66 percent of obese participants also had HTN, while only 12/39 = 31 percent of
non-obese participants had HTN.
The term relative risk refers to the amount of risk one group has relative to another.
This chapter discusses different measures of relative risk that are to be used with
different study designs. It is important to acknowledge here that technically, the
term risk can only apply to cohort studies because you can only be at risk if you
possess the exposure but not the outcome for some period of time in a study, and
only cohort studies have this design feature. However, the other study designs —
including cross-sectional and case-control — intend to estimate the relative risk
In a cohort study, the measure of relative risk used is called the risk ratio (also
called cumulative incidence ratio). To calculate the risk ratio, first calculate the
CIR in the exposed, calculate the CIR in the unexposed, then take a ratio of the CIR
in the exposed to the CIR in the unexposed. This formula could be expressed
as: RR a / r1 / c / r 2 .
For this example, as calculated earlier, the CIR for the exposed was 0.667, and the
CIR for the unexposed was 0.308. Therefore, the risk-ratio calculation would be
0.667 / 0.308, which is 2.17. So, in this cohort study, obese participants were slightly
more than twice as likely to be diagnosed as having HTN during follow-up than
non-obese subjects.
In a case-control study, for a measure of relative risk, you must use the odds ratio
(discussed later in the section “Odds ratio”). You cannot use the risk ratio or
prevalence ratio in a case-control study. The odds ratio can also be used as a mea-
sure of relative risk in a cross-sectional study, and can technically be used in a
cohort study, although the preferred measure is the risk ratio.
Let’s go back to discussing the risk ratio. You can calculate an approximate
95 percent confidence interval (CI) around the observed risk ratio using the fol-
lowing formulas, which assume that the logarithm of the risk ratio is normally
distributed:
1. Calculate the standard error (SE) of the log of risk ratio using the follow-
ing formula:
SE b / ( a r 1) d / ( c r 2 )
3. Find the lower and upper limits of the CI with the following formula:
RR
Q to ( RR Q )
95% CI
For confidence levels other than 95 percent, replace the z-score of 1.96 in Step 2
with the corresponding z-score shown in Table 10-1 of Chapter 10. As an example,
for 90 percent confidence levels, use 1.64, and for 99 percent confidence levels,
use 2.58.
For the example in Figure 13-2, you calculate 95 percent CI around the observed
risk ratio as follows:
2. Q e 1.96 0.2855
, which is 1.75.
Using this formula, the risk ratio would be expressed as 2.17, 95 percent CI 1.24
to 3.80.
You could also use R to calculate a risk ratio and 95 percent CI for the fourfold
table in Figure 13-2 with the following steps:
1. Create a matrix.
2. Load a library.
For many epidemiologic calculations, you can use the epitools package in R and
use a command from this package to calculate the risk ratio and 95 percent
CI. Load the epitools library with this command: library(epitools).
In this case, run the riskratio.wald command on the obese_HTN matrix you
created in Step 1: riskratio.wald(obese_HTN).
$measure
risk ratio with 95% C.I.
Predictor estimate lower upper
Exposed1 1.000000 NA NA
Exposed2 2.076923 1.09512 3.938939
$p.value
two-sided
Predictor midp.exact fisher.exact chi.square
Exposed1 NA NA NA
Exposed2 0.009518722 0.01318013 0.00744125
$correction
[1] FALSE
attr(,”method”)
[1] "Unconditional MLE & normal approximation (Wald) CI"
>
Notice that the output is organized under the following headings: $data, $measure,
$p.value, and $correction. Under the $measure section is a centered title that says
risk ratio with 95% C.I. — which is more than a hint! Under that is a table with the
following column headings: Predictor, estimate, lower, and upper. The estimate col-
umn has the risk ratio estimate (which you already calculated by hand and rounded
off to 2.17). The lower and upper columns have the confidence limits, which R
calculated as 1.09512 (round to 1.10) and 3.938939 (round to 3.94), respectively.
You may notice that because R used a slightly different SE formula than our man-
ual calculation, R’s CI was slightly wider.
Odds ratio
The odds of an event occurring is the probability of it happening divided by the
probability of it not happening. Assuming you use p to represent a probability, you
could write the odds equation this way: p / (1 p ). In a fourfold table, you would
represent the odds of the outcome in the exposed as a/b. You would also represent
the odds of the outcome in the unexposed as c/d.
Odds have no units. They are not expressed as percentages. See Chapter 3 for a
more detailed discussion of odds.
When considering cross-sectional and cohort studies, the odds ratio (OR) repre-
sents the ratio of the odds of the outcome in the exposed to the odds of the out-
come in the unexposed. In case-control studies, because of the sampling approach,
the OR represents the ratio of the odds of exposure among those with the outcome
to the odds of exposure among those without the outcome. But because any four-
fold table has only one OR no matter how you calculate it, the actual value of the
OR stays the same, but how it is described and interpreted depends upon the study
design.
Let’s assume that Figure 13-2 presents data on a cross-sectional study, so we will
look at the OR from that perspective. Because you calculate the odds in the exposed
as a/b, and the odds in the unexposed as c/d, the odds ratio is calculated by divid-
ing a/b by b/c like this: OR a/b / c/d .
For this example, the OR is (14 / 7 ) / (12 / 27 ), which is 2.00 / 0.444 , which is 4.50. In
this sample, assuming a cross-sectional study, participants who were positive for
the exposure had 4.5 times the odds of also being positive for the outcome com-
pared to participants who were negative for the exposure. In other words, obese
participants had 4.5 times the odds of also having HTN compared to non-obese
participants.
You can calculate an approximate 95 percent CI around the observed OR using the
following formulas, which assume that the logarithm of the OR is normally
distributed:
1. Calculate the standard error of the log of the OR with the following
formula:
SE 1/ a 1/ b 1/ c 1/ d
OR
Q to ( OR Q)
95% CI
Like with the risk ratio CI, for confidence levels other than 95 percent, replace the
z-score of 1.96 in Step 2 with the corresponding z-score shown in Table 10-1 of
Chapter 10. As an example, for 90 percent confidence levels, use 1.64, and for
99 percent confidence levels, use 2.58.
For the example in Figure 13-2, you calculate 95 percent CI around the observed
OR as follows:
1. SE 1 / 14 1 / 7 1 / 12 1 / 27 , which is 0.5785.
2. Q e 1.96 0.5785
, which is 3.11.
3. 95% CI 4.50
3.11
to (4.50 3.11), which is 1.45 to 14.0.
Using these calculations, the OR is estimated as 4.5, and the 95 percent CI as 1.45
to 14.0.
To do this operation in R, you would follow the same steps as listed at the end of
the previous section, except in Step 3, the command you’d run on the matrix is
oddsratio.wald() using this code: oddsratio.wald(obese_HTN). The output is laid out
the same way as shown in Listing 13-1, with a $measure section titled odds ratio
with a 95% C.I. In that section, it indicates that the lower and upper confidence
limits are 1.448095 (rounded to 1.45) and 13.98389 (rounded to 13.98), respec-
tively. This time, R’s estimate of the 95 percent CI was close to the one you got
with your manual calculation, but slightly narrower.
A wide 95 percent CI is the sign of an unstable (and not very useful) estimate.
Consider a 95 percent CI for an OR that goes from 1.45 to 14.0. If you are interpret-
ing the results of a cohort study, you are saying that obesity could increase the
odds of getting HTN by as little as 1.45, or as much as 14! Most researchers try to
solve this problem by increasing their sample size to reduce the size of their SE,
which will in turn reduce the width of the CI.
Most screening tests produce some false positive results, which is when the result
of the test is positive, but the patient is actually negative for the condition. Screen-
ing tests also produce some false negative results, where the result is negative in
patients where the condition is present. Because of this, it is important to know
false positive rates, false negative rates, and other features of screening tests to
consider their level of accuracy in your interpretation of their results.
You usually evaluate a new, experimental screening test for a particular medical
condition by administering the new test to a group of participants. These partici-
pants include some who have the condition and some who do not. For all the
participants in the study, their status with respect to the particular medical condi-
tion has been determined by the gold standard method, and you are seeing how
well your new, experimental screening test compares. You can then cross-
tabulate the new screening test results against the gold standard results repre-
senting the true condition in the participants. You would create a fourfold table in
a framework as shown in Figure 13-3.
FIGURE 13-3:
This is how data
are summarized
when evaluating
a proposed new
diagnostic
screening test.
© John Wiley & Sons, Inc.
FIGURE 13-4:
Results from a
study of a new
experimental
home pregnancy
test.
© John Wiley & Sons, Inc.
The structure of the table in Figure 13-4 is important because if your results are
arranged in that way, you can easily calculate at least five important characteris-
tics of the experimental test (in our case, the home pregnancy test) from this
table: accuracy, sensitivity, specificity, positive predictive value (PPV), and nega-
tive predictive value (NPV). We explain how in the following sections.
Overall accuracy
Overall accuracy measures how often a test result comes out the same as the gold
standard result (meaning the test is correct). A perfectly accurate test never pro-
duces false positive or false negative results. In Figure 13-4, cells a and d represent
correct test results, so the overall accuracy of the home pregnancy test is a d /t .
Using the data in Figure 13-4, accuracy 33 51 / 100 , which is 0.84, or
84 percent.
In a screening test with a sensitivity of 100 percent, the test result is always posi-
tive whenever the condition is truly present. In other words, the test will identify
all individuals who truly have the condition. When a perfectly sensitive test comes
out negative, you can be sure the person doesn’t have the condition. You calculate
sensitivity by dividing the number of true positive cases by the total number of
cases where the condition was truly present: a/c1 (that is, true positive/all
present). Using the data in Figure 13-4, sensitivity 33 / 37, which is 0.89. This
means that the home test comes out positive in only 89 percent of truly pregnant
women, and the other 11 percent were really pregnant, but had a false positive
result in the test.
A perfectly specific test never produces a false positive result for an individual
without the condition. In a test that has a specificity of 100 percent, whenever the
condition is truly absent, the test always has a negative result. In other words, the
test will identify all individuals who truly do not have the condition. When a per-
fectly specific test comes out positive, you can be sure the person has the condi-
tion. You calculate specificity by dividing the number of true negative cases by the
total number of cases where the condition was truly absent: d / c2 (that is, true
negative/all not present). Using the data in Figure 13-4, specificity 51/63 , which
is 0.81. This means that among the women who were not pregnant, the home test
was negative only 81 percent of the time, and 11 percent of women who were truly
negative tested as positive. (You can see why it is important to do studies like this
before promoting the use of a particular screening test!)
But imagine you work in a lab that processes the results of screening tests, and
you do not usually have access to the gold standard results. You may ask the ques-
tion, “How likely is a particular screening test result to be correct, regardless of
whether it is positive or negative?” When asking this about positive test results,
you are asking about positive predictive value (PPV), and when asking about neg-
ative test results, you are asking about negative predictive value (NPV). These are
covered in the following sections.
Sensitivity and specificity are important characteristics of the test itself. Observe
that the answers depend on the prevalence of the condition in the background
population. If the study population were older women, then the prevalence of
being pregnant would be lower, and that would impact the sensitivity and speci-
ficity. The prevalence will also impact the PPV and NPV, which we discuss in the
next section. For these reasons, it is important to use natural sampling in such a
study design.
The negative predictive value (NPV, also called predictive value negative) is the
fraction of all negative test results that are true negatives. In the case of the preg-
nancy test scenario, the NPV is the fraction of the time a negative screening test
results means the woman is truly not pregnant. NPV is the likelihood a negative
test result is correct. You calculate NPV as d / r 2. For the data in Figure 13-4, the
NPV is 51 / 55 , which is 0.93. So, if the pregnancy test result is negative, there’s a
93 percent chance that the woman is truly not pregnant.
Investigating treatments
In conditions where there are no known treatments, one of the simplest ways to
investigate a new treatment (such as a drug or surgical procedure) is to compare
it to a placebo or sham condition using a clinical trial study design. Because many
forms of dementia have no known treatment, it would be ethical to compare new
treatments for dementia to placebo or a sham treatment in a clinical trial. In those
cases, patients with the condition under study would be randomized (randomly
assigned) to an active group (taking the real treatment) and a control group (that
would receive the sham treatment), as randomization is a required feature of clin-
ical trials. Because some of the participants in the control group may appear to
improve, it is important that participants are blinded as to their group assign-
ment, so that you can tell if outcomes are actually improved in the treatment
compared to the control group.
Suppose you conduct a study where you enroll 200 patients with mild dementia
symptoms, then randomize them so that 100 receive an experimental drug
intended for mild dementia symptoms, and 100 receive a placebo. You have the
participants take their assigned product for six weeks, then you record whether
each participant felt that the product helped their dementia symptoms. You tabu-
late the results in a fourfold table, like Figure 13-5.
According to the data in Figure 13-5, 70 percent of participants taking the new
drug report that it helped their dementia symptoms, which is quite impressive
until you see that 50 percent of participants who received the placebo also reported
improvement. When patients report therapeutic effect from a placebo, it’s called
the placebo effect, and it may come from a lot of different sources, including the
patient’s expectation of efficacy of the product. Nevertheless, if you conduct a
Yates chi-square or Fisher Exact test on the data (as described in Chapter 12)
at α = 0.05, the results show treatment assignment was statistically signifi-
cantly associated with whether or not the participant reported a treatment effect
(p 0.006 by either test).
Humans who perform such determinations in studies are called raters because
they are assigning ratings, which are values or classifiers that will be used in the
study. For the measurements in your study, it is important to know how consis-
tent such ratings are among different raters engaged in rating the same item. This
is called inter-rater reliability. You will also be concerned with how reproducible
the ratings are if one rater were to rate the same item multiple times. This is called
intra-rater reliability.
FIGURE 13-6:
Results of two
raters reading the
same set of
50 specimens
and rating each
specimen yes
or no.
© John Wiley & Sons, Inc.
Looking at Figure 13-6, cell a contains a count of how many scans were rated
yes — there is a tumor — by both Rater 1 and Rater 2. Cell b counts how many
scans were rated yes by Rater 1 but no by Rater 2. Cell c counts how many scans
were rated no by Rater 1 and yes by Rater 2, and cell d shows where Rater 1 and
Rater 2 agreed and both rated the scan no. Cells a and d are considered concordant
because both raters agreed, and b and c are discordant because both raters
disagreed.
Ideally, all the scans would be counted in concordant cells a or d of Figure 13-6,
and discordant cells b and c would contain zeros. A measure of how close the data
come to this ideal is called Cohen’s Kappa, and is signified by the Greek lowercase
kappa: κ. You calculate kappa as: 2( ad bc ) / ( r 1 c2 r 2 c1).
If the raters are in perfect agreement, then κ = 1. If you generate completely ran-
dom ratings, you will see a κ = 0. You may think this means κ takes on a positive
value between 0 and 1, but random sampling fluctuations can actually cause κ to
be negative. This situation can be compared to a student taking a true/false test
where the number of wrong answers is subtracted from the number of right
answers as a penalty for guessing. When calculating κ, getting a score less than
zero indicates the interesting combination of being both incorrect and unfortu-
nate, and is penalized!
For CIs forκ, you won’t find an easy formula, but the fourfold table web page
(https://statpages.info/ctab2x2.html) provides approximate CIs. For the
preceding example, the 95 percent CI is 0.202 to 0.735. This means that for your
two raters, their agreement was 0.514 (95 percent CI 0.202 to 0.735), which sug-
gests that the agreement level was acceptable.
You can construct a similar table to Figure 13-6 for estimating intra-rater reli-
ability. You would do this by having one rater rate the same groups of scans in two
separate sessions. In this case, in the table in Figure 13-6, you’d replace the by
Rater with in Session in the row and column labels.
Chapter 14
Analyzing Incidence and
Prevalence Rates in
Epidemiologic Data
E
pidemiology is the study of the causes of health and disease in human
populations. It is sometimes defined as characterizing the three Ds — the
distribution and determinants of human disease (although epidemiology
technically also concerns more positive outcomes, such as human health and
wellness). This chapter describes two concepts central to epidemiology: prevalence
and incidence. Prevalence and incidence are also frequently encountered in other
areas of human research as well. We describe how to calculate incidence rates and
prevalence proportions. Then we concentrate on the analysis of incidence. (For an
introduction to prevalence and to learn how to calculate prevalence ratios, see
Chapter 13.) Later in this chapter, we describe how to calculate confidence inter-
vals around incidence rates and rate ratios, and how to compare incidence rates
between two populations.
Because prevalence is a proportion, it’s analyzed in exactly the same way as any
other proportion. The standard error (SE) of a prevalence ratio can be estimated
by the formula in Chapter 13. Confidence intervals (CIs) for a prevalence estimate
can be obtained from exact methods based on the binomial distribution or from
formulas based on the normal approximation to the binomial distribution. Also,
prevalence can be compared between two or more populations using the chi-
square or Fisher Exact test. For this reason, the remainder of this chapter focuses
on how to analyze incidence rates.
The incidence rate should be estimated by counting events over a narrow enough
interval of time so that the number of observed events is a small fraction of the
total population studied. One year is narrow enough for calculating incidence of
Type II diabetes in adults because 0.02 percent of the adult population develops
diabetes in a year. However, one year isn’t narrow enough to be useful when con-
sidering the incidence of an acute condition like influenza. In influenza and other
infectious diseases, the intervals of interest would be in terms of daily, weekly,
and monthly trends. It’s not very helpful to know that 30 percent of the popula-
tion came down with influenza in a one-year period.
Consider City XYZ, which has a population of 300,000 adults. None of them has
been diagnosed with Type II diabetes. Suppose that in 2023, 30 adults from City
XYZ were newly diagnosed with Type II diabetes. The incidence of adult Type II
diabetes in City XYZ would be calculated with a numerator of 30 cases and a
denominator of 300,000 adults in one year. Using the incidence formula, this
works out to 0.0001 new cases per person-year. As described before, in epidemi-
ology, rates are reconfigured to have at least whole numbers so that they are eas-
ier to interpret and envision. For this example, you could express City XYZ’s 2023
adult Type II diabetes incidence rate as 1 new case per 10,000 person-years, or as
10 new cases per 100,000 person-years.
Now imagine another city — City ABC — has a population of 80,0000 adults, and
like with City XYZ, none of them had ever been diagnosed with Type II diabetes.
Now, assume that in 2023, 24 adults from City ABC were newly diagnosed with
Type II diabetes. City ABC’s 2023 incidence rate would be calculated as 24 cases in
80,000 individuals in one year, which works out to 24 / 80, 000 or 0.0003 new cases
per person-year. To make the estimate comparable to City XYZ’s estimate, let’s
express City ABC’s estimate as 30 new cases per 100,000 person-years. So, the
2023 adult Type II diabetes incidence rate in City ABC — which is 30 new cases per
100,000 person-years — is three times as large as the 2023 adult Type II diabetes
You may expect that conditions with higher incidence rates would have higher
prevalence than conditions with lower incidence rates. This is true with common
chronic conditions, such as hypertension. But if a condition is acute — including
infectious diseases, such as influenza and COVID-19 — the duration of the condi-
tion may be short. In such a scenario, a high incidence rate may not be paired with
a high prevalence. Relatively rare chronic diseases of long duration — such as
dementia — have low yearly incidence rates, but as human health improves and
humans live longer on average, the prevalence of dementia increases.
2. Divide the lower and upper confidence limits for N by the exposure (E).
Earlier in the chapter, we describe City ABC, which had a population of 80,000
adults without a diagnosis of Type II diabetes. In 2023, 24 new diabetes cases were
identified in adults in City ABC, so the event count (N) is 24, and the exposure (E)
is 80,000 person-years (because we are counting 80,000 persons for one year).
Even though 24 is not that large, let’s use this example to demonstrate calculating
a CI for R. The incidence rate (R) is N /E , which is 24 per 80,000 person-years, or
30 per 100,000 person-years. How precise is this incidence rate?
To answer this, first, you should find the confidence limits for N. Using the
approximate formula, the 95 percent CI around the event count of 24 is
24 1.96 24 to 24 + 1.96 24 , or 14.4 to 33.6 events. Next, you divide the lower and
upper confidence limits of N by the exposure using these formulas: 14.4/80,000 =
0.00018 for the lower limit, and 33.6/80,000 = 0.00042 for the upper limit. Finally,
you can express these limits as 18.0 to 42.0 events per 100,00 person-years — the
CI for the incidence rate. Your interpretation would be that City ABC’s 2023 inci-
dence rate for Type II diabetes in adults was 30.0 (95 percent CI 18.0 to 42.0) per
100,000 person-years.
R2 N 2 /E 2
RR
R1 N 1 /E 1
For other confidence levels, you can replace the 1.96 in the Q formula with the
appropriate critical z value for the normal distribution.
So, for the 2023 adult Type II diabetes example, you would setN 1 30, N 2 24 , and
1.96 1/ 24 1/ 30
RR = 3.0. The equation would be Q e , so the 95 percent lower and
upper confidence limits would be 3.0 /1.71 and 3.0 1.71, meaning the CI of the RR
would be from 1.75 to 5.13. You would interpret this by saying that that 2023 RR for
adult Type II diabetes incidence is 3.0 times the rate in City ABC compared to City
XYZ (95 percent CI 1.75 to 5.13).
For the City ABC and City XYZ adult Type II diabetes 2023 rate comparison, the
observed RR was 3.0, with a 95 percent confidence interval of 1.75 to 5.13. This CI
does not include 1.0 — in fact, it is entirely above 1.0. So, the RR is significantly
greater than 1, and you would conclude that City ABC has a statistically significantly
higher adult Type II diabetes incidence rate than City XYZ (assuming α = 0.05).
To interpret the formula into words, if the square of the difference is more than
four times the sum, then the event counts are statistically significantly different
at α = 0.05. The value of 4 in this rule approximates 3.84, the chi-square value
corresponding to p = 0.05.
Imagine you learned that in City XYZ, there were 30 fatal car accidents in 2022. In
the following year, 2023, you learned City XYZ had 40 fatal car accidents. You may
wonder: Is driving in City XYZ getting more dangerous every year? Or was the observed
increase from 2022 to 2023 due to random fluctuations? Using the simple rule, you
2
can calculate 30 – 40 / 30 40 100 /70 1.4, which is less than 4. Having
30 events — which in this case are fatal car accidents — isn’t statistically signifi-
cantly different from having 40 events in the same time period. As you see from
the result, the increase of 10 in one year is likely statistical noise. But had
the number of events increased more dramatically — say from 30 to 50 events —
the increase would have been statistically significant. This is because
2
30 50 / 30 50 400/ 80 5.0, which is greater than 4.
For example, suppose that you’re designing a study to test whether rotavirus gas-
troenteritis has a higher incidence in City XYZ compared to City ABC. You’ll enroll
an equal number City XYZ and City ABC residents, and follow them for one year to
see whether they get rotavirus. Suppose that the one-year incidence of rotavirus
in City XYZ is 1 case per 100 person-years (an incidence rate of 0.01 case per
patient-year, or 1 percent per year). You want to have an 80 percent likelihood of
getting a statistically significant result assuming p = 0.05 (you want to set power
at 80 percent and α = 0.05). When comparing the incidence rates, you are only
concerned if they differ by more than 25 percent, which translates to a RR of 1.25.
This means you expect to see 0.01 × 1.25 = 0.0125 cases per patient-year in City ABC.
If you want to use G*Power to do your power calculation (see Chapter 4), under
Test family, choose z tests for population-level tests. Under Statistical test, choose
Proportions: Difference between two independent proportions because the two rates are
independent. Under Type of power analysis, choose A priori: Compute required
sample size – given α, power and effect size, and under the Input Parameters section,
choose two tails so you can test if one is higher or lower than the other. Set Propor-
tion p1 to 0.01 (to represent City XYZ’s incidence rate), Proportion p2 to 0.0125 (to
represent City ABC’s expected incidence rate), α err prob (α) to 0.05, and Power
(1-β err prob) (power) to 0.8 for 80 percent, and keep a balanced Allocation ration
N2/N1 of 1. After clicking Calculate, you’ll see you need at least 27,937 person-
years of observation in each group, meaning observing 57,000 participants over a
one-year study. The shockingly large target sample size illustrates a challenge
when studying incidence rates of rare illnesses.
Chapter 15
Introducing Correlation
and Regression
C
orrelation, regression, curve-fitting, model-building — these terms all
describe a set of general statistical techniques that deal with the relation-
ships among variables. Introductory statistics courses usually present only
the simplest form of correlation and regression, equivalent to fitting a straight
line to a set of data. But in the real world, correlations and regressions are seldom
that simple — statistical problems may involve more than two variables, and the
relationship among them can be quite complicated.
The words correlation and regression are often used interchangeably, but they refer
to two different concepts:
» –1: If the variables have an inverse or negative relationship, meaning when one
goes up, the other goes down
Note: The Pearson correlation coefficient measures the extent to which the points
lie along a straight line. If your data follow a curved line, the r value may be low or
zero, as shown in Figure 15-2. All three graphs in Figure 15-2 have the same
amount of random scatter in the points, but they have quite different r values.
Pearson r is based on a straight-line relationship and is too small (or even zero) if
the relationship is nonlinear. So, you shouldn’t interpret r 0 as evidence of lack
of association or independence between two variables. It could indicate only the
lack of a straight-line relationship between the two variables.
FIGURE 15-2:
Pearson r is
based on a
straight-line
relationship.
© John Wiley & Sons, Inc.
Even when X and Y are completely independent, your calculated r value is almost
never exactly zero. One way to test for a statistically significant association
between X and Y is to test whether r is statistically significantly different from zero
by calculating a p value from the r value (see Chapter 3 for a refresher on p values).
Either way, you get p 0.098, which is greater than 0.05. At α = 0.05, the r value
of 0.500 is not statistically significantly different from zero (see Chapter 12 for
more about α).
Here are the steps for calculating 95 percent confidence limits around an observed
r value of 0.05 for a sample of 12 participants (N = 12):
1
z log 1 0.5 / 1 0.5 0.549
2
Notice that the 95 percent confidence interval goes from –0.104 to 0.835 , a range
that includes the value zero. This means that the true r value could indeed be zero,
which is consistent with the non-significant p value of 0.098 that you obtained
from the significance test of r in the preceding section.
For example, if you want to compare an r1 value of 0.4 based on an N1 of 100 par-
ticipants with an r2 value of 0.6 based on an N2 of 150 participants, you perform the
following steps:
1
z1 log 1 0.4 / 1 0.4 0.424
2
1
z2 log 1 0.6 / 1 0.6 0.693
2
0.269/0.131 2.05
For a sample-size calculation for a correlation coefficient, you need to plug in the
following design parameters of the study into the equation:
» The desired α level of the test: The p value that’s considered significant
when you’re testing the correlation coefficient (usually 0.05).
» The desired power of the test: The probability of rejecting the null hypoth-
esis if the alternative hypothesis is true (usually set to 0.8 or 80 percent).
It may be challenging to select an effect size, and context matters. One approach
would be to start by referring to Figure 15-1 to select a potential effect size, then
do a sample-size calculation and see the result. If the result requires more sam-
ples than you could ever enroll, then try making the effect size a little larger and
redoing the calculation until you get a more reasonable answer.
5. Under Effect Size, which is the expected difference between r1 and r2, enter the
effect size you expect.
8. Click Calculate.
The answer will appear under Total sample size. As an example, if you enter these
parameters and an effect size of 0.02, the total sample size will be 191.
» Obtain numerical values for the parameters that appear in the regres-
sion model formula. Chapter 19 explains how to make a regression model
based on a theoretical rather than known statistical distribution (described in
Chapter 3). Such a model is used to develop estimates like the ED50 of a drug,
which is the dose that produces one-half the maximum effect.
If you have only one independent variable, it’s often designated by X, and the
dependent variable is designated by Y. If you have more than one independent
variable, variables are usually designated by letters toward the end of the alphabet
(W, X, Y, Z). Parameters are often designated by letters toward the beginning of the
In mathematical texts, you may see a regression model with three predictors
written in one of several ways, such as
In practical work, using the actual names of the variables from your data and
using meaningful terms for parameters is easiest to understand and least error-
prone. For example, consider the equation for the first-order elimination of an
injected drug from the blood, Conc Conc0 e ke Time. This form, with its short but
meaningful names for the two variables, Conc (blood concentration) and Time
(time after injection), and the two parameters, Conc 0 (concentration at Time 0)
and ke (elimination rate constant), would probably be more meaningful to a reader
than Y a e -b X
.
There are different terms for different types of regression. In this book, we refer
to regression models with one predictor in the model as simple regression, or uni-
variate regression. We refer to regression models with multiple predictors as mul-
tivariate regression.
In the next section, we explain how the type of outcome variable determines
which regression to select, and after that, we explain how the mathematical form
of the data influences the type of regression you choose.
» Ordinary regression (also called linear regression) is used when the outcome
is a continuous variable whose random fluctuations are governed by the
normal distribution (see Chapters 16 and 17).
» Survival regression when the outcome is a time to event, often called a survival
time. Part 6 covers the entire topic of survival analysis, and Chapter 23 focuses
on regression.
In a linear function, you multiply each predictor variable by a parameter and then
add these products to give the predicted value. You can also have one more param-
eter that isn’t multiplied by anything — it’s called the constant term or the inter-
cept. Here are some linear functions:
»Y a bX
»Y a bX cX 2 dX 3
»Y a bX cLog (W ) dX /Cos( Z )
In these examples, Y is the dependent variable or the outcome, and X, W, and Z are
the independent variables or predictors. Also, a, b, c, and d are parameters.
Y a/ b e –c X
Chapter 16
Getting Straight Talk on
Straight-Line Regression
C
hapter 15 refers to regression analyses in a general way. This chapter
focuses on the simplest type of regression analysis: straight-line regression.
You can visualize it as fitting a straight line to the points in a scatter plot
from a set of data involving just two variables. Those two variables are generally
referred to as X and Y. The X variable is formally called the independent variable (or
the predictor or cause). The Y variable is called the dependent variable (or the out-
come or effect).
» You’ve made a scatter plot of the two variables and the data points seem to
lie, more or less, along a straight line (as shown in Figures 16-1a and 16-1b).
You shouldn’t try to fit a straight line to data that appears to lie along a curved
line (as shown in Figures 16-1c and 16-1d).
» The data points appear to scatter randomly around the straight line over the
entire range of the chart, with no extreme outliers (as shown in Figures 16-1a
and 16-1b).
FIGURE 16-1:
Straight-line
regression is
appropriate for
both strong and
weak linear
relationships
(a and b), but not
for nonlinear
(curved-line)
relationships
(c and d).
© John Wiley & Sons, Inc.
» You want to know the value of the slope and/or intercept (also referred to as
the Y intercept) of a line fitted through the X and Y data points.
» You want to be able to predict the value of Y if you know the value of X.
In straight-line regression, our goal is to develop the best-fitting line for our data.
Using least-squares as a guide, the best-fitting line through a set of data is the
one that minimizes the sum of the squares (SSQ) of the residuals. Residuals are the
vertical distances of each point from the fitted line, as shown in Figure 16-2.
FIGURE 16-2:
On average, a
good-fitting line
has smaller
residuals than a
bad-fitting line.
© John Wiley & Sons, Inc.
SSQ ( a bX i Yi ) 2
i
If you’re good at first-semester calculus, you can find the values of a and b that
minimize SSQ by setting the partial derivatives of SSQ with respect to a and b
equal to 0. If you stink at calculus, trust that this leads to these two simultaneous
equations:
a( N ) b( Y ) ( Y )
a( X ) b( X 2 ) ( XY )
( Y )( X 2 ) ( X )(( XY )
a
( N )(( X 2 ) ( X ) 2
( XY ) ( a )( X )
b
( X 2)
See Chapter 2 if you don’t feel comfortable reading the mathematical notations or
expressions in this section.
Usually, the data consist of two columns of numbers, one representing the
independent variable and the other representing the dependent variable.
2. Tell the software which variable is the independent variable and which
one is the dependent variable.
Depending on the software, you may type in the variable names, or pick them
from a menu or list in your file.
3. If the software offers output options, tell it that you want it to output
these results:
• Regression table
• Goodness-of-fit measures
Then go look for the output. You should see the output you requested in
Step 3.
Consider how blood pressure (BP) is related to body weight. It may be reasonable
to suspect that people who weigh more have higher BP. If you test this hypothesis
1 74.4 109
2 85.1 114
3 78.3 94
4 77.2 109
5 63.8 104
6 77.9 132
7 78.9 127
8 60.9 98
9 75.6 126
10 74.5 126
11 82.2 116
12 99.8 121
13 78.0 111
14 71.8 116
15 90.2 115
16 105.4 133
17 100.4 128
18 80.9 128
19 81.8 105
20 109.0 127
FIGURE 16-3:
Scatter plot of
SBP versus body
weight.
© John Wiley & Sons, Inc.
You can also tell that there aren’t any higher-weight participants with a very low
SBP, because the lower-right part of the graph is rather empty. But this relation-
ship isn’t completely convincing, because several participants in the lower weight
range of 70 to 80 kg have SBPs over 125 mmHg.
A correlation analysis (described in Chapter 15) will tell you how strong this type
of association is, as well as its direction (which is positive in this case). The results
of a correlation analysis help you decide whether or not the association is likely
due to random fluctuations. Assuming it is not, proceeding to a regression analy-
sis provides you with a mathematical formula that numerically expresses the
relationship between the two variables (which are weight and SBP in this example).
» A statement of what you asked the program to do (the code you ran for the
regression)
» A summary of the residuals, including graphs that display the residuals and
help you assess whether they’re normally distributed
The actual equation of the straight line is not SBP = weight, but more accurately
SBP = a + b × weight, with the a (intercept) and b (slope) parameters having been
left out of the model. This is a reminder that although the goal is to evaluate the
SBP = weight model conceptually, in reality, this relationship will be numerically
different with each data set we use.
Evaluating residuals
The residual for a point is the vertical distance of that point from the fitted line.
It’s calculated as Residual Y ( a b X ) , where a and b are the intercept and
slope of the fitted straight line, respectively.
Most regression software outputs several measures of how the data points scatter
above and below the fitted line, which provides an idea of the size of the residuals
(see “Summary statistics for the residuals” for how to interpret these measures).
The residuals for the sample data are shown in Figure 16-5.
FIGURE 16-5:
Scattergram of
SBP versus
weight, with the
fitted straight line
and the residuals
of each point
from the line.
© John Wiley & Sons, Inc.
» The minimum and maximum values: These are labeled as Min and Max,
respectively, and represent the two largest residuals, or the two points that lie
farthest away from the least-squares line in either direction. The minimum is
negative, indicating it is below the line, while the positive maximum is above
the line. The minimum is almost 21 mmHg below the line, while the maximum
lies about 17 mmHg above the line.
» The first and third quartiles: These are labeled IQ and 3Q on the output.
Looking under IQ, which is the first quartile, you can tell that about 25 percent
of the data points (which would be 5 out of 20) lie more than 4.7 mmHg below
the fitted line. For the third quartile results, you see that another 25 percent
lie more than 6.5 mmHg above the fitted line. The remaining 50 percent of the
points lie within those two quartiles.
» The median: Labeled Median on the output, a median of –3.4 tells you that half
of the residuals, which is 10 of the 20 data points, are less than –3.4, and half are
greater than –3.4. The negative sign means the median lies below the fitted line.
Note: The mean isn’t included in these summary statistics because the mean of
the residuals is always exactly 0 for any kind of regression that includes an
intercept term.
The residual standard error, often called the root-mean-square (RMS) error in regres-
sion output, is a measure of how tightly or loosely the points scatter above or
below the fitted line. You can think of it as the standard deviation (SD) of the resid-
uals, although it’s computed in a slightly different way from the usual SD of a set
of numbers. RMS uses N 2 instead of N 1 in the denominator of the SD
formula. At the bottom of Figure 16-4, Residual standard error is expressed as
9.838 mmHg. You can think of it as another summary statistic for residuals
As stated at the beginning of this section, you calculate the residual for each point
by subtracting the predicted Y from the observed Y. As shown in Figure 16-6, a
residuals versus fitted graph displays the values of the residuals plotted along the Y
axis and the predicted Y values from the fitted straight line plotted along the X
axis. A normal Q-Q graph shows the standardized residuals, which are the residuals
divided by the RMS value, along the Y axis, and theoretical quantiles along the X
axis. Theoretical quantiles are what you’d expect the standardized residuals to be if
they were exactly normally distributed.
Together, the two graphs shown in Figure 16-6 provide insight into whether your
data conforms to the requirements for straight-line regression:
» Your data must lie above and below the line randomly across the whole
range of data.
» The average amount of scatter must be fairly constant across the whole range
of data.
You need years of experience examining residual plots before you can interpret
them confidently, so don’t feel discouraged if you can’t tell whether your data
complies with the requirements for straight-line regression from these graphs.
Here’s how we interpret them, allowing for other biostatisticians to disagree:
» We read the residuals versus fitted chart in Figure 16-6 to show the points
lying equally above and below the fitted line, because this appears true
whether you’re looking at the left, middle, or right part of the graph.
» If the residuals are normally distributed, then in the normal Q-Q chart in
Figure 16-6, the points should lie close to the dotted diagonal line, and
shouldn’t display any overall curved shape. Our opinion is that the points
follow the dotted line pretty well, so we’re not concerned about lack of
normality in the residuals.
For straight-line regression, the coefficients table has two rows that correspond
with the two parameters of the straight line:
» The intercept row: This row is labeled (Intercept) in Figure 16-4, but can be
labeled Intercept or Constant in other software.
» The slope row: This row is usually labeled with the name of the independent
variable in your data, so in Figure 16-4, it is named Wgt. It may be labeled Slope
in some programs.
In the example shown in Figure 16-4, the estimated value of the intercept is
76.8602 mmHg, and the estimated value of the slope is 0.4871 mmHg/kg.
» The intercept value of 76.9 mmHg means that a person who weighs 0 kg is
predicted to have a SBP of about 77 mmHg. But nobody weighs 0 kg! The
intercept in this example (and in many straight-line relationships in biology)
has no physiological meaning at all, because 0 kg is completely outside the
range of possible human weights.
» The slope value of 0.4871 mmHg/kg does have a real-world meaning. It means
that every additional 1 kg of weight is associated with a 0.4871 mmHg increase
in SBP. If we multiply both estimates by 10, we could say that every additional
10 kg of body weight is associated with almost a 5 mmHg SBP increase.
Because data from your sample always have random fluctuations, any estimate
you calculate from your data will be subject to random fluctuations, whether it is
a simple summary statistic or a regression coefficient. The SE of your estimate
tells you how precisely you were able to estimate the parameter from your data,
which is very important if you plan to use the value of the slope (or the intercept)
in a subsequent calculation.
» SEs always have the same units as the coefficients themselves. In the
example shown in Figure 16-4, the SE of the intercept has units of mmHg,
and the SE of the slope has units of mmHg/kg.
If you know the value of the SE, you can easily calculate a confidence interval (CI)
around the estimate (see Chapter 10 for more information on CIs). These expres-
sions provide a very good approximation of the 95 percent confidence limits
(abbreviated CL), which mark the low and high ends of the CI around a regression
coefficient:
The p value
A column in the regression tables (usually the last one) contains the p value,
which indicates whether the regression coefficient is statistically significantly
different from 0. In Figure 16-4, it is labeled Pr |t| , but it can be called a variety
of other names, including p value, p, and Signif.
In Figure 16-4, the p value for the intercept is shown as 5.49e 05, which is equal
to 0.0000549 (see the description of scientific notation in Chapter 2). Assuming
we set α at 0.05, the p value is much less than 0.05, so the intercept is statistically
significantly different from zero. But recall that in this example (and usually in
straight-line regression), the intercept doesn’t have any real-world importance.
It’s equals the estimated SBP for a person who weighs 0 kg, which is nonsensical,
so you probably don’t care whether it’s statistically significantly different from
zero or not.
If you want to test for a significant correlation between two variables at α = 05, you
can look at the p value for the slope of the least-squares straight line. If it’s less
than 0.05, then the X and Y variables are also statistically significantly correlated.
The p value for the significance of the slope in a straight-line regression is always
exactly the same as the p value for the correlation test of whether r is statistically
significantly different from zero, as described in Chapter 15.
The r 2 is always positive, because square of any number is always positive. But the
correlation coefficient can be positive or negative, depending on whether the fit-
ted line slopes upward or downward. If the fitted line slopes downward, make
your r value negative.
Why did the program give you r 2 instead of r in the first place? It’s because r 2 is a
useful estimate called the coefficient of determination. It tells you what percent of
the total variability in the Y variable can be explained by the fitted line.
» An r 2 value of 1 means that the points lie exactly on the fitted line, with no
scatter at all.
» An r 2 value of 0.3 (as in this example) means that 30 percent of the variance in
the dependent variable is explainable by the independent variable in this
straight-line model.
Note: Figure 18-4 also lists the Adjusted R-squared at the bottom right. We talk
about the adjusted r 2 value in Chapter 17 when we explain multiple regression, so
for now, you can just ignore it.
The F statistic
The last line of the sample output in Figure 17-4 presents the F statistic and asso-
ciated p value (under F-statistic). These estimates address this question: Is the
straight-line model any good at all? In other words, how much better is the
straight-line model, which contains an intercept and a predictor variable, at pre-
dicting the outcome compared to the null model?
The null model is a model that contains only a single parameter representing a
constant term with no predictor variables at all. In this case, the null model would
only include the intercept.
Under α = 0.05, if the p value associated with the F statistic is less than 0.05, then
adding the predictor variable to the model makes it statistically significantly bet-
ter at predicting SBP than the null model.
For this example, the p value of the F statistic is 0.013, which is statistically sig-
nificant. It means using weight as a predictor of SBP is statistically significantly
better than just guessing that everyone in the data set has the mean SBP (which is
how the null model is compared).
Some statistics programs show the actual equation of the best-fitting straight
line. If yours doesn’t, don’t worry. Just substitute the coefficients of the intercept
and slope for a and b in the straight-line equation: Y a bX .
Then you can use this equation to predict someone’s SBP if you know their weight.
So, if a person weighs 100 kilograms, you can estimate that that person’s SBP will
be around 76.9 100 0.487, which is 76.9 48.7 , or about 125.6 mmHg. Your
prediction probably won’t be exactly on the nose, but it should be better than not
using a predictive model and just guessing.
How far off will your prediction be? The residual SE provides a unit of measure-
ment to answer this question. As we explain in the earlier section “Summary sta-
tistics for the residuals,” the residual SE indicates how much the individual points
tend to scatter above and below the fitted line. For the SBP example, this number
is 9.8 , so you can expect your prediction to be within about 10 mmHg most of
the time.
Always look at a scatter plot of your data to make sure outliers aren’t present.
Examine the residuals to ensure they are distributed normally above and
below the fitted line.
» Do you want to show that the two variables are statistically significantly
associated? If so, you want to calculate the sample size required to achieve a
certain statistical power for the significance test (see Chapter 3 for an introduc-
tion to statistical power).
» Do you want to estimate the value of the slope (or intercept) to within a
certain margin of error? If so, you want to calculate the sample size required
to achieve a certain precision in your estimate.
» The number of data points: More data points give you greater precision. SEs
vary inversely with the square root of the sample size. Alternatively, the
required sample size varies inversely with the square of the desired SE. So, if
you quadruple the sample size, you cut the SE in half. This is a very important
and generally applicable principle.
» Tightness of the fit of the observed points to the line: The closer the data
points hug the line, the more precisely you can estimate the regression
coefficients. The effect is directly proportional, in that twice as much Y-scatter
of the points produces twice as large a SE in the coefficients.
» How the data points are distributed across the range of the X variable:
This effect is hard to quantify, but in general, having the data points spread
out evenly over the entire range of X produces more precision than having
most of them clustered near the middle of the range.
Given these factors, how do you strategically design a study and gather data for a
linear regression where you’re mainly interested in estimating a regression coef-
ficient to within a certain precision? One practical approach is to first conduct a
study that is small and underpowered, called a pilot study, to estimate the SE of the
But the SE from a pilot study usually isn’t small enough (unless you’re a lot luck-
ier that we’ve ever been). That’s when you can reach for the square-root law as a
remedy! Follow these steps to calculate the total sample size you need to get the
precision you want:
1. Divide the SE that you got from your pilot study by the SE you want your
full study to achieve.
3. Multiply the square of the ratio by the sample size of your pilot study.
Imagine that you want to estimate the slope to a precision or SE of ±5. If a pilot
study of 20 participants gives you a SE of ±8.4 units, then the ratio is 8.4 / 5, which
is 1.68. Squaring this ratio gives you 2.82, which tells you that to get an SE of 5,
you need 2.82 20 , or about 56 participants. And because we assume you took our
advice, we’ll assume you’ve already recruited the first 20 participants for your
pilot study. Now, you only have to recruit only another 36 participants to have a
total of 56.
Admittedly, this estimation is only approximate. But it does give you at least a
ballpark idea of how big your sample size needs to be to achieve the desired
precision.
Chapter 17
More of a Good Thing:
Multiple Regression
C
hapter 15 introduces the general concepts of correlation and regression, two
related techniques for detecting and characterizing the relationship between
two or more variables. Chapter 16 describes the simplest kind of regression —
fitting a straight line to a set of data consisting of one independent variable (the
predictor) and one dependent variable (the outcome). The formula relating the pre-
dictor to the outcome, known as the model, is of the form Y a bX , where Y is the
outcome, X is the predictor, and a and b are parameters (also called regression
coefficients). This kind of regression is usually the only one you encounter in an
introductory statistics course, because it is a relatively simple way to do a regres-
sion. It’s good for beginners to learn!
The same idea can be extended to multiple regression models containing more
than one predictor (which estimates more than two parameters). For two predic-
tor variables, you’re fitting a plane, which is a flat sheet. Imagine fitting a set of
points to this plane in three dimensions (meaning you’d be adding a Z axis to your
X and Y). Now, extend your imagination. For more than two predictors, in regres-
sion, you’re fitting a hyperplane to points in four-or-more-dimensional space.
Hyperplanes in multidimensional space may sound mind-blowing, but luckily for
us, the actual formulas are simple algebraic extensions of the straight-line
formulas.
In the following sections, we define some basic terms related to multiple regres-
sion, and explain when you should use it.
In textbooks and published articles, you may see regression models written in
various ways:
» In practical research work, the variables are often given meaningful names,
like Age, Gender, Height, Weight, Glucose, and so on.
Figuring out the best way to introduce categorical predictors into a multiple
regression model is always challenging. You have to set up your data the right
way, or you’ll get results that are either wrong, or difficult to interpret properly.
Following are two important factors to consider.
Imagine that you create a one-way frequency table of a Primary Diagnosis vari-
able from a sample of study participant data. Your results are: Hypertension: 73,
Diabetes: 35, Cancer: 1, and Other: 10. To deal with the sparse Cancer variable, you
may want to create another variable in which Cancer is collapsed together with
Other (which would then have 11 rows). Another approach is to create a binary
variable with yes/no levels, such as: Hypertension: 73 and No Hypertension: 46.
But binary variables don’t take into account the other levels. You could also make
Similarly, if your model has two categorical variables with an interaction term (like
Setting + Primary Diagnosis + Setting * Primary Diagnosis), you should prepare a
two-way cross-tabulation of the two variables first (in our example, Setting by
Primary Diagnosis). You will observe that you are limited by having to ensure that
you have enough rows in each cell of the table to run your analysis. See Chapter 12
for details about cross-tabulations.
It is best to code binary variables as 0 for not having the attribute or state, and 1
for having the attribute or state. So a binary variable named Cancer should be
coded as Cancer = 1 if the participant has cancer, and Cancer = 0 if they do not.
For categorical variables with more than two levels, it’s more complicated. Even if
you recode the categorical variable from containing characters to a numeric code,
this code cannot be used in regression unless we want to model the category as an
ordinal variable. Imagine a variable coded as 1 = graduated high school, 2 = gradu-
ated college, and 3 = obtained post-graduate degree. If this variable was entered
as a predictor in regression, it assumes equal steps going from code 1 to code 2,
and from code 2 to code 3. Anyone who has applied to college or gone to graduate
school knows these steps are not equal! To solve this problem, you could select
1 Hypertension 1 0 0 0
2 Diabetes 0 1 0 0
3 Cancer 0 0 1 0
4 Other 0 0 0 1
5 Diabetes 0 1 0 0
Table 17-1 shows theoretical coding for a data set containing the variables StudyID
(for participant ID) and PrimaryDx (for participant primary diagnosis). As shown
in Table 17-1, you take each level and make an indicator variable for it: Hyperten-
sion is HTN, diabetes is Diab, cancer is Cancer, and other is OtherDx. Instead of
including the variable PrimaryDx in the model, you’d include the indicator vari-
ables for all levels of PrimaryDx except the reference level. So, if the reference level
you selected for PrimaryDx was hypertension, you’d include Diab, Cancer, and
OtherDx in the regression, but would not include HTN. To contrast this to the edu-
cation example, in the set of variables in Table 17-1, participants can have a 1 for
one or more indicator variables or just be in the reference group. However, with
the education example, they can only be coded at one level, or be in the reference
group.
Don’t forget to leave the reference-level indicator variable out of the regression,
or your model will break!
Imagine that you are interested in whether the outcome of systolic blood pressure
(SBP) can be predicted by age, body weight, or both. Table 17-2 shows a small data
file with variables that could address this research question that we use through-
out the remainder of this chapter. It contains the age, weight, and SBP of 16 study
participants from a clinical population.
TABLE 17-2 Sample Age, Weight, and Systolic Blood Pressure Data for a
Multiple Regression Analysis
Participant ID Age (years) Weight (kg) SBP (mmHg)
1 60 58 117
2 61 90 120
3 74 96 145
4 57 72 129
5 63 62 132
6 68 79 130
7 66 69 110
8 77 96 163
9 63 96 136
10 54 54 115
11 63 67 118
12 76 99 132
13 60 74 111
14 61 73 112
15 65 85 147
16 79 80 138
FIGURE 17-1:
A scatter chart
matrix for a set
of variables prior
to multiple
regression.
© John Wiley & Sons, Inc.
These charts can give you insight into which variables are associated with each
other, how strongly they’re associated, and their direction of association. They
also show whether your data have outliers. The scatter charts in Figure 17-1
indicate that there are no extreme outliers in the data. Each scatter chart also
shows some degree of positive correlation (as described in Chapter 15). In fact, if
you refer to Figure 17-1, you may guess that the charts in Figure 17-1 correspond
to correlation coefficients between 0.5 and 0.8. In addition to the scatter charts,
you can also have your software calculate correlation coefficients (r values)
between each pair of variables. For this example, here are the results: r 0.654 for
Age versus Weight, r 0.661 for Age versus SBP, and r 0.646 for Weight versus SBP.
1. Assemble your data into a file with one row per participant and one
column for each variable you want in the model.
2. Tell the software which variable is the outcome and which are the
predictors.
3. Specify whatever optional output you want from the software, which
could include graphs, summaries of the residuals (observed minus
predicted outcome values), and other useful results.
Now, you should retrieve the output, and look for the optional output you
requested.
» Code: The first line starting with Call: reflects back the code run to execute the
regression, which contains the linear model using variable names: SBP ~
Age + Weight.
• Estimate: The estimated value of the parameter, which tells you how much
the outcome variable changes when the corresponding variable increases
by exactly 1.0 unit, holding all the other variables constant. For example, the
model predicts that if all participants have the same weight, every addi-
tional year of age is associated with an increase in SBP of 0.84 mmHg.
• Standard error: The standard error (SE) is the precision of the estimate,
and is in the column labeled Std. Error. The SE for the Age coefficient is
0.52 mmHg per year, indicating the level of uncertainty around the
0.84 mmHg estimate.
• t value: The t value (which is labeled t value) is the value of the parameter
divided by its SE. For Age, the t value is 0.8446 / 0.5163 , or 1.636.
» Model fit statistics: These are calculations that describe how well the model
fits your data overall.
• F statistic: The F statistic and associated p value (on the last line of the
output) indicate whether the model predicts the outcome statistically
significantly better than a null model. A null model contains only the
intercept term and no predictor variables at all. The very low p value
(0.0088) indicates that age and weight together predict SBP statistically
significantly better than the null model.
» Predicted values for the dependent variable for each participant. This can be
output either as a listing, or as a new variable placed into your data file.
» Residuals (observed minus predicted value) for each participant. Again, this
can be output either as a listing, or as a new variable placed into your data file.
» The amount of variability in the residuals is fairly constant, and not dependent
on the value of the dependent variable.
FIGURE 17-3:
Diagnostic
graphs from a
regression.
© John Wiley & Sons, Inc.
» The residual SE is the average scatter of the observed points from the fitted
model. You want them to be close to the line. As shown in Figure 17-2, the
residual SE is about 11 mmHg.
Figure 17-4 shows another way to judge how well the model predicts the outcome.
It’s a graph of observed and predicted values of the outcome variable, with a
superimposed identity line (Observed Predicted). Your program may offer this
observed versus predicted graph, or you can generate it from the observed and pre-
dicted values of the dependent variable. For a perfect prediction model, the points
would lie exactly on the identity line. The correlation coefficient of these points is
the multiple r value for the regression.
FIGURE 17-4:
Observed versus
predicted
outcomes for the
model SBP ~ Age
+ Weight, for the
data in
Table 17-2.
© John Wiley & Sons, Inc.
If the estimate of the slope for the interaction term has a statistically significant
p value, then the null hypothesis of no interaction is rejected, and the two variables
are interpreted to have a significant interaction. If the sign on the interaction term
is positive, it is a synergistic interaction, and if it is negative, it is called an anti-
synergistic or antagonistic interaction.
In the example from Table 17-2, there’s a statistically significant positive correla-
tion between each predictor and the outcome. We figured this out when running
the correlations for Figure 17-1, but you could check our work by using the data in
Figure 17-2 in a straight-line regression, as described in Chapter 16. In contrast,
the multiple regression output in Figure 17-2 shows that neither Age nor Weight
are statistically significant in the model, meaning neither has regression coeffi-
cients that are statistically significantly different from zero! Why are they associ-
ated with the outcome in correlation but not multiple regression analysis?
The answer is collinearity. In the regression world, the term collinearity (also
called multicollinearity) refers to a strong correlation between two or more of the
predictor variables. If you run a correlation between Age and Weight (the two pre-
dictors), you’ll find that they’re statistically significantly correlated with each
other. It is this situation that destroys your statistically significant p value seen on
some predictors in iterative models when doing multiple regression.
The problem with collinearity is that you cannot tell which of the two predictor
variables is actually influencing the outcome more, because they are fighting over
explaining the variability in the dependent variable. Although models with col-
linearity are valid, they are hard to interpret if you are looking for cause-and-
effect relationships, meaning you are doing causal inference. Chapter 20 provides
philosophical guidance on dealing with collinearity in modeling.
Chapter 18
A Yes-or-No Proposition:
Logistic Regression
Y
ou can use logistic regression to analyze the relationship between one or
more predictor variables (the X variables) and a categorical outcome vari-
able (the Y variable). Typical categorical outcomes include the following
two-level variables (which are also called binary or dichotomous):
» To make yes or no predictions about the outcome that take into account the
consequences of false-positive and false-negative predictions. For example,
you can generate a tentative cancer diagnosis from a set of observations and
lab results using a formula that balances the different consequences of a
false-positive versus a false-negative diagnosis.
» To see how one predictor influences the outcome after adjusting for the
influence of other variables. One example is to see how the number of
minutes of exercise per day influences the chance of having a heart attack
after controlling for the for the effects of age, gender, lipid levels, and other
patient characteristics that could influence the outcome.
0 0 433 0
10 0 457 1
31 0 559 1
82 0 560 1
92 0 604 1
107 0 632 0
142 0 686 1
173 0 691 1
175 0 702 1
232 0 705 1
266 0 774 1
299 0 853 1
303 1 879 1
326 0 915 1
404 1 977 1
How can you analyze these data with logistic regression? First, make a scatter plot
(see Chapter 16) with the predictor — the dose — on the X axis, and the outcome
of death on the Y axis, as shown in Figure 18-1a.
FIGURE 18-1:
Dose versus
mortality from
Table 18-1: each
individual’s
data (a) and
grouped (b).
© John Wiley & Sons, Inc.
In Figure 18-1a, because the outcome variable is binary, the points are restricted
to two horizontal lines, making the graph difficult to interpret. You can get a bet-
ter picture of the dose-lethality relationship by grouping the doses into intervals.
In Figure 18-1b, we grouped the intervals into 200 REM classes (see Chapter 9),
and plotted the fraction of individuals in each interval who died. Clearly,
Figure 18-1b shows the chance of dying increases with increasing dose.
If you have a binary outcome, you need to fit a function that has an S shape. The
formula calculating Y must be an expression involving X that — by design — can
never produce a Y value outside of the range from 0 to 1, no matter how large or
small X may become.
Of the many mathematical expressions that produce S-shaped graphs, the logistic
function is ideally suited to this kind of data. In its simplest form, the logistic
function is written like this: Y 1 / 1 e X , where e is the mathematical con-
stant 2.718, known as a natural logarithm (see Chapter 2). We will use e to repre-
sent this number for the rest of the chapter. Figure 18-2a shows the shape of the
logistic function.
The logistic function shown in Figure 18-2 can be made more versatile for repre-
senting observed data by being generalized. The logistic function is generalized by
adding two adjustable parameters named a and b like this: Y 1 / 1 e ( a bX ) .
FIGURE 18-2:
The first graph (a)
shows the shape
of the logistic
function. The
second graph (b)
shows that when
b is 0, the logistic
function becomes
a horizontal
straight line.
© John Wiley & Sons, Inc.
Notice that the a bX part looks just like the formula for a straight line (see
Chapter 16). It’s the rest of the logistic function that bends the straight line into
its characteristic S shape. The middle of the S (where Y 0.5 ) always occurs when
X b / a . The steepness of the curve in the middle region is determined by b, as
follows:
FIGURE 18-3:
The first graph (a)
shows that when
b is negative,
the logistic
function slopes
downward. The
second graph (b)
shows that when
b is very large,
the logistic
function becomes
a “step function.”
© John Wiley & Sons, Inc.
Because the logistic curve approaches the limits 0.0 and 1.0 for extreme values of
the predictor(s), you should not use logistic regression in situations where the
fraction of individuals positive for the outcome does not approach these two lim-
its. Logistic regression is appropriate for the radiation example because none of
the individuals died at a radiation exposure of zero REMs, and all of the individu-
als died at doses of 686 REMs and higher. If we imagine a study of patients with a
disease where the outcome is a cure, if taking a drug in very high doses would not
always cause a 100 percent cure, and the disease could resolve on its own without
any drug, the data would not be appropriate. This is because some patients with
high doses would still have an outcome value of 0, and some patients at zero dose
would have an outcome value of 1.
Logistic regression fits the logistic model to your data by finding the values of a
and b that make the logistic curve come as close as possible to all your plotted
points. With this fitted model, you can then predict the probability of the outcome.
See the later section “Predicting probabilities with the fitted logistic formula” for
more details.
Logistic regression determines the values of the regression coefficients that are most
consistent with the observed data using what’s called the maximum likelihood criterion.
The likelihood of any statistical model is the probability (based on the model) of obtain-
ing the values you observed in your data. There’s a likelihood value for each row in the
data set, and a total likelihood (L) for the entire data set. The likelihood value for each
data point is the predicted probability of getting the observed outcome result. For indi-
viduals who died (refer to Table 18-1), the likelihood is the probability of dying (Y) pre-
dicted by the logistic formula. For individuals who survived, the likelihood is the
predicted probability of not dying, which is 1 Y . The total likelihood (L) for the whole
set of individuals is the product of all the calculated likelihoods for each individual.
To find the values of the coefficients that maximize L, it is most practical to find the val-
ues that minimize the quantity 2 multiplied by the natural logarithm of L, which also called
the –2 log likelihood and abbreviated –2LL. Statisticians also call –2LL the deviance. The
closer the curve designated by the regression formula comes to the observed points,
the smaller this deviance value will be. The actual value of the deviance for a logistic
regression model doesn’t mean much by itself. It’s the difference in deviance between
two models you might be comparing that is important.
Once deviance is calculated, the final step is to identify the values of the coefficients that
will minimize the deviance of the observed Y values from the fitted logistic curve. This
may sound challenging, but statistical programs employ elegant and efficient ways to
minimize such a complicated function involving several variables, and uses these meth-
ods to obtain the coefficients.
1. Make sure your data set has a column for the outcome variable that is
coded as 1 where the individual is positive for the outcome, and 0 when
they are negative.
If you do not have an outcome column coded this way, use the data manage-
ment commands in your software to generate a new variable coded as 0 for
those who do not have the outcome, and 1 for those who have the outcome,
as shown in Table 18-1.
2. Make sure your data set has a column for each predictor variable, and
that these columns are coded the way you want them to be entered
them into the model.
3. Tell your software which variables are the predictors and which is the
outcome.
Depending on the software, you may do this by typing the variable names,
or by selecting the variables from a menu or list.
4. Request the optional output from the software if available which may
include:
Obtain the output you requested, and interpret the resulting model.
» One or more pseudo–r2 values: Pseudo–r2 values indicate how much of the
total variability in the outcome is explainable by the fitted model. They are
analogous to how r2 is interpreted in ordinary least-squares regression, as
described in Chapter 17. In Figure 18-4a, two such values are provided under
the labels Cox/Snell R-square and Nagelkerke R-square. The Cox/Snell r2 is 0.577,
and the Nagelkerke r2 is 0.770, both of which indicate that a majority of the
variability in the outcome is explainable by the logistic model.
» The first column usually lists the regression coefficients (under Coeff. in
Figure 18-4a).
» The second column usually lists the standard error (SE) of each coefficient
(under StdErr in Figure 18-4a).
For each predictor variable, the output should also provide the odds ratio (OR) and
its 95 percent confidence interval. These are usually presented in a separate table
as they are in Figure 18-4a under Odds Ratios and 95% Confidence Intervals.
You can write out the formula manually by inserting the value of the regression
coefficients from the regression table into the logistic formula. The final model
produced by the logistic regression program from the data in Table 18-1 and the
resulting logistic curve are shown in Figure 18-5.
Once you have the fitted logistic formula, you can predict the probability of having
the outcome if you know the value of the predictor variable. For example, if an
individual is exposed to 500 REM of radiation, the probability of the outcome is
given by this formula: Probability of Death 1 / 1 e _( 4 , 828 0.01146 500 ) , which
equals 0.71. An individual exposed to 500 REM of radiation has a predicted proba-
bility of 0.71 — or a 71 percent chance — of dying shortly thereafter. The predicted
probabilities for each individual are shown in the data listed in Figure 18-4b. You
can also calculate some points of special significance on a logistic curve, as you
find out in the following sections.
Be careful with your algebra when evaluating these formulas! The a coefficient in
a logistic regression is often a negative number, and subtracting a negative num-
ber is like adding its absolute value.
Using your high-school algebra, you can solve the logistic formula
Y 1/ 1 e _( a
bX )
for X as a function of Y. If you don’t remember how to do that,
don’t worry, here’s the answer:
Y
log a
1 Y
X
b
where log stands for natural logarithm. If you substitute 0.5 for Y in the preceding
equation because you want to calculate the ED50, the answer is a / b. Similarly,
1.39 a
substituting 0.8 for Y gives the ED80 as .
b
In the following sections, we talk about yes or no predictions. We explain how they
expose the ability of the logistic model to make predictions, and how you can stra-
tegically select the cut value that gives you the best tradeoff between wrongly
predicting yes and wrongly predicting no.
From the classification table shown in Figure 18-6, you can calculate several use-
ful measures of the model’s predicting ability for any specified cut value, includ-
ing the following:
Sensitivity and specificity are especially relevant to screening tests for diseases.
An ideal test would have 100 percent sensitivity and 100 percent specificity, and
therefore, 100 percent overall accuracy. In reality, no test could meet these stan-
dards, and there is a tradeoff between sensitivity and specificity.
By judiciously choosing the cut value for converting a predicted probability into a
yes or no decision, you can often achieve high sensitivity or high specificity, but
it’s hard to maximize both simultaneously. Screening tests are meant to detect
disease, so how you select the cut value depends upon what happens if it produces
a false-positive or false-negative result. This helps you decide whether to priori-
tize sensitivity or specificity.
The sensitivity and specificity of a logistic model depends upon the cut value you
set for the predicted probability. The trick is to select a cut value that gives the
optimal combination of sensitivity and specificity, striking the best balance
between false-positive and false-negative predictions, in light of the different
An ROC graph has a curve that shows you the complete range of sensitivity and
specificity that can be achieved for any fitted logistic model based on the selected
cut value. The software generates an ROC curve by effectively trying all possible
cut values of predicted probability between 0 and 1, calculating the predicted out-
comes, cross-tabbing them against the observed outcomes, calculating sensitivity
and specificity, and then graphing sensitivity versus specificity. Figure 18-7 shows
the ROC curve from the logistic model developed from the data in Figure 18-1
(using R software; see Chapter 4).
FIGURE 18-7:
ROC curve from
dose mortality
data.
© John Wiley & Sons, Inc.
Like Figure 18-7, every ROC graph has sensitivity running up the Y axis, which is
displayed either as fractions between 0 and 1 or as percentages between 0 and 100.
The X axis is either presented from left to right as 1 – specificity , or like it is in
Figure 18-7, where specificity is labeled backwards — from right to left — along
the X axis.
Most ROC curves lie in the upper-left part of the graph area. The farther away
from the diagonal line they are, the better the predictive model is. For a nearly
perfect model, the ROC curve runs up along the Y axis from the lower-left corner
to the upper-left corner, then along the top of the graph from the upper-left cor-
ner to the upper-right corner.
Because of how sensitivity and specificity are calculated, the graph appears as a
series of steps. If you have a large data set, your graph will have more and smaller
steps. For clarity, we show the cut values for predicted probability as a scale along
the ROC curve itself in Figure 18-7, but unfortunately, most statistical software
doesn’t do this for you.
Looking at the ROC curve helps you choose a cut value that gives the best tradeoff
between sensitivity and specificity:
» To have very few false positives: Choose a higher cut value to give a high
specificity. Figure 18-7 shows that by setting the cut value to 0.6, you can
simultaneously achieve about 93 percent specificity and 87 percent sensitivity.
» To have very few false negatives: Choose a lower cut value to give higher
sensitivity. Figure 18-7 shows you that if you set the cut value to 0.3, you can
have almost perfect sensitivity because you’ll be at almost 100 percent, but
your specificity will be only about 75 percent, meaning you’ll have a 25 percent
false positive rate.
The software may optionally display the area under the ROC curve (abbreviated
AUC), along with its standard error and a p value. This is another measure of how
good the predictive model is. The diagonal line has an AUC of 0.5, and there is a
statistical test comparing your AUC to the diagonal line. Under α = 0.05, if the p
value < 0.05, it indicates that your model is statistically significantly better than
the diagonal line at accurately predicting your outcome.
» Don’t fit a logistic function to non-logistic data: Don’t use logistic regres-
sion to fit data that doesn’t behave like the logistic S curve. Plot your grouped
data (as shown earlier in Figure 18-1b), and if it’s clear that the fraction of
positive outcomes isn’t leveling off at Y 0 or Y 1 for very large or very
small X values, then logistic regression is not the correct modeling approach.
The H-L test described earlier under the section “Assessing the adequacy of
the model” provides a statistical test to determine if your data qualify for
logistic regression. Also, in Chapter 19, we describe a more generalized logistic
model that contains other parameters for the upper and lower leveling-
off values.
» Watch out for collinearity and disappearing significance: When you are
doing any kind of regression and two or more predictor variables are strongly
related with each other, you can be plagued with problems of collinearity. We
describe this problem in Chapter 17, and potential modeling solutions in
Chapter 20.
Also, be careful not to misinterpret odds ratios for numerical predictors, and be
mindful of the complete separation problem, as described in the following
sections.
The value of a regression coefficient depends on the units in which the corre-
sponding predictor variable is expressed. So the coefficient of a height variable
expressed in meters is 100 times larger than the coefficient of height expressed in
centimeters. In logistic regression, ORs are obtained by exponentiating the coef-
ficients, so switching from centimeters to meters corresponds to raising the OR
(and its confidence limits) to the 100th power.
If the predictor variable or variables in your model completely separate the yes
outcomes from the no outcomes, the maximum likelihood method will try to make
the coefficient of that variable infinite, which usually causes an error in the soft-
ware. If the coefficient is positive, the OR tries to be infinity, and if it is negative,
it tries to be 0. The SE of the OR tries to be infinite, too. This may cause your CI to
have a lower limit of 0, an upper limit of infinity, or both.
Check out Figure 18-8, which visually describes the problem. The regression is
trying to make the curve come as close as possible to all the data points. Usually it
has to strike a compromise, because there’s a mixture of 1s and 0s, especially in
the middle of the data. But with perfectly separated data, no compromise is neces-
sary. As b becomes infinitely large, the logistic function morphs into a step func-
tion that touches all the data points (observe where b = 5).
FIGURE 18-8:
Visualizing the
complete
separation (or
perfect predictor)
problem in
logistic
regression.
© John Wiley & Sons, Inc.
Here are two simple approaches you can use if your logistic model has only one
predictor. In each case, you replace the logistic regression equation with another
equation that is somewhat equivalent, and then do a sample-size calculation
based on that. It’s not an ideal solution, but it can give you an answer that’s close
enough for planning purposes.
Chapter 19
Other Useful Kinds
of Regression
T
his chapter covers regression approaches you’re likely to encounter in bio-
statistical work that are not covered in other chapters. They’re not quite as
common as straight-line regression, multiple regression, and logistic regres-
sion (described in Chapters 16, 17, and 18, respectively), but you should be aware of
them. We don’t go into a lot of detail, but we describe what they are, the circum-
stances under which you may want to use them, how to execute the models and
interpret the output, and special situations you may encounter with these models.
Note: We also don’t cover survival regression in this chapter, even though it’s one
of the most important kinds of regression analysis in biostatistics. Survival analy-
sis is the theme of Part 6 of this book, and is the topic of Chapter 23.
Because independent random events like highway accidents should follow a Pois-
son distribution (see Chapter 24), they should be analyzed by a kind of regression
designed for Poisson outcomes. And — surprise, surprise — this type of special-
ized regression is called Poisson regression.
Don’t confuse the generalized linear model with the very similarly named general
linear model. It’s unfortunate that these two names are almost identical, because
they describe two very different things. Now, the general linear model is usually
abbreviated LM, and the generalized linear model is abbreviated GLM, so we will use
those abbreviations. (However, some old textbooks from the 1970s may use GLM to
mean LM, because the generalized linear model had not been invented yet.)
GLM is similar to LM in that the predictor variables usually appear in the model as
the familiar linear combination:
c0 c1 x 1 c2 x 2 c3 x 3 ...
where the x’s are the predictor variables, and the c’s are the regression coefficients
(with c 0 being called a constant term, or intercept).
» With LM, the linear combination becomes the predicted value of the outcome,
but with GLM, you can specify a link function. The link function is a transforma-
tion that turns the linear combination into the predicted value. As we note in
Chapter 18, logistic regression applies exactly this kind of transformation: Let’s
call the linear combination V. In logistic regression, V is sent through the
logistic function 1 / 1 e V to convert it into a predicted probability of
having the outcome event. So if you select the correct link function, you can
use GLM to perform logistic regression.
GLM is the Swiss army knife of regression. If you select the correct link function,
you can use it to do ordinary least-squares regression, logistic regression, Poisson
regression, and a whole lot more. Most statistical software offers a GLM function;
that way, other specialized regressions don’t need to be programmed. If the soft-
ware you are using doesn’t offer logistic or Poisson regression, check to see
whether it offers GLM, and if it does, use that instead. (Flip to Chapter 4 for an
introduction to statistical software.)
Running a Poisson regression is similar in many ways to running the other com-
mon kinds of regression, but there are some differences. Here are the steps:
For this example, you have a row of data for each year, so year is the experi-
mental unit. For each row, you have a column containing the outcome values,
which is number of accidents each year (Accidents). Since you have one
predictor — which is year — you have a column for Year.
2. Tell the software which variables are the predictor variables, and which
one is the outcome.
Step 3 is not obvious, and you may have to consult your software’s help file. In
the R program, as an example, you have to specify both family and link in a
single construction, which looks like this:
This code tells R that the outcome is the variable Accidents, the predictor is the
variable Year, and the outcome variable follows the Poisson family of distribu-
tions. The code link “identity” tells R that you want to fit a model in which
the true event rate rises in a linear fashion, meaning that it increases by a
constant amount each year.
2010 10
2011 12
2012 15
2013 8
2014 8
2015 15
2016 4
2017 20
2018 20
2019 17
2020 29
2021 28
FIGURE 19-2:
Poisson
regression
output.
This output has the same general structure as the output from other kinds of
regression. The most important parts of it are the following:
» The standard error (SE) is labeled Std. Error and is 0.3169, indicating the
precision of the estimated rate increase per year. From the SE, using the rules
given in Chapter 10, the 95 percent confidence interval (CI) around the
estimated annual increase is approximately 1.3298 1.96 x 0.3169, which
gives a 95 percent CI of 0.71 to 1.95 (around the estimate 1.33).
» The last column, labeled Pr | z | , is the p value for the significance of the
increasing trend estimated at 1.33. The Year variable has a p value of 2.71
e-05, which is scientific notation (see Chapter 2) for 0.0000271. Using α = 0.05,
the apparent increase in rate over the 12 years would be interpreted as highly
statistically significant.
» AIC (Akaike’s Information Criterion) indicates how well this model fits the data.
The value of 81.72 isn’t useful by itself, but it’s very useful when choosing
between two alternative models, as we explain later in this chapter.
R software can also provide the predicted annual event rate for each year, from
which you can add a trend line to the scatter graph, indicating how you think the
true event rate may vary with time (see Figure 19-3).
FIGURE 19-3:
Poisson
regression,
assuming a
constant increase
in accident rate
per year with
trend line.
© John Wiley & Sons, Inc.
This produces the output shown in Figure 19-4 and graphed in Figure 19-5.
FIGURE 19-4:
Output from an
exponential trend
Poisson
regression.
Because of the log link used in this regression run, the coefficients are related to
the logarithm of the event rate. Thus, the relative rate of increase per year is
obtained by taking the antilog of the regression coefficient for Year. This is done
by raising e (the mathematical constant 2.718. . .) to the power of the regression
coefficient for Year: e 0.10414 , which is about 1.11. So, according to an exponential
increase model, the annual accident rate increases by a factor of 1.11 each year —
meaning there is an 11 percent increase each year. The dashed-line curve in
Figure 19-4 shows this exponential trend, which appears to accommodate the
steeper rate of increase seen after 2016.
The standard deviation (SD) of a Poisson distribution is equal to the square root of
the mean of the distribution. But if clustering is present, the SD of the data is
larger than the square root of the mean. This situation is called overdispersion.
GLM in R can correct for overdispersion if you designate the distribution family
quasipoisson rather than poisson, like this:
The formula for a nonlinear regression model may be any algebraic expression. It
can involve sums, differences, products, ratios, powers, and roots. These can be
combined together in a formula with logarithmic, exponential, trigonometric, and
other advanced mathematical functions (see Chapter 2 for an introduction to these
items). The formula can contain any number of predictor variables, and any num-
ber of parameters. In fact, nonlinear regression formulas often contain many
more parameters than predictor variables.
Unlike other types of regression covered in this chapter and book, where a regres-
sion command and code are used to generate output, developing a full-blown
nonlinear regression model is more of a do-it-yourself proposition. First, you
have to decide what function you want to fit to your data, making this choice from
the infinite number of possible functions you could select. Sometimes the general
form of the function is determined or suggested by a scientific theory. Using a
theory to guide your development of a nonlinear function means relying on a the-
oretical or mechanistic function, which is more common in the physical sciences
than life sciences. If you choose your nonlinear function based on a function with
a generally similar shape, you are using an empirical function. After choosing the
function, you have to provide starting estimates for the value of each of the
parameters appearing in the function. After that, you can execute the regression.
The software tries to refine your estimates using an iterative process that may or
Raw PK data often consist of the concentration level of the drug in the partici-
pant’s blood at various times after a dose of the drug is administered. Consider a
Phase I trial, in which 10,000 micrograms (μg) of a new drug is given as a single
bolus, which is a rapid injection into a vein, in each participant. Blood samples are
drawn at predetermined times after dosing and are analyzed for drug concentra-
tions. Hypothetical data from one participant are shown in Table 19-2 and graphed
in Figure 19-6. The drug concentration in the blood is expressed in units of μg per
deciliter ( g / dL ). Remember, a deciliter is one-tenth of a liter.
» The volume of distribution (V d): This is the effective volume of fluid or tissue
through which the drug is distributed in the body. This effective volume could
be equal to the blood volume, but could be greater if the drug also spreads
through fatty tissue or other parts of the body. If you know the dose of the drug
infused (Dose), and you know the blood plasma concentration at the moment of
infusion (C 0), you can calculate the volume of distribution as Vd Dose / C 0.
But you can’t directly measure C 0. By the time the drug has distributed evenly
throughout the bloodstream, some of it has already been eliminated from the
body. So C 0 has to be estimated by extrapolating the measured concentrations
backward in time to the moment of infusion (Time 0).
» The elimination half-life (λ): The time it takes for half of the drug in the body
to be eliminated.
0.25 57.4
0.5 54.0
1 44.8
1.5 52.7
2 43.6
3 40.1
4 27.9
6 20.6
8 15.0
12 10.0
FIGURE 19-6:
The blood
concentration of
an intravenous
drug decreases
over time in one
participant.
© John Wiley & Sons, Inc.
keTime
Conc C oe
In R, you can develop arrays called vectors from data sets, but in this example,
we create each vector manually, naming them Time and Conc, using the data
from Table 19-2:
Conc c 57.4, 54.0, 44.8, 52.7, 43.6, 40.1, 27.9, 20.6, 15.0, 10.0
2. Specify the equation to be fitted to the data, using the algebraic syntax
your software requires.
We write the equation using R’s algebraic syntax this way: Conc ~ C0 * exp(– ke *
Time), where Conc and Time are your vectors of data, and C0 and ke are
parameters you set.
3. Tell the software that C0 and ke are parameters to be fitted, and provide
initial estimates for these values.
In this example, C0 (variable C0) is the concentration you expect at the moment
of dosing (at t 0). From Figure 19-6, it looks like the concentration starts out
around 50, so you can use 50 as an initial guess for C0. The ke parameter (variable
ke) affects how quickly the concentration decreases with time. Figure 19-6
indicates that the concentration seems to decrease by half about every few
hours, so λ should be somewhere around 4 hours. Because 0.693 / ke ,
a little algebra gives the equation ke = 0.693/X. If you plug in 4 hours for X, you
get ke = 0.693/4 = 0.2, so you may try 0.2 as a starting guess for ke. You tell
R the starting guesses by using the syntax: start=list(C0 50 , ke 0.2).
The statement in R for nonlinear regression is nls, which stands for nonlinear
least-squares. The full R statement for executing this nonlinear regression model
and summarizing the output is:
FIGURE 19-7:
Results of
nonlinear
regression in R.
How precise are these PK parameters? In other words, what is their SE? Unfortu-
nately, uncertainty in any measured quantity will propagate through a mathemat-
ical expression that involves that quantity, and this needs to be taken into account
in calculating the SE. To do this, you can use the online calculator at https://
statpages.info/erpropgt.html. Choose the estimator designed for two vari-
ables, and enter the information from the output into the calculator. You can cal-
culate that the Vd 16.8 0.65 liters, and 4.25 0.43 hours.
R can be asked to generate the predicted value for each data point, from which you
can superimpose the fitted curve onto the observed data points, as in
Figure 19-8.
R also provides the residual standard error (labeled Residual std. err. in Figure 19-7),
which is defined as the standard deviation of the vertical distances of the observed
points from the fitted curve. The value from the output of 3.556 means that the
points scatter about 3.6 μg/dL above and below the fitted curve. Additionally, R
can be asked to provide Akaike’s Information Criterion (AIC), which is useful in
selecting which of several possible models best fits the data.
Because nonlinear regression involves algebra, some fancy math footwork can
help you out. Very often, you can re-express the formula in an equivalent form
that directly involves calculating the parameters you actually want to know.
Here’s how it works for the PK example we use in the preceding sections.
Algebra tells you that because Vd Dose / C 0 , then C 0 Dose / Vd . So why not use
Dose / Vd instead of C 0 in the formula you’re fitting? If you do, it becomes
Conc Dose e keTime
. And you can go even further than that. It turns out that
vd
keTime
a first-order exponential-decline formula can be written either as e or as the
Time
t 1/2
From the original description of this example, you already know that Dose = 10,000
μg, so you can substitute this value for Dose in the formula to be fitted. You’ve
already estimated λ (variable tHalf) as 4 hours. Also, you estimated C 0 as about 50
μg/dL from looking at Figure 19-6, as we describe earlier. This means you can
estimate Vd (variable Vd) as 10, 000 / 50, which is 200 dL. With these estimates, the
final R statement is
FIGURE 19-9:
Nonlinear
regression that
estimates the
PK parameters
you want.
From Figure 19-9, you can see the direct results for Vd and tHalf. Using the output,
you can estimate that the Vd is 168.2 6.5 dL (or 16.8 0.66 liters), and λ is
4.24 0.43 hours.
Smoothing Nonparametric
Data with LOWESS
Sometimes you want to fit a smooth curve to a set of points that don’t seem to
conform to a common, recognizable distribution, such as normal, exponential,
logistic, and so forth. If you can’t write an equation for the curve you want to fit,
you can’t use linear or nonlinear regression techniques. What you need is essen-
tially a nonparametric regression approach, which would not try to fit any formula/
model to the relationship, but would instead just try to draw a smooth line through
the data points.
Running LOWESS
Suppose that you discover a new hormone called XYZ believed to be produced in
women’s ovaries throughout their lifetimes. Research suggests blood levels of
XYZ should vary with age, in that they are low before going through puberty and
after completing menopause, and high during child-bearing years. You want to
characterize and quantify the relationship between XYZ levels and age as accu-
rately as possible.
Suppose that for your analysis, you are allowed to obtain 200 blood samples drawn
from consenting female participants aged 2 to 90 years for another research proj-
ect, and analyze the specimens for XYZ levels. A graph of XYZ level versus age may
look like Figure 19-10.
FIGURE 19-10:
The relationship
between age and
hormone
concentration
doesn’t conform
to a simple
function.
© John Wiley & Sons, Inc.
In Figure 19-10, you can observe a lot of scatter in these points, which makes it
hard to see the more subtle aspects of the XYZ-age relationship. At what age does
the hormone level start to rise? When does it peak? Does it remain fairly constant
throughout child-bearing years? When does it start to decline? Is the rate of
decline after menopause constant, or does it change with advancing age?
FIGURE 19-11:
The fitted
LOWESS curve
follows the shape
of the data,
whatever it
may be.
© John Wiley & Sons, Inc.
In Figure 19-11, the smoothed curve seems to fit the data quite well across all ages
except the lower ones. The individual data points don’t show any noticeable
upward trend until age 12 or so, but the smoothed curve starts climbing right
from age 3. The curve completes its rise by age 20, and then remains flat until
almost age 50, when it starts declining. The rate of decline seems to be greatest
between ages 50 to 65, after which it declines less rapidly. These subtleties would
be very difficult to spot just by looking at the individual data points without any
smoothed curve.
FIGURE 19-12:
You can adjust
the smoothness
of the fitted curve
by adjusting the
smoothing
fraction.
© John Wiley & Sons, Inc.
» Setting f 0.667 produces a rather stiff curve that rises steadily between ages
2 and 40, and then declines steadily after that (see dashed line). The value
0.667 represents 2/3, which is what R uses as the default value of the f
parameter if you don’t specify it. This curve misses important features of the
data, like the low pre-puberty hormone levels, the flat plateau during child-
bearing years, and the slowing down of the yearly decrease above age 65. You
can say that this curve shows excessive bias, systematically departing from
observed values in various places along its length.
» Setting f 0.1, which is at a lower extreme, produces a very jittery curve with
a lot of up-and-down wiggles that can’t possibly relate to actual ages, but
instead reflect random fluctuations in the data (see dark, solid line). You can
say that this curve shows excessive variance, with too many random fluctua-
tions along its length.
Chapter 20
Getting the Hint from
Epidemiologic Inference
I
n Parts 5 and 6, we describe different types of regression, such as ordinary
least-squares regression, logistic regression, Poisson regression, and survival
regression. In each kind of regression we cover, we describe a situation in which
you are performing multivariable or multivariate regression, which means you are
making a regression model with more than one independent variable. Those
chapters describe the mechanics of fitting these multivariable models, but they
don’t provide much guidance on which independent variables to choose to try to
put in the multivariable model.
The chapters in Parts 5 and 6 also discuss model-fitting, which means the act of
trying to refine your regression model so that it optimally fits your data. When you
have a lot of candidate independent variables (or candidate covariates), part of
model-fitting has to do with deciding which of these variables actually fit in the
model and should stay in, and which ones don’t fit and should be kicked out. Part
of what guides this decision-making process are the mechanics of modeling and
model-fitting. The other main part of what guides these decisions is the hypoth-
esis you are trying to answer with your model, which is the focus of this chapter.
As shown Figure 20-1, inability to exercise and low income are both seen as potential
confounders. That is because they are associated with both the exposure of mili-
tary service and the outcome of amputation, and they are not on the causal path-
way between military service and amputation. In other words, what is causing the
outcome of amputation is not also causing the patient’s inability to exercise, nor
is it also causing the patient to have low income. But whatever is causing the
patient’s amputation is also causing the patient’s retinopathy. That’s because
Type II diabetes causes poor circulation, which causes both retinopathy and
amputation. This means that retinopathy and amputation are on the same causal
pathway, and retinopathy cannot be considered a potential confounder.
Avoiding overloading
You may think that choosing what covariates belong in a regression model is easy.
You just put all the confounders and the exposure in as covariates and you’re done,
right? Well, unfortunately, it’s not that simple. Each time you add a covariate to a
regression model, you increase the amount of error in the model by some amount —
no matter what covariate you choose to add. Although there is no official maximum
to the number of covariates in a model, it is possible to add so many covariates that
the software cannot compute the model, causing an error. In a logistic regression
model as discussed in Chapter 18, each time you add a covariate, you increase the
overall likelihood of the model. In Chapter 17, which focuses on ordinary least-
squares regression, adding a covariate increases your sum of squares.
What this means is that you don’t want to add covariates to your model that just
increase error and don’t help with the overall goal of model fit. A good strategy is
to try to find the best collection of covariates that together deal with as much error
as possible. For example, think of it like roommates who share apartment-
cleaning duties. It’s best if they split up the apartment and each clean different
parts of it, rather than insisting on cleaning up the same rooms, which would be
a waste of time. The term parsimony refers to trying to include the fewest
covariates in your regression model that explain the most variation in the depen-
dent variable. The modeling approaches discussed in the next section explain
ways to develop such parsimonious models.
But if you collected your data based on a hypothesis, you are doing a hypothesis-
driven analysis. Epidemiologic studies require hypothesis-driven analyses, where
you have already selected your exposure and outcome, and now you have to fit a
regression model predicting the outcome, but including your exposure and con-
founders as covariates. You know you need to include the exposure and the out-
come in every model you run. However, you may not know how to decide on which
confounders stay in the model.
You then need to choose a modeling approach, which is the approach you will use
to determine which candidate confounders stay in the model with the exposure
and which ones are removed. There are three common approaches in regression
modeling (although analysts have their customized approaches). These approaches
don’t have official names, but we will use terms that are commonly used. They
are: forward stepwise, backward elimination, and stepwise selection.
» Backward elimination: In this approach, the first model you run contains all
your potential covariates, including all the confounders and the exposure.
Using modeling rules, each time you run the model, you remove or eliminate
the confounder contributing the least to the model. You decide which one
that is based on modeling rules you set (such as which confounder has the
largest p value). Theoretically, after you pare away the confounders that do
not meet the rules, you will have a final model. In practice, this process can
run into problems if you have collinear covariates (see Chapters 17 and 18 for
a discussions of collinearity). Your first model — filled with all your potential
covariates — may error out for this reason, and not converge. Also, it is not
clear whether once you eliminate a covariate you should try it again in the
model. This approach often sounds better on paper than it works in practice.
Once you produce your final model, check the p value for the covariate or covari-
ates representing your exposure. If they are not statistically significant, it means
that your hypothesis was incorrect, and after controlling for confounding, your
exposure was not statistically significantly associated with the outcome. However,
if the p value is statistically significant, then you would move on to interpret the
results for your exposure covariates from your regression model. After controlling
for confounding, your exposure was statistically significantly associated with
your outcome. Yay!
Use a spreadsheet to keep track of each model you run and a summary of the
results. Save this in addition to your computer code for running the models. It can
help you communicate with others about why certain covariates were retained and
not retained in your final model.
Understanding Interaction
(Effect Modification)
In Chapter 17, we touch on the topic of interaction (also known as effect modifica-
tion). This is where the relationship between an exposure and an outcome is
strongly dependent upon the status of another covariate. Imagine that you con-
ducted a study of laborers who had been exposed to asbestos at work, and you
found that being exposed to asbestos at work was associated with three times the
odds of getting lung cancer compared to not being exposed. In another study, you
found that individuals who smoked cigarettes had twice the odds of getting lung
cancer compared to those who did not smoke.
Knowing this, what would you predict are the odds of getting lung cancer for
asbestos-exposed workers who also smoke cigarettes, compared to workers who
aren’t exposed to asbestos and do not smoke cigarettes? Do you think it would be
additive — meaning three times for asbestos plus two times for smoking equals
five times the odds? Or do you think it would be multiplicative — meaning three
times two equals six times the odds?
Although this is just an example, it turns out that in real life, the effect of being
exposed to both asbestos and cigarette smoking represents a greater than multi-
plicative synergistic interaction (meaning much greater than six) in terms of the
odds for getting lung cancer. In other words, the risk of getting lung cancer for
cigarette smokers is dependent upon their asbestos-exposure status, and the risk
of lung cancer for asbestos workers is dependent upon their cigarette-smoking
status. Because the factors work together to increase the risk, this is a synergistic
interaction (with the opposite being an antagonistic interaction).
How and when do you model an interaction in regression? Typically, you first fit
your final model using a multivariate regression approach (see the earlier section
“Adjusting for confounders in regression” for more on this). Next, once the final
model is fit, you try to interact the exposure covariate or covariates with a con-
founder that you believe is the other part of the interaction. After that, you look at
the p value on the interaction term and decide whether or not to keep the
interaction.
The study designs on the evidence-based pyramid that could be answered with a
regression model include clinical trial, cohort study, case-control study, and
cross-sectional study. If in your final model your exposure is statistically signifi-
cantly associated with your outcome, you now have to see how much evidence you
have that the exposure caused the outcome. This section provides two methods by
which to evaluate the significant exposure and outcome relationship in your
regression: Rothman’s causal pie and Bradford Hill’s criteria of causality.
For example, cigarette smoking is a very strong cause of lung cancer, as is occu-
pational exposure to asbestos. There are other causes, but for each individual,
these other causes would fill up small pieces of the causal pie for lung cancer.
Some may have a higher genetic risk factor for cancer. However, if they do not
smoke and stay away from asbestos, they will not fill up much of their pie tin, and
may have necessary but insufficient cause for lung cancer. However, if they include
both asbestos exposure and smoking in their tin, they are risking filling it up and
getting the outcome.
» First, consider if the data you are analyzing are from a clinical trial or cohort
study. If they are, then you will have met the criterion of temporality, which
means the exposure or intervention preceded the outcome and is especially
strong evidence for causation.
» If the estimate for the exposure in your regression model is large, you can say
you have a strong magnitude of association, and this is evidence of causation.
This is especially true if your estimate is larger than those of the confounders
in the model as well as similar estimates from the scientific literature.
» If the estimate is consistent in size and direction with other analyses, including
previous studies you’ve done and studies in the scientific literature, there is
more evidence for causation.
Chapter 21
Summarizing and
Graphing Survival Data
T
his chapter describes statistical techniques that deal with a special kind of
numerical data called survival data or time-to-event data. These data reflect
the interval from a particular starting point in time, such the date a patient
receives a certain diagnosis or undergoes a certain procedure, to the first or only
occurrence of a particular kind of event that represents an endpoint. Because these
techniques are often applied to situations where the endpoint event is death, we
usually call the use of these techniques survival analysis, even when the endpoint is
something less drastic (or final) than death. Survival data could include time from
resolution of a chronic illness symptom to its relapse, but it can also be a desirable
endpoint, such as time to remission of cancer, or time to recovery from an acute
condition. Throughout this chapter, we use terms and examples that imply that
the endpoint is death, such as saying survival time instead of time to event. However,
everything we say also applies to other kinds of endpoints.
You may wonder why you need a special kind of analysis for survival data in the
first place. Why not just treat survival times as ordinary numerical variables? Why
not summarize them as means, medians, standard deviations, and so on, and
graph them as histograms and box-and-whiskers charts? Why not compare sur-
vival times between groups with t tests and ANOVAs? Why not use ordinary least-
squares regression to explore how various factors influence survival time?
A person can die only once, so survival analysis can obviously be used for one-
time events. But other endpoints can occur multiple times, such as having a stroke
or having cancer go into remission. The techniques we describe in this chapter
only analyze time to the first occurrence of the event. More advanced survival
analysis methods are needed for models that can handle multiple occurrences of
an event, and these are beyond the scope of this book.
The starting point of the time interval is somewhat arbitrary, so it must be defined
explicitly every time you do a survival analysis. Imagine that you’re studying the
progression of chronic obstructive pulmonary disease (COPD) in a group of
patients. If you want to study the natural history of the disease, the starting point
can be the diagnosis date. But if you’re instead interested in evaluating the
efficacy of a treatment, the starting point can be defined as the date the
treatment began.
If non-normality were the only problem with survival data, you’d be able to sum-
marize survival times as medians and centiles instead of means and standard
deviations. Also, you could compare survival between groups with nonparametric
Mann-Whitney and Kruskal-Wallis tests instead of t tests and ANOVAs. But time-
to-event data are susceptible to a specific type of missingness called censoring.
Typical parametric and nonparametric regression methods are not equipped to
deal with censoring, so we present survival analysis techniques in this chapter.
Considering censoring
Survival data are defined as the time interval between a selected starting point and
an endpoint that represents an event. But unfortunately, the time the event takes
place can be missing in survival data. This can happen in two general ways:
» You may not be able to observe all the participants in the data until they
have the event. Because of time constraints, at some point, you have to end
the study and analyze your data. If your endpoint is death, hopefully at the
end of your study, some of the participants are still alive! At that point, you
would not know how much longer these participants will ultimately live. You
only know that they were still alive up to the last date they were measured in
the study as part of data collection, or the last date study staff communicated
with them in some way (such as through a follow-up phone call). This date is
called the date of last contact or the last-seen date, and would be the date that
these participants would be censored in your data.
» You may lose track of some participants during the study. Participants
who enroll in a study may be lost to follow-up (LFU), meaning that it is no
longer possible for study staff to locate them and continue to collect data for
the study. These participants are also censored at their date of last contact,
but in the case of LFU, this date is typically well before the observation
period ends.
Figure 21-1 shows the results of a small study of survival in cancer patients after
a surgical procedure to remove a tumor. Ten patients were recruited to participate
in the study and were enrolled at the time of their surgery. The recruitment period
went from Jan. 1, 2010, to the end of Dec. 31, 2011 (meaning a two-year enrollment
period). All participants were then followed until they died, or until the conclusion
of the study, on Dec. 31, 2016, which added five years of additional observation
time after the last enrollment. Each participant has a horizontal timeline that
starts on the date of surgery and ends with either the date of death or the
censoring date.
FIGURE 21-1:
Survival of ten
study participants
following surgery
for cancer.
© John Wiley & Sons, Inc.
In Figure 21-1, observe that each line ends with a code, and there’s a legend at the
bottom. Six of the ten participants (#’s 1, 2, 4, 6, 9, and 10, labeled X) died during
the course of the follow-up study. Two participants (#5 and #7, labeled L) were
LFU at some point during the study, and two participants (#3 and #8, labeled E)
were still alive at the end of the study. So this study has four participants — the
Ls and the Es — with censored survival times.
» The hazard rate is the probability of the participant dying in the next small
interval of time, assuming the participant is alive right now.
» The survival rate is the probability of the participant living for a certain
amount of time after some starting time point.
The first task when analyzing survival data is usually to describe how the hazard
and survival rates vary with time. In this chapter, we show you how to estimate
the hazard and survival rates, summarize them as tables, and display them as
graphs. Most of the larger statistical packages (such as those described in
Chapter 4) allow you to do the calculations we describe automatically, so you may
never have to do them manually. But without first understanding how these
methods work, it’s almost impossible to understand any other aspect of survival
analysis, so we provide a demonstration for instructional purposes.
» You shouldn’t exclude participants with a censored survival time from any
survival analysis!
» You shouldn’t substitute the censored date with some other value, which is
called imputing. When you impute numerical data to replace a missing value, it
is common to use the last observed value for that participant (called last
observation carried forward, or LOCF, imputation). However, you should not
impute dates in survival analysis.
Exclusion and imputation don’t work to fix the missingness in censored data. You
can see why in Figure 21-2, where we’ve slid the timelines for all the participants
over to the left as if they all had their surgery on the same date. The time scale
shows survival time in years after surgery instead of chronological time.
If you exclude all participants who were censored in your analysis, you may be left
with analyzable data on too few participants. In this example, there are only six
uncensored participants, and removing them would weaken the power of the
analysis. Worse, it would also bias the results in subtle and unpredictable ways.
Using the last-seen date in place of the death date for a censored observation may
seem like a legitimate use of LOCF imputation, but because the participant did not
die during the observation period, it is not acceptable. It’s equivalent to assuming
that all censored participants died immediately after the last-contact date. But
this assumption isn’t reasonable, because it would not be unusual for them to live
on many years. This assumption would also bias your results toward artificially
shorter survival times.
In your analytic data set, only include one variable to represent time observed
(such as Time in days, months, or years), and one variable to represent event
status (such as Event or Death), coded as 1 if they are have the event during the
observation period, and 0 if they are censored. Calculate these variables from raw
date variables stored in other parts of the data (such as date of death, date of visit,
and so on).
These calculations can be laid out systematically in a life table, which is also called
an actuarial life table because of its early use by insurance companies. The calcula-
tions only involve addition, subtraction, multiplication, and division, so they can
be done manually. They are easy to set up in a spreadsheet format, and there are
many life-table templates available for Microsoft Excel and other spreadsheet
programs that you can use.
» During the first year after surgery, one participant died (#1), and one partici-
pant was censored (#5, who was LFU).
» During the third year, two participants died (#4 and #9), and none were
censored.
Continue tabulating deaths and censored times for the fourth through seventh
years, and enter these counts into the appropriate cells of a spreadsheet like the
one shown in Figure 21-3.
FIGURE 21-3:
A partially
completed life
table to analyze
the survival times
shown in
Figure 21-2.
© John Wiley & Sons, Inc.
» Put the description of the time interval that defines each slice into Column A.
» Enter the total number of participants alive at the start into Column B in the
0–1 yr row.
» Enter the counts of participants who died within each time slice into Column C
(labeled Died).
» Enter the counts of participants who were censored during each time slice
into Column D (labeled Last Seen Alive).
After you’ve entered all the counts, the spreadsheet will look like Figure 21-3.
Then you perform the calculations shown in the Formula row at the top of the
figure to generate the numbers in all the other cells of the table. (To see what it
looks like when the table is completely filled in, take a sneak peek at
Figure 21-4.)
» Out of the ten participants alive at the start, one died and one was last seen
alive during the first year. This means eight participants (10 – 1 – 1) are
known to still be alive at the start of the second year. The missing partici-
pant is #5, who was LFU during the first year. They are censored and not
counted in any subsequent years.
» Zero participants died or were last seen alive during the second year. So, the
same eight participants are still known to be alive at the start of the third year.
Column E
Column E shows the number of participants at risk for dying during each year. You
may guess that this is the number of participants alive at the start of the interval,
but there’s one minor correction. If any were censored during that year, then
they weren’t technically able to be observed for the entire year. Though they may
die that year, if they are censored before then, the study will miss it. What if you
don’t know exactly when during that year they became censored? If you don’t
have the exact date, you can consider them being observed for half the time
period (in this case, 0.5 years). So the number at risk can be estimated as the
number alive at the start of the year, minus one-half of the number who became
censored during that year, as indicated by the formula for Column E: E = B – D/2.
(Note: To simplify the example, we are using years, but you could use months
instead if you have exact censoring and death dates in your data to improve the
accuracy of your analysis.)
» Ten participants were alive at the start of Year 1, and one participant was
censored during Year 1. To correct for censoring, divide 1 by 2, which is
0.5. Next, subtract 0.5 from 10 to get 9.5. After correcting for censoring, only
9.5 participants are at risk of dying during Year 1.
Column F
Column F shows the Probability of Dying during each interval, assuming the partic-
ipant has survived up to the start of that interval. To calculate this, divide the Died
column by the At Risk column. This represents the fraction of those who were at
risk of dying at the beginning of the interval who actually died during the interval.
Formula for Column F: F = C/E.
» For Year 1, the probability of dying is calculated by dividing the one death by
the 9.5 participants at risk: 1/9.5, or 0.105 (10.5 percent).
» Zero participants died in Year 2. So, for participants surviving Year 1 and alive
at the beginning of Year 2, the probability of dying during Year 2 is 0. Woo-hoo!
Column G
Column G shows the Probability of Surviving during each interval for participants
who have survived up to the start of that interval. Since surviving means not
dying, the equation for this column is 1 – Probability of Dying, as indicated by the
formula for Column G: G = 1 – F.
Column H
Column H shows the cumulative probability of surviving from the time of surgery
all the way through the end of this time slice. To survive from the start time
through the end of any given year (year N), the participant must survive each of
Here’s to fill in Figure 22-3 (with the results shown in Figure 22-4):
» For Year 2, H is the product of G for Year 1 times G for Year 2; that is, 0.895 ×
1.000, or 0.895.
» For Year 3, H is the product of the Gs for Years 1, 2, and 3; that is, 0.895 ×
1.000 × 0.750, or 0.671.
FIGURE 21-4:
Completed life
table to analyze
the survival
times shown in
Figure 22-2.
© John Wiley & Sons, Inc.
» The hazard and survival values obtained from this life table are estimates
from a sample of the true population hazard and survival functions (in this
case, using one-year intervals).
» The hazard rate obtained from a life table for each time slice is equal to the
Probability of Dying (Column F) divided by the width of the slice. Therefore,
the hazard rate for the first year would be expressed as 0.105 per year, or
10.5 percent per year.
» The cumulative survival probability is always 1.0 (or 100 percent) at time 0,
whenever you designate that time 0 is (in the example, date of surgery), but
it’s not included in the table.
» The cumulative survival function decreases only at the end of an interval that
has at least one observed death, because censored observations don’t cause
a decrease in the estimated survival. Censored observations however
influence the size of the decreases when subsequent events occur. This is
because censoring reduces the number at risk, which is used in the denomi-
nator in the calculation of the death and survival probabilities.
» If an interval contains no events and no censoring, like in the 1–2 years row in
the table in Figure 21-4, it has no impact on the calculations. Notice how all
subsequent values for Column B and for Columns E through H would remain
identical if that row were removed.
» Figure 21-5a is a graph of hazard rates. Hazard rates are often graphed as
bar charts, because each time slice has its own hazard rate in a life table.
FIGURE 21-5:
Hazard function
(a) and survival
function (b)
results from
life-table
calculations.
© John Wiley & Sons, Inc.
The life-table calculations work fine with only one participant per row and pro-
duce what’s called Kaplan-Meier (K-M) survival estimates. You can think of the K-M
method as a very fine-grained life table. Or, you can see a life table as a grouped
K-M calculation.
A K-M worksheet for the survival times is shown in Figure 21-6. It is based on the
one-participant-per-row idea and is laid out much like the usual life-table
» Instead of a column identifying the time slices, there are two columns
identifying the individual participant (Column A) and their survival or censor-
ing time (Column B). The table is ordered from the shortest time to
the longest.
» Instead of two columns containing the number who died and were censored
in each interval, you need only one column indicating whether or not the
participant in that row died (Column C). If they died during the observation
period, use code 1, and if not and they were censored, use code 0.
» These changes mean that Column D labeled Alive at Start now decreases by 1
for each subsequent row.
» The At Risk column in Figure 21-4 isn’t needed, because it can be calculated
from the Alive at Start column. That’s because if the participant is censored,
the probability of dying is calculated as 0, regardless of the value of the
denominator.
» To calculate Column E, the Probability of Dying, divide the Died indicator by the
number of participants alive for that time period in Column D, Alive at Start.
Formula: E = C/D.
FIGURE 21-6:
Kaplan-Meier
calculations.
© John Wiley & Sons, Inc.
Figure 21-7 shows graphs of the K-M hazard and survival estimates from
Figure 21-6. These charts were created using the R statistical software. Most soft-
ware that performs survival analysis can create graphs similar to this. The K-M
FIGURE 21-7:
Kaplan-Meier
estimates of the
hazard (a) and
survival (b)
functions.
© John Wiley & Sons, Inc.
While the K-M survival curve tends to be smoother than the life table survival
curve, just the opposite is true for the hazard curve. In Figure 21-7a, each partic-
ipant has their own very thin bar, and the resulting chart isn’t easy to interpret.
Dates and times should be recorded to suitable precision. If your study timeline
is years, it’s best to keep track of dates to the day. In a Phase I clinical trial
(see Chapter 5), participants may be studied for events that happen in a span
of a few days. In those cases, it’s important to record dates and times to the near-
est hour or minute. You can even envision laboratory studies of intracellular
events where time would have to be recorded with millisecond — or even
microsecond — precision!
Dates and times can be stored in different ways in different statistical software (as
well as Microsoft Excel). Designating columns as being in date format or time for-
mat can allow you to perform calendar arithmetic, allowing you to obtain time
intervals by subtracting one date from another.
The problem is that if you accidentally use your censored indicator instead of your
event indicator when running your survival analysis, you will unknowingly flip
your analysis, and you won’t get any warning or error message from the program.
You’ll only get incorrect results. Worse, depending on how many censored and
uncensored observations you have, the survival curve may also not hint at any
errors. It may look like a perfectly reasonable survival curve for your data, even
though it’s completely wrong.
You have to read your software’s documentation carefully to make sure you code
your event variable correctly. Also, you should always check the program’s output
for the number of censored and uncensored observations and compare them to the
known count of censored and uncensored participants in your data file.
Chapter 22
Comparing Survival
Times
T
he life table and Kaplan-Meier survival curves described in Chapter 21 are
ideal for summarizing and describing the time to the first or only occur-
rence of a particular event based on times observed in a sample of individu-
als. They correctly incorporate data that reflect when an individual is observed
during the study but does not experience the event, which is called censored data.
Animal and human studies involving endpoints that occur on a short time-scale,
like measurements taking during an experimental surgical procedure, may yield
totally uncensored data. However, the more common situation is that during the
observation period of studies, not all individuals experience the event, so you usu-
ally have censored data on your hands.
In biological research and especially in clinical trials (discussed in Chapter 5), you
often want to compare survival times between two or more groups of individuals.
In humans, this may have to do with survival after cancer surgery. In animals, it
may have to do with testing the toxicity of a potential therapeutic. This chapter
describes an important method for comparing survival curves between two groups
called the log-rank test, and explains how to calculate the sample size you need to
have sufficient statistical power for this test (see Chapter 3). The log-rank test can
be extended to handle three or more groups, but this discussion is beyond the
scope of this book.
There is some ambiguity associated with the name log-rank test. It has also been
called different names (such as the Mantel-Cox test), and has been extended into
variants such as the Gehan-Breslow test. You may also observe that different
software may calculates the log-rank test slightly differently. In this chapter, we
describe the most commonly used form of the log-rank test.
If have no censored observations in your data, you can skip most of this chapter.
This may happen if, for example, death is your outcome and at the end of your
study period no individuals are alive anymore — they all have died in your study.
As you may guess, this situation is much more common in animal studies than
human studies. But if you have followed all the individuals in your data until they
all experienced the outcome, and you have two or more groups of numbers indi-
cating survival times that you want to compare, you can use approaches described
in Chapter 11. One option is to use an unpaired Student t test to test whether one
group has a statistically significantly longer mean survival time than the other. If
you have three or more groups, you would use an ANOVA instead. But because
survival times are very likely to be non-normally distributed, you may prefer to
use a nonparametric test, such as the Wilcoxon Sum-of-Ranks test or Mann-
Whitney U test, to compare the median survival time between two groups. With
more than two groups, you would use the nonparametric Kruskal-Wallis test.
Suppose that you conduct a toxicity study with laboratory animals of a potential
cancer drug. You obtain 90 experimental mice. The mice are randomly placed in
groups such that 60 receive the drug in their food, and 30 are given control food
with no drug. A laboratory worker observes them and records their vital status
every day after the experiment starts, taking note of when each animal dies or is
censored, meaning they are taken out of the study for another reason (such as not
eating). You perform a life-table analysis on each group of mice — the drug com-
pared to control — as described in Chapter 21, and graph the results. The graph
displays the survival curves shown in Figure 22-1. As a bonus, the two life tables
generated to support this display also provide the summary information needed
the log-rank test.
The two survival curves in Figure 22-1 look different. The drug group seems to be
showing better survival than the control group. But is this apparent difference
real, or could it be the result of random fluctuations only? The log-rank test
answers this question.
» Group: The group variable contains a code indicating the individual’s group. In
this example, we could use the code Drug = 1 and Control = 2.
» Event status: A variable that indicates the individual’s status at the end of
observation. If they got the event, it is usually coded as 1, and if not or they
are censored, it is coded as 0.
To run the log-rank test, you tell your computer program which variable repre-
sents the group variable, which one means time, and which one contains the
event status. The program should produce a p value for the log-rank test. If you
set α = 0.05 and the p value is less than that, you reject the null and conclude that
the two groups have statistically significantly different survival curves.
In addition to the p value, the program may output median survival time for each
group along with confidence intervals, and difference in median times between
groups. If possible, you will also want to request graphs that show whether your
data are consistent with the hazard proportionality assumption that we describe
later in “Assessing the assumptions.”
The log-rank test utilizes information from the life tables needed to produce the
graph shown earlier in Figure 22-1. Figure 22-2 shows a portion of the life tables
that produced the curves shown in Figure 22-1, with the data for the two groups
displayed side by side.
In Figure 22-2, the Drug group’s results are in columns B through E, and the Con-
trol group’s results are in columns F through I. The only measurements needed
from Figure 22-2 for the log-rank test are At risk (columns E and I), meaning
number at risk in each time slice for each group, and Died (columns C and G),
meaning the number of observed deaths in that time slice for each group. The log-
rank test calculations are in a second spreadsheet (shown in Figure 22-3).
FIGURE 22-3:
Basic log-rank
calculations done
manually (but
please use
software
instead!).
© John Wiley & Sons, Inc.
» Columns B and C pertain to the Drug group, and reprint the At risk and Died
columns from Figure 22-2 for that group. Columns D and E pertain to the
Control group and reprint the At risk and Died columns from Figure 22-2 for
that group.
» Columns F and G show the combined total number of individuals at risk and
the total number of individuals who died, which is obtained by combining the
corresponding columns for the two groups.
» Column H, labeled % At Risk, shows Group 1’s percentage of the total number
of at-risk individuals per time slice.
» Column J, labeled Excess Deaths, shows the excess number of actual deaths
compared to the expected number for Group 1.
» Column K shows the variance (equal to the square of the standard deviation)
of the excess deaths. It’s obtained from this complicated formula that’s based
on the properties of the binomial distribution (see Chapter 24):
V DT N 1 /N T N 2 /N T NT DT / N T 1
Next, you add up the excess deaths in all the time slices to get the total number of
excess deaths for Group 1 compared to what you would have expected if the deaths
had been distributed between the two groups in the same ratio as the number of
at-risk individuals.
Then you add up all the variances. You are allowed to do that, because the sum of
the variances of the individual numbers is equal to the variance of the sum of a set
of numbers.
Finally, you divide the total excess deaths by the square root of the total variance
to get a test statistic called Z:
Z ExcessDeaths / Variances
Note: By the way, it doesn’t matter which group you assign as Group 1 in these
calculations. The final results come out the same either way.
Also, the log-rank test looks for differences in overall survival time. In other
words, it’s not good at detecting differences in shape between two survival curves
with similar overall survival time, like the two curves shown in Figure 22-4. These
two curves actually have the same median survival time, but the survival experi-
ence is different, as shown in the graph. When two survival curves cross over each
other, as shown in Figure 22-4b, the excess deaths are positive for some time
slices and negative for others. This leads them to cancel out when they’re added
up, producing a smaller z value as a test statistic z value, which translates to
larger, non-statistically significant p value.
FIGURE 22-4:
Proportional (a)
and nonpropor-
tional (b) hazards
relationships
between two
survival curves.
© John Wiley & Sons, Inc.
Therefore, one very important assumption of the log-rank test is that the two
groups have proportional hazards, which means the two groups must have gener-
ally similar survival shapes, as shown in Figure 22-4a. Flip to Chapter 21 for more
about survival curves, and read about hazards in more detail in Chapter 23.
» The need to specify an alternative hypothesis: This hypothesis can take the
form of a hazard ratio, described in Chapter 23, where the null hypothesis is
that the hazard ratio = 1. Or, you can hypothesize the difference between two
median survival times.
After opening the PS program, choose the Survival tab, fill in the form, and click
Calculate. The median survival times for the two groups are labeled m1 and m2,
the accrual interval is labeled A, the post-accrual follow-up period is labeled F,
and the group allocation proportion is labeled m. Note that the time variables must
always be entered in the same units (days, in this example). You will also need to
enter your chosen α and power.
Chapter 23
Survival Regression
S
urvival regression is one of the most commonly used techniques in
biostatistics. It overcomes the limitations of the log-rank test (see
Chapter 22) and allows you to analyze how survival time is influenced by one
or more predictors (the X variables), which can be categorical or numerical. In this
chapter, we introduce survival regression. We specify when to use it, describe its
basic concepts, and show you how to run survival regressions in statistical
software and interpret the output. We also explain how to build prognosis curves
and estimate the sample size you need to support a survival regression.
Note: Because time-to-event data so often describe actual survival, when the
event we are talking about is death, we use the terms death and survival time. But
everything we say about death applies to the first occurrence of any event, like
pre-diabetes patients restoring their blood sugar to normal levels, or cancer sur-
vivors suffering a recurrence of cancer.
» The log-rank test doesn’t handle numerical predictors well. Because this
test compares survival among a small number of categories, it does not work
well for a numerical variable like age. To compare survival among different
age groups with the log-rank test, you would first have to categorize the
participants into age ranges. The age ranges you choose for your groups
should be based on your research question. Because doing this loses the
granularity of the data, this test may be less efficient at detecting gradual
trends across the whole age range.
» The log-rank test doesn’t let you analyze the simultaneous effect of
different predictors. If you try to create subgroups of participants for each
distinct combination of categories for more than one predictor (such as three
treatment groups and three diagnostic groups), you will quickly see that you
have too many groups and not enough participants in each group to support
the test. In this example — with three different treatment groups and three
diagnostic groups — you would have 3 × 3 groups, which is nine, and is already
too many for a log-rank test to be useful. Even if you have 100 participants in
your study, dividing them into nine categories greatly reduces the number of
participants in each category, making the subgroup estimate unstable.
» Adjust for the effects of confounding variables that also influence survival
Most kinds of regression require you to write a formula to fit to your data. The
formula is easiest to understand and work with when the predictors appear in the
function as a linear combination in which each predictor variable is multiplied by a
coefficient, and these terms are all added together (perhaps with another coeffi-
cient, called an intercept, thrown in). Here is an example of a typical regression
formula: y c0 c1 x 1 c 2 x 2 c 3 x 3 . Linear combinations (such as c2x2 from the
example formula) can also have terms with higher powers — like squares or
cubes — attached to the predictor variables. Linear combinations can also have
interaction terms, which are products of two or more predictors, or the same pre-
dictor with itself.
Survival regression takes the linear combination and uses it to predict survival.
But survival data presents some special challenges:
» Censoring: Censoring happens when the event doesn’t occur during the
observation time of the study (which, in human studies, means during
follow-up). Before considering using survival regression on your data, you
need to evaluate the impact censoring may have on the results. You can do
this using life tables, the Kaplan-Meier method, and the log-rank test, as
described in Chapters 21 and 22.
Since 1972, many issues have been identified when using survival regression for
biological data, especially with respect to its appropriateness for the type of data.
One way to examine this is by running a logistic regression model (see Chapter 18)
with the same predictors and outcome as your survival regression model without
including the time variable, and seeing if the interpretation changes.
1. Determine the shape of the overall survival curve produced from the
Kaplan-Meier method.
Luckily, the way your software defines its baseline function doesn’t affect any of
the calculated measures on your output, so you don’t have to worry about it. But
you should be aware of these definitions if you plan to generate prognosis curves,
because the formulas to generate these are slightly different depending upon the
way the computer calculates the baseline survival function.
To understand the flex, look at what happens when you raise this straight line to
various powers, which we refer to as h and illustrate in Figure 23-1b:
» Squaring: If you set h = 2, the y value for every point on the line always comes
out smaller, because they are always less than 1. For example, 0.8 2 is 0.64.
» Taking the square root: If we set h = 0.05, the y value of every point on the
line becomes larger. For example, the square root of 0.25 is 0.5.
Notice in Figure 23-1b, both 12 and 10.5 remain 1, and 0 2 and 0 0.5 both remain 0, so
those two ends of the line don’t change.
Does the same trick work for a survival curve that doesn’t follow any particular
algebraic formula? Yes, it does! Look at Figure 23-2.
FIGURE 23-2:
Raising to a
power works
for survival
curves, too.
© John Wiley & Sons, Inc.
» Figure 23-2a shows a typical survival curve. It’s not defined by any algebraic
formula. It just graphs the table of values obtained by a life-table or Kaplan-
Meier calculation.
» Figure 23-2b shows how the baseline survival curve is flexed by raising every
baseline survival value to a power. You get the lower curve by setting h = 2
and squaring every baseline survival value. You get the upper curve by setting
h = 0.05 and taking the square root of every baseline survival value. Notice
that the two flexed curves keep all the distinctive zigs and zags of the baseline
curve, in that every step occurs at the same time value as it occurs in the
baseline curve.
• The upper curve represents participants who had better survival than
a baseline person at any given moment — meaning they had a lower
hazard rate.
But what should the value of h be? The h value varies from one individual to
another. Keep in mind that the baseline curve describes the survival of a perfectly
average participant, but no individual is completely average. You can think of
every participant in the data as having her very own personalized survival curve,
based on her very own h value, that provides the best estimate of that partici-
pant’s chance of survival over time.
Hazard ratios
Hazard ratios (HRs) are the estimates of relative risk obtained from PH regression.
HRs in survival regression play a similar role that odds ratios play in logistic
regression. They’re also calculated the same way from regression output — by
exponentiating the regression coefficients:
Keep in mind that hazard is the chance of dying in any small period of time. For
each predictor variable in a PH regression model, a coefficient is produced that —
when exponentiated — equals the HR. The HR tells you how much the hazard rate
increases for the participants positive for the predictor compared to the compari-
son group when you increase the variable’s value by exactly 1.0 unit. Therefore, a
HR’s numerical value depends on the units in which the variable is expressed in
your data. And for categorical predictors, interpreting the HR depends on how you
code the categories.
• Equal to 1 if the event was known to occur during the observation period
(uncensored)
• Equal to 0 if the event didn’t occur during the observation period (censored)
And as with all regression methods, you designate one or more variables as the
predictors. The rules for representing the predictor variables are the same as
described in Chapter 18:
» For categorical predictors, carefully consider how you recode the data,
especially in terms of selecting a reference group. Consider a five-level age
group variable. Would you want to model it as an ordinal categorical variable,
assuming a linear relationship with the outcome? Or would you prefer using
indicator variables, allowing each level to have its own slope relative to the
reference level? Flip to Chapter 8 for more on recoding categorical variables.
After you assemble and properly code the data, you execute the regression in sta-
tistical software using a similar approach as you use when doing ordinary least-
squares or logistic regression. You need to specify the variables in the regression
model:
Make sure you are careful when you include categorical predictors, especially
indicator variables. All the predictors you introduce should make sense
together in the model.
Most software also lets you specify calculations you want to see on the output. You
should always request at least the following:
» Summary descriptive statistics about the data. These can include number of
censored and uncensored observations, median survival time, and mean and
standard deviation for each predictor variable in the model
After you specify all the input to the program, execute the code, retrieve the out-
put, and interpret the results.
FIGURE 23-3:
Kaplan-Meier
survival curves by
treatment and
clinical center.
© John Wiley & Sons, Inc.
To run a PH regression on the data from this example, you must indicate the fol-
lowing to the software in your code:
» The time-to-event variable. We named this variable Time, and it was coded
in years. For participants who died during the observation period, it was
coded as the number of years from observation beginning until death. For
participants who did not die during the observation period, it contains
number of years they were observed.
» The event status variable. We named this variable Status, and coded it as
1 if the participant was known to have died during the observation period,
and 0 if they did not die.
If you use a numerical variable such as age as a predictor and enter it into the
model, the resulting coefficient will apply to increasing this variable by one unit
(such as for one year of age).
Using the R statistical software, the PH regression can be invoked with a single
command:
Figure 23-4 shows R’s output, using the data that we graph in Figure 23-3. The
output from other statistical programs won’t look exactly like Figure 23-4, but
you should be able to find the main components described in the following
sections.
FIGURE 23-4:
Output of a PH
regression
from R.
One quick check to see whether a predictor is affecting your data in a non-PH way
is to take the following steps:
2. Plot the Kaplan-Meier survival curve for each group (see Chapter 22).
If the two survival curves for a particular predictor display the slanted figure-
eight pattern shown in Figure 23-5, either don’t use PH regression on those
data, or don’t use that predictor in your PH regression model. That’s because it
violates the assumption of proportional hazards underlying PH regression.
FIGURE 23-5:
Don’t try PH
regression on this
kind of data
because it
violates the PH
assumption.
© John Wiley & Sons, Inc.
Your statistical software may offer several options to test the hazard-
proportionality assumption. Check your software’s documentation to see what
it offers and how to interpret the output. It may offer the following:
» Graphs of the hazard functions versus time, which let you see the extent to
which the hazards are proportional.
» The value of the regression coefficient. This says how much the log of the HR
increases when the predictor variable increases by exactly 1.0 unit. It’s hard to
interpret unless you exponentiate it into a HR. In Figure 23-4, the coefficient
for CenterCD is 0.4522, indicating that every increase of 1 in CenterCD (which
literally means comparing everyone at Centers A and B to those at Centers C
and D), there is an increase the logarithm of the hazard by 0.4522. When
exponentiated, this translates into a HR of 1.57 (listed on the output under
exp(coef)). As predicted from looking at Figure 24-3, this indicates that those at
Centers C and D together are associated with a higher hazard compared with
those at Centers A and B together. For indicator variables, there will be a row
in the table for each non-reference level, so in this case, you see a row for
Radiation. The coefficient for Radiation is –0.4323, which when exponentiated,
translates to an HR of 0.65 (again listed under exp(coef)). The negative sign
indicates that in this study, radiation treatment is associated with less hazard
and better survival than the comparison treatment, which is chemotherapy.
Interpreting the HRs and their confidence intervals is described in the next
section “Homing in on hazard ratios and their confidence intervals.”
» The p value. Under the assumption that α = 0.05, if the p value is less than
0.05, it indicates that the coefficient is statistically significantly different from 0
after adjusting for the effects of all the other variables that may appear the
model. In other words, a p value of less than 0.05 means that the correspond-
ing predictor variable is statistically significantly associated with survival. The p
value for CenterCD is shown as 8.09e–06, which is scientific notation for
0.000008, indicating that CenterCD is very significantly associated with survival.
» The HR and its confidence limits, which we describe in the next section.
You may be surprised that no intercept (or constant) row is in the coefficient table
in the output shown in Figure 23-4. PH regression doesn’t include an intercept in
the linear part of the model because the intercept is absorbed into the baseline
survival function.
If the software doesn’t output HRs or their CIs, you can calculate them from the
regression coefficients and standard errors (SEs) as follows:
In Figure 23-4, the coefficients are listed under coef, and the SEs are listed under
se(coef). HRs are useful and meaningful measures of the extent to which a variable
influences survival.
» The CI around the HR estimated from your sample indicates the range in
which the true HR of the population from which your sample was drawn
probably lies.
In Figure 23-4, the HR for CenterCD is e 0.4522 1.57, with a 95 percent CI of 1.29 to
1.92. This means that an increase of 1 in CenterCD (meaning being a participant at
Centers A or B compared to being one at Centers C or D) is statistically signifi-
cantly associated with a 57 percent increase in hazard. This is because multiplying
by 1.57 is equivalent to a 57 percent increase. Similarly, the HR for Radiation (rela-
tive to the comparison, which is chemotherapy) is 0.649, with a 95 percent CI of
0.43 to 0.98. This means that those undergoing radiation had only 65 percent the
hazard of those undergoing chemotherapy, and the relationship is statistically
significant.
Risk factors, or predictors associated with increased risk of the outcome, have HRs
greater than 1. Protective factors, or predictors associated with decreased risk of the
outcome, have HRs less than 1. In the example, CenterCD is a risk factor, and Radi-
ation is a protective factor.
» Should you include a possible predictor variable (like age) in the model?
» Should you include the squares or cubes of predictor variables in the model
(meaning including age2 or age3 in addition to age)?
» Should you include a term for the interaction between two predictors?
Your software may offer one or more of the following goodness-of-fit measures:
» A likelihood ratio test and associated p value that compares the full model,
which includes all the parameters, to a model consisting of just the overall
baseline function. In Figure 23-4, the likelihood ratio p value is shown as
4.46e 06 , which is scientific notation for p 0.00000446 , indicating a model
that includes the CenterCD and Radiation variables can predict survival
statistically significantly better than just the overall (baseline) survival curve.
The software may also offer a graph of the baseline survival function. If your soft-
ware is using an average-participant baseline (see the earlier section, “The steps
to perform a PH regression”), this graph is useful as an indicator of the entire
group’s overall survival. But if your software uses a zero-participant baseline, the
curve is not helpful.
Suppose that you’re survival time (from diagnosis to death) for a group of cancer
patients in which the predictors are age, tumor stage, and tumor grade at the time
of diagnosis. You’d run a PH regression on your data and have the program gen-
erate the baseline survival curve as a table of times and survival probabilities.
After that, whenever a patient is newly diagnosed with cancer, you can take that
person’s age, stage, and grade, and generate an expected survival curve tailored
for that particular patient. (The patient may not want to see it, but at least it could
be done.)
You’ll probably have to do these calculations outside of the software that you use
for the survival regression, but the calculations aren’t difficult and can be done in
a Microsoft Excel spreadsheet. The example in the following sections uses the
small set of sample data that’s preloaded into the online calculator for PH regres-
sion at https://statpages.info/prophaz.html. This particular example has
only one predictor, but the basic idea extends to multiple predictors.
Looking at Figure 23-6, first consider the table in the Baseline Survivor Function
section, which has two columns: time in years, and predicted survival expressed
as a fraction. It also has four rows — one for each time point in which one or more
deaths was actually observed. The baseline survival curve for the example data
starts at 1.0 (100 percent survival) at time 0, as survival curves always do, but this
row isn’t shown in the output. The survival curve remains flat at 100 percent until
year two, when it suddenly drops down to 99.79 percent, where it stays until year
seven, when it drops down to 98.20 percent, and so on.
In the Descriptive Stats section near the start of the output in Figure 23-6, the
average age of the 11 patients in the example data set is 51.1818 years, so the base-
line survival curve shows the predicted survival for a patient who is exactly 51.1818
years old. But suppose that you want to generate a survival curve that’s custom-
ized for a patient who is a different age — like 55 years old. According to the PH
model, you need to raise the entire baseline curve to some power h. This means
you have to exponentiate the four tabulated points by h.
» The value of the predictor variable for that patient. In this example, the value
of age is 55.
In this example, you subtract the average age, which is 51.18, from the patient’s
age, which is 55, giving a difference of +3.82.
In this example, you multiply 3.82 from Step 1 by the regression coefficient for
age, which is 0.377, giving a product of 1.44 for v.
4. Add all the v values, and call the sum of the individual v values V.
This example has only one predictor variable, which is age, so V equals the v
value you calculate for age in Step 2, which is 1.44.
5. Calculate e V .
This is the value of h. In this example, e 1.44 gives the value 4.221, which is the h
value for a 55-year-old patient.
6. Raise each of the baseline survival values to the power of h to get the
survival values for the prognosis curve.
You then graph these calculated survival values to give a customized survival
curve for this particular patient. And that’s all there is to it!
2. h eV
» It’s even a little trickier for multivalued categories (such as different clinical
centers) because you have to code each of these variables as a set of indicator
variables.
Very often, sample-size estimates for studies that use regression methods are
based on simpler analytical methods. We recommend that when you’re planning
a study that will be analyzed using PH regression, you base your sample-size esti-
mate on the simpler log-rank test, described in Chapter 22. The free PS program
handles these calculations very well.
» Anticipated enrollment rate: How many participants you hope to enroll per
time period
» Planned duration of follow-up: How long you plan to continue following all
the participants after the last participant has been enrolled before ending the
study and analyzing your data
If you are uncomfortable with estimating sample size for a large study that will be
evaluated with a regression model, consult a statistician with experience in devel-
oping sample-size estimates for similarly-designed studies. They will be able to
guide you in the tips and tricks they use to arrive at an adequate sample-size cal-
culation given your research question and context.
Chapter 24
Ten Distributions Worth
Knowing
T
his chapter describes ten statistical distribution functions you’ll probably
encounter in biological research. For each one, we provide a graph of what
that distribution looks like, as well as some useful or interesting facts and
formulas. You find two general types of distributions here:
» Common test statistic distributions: The last three distributions don’t describe
your observed data. Instead, they describe how a test statistic that is calculated
as part of a statistical significance test will fluctuate if the null hypothesis is true.
The Student t, chi-square, and Fisher F distributions allow you to calculate test
statistics to help you decide if observed differences between groups, associations
between variables, and other effects you want to test should be interpreted as
due to random fluctuations or not. If the apparent effects in your data are due
only to random fluctuations, then you will fail to reject the null hypothesis. These
distributions are used with the test statistics to obtain p values, which indicate
the statistical significance of the apparent effects. (See Chapter 3 for more
information on significance testing and p values.)
» In the case the null hypothesis is true, the p value from any exact significance
test is uniformly distributed between 0 and 1.
FIGURE 24-1:
The uniform
distribution.
© John Wiley & Sons, Inc.
The Microsoft Excel formula RAND() generates a random number drawn from
the standard uniform distribution.
FIGURE 24-2:
The normal
distribution at
various means
and standard
deviations.
© John Wiley & Sons, Inc.
If a set of log-normal numbers has a mean A and standard deviation D, then the
natural logarithms of those numbers will have a standard deviation
s Log 1 D / A 2 , and a mean m Log A s 2 / 2.
FIGURE 24-4:
The binomial
distribution.
© John Wiley & Sons, Inc.
The formula for the probability of getting x successes in N tries when the proba-
bility of success on one try is p is Pr x , N , p ) p x 1 p ) N x N! / [x!( N x ! .
Looking across Figure 24-4, you might have guessed that as N gets larger, the
binomial distribution’s shape approaches that of a normal distribution with mean
Np and standard deviation Np(1 p ) .
FIGURE 24-5:
The Poisson
distribution.
© John Wiley & Sons, Inc.
Looking across Figure 24-5, you might have guessed that as m gets larger, the
Poisson distribution’s shape approaches that of a normal distribution, with
mean m and standard deviation m.
FIGURE 24-6:
The exponential
distribution.
© John Wiley & Sons, Inc.
» If k 1, the failure rate has a lot of early failures, but these are reduced
over time.
The Weibull distribution shown in Figure 24-7 leads to survival curves of the form
k
Survival l e Time , which are widely used in industrial statistics. But survival
methods that don’t assume a distribution for the survival curve are more common
in biostatistics (we cover examples in Chapters 21, 22, and 23).
FIGURE 24-8:
The Student t
distribution.
© John Wiley & Sons, Inc.
In Figure 24-8, as the degrees of freedom increase, the shape of the Student t
distribution approaches that of the normal distribution.
Table 24-1 shows the critical t value for various degrees of freedom at α = 0.05.
Under α = 0.05, random fluctuations cause the t statistic to exceed the critical t
value only 5 percent of the time. This 5 percent includes exceeding t on either the
positive or negative side. From the table, if you determine your critical t is 2.01 at
50 df, and your test statistic is 2.45, it exceeds the critical t, and is statistically
significant at α = 0.05. But this would also be true if your test statistic was –2.45,
because the table only presents absolute values of critical t.
1 12.71
2 4.30
3 3.18
4 2.78
5 2.57
6 2.45
8 2.31
10 2.23
20 2.09
50 2.01
∞ 1.96
For other α and df values, the Microsoft Excel formula =T.INV.2T(α, df) gives the
critical Student t value.
As you look across Figure 24-9, you may notice that as the degrees of freedom
increase, the shape of the chi-square distribution approaches that of the normal
distribution. Table 24-2 shows the critical chi-square value for various degrees of
freedom at α = 0.05.
Under α = 0.05, random fluctuations cause the chi-square statistic to exceed the
critical chi-square value only 5 percent of the time. If the chi-square value from
your test exceeds the critical value, the test is statistically significant at α = 0.05.
For other α and df values, the Microsoft Excel formula = CHIINV(α, df) gives the
critical 2 value.
1 3.84
2 5.99
3 7.81
4 9.49
5 11.07
6 12.59
7 14.07
8 15.51
9 16.92
10 18.31
Random fluctuations cause F to exceed the critical F value only 5 percent of the
time. If the F value from your ANOVA exceeds this value, the test is statistically
significant at α = 0.05. For other values of α, df1, and df2 , the Microsoft Excel
formula = FINV(α, df1, df2) will give the critical F value.
Chapter 25
Ten Easy Ways to
Estimate How Many
Participants You Need
S
ample-size calculations (also called power calculations) tend to frighten
researchers and send them running to the nearest statistician. But if you you
need a ballpark idea of how many participants are needed for a new research
project, you can use these ten quick and dirty rules of thumb.
Before you begin, take a look at Chapter 3 — especially the sections on hypothesis
testing and the power of a test. That way, you’ll refresh your memory about what
power and sample-size calculations are all about. For your study, you will need to
select the effect size of importance that you want to detect. An effect size could be
the difference of at least 10 mmHg in mean systolic blood-pressure lowering
between groups on two different hypertension drugs, or it could be having the
degree of correlation between two laboratory values of at least 0.7. Once you select
your effect size and compatible statistical test, look in this chapter for the rule for
the statistical test you selected to calculate the sample size.
The first six sections tell you how many participants you need to provide complete
data for you to analyze in order to have an 80 percent chance of getting a p value
CHAPTER 25 Ten Easy Ways to Estimate How Many Participants You Need 361
that’s less than 0.05 when you run the test if a true difference of your effect size
does indeed exist. In other words, we are setting the parameters 80 percent power
at α = 0.05, because they are widely used in biological research. The remaining
four sections tell you how to modify your estimate for other power or α values, and
how to adjust your estimate for unequal group size and dropouts from the study.
» Effect size (E): The difference between the means of two groups divided by
the standard deviation (SD) of the values within a group.
For example, say you’re comparing two hypertension drugs — Drug A and
Drug B — on lowering systolic blood pressure (SBP). You might set the effect size
of 10 mmHg. You also know from prior studies that the SD of the SBP change is
known to be 20 mmHg. Then the equation is E 10 / 20, or 0.5, and you need
16 / 0.5 2, or 64 participants in each group (128 total).
» Effect size (E): The difference between the largest and smallest means among
the groups divided by the within-group SD.
Continuing the example from the preceding section, if you’re comparing three
hypertension drugs — Drug A, Drug B, and Drug C — and if any mean difference
of 10 mmHg in SBP between any pair of drug groups is important, then E is still
10 / 20, or 0.5, but you now need 20/ 0.5 2 , or 80 participants in each group (240
total).
» Effect size (E): The average of the paired differences divided by the SD of the
paired differences.
Imagine that you’re studying test scores in struggling students before and after
tutoring. You determine a six-point improvement in grade points is the effect size
of importance, and the SD of the changes is ten points. Then E 6 / 10 , or 0.6, and
2
you need 8 / 0.6 , or about 22 students, each of whom provides a before score and
an after score.
» Effect size (D): The difference between the two proportions (P1 and P2 ) that
you’re comparing. You also have to calculate the average of the two propor-
tions: P P1 P2 / 2.
For example, if a disease has a 60 percent mortality rate, but you think your drug
can cut this rate in half to 30 percent, then P 0.6 0.3 / 2 , or 0.45, and
D 0.6 0.3 , or 0.3. You need 16 0.45 1 0.45 / 0.3 2, or 44 participants
in each group (88 total).
» Effect size: The correlation coefficient (r) you want to be able to detect.
CHAPTER 25 Ten Easy Ways to Estimate How Many Participants You Need 363
Imagine that you’re studying the association between weight and blood pressure,
and you want the correlation test to come out statistically significant if these two
variables have a true correlation coefficient of at least 0.2. Then you need to study
8 / 0.2 2 , or 200 participants.
» Effect size: The hazard ratio (HR) you want to be able to detect.
Here’s how the formula works out for several values of HR greater than 1:
1.1 3,523
1.2 963
1.3 465
1.4 283
1.5 195
1.75 102
2.0 67
2.5 38
3.0 27
Your enrollment must be large enough and your follow-up must be long enough
to ensure that the required number of events take place during the observation
period. This may be difficult to estimate beforehand as it involves considering
recruitment rates, censoring rates, the shape of the survival curve, and other fac-
tors difficult to forecast. Some research protocols provide only a tentative esti-
mate of the expected enrollment for planning, budgeting, and ethical purposes.
Many state that enrollment and/or follow-up will continue until the required
number of events has been observed. Even with ambiguity, it is important to fol-
low conventions described in this book when designing to avoid criticism for
departing from good general principles.
» For 50 percent power: Use only half as many participants — multiply the
estimate by 0.5.
» For 90 percent power: Increase the sample size by 33 percent — multiply the
estimate by 1.33.
» For 95 percent power: Increase the sample size by 66 percent — multiply the
estimate by 1.66.
For example, if you know from doing a prior sample size calculation that a study
with 70 participants provides 80 percent power to test its primary objective,
then a study that has 1.33 70, or 93 participants will have about 90 percent
power to test the same objective. The reason to consider power of levels other than
80 percent is because of limited sample. If you know that 70 participants provides
80 percent power, but you will only have access to 40, you can estimate maximum
power you are able to achieve.
CHAPTER 25 Ten Easy Ways to Estimate How Many Participants You Need 365
For example, imagine that you’ve calculated you need a sample size of 100 partici-
pants using α = 0.05 as your criterion for significance. Then your boss says you
have to apply a two-fold Bonferroni correction (see Chapter 11) and use α = 0.025
as your criterion instead. You need to increase your sample size to 100 x 1.2, or 120
participants, to have the same power at the new α level.
» If you want one group twice as large as the other: Increase one group by
50 percent, and reduce the other group by 25 percent. This increases the total
sample size by about 13 percent.
» If you want one group three times as large as the other: Reduce one
group by a third, and double the size of the other group. This increases the
total sample size by about 33 percent.
» If you want one group four times as large as the other: Reduce one group
by 38 percent and increase the other group by 250 percent. This increases the
total sample size by about 56 percent.
Suppose that you’re comparing two equal-sized groups, Drug A and Drug B. You’ve
calculated that you need two groups of 32, for a total of 64 participants. Now, you
decide to randomize group assignment using a 2:1 ratio for A:B. To keep the same
power, you’ll need 32 1.5 , or 48 for Drug A, an increase of 50 percent. For B,
you’ll want 32 0.75, or 24, a decrease of 25 percent, for an overall new total 72
participants in the study.
10% 11%
20% 25%
25% 33%
33% 50%
50% 100%
If your sample size estimate says you need a total of 60 participants with complete
data, and you expect a 25 percent attrition rate, you need to enroll 60 1.33 , or 80
participants. That way, you’ll have complete data on 60 participants after a quar-
ter of the original 80 are removed from analysis.
CHAPTER 25 Ten Easy Ways to Estimate How Many Participants You Need 367
Index
Symbols
alternative hypothesis, 40, 43, 144, 150–151, 324
Alzheimer’s disease, 91, 146
α (alpha)
amputation, 292–293
Bonferroni, 153–154
analysis of variance (ANOVA)
definition of, 41
assessing, 152–157
level, 206
introduction to, 11, 47–49
relation to sample sizes, 365–366
using, 143–145, 158
setting, 43
analytic dataset, 76
* (asterisk), 155
analytic research, 88–90
β (beta), 41
analytic study designs, 91
λ (half-life), 280–282
analytic suite, 57–58
κ (kappa), 189–190
analyzing data, 7, 9–10, 74–76
μg (micrograms), 280–282
and rule, 31
Π (pi), 27–28
animal research, 1
√ (radical sign), 21
ANOVA (analysis of variance)
Σ (sigma), 27–28
assessing, 152–157
γ (skewness coefficient), 121
introduction to, 11, 47–49
using, 143–145, 158
A anticipated enrollment rate, 347
absolute values, 23 antilogarithm, 22, 118–119
accuracy, 37, 38, 262–264 anti-synergy, 245–247
active group, 187 area under the ROC curve (AUC), 265, 280
actuarial life tables, 307, 311–316, 320–321 arguments, 23
addition, 18–19 arithmetic mean, 115–116
additive, 296 arrays, 25–27
administrative measurements, 63 asbestos, 296–298
adverse events, 70 asymptomatic confidence limits, 134
agriculture, 1 attrition, 366–367
Akaike’s Information Criterion (AIC), 259, 276– average values, 11, 39–40, 141–158
277, 342
B
alcohol intake, 94–98
alpha (α)
background information, 69
Bonferroni, 153–154
backward elimination approach, 295
definition of, 41
bad fit line, 215–216
level, 206
balanced groups, 154
relation to sample sizes, 365–366
bar charts, 113, 126
setting, 43
Index 369
base-2 logarithms, 22 case report forms (CRFs), 74
base-10 logarithms, 22 case reports, 91
base-e logarithms, 22 case series, 91
baseline survival function, 342–344 case studies, 91
baseline values, 96 case-control study, 90, 93–96
beta (β), 41 CAT (computerized tomography) scans, 188
bimodal (two-peaked) distribution, 114, 117 categorical data, 112–114
binary logarithms, 22 categorical variables, 236–238
binary variables, 173–174, 177, 236–238 causal inference, 11–12, 87, 90–95, 247
binomial distribution, 36, 354–355 CBD (cannabidiol), 159–166
biology, 1 censoring, 302–306, 329, 335
biopsy specimens, 188 census, 33–34
biostatisticians, 60, 141, 161, 168–169 center, 115
biostatistics, definition of, 1, 7 centiles, 120
bivariate analysis, 11, 128, 145, 160 central tendency, 115
blinding, 66, 70, 171 central-limit theorem (CLT), 134
blocked randomization, 67 charts and charting. See also graphing
blood pressure bar and pie, 113, 126
of study participants, 116–117 box-and-whiskers, 127–128
variable name, 18 categorical data, 112–114
body mass index (BMI), 177 correlation coefficients, 202–203
bolus, 280 hazard rates and survival probabilities, 312–315
Bonferroni adjustment, 153–156 multiple regression, 243–254
box-and-whiskers charts, 127–128 numerical data, 124–128
Bradford Hill’s criteria of causality, 297–298 Poisson regression, 273–276
Receiver Operator Characteristics (ROC),
264–265
C residuals, 222–223
calculated values, 37 scatter plots, 214, 219, 221, 238–240
calculations, 8 software for, 57–58
cancer s-shaped data, 252–256
as a categorical variable, 236–238 student t test, 45–47
liver, 93–97 CHD (coronary heart disease), 92–93
lung, 296–298 chemotherapy, 337
relation to weight, 11 chi-square distribution, 165–166, 358–359
remission, 318, 325 chi-square test
stages, 102 pros and cons, 167–169
survival data, 208, 301–306, 337, 343 sample size, 171–172
candidate covariates, 291 tables, 13
cannabidiol (CBD), 159–166 using, 11, 161–167, 174
car accidents, 274–278
Index 371
D difference, 147–148, 163
difference table, 163
data. See also cross-tabulated data
disease, 191, 193–194
analyzing and collecting, 1, 7, 9–10, 74–76,
101–110 dispersion, 115, 119
categorical, 112–114 distribution center, 115
free-text, 103 distributions
interval and ordinal, 102 bimodal (two-peaked), 114–117
ratio, 102, 107–108 binomial, 36, 354–355
skewed and unskewed, 11, 114, 121, 353 chi-square, 165–166, 358–359
survival, 208, 301–306, 337, 343 exponential, 356
time, 108–110 Fisher F, 152, 359–360
data close-out, 76 frequency, 47–48
data dictionary, 110 leptokurtic and platykurtic, 122
data safety monitoring board (DSMB), 73, 76 normal, 13, 36, 114, 353
data safety monitoring committee (DSMC), 73 probability, 35–37
data snapshot, 76 sampling, 38
date data, 108–110 statistical, 13
date of last contact, 303 student t, 357–358
DBP (diastolic blood pressure), 116–117 weibull, 330, 356–357
deciliter (dL), 280 District of Columbia, 34–35
decision theory, 10, 39–40 division, 20
degrees of freedom (df) dL (deciliter), 280
calculating, 147–148, 152–153 dose-response relationship, 298
for chi-square tests, 166–167, 358–359 double-blinding, 66, 97
dementia, 91, 146, 187–188 double-precision numbers, 107
denominator, 41–42 drug description, 70
dependent variable, 208, 213, 235–245 drug development research, 280–282
descriptive research, 88–90 DSMB (data safety monitoring board), 73, 76
descriptive study designs, 91 DSMC (data safety monitoring committee), 73
desired power, 347 Dupont, William D. (biostatistician), 60
desired α level, 347
determinants, 191
E
deviation, 119, 258–259
ECG (electrocardiogram), 188
df (degrees of freedom)
ecologic fallacy, 93
calculating, 147–148
ecologic studies, 91–93
for chi-square test, 166–167, 358–359
effect modification, 296–297
numerator and denominator, 152
effect size
diabetes, 135–136, 236–240.See also Type II
compared to power and sample size, 45–47
diabetes
definitions of, 362–364
diagnostic procedures, 183–188
example of, 39
diastolic blood pressure (DBP), 116–117, 123–124
of importance, 158, 206, 361
dichotomous variables, 173–174, 249, 269
Index 373
GLM (generalized linear model), 272–278 historical control, 142
glucose values, 25–26, 153 H-L test, 259
gold standard test, 183–184 homogeneity of variances, 155
good fit line, 215–216 hormone concentration, 287–290
goodness of fit, 227–228, 258–259 Hosmer-Lemeshow Goodness of Fit test, 259
G*Power HR (hazard ratios), 334–335, 340
description of, 59–60, 68 HTN (hypertension), 90, 94–98, 177–178
using, 158, 198, 207, 324–325 human health research, 88
graphing. See also charts and charting human subjects protection certification, 73
categorical data, 112–114 hyperplane, 234
correlation coefficients, 202–203 hypertension (HTN), 90, 94–98, 177–178
hazard rates and survival probabilities, 312–315 hypothesis, 40–47, 63, 94, 247.See also null
multiple regression, 243–245 hypothesis
numerical data, 124–128 hypothesis-driven analysis, 294
Poisson regression, 273–276 hypothesized cause, 83, 175, 178, 193, 292
Receiver Operator Characteristics (ROC),
264–265
residuals, 222–223
I
ICF (Informed Consent Form), 72–73
software for, 57–58
ICH (International Conference on
s-shaped data, 252–256 Harmonization), 72
student t test, 45–47 icons explained, 3
GraphPad, 67 identification (ID) numbers, 104
Greek letters, 17 identity line, 245
GUI (guided user interface), 56, 58 imputation, 75
incidence, 191–198
H incidence rate, 192–198
h value, 344–346 inclusion criteria, 64
half-life (λ), 280–282 independent t test, 148
hazard rate independent variable, 208–210, 213, 215, 291–294
definition of, 305 indicator variables, 237–238
from life tables, 311–315 indices, 174
relation to survival rate, 333 individual-level data, 160
hazard ratios (HR), 334–335, 340, 364 inferential statistics, 34, 77
health insurance, 112–114 inferring, 10
healthcare, 9–10 infinity, 33
highway accidents, 274–278 influenza, 192
Hill, Bradford (epidemiologist) Informed Consent Form (ICF), 72–73
Bradford Hills’ criteria of causality, 297–298 inner mean, 118
histogram, 34–35, 124–125 integers, 107
Kaplan-Meier method, 313–316, 328, 330 log-normal distribution, 36, 125, 353–354
Kaplan-Meier (K-M) survival estimate, 313, 317 log-rank test, 317–325, 328–330
kilograms (kg), 218–220, 225–226, 239 lost to follow-up (LFU), 303–304, 308–309
Index 375
LOWESS (locally weighted scatterplot smoothing) Mitra, Amal K. (author)
curve-fitting, 12, 286–290 Epidemiology for Dummies, 92
lung cancer, 296–298 mmHg (millimeters of mercury), 218–226,
229, 239
M mode, 117
model building, 246
Mann, Henry (professor), 141
model fit statistics, 242
Mann-Whitney U test, 47–49, 143, 362
models
Mantel-Cox test. See log-rank test
generalized linear (GLM), 272–278
Mantel-Haenszel chi-square test, 168
linear, 272–273
margin of error (ME), 134
null, 228, 242, 259
marginal totals, 160
parsimonious, 293
masking, 66, 70, 171
predictive, 228–229
mathematical expressions, 15
regression, 68, 208–209
mathematical operations, 18–25
molecular biology, 7
matrix, 26
multicollinearity, 246–247
matrix algebra, 26
multi-dimensional arrays, 26–27
maximum value, 222
multilevel variable, 236
ME (margin of error), 134
multiple regression
mean
basics of, 234–235
arithmetic, 115–116
introduction to, 26
compared to other values, 142–157, 362
sample size for, 247–248
confidence limits, 134–135
special considerations, 245–247
mean square (mean Sq), 155
using, 236–245
measurements, 63–64, 102–103
multiple R-squared, 242
mechanical function, 279
multiplication, 18–20
median, 116–117, 123, 222
multiplicative, 296
meta-analyses, 97–98
multiplicity, 75–76
metadata, 110
multi-site study, 104
mice, 1, 318
multi-stage sampling, 85–86
micrograms (μg), 280–282
multivariable regression, 291
Microsoft Excel
multivariate analysis, 128, 145, 291
for data collection, 103, 105, 107–110
functions of, 57
for log-rank tests, 319–320 N
for randomization, 67 Nagelkerke R-square, 259
for straight-line regression, 217 National Health and Nutrition Examination Survey
for survival regressions, 343 (NHANES), 82, 86, 93, 148–157
millimeters of mercury (mmHg), 218–226, 229, 239 National Institutes of Health (NIH), 72–73
minimum value, 222 natural logarithms, 22
missing data, 74–75 negative predicted value (NPV), 187
Index 377
participants. See also sample size; samples PPV (positive predictive value), 187
enrolling, 68 precision, 37, 38, 147
protection for, 71–73 predicted values, 242
selecting, 64–65, 236–237 predictive model, 228–229
PatSat, 106 predictive value negative, 187
PCR (polymerase chain reaction), 184 predictive value positive, 187
Pearson, Karl (biostatistician), 161 predictors
Pearson Correlation test, 47–49, 227 introduction to, 208, 233
Pearson kurtosis index, 122 in iterative models, 246–247
percentile, 120 in logistic models, 255–256
perfect predictor problem, 267–268 in regression models, 273–274, 279
perfect separation, 267–268 relation to the outcome, 242, 245–246, 250
periodicity, 83 types of, 209, 235–236
PH (proportional hazards regression), 330–331, pregnancy, 171, 185–187
333–334 prevalence, 179, 186, 191–194
pharmacokinetic (PK) properties, 280–282 prevalence ratio, 179
pi (Π), 27–28 primary diagnosis (PrimaryDx), 236–238
pie charts, 113 primary efficacy objective, 62
pilot study, 230 primary objectives, 62
placebo, 66–67, 171, 187–188 primary sampling units (PSU), 86
placebo effect, 66, 187–188 privacy, 71
plain text format, 16, 24 probability, 30–33
platykurtic distribution, 122 probability bell curve, 353
Plummer, Walton D. (biostatistician), 60 probability distributions, 35–37
pointy-topped distribution, 114 probability of independence, 166
poisson distribution, 36, 355 procedural descriptions, 70
poisson regression product, of an array, 27
definition of, 12, 210 prognosis curves, 329, 331, 343–346
using, 271–278 proportional hazards (PH) regression, 330–331,
polymerase chain reaction (PCR), 184 333–334
populations, 33–37, 175 proportions, 11, 135–136, 363
positive predictive value (PPV), 187 protective factor, 178
positively skewed data, 121 protractor, 113
post-hoc tests, 143, 152–157 PS (Power and Sample Size Calculation)
potential confounding variables, 64 for chi-square and Fisher exact tests,
Power and Sample Size Calculation (PS) 171–172
for chi-square and Fisher exact tests, 171–172 definition of, 60
definition of, 60 for survival comparisons, 324–325
for survival comparisons, 324–325 pseudo-r-squared values, 259
power calculations, 47, 171–172, 361, 365 PSU (primary sampling units), 86
powers, 20–21, 41, 44, 45–47, 206 Python, 58
Index 379
root-mean-square (RMS), 222 SAS OnDemand for Academics (ODA), 55–56
roots, 21 SBP (systolic blood pressure)
Rothman’s causal pie, 297–298 comparing, 39
round-off error, 352 effect of drugs on, 62, 123–125
RR (rate ratio), 195–196 relation between weight and, 218–229
RStudio, 58 variable name of, 18
Rumsey, Deborah J. (author) scatter plots
Statistics For Dummies, 2, 29 creating, 219, 238–240
Statistics II for Dummies, 29 example of, 221
types of, 214
Index 381
symbolic constants, 17 treatment periods, 65
symmetry, 115 treatments, 187–188
synergy, 245–247 trees, 1
systematic error, 37 trend line, 276
systematic sampling, 82–83 trimmed mean. See inner mean
systemic reviews, 97–98 true value, 37
systolic blood pressure (SBP) Tukey-Kramer test, 154–156
comparing, 39 Tukey’s HSD (“honestly” significant difference test),
effect of drugs on, 62, 123–125 154–156
relation between weight and, 218–229 two-dimensional arrays, 26
variable name of, 18 two-peaked (bimodal) distribution, 114, 117
type I error, 41, 42–44, 75–76, 152
Type II diabetes, 10–12, 192–197, 250, 292
T type II error, 41–44
t tests, 41–42, 47–49, 142–152, 362–363 typeset format, 16, 24–25
t value, 242 typographic effects, 16
Tableau, 57, 60
terminal elimination rate constant, 12
test statistic, 37, 40, 41–42, 147 U
tests ultrasound, 188
chi-square, 11, 13, 161–169, 171–172, 174 unbalanced groups, 154
Fisher exact, 11, 13, 169–172 under-coverage, 78
H-L, 259 unequal-variance t test, 143
log-rank, 317–325, 328–330 uniform distribution, 352
Mann-Whitney U, 47–49, 143, 362 United States
nonparametric, 48–49, 157 airports, 34–35
post-hoc, 143, 152–157 census, 84
Scheffe’s, 154–156 International Review Board, 72
Spearman Rank Correlation, 47–49 surveillance study, 86
student t, 41–42, 47–49, 142–152, 362–363 univariate analysis, 128
Tukey-Kramer, 154–156 univariate regression, 209
unequal-variance t, 143 Universität Düsseldorf, 59
Wilcoxon Signed-Ranks (WSR), 47–49, 142, 146 unskewed data, 121
Wilcoxon Sum-of-Ranks, 47–49, 143, 157, 362
theoretical function, 279 V
third quartile, 222 value of F statistic (F value), 155
three-way ANOVA, 144–145 values. See also p value
tied values, 49 absolute, 23
time data, 108–110 average, 11, 141–158
time-to-event variable, 337 calculated, 37
treatment bias, 66 estimate, predicted and t values, 242
W Y
Wallis, Wilson Allen (statistician), 141 Y variable, 213–216, 224–228
washout intervals, 65 Yates, Frank (statistician), 168–169
waves, 96, 176 Yates continuity correction, 168–169
weak linear relationship, 214
websites. See also G*Power; Microsoft Excel
ClinCalc, 172
Cochrane, 297
Index 383
About the Authors
Monika M. Wahi, MPH, CPH, is a well-published data scientist with more than
20 years of experience, and president of the public health informatics and educa-
tion firm DethWench Professional Services (DPS) (www.dethwench.com). She is
the author of Mastering SAS Programming for Data Warehousing and has coauthored
over 35 peer-reviewed scientific articles. After obtaining her master of public
health degree in epidemiology from the University of Minnesota School of Public
Health, she has served many roles at the intersection of study design, biostatis-
tics, informatics, and research in the public and private sectors, including at
Hennepin County Department of Corrections in Minneapolis, the Byrd
Alzheimer’s Institute in Tampa, and the U.S. Army. After founding DPS, she
worked as an adjunct lecturer at Laboure College in the Boston area for several
years, teaching about the U.S. healthcare system and biostatistics in their bachelor
of nursing program. At DPS, she helps organizations upgrade their analytics
pipelines to take advantage of new research approaches, including open source.
She also coaches professionals moving into data science from healthcare and
other fields on research methods, applied statistics, data governance, informatics,
and management.
John C. Pezzullo, PhD, spent more than half a century working in the physical,
biological, and social sciences. For more than 25 years, he led a dual life at Rhode
Island Hospital as an information technology programmer/analyst (and later
director) while also providing statistical and other technical support to biological
and clinical researchers at the hospital. He then joined the faculty at Georgetown
University as informatics director of the National Institute of Child Health and
Human Development’s Perinatology Research Branch. He created the StatPages
website (https://statpages.info), which provides online statistical calculating
capability and other statistics-related resources.
Dedication
This book is dedicated to my favorite “dummy,” my dear old dad, Bhupinder Nath
“Ben” Wahi. He is actually a calculus whiz. He used to point to the For Dummies
books when we’d see them at the bookstore and say, “Am I a dummy?” Of course,
I won’t answer that! — Monika
Author’s Acknowledgments
First and foremost, I want to acknowledge and honor the late Dr. John Pezzullo,
the original author of this work. It has been a pleasure to revise his thorough and
interesting writing, delivered with the enthusiasm of a true educator. I am also
extremely grateful for Matt Wagner of Fresh Books, who opened the doors
necessary for my coauthorship of this book. Further, I am indebted to editor
Katharine Dvorak for her constant support, guidance, and helpfulness through the
writing process. Additionally, I want to extend a special “thank you” to my amaz-
ing colleague, Sunil Gupta, who provided his typically excellent technical review.
Finally, I want to express my appreciation for all the other members of the For
Dummies team at Wiley who helped me along the way to make this book a success.
Thank you for helping me be the best writer I can be! — Monika
Publisher’s Acknowledgments